Tag Archives: AWS Lake Formation

Query your Apache Hive metastore with AWS Lake Formation permissions

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/query-your-apache-hive-metastore-with-aws-lake-formation-permissions/

Apache Hive is a SQL-based data warehouse system for processing highly distributed datasets on the Apache Hadoop platform. There are two key components to Apache Hive: the Hive SQL query engine and the Hive metastore (HMS). The Hive metastore is a repository of metadata about the SQL tables, such as database names, table names, schema, serialization and deserialization information, data location, and partition details of each table. Apache Hive, Apache Spark, Presto, and Trino can all use a Hive Metastore to retrieve metadata to run queries. The Hive metastore can be hosted on an Apache Hadoop cluster or can be backed by a relational database that is external to a Hadoop cluster. Although the Hive metastore stores the metadata of tables, the actual data of the table could be residing on Amazon Simple Storage Service (Amazon S3), the Hadoop Distributed File System (HDFS) of the Hadoop cluster, or any other Hive-supported data stores.

Because Apache Hive was built on top of Apache Hadoop, many organizations have been using the software from the time they have been using Hadoop for big data processing. Also, Hive metastore provides flexible integration with many other open-source big data software like Apache HBase, Apache Spark, Presto, and Apache Impala. Therefore, organizations have come to host huge volumes of metadata of their structured datasets in the Hive metastore. A metastore is a critical part of a data lake, and having this information available, wherever it resides, is important. However, many AWS analytics services don’t integrate natively with the Hive metastore, and therefore, organizations have had to migrate their data to the AWS Glue Data Catalog to use these services.

AWS Lake Formation has launched support for managing user access to Apache Hive metastores through a federated AWS Glue connection. Previously, you could use Lake Formation to manage user permissions on AWS Glue Data Catalog resources only. With the Hive metastore connection from AWS Glue, you can connect to a database in a Hive metastore external to the Data Catalog, map it to a federated database in the Data Catalog, apply Lake Formation permissions on the Hive database and tables, share them with other AWS accounts, and query them using services such as Amazon Athena, Amazon Redshift Spectrum, Amazon EMR, and AWS Glue ETL (extract, transform, and load). For additional details on how the Hive metastore integration with Lake Formation works, refer to Managing permissions on datasets that use external metastores.

Use cases for Hive metastore integration with the Data Catalog include the following:

  • An external Apache Hive metastore used for legacy big data workloads like on-premises Hadoop clusters with data in Amazon S3
  • Transient Amazon EMR workloads with underlying data in Amazon S3 and the Hive metastore on Amazon Relational Database Service (Amazon RDS) clusters.

In this post, we demonstrate how to apply Lake Formation permissions on a Hive metastore database and tables and query them using Athena. We illustrate a cross-account sharing use case, where a Lake Formation steward in producer account A shares a federated Hive database and tables using LF-Tags to consumer account B.

Solution overview

Producer account A hosts an Apache Hive metastore in an EMR cluster, with underlying data in Amazon S3. We launch the AWS Glue Hive metastore connector from AWS Serverless Application Repository in account A and create the Hive metastore connection in account A’s Data Catalog. After we create the HMS connection, we create a database in account A’s Data Catalog (called the federated database) and map it to a database in the Hive metastore using the connection. The tables from the Hive database are then accessible to the Lake Formation admin in account A, just like any other tables in the Data Catalog. The admin continues to set up Lake Formation tag-based access control (LF-TBAC) on the federated Hive database and share it to account B.

The data lake users in account B will access the Hive database and tables of account A, just like querying any other shared Data Catalog resource using Lake Formation permissions.

The following diagram illustrates this architecture.

The solution consists of steps in both accounts. In account A, perform the following steps:

  1. Create an S3 bucket to host the sample data.
  2. Launch an EMR 6.10 cluster with Hive. Download the sample data to the S3 bucket. Create a database and external tables, pointing to the downloaded sample data, in its Hive metastore.
  3. Deploy the application GlueDataCatalogFederation-HiveMetastore from AWS Serverless Application Repository and configure it to use the Amazon EMR Hive metastore. This will create an AWS Glue connection to the Hive metastore that shows up on the Lake Formation console.
  4. Using the Hive metastore connection, create a federated database in the AWS Glue Data Catalog.
  5. Create LF-Tags and associate them to the federated database.
  6. Grant permissions on the LF-Tags to account B. Grant database and table permissions to account B using LF-Tag expressions.

In account B, perform the following steps:

  1. As a data lake admin, review and accept the AWS Resource Access Manager (AWS RAM) invites for the shares from account A.
  2. The data lake admin then sees the shared database and tables. The admin creates a resource link to the database and grants fine-grained permissions to a data analyst in this account.
  3. Both the data lake admin and the data analyst query the Hive tables that are available to them using Athena.

Account A has the following personas:

  • hmsblog-producersteward – Manages the data lake in the producer account A

Account B has the following personas:

  • hmsblog-consumersteward – Manages the data lake in the consumer account B
  • hmsblog-analyst – A data analyst who needs access to selected Hive tables

Prerequisites

To follow the tutorial in this post, you need the following:

Lake Formation and AWS CloudFormation setup in account A

To keep the setup simple, we have an IAM admin registered as the data lake admin. Complete the following steps:

  1. Sign into the AWS Management Console and choose the us-west-2 Region.
  2. On the Lake Formation console, under Permissions in the navigation pane, choose Administrative roles and tasks.
  3. Choose Manage Administrators in the Data lake administrators section.
  4. Under IAM users and roles, choose the IAM admin user that you are logged in as and choose Save.
  5. Choose Launch Stack to deploy the CloudFormation template:
  6. Choose Next.
  7. Provide a name for the stack and choose Next.
  8. On the next page, choose Next.
  9. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  10. Choose Create.

Stack creation takes about 10 minutes. The stack establishes the producer account A setup as follows:

  • Creates an S3 data lake bucket
  • Registers the data lake bucket to Lake Formation with the Enable catalog federation flag
  • Launches an EMR 6.10 cluster with Hive and runs two steps in Amazon EMR:
    • Downloads the sample data from public S3 bucket to the newly created bucket
    • Creates a Hive database and four external tables for the data in Amazon S3, using a HQL script
  • Creates an IAM user (hmsblog-producersteward) and sets this user as Lake Formation administrator
  • Creates LF-Tags (LFHiveBlogCampaignRole = Admin, Analyst)

Review CloudFormation stack output in account A

To review the output of your CloudFormation stack, complete the following steps:

  1. Log in to the console as the IAM admin user you used earlier to run the CloudFormation template.
  2. Open the CloudFormation console in another browser tab.
  3. Review and note down the stack Outputs tab details.
  4. Choose the link under Value for ProducerStewardCredentials.

This will open the AWS Secrets Manager console.

  1. Choose Retrieve value and note down the credentials of hmsblog-producersteward.

Set up a federated AWS Glue connection in account A

To set up a federated AWS Glue connection, complete the following steps:

  1. Open the AWS Serverless Application Repository console in another browser tab.
  2. In the navigation pane, choose Available applications.
  3. Select Show apps that create custom IAM roles or resource policies.
  4. In the search bar, enter Glue.

This will list various applications.

  1. Choose the application named GlueDataCatalogFederation-HiveMetastore.

This will open the AWS Lambda console configuration page for a Lambda function that runs the connector application code.

To configure the Lambda function, you need details of the EMR cluster launched by the CloudFormation stack.

  1. On another tab of your browser, open the Amazon EMR console.
  2. Navigate to the cluster launched for this post and note down the following details from the cluster details page:
    1. Primary node public DNS
    2. Subnet ID
    3. Security group ID of the primary node

  3. Back on the Lambda configuration page, under Review, configure, and deploy, in the Application settings section, provide the following details. Leave the rest as the default values.
    1. For GlueConnectionName, enter hive-metastore-connection.
    2. For HiveMetastoreURIs enter thrift://<Primary-node-public-DNS-of your-EMR>:9083. For example, thrift://ec2-54-70-203-146.us-west-2.compute.amazonaws.com:9083, where 9083 is the Hive metastore port in EMR cluster.
    3. For VPCSecurityGroupIds, enter the security group ID of the EMR primary node.
    4. For VPCSubnetIds, enter the subnet ID of the EMR cluster.
  4. Choose Deploy.

Wait for the Create Completed status of the Lambda application. You can review the details of the Lambda application on the Lambda console.

  1. Open Lake Formation console and in the navigation pane, choose Data sharing.

You should see hive-metastore-connection under Connections.

  1. Choose it and review the details.
  2. In the navigation pane, under Administrative roles and tasks, choose LF-Tags.

You should see the created LF-tag LFHiveBlogCampaignRole with two values: Analyst and Admin.

  1. Choose LF-Tag permissions and choose Grant.
  2. Choose IAM users and roles and enter hmsblog-producersteward.
  3. Under LF-Tags, choose Add LF-Tag.
  4. Enter LFHiveBlogCampaignRole for Key and enter Analyst and Admin for Values.
  5. Under Permissions, select Describe and Associate for LF-Tag permissions and Grantable permissions.
  6. Choose Grant.

This gives LF-Tags permissions for the producer steward.

  1. Log out as the IAM administrator user.

Grant Lake Formation permissions as producer steward

Complete the following steps:

  1. Sign in to the console as hmsblog-producersteward, using the credentials from the CloudFormation stack Output tab that you noted down earlier.
  2. On the Lake Formation console, in the navigation pane, choose Administrative roles and tasks.
  3. Under Database creators, choose Grant.
  4. Add hmsblog-producersteward as a database creator.
  5. In the navigation pane, choose Data sharing.
  6. Under Connections, choose the hive-metastore-connection hyperlink.
  7. On the Connection details page, choose Create database.
  8. For Database name, enter federated_emrhivedb.

This is the federated database in the local AWS Glue Data Catalog that will point to a Hive metastore database. This is a one-to-one mapping of a database in the Data Catalog to a database in the external Hive metastore.

  1. For Database identifier, enter the name of the database in the EMR Hive metastore that was created by the Hive SQL script. For this post, we use emrhms_salesdb.
  2. Once created, select federated_emrhivedb and choose View tables.

This will fetch the database and table metadata from the Hive metastore on the EMR cluster and display the tables created by the Hive script.

Now you associate the LF-Tags created by the CloudFormation script on this federated database and share it to the consumer account B using LF-Tag expressions.

  1. In the navigation pane, choose Databases.
  2. Select federated_emrhivedb and on the Actions menu, choose Edit LF-Tags.
  3. Choose Assign new LF-Tag.
  4. Enter LFHiveBlogCampaignRole for Assigned keys and Admin for Values, then choose Save.
  5. In the navigation pane, choose Data lake permissions.
  6. Choose Grant.
  7. Select External accounts and enter the consumer account B number.
  8. Under LF-Tags or catalog resources, choose Resource matched by LF-Tags.
  9. Choose Add LF-Tag.
  10. Enter LFHiveBlogCampaignRole for Key and Admin for Values.
  11. In the Database permissions section, select Describe for Database permissions and Grantable permissions.
  12. In the Table permissions section, select Select and Describe for Table permissions and Grantable permissions.
  13. Choose Grant.
  14. In the navigation pane, under Administrative roles and tasks, choose LF-Tag permissions.
  15. Choose Grant.
  16. Select External accounts and enter the account ID of consumer account B.
  17. Under LF-Tags, enter LFHiveBlogCampaignRole for Key and enter Analyst and Admin for Values.
  18. Under Permissions, select Describe and Associate under LF-Tag permissions and Grantable permissions.
  19. Choose Grant and verify that the granted LF-Tag permissions display correctly.
  20. In the navigation pane, choose Data lake permissions.

You can review and verify the permissions granted to account B.

  1. In the navigation pane, under Administrative roles and tasks, choose LF-Tag permissions.

You can review and verify the permissions granted to account B.

  1. Log out of account A.

Lake Formation and AWS CloudFormation setup in account B

To keep the setup simple, we use an IAM admin registered as the data lake admin.

  1. Sign into the AWS Management Console of account B and select the us-west-2 Region.
  2. On the Lake Formation console, under Permissions in the navigation pane, choose Administrative roles and tasks.
  3. Choose Manage Administrators in the Data lake administrators section.
  4. Under IAM users and roles, choose the IAM admin user that you are logged in as and choose Save.
  5. Choose Launch Stack to deploy the CloudFormation template:
  6. Choose Next.
  7. Provide a name for the stack and choose Next.
  8. On the next page, choose Next.
  9. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  10. Choose Create.

Stack creation should take about 5 minutes. The stack establishes the producer account B setup as follows:

  • Creates an IAM user hmsblog-consumersteward and sets this user as Lake Formation administrator
  • Creates another IAM user hmsblog-analyst
  • Creates an S3 data lake bucket to store Athena query results, with ListBucket and write object permissions to both hmsblog-consumersteward and hmsblog-analyst

Note down the stack output details.

Accept resource shares in account B

Sign in to the console as hmsblog-consumersteward and complete the following steps:

  1. On the AWS CloudFormation console, navigate to the stack Outputs tab.
  2. Choose the link for ConsumerStewardCredentials to be redirected to the Secrets Manager console.
  3. On the Secrets Manager console, choose Retrieve secret value and copy the password for the consumer steward user.
  4. Use the ConsoleIAMLoginURL value from the CloudFormation template Output to log in to account B with the consumer steward user name hmsblog-consumersteward and the password you copied from Secrets Manager.
  5. Open the AWS RAM console in another browser tab.
  6. In the navigation pane, under Shared with me, choose Resource shares to view the pending invitations.

You should see two resource share invitations from producer account A: one for a database-level share and one for a table-level share.

  1. Choose each resource share link, review the details, and choose Accept.

After you accept the invitations, the status of the resource shares changes from Pending to Active.

  1. Open the Lake Formation console in another browser tab.
  2. In the navigation pane, choose Databases.

You should see the shared database federated_emrhivedb from producer account A.

  1. Choose the database and choose View tables to review the list of tables shared under that database.

You should see the four tables of the Hive database that is hosted on the EMR cluster in the producer account.

Grant permissions in account B

To grant permissions in account B, complete the following steps as hmsblog-consumersteward:

  1. On the Lake Formation console, in the navigation pane, choose Administrative roles and tasks.
  2. Under Database creators, choose Grant.
  3. For IAM users and roles, enter hmsblog-consumersteward.
  4. For Catalog permissions, select Create database.
  5. Choose Grant.

This allows hmsblog-consumersteward to create a database resource link.

  1. In the navigation pane, choose Databases.
  2. Select federated_emrhivedb and on the Actions menu, choose Create resource link.
  3. Enter rl_federatedhivedb for Resource link name and choose Create.
  4. Choose Databases in the navigation pane.
  5. Select the resource link rl_federatedhivedb and on the Actions menu, choose Grant.
  6. Choose hmsblog-analyst for IAM users and roles.
  7. Under Resource link permissions, select Describe, then choose Grant.
  8. Select Databases in the navigation pane.
  9. Select the resource link rl_federatedhivedb and on the Actions menu, choose Grant on target.
  10. Choose hmsblog-analyst for IAM users and roles.
  11. Choose hms_productcategory and hms_supplier for Tables.
  12. For Table permissions, select Select and Describe, then choose Grant.
  13. In the navigation pane, choose Data lake permissions and review the permissions granted to hms-analyst.

Query the Apache Hive database of the producer from the consumer Athena

Complete the following steps:

  1. On the Athena console, navigate to the query editor.
  2. Choose Edit settings to configure the Athena query results bucked.
  3. Browse and choose the S3 bucket hmsblog-athenaresults-<your-account-B>-us-west-2 that the CloudFormation template created.
  4. Choose Save.

hmsblog-consumersteward has access to all four tables under federated_emrhivedb from the producer account.

  1. In the Athena query editor, choose the database rl_federatedhivedb and run a query on any of the tables.

You were able to query an external Apache Hive metastore database of the producer account through the AWS Glue Data Catalog and Lake Formation permissions using Athena from the recipient consumer account.

  1. Sign out of the console as hmsblog-consumersteward and sign back in as hmsblog-analyst.
  2. Use the same method as explained earlier to get the login credentials from the CloudFormation stack Outputs tab.

hmsblog-analyst has Describe permissions on the resource link and access to two of the four Hive tables. You can verify that you see them on the Databases and Tables pages on the Lake Formation console.

On the Athena console, you now configure the Athena query results bucket, similar to how you configured it as hmsblog-consumersteward.

  1. In the query editor, choose Edit settings.
  2. Browse and choose the S3 bucket hmsblog-athenaresults-<your-account-B>-us-west-2 that the CloudFormation template created.
  3. Choose Save.
  4. In the Athena query editor, choose the database rl_federatedhivedb and run a query on the two tables.
  5. Sign out of the console as hmsblog-analyst.

You were able to restrict sharing the external Apache Hive metastore tables using Lake Formation permissions from one account to another and query them using Athena. You can also query the Hive tables using Redshift Spectrum, Amazon EMR, and AWS Glue ETL from the consumer account.

Clean up

To avoid incurring charges on the AWS resources created in this post, you can perform the following steps.

Clean up resources in account A

There are two CloudFormation stacks associated with producer account A. You need to delete the dependencies and the two stacks in the correct order.

  1. Log in as the admin user to producer account B.
  2. On the Lake Formation console, choose Data lake permissions in the navigation pane.
  3. Choose Grant.
  4. Grant Drop permissions to your role or user on federated_emrhivedb.
  5. In the navigation pane, choose Databases.
  6. Select federated_emrhivedb and on the Actions menu, choose Delete to delete the federated database that is associated with the Hive metastore connection.

This makes the AWS Glue connection’s CloudFormation stack ready to be deleted.

  1. In the navigation pane, choose Administrative roles and tasks.
  2. Under Database creators, select Revoke and remove hmsblog-producersteward permissions.
  3. On the CloudFormation console, delete the stack named serverlessrepo-GlueDataCatalogFederation-HiveMetastore first.

This is the one created by your AWS SAM application for the Hive metastore connection. Wait for it to complete deletion.

  1. Delete the CloudFormation stack that you created for the producer account set up.

This deletes the S3 buckets, EMR cluster, custom IAM roles and policies, and the LF-Tags, database, tables, and permissions.

Clean up resources in account B

Complete the following steps in account B:

  1. Revoke permission to hmsblog-consumersteward as database creator, similar to the steps in the previous section.
  2. Delete the CloudFormation stack that you created for the consumer account setup.

This deletes the IAM users, S3 bucket, and all the permissions from Lake Formation.

If there are any resource links and permissions left, delete them manually in Lake Formation from both accounts.

Conclusion

In this post, we showed you how to launch the AWS Glue Hive metastore federation application from AWS Serverless Application Repository, configure it with a Hive metastore running on an EMR cluster, create a federated database in the AWS Glue Data Catalog, and map it to a Hive metastore database on the EMR cluster. We illustrated how to share and access the Hive database tables for a cross-account scenario and the benefits of using Lake Formation to restrict permissions.

All Lake Formation features such as sharing to IAM principals within same account, sharing to external accounts, sharing to external account IAM principals, restricting column access, and setting data filters work on federated Hive database and tables. You can use any of the AWS analytics services that are integrated with Lake Formation, such as Athena, Redshift Spectrum, AWS Glue ETL, and Amazon EMR to query the federated Hive database and tables.

We encourage you to check out the features of the AWS Glue Hive metastore federation connector and explore Lake Formation permissions on your Hive database and tables. Please comment on this post or talk to your AWS Account Team to share feedback on this feature.

For more details, see Managing permissions on datasets that use external metastores.


About the authors

Aarthi Srinivasan is a Senior Big Data Architect with AWS Lake Formation. She likes building data lake solutions for AWS customers and partners. When not on the keyboard, she explores the latest science and technology trends and spends time with her family.

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

Post Syndicated from Nitin Arora original https://aws.amazon.com/blogs/big-data/how-amazon-finance-automation-built-a-data-mesh-to-support-distributed-data-ownership-and-centralize-governance/

Amazon Finance Automation (FinAuto) is the tech organization of Amazon Finance Operations (FinOps). Its mission is to enable FinOps to support the growth and expansion of Amazon businesses. It works as a force multiplier through automation and self-service, while providing accurate and on-time payments and collections. FinAuto has a unique position to look across FinOps and provide solutions that help satisfy multiple use cases with accurate, consistent, and governed delivery of data and related services.

In this post, we discuss how the Amazon Finance Automation team used AWS Lake Formation and the AWS Glue Data Catalog to build a data mesh architecture that simplified data governance at scale and provided seamless data access for analytics, AI, and machine learning (ML) use cases.

Challenges

Amazon businesses have grown over the years. In the early days, financial transactions could be stored and processed on a single relational database. In today’s business world, however, even a subset of the financial space dedicated to entities such as Accounts Payable (AP) and Accounts Receivable (AR) requires separate systems handling terabytes of data per day. Within FinOps, we can curate more than 300 datasets and consume many more raw datasets from dozens of systems. These datasets can then be used to power front end systems, ML pipelines, and data engineering teams.

This exponential growth necessitated a data landscape that was geared towards keeping FinOps operating. However, as we added more transactional systems, data started to grow in operational data stores. Data copies were common, with duplicate pipelines creating redundant and often out-of-sync domain datasets. Multiple curated data assets were available with similar attributes. To resolve these challenges, FinAuto decided to build a data services layer based on a data mesh architecture. FinAuto wanted to verify that the data domain owners would retain ownership of their datasets while users got access to the data by using a data mesh architecture.

Solution overview

Being customer focused, we started by understanding our data producers’ and consumers’ needs and priorities. Consumers prioritized data discoverability, fast data access, low latency, and high accuracy of data. Producers prioritized ownership, governance, access management, and reuse of their datasets. These inputs reinforced the need of a unified data strategy across the FinOps teams. We decided to build a scalable data management product that is based on the best practices of modern data architecture. Our source system and domain teams were mapped as data producers, and they would have ownership of the datasets. FinAuto provided the data services’ tools and controls necessary to enable data owners to apply data classification, access permissions, and usage policies. It was necessary for domain owners to continue this responsibility because they had visibility to the business rules or classifications and applied that to the dataset. This enabled producers to publish data products that were curated and authoritative assets for their domain. For example, the AR team created and governed their cash application dataset in their AWS account AWS Glue Data Catalog.

With many such partners building their data products, we needed a way to centralize data discovery, access management, and vending of these data products. So we built a global data catalog in a central governance account based on the AWS Glue Data Catalog. The FinAuto team built AWS Cloud Development Kit (AWS CDK), AWS CloudFormation, and API tools to maintain a metadata store that ingests from domain owner catalogs into the global catalog. This global catalog captures new or updated partitions from the data producer AWS Glue Data Catalogs. The global catalog is also periodically fully refreshed to resolve issues during metadata sync processes to maintain resiliency. With this structure in place, we then needed to add governance and access management. We selected AWS Lake Formation in our central governance account to help secure the data catalog, and added secure vending mechanisms around it. We also built a front-end discovery and access control application where consumers can browse datasets and request access. When a consumer requests access, the application validates the request and routes them to a respective producer via internal tickets for approval. Only after the data producer approves the request are permissions provisioned in the central governance account through Lake Formation.

Solution tenets

A data mesh architecture has its own advantages and challenges. By democratizing the data product creation, we removed dependencies on a central team. We made reuse of data possible with data discoverability and minimized data duplicates. This also helped remove data movement pipelines, thereby reducing data transfer and maintenance costs.

We realized, however, that our implementation could potentially impact day-to-day tasks and inhibit adoption. For example, data producers need to onboard their dataset to the global catalog, and complete their permissions management before they can share that with consumers. To overcome this obstacle, we prioritized self-service tools and automation with a reliable and simple-to-use interface. We made interaction, including producer-consumer onboarding, data access request, approvals, and governance, quicker through the self-service tools in our application.

Solution architecture

Within Amazon, we isolate different teams and business processes with separate AWS accounts. From a security perspective, the account boundary is one of the strongest security boundaries in AWS. Because of this, the global catalog resides in its own locked-down AWS account.

The following diagram shows AWS account boundaries for producers, consumers, and the central catalog. It also describes the steps involved for data producers to register their datasets as well as how data consumers get access. Most of these steps are automated through convenience scripts with both AWS CDK and CloudFormation templates for our producers and consumer to use.

Solution Architecture Diagram

The workflow contains the following steps:

  1. Data is saved by the producer in their own Amazon Simple Storage Service (Amazon S3) buckets.
  2. Data source locations hosted by the producer are created within the producer’s AWS Glue Data Catalog.
  3. Data source locations are registered with Lake Formation.
  4. An onboarding AWS CDK script creates a role for the central catalog to use to read metadata and generate the tables in the global catalog.
  5. The metadata sync is set up to continuously sync data schema and partition updates to the central data catalog.
  6. When a consumer requests table access from the central data catalog, the producer grants Lake Formation permissions to the consumer account AWS Identity and Access Management (IAM) role and tables are visible in the consumer account.
  7. The consumer account accepts the AWS Resource Access Manager (AWS RAM) share and creates resource links in Lake Formation.
  8. The consumer data lake admin provides grants to IAM users and roles mapping to data consumers within the account.

The global catalog

The basic building block of our business-focused solutions are data products. A data product is a single domain attribute that a business understands as accurate, current, and available. This could be a dataset (a table) representing a business attribute like a global AR invoice, invoice aging, aggregated invoices by a line of business, or a current ledger balance. These attributes are calculated by the domain team and are available for consumers who need that attribute, without duplicating pipelines to recreate it. Data products, along with raw datasets, reside within their data owner’s AWS account. Data producers register their data catalog’s metadata to the central catalog. We have services to review source catalogs to identify and recommend classification of sensitive data columns such as name, email address, customer ID, and bank account numbers. Producers can review and accept those recommendations, which results in corresponding tags applied to the columns.

Producer experience

Producers onboard their accounts when they want to publish a data product. Our job is to sync the metadata between the AWS Glue Data Catalog in the producer account with the central catalog account, and register the Amazon S3 data location with Lake Formation. Producers and data owners can use Lake Formation for fine-grained access controls on the table. It is also now searchable and discoverable via the central catalog application.

Consumer experience

When a data consumer discovers the data product that they’re interested in, they submit a data access request from the application UI. Internally, we route the request to the data owner for the disposition of the request (approval or rejection). We then create an internal ticket to track the request for auditing and traceability. If the data owner approves the request, we run automation to create an AWS RAM resource share to share with the consumer account covering the AWS Glue database and tables approved for access. These consumers can now query the datasets using the AWS analytics services of their choice like Amazon Redshift Spectrum, Amazon Athena, and Amazon EMR.

Operational excellence

Along with building the data mesh, it’s also important to verify that we can operate with efficiency and reliability. We recognize that the metadata sync process is at the heart of this global data catalog. As such, we are hypervigilant of this process and have built alarms, notifications, and dashboards to verify that this process doesn’t fail silently and create a single point of failure for the global data catalog. We also have a backup repair service that syncs the metadata from producer catalogs into the central governance account catalog periodically. This is a self-healing mechanism to maintain reliability and resiliency.

Empowering customers with the data mesh

The FinAuto data mesh hosts around 850 discoverable and shareable datasets from multiple partner accounts. There are more than 300 curated data products to which producers can provide access and apply governance with fine-grained access controls. Our consumers use AWS analytics services such as Redshift Spectrum, Athena, Amazon EMR, and Amazon QuickSight to access their data. This capability with standardized data vending from the data mesh, along with self-serve capabilities, allows you to innovate faster without dependency on technical teams. You can now get access to data faster with automation that continuously improves the process.

By serving the FinOps team’s data needs with high availability and security, we enabled them to effectively support operation and reporting. Data science teams can now use the data mesh for their finance-related AI/ML use cases such as fraud detection, credit risk modeling, and account grouping. Our finance operations analysts are now enabled to dive deep into their customer issues, which is most important to them.

Conclusion

FinOps implemented a data mesh architecture with Lake Formation to improve data governance with fine-grained access controls. With these improvements, the FinOps team is now able to innovate faster with access to the right data at the right time in a self-serve manner to drive business outcomes. The FinOps team will continue to innovate in this space with AWS services to further provide for customer needs.

To learn more about how to use Lake Formation to build a data mesh architecture, see Design a data mesh architecture using AWS Lake Formation and AWS Glue.


About the Authors

Nitin Arora PicNitin Arora is a Sr. Software Development Manager for Finance Automation in Amazon. He has over 18 years of experience building business critical, scalable, high-performance software. Nitin leads several data and analytics initiatives within Finance, which includes building Data Mesh. In his spare time, he enjoys listening to music and read.

Pradeep Misra PicPradeep Misra is a Specialist Solutions Architect at AWS. He works across Amazon to architect and design modern distributed analytics and AI/ML platform solutions. He is passionate about solving customer challenges using data, analytics, and AI/ML. Outside of work, Pradeep likes exploring new places, trying new cuisines, and playing board games with his family. He also likes doing science experiments with his daughters.

Rajesh Rao PicRajesh Rao is a Sr. Technical Program Manager in Amazon Finance. He works with Data Services teams within Amazon to build and deliver data processing and data analytics solutions for Financial Operations teams. He is passionate about delivering innovative and optimal solutions using AWS to enable data-driven business outcomes for his customers.

Andrew Long PicAndrew Long, the lead developer for data mesh, has designed and built many of the big data processing systems that have fueled Amazon’s financial data processing infrastructure. His work encompasses a range of areas, including S3-based table formats for Spark, diverse Spark performance optimizations, distributed orchestration engines and the development of data cataloging systems. Additionally, Andrew finds pleasure in sharing his knowledge of partner acrobatics.

Satyen GauravKumar Satyen Gaurav, is an experienced Software Development Manager at Amazon, with over 16 years of expertise in big data analytics and software development. He leads a team of engineers to build products and services using AWS big data technologies, for providing key business insights for Amazon Finance Operations across diverse business verticals. Beyond work, he finds joy in reading, traveling and learning strategic challenges of chess.

Harmonize data using AWS Glue and AWS Lake Formation FindMatches ML to build a customer 360 view

Post Syndicated from Nishchai JM original https://aws.amazon.com/blogs/big-data/harmonize-data-using-aws-glue-and-aws-lake-formation-findmatches-ml-to-build-a-customer-360-view/

In today’s digital world, data is generated by a large number of disparate sources and growing at an exponential rate. Companies are faced with the daunting task of ingesting all this data, cleansing it, and using it to provide outstanding customer experience.

Typically, companies ingest data from multiple sources into their data lake to derive valuable insights from the data. These sources are often related but use different naming conventions, which will prolong cleansing, slowing down the data processing and analytics cycle. This problem particularly impacts companies trying to build accurate, unified customer 360 profiles. There are customer records in this data that are semantic duplicates, that is, they represent the same user entity, but have different labels or values. It’s commonly referred to as a data harmonization or deduplication problem. The underlying schemas were implemented independently and don’t adhere to common keys that can be used for joins to deduplicate records using deterministic techniques. This has led to so-called fuzzy deduplication techniques to address the problem. These techniques utilize various machine learning (ML) based approaches.

In this post, we look at how we can use AWS Glue and the AWS Lake Formation ML transform FindMatches to harmonize (deduplicate) customer data coming from different sources to get a complete customer profile to be able to provide better customer experience. We use Amazon Neptune to visualize the customer data before and after the merge and harmonization.

Overview of solution

In this post, we go through the various steps to apply ML-based fuzzy matching to harmonize customer data across two different datasets for auto and property insurance. These datasets are synthetically generated and represent a common problem for entity records stored in multiple, disparate data sources with their own lineage that appear similar and semantically represent the same entity but don’t have matching keys (or keys that work consistently) for deterministic, rule-based matching. The following diagram shows our solution architecture.

We use an AWS Glue job to transform the auto insurance and property insurance customer source data to create a merged dataset containing fields that are common to both datasets (identifiers) that a human expert (data steward) would use to determine semantic matches. The merged dataset is then used to deduplicate customer records using an AWS Glue ML transform to create a harmonized dataset. We use Neptune to visualize the customer data before and after the merge and harmonization to see how the transform FindMacthes can bring all related customer data together to get a complete customer 360 view.

To demonstrate the solution, we use two separate data sources: one for property insurance customers and another for auto insurance customers, as illustrated in the following diagram.

The data is stored in an Amazon Simple Storage Service (Amazon S3) bucket, labeled as Raw Property and Auto Insurance data in the following architecture diagram. The diagram also describes detailed steps to process the raw insurance data into harmonized insurance data to avoid duplicates and build logical relations with related property and auto insurance data for the same customer.

The workflow includes the following steps:

  1. Catalog the raw property and auto insurance data, using an AWS Glue crawler, as tables in the AWS Glue Data Catalog.
  2. Transform raw insurance data into CSV format acceptable to Neptune Bulk Loader, using an AWS Glue extract, transform, and load (ETL) job.
  3. When the data is in CSV format, use an Amazon SageMaker Jupyter notebook to run a PySpark script to load the raw data into Neptune and visualize it in a Jupyter notebook.
  4. Run an AWS Glue ETL job to merge the raw property and auto insurance data into one dataset and catalog the merged dataset. This dataset will have duplicates and no relations are built between the auto and property insurance data.
  5. Create and train an AWS Glue ML transform to harmonize the merged data to remove duplicates and build relations between the related data.
  6. Run the AWS Glue ML transform job. The job also catalogs the harmonized data in the Data Catalog and transforms the harmonized insurance data into CSV format acceptable to Neptune Bulk Loader.
  7. When the data is in CSV format, use a Jupyter notebook to run a PySpark script to load the harmonized data into Neptune and visualize it in a Jupyter notebook.

Prerequisites

To follow along with this walkthrough, you must have an AWS account. Your account should have permission to provision and run an AWS CloudFormation script to deploy the AWS services mentioned in the architecture diagram of the solution.

Provision required resources using AWS CloudFormation:

To launch the CloudFormation stack that configures the required resources for this solution in your AWS account, complete the following steps:

  1. Log in to your AWS account and choose Launch Stack:

  1. Follow the prompts on the AWS CloudFormation console to create the stack.
  2. When the launch is complete, navigate to the Outputs tab of the launched stack and note all the key-value pairs of the resources provisioned by the stack.

Verify the raw data and script files S3 bucket

On the CloudFormation stack’s Outputs tab, choose the value for S3BucketName. The S3 bucket name should be cloud360-s3bucketstack-xxxxxxxxxxxxxxxxxxxxxxxx and should contain folders similar to the following screenshot.

The following are some important folders:

  • auto_property_inputs – Contains raw auto and property data
  • merged_auto_property – Contains the merged data for auto and property insurance
  • output – Contains the delimited files (separate subdirectories)

Catalog the raw data

To help walk through the solution, the CloudFormation stack created and ran an AWS Glue crawler to catalog the property and auto insurance data. To learn more about creating and running AWS Glue crawlers, refer to Working with crawlers on the AWS Glue console. You should see the following tables created by the crawler in the c360_workshop_db AWS Glue database:

  • source_auto_address – Contains address data of customers with auto insurance
  • source_auto_customer – Contains auto insurance details of customers
  • source_auto_vehicles – Contains vehicle details of customers
  • source_property_addresses – Contains address data of customers with property insurance
  • source_property_customers – Contains property insurance details of customers

You can review the data using Amazon Athena. For more information about using Athena to query an AWS Glue table, refer to Running SQL queries using Amazon Athena. For example, you can run the following SQL query:

SELECT * FROM "c360_workshop_db"."source_auto_address" limit 10;

Convert the raw data into CSV files for Neptune

The CloudFormation stack created and ran the AWS Glue ETL job prep_neptune_data to convert the raw data into CSV format acceptable to Neptune Bulk Loader. To learn more about building an AWS Glue ETL job using AWS Glue Studio and to review the job created for this solution, refer to Creating ETL jobs with AWS Glue Studio.

Verify the completion of job run by navigating to the Runs tab and checking the status of most recent run.

Verify the CSV files created by the AWS Glue job in the S3 bucket under the output folder.

Load and visualize the raw data in Neptune

This section uses SageMaker Jupyter notebooks to load, query, explore, and visualize the raw property and auto insurance data in Neptune. Jupyter notebooks are web-based interactive platforms. We use Python scripts to analyze the data in a Jupyter notebook. A Jupyter notebook with the required Python scripts has already been provisioned by the CloudFormation stack.

  1. Start Jupyter Notebook.
  2. Choose the Neptune folder on the Files tab.

  1. Under the Customer360 folder, open the notebook explore_raw_insurance_data.ipynb.

  1. Run Steps 1–5 in the notebook to analyze and visualize the raw insurance data.

The rest of the instructions are inside the notebook itself. The following is a summary of the tasks for each step in the notebook:

  • Step 1: Retrieve Config – Run this cell to run the commands to connect to Neptune for Bulk Loader.
  • Step 2: Load Source Auto Data – Load the auto insurance data into Neptune as vertices and edges.
  • Step 3: Load Source Property Data – Load the property insurance data into Neptune as vertices and edges.
  • Step 4: UI Configuration – This block sets up the UI config and provides UI hints.
  • Step 5: Explore entire graph – The first block builds and displays a graph for all customers with more than four coverages of auto or property insurance policies. The second block displays the graph for four different records for a customer with the name James.

These are all records for the same customer, but because they’re not linked in any way, they appear as different customer records. The AWS Glue FindMatches ML transform job will identify these records as customer James, and the records provide complete visibility on all policies owned by James. The Neptune graph looks like the following example. The vertex covers represents the coverage of auto or property insurance by the owner (James in this case) and the vertex locatedAt represents the address of the property or vehicle that is covered by the owner’s insurance.

Merge the raw data and crawl the merged dataset

The CloudFormation stack created and ran the AWS Glue ETL job merge_auto_property to merge the raw property and auto insurance data into one dataset and catalog the resultant dataset in the Data Catalog. The AWS Glue ETL job does the following transforms on the raw data and merges the transformed data into one dataset:

  • Changes the following fields on the source table source_auto_customer:
    1. Changes policyid to id and data type to string
    2. Changes fname to first_name
    3. Changes lname to last_name
    4. Changes work to company
    5. Changes dob to date_of_birth
    6. Changes phone to home_phone
    7. Drops the fields birthdate, priority, policysince, and createddate
  • Changes the following fields on the source_property_customers:
    1. Changes customer_id to id and data type to string
    2. Changes social to ssn
    3. Drops the fields job, email, industry, city, state, zipcode, netnew, sales_rounded, sales_decimal, priority, and industry2
  • After converting the unique ID field in each table to string type and renaming it to id, the AWS Glue job appends the suffix -auto to all id fields in the source_auto_customer table and the suffix -property to all id fields in the source_propery_customer table before copying all the data from both tables into the merged_auto_property table.

Verify the new table created by the job in the Data Catalog and review the merged dataset using Athena using below Athena SQL query:

SELECT * FROM "c360_workshop_db"."merged_auto_property" limit 10

For more information about how to review the data in the merged_auto_property table, refer to Running SQL queries using Amazon Athena.

Create, teach, and tune the Lake Formation ML transform

The merged AWS Glue job created a Data Catalog called merged_auto_property. Preview the table in Athena Query Editor and download the dataset as a CSV from the Athena console. You can open the CSV file for quick comparison of duplicates.

The rows with IDs 11376-property and 11377-property are mostly same except for the last two digits of their SSN, but these are mostly human errors. The fuzzy matches are easy to spot by a human expert or data steward with domain knowledge of how this data was generated, cleansed, and processed in the various source systems. Although a human expert can identify those duplicates on a small dataset, it becomes tedious when dealing with thousands of records. The AWS Glue ML transform builds on this intuition and provides an easy-to-use ML-based algorithm to automatically apply this approach to large datasets efficiently.

Create the FindMatches ML transform

  1. On the AWS Glue console, expand Data Integration and ETL in the navigation pane.
  2. Under Data classification tools, choose Record Matching.

This will open the ML transforms page.

  1. Choose Create transform.
  2. For Name, enter c360-ml-transform.
  3. For Existing IAM role, choose GlueServiceRoleLab.
  4. For Worker type, choose G.2X (Recommended).
  5. For Number of workers, enter 10.
  6. For Glue version, choose as Spark 2.4 (Glue Version 2.0).
  7. Keep the other values as default and choose Next.

  1. For Database, choose c360_workshop_db.
  2. For Table, choose merged_auto_property.
  3. For Primary key, select id.
  4. Choose Next.

  1. In the Choose tuning options section, you can tune performance and cost metrics available for the ML transform. We stay with the default trade-offs for a balanced approach.

We have specified these values to achieve balanced results. If needed, you can adjust these values later by selecting the transform and using the Tune menu.

  1. Review the values and choose Create ML transform.

The ML transform is now created with the status Needs training.

Teach the transform to identify the duplicates

In this step, we teach the transform by providing labeled examples of matching and non-matching records. You can create your labeling set yourself or allow AWS Glue to generate the labeling set based on heuristics. AWS Glue extracts records from your source data and suggests potential matching records. The file will contain approximately 100 data samples for you to work with.

  1. On the AWS Glue console, navigate to the ML transforms page.
  2. Select the transform c360-ml-transform and choose Train model.

  1. Select I have labels and choose Browse S3 to upload labels from Amazon S3.


Two labeled files have been created for this example. We upload these files to teach the ML transform.

  1. Navigate to the folder label in your S3 bucket, select the labeled file (Label-1-iteration.csv), and choose Choose. And Click “Upload labeling file from S3”.
  2. A green banner appears for successful uploads.
  3. Upload another label file (Label-2-iteration.csv) and select Append to my existing labels.
  4. Wait for the successful upload, then choose Next.

  1. Review the details in the Estimate quality metrics section and choose Close.

Verify that the ML transform status is Ready for use. Note that the label count is 200 because we successfully uploaded two labeled files to teach the transform. Now we can use it in an AWS Glue ETL job for fuzzy matching of the full dataset.

Before proceeding to the next steps, note the transform ID (tfm-xxxxxxx) for the created ML transform.

Harmonize the data, catalog the harmonized data, and convert the data into CSV files for Neptune

In this step, we run an AWS Glue ML transform job to find matches in the merged data. The job also catalogs the harmonized dataset in the Data Catalog and converts the merged [A1] dataset into CSV files for Neptune to show the relations in the matched records.

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose the job perform_ml_dedup.

  1. On the job details page, expand Additional properties.
  2. Under Job parameters, enter the transform ID you saved earlier and save the settings.

    1. Choose Run and monitor the job status for completion.

  1. Run the following query in Athena to review the data in the new table ml_matched_auto_property, created and cataloged by the AWS Glue job, and observe the results:
SELECT * FROM c360_workshop_db.ml_matched_auto_property WHERE first_name like 'Jam%' and last_name like 'Sanchez%';

The job has added a new column called match_id. If multiple records follow the match criteria, then all matching records have the same match_id.

Match IDs play a crucial role in data harmonization using Lake Formation FindMatches. Each row is assigned a unique integer match ID based on matching criteria such as first_name, last_name, SSN, or date_of_birth, as defined in the uploaded label file. For instance, match ID 25769803941 is assigned to all records that meet the match criteria, such as row 1, 2, 4, and 5 which share the same last_name, SSN, and date_of_birth. Consequently, the properties with ID 19801-property, 29801-auto, 19800-property, and 29800-auto all share the same match ID. It’s important to take note of the match ID because it will be utilized for Neptune Gremlin queries.

The output of the AWS Glue job also has created two files, master_vertex.csv and master_edge.csv, in the S3 bucket output/master_data. We use these files to load data into the Neptune database to find the relationship among different entities.

Load and visualize the harmonized data in Neptune

This section uses Jupyter notebooks to load, query, explore, and visualize the ML matched auto and property insurance data in Neptune. Complete the following steps:

  1. Start Jupyter Notebook.
  2. Choose the Neptune folder on the Files tab.
  3. Under the Customer360 folder, choose the notebook. explore_harmonized_insurance_data.ipynb.
  4. Run Steps 1–5 in the notebook to analyze and visualize the raw insurance data.

The rest of the instructions are inside the notebook itself. The following is a summary of the tasks for each step in the notebook:

  • Step 1. Retrieve Config – Run this cell to run the commands to connect to Neptune for Bulk Loader.
  • Step 2. Load Harmonized Customer Data – Load the final vertex and edge files into Neptune.
  • Step 3. Initialize Neptune node traversals – This block sets up the UI config and provides UI hints.
  • Step 4. Exploring Customer 360 graph – Replace the Match_id 25769803941 copied from the previous step into g.V('REPLACE_ME')( If its not replaced already ) and run the cell.

This displays the graph for four different records for a customer with first_name, and James and JamE are is now connected with the SameAs vertex. The Neptune graph helps connect different entities with match criteria; the AWS Glue FindMatches ML transform job has identified these records as customer James and the records show the Match_id is the same for them. The following diagram shows an example of the Neptune graph. The vertex covers represents the coverage of auto or property insurance by the owner (James in this case) and the vertex locatedAt represents the address of the property or vehicle that is covered by the owner’s insurance.

Clean up

To avoid incurring additional charges to your account, on the AWS CloudFormation console, select the stack that you provisioned as part of this post and delete it.

Conclusion

In this post, we showed how to use the AWS Lake Formation FindMatch transform for fuzzy matching data on a data lake to link records if there are no join keys and group records with similar match IDs. You can use Amazon Neptune to establish the relationship between records and visualize the connect graph for deriving insights.

We encourage you to explore our range of services and see how they can help you achieve your goals. For more data and analytics blog posts, check out AWS Blogs.


About the Authors

Nishchai JM is an Analytics Specialist Solutions Architect at Amazon Web services. He specializes in building Big-data applications and help customer to modernize their applications on Cloud. He thinks Data is new oil and spends most of his time in deriving insights out of the Data.

Varad Ram is Senior Solutions Architect in Amazon Web Services. He likes to help customers adopt to cloud technologies and is particularly interested in artificial intelligence. He believes deep learning will power future technology growth. In his spare time, he like to be outdoor with his daughter and son.

Narendra Gupta is a Specialist Solutions Architect at AWS, helping customers on their cloud journey with a focus on AWS analytics services. Outside of work, Narendra enjoys learning new technologies, watching movies, and visiting new places

Arun A K is a Big Data Solutions Architect with AWS. He works with customers to provide architectural guidance for running analytics solutions on the cloud. In his free time, Arun loves to enjoy quality time with his family

How Novo Nordisk built distributed data governance and control at scale

Post Syndicated from Jonatan Selsing original https://aws.amazon.com/blogs/big-data/how-novo-nordisk-built-distributed-data-governance-and-control-at-scale/

This is a guest post co-written with Jonatan Selsing and Moses Arthur from Novo Nordisk.

This is the second post of a three-part series detailing how Novo Nordisk, a large pharmaceutical enterprise, partnered with AWS Professional Services to build a scalable and secure data and analytics platform. The first post of this series describes the overall architecture and how Novo Nordisk built a decentralized data mesh architecture, including Amazon Athena as the data query engine. The third post will show how end-users can consume data from their tool of choice, without compromising data governance. This will include how to configure Okta, AWS Lake Formation, and a business intelligence tool to enable SAML-based federated use of Athena for an enterprise BI activity.

When building a scalable data architecture on AWS, giving autonomy and ownership to the data domains are crucial for the success of the platform. By providing the right mix of freedom and control to those people with the business domain knowledge, your business can maximize value from the data as quickly and effectively as possible. The challenge facing organizations, however, is how to provide the right balance between freedom and control. At the same time, data is a strategic asset that needs to be protected with the highest degree of rigor. How can organizations strike the right balance between freedom and control?

In this post, you will learn how to build decentralized governance with Lake Formation and AWS Identity and Access Management (IAM) using attribute-based access control (ABAC). We discuss some of the patterns we use, including Amazon Cognito identity pool federation using ABAC in permission policies, and Okta-based SAML federation with ABAC enforcement on role trust policies.

Solution overview

In the first post of this series, we explained how Novo Nordisk and AWS Professional Services built a modern data architecture based on data mesh tenets. This architecture enables data governance on distributed data domains, using an end-to-end solution to create data products and providing federated data access control. This post dives into three elements of the solution:

  • How IAM roles and Lake Formation are used to manage data access across data domains
  • How data access control is enforced at scale, using a group membership mapping with an ABAC pattern
  • How the system maintains state across the different layers, so that the ecosystem of trust is configured appropriately

From the end-user perspective, the objective of the mechanisms described in this post is to enable simplified data access from the different analytics services adopted by Novo Nordisk, such as those provided by software as a service (SaaS) vendors like Databricks, or self-hosted ones such as JupyterHub. At the same time, the platform must guarantee that any change in a dataset is immediately reflected at the service user interface. The following figure illustrates at a high level the expected behavior.

High-level data platform expected behavior

Following the layer nomenclature established in the first post, the services are created and managed in the consumption layer. The domain accounts are created and managed in the data management layer. Because changes can occur from both layers, continuous communication in both directions is required. The state information is kept in the virtualization layer along with the communication protocols. Additionally, at sign-in time, the services need information about data resources required to provide data access abstraction.

Managing data access

The data access control in this architecture is designed around the core principle that all access is encapsulated in isolated IAM role sessions. The layer pattern that we described in the first post ensures that the creation and curation of the IAM role policies involved can be delegated to the different data management ecosystems. Each data management platform integrated can use their own data access mechanisms, with the unique requirement that the data is accessed via specific IAM roles.

To illustrate the potential mechanisms that can be used by data management solutions, we show two examples of data access permission mechanisms used by two different data management solutions. Both systems utilize the same trust policies as described in the following sections, but have a completely different permission space.

Example 1: Identity-based ABAC policies

The first mechanism we discuss is an ABAC role that provides access to a home-like data storage area, where users can share within their departments and with the wider organization in a structure that mimics the organizational structure. Here, we don’t utilize the group names, but instead forward user attributes from the corporate Active Directory directly into the permission policy through claim overrides. We do this by having the corporate Active Directory as the identity provider (IdP) for the Amazon Cognito user pool and mapping the relevant IdP attributes to user pool attributes. Then, in the Amazon Cognito identity pool, we map the user pool attributes to session tags to use them for access control. Custom overrides can be included in the claim mapping, through the use of a pre token generation Lambda trigger. This way, claims from AD can be mapped to Amazon Cognito user pool attributes and then ultimately used in the Amazon Cognito identity pool to control IAM role permissions. The following is an example of an IAM policy with sessions tags:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "",
                        "public/",
                        "public/*",
                        "home/",
                        "home/${aws:PrincipalTag/initials}/*",
                        "home/${aws:PrincipalTag/department}/*"
                    ]
                }
            },
            "Action": "s3:ListBucket",
            "Resource": [
                "arn:aws:s3:::your-home-bucket"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:GetObject*",
                "s3:PutObject*",
                "s3:DeleteObject*"
            ],
            "Resource": [
                "arn:aws:s3:::your-home-bucket/home/${aws:PrincipalTag/initials}",
                "arn:aws:s3:::your-home-bucket/home/${aws:PrincipalTag/initials}/*",
                "arn:aws:s3:::your-home-bucket/public/${aws:PrincipalTag/initials}",
                "arn:aws:s3:::your-home-bucket/public/${aws:PrincipalTag/initials}/*",
                "arn:aws:s3:::your-home-bucket/home/${aws:PrincipalTag/department}",
                "arn:aws:s3:::your-home-bucket/home/${aws:PrincipalTag/department}/*",
                "arn:aws:s3:::your-home-bucket/public/${aws:PrincipalTag/department}",
                "arn:aws:s3:::your-home-bucket/public/${aws:PrincipalTag/department}/*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": "s3:GetObject*",
            "Resource": [
                "arn:aws:s3:::your-home-bucket/public/",
                "arn:aws:s3:::your-home-bucket/public/*"
            ],
            "Effect": "Allow"
        }
    ]
}

This role is then embedded in the analytics layer (together with the data domain roles) and assumed on behalf of the user. This enables users to mix and match between data domains—as well as utilizing private and public data paths that aren’t necessarily tied to any data domain. For more examples of how ABAC can be used with permission policies, refer to How to scale your authorization needs by using attribute-based access control with S3.

Example 2: Lake Formation name-based access controls

In the data management solution named Novo Nordisk Enterprise Datahub (NNEDH), which we introduced in the first post, we use Lake Formation to enable standardized data access. The NNEDH datasets are registered in the Lake Formation Data Catalog as databases and tables, and permissions are granted using the named resource method. The following screenshot shows an example of these permissions.

Lakeformation named resource method for permissions management

In this approach, data access governance is delegated to Lake Formation. Every data domain in NNEDH has isolated permissions synthesized by NNEDH as the central governance management layer. This is a similar pattern to what is adopted for other domain-oriented data management solutions. Refer to Use an event-driven architecture to build a data mesh on AWS for an example of tag-based access control in Lake Formation.

These patterns don’t exclude implementations of peer-to-peer type data sharing mechanisms, such as those that can be achieved using AWS Resource Access Manager (AWS RAM), where a single IAM role session can have permissions that span across accounts.

Delegating role access to the consumption later

The following figure illustrates the data access workflow from an external service.

Data access workflow from external service

The workflow steps are as follows:

  1. A user authenticates on an IdP used by the analytics tool that they are trying to access. A wide range of analytics tools are supported by Novo Nordisk platform, such as Databricks and JupyterHub, and the IdP can be either SAML or OIDC type depending on the capabilities of the third-party tool. In this example, an Okta SAML application is used to sign into a third-party analytics tool, and an IAM SAML IdP is configured in the data domain AWS account to federate with the external IdP. The third post of this series describes how to set up an Okta SAML application for IAM role federation on Athena.
  2. The SAML assertion obtained during the sign-in process is used to request temporary security credentials of an IAM role through the AssumeRole operation. In this example, the SAML assertion is used onAssumeRoleWithSAMLoperation. For OpenID Connect-compatible IdPs, the operationAssumeRoleWithWebIdentitymust be used with the JWT. The SAML attributes in the assertion or the claims in the token can be generated at sign-in time, to ensure that the group memberships are forwarded, for the ABAC policy pattern described in the following sections.
  3. The analytics tool, such as Databricks or JupyterHub, abstracts the usage of the IAM role session credentials in the tool itself, and data can be accessed directly according to the permissions of the IAM role assumed. This pattern is similar in nature to IAM passthrough as implemented by Databricks, but in Novo Nordisk it’s extended across all analytics services. In this example, the analytics tool accesses the data lake on Amazon Simple Storage Service (Amazon S3) through Athena queries.

As the data mesh pattern expands across domains covering more downstream services, we need a mechanism to keep IdPs and IAM role trusts continuously updated. We come back to this part later in the post, but first we explain how role access is managed at scale.

Attribute-based trust policies

In previous sections, we emphasized that this architecture relies on IAM roles for data access control. Each data management platform can implement its own data access control method using IAM roles, such as identity-based policies or Lake Formation access control. For data consumption, it’s crucial that these IAM roles are only assumable by users that are part of Active Directory groups with the appropriate entitlements to use the role. To implement this at scale, the IAM role’s trust policy uses ABAC.

When a user authenticates on the external IdP of the consumption layer, we add in the access token a claim derived from their Active Directory groups. This claim is propagated by theAssumeRoleoperation into the trust policy of the IAM role, where it is compared with the expected Active Directory group. Only users that belong to the expected groups can assume the role. This mechanism is illustrated in the following figure.

Architecture of the integration with the identity provider

Translating group membership to attributes

To enforce the group membership entitlement at the role assumption level, we need a way to compare the required group membership with the group memberships that a user comes with in their IAM role session. To achieve this, we use a form of ABAC, where we have a way to represent the sum of context-relevant group memberships in a single attribute. A single IAM role session tag value is limited to 256 characters. The corresponding limit for SAML assertions is 100,000 characters, so for systems where a very large number of either roles or group-type mappings are required, SAML can support a wider range of configurations.

In our case, we have opted for a compression algorithm that takes a group name and compresses it to a 4-character string hash. This means that, together with a group-separation character, we can fit 51 groups in a single attribute. This gets pushed down to approximately 20 groups for OIDC type role assumption due to the PackedPolicySize, but is higher for a SAML-based flow. This has shown to be sufficient for our case. There is a risk that two different groups could hash to the same character combination; however, we have checked that there are no collisions in the existing groups. To mitigate this risk going forward, we have introduced guardrails in multiples places. First, before adding new groups entitlements in the virtualization layer, we check if there’s a hash collision with any existing group. When a duplicated group is attempted to be added, our service team is notified and we can react accordingly. But as stated earlier, there is a low probability of clashes, so the flexibility this provides outweighs the overhead associated with managing clashes (we have not had any yet). We additionally enforce this at SAML assertion creation time as well, to ensure that there are no duplicated groups in the users group list, and in cases of duplication, we remove both entirely. This means malicious actors can at most limit the access of other users, but not gain unauthorized access.

Enforcing audit functionality across sessions

As mentioned in the first post, on top of governance, there are strict requirements around auditability of data accesses. This means that for all data access requests, it must be possible to trace the specific user across services and retain this information. We achieve this by setting (and enforcing) a source identity for all role sessions and make sure to propagate enterprise identity to this attribute. We use a combination of Okta inline hooks and SAML session tags to achieve this. This means that the AWS CloudTrail logs for an IAM role session have the following information:

{
    "eventName": "AssumeRoleWithSAML",
    "requestParameters": {
        "SAMLAssertionlD": "id1111111111111111111111111",
        "roleSessionName": "[email protected]",
        "principalTags": {
            "nn-initials": "user",
            "department": "NNDepartment",
            "GroupHash": "xxxx",
            "email": "[email protected]",
            "cost-center": "9999"
        },
        "sourceIdentity": "[email protected]",
        "roleArn": "arn:aws:iam::111111111111:role/your-assumed-role",
        "principalArn": "arn:aws:iam,111111111111:saml-provider/your-saml-provider",
        ...
    },
    ...
}

On the IAM role level, we can enforce the required attribute configuration with the following example trust policy. This is an example for a SAML-based app. We support the same patterns through OpenID Connect IdPs.

We now go through the elements of an IAM role trust policy, based on the following example:

{
    "Version": "2008-10-17",
    "Statement": {
        "Effect": "Allow",
        "Principal": {
            "Federated": [SAML_IdP_ARN]
        },
        "Action": [
            "sts:AssumeRoleWithSAML",
            "sts:TagSession",
            "sts:SetSourceIdentity"
        ],
        "Condition": {
            "StringEquals": {
                "SAML:aud": "https://signin.aws.amazon.com/saml"
            },
            "StringLike": {
                "sts:SourceIdentity": "*@novonordisk.com",
                "aws:RequestTag/GroupHash": ["*xxxx*"]
            },
            "StringNotLike": {
                "sts:SourceIdentity": "*"
            }
        }
    }
}

The policy contains the following details:

  • ThePrincipalstatement should point to the list of apps that are served through the consumption layer. These can be Azure app registrations, Okta apps, or Amazon Cognito app clients. This means that SAML assertions (in the case of SAML-based flows) minted from these applications can be used to run the operationAssumeRoleWithSamlif the remaining elements are also satisfied.
  • TheActionstatement includes the required permissions for theAssumeRolecall to succeed, including adding the contextual information to the role session.
  • In the first condition, the audience of the assertion needs to be targeting AWS.
  • In the second condition, there are twoStringLikerequirements:
    • A requirement on the source identity as the naming convention to follow at Novo Nordisk (users must come with enterprise identity, following our audit requirements).
    • Theaws:RequestTag/GroupHashneeds to bexxxx, which represents the hashed group name mentioned in the upper section.
  • Lastly, we enforce that sessions can’t be started without setting the source identity.

This policy enforces that all calls are from recognized services, include auditability, have the right target, and enforces that the user has the right group memberships.

Building a central overview of governance and trust

In this section, we discuss how Novo Nordisk keeps track of the relevant group-role relations and maps these at sign-in time.

Entitlements

In Novo Nordisk, all accesses are based on Active Directory group memberships. There is no user-based access. Because this pattern is so central, we have extended this access philosophy into our data accesses. As mentioned earlier, at sign-in time, the hooks need to be able to know which roles to assume for a given user, given this user’s group membership. We have modeled this data in Amazon DynamoDB, where just-in-time provisioning ensures that only the required user group memberships are available. By building our application around the use of groups, and by having the group propagation done by the application code, we avoid having to make a more general Active Directory integration, which would, for a company the size of Novo Nordisk, severely impact the application, simply due to the volume of users and groups.

The DynamoDB entitlement table contains all relevant information for all roles and services, including role ARNs and IdP ARNs. This means that when users log in to their analytics services, the sign-in hook can construct the required information for the Roles SAML attribute.

When new data domains are added to the data management layer, the data management layer needs to communicate both the role information and the group name that gives access to the role.

Single sign-on hub for analytics services

When scaling this permission model and data management pattern to a large enterprise such as Novo Nordisk, we ended up creating a large number of IAM roles distributed across different accounts. Then, a solution is required to map and provide access for end-users to the required IAM role. To simplify user access to multiple data sources and analytics tools, Novo Nordisk developed a single sign-on hub for analytics services. From the end-user perspective, this is a web interface that glues together different offerings in a unified system, making it a one-stop tool for data and analytics needs. When signing in to each of the analytical offerings, the authenticated sessions are forwarded, so users never have to reauthenticate.

Common for all the services supported in the consumption layer is that we can run a piece of application code at sign-in time, allowing sign-in time permissions to be calculated. The hooks that achieve this functionality can, for instance, be run by Okta inline hooks. This means that each of the target analytics services can have custom code to translate relevant contextual information or provide other types of automations for the role forwarding.

The sign-in flow is demonstrated in the following figure.

Sign-in flow

The workflow steps are as follows:

  1. A user accesses an analytical service such as Databricks in the Novo Nordisk analytics hub.
  2. The service uses Okta as the SAML-based IdP.
  3. Okta invokes an AWS Lambda-based SAML assertion inline hook.
  4. The hook uses the entitlement database, converting application-relevant group memberships into role entitlements.
  5. Relevant contextual information is returned from the entitlement database.
  6. The Lambda-based hook adds new SAML attributes to the SAML assertion, including the hashed group memberships and other contextual information such as source identity.
  7. A modified SAML assertion is used to sign users in to the analytical service.
  8. The user can now use the analytical tool with active IAM role sessions.

Synchronizing role trust

The preceding section gives an overview of how federation works in this solution. Now we can go through how we ensure that all participating AWS environments and accounts are in sync with the latest configuration.

From the end-user perspective, the synchronization mechanism must ensure that every analytics service instantiated can access the data domains assigned to the groups that the user belongs to. Also, changes in data domains—such as granting data access to an Active Directory group—must be effective immediately to every analytics service.

Two event-based mechanisms are used to maintain all the layers synchronized, as detailed in this section.

Synchronize data access control on the data management layer with changes to services in the consumption layer

As describe in the previous section, the IAM roles used for data access are created and managed by the data management layer. These IAM roles have a trust policy providing federated access to the external IdPs used by the analytics tools of the consumption layer. It implies that for every new analytical service created with a different IDP, the IAM roles used for data access on data domains must be updated to trust this new IdP.

Using NNEDH as an example of a data management solution, the synchronization mechanism is demonstrated in the following figure.

Synchronization mechanism in a data management solution

Taking as an example a scenario where a new analytics service is created, the steps in this workflow are as follows:

  1. A user with access to the administration console of the consumption layer instantiates a new analytics service, such as JupyterHub.
  2. A job running on AWS Fargate creates the resources needed for this new analytics service, such as an Amazon Elastic Compute Cloud (Amazon EC2) instance for JupyterHub, and the IdP required, such as a new SAML IdP.
  3. When the IdP is created in the previous step, an event is added in an Amazon Simple Notification Service (Amazon SNS) topic with its details, such as name and SAML metadata.
  4. In the NNEDH control plane, a Lambda job is triggered by new events on this SNS topic. This job creates the IAM IdP, if needed, and updates the trust policy of the required IAM roles in all the AWS accounts used as data domains, adding the trust on the IdP used by the new analytics service.

In this architecture, all the update steps are event-triggered and scalable. This means that users of new analytics services can access their datasets almost instantaneously when they are created. In the same way, when a service is removed, the federation to the IdP is automatically removed if not used by other services.

Propagate changes on data domains to analytics services

Changes to data domains, such as the creation of a new S3 bucket used as a dataset, or adding or removing data access to a group, must be reflected immediately on analytics services of the consumption layer. To accomplish it, a mechanism is used to synchronize the entitlement database with the relevant changes made in NNEDH. This flow is demonstrated in the following figure.

Changes propagation flow

Taking as an example a scenario where access to a specific dataset is granted to a new group, the steps in this workflow are as follows:

  1. Using the NNEDH admin console, a data owner approves a dataset sharing request that grants access on a dataset to an Active Directory group.
  2. In the AWS account of the related data domain, the dataset components such as the S3 bucket and Lake Formation are updated to provide data access to the new group. The cross-account data sharing in Lake Formation uses AWS RAM.
  3. An event is added in an SNS topic with the current details about this dataset, such as the location of the S3 bucket and the groups that currently have access to it.
  4. In the virtualization layer, the updated information from the data management layer is used to update the entitlement database in DynamoDB.

These steps make sure that changes on data domains are automatically and immediately reflected on the entitlement database, which is used to provide data access to all the analytics services of the consumption layer.

Limitations

Many of these patterns rely on the analytical tool to support a clever use of IAM roles. When this is not the case, the platform teams themselves need to develop custom functionality at the host level to ensure that role accesses are correctly controlled. This, for example, includes writing custom authenticators for JupyterHub.

Conclusion

This post shows an approach to building a scalable and secure data and analytics platform. It showcases some of the mechanisms used at Novo Nordisk and how to strike the right balance between freedom and control. The architecture laid out in the first post in this series enables layer independence, and exposes some extremely useful primitives for data access and governance. We make heavy use of contextual attributes to modulate role permissions at the session level, which provide just-in-time permissions. These permissions are propagated at a scale, across data domains. The upside is that a lot of the complexity related to managing data access permission can be delegated to the relevant business groups, while enabling the end-user consumers of data to think as little as possible about data accesses and focus on providing value for the business use cases. In the case of Novo Nordisk, they can provide better outcomes for patients and acceleration innovation.

The next post in this series describes how end-users can consume data from their analytics tool of choice, aligned with the data access controls detailed in this post.


About the Authors

Jonatan Selsing is former research scientist with a PhD in astrophysics that has turned to the cloud. He is currently the Lead Cloud Engineer at Novo Nordisk, where he enables data and analytics workloads at scale. With an emphasis on reducing the total cost of ownership of cloud-based workloads, while giving full benefit of the advantages of cloud, he designs, builds, and maintains solutions that enable research for future medicines.

Hassen Riahi is a Sr. Data Architect at AWS Professional Services. He holds a PhD in Mathematics & Computer Science on large-scale data management. He works with AWS customers on building data-driven solutions.

Alessandro Fior is a Sr. Data Architect at AWS Professional Services. He is passionate about designing and building modern and scalable data platforms that accelerate companies to extract value from their data.

Moses Arthur comes from a mathematics and computational research background and holds a PhD in Computational Intelligence specialized in Graph Mining. He is currently a Cloud Product Engineer at Novo Nordisk, building GxP-compliant enterprise data lakes and analytics platforms for Novo Nordisk global factories producing digitalized medical products.

Anwar RizalAnwar Rizal is a Senior Machine Learning consultant based in Paris. He works with AWS customers to develop data and AI solutions to sustainably grow their business.

Kumari RamarKumari Ramar is an Agile certified and PMP certified Senior Engagement Manager at AWS Professional Services. She delivers data and AI/ML solutions that speed up cross-system analytics and machine learning models, which enable enterprises to make data-driven decisions and drive new innovations.

Build a transactional data lake using Apache Iceberg, AWS Glue, and cross-account data shares using AWS Lake Formation and Amazon Athena

Post Syndicated from Vikram Sahadevan original https://aws.amazon.com/blogs/big-data/build-a-transactional-data-lake-using-apache-iceberg-aws-glue-and-cross-account-data-shares-using-aws-lake-formation-and-amazon-athena/

Building a data lake on Amazon Simple Storage Service (Amazon S3) provides numerous benefits for an organization. It allows you to access diverse data sources, build business intelligence dashboards, build AI and machine learning (ML) models to provide customized customer experiences, and accelerate the curation of new datasets for consumption by adopting a modern data architecture or data mesh architecture.

However, many use cases, like performing change data capture (CDC) from an upstream relational database to an Amazon S3-based data lake, require handling data at a record level. Performing an operation like inserting, updating, and deleting individual records from a dataset requires the processing engine to read all the objects (files), make the changes, and rewrite entire datasets as new files. Furthermore, making the data available in the data lake in near-real time often leads to the data being fragmented over many small files, resulting in poor query performance and compaction maintenance.

In 2022, we announced that you can enforce fine-grained access control policies using AWS Lake Formation and query data stored in any supported file format using table formats such as Apache Iceberg, Apache Hudi, and more using Amazon Athena queries. You get the flexibility to choose the table and file format best suited for your use case and get the benefit of centralized data governance to secure data access when using Athena.

In this post, we show you how to configure Lake Formation using Iceberg table formats. We also explain how to upsert and merge in an S3 data lake using an Iceberg framework and apply Lake Formation access control using Athena.

Iceberg is an open table format for very large analytic datasets. Iceberg manages large collections of files as tables, and it supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries. The Iceberg specification allows seamless table evolution such as schema and partition evolution, and its design is optimized for usage on Amazon S3. Iceberg also helps guarantee data correctness under concurrent write scenarios.

Solution overview

To explain this setup, we present the following architecture, which integrates Amazon S3 for the data lake (Iceberg table format), Lake Formation for access control, AWS Glue for ETL (extract, transform, and load), and Athena for querying the latest inventory data from the Iceberg tables using standard SQL.

The solution workflow consists of the following steps, including data ingestion (Steps 1–3), data governance (Step 4), and data access (Step 5):

  1. We use AWS Database Migration Service (AWS DMS) or a similar tool to connect to the data source and move incremental data (CDC) to Amazon S3 in CSV format.
  2. An AWS Glue PySpark job reads the incremental data from the S3 input bucket and performs deduplication of the records.
  3. The job then invokes Iceberg’s MERGE statements to merge the data with the target S3 bucket.
  4. We use the AWS Glue Data Catalog as a centralized catalog, which is used by AWS Glue and Athena. An AWS Glue crawler is integrated on top of S3 buckets to automatically detect the schema. Lake Formation allows you to centrally manage permissions and access control for Data Catalog resources in your S3 data lake. You can use fine-grained access control in Lake Formation to restrict access to data in query results.
  5. We use Athena integrated with Lake Formation to query data from the Iceberg table using standard SQL and validate table- and column-level access on Iceberg tables.

For this solution, we assume that the raw data files are already available in Amazon S3, and focus on processing the data using AWS Glue with Iceberg table format. We use sample item data that has the following attributes:

  • op – This represents the operation on the source record. This shows values I to represent insert operations, U to represent updates, and D to represent deletes. You need to make sure this attribute is included in your CDC incremental data before it gets written to Amazon S3. Make sure you capture this attribute, so that your ETL logic can take appropriate action while merging it.
  • product_id – This is the primary key column in the source data table.
  • category – This column represents the category of an item.
  • product_name – This is the name of the product.
  • quantity_available – This is the quantity available in the inventory. When we showcase the incremental data for UPSERT or MERGE, we reduce the quantity available for the product to showcase the functionality.
  • last_update_time – This is the time when the item record was updated at the source data.

We demonstrate implementing the solution with the following steps:

  1. Create an S3 bucket for input and output data.
  2. Create input and output tables using Athena.
  3. Insert the data into the Iceberg table from Athena.
  4. Query the Iceberg table using Athena.
  5. Upload incremental (CDC) data for further processing.
  6. Run the AWS Glue job again to process the incremental files.
  7. Query the Iceberg table again using Athena.
  8. Define Lake Formation policies.

Prerequisites

For Athena queries, we need to configure an Athena workgroup with engine version 3 to support Iceberg table format.

To validate cross-account access through Lake Formation for Iceberg table, in this post we used two accounts (primary and secondary).

Now let’s dive into the implementation steps.

Create an S3 bucket for input and output data

Before we run the AWS Glue job, we have to upload the sample CSV files to the input bucket and process them with AWS Glue PySpark code for the output.

To create an S3 bucket, complete the following steps:

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Choose Create bucket.
  3. Specify the bucket name asiceberg-blog and leave the remaining fields as default.

S3 bucket names are globally unique. While implementing the solution, you may get an error saying the bucket name already exists. Make sure to provide a unique name and use the same name while implementing the rest of the implementation steps. Formatting the bucket name as<Bucket-Name>-${AWS_ACCOUNT_ID}-${AWS_REGION_CODE}might help you get a unique name.

  1. On the bucket details page, choose Create folder.
  2. Create two subfolders. For this post, we createiceberg-blog/raw-csv-input andiceberg-blog/iceberg-output.
  3. Upload theLOAD00000001.csvfile into the raw-csv-input folder.

The following screenshot provides a sample of the input dataset.

Create input and output tables using Athena

To create input and output Iceberg tables in the AWS Glue Data Catalog, open the Athena query editor and run the following queries in sequence:

-- Create database for the demo
CREATE DATABASE iceberg_lf_db;

As we explain later in this post, it’s essential to record the data locations when incorporating Lake Formation access controls.

-- Create external table in input CSV files. Replace the S3 path with your bucket name
CREATE EXTERNAL TABLE iceberg_lf_db.csv_input(
op string,
product_id bigint,
category string,
product_name string,
quantity_available bigint,
last_update_time string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://glue-iceberg-demo/raw-csv-input/'
TBLPROPERTIES (
'areColumnsQuoted'='false',
'classification'='csv',
'columnsOrdered'='true',
'compressionType'='none',
'delimiter'=',',
'typeOfData'='file');

-- Create output Iceberg table with partitioning. Replace the S3 bucket name with your bucket name
CREATE TABLE iceberg_lf_db.iceberg_table_lf (
product_id bigint,
category string,
product_name string,
quantity_available bigint,
last_update_time timestamp)
PARTITIONED BY (category, bucket(16,product_id))
LOCATION 's3://glue-iceberg-demo/iceberg_blog/iceberg-output/'
TBLPROPERTIES (
'table_type'='ICEBERG',
'format'='parquet',
'write_target_data_file_size_bytes'='536870912'
);

-- Validate the input data
SELECT * FROM iceberg_lf_db.csv_input;

SELECT * FROM iceberg_lf_db.iceberg_table_lf;

Alternatively, you can use an AWS Glue crawler to create the table definition for the input files.

Insert the data into the Iceberg table from Athena

Optionally, we can insert data into the Iceberg table through Athena using the following code:

insert into iceberg_lf_demo.iceberg_lf_output_athena (product_id,category,product_name,quantity_available,last_update_time) values (200,'Mobile','Mobile brand 1',25,cast('2023-01-19 09:51:40' as timestamp));
insert into iceberg_lf_demo.iceberg_lf_output_athena (product_id,category,product_name,quantity_available,last_update_time) values (201,'Laptop','Laptop brand 1',20,cast('2023-01-19 09:51:40' as timestamp));
insert into iceberg_lf_demo.iceberg_lf_output_athena (product_id,category,product_name,quantity_available,last_update_time) values (202,'Tablet','Kindle',30,cast('2023-01-19 09:51:41' as timestamp));
insert into iceberg_lf_demo.iceberg_lf_output_athena (product_id,category,product_name,quantity_available,last_update_time) values (203,'Speaker','Alexa',10,cast('2023-01-19 09:51:42' as timestamp));
insert into iceberg_lf_demo.iceberg_lf_output_athena (product_id,category,product_name,quantity_available,last_update_time) values (204,'Speaker','Alexa',50,cast('2023-01-19 09:51:43' as timestamp));

For this post, we load the data using an AWS Glue job. Complete the following steps to create the job:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose Create job.
  3. Select Visual with a blank canvas.
  4. Choose Create.
  5. Choose Edit script.
  6. Replace the script with the following script:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

from pyspark.sql.functions import *
from awsglue.dynamicframe import DynamicFrame

from pyspark.sql.window import Window
from pyspark.sql.functions import rank, max

from pyspark.conf import SparkConf

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
conf = SparkConf()

## spark.sql.catalog.job_catalog.warehouse can be passed as an ## runtime argument with value as the S3 path
## Please make sure to pass runtime argument –
## iceberg_job_catalog_warehouse with value as the S3 path 
conf.set("spark.sql.catalog.job_catalog.warehouse", args['iceberg_job_catalog_warehouse'])
conf.set("spark.sql.catalog.job_catalog", "org.apache.iceberg.spark.SparkCatalog")
conf.set("spark.sql.catalog.job_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
conf.set("spark.sql.catalog.job_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
conf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
conf.set("spark.sql.iceberg.handle-timestamp-without-timezone","true")

sc = SparkContext(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)


## Read Input Table
## glueContext.create_data_frame.from_catalog can be more 
## performant and can be replaced in place of 
## create_dynamic_frame.from_catalog.

IncrementalInputDyF = glueContext.create_dynamic_frame.from_catalog(database = "iceberg_lf_db", table_name = "csv_input", transformation_ctx = "IncrementalInputDyF")
IncrementalInputDF = IncrementalInputDyF.toDF()

if not IncrementalInputDF.rdd.isEmpty():
## Apply De-duplication logic on input data, to pickup latest record based on timestamp and operation
IDWindowDF = Window.partitionBy(IncrementalInputDF.product_id).orderBy(IncrementalInputDF.last_update_time).rangeBetween(-sys.maxsize, sys.maxsize)

# Add new columns to capture OP value and what is the latest timestamp
inputDFWithTS= IncrementalInputDF.withColumn("max_op_date",max(IncrementalInputDF.last_update_time).over(IDWindowDF))

# Filter out new records that are inserted, then select latest record from existing records and merge both to get deduplicated output
NewInsertsDF = inputDFWithTS.filter("last_update_time=max_op_date").filter("op='I'")
UpdateDeleteDf = inputDFWithTS.filter("last_update_time=max_op_date").filter("op IN ('U','D')")
finalInputDF = NewInsertsDF.unionAll(UpdateDeleteDf)

# Register the deduplicated input as temporary table to use in Iceberg Spark SQL statements
finalInputDF.createOrReplaceTempView("incremental_input_data")
finalInputDF.show()

## Perform merge operation on incremental input data with MERGE INTO. This section of the code uses Spark SQL to showcase the expressive SQL approach of Iceberg to perform a Merge operation
IcebergMergeOutputDF = spark.sql("""
MERGE INTO job_catalog.iceberg_lf_db.iceberg_table_lf t
USING (SELECT op, product_id, category, product_name, quantity_available, to_timestamp(last_update_time) as last_update_time FROM incremental_input_data) s
ON t.product_id = s.product_id
WHEN MATCHED AND s.op = 'D' THEN DELETE
WHEN MATCHED THEN UPDATE SET t.quantity_available = s.quantity_available, t.last_update_time = s.last_update_time
WHEN NOT MATCHED THEN INSERT (product_id, category, product_name, quantity_available, last_update_time) VALUES (s.product_id, s.category, s.product_name, s.quantity_available, s.last_update_time)
""")

job.commit()
  1. On the Job details tab, specify the job name (iceberg-lf).
  2. For IAM Role, assign an AWS Identity and Access Management (IAM) role that has the required permissions to run an AWS Glue job and read and write to the S3 bucket.
  3. For Glue version, choose Glue 4.0 (Glue 3.0 is also supported).
  4. For Language, choose Python 3.
  5. Make sure Job bookmark has the default value of Enable.
  6. For Job parameters, add the following:
    1. Add the key--datalake-formatswith the valueiceberg.
    2. Add the key--iceberg_job_catalog_warehouse with the value as your S3 path (s3://<bucket-name>/<iceberg-warehouse-path>).
  7. Choose Save and then Run, which should write the input data to the Iceberg table with a MERGE statement.

Query the Iceberg table using Athena

After you have successfully run the AWS Glue job, you can validate the output in Athena with the following SQL query:

SELECT * FROM iceberg_lf_db.iceberg_table_lf limit 10;

The output of the query should match the input, with one difference: the Iceberg output table doesn’t have theopcolumn.

Upload incremental (CDC) data for further processing

After we process the initial full load file, let’s upload an incremental file.

This file includes updated records on two items.

Run the AWS Glue job again to process incremental files

Because the AWS Glue job has bookmarks enabled, the job picks up the new incremental file and performs a MERGE operation on the Iceberg table.

To run the job again, complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Select the job and choose Run.

For this post, we run the job manually, but you can configure your AWS Glue jobs to run as part of an AWS Glue workflow or via AWS Step Functions (for more information, see Manage AWS Glue Jobs with Step Functions).

Query the Iceberg table using Athena after incremental data processing

When the incremental data processing is complete, you can run the same SELECT statement again and validate that the quantity value is updated for items 200 and 201.

The following screenshot shows the output.

Define Lake Formation policies

For data governance, we use Lake Formation. Lake Formation is a fully managed service that simplifies data lake setup, supports centralized security management, and provides transactional access on top of your data lake. Moreover, it enables data sharing across accounts and organizations. There are two ways to share data resources in Lake Formation: named resource access control (NRAC) and tag-based access control (TBAC). NRAC uses AWS Resource Access Manager (AWS RAM) to share data resources across accounts using Lake Formation V3. Those are consumed via resource links that are based on created resource shares. Lake Formation tag-based access control (LF-TBAC) is another approach to share data resources in Lake Formation, which defines permissions based on attributes. These attributes are called LF-tags.

In this example, we create databases in the primary account. Our NRAC database is shared with a data domain via AWS RAM. Access to data tables that we register in this database will be handled through NRAC.

Configure access controls in the primary account

In the primary account, complete the following steps to set up access controls using Lake Formation:

  1. On the Lake Formation console, choose Data lake locations in the navigation pane.
  2. Choose Register location.
  3. Update the Iceberg Amazon S3 location path shown in the following screenshot.

Grant access to the database to the secondary account

To grant database access to the external (secondary) account, complete the following steps:

  1. On the Lake Formation console, navigate to your database.
  2. On the Actions menu, choose Grant.
  3. Choose External accounts and enter the secondary account number.
  4. Select Named data catalog resources.
  5. Verify the database name.

The first grant should be at database level, and the second grant is at table level.

  1. For Database permissions, specify your permissions (for this post, we select Describe).
  2. Choose Grant.

Now you need to grant permissions at the table level.

  1. Select External accounts and enter the secondary account number.
  2. Select Named data catalog resources.
  3. Verify the table name.
  4. For Table permissions, specify the permissions you want to grant. For this post, we select Select and Describe.
  5. Choose Grant.

If you see the following error, you must revokeIAMAllowedPrincipalsfrom the data lake permissions.

To do so, select IAMAllowedPrincipals and choose Revoke.

Choose Revoke again to confirm.

After you revoke the data permissions, the permissions should appear as shown in the following screenshot.

Add AWS Glue IAM role permissions

Because the IAM principal role was revoked, the AWS Glue IAM role that was used in the AWS Glue job needs to be added exclusively to grant access as shown in the following screenshot.

You need to repeat these steps for the AWS Glue IAM role at table level.

Verify the permissions granted to the AWS Glue IAM role on the Lake Formation console.

Grant access to the Iceberg table to the external account

In the secondary account, complete the following steps to grant access to the Iceberg table to external account.

  1. On the AWS RAM console, choose Resource shares in the navigation pane.
  2. Choose the resource shares invitation sent from the primary account.
  3. Choose Accept resource share.

The resource status should now be active.

Next, you need to create a resource link for the shared Iceberg table and access through Athena.

  1. On the Lake Formation console, choose Tables in the navigation pane.
  2. Select the Iceberg table (shared from the primary account).
  3. On the Actions menu, choose Create resource link.
  4. For Resource link name, enter a name (for this post,iceberg_table_lf_demo).
  5. For Database, choose your database and verify the shared table and database are automatically populated.
  6. Choose Create.
  7. Select your table and on the Actions menu, choose View data.

You’re redirected to the Athena console, where you can query the data.

Grant column-based access in the primary account

For column-level restricted access, you need to grant access at the column level on the Iceberg table. Complete the following steps:

  1. On the Lake Formation console, navigate to your database.
  2. On the Actions menu, choose Grant.
  3. Select External accounts and enter the secondary account number.
  4. Select Named data catalog resources.
  5. Verify the table name.
  6. For Table permissions, choose the permissions you want to grant. For this post, we select Select.
  7. Under Data permissions, choose Column-based access.
  8. Select Include columns and choose your permission filters (for this post, Category and Quantity_available).
  9. Choose Grant.

Data with restricted columns can now be queried through the Athena console.

Clean up

To avoid incurring ongoing costs, complete the following steps to clean up your resources:

  1. In your secondary account, log in to the Lake Formation console.
  2. Drop the resource share table.
  3. In your primary account, log in to the Lake Formation console.
  4. Revoke the access you configured.
  5. Drop the AWS Glue tables and database.
  6. Delete the AWS Glue job.
  7. Delete the S3 buckets and any other resources that you created as part of the prerequisites for this post.

Conclusion

This post explains how you can use the Iceberg framework with AWS Glue and Lake Formation to define cross-account access controls and query data using Athena. It provides an overview of Iceberg and its features and integration approaches, and explains how you can ingest data, grant cross-account access, and query data through a step-by-step guide.

We hope this gives you a great starting point for using Iceberg to build your data lake platform along with AWS analytics services to implement your solution.


About the Authors

Vikram Sahadevan is a Senior Resident Architect on the AWS Data Lab team. He enjoys efforts that focus around providing prescriptive architectural guidance, sharing best practices, and removing technical roadblocks with joint engineering engagements between customers and AWS technical resources that accelerate data, analytics, artificial intelligence, and machine learning initiatives.

Suvendu Kumar Patra possesses 18 years of experience in infrastructure, database design, and data engineering, and he currently holds the position of Senior Resident Architect at Amazon Web Services. He is a member of the specialized focus group, AWS Data Lab, and his primary duties entail working with executive leadership teams of strategic AWS customers to develop their roadmaps for data, analytics, and AI/ML. Suvendu collaborates closely with customers to implement data engineering, data hub, data lake, data governance, and EDW solutions, as well as enterprise data strategy and data management.

How Morningstar used tag-based access controls in AWS Lake Formation to manage permissions for an Amazon Redshift data warehouse

Post Syndicated from Don Drake original https://aws.amazon.com/blogs/big-data/how-morningstar-used-tag-based-access-controls-in-aws-lake-formation-to-manage-permissions-for-an-amazon-redshift-data-warehouse/

This post was co-written by Ashish Prabhu, Stephen Johnston, and Colin Ingarfield at Morningstar and Don Drake, at AWS.

With “Empowering Investor Success” as the core motto, Morningstar aims at providing our investors and advisors with the tools and information they need to make informed investment decisions.

In this post, Morningstar’s Data Lake Team Leads discuss how they utilized tag-based access control in their data lake with AWS Lake Formation and enabled similar controls in Amazon Redshift.

The business challenge

At Morningstar, we built a data lake solution that allows our consumers to easily ingest data, make it accessible via the AWS Glue Data Catalog, and grant access to consumers to query the data via Amazon Athena. In this solution, we were required to ensure that the consumers could only query the data to which they had explicit access. To enforce our access permissions, we chose Lake Formation tag-based access control (TBAC). TBAC helps us categorize the data into a simple, broad level or a complex, more granular level using tags and then grant consumers access to those tags based on what group of data they need. Tag-based entitlements allow us to have a flexible and manageable entitlements system that solves our complex entitlements scenarios.

However, our consumers pushed us for better query performance and enhanced analytical capabilities. We realized we needed a data warehouse to cater to all of these consumer requirements, so we evaluated Amazon Redshift. Amazon Redshift provides us with features that we could use to work with our consumers and enable their analytical requirements:

  • Better performance for consumers’ analytical requirements
  • Ability to tune query performance with user-specified sort keys and distribution keys
  • Ability to have different representations of the same data via views and materialized views
  • Consistent query performance regardless of concurrency

Many new Amazon Redshift features helped solve and scale our analytical query requirements, specifically Amazon Redshift Serverless and Amazon Redshift data sharing.

Because our Lake Formation-enforced data lake is a central data repository for all our data, it makes sense for us to flow the data permissions from the data lake into Amazon Redshift. We utilize AWS Identity and Access Management (IAM) authentication and want to centralize the governance of permissions based on IAM roles and groups. For each AWS Glue database and table, we have a corresponding Amazon Redshift schema and table. Our goal was to ensure customers who have access to AWS Glue tables via Lake Formation also have access to the corresponding tables in Amazon Redshift.

However, we faced a problem with user-based entitlements as we moved to Amazon Redshift.

The entitlements problem

Even though we added Amazon Redshift as part of our overall solution, the entitlement requirements and challenges that came with it remained the same for our users consuming via Lake Formation. At the same time, we had to find a way to implement entitlements in our Amazon Redshift data warehouse with the same set of tags that we had already defined in Lake Formation. Amazon Redshift supports resource-based entitlements but doesn’t support tag-based entitlements. The challenge we had to overcome was how to map our existing tag-based entitlements in Lake Formation to the resource-based entitlements in Amazon Redshift.

The data in the AWS Glue Data Catalog needed to be also loaded in the Amazon Redshift data warehouse native tables. This was necessary so that the users get a familiar list of schema and tables that they are accustomed to seeing in the Data Catalog when accessing via Athena. This way, our existing data lake consumers could easily transition to Amazon Redshift.

The following diagram illustrates the structure of the AWS Glue Data Catalog mapped 1:1 with the structure of our Amazon Redshift data warehouse.

Shows mapping of Glue databases and tables to Redshift schemas and tables.

We wanted to utilize the ontology of tags in Lake Formation to also be used on the datasets in Amazon Redshift so that consumers could be granted access to the same datasets in both places. This enabled us to have a single entitlement policy source API that would grant appropriate access to both our Amazon Redshift tables as well as the corresponding Lake Formation tables based on the Lake Formation tag-based policies.

Entitlement Policy Source is used by Lake Formation and Redshift

To solve this problem, we needed to build our own solution to convert the tag-based policies in Lake Formation into grants and revokes in the resource-based entitlements in Amazon Redshift.

Solution overview

To solve this mismatch, we wanted to synchronize our Lake Formation tag ontology and classifications to the Amazon Redshift permission model. To do this, we map Lake Formation tags and grants to Amazon Redshift grants with the following steps:

  1. Map all the resources (databases, schemas, tables, and more) in Lake Formation that are tagged to their equivalent Amazon Redshift tables.
  2. Translate each policy in Lake Formation on a tag expression to a set of Amazon Redshift table grants and revokes.

The net result is that when there is a tag or policy change in Lake Formation, a corresponding set of grants or revokes are made to the equivalent Amazon Redshift tables to keep our entitlements in sync.

Map all tagged resources in Lake Formation to Amazon Redshift equivalents

The tag-based access control of Lake Formation allowed us to apply multiple tags on a single resource (database and table) in the AWS Glue Data Catalog. If visualized in a mapping form, the resource tagging can be displayed as how multiple tags on a single table would be flattened into individual entitlements on Amazon Redshift tables.

Mapping of tags in Lake Formation to Redshift tables

Translate tags to Amazon Redshift grants and revokes

To enable the migration of the tag-based policy enforced in Lake Formation, the permissions can be converted into simple grants and revokes that can be done on a per-group level.

There are two fundamental parts to a tag policy: the principal_id and the tag expression (for example, “Acess Level” = “Public”). Assuming that we have an Amazon Redshift database group for each principal_id, then the resources that represent the tag expression can be permissioned accordingly. We plan on migrating from database groups to database roles in a future implementation.

mapping of tags to Redshift user group

The solution implementation

The implementation of this solution led us to develop two components:

  • The mapper service
  • The Amazon Redshift data configuration

The mapper service can be thought of as a translation service. As the name suggests, it has the core business logic to map the tag and policy information into resource-based grants and revokes in Amazon Redshift. It needs to mimic the behavior of Lake Formation when handling the tag policy translation.

To do this translation, the mapper needs to understand and store the metadata at two levels:

  • Understanding what resource in Amazon Redshift is to be tagged with what value
  • Tracking the grants and revokes already performed so they can be updated with changes in the policy

To do this, we created a config schema in our Amazon Redshift cluster, which currently stores all the configurations.

As part of our implementation, we store the mapped (translated) information in Amazon Redshift. This allows us to incrementally update table grants as Lake Formation tags or policies changed. The following diagram illustrates this schema.

schema of configuration stored in Redshift

Business impact and value

The solution we put together has created key business impacts and values out of the current implementation and allows us greater flexibility in the future.

It allows us to get the data to our users faster with the tag policies applied in Lake Formation and translated directly to permissions in Amazon Redshift with immediate effect. It also allows us to have consistency in permissions applied in both Lake Formation and Amazon Redshift, based on the effective permissions derived from tag policies. And all this happens via a single source that grants and revokes permissions across the board, instead of managing them separately.

If we translate this into the business impact and business value that we generate, the solution improves the time to market of our data, but at the same time provides consistent entitlements across the business-driven categories that we define as tags.

The solution also opens up solutions to add more impact as our product scales both horizontally and vertically. There are potential solutions we could implement in terms of automation, users self-servicing their permissions, auditing, dashboards, and more. As our business scales, we expect to take advantage of these capabilities.

Conclusion

In this post, we shared how Morningstar utilized tag-based access control in our data lake with Lake Formation and enabled similar controls in Amazon Redshift. We developed two components that handle mapping of the tag-based access controls to Amazon Redshift permissions. This solution has improved the time to market for our data and provides consistent entitlements across different business-driven categories.

If you have any questions or comments, please leave them in the comments section.


About the Authors

Ashish Prabhu is a Senior Manager of Software Engineering in Morningstar, Inc. He focuses on the solutioning and delivering the different aspects of Data Lake and Data Warehouse for Morningstar’s Enterprise Data and Platform Team. In his spare time he enjoys playing basketball, painting and spending time with his family.

Stephen Johnston is a Distinguished Software Architect at Morningstar, Inc. His focus is on data lake and data warehousing technologies for Morningstar’s Enterprise Data Platform team.

Colin Ingarfield is a Lead Software Engineer at Morningstar, Inc. Based in Austin, Colin focuses on access control and data entitlements on Morningstar’s growing Data Lake platform.

Don Drake is a Senior Analytics Specialist Solutions Architect at AWS. Based in Chicago, Don helps Financial Services customers migrate workloads to AWS.

Interact with Apache Iceberg tables using Amazon Athena and cross account fine-grained permissions using AWS Lake Formation

Post Syndicated from Kishore Dhamodaran original https://aws.amazon.com/blogs/big-data/interact-with-apache-iceberg-tables-using-amazon-athena-and-cross-account-fine-grained-permissions-using-aws-lake-formation/

We recently announced support for AWS Lake Formation fine-grained access control policies in Amazon Athena queries for data stored in any supported file format using table formats such as Apache Iceberg, Apache Hudi and Apache Hive. AWS Lake Formation allows you to define and enforce database, table, and column-level access policies to query Iceberg tables stored in Amazon S3. Lake Formation provides an authorization and governance layer on data stored in Amazon S3. This capability requires that you upgrade to Athena engine version 3.

Large organizations often have lines of businesses (LoBs) that operate with autonomy in managing their business data. It makes sharing data across LoBs non-trivial. These organizations have adopted a federated model, with each LoB having the autonomy to make decisions on their data. They use the publisher/consumer model with a centralized governance layer that is used to enforce access controls. If you are interested in learning more about data mesh architecture, visit Design a data mesh architecture using AWS Lake Formation and AWS Glue. With Athena engine version 3, customers can use the same fine-grained controls for open data frameworks such as Apache Iceberg, Apache Hudi, and Apache Hive.

In this post, we deep dive into a use-case where you have a producer/consumer model with data sharing enabled to give restricted access to an Apache Iceberg table that the consumer can query. We’ll discuss column filtering to restrict certain rows, filtering to restrict column level access, schema evolution, and time travel.

Solution overview

To illustrate the functionality of fine-grained permissions for Apache Iceberg tables with Athena and Lake Formation, we set up the following components:

  • In the producer account:
    • An AWS Glue Data Catalog to register the schema of a table in Apache Iceberg format
    • Lake Formation to provide fine-grained access to the consumer account
    • Athena to verify data from the producer account
  • In the consumer account:
    • AWS Resource Access Manager (AWS RAM) to create a handshake between the producer Data Catalog and consumer
    • Lake Formation to provide fine-grained access to the consumer account
    • Athena to verify data from producer account

The following diagram illustrates the architecture.

Cross-account fine-grained permissions architecture

Prerequisites

Before you get started, make sure you have the following:

Data producer setup

In this section, we present the steps to set up the data producer.

Create an S3 bucket to store the table data

We create a new S3 bucket to save the data for the table:

  1. On the Amazon S3 console, create an S3 bucket with unique name (for this post, we use iceberg-athena-lakeformation-blog).
  2. Create the producer folder inside the bucket to use for the table.

Amazon S3 bucket and folder creation

Register the S3 path storing the table using Lake Formation

We register the S3 full path in Lake Formation:

  1. Navigate to the Lake Formation console.
  2. If you’re logging in for the first time, you’re prompted to create an admin user.
  3. In the navigation pane, under Register and ingest, choose Data lake locations.
  4. Choose Register location, and provide the S3 bucket path that you created earlier.
  5. Choose AWSServiceRoleForLakeFormationDataAccess for IAM role.

For additional information about roles, refer to Requirements for roles used to register locations.

If you enabled encryption of your S3 bucket, you have to provide permissions for Lake Formation to perform encryption and decryption operations. Refer to Registering an encrypted Amazon S3 location for guidance.

  1. Choose Register location.

Register Lake Formation location

Create an Iceberg table using Athena

Now let’s create the table using Athena backed by Apache Iceberg format:

  1. On the Athena console, choose Query editor in the navigation pane.
  2. If you’re using Athena for the first time, under Settings, choose Manage and enter the S3 bucket location that you created earlier (iceberg-athena-lakeformation-blog/producer).
  3. Choose Save.
  4. In the query editor, enter the following query (replace the location with the S3 bucket that you registered with Lake Formation). Note that we use the default database, but you can use any other database.
CREATE TABLE consumer_iceberg (
customerid bigint,
customername string,
email string,
city string,
country string,
territory string,
contactfirstname string,
contactlastname string)
LOCATION 's3://YOUR-BUCKET/producer/' -- *** Change bucket name to your bucket***
TBLPROPERTIES ('table_type'='ICEBERG')
  1. Choose Run.

Athena query editor to create Iceberg table

Share the table with the consumer account

To illustrate functionality, we implement the following scenarios:

  • Provide access to selected columns
  • Provide access to selected rows based on a filter

Complete the following steps:

  1. On the Lake Formation console, in the navigation pane under Data catalog, choose Data filters.
  2. Choose Create new filter.
  3. For Data filter name, enter blog_data_filter.
  4. For Target database, enter lf-demo-db.
  5. For Target table, enter consumer_iceberg.
  6. For Column-level access, select Include columns.
  7. Choose the columns to share with the consumer: country, address, contactfirstname, city, customerid, and customername.
  8. For Row filter expression, enter the filter country='France'.
  9. Choose Create filter.

create data filter

Now let’s grant access to the consumer account on the consumer_iceberg table.

  1. In the navigation pane, choose Tables.
  2. Select the consumer_iceberg table, and choose Grant on the Actions menu.
    Grant access to consumer account on consumer_iceberg table
  3. Select External accounts.
  4. Enter the external account ID.
    Grant data permissions
  5. Select Named data catalog resources.
  6. Choose your database and table.
  7. For Data filters, choose the data filter you created.
    Add data filter
  8. For Data filter permissions and Grantable permissions, select Select.
  9. Choose Grant.

Permissions for creating grant

Data consumer setup

To set up the data consumer, we accept the resource share and create a table using AWS RAM and Lake Formation. Complete the following steps:

  1. Log in to the consumer account and navigate to the AWS RAM console.
  2. Under Shared with me in the navigation pane, choose Resource shares.
  3. Choose your resource share.
    Resource share in consumer account
  4. Choose Accept resource share.
  5. Note the name of the resource share to use in the next steps.
    Accept resource share
  6. Navigate to the Lake Formation console.
  7. If you’re logging in for the first time, you’re prompted to create an admin user.
  8. Choose Databases in the navigation pane, then choose your database.
  9. On the Actions menu, choose Create resource link.
    Create a resource link
  10. For Resource link name, enter the name of your resource link (for example, consumer_iceberg).
  11. Choose your database and shared table.
  12. Choose Create.
    Create table with resource link

Validate the solution

Now we can run different operations on the tables to validate the fine-grained access controls.

Insert operation

Let’s insert data into the consumer_iceberg table in the producer account, and validate the data filtering works as expected in the consumer account.

  1. Log in to the producer account.
  2. On the Athena console, choose Query editor in the navigation pane.
  3. Use the following SQL to write and insert data into the Iceberg table. Use the Query editor to run one query at a time. You can highlight/select one query at a time and click “Run”/“Run again:
INSERT INTO consumer_iceberg VALUES (1, 'Land of Toys Inc.', '[email protected]',
'NYC','USA', 'NA', 'James', 'xxxx 118th NE');

INSERT INTO consumer_iceberg VALUES (2, 'Reims Collectables', '[email protected]',
'Reims','France', 'EMEA', 'Josephine', 'Darakjy');

INSERT INTO consumer_iceberg VALUES (3, 'Lyon Souveniers', '[email protected]',
'Paris', 'France', 'EMEA','Art', 'Venere');

Insert data into consumer_iceberg table in the producer account

  1. Use the following SQL to read and select data in the Iceberg table:
SELECT * FROM "lf-demo-db"."consumer_iceberg" limit 10;

Run select query to validate rows were inserted

  1. Log in to the consumer account.
  2. In the Athena query editor, run the following SELECT query on the shared table:
SELECT * FROM "lf-demo-db"."consumer_iceberg" limit 10;

Run same query in consumer account

Based on the filters, the consumer has visibility to a subset of columns, and rows where the country is France.

Update/Delete operations

Now let’s update one of the rows and delete one from the dataset shared with the consumer.

  1. Log in to the producer account.
  2. Update city='Paris' WHERE city='Reims' and delete the row customerid = 3;
    UPDATE consumer_iceberg SET city= 'Paris' WHERE city= 'Reims' ;

    Run update query in producer account

DELETE FROM consumer_iceberg WHERE customerid =3;

Run delete query in producer account

  1. Verify the updated and deleted dataset:
SELECT * FROM consumer_iceberg;

Verify update and delete reflected in producer account

  1. Log in to the consumer account.
  2. In the Athena query editor, run the following SELECT query on the shared table:
SELECT * FROM "lf-demo-db"."consumer_iceberg" limit 10;

Verify update and delete in consumer account

We can observe that only one row is available and the city is updated to Paris.

Schema evolution: Add a new column

Let’s update one of the rows and delete one from the dataset shared with the consumer.

  1. Log in to the producer account.
  2. Add a new column called geo_loc in the Iceberg table. Use the Query editor to run one query at a time. You can highlight/select one query at a time and click “Run”/“Run again:
ALTER TABLE consumer_iceberg ADD COLUMNS (geo_loc string);

INSERT INTO consumer_iceberg VALUES (5, 'Test_user', '[email protected]',
'Reims','France', 'EMEA', 'Test_user', 'Test_user', 'test_geo');

SELECT * FROM consumer_iceberg;

Add a new column in producer aacccount

To provide visibility to the newly added geo_loc column, we need to update the Lake Formation data filter.

  1. On the Lake Formation console, choose Data filters in the navigation pane.
  2. Select your data filter and choose Edit.
    Update data filter
  3. Under Column-level access, add the new column (geo_loc).
  4. Choose Save.
    Add new column to data filter
  5. Log in to the consumer account.
  6. In the Athena query editor, run the following SELECT query on the shared table:
SELECT * FROM "lf-demo-db"."consumer_iceberg" limit 10;

Validate new column appears in consumer account

The new column geo_loc is visible and an additional row.

Schema evolution: Delete column

Let’s update one of the rows and delete one from the dataset shared with the consumer.

  1. Log in to the producer account.
  2. Alter the table to drop the address column from the Iceberg table. Use the Query editor to run one query at a time. You can highlight/select one query at a time and click “Run”/“Run again:
ALTER TABLE consumer_iceberg DROP COLUMN address;

SELECT * FROM consumer_iceberg;

Delete a column in producer account

We can observe that the column address is not present in the table.

  1. Log in to the consumer account.
  2. In the Athena query editor, run the following SELECT query on the shared table:
SELECT * FROM "lf-demo-db"."consumer_iceberg" limit 10;

Validate column deletion in consumer account

The column address is not present in the table.

Time travel

We have now changed the Iceberg table multiple times. The Iceberg table keeps track of the snapshots. Complete the following steps to explore the time travel functionality:

  1. Log in to the producer account.
  2. Query the system table:
SELECT * FROM "lf-demo-db"."consumer_iceberg$snapshots" limit 10;

We can observe that we have generated multiple snapshots.

  1. Note down one of the committed_at values to use in the next steps (for this example, 2023-01-29 21:35:02.176 UTC).
    Time travel query in consumer account
  2. Use time travel to find the table snapshot. Use the Query editor to run one query at a time. You can highlight/select one query at a time and click “Run”/“Run again:
SELECT * FROM consumer_iceberg FOR TIMESTAMP
AS OF TIMESTAMP '2023-01-29 21:35:02.176 UTC';

Find table snapshot using time travel

Clean up

Complete the following steps to avoid incurring future charges:

  1. On the Amazon S3 console, delete the table storage bucket (for this post, iceberg-athena-lakeformation-blog).
  2. In the producer account on the Athena console, run the following commands to delete the tables you created:
DROP TABLE "lf-demo-db"."consumer_iceberg";
DROP DATABASE lf-demo-db;
  1. In the producer account on the Lake Formation console, revoke permissions to the consumer account.
    Clean up - Revoke permissions to consumer account
  2. Delete the S3 bucket used for the Athena query result location from the consumer account.

Conclusion

With the support for cross account, fine-grained access control policies for formats such as Iceberg, you have the flexibility to work with any format supported by Athena. The ability to perform CRUD operations against the data in your S3 data lake combined with Lake Formation fine-grained access controls for all tables and formats supported by Athena provides opportunities to innovate and simplify your data strategy. We’d love to hear your feedback!


About the authors

Kishore Dhamodaran is a Senior Solutions Architect at AWS. Kishore helps strategic customers with their cloud enterprise strategy and migration journey, leveraging his years of industry and cloud experience.

Jack Ye is a software engineer of the Athena Data Lake and Storage team at AWS. He is an Apache Iceberg Committer and PMC member.

Chris Olson is a Software Development Engineer at AWS.

Xiaoxuan Li is a Software Development Engineer at AWS.

Rahul Sonawane is a Principal Analytics Solutions Architect at AWS with AI/ML and Analytics as his area of specialty.

AWS Week in Review – March 20, 2023

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-week-in-review-march-20-2023/

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

A new week starts, and Spring is almost here! If you’re curious about AWS news from the previous seven days, I got you covered.

Last Week’s Launches
Here are the launches that got my attention last week:

Picture of an S3 bucket and AWS CEO Adam Selipsky.Amazon S3 – Last week there was AWS Pi Day 2023 celebrating 17 years of innovation since Amazon S3 was introduced on March 14, 2006. For the occasion, the team released many new capabilities:

Amazon Linux 2023 – Our new Linux-based operating system is now generally available. Sébastien’s post is full of tips and info.

Application Auto Scaling – Now can use arithmetic operations and mathematical functions to customize the metrics used with Target Tracking policies. You can use it to scale based on your own application-specific metrics. Read how it works with Amazon ECS services.

AWS Data Exchange for Amazon S3 is now generally available – You can now share and find data files directly from S3 buckets, without the need to create or manage copies of the data.

Amazon Neptune – Now offers a graph summary API to help understand important metadata about property graphs (PG) and resource description framework (RDF) graphs. Neptune added support for Slow Query Logs to help identify queries that need performance tuning.

Amazon OpenSearch Service – The team introduced security analytics that provides new threat monitoring, detection, and alerting features. The service now supports OpenSearch version 2.5 that adds several new features such as support for Point in Time Search and improvements to observability and geospatial functionality.

AWS Lake Formation and Apache Hive on Amazon EMR – Introduced fine-grained access controls that allow data administrators to define and enforce fine-grained table and column level security for customers accessing data via Apache Hive running on Amazon EMR.

Amazon EC2 M1 Mac Instances – You can now update guest environments to a specific or the latest macOS version without having to tear down and recreate the existing macOS environments.

AWS Chatbot – Now Integrates With Microsoft Teams to simplify the way you troubleshoot and operate your AWS resources.

Amazon GuardDuty RDS Protection for Amazon Aurora – Now generally available to help profile and monitor access activity to Aurora databases in your AWS account without impacting database performance

AWS Database Migration Service – Now supports validation to ensure that data is migrated accurately to S3 and can now generate an AWS Glue Data Catalog when migrating to S3.

AWS Backup – You can now back up and restore virtual machines running on VMware vSphere 8 and with multiple vNICs.

Amazon Kendra – There are new connectors to index documents and search for information across these new content: Confluence Server, Confluence Cloud, Microsoft SharePoint OnPrem, Microsoft SharePoint Cloud. This post shows how to use the Amazon Kendra connector for Microsoft Teams.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
A few more blog posts you might have missed:

Example of a geospatial query.Women founders Q&A – We’re talking to six women founders and leaders about how they’re making impacts in their communities, industries, and beyond.

What you missed at that 2023 IMAGINE: Nonprofit conference – Where hundreds of nonprofit leaders, technologists, and innovators gathered to learn and share how AWS can drive a positive impact for people and the planet.

Monitoring load balancers using Amazon CloudWatch anomaly detection alarms – The metrics emitted by load balancers provide crucial and unique insight into service health, service performance, and end-to-end network performance.

Extend geospatial queries in Amazon Athena with user-defined functions (UDFs) and AWS Lambda – Using a solution based on Uber’s Hexagonal Hierarchical Spatial Index (H3) to divide the globe into equally-sized hexagons.

How cities can use transport data to reduce pollution and increase safety – A guest post by Rikesh Shah, outgoing head of open innovation at Transport for London.

For AWS open-source news and updates, here’s the latest newsletter curated by Ricardo to bring you the most recent updates on open-source projects, posts, events, and more.

Upcoming AWS Events
Here are some opportunities to meet:

AWS Public Sector Day 2023 (March 21, London, UK) – An event dedicated to helping public sector organizations use technology to achieve more with less through the current challenging conditions.

Women in Tech at Skills Center Arlington (March 23, VA, USA) – Let’s celebrate the history and legacy of women in tech.

The AWS Summits season is warming up! You can sign up here to know when registration opens in your area.

That’s all from me for this week. Come back next Monday for another Week in Review!

Danilo

Patterns for enterprise data sharing at scale

Post Syndicated from Venkata Sistla original https://aws.amazon.com/blogs/big-data/patterns-for-enterprise-data-sharing-at-scale/

Data sharing is becoming an important element of an enterprise data strategy. AWS services like AWS Data Exchange provide an avenue for companies to share or monetize their value-added data with other companies. Some organizations would like to have a data sharing platform where they can establish a collaborative and strategic approach to exchange data with a restricted group of companies in a closed, secure, and exclusive environment. For example, financial services companies and their auditors, or manufacturing companies and their supply chain partners. This fosters development of new products and services and helps improve their operational efficiency.

Data sharing is a team effort, it’s important to note that in addition to establishing the right infrastructure, successful data sharing also requires organizations to ensure that business owners sponsor data sharing initiatives. They also need to ensure that data is of high quality. Data platform owners and security teams should encourage proper data use and fix any privacy and confidentiality issues.

This blog discusses various data sharing options and common architecture patterns that organizations can adopt to set up their data sharing infrastructure based on AWS service availability and data compliance.

Data sharing options and data classification types

Organizations operate across a spectrum of security compliance constraints. For some organizations, it’s possible to use AWS services like AWS Data Exchange. However, organizations working in heavily regulated industries like federal agencies or financial services might be limited by the allow listed AWS service options. For example, if an organization is required to operate in a Fedramp Medium or Fedramp High environment, their options to share data may be limited by the AWS services that are available and have been allow listed. Service availability is based on platform certification by AWS, and allow listing is based on the organizations defining their security compliance architecture and guidelines.

The kind of data that the organization wants to share with its partners may also have an impact on the method used for data sharing. Complying with data classification rules may further limit their choice of data sharing options they may choose.

The following are some general data classification types:

  • Public data – Important information, though often freely available for people to read, research, review and store. It typically has the lowest level of data classification and security.
  • Private data – Information you might want to keep private like email inboxes, cell phone content, employee identification numbers, or employee addresses. If private data were shared, destroyed, or altered, it might pose a slight risk to an individual or the organization.
  • Confidential or restricted data – A limited group of individuals or parties can access sensitive information often requiring special clearance or special authorization. Confidential or restricted data access might involve aspects of identity and authorization management. Examples of confidential data include Social Security numbers and vehicle identification numbers.

The following is a sample decision tree that you can refer to when choosing your data sharing option based on service availability, classification type, and data format (structured or unstructured). Other factors like usability, multi-partner accessibility, data size, consumption patterns like bulk load/API access, and more may also affect the choice of data sharing pattern.

decisiontree

In the following sections, we discuss each pattern in more detail.

Pattern 1: Using AWS Data Exchange

AWS Data Exchange makes exchanging data easier, helping organizations lower costs, become more agile, and innovate faster. Organizations can choose to share data privately using AWS Data Exchange with their external partners. AWS Data Exchange offers perimeter controls that are applied at identity and resource levels. These controls decide which external identities have access to specific data resources. AWS Data Exchange provides multiple different patterns for external parties to access data, such as the following:

The following diagram illustrates an example architecture.

pattern1

With AWS Data Exchange, once the dataset to share (or sell) is configured, AWS Data Exchange automatically manages entitlements (and billing) between the producer and the consumer. The producer doesn’t have to manage policies, set up new access points, or create new Amazon Redshift data shares for each consumer, and access is automatically revoked if the subscription ends. This can significantly reduce the operational overhead in sharing data.

Pattern 2: Using AWS Lake Formation for centralized access management

You can use this pattern in cases where both the producer and consumer are on the AWS platform with an AWS account that is enabled to use AWS Lake Formation. This pattern provides a no-code approach to data sharing. The following diagram illustrates an example architecture.

pattern2

In this pattern, the central governance account has Lake Formation configured for managing access across the producer’s org accounts. Resource links from the production account Amazon Simple Storage Service (Amazon S3) bucket are created in Lake Formation. The producer grants Lake Formation permissions on an AWS Glue Data Catalog resource to an external account, or directly to an AWS Identity and Access Management (IAM) principal in another account. Lake Formation uses AWS Resource Access Manager (AWS RAM) to share the resource. If the grantee account is in the same organization as the grantor account, the shared resource is available immediately to the grantee. If the grantee account is not in the same organization, AWS RAM sends an invitation to the grantee account to accept or reject the resource grant. To make the shared resource available, the consumer administrator in the grantee account must use the AWS RAM console or AWS Command Line Interface (AWS CLI) to accept the invitation.

Authorized principals can share resources explicitly with an IAM principal in an external account. This feature is useful when the producer wants to have control over who in the external account can access the resources. The permissions the IAM principal receives are a union of direct grants and the account-level grants that are cascaded down to the principals. The data lake administrator of the recipient account can view the direct cross-account grants, but can’t revoke permissions.

Pattern 3: Using AWS Lake Formation from the producer external sharing account

The producer may have stringent security requirements where no external consumer should access their production account or their centralized governance account. They may also not have Lake Formation enabled on their production platform. In such cases, as shown in the following diagram, the producer production account (Account A) is dedicated to its internal organization users. The producer creates another account, the producer external sharing account (Account B), which is dedicated for external sharing. This gives the producer more latitude to create specific policies for specific organizations.

The following architecture diagram shows an overview of the pattern.

pattern3

The producer implements a process to create an asynchronous copy of data in Account B. The bucket can be configured for Same Region Replication (SRR) or Cross Region Replication (CRR) for objects that need to be shared. This facilitates automated refresh of data to the external account to the “External Published Datasets” S3 bucket without having to write any code.

Creating a copy of the data allows the producer to add another degree of separation between the external consumer and its production data. It also helps meet any compliance or data sovereignty requirements.

Lake Formation is set up on Account B, and the administrator creates resources links for the “External Published Datasets” S3 bucket in its account to grant access. The administrator follows the same process to grant access as described earlier.

Pattern 4: Using Amazon Redshift data sharing

This pattern is ideally suited for a producer who has most of their published data products on Amazon Redshift. This pattern also requires the producer’s external sharing account (Account B) and the consumer account (Account C) to have an encrypted Amazon Redshift cluster or Amazon Redshift Serverless endpoint that meets the prerequisites for Amazon Redshift data sharing.

The following architecture diagram shows an overview of the pattern.

pattern4

Two options are possible depending on the producer’s compliance constraints:

  • Option A – The producer enables data sharing directly on the production Amazon Redshift cluster.
  • Option B – The producer may have constraints with respect to sharing the production cluster. The producer creates a simple AWS Glue job that copies data from the Amazon Redshift cluster in the production Account A to the Amazon Redshift cluster in the external Account B. This AWS Glue job can be scheduled to refresh data as needed by the consumer. When the data is available in Account B, the producer can create multiple views and multiple data shares as needed.

In both options, the producer maintains complete control over what data is being shared, and the consumer admin maintains full control over who can access the data within their organization.

After both the producer and consumer admins approve the data sharing request, the consumer user can access this data as if it were part of their own account without have to write any additional code.

Pattern 5: Sharing data securely and privately using APIs

You can adopt this pattern when the external partner doesn’t have a presence on AWS. You can also use this pattern when published data products are spread across various services like Amazon S3, Amazon Redshift, Amazon DynamoDB, and Amazon OpenSearch Service but the producer would like to maintain a single data sharing interface.

Here’s an example use case: Company A would like to share some of its log data in near-real time with its partner Company B, who uses this data to generate predictive insights for Company A. Company A stores this data in Amazon Redshift. The company wants to share this transactional information with its partner after masking the personally identifiable information (PII) in a cost-effective and secure way to generate insights. Company B doesn’t use the AWS platform.

Company A establishes a microbatch process using an AWS Lambda function or AWS Glue that queries Amazon Redshift to get incremental log data, applies the rules to redact the PII, and loads this data to the “Published Datasets” S3 bucket. This instantiates an SRR/CRR process that refreshes this data in the “External Sharing” S3 bucket.

The following diagram shows how the consumer can then use an API-based approach to access this data.

pattern5

The workflow contains the following steps:

  1. An HTTPS API request is sent from the API consumer to the API proxy layer.
  2. The HTTPS API request is forwarded from the API proxy to Amazon API Gateway in the external sharing AWS account.
  3. Amazon API Gateway calls the request receiver Lambda function.
  4. The request receiver function writes the status to a DynamoDB control table.
  5. A second Lambda function, the poller, checks the status of the results in the DynamoDB table.
  6. The poller function fetches results from Amazon S3.
  7. The poller function sends a presigned URL to download the file from the S3 bucket to the requestor via Amazon Simple Email Service (Amazon SES).
  8. The requestor downloads the file using the URL.
  9. The network perimeter AWS account only allows egress internet connection.
  10. The API proxy layer enforces both the egress security controls and perimeter firewall before the traffic leaves the producer’s network perimeter.
  11. The AWS Transit Gateway security egress VPC routing table only allows connectivity from the required producer’s subnet, while preventing internet access.

Pattern 6: Using Amazon S3 access points

Data scientists may need to work collaboratively on image, videos, and text documents. Legal and audit groups may want to share reports and statements with the auditing agencies. This pattern discusses an approach to sharing such documents. The pattern assumes that the external partners are also on AWS. Amazon S3 access points allow the producer to share access with their consumer by setting up cross-account access without having to edit bucket policies.

Access points are named network endpoints that are attached to buckets that you can use to perform S3 object operations, such as GetObject and PutObject. Each access point has distinct permissions and network controls that Amazon S3 applies for any request that is made through that access point. Each access point enforces a customized access point policy that works in conjunction with the bucket policy attached to the underlying bucket.

The following architecture diagram shows an overview of the pattern.

pattern6

The producer creates an S3 bucket and enables the use of access points. As part of the configuration, the producer specifies the consumer account, IAM role, and privileges for the consumer IAM role.

The consumer users with the IAM role in the consumer account can access the S3 bucket via the internet or restricted to an Amazon VPC via VPC endpoints and AWS PrivateLink.

Conclusion

Each organization has its unique set of constraints and requirements that it needs to fulfill to set up an efficient data sharing solution. In this post, we demonstrated various options and best practices available to organizations. The data platform owner and security team should work together to assess what works best for your specific situation. Your AWS account team is also available to help.

Related resources

For more information on related topics, refer to the following:


About the Authors


Venkata Sistla
is a Cloud Architect – Data & Analytics at AWS. He specializes in building data processing capabilities and helping customers remove constraints that prevent them from leveraging their data to develop business insights.

Santosh Chiplunkar is a Principal Resident Architect at AWS. He has over 20 years of experience helping customers solve their data challenges. He helps customers develop their data and analytics strategy and provides them with guidance on how to make it a reality.

Introducing AWS Glue crawlers using AWS Lake Formation permission management

Post Syndicated from Sandeep Adwankar original https://aws.amazon.com/blogs/big-data/introducing-aws-glue-crawlers-using-aws-lake-formation-permission-management/

Data lakes provide a centralized repository that consolidates your data at scale and makes it available for different kinds of analytics. AWS Glue crawlers are a popular way to scan data in a data lake, classify it, extract schema information from it, and store the metadata automatically in the AWS Glue Data Catalog. AWS Lake Formation enables you to centrally govern, secure, and share your data, and lets you scale permissions easily.

We are pleased to announce AWS Glue crawler and Lake Formation integration. You can now use Lake Formation permissions for the crawler’s access to your Lake Formation managed data lakes, whether those are in your account or in other accounts. Before this release, you had to set up AWS Glue crawler IAM role with Amazon Simple Storage Service (Amazon S3) permissions to crawl data source on Amazon S3. And also establish Amazon S3 bucket policies on the source bucket for the crawler role to access S3 data source. Now you can use AWS Lake Formation permission defined on data lake for crawling the data and you no longer need to configure dedicated Amazon S3 permissions for crawlers. AWS Lake Formation manages crawler IAM role access to various Amazon S3 buckets and/or its prefix using data locations permissions to simplify security management. Further you can apply the same security model for crawlers in addition to AWS Glue jobs, Amazon Athena for centralized governance.

When you configure an AWS Glue crawler to use Lake Formation, by default, the crawler uses Lake Formation in the same account to obtain data access credentials. However, you can also configure the crawler to use Lake Formation of a different account by providing an account ID during creation. The cross-account capability allows you to perform permissions management from a central governance account. Customers prefer the central governance experience over writing bucket policies separately in each bucket-owning account. To build a data mesh architecture, you can author permissions in a single Lake Formation governance to manage access to data locations and crawlers spanning multiple accounts in your data lake. You can refer to How to configure a crawler to use Lake Formation credentials for more information.

In this post, we walk through a single in-account architecture that shows how to enable Lake Formation permissions on the data lake, configure an AWS Glue crawler with Lake Formation permission to scan and populate schema from an S3 data lake into the AWS Glue Data Catalog, and then use an analytical engine like Amazon Athena to query the data.

Solution overview

The AWS Glue crawler and Lake Formation integration supports in-account crawling as well as cross-account crawling. You can configure a crawler to use Lake Formation permissions to access an S3 data store or a Data Catalog table with an underlying S3 location within the same AWS account or another AWS account. You can configure an existing Data Catalog table as a crawler’s target if the crawler and the Data Catalog table reside in the same account. The following figure shows the in-account crawling architecture.

Prerequisites

Complete the following prerequisite steps:

  1. Sign in to the Lake Formation console as admin.
  2. If this is the first time accessing the Lake Formation console, add yourself as the data lake administrator.
  3. In the navigation pane, under Data catalog, choose Settings.
  4. Deselect Use only IAM access control for new databases.
  5. Deselect Use only IAM access control for new tables in new databases.
  6. Keep Version 3 as the current cross-account version.
  7. Choose Save.

Set up your solution resources

We set up the solution resources using AWS CloudFormation. Complete the following steps:

  1. Log in to the AWS Management Console as IAM administrator.
  2. Choose Launch Stack to deploy a CloudFormation template:
  3. For LFBusinessAnalystUserName, keep as the default LFBusinessAnalyst.
  4. Create your stack.
  5. When the stack is complete, on the AWS CloudFormation console, navigate to the Resources tab of the stack.
  6. Note down value of Databasename, DataLakeBucket, and GlueCrawlerName.
  7. Choose the LFBusinessAnalystUserCredentials value to navigate to the AWS Secrets Manager console.
  8. In the Secret value section, choose Retrieve secret value.
  9. Note down the secret value for the password for IAM user LFBusinessAnalyst.

Validate resources

In your account, validate the following resources created by template:

  • AWS Glue database – The Databasename value noted from the CloudFormation template.
  • S3 bucket for the data lake with sample data – The DataLakeBucketvalue value noted from the CloudFormation template.
  • AWS Glue crawler and IAM role with required permission – The GlueCrawlerName value noted from the CloudFormation template.

The template registers the S3 bucket with Lake Formation as the data location. On Lake Formation console left navigation choose Data lake locations under Register and ingest.

The template also grants data location permission on the S3 bucket to the crawler role. On Lake Formation console left navigation choose Data locations under Permissions.

Lastly, the template grants database permission to the crawler role. On Lake Formation console left navigation choose Data lake permissions under Permissions.

Edit and run the AWS Glue crawler

To configure and run the AWS Glue crawler, complete the following steps:

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Locate the crawler lfcrawler-<your-account-id> and edit it.
  3. Under Lake Formation configuration, select Use Lake Formation credentials for crawling S3 data source.
  4. Choose Next.
  5. Review and update the crawler settings.

Note that the crawler IAM role uses Lake Formation permission to access the data and doesn’t have any S3 policies.

  1. Run the crawler and verify that the crawler run is complete.
  2. In the AWS Glue database lfcrawlerdb<your-account-id>, verify that the table is created and the schema matches with what you have in the S3 bucket.

The crawler was able to crawl the S3 data source and successfully populate the schema using Lake Formation permissions.

Grant access to the data analyst using Lake Formation

Now the data lake admin can delegate permissions on the database and table to the LFBusinessAnalyst user via the Lake Formation console.

Grant the LFBusinessAnalyst IAM user access to the database with Describe permissions.

  1. On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permission .
  2. Choose Grant
  3. Under Principals, select IAM users and roles.
  4. Choose the IAM users LFBusinessAnalyst
  5. Under LF-Tags or catalog resources, choose lfcrawlerdb<your-accountid> for Databases.
  6. Select Describe for Database permissions.
  7. Choose Grant to apply the permissions.

Grant the LFBusinessAnalyst IAM user Select and Describe access to the table.

  1. On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permission.
  2. Choose Grant.
  3. Under Principals, select IAM users and roles.
  4. Choose the IAM users LFBusinessAnalyst.
  5. Under LF-Tags or catalog resources, choose lfcrawlerdb<your-accountid> for Databases and lf_datalake_<your-accountid>_<region> for Tables
  6. Choose Select, Describe for Table permissions.
  7. Choose Grant to apply the permissions.

Verify the tables using Athena

To verify the tables using Athena, complete the following steps:

  1. Log in as LFBusinessAnalyst using the password noted earlier through the CloudFormation stack.
  2. On the Athena console, choose lfconsumer-primary-workgroup as the Athena workgroup.
  3. Run the query to validate access as shown in the following screenshot.

We have successfully crawled Amazon S3 data store using the crawler with Lake Formation permission and populated the metadata in AWS Glue Data Catalog. We have granted Lake Formation permission on database and table to consumer user and validated user access to the data using Athena.

Clean up

To avoid unwanted charges to your AWS account, you can delete the AWS resources:

  1. Sign in to the CloudFormation console as the IAM admin used for creating the CloudFormation stack.
  2. Delete the stack you created.

Summary

In this post, we showed how to use the new AWS Glue crawler integration with Lake Formation. Data lake admins can now share crawled tables with data analysts using Lake Formation, allowing analysts to use analytical services such as Athena. You can centrally manage all permissions in Lake Formation, making it easier to administer and protect data lakes.

Special thanks to everyone who contributed to this crawler feature launch: Anshuman Sharma, Jessica Cheng, Aditya K, Sandya Krishnanand

If you have questions or suggestions, submit them in the comments section.


About the authors

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

AWS Lake Formation 2022 year in review

Post Syndicated from Jason Berkowitz original https://aws.amazon.com/blogs/big-data/aws-lake-formation-2022-year-in-review/

Data governance is the collection of policies, processes, and systems that organizations use to ensure the quality and appropriate handling of their data throughout its lifecycle for the purpose of generating business value. Data governance is increasingly top-of-mind for customers as they recognize data as one of their most important assets. Effective data governance enables better decision-making by improving data quality, reducing data management costs, and ensuring secure access to data for stakeholders. In addition, data governance is required to comply with an increasingly complex regulatory environment with data privacy (such as GDPR and CCPA) and data residency regulations (such as in the EU, Russia, and China).

For AWS customers, effective data governance improves decision-making, increases business agility, provides a competitive advantage, and reduces the risk of fines due to non-compliance with regulatory obligations. We understand the unique opportunity to provide our customers a comprehensive end-to-end data governance solution that is seamlessly integrated into our portfolio of services, and AWS Lake Formation and the AWS Glue Data Catalog are key to solving these challenges.

In this post, we are excited to summarize the features that the AWS Glue Data Catalog, AWS Glue crawler, and Lake Formation teams delivered in 2022. We have collected some of the key talks and solutions on data governance, data mesh, and modern data architecture published and presented in AWS re:Invent 2022, and a few data lake solutions built by customers and AWS Partners for easy reference. Whether you are a data platform builder, data engineer, data scientist, or any technology leader interested in data lake solutions, this post is for you.

To learn more about how customers are securing and sharing data with Lake Formation, we recommend going deeper into GoDaddy’s decentralized data mesh, Novo Nordisk’s modern data architecture, and JPMorgan’s improvements to their Federated Data Lake, a governed data mesh implementation using Lake Formation. Also, you can learn how AWS Partners integrated with Lake Formation to help customers build unique data lakes, in Starburst’s data mesh solution, Informatica’s automated data sharing solution, Ahana’s Presto integration with Lake Formation, Ascending’s custom data governance system, how PBS used machine learning on their data lakes, and how hc1 provides personalized health insights for customers.

You can review how Lake Formation is used by customers to build modern data architectures in the following re:Invent 2022 talks:

The Lake Formation team listened to customer feedback and made improvements in the areas of cross-account data governance, expanding the source of data lakes, enabling unified data governance of a business data catalog, making secure business-to-business data sharing possible, and expanding the coverage area for fine-grained access controls to Amazon Redshift. In the rest of this post, we are happy to share the progress we made in 2022.

Enhancing cross-account governance

Lake Formation provides the foundation for customers to share data across accounts within their organization. You can share AWS Glue Data Catalog resources to AWS Identity and Access Management (IAM) principals within an account as well as other AWS accounts using two methods. The first one is called the named-resource method, where users can select the names of databases and tables and choose the type of permissions to share. The second method uses LF-Tags, where users can create and associate LF-Tags to databases and tables and grant permission to IAM principals using LF-Tag policies and expressions.

In November 2022, Lake Formation introduced version 3 of its cross-account sharing feature. With this new version, Lake Formation users can share catalog resources using LF-Tags at the AWS Organizations level. Sharing data using LF-tags helps scale permissions and reduces the admin work for data lake builders. The cross-account sharing version 3 also allows you to share resources to specific IAM principals in other accounts, providing data owners control over who can access their data in other accounts. Lastly, we have removed the overhead of writing and maintaining Data Catalog resource policies by introducing AWS Resource Access Manager (AWS RAM) invites with LF-Tags-based policies in the cross-account sharing version 3. We encourage you to further explore cross-account sharing in Lake Formation.

Extending Lake Formation permissions to new data

Until re:Invent 2022, Lake Formation provided permissions management for IAM principals on Data Catalog resources with underlying data primarily on Amazon Simple Storage Service (Amazon S3). At re:Invent 2022, we introduced Lake Formation permissions management for Amazon Redshift data shares in preview mode. Amazon Redshift is a fully-managed, petabyte-scale data warehouse service in the AWS Cloud. The data sharing feature allows data owners to group databases, tables, and views in an Amazon Redshift cluster and share it with other Amazon Redshift clusters within or across AWS accounts. Data sharing reduces the need to keep multiple copies of the same data in different data warehouses to accelerate business decision-making across an organization. Lake Formation further enhances sharing data within Amazon Redshift data shares by providing fine-grained access control on tables and views.

For additional details on this feature, refer to AWS Lake Formation-managed Redshift datashares (preview) and How Redshift data share can be managed by Lake Formation.

Amazon EMR is a managed cluster platform to run big data applications using Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto at scale. You can use Amazon EMR to run batch and stream processing analytics jobs on your S3 data lakes. Starting with Amazon EMR release 6.7.0, we introduced Lake Formation permissions management on a runtime IAM role used with the EMR Steps API. This feature enables you to submit Apache Spark and Apache Hive applications to an EMR cluster through the EMR Steps API that enforces table-level and column-level permissions using Lake Formation to that IAM role submitting the application. This Lake Formation integration with Amazon EMR allows you to share an EMR cluster across multiple users in an organization with different permissions by isolating your applications through a runtime IAM role. We encourage you to check this feature in the Lake Formation workshop Integration with Amazon EMR using Runtime Roles. To explore a use case, see Introducing runtime roles for Amazon EMR steps: Use IAM roles and AWS Lake Formation for access control with Amazon EMR.

Amazon SageMaker Studio is a fully integrated development environment (IDE) for machine learning (ML) that enables data scientists and developers to prepare data for building, training, tuning, and deploying models. Studio offers a native integration with Amazon EMR so that data scientists and data engineers can interactively prepare data at petabyte scale using open-source frameworks such as Apache Spark, Presto, and Hive using Studio notebooks. With the release of Lake Formation permissions management on a runtime IAM role, Studio now supports table-level and column-level access with Lake Formation. When users connect to EMR clusters from Studio notebooks, they can choose the IAM role (called the runtime IAM role) that they want to connect with. If data access is managed by Lake Formation, users can enforce table-level and column-level permissions using policies attached to the runtime role. For more details, refer to Apply fine-grained data access controls with AWS Lake Formation and Amazon EMR from Amazon SageMaker Studio.

Ingest and catalog varied data

A robust data governance model includes data from an organization’s many data sources and methods to discover and catalog those varied data assets. AWS Glue crawlers provide the ability to discover data from sources including Amazon S3, Amazon Redshift, and NoSQL databases, and populate the AWS Glue Data Catalog.

In 2022, we launched AWS Glue crawler support for Snowflake and AWS Glue crawler support for Delta Lake tables. These integrations allow AWS Glue crawlers to create and update Data Catalog tables based on these popular data sources. This makes it even easier to create extract, transform, and load (ETL) jobs with AWS Glue based on these Data Catalog tables as sources and targets.

In 2022, the AWS Glue crawlers UI was redesigned to offer a better user experience. One of the main enhancements delivered as part of this revision is the greater insights into AWS Glue crawler history. The crawler history UI provides an easy view of crawler runs, schedules, data sources, and tags. For each crawl, the crawler history offers a summary of changes in the database schema or Amazon S3 partition changes. Crawler history also provides detailed info about DPU hours and reduces the time spent analyzing and debugging crawler operations and costs. To explore the new functionalities added to the crawlers UI, refer to Set up and monitor AWS Glue crawlers using the enhanced AWS Glue UI and crawler history.

In 2022, we also extended support for crawlers based on Amazon S3 event notifications to support catalog tables. With this feature, incremental crawling can be offloaded from data pipelines to the scheduled AWS Glue crawler, reducing crawls to incremental S3 events. For more information, refer to Build incremental crawls of data lakes with existing Glue catalog tables.

More ways to share data beyond the data lake

During re:Invent 2022, we announced a preview of AWS Data Exchange for AWS Lake Formation, a new feature that enables data subscribers to find and subscribe to third-party datasets that are managed directly through Lake Formation. Until now, AWS Data Exchange subscribers could access third-party datasets by exporting providers’ files to their own S3 buckets, calling providers’ APIs through Amazon API Gateway, or querying producers’ Amazon Redshift data shares from their Amazon Redshift cluster. With the new Lake Formation integration, data providers curate AWS Data Exchange datasets using Lake Formation tags. Data subscribers are able to query and explore the databases and tables associated with those tags, just like any other AWS Glue Data Catalog resource. Organizations can apply resource-based Lake Formation permissions to share the licensed datasets within the same account or across accounts using AWS License Manager. AWS Data Exchange for Lake Formation streamlines data licensing and sharing operations by accelerating data onboarding, reducing the amount of ETL required for end-users to access third-party data, and centralizing governance and access controls for third-party data.

At re:Invent 2022, we also announced Amazon DataZone, a new data management service that makes it faster and easier for you to catalog, discover, share, and govern data stored across AWS, on-premises, and third-party sources. Amazon DataZone is a business data catalog service that supplements the technical metadata in the AWS Glue Data Catalog. Amazon DataZone is integrated with Lake Formation permissions management so that you can effectively manage and govern access to your data, and audit who is accessing what data and for what purpose. With the publisher-subscriber model of Amazon DataZone, data assets can be shared and accessed across Regions. For additional details about the service and its capabilities, refer to the Amazon DataZone FAQs and re:Invent launch.

Conclusion

Data is transforming every field and every business. However, with data growing faster than most companies can keep track of, collecting, securing, and getting value out of that data is a challenging thing to do. A modern data strategy can help you create better business outcomes with data. AWS provides the most complete set of services for the end-to-end data journey to help you unlock value from your data and turn it into insight.

At AWS, we work backward from customer requirements. From the Lake Formation team, we worked hard to deliver the features described in this post, and we invite you to check them out. With our continued focus to invent, we hope to play a key role in empowering organizations to build new data governance models that help you derive more business value at lightning speed.

You can get started with Lake Formation by exploring our hands-on workshop modules and Getting started tutorials. We look forward to hearing from you, our customers, on your data lake and data governance use cases. Please get in touch through your AWS account team and share your comments.


About the Authors

Jason Berkowitz is a Senior Product Manager with AWS Lake Formation. He comes from a background in machine learning and data lake architectures. He helps customers become data-driven.

Aarthi Srinivasan is a Senior Big Data Architect with AWS Lake Formation. She enjoys building data lake solutions for AWS customers and partners. When not on the keyboard, she explores the latest science and technology trends and spends time with her family.

Leonardo Gómez is a Senior Analytics Specialist Solutions Architect at AWS. Based in Toronto, Canada, he has over a decade of experience in data management, helping customers around the globe address their business and technical needs.

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

Post Syndicated from Vivek Shrivastava original https://aws.amazon.com/blogs/big-data/build-a-multi-region-and-highly-resilient-modern-data-architecture-using-aws-glue-and-aws-lake-formation/

AWS Lake Formation helps with enterprise data governance and is important for a data mesh architecture. It works with the AWS Glue Data Catalog to enforce data access and governance. Both services provide reliable data storage, but some customers want replicated storage, catalog, and permissions for compliance purposes.

This post explains how to create a design that automatically backs up Amazon Simple Storage Service (Amazon S3), the AWS Glue Data Catalog, and Lake Formation permissions in different Regions and provides backup and restore options for disaster recovery. These mechanisms can be customized for your organization’s processes. The utility for cloning and experimentation is available in the open-sourced GitHub repository.

This solution only replicates metadata in the Data Catalog, not the actual underlying data. To have a redundant data lake using Lake Formation and AWS Glue in an additional Region, we recommend replicating the Amazon S3-based storage using S3 replication, S3 sync, aws-s3-copy-sync-using-batch or S3 Batch replication process. This ensures that the data lake will still be functional in another Region if Lake Formation has an availability issue. The Data Catalog setup (tables, databases, resource links) and Lake Formation setup (permissions, settings) must also be replicated in the backup Region.

Solution overview

This post shows how to create a backup of the Lake Formation permissions and AWS Glue Data Catalog from one Region to another in the same account. The solution doesn’t create or modify AWS Identity and Access Management (IAM) roles, which are available in all Regions. There are three steps to creating a multi-Region data lake:

  1. Migrate Lake Formation data permissions.
  2. Migrate AWS Glue databases and tables.
  3. Migrate Amazon S3 data.

In the following sections, we look at each migration step in more detail.

Lake Formation permissions

In Lake Formation, there are two types of permissions: metadata access and data access.

Metadata access permissions allow users to create, read, update, and delete metadata databases and tables in the Data Catalog.

Data access permissions allow users to read and write data to specific locations in Amazon S3. Data access permissions are managed using data location permissions, which allow users to create and alter metadata databases and tables that point to specific Amazon S3 locations.

When data is migrated from one Region to another, only the metadata access permissions are replicated. This means that if data is moved from a bucket in the source Region to another bucket in the target Region, the data access permissions need to be reapplied in the target Region.

AWS Glue Data Catalog

The AWS Glue Data Catalog is a central repository of metadata about data stored in your data lake. It contains references to data that is used as sources and targets in AWS Glue ETL (extract, transform, and load) jobs, and stores information about the location, schema, and runtime metrics of your data. The Data Catalog organizes this information in the form of metadata tables and databases. A table in the Data Catalog is a metadata definition that represents the data in a data lake, and databases are used to organize these metadata tables.

Lake Formation permissions can only be applied to objects that already exist in the Data Catalog in the target Region. Therefore, in order to apply these permissions, the underlying Data Catalog databases and tables must already exist in the target Region. To meet this requirement, this utility migrates both the AWS Glue databases and tables from the source Region to the target Region.

Amazon S3 data

The data that underlies an AWS Glue table can be stored in an S3 bucket in any Region, so replication of the data itself isn’t necessary. However, if the data has already been replicated to the target Region, this utility has the option to update the table’s location to point to the replicated data in the target Region. If the location of the data is changed, the utility updates the S3 bucket name and keeps the rest of the prefix hierarchy unchanged.

This utility doesn’t include the migration of data from the source Region to the target Region. Data migration must be performed separately using methods such as S3 replication, S3 sync, aws-s3-copy-sync-using-batch or S3 Batch replication.

This utility has two modes for replicating Lake Formation and Data Catalog metadata: on-demand and real-time. The on-demand mode is a batch replication that takes a snapshot of the metadata at a specific point in time and uses it to synchronize the metadata. The real-time mode replicates changes made to the Lake Formation permissions or Data Catalog in near-real time.

The on-demand mode of this utility is recommended for creating existing Lake Formation permissions and Data Catalogs because it replicates a snapshot of the metadata. After the Lake Formation and Data Catalogs are synchronized, you can use real-time mode to replicate any ongoing changes. This creates a mirror image of the source Region in the target Region and keeps it up to date as changes are made in the source Region. These two modes can be used independently of each other, and the operations are idempotent.

The code for the on-demand and real-time modes is available in the GitHub repository. Let’s look at each mode in more detail.

On-demand mode

On-demand mode is used to copy the Lake Formation permissions and Data Catalog at a specific point in time. The code is deployed using the AWS Cloud Development Kit (AWS CDK). The following diagram shows the solution architecture for this mode.

The AWS CDK deploys an AWS Glue job to perform the replication. The job retrieves configuration information from a file stored in an S3 bucket. This file includes details such as the source and target Regions, an optional list of databases to replicate, and options for moving data to a different S3 bucket. More information about these options and deployment instructions is available in the GitHub repository.

The AWS Glue job retrieves the Lake Formation permissions and Data Catalog object metadata from the source Region and stores it in a JSON file in an S3 bucket. The same job then uses this file to create the Lake Formation permissions and Data Catalog databases and tables in the target Region.

This tool can be run on demand by running the AWS Glue job. It copies the Lake Formation permissions and Data Catalog object metadata from the source Region to the target Region. If you run the tool again after making changes to the target Region, the changes are replaced with the latest Lake Formation permissions and Data Catalog from the source Region.

This utility can detect any changes made to the Data Catalog metadata, databases, tables, and columns while replicating the Data Catalog from the source to the target Region. If a change is detected in the source Region, the latest version of the AWS Glue object is applied to the target Region. The utility reports the number of objects modified during its run.

The Lake Formation permissions are copied from the source to the target Region, so any new permissions are replicated in the target Region. If a permission is removed from the source Region, it is not removed from the target Region.

Real-time mode

Real-time mode replicates the Lake Formation permissions and Data Catalog at a regular interval. The default interval is 1 minute, but it can be modified during deployment. The code is deployed using the AWS CDK. The following diagram shows the solution architecture for this mode.

The AWS CDK deploys two AWS Lambda jobs and creates an Amazon DynamoDB table to store AWS CloudTrail events and an Amazon EventBridge rule to run the replication at a regular interval. The Lambda jobs retrieve the configuration information from a file stored in an S3 bucket. This file includes details such as the source and target Regions, options for moving data to a different S3 bucket, and the lookback period for CloudTrail in hours. More information about these options and deployment instructions is available in the GitHub repository.

The EventBridge rule triggers a Lambda job at a fixed interval. This job retrieves the configuration information and queries CloudTrail events related to the Data Catalog and Lake Formation that occurred in the past hour (the duration is configurable). All relevant events are then stored in a DynamoDB table.

After the event information is inserted into the DynamoDB table, another Lambda job is triggered. This job retrieves the configuration information and queries the DynamoDB table. It then applies all the changes to the target Region. If the tool is run again after making changes to the target Region, the changes are replaced with the latest Lake Formation permissions and Data Catalog from the source Region. Unlike on-demand mode, this utility also removes any Lake Formation permissions that were removed from the source Region from the target Region.

Limitations

This utility is designed to replicate permissions within a single account only. The on-demand mode replicates a snapshot and doesn’t remove existing permissions, so it doesn’t perform delete operations. The API currently doesn’t support replicating changes to row and column permissions.

Conclusion

In this post, we showed how you can use this utility to migrate the AWS Glue Data Catalog and Lake Formation permissions from one Region to another. It can also keep the source and target Regions synchronized if any changes are made to the Data Catalog or the Lake Formation permissions. Implementing it across Regions (multi-Region) is a good option if you are looking for the most separation and complete independence of your globally diverse data workloads. Also consider the trade-offs. Implementing and operating this strategy, particularly using multi-Region, can be more complicated and more expensive, than other DR strategies.

To get started, checkout the github repo. For more resources, refer to the following:


About the authors

Vivek Shrivastava is a Principal Data Architect, Data Lake in AWS Professional Services. He is a Bigdata enthusiast and holds 13 AWS Certifications. He is passionate about helping customers build scalable and high-performance data analytics solutions in the cloud. In his spare time, he loves reading and finds areas for home automation

Raza Hafeez is a Senior Data Architect within the Shared Delivery Practice of AWS Professional Services. He has over 12 years of professional experience building and optimizing enterprise data warehouses and is passionate about enabling customers to realize the power of their data. He specializes in migrating enterprise data warehouses to AWS Modern Data Architecture.

Nivas Shankar  is a Principal Product Manager for AWS Lake Formation. He works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure and access data lake. Also leads several data and analytics initiatives within AWS including support for Data Mesh.

Enable cross-account sharing with direct IAM principals using AWS Lake Formation Tags

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/enable-cross-account-sharing-with-direct-iam-principals-using-aws-lake-formation-tags/

With AWS Lake Formation, you can build data lakes with multiple AWS accounts in a variety of ways. For example, you could build a data mesh, implementing a centralized data governance model and decoupling data producers from the central governance. Such data lakes enable the data as an asset paradigm and unleash new possibilities with data discovery and exploration across organization-wide consumers. While enabling the power of data in decision-making across your organization, it’s also crucial to secure the data. With Lake Formation, sharing datasets across accounts only requires a few simple steps, and you can control what you share.

Lake Formation has launched Version 3 capabilities for sharing AWS Glue Data Catalog resources across accounts. When moving to Lake Formation cross-account sharing V3, you get several benefits. When moving from V1, you get more optimized usage of AWS Resource Access Manager (AWS RAM) to scale sharing of resources. When moving from V2, you get a few enhancements. First, you don’t have to maintain AWS Glue resource policies to share using LF-tags because Version 3 uses AWS RAM. Second, you can share with AWS Organizations using LF-tags. Third, you can share to individual AWS Identity and Access Management (IAM) users and roles in other accounts, thereby providing data owners control over which individuals can access their data.

Lake Formation tag-based access control (LF-TBAC) is an authorization strategy that defines permissions based on attributes called LF-tags. LF-tags are different from IAM resource tags and are associated only with Lake Formation databases, tables, and columns. LF-TBAC allows you to define the grant and revoke permissions policy by grouping Data Catalog resources, and therefore helps in scaling permissions across a large number of databases and tables. LF-tags are inherited from a database to all its tables and all the columns of each table.

Version 3 offers the following benefits:

  • True central governance with cross-account sharing to specific IAM principals in the target account
  • Ease of use in not having to maintain an AWS Glue resource policy for LF-TBAC
  • Efficient reuse of AWS RAM shares
  • Ease of use in scaling to hundreds of accounts with LF-TBAC

In this post, we illustrate the new features of cross-account sharing Version 3 in a producer-consumer scenario using TPC datasets. We walk through the setup of using LF-TBAC to share data catalog resources from the data producer account to direct IAM users in the consumer account. We also go through the steps in the receiving account to accept the shares and query the data.

Solution overview

To demonstrate the Lake Formation cross-account Version 3 features, we use the TPC datasets available at s3://aws-data-analytics-workshops/shared_datasets/tpcparquet/. The solution consists of steps in both accounts.

In account A, complete the following steps:

  1. As a data producer, register the dataset with Lake Formation and create AWS Glue Data Catalog tables.
  2. Create LF-tags and associate them with the database and tables.
  3. Grant LF-tag based permissions on resources directly to personas in consumer account B.

The following steps take place in account B:

  1. The consumer account data lake admin reviews and accepts the AWS RAM invitations.
  2. The data lake admin gives CREATE DATABASE access to the IAM user lf_business_analysts.
  3. The data lake admin creates a database for the marketing team and grants CREATE TABLE access to lf_campaign_manager.
  4. The IAM users create resource links on the shared database and tables and query them in Amazon Athena.

The producer account A has the following personas:

  • Data lake admin – Manages the data lake in the producer account
  • lf-producersteward – Manages the data and user access

The consumer account B has the following personas:

  • Data lake admin – Manages the data lake in the consumer account
  • lf-business-analysts – The business analysts in the sales team needs access to non-PII data
  • lf-campaign-manager – The manager in the marketing team needs access to data related to products and promotions

Prerequisites

You need the following prerequisites:

  • Two AWS accounts. For this demonstration of how AWS RAM invites are created and accepted, you should use two accounts that are not part of the same organization.
  • An admin IAM user in both accounts to launch the AWS CloudFormation stacks.
  • Lake Formation mode enabled in both the producer and consumer account with cross-account Version 3. For instructions, refer to Change the default permission model.

Lake Formation and AWS CloudFormation setup in account A

To keep the setup simple, we have an IAM admin registered as the data lake admin.

  1. Sign into the AWS Management Console in the us-east-1 Region.
  2. On the Lake Formation console, under Permissions in the navigation pane, choose Administrative roles and tasks.
  3. Select Choose Administrators under Datalake administrators.
  4. In the pop-up window Manage data lake administrators, under IAM users and roles, choose IAM admin user and choose Save.
  5. Choose Launch Stack to deploy the CloudFormation template:
    BDB-2063-launch-cloudformation-stack
  6. Choose Next.
  7. Provide a name for the stack and choose Next.
  8. On the next page, choose Next.
  9. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  10. Choose Create.

Stack creation should take about 2–3 minutes. The stack establishes the producer setup as follows:

  • Creates an Amazon Simple Storage Service (Amazon S3) data lake bucket
  • Registers the data lake bucket with Lake Formation
  • Creates an AWS Glue database and tables
  • Creates an IAM user (lf-producersteward) who will act as producer steward
  • Creates LF-tags and assigns them to the created catalog resources as specified in the following table
Database Table LF-Tag Key LF-Tag Value Resource Tagged
lftpcdb . Sensitivity Public DATABASE
lftpcdb items HasCampaign true TABLE
lftpcdb promotions HasCampaign true TABLE
lftpcdb customers table columns = "c_last_name","c_first_name","c_email_address" Sensitivity Confidential TABLECOLUMNS

Verify permissions in account A

After the CloudFormation stack launches, complete the following steps in account A:

  1. On the AWS CloudFormation console, navigate to the Outputs tab of the stack.

  1. Choose the LFProducerStewardCredentials value to navigate to the AWS Secrets Manager console.
  2. In the Secret value section, choose Retrieve secret value.
  3. Note down the secret value for the password for IAM user lf-producersteward.

You need this to log in to the console later as the user lf-producersteward.

  1. On the LakeFormation console, choose Databases on the navigation pane.
  2. Open the database lftpcdb.
  3. Verify the LF-tags on the database are created.

  1. Choose View tables and choose the items table to verify the LF-tags.

  1. Repeat the steps for the promotions and customers tables to verify the LF-tags assigned.

  1. On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
  2. Select the database lftpcdb and on the Actions menu, choose View Permissions.
  3. Verify that there are no default permissions granted on the database lftpcdb for IAMAllowedPrincipals.
  4. If you find any, select the permission and choose Revoke to revoke the permission.
  5. On the AWS Management Console, choose the AWS CloudShell icon on the top menu.

This opens AWS CloudShell in another tab of the browser. Allow a few minutes for the CloudShell environment to set up.

  1. Run the following AWS Command Line Interface (AWS CLI) command after replacing {BUCKET_NAME} with DataLakeBucket from the stack output.
aws s3 cp s3://aws-data-analytics-workshops/shared_datasets/tpcparquet/ s3://${BUCKET_NAME}/tpcparquet/  --recursive

If CloudShell isn’t available in your chosen Region, run the following AWS CLI command to copy the required dataset from your preferred AWS CLI environment as the IAM admin user.

  1. Verify that your S3 bucket has the dataset copied in it.
  2. Log out as the IAM admin user.

Grant permissions in account A

Next, we continue granting Lake Formation permissions to the dataset as a data steward within the producer account. The data steward grants the following LF-tag-based permissions to the consumer personas.

Consumer Persona LF-tag Policy
lf-business-analysts Sensitivity=Public
lf-campaign-manager HasCampaign=true
  1. Log in to account A as user lf-producersteward, using the password you noted from Secrets Manager earlier.
  2. On the Lake Formation console, under Permissions in the navigation pane, choose Data Lake permissions.
  3. Choose Grant.
  4. Under Principals, select External accounts.
  5. Provide the ARN of the IAM user in the consumer account (arn:aws:iam::<accountB_id>:user/lf-business-analysts) and press Enter.

  1. Under LF_Tags or catalog resources, select Resources matched by LF-Tags.
  2. Choose Add LF-Tag to add a new key-value pair.
  3. For the key, choose Sensitivity and for the value, choose Public.
  4. Under Database permissions, select Describe, and under Table permissions, select Select and Describe.

  1. Choose Grant to apply the permissions.
  2. On the Lake Formation console, under Permissions in the navigation pane, choose Data Lake permissions.
  3. Choose Grant.
  4. Under Principals, select External accounts.
  5. Provide the ARN of the IAM user in the consumer account (arn:aws:iam::<accountB_id>:user/lf-campaign-manager) and press Enter.
  6. Under LF_Tags or catalog resources, select Resources matched by LF-Tags.
  7. Choose Add LF-Tag to add a new key-value pair.
  8. For the key, choose HasCampaign and for the value, choose true.

  1. Under Database permissions, select Describe, and under Table permissions, select Select and Describe.
  2. Choose Grant to apply the permissions.
  3.  Verify on the Data lake permissions tab that the permissions you have granted show up correctly.

AWS CloudFormation setup in account B

Complete the following steps in the consumer account:

  1. Log in as an IAM admin user in account B and launch the CloudFormation stack:
    BDB-2063-launch-cloudformation-stack
  2. Choose Next.
  3. Provide a name for the stack, then choose Next.
  4. On the next page, choose Next.
  5. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  6. Choose Create.

Stack creation should take about 2–3 minutes. The stack sets up the following resources in account B:

  • IAM users datalakeadmin1, lf-business-analysts, and lf-campaign-manager, with relevant IAM and Lake Formation permissions
  • A database called db_for_shared_tables with Create_Table permissions to the lf-campaign-manager user
  • An S3 bucket named lfblog-athenaresults-<your-accountB-id>-us-east-1 with ListBucket and write permissions to lf-business-analysts and lf-campaign-manager

Note down the stack output details.

Accept resource shares in account B

After you launch the CloudFormation stack, complete the following steps in account B:

  1. On the CloudFormation stack Outputs tab, choose the link for DataLakeAdminCredentials.

This takes you to the Secrets Manager console.

  1. On the Secrets Manager console, choose Retrieve secret value and copy the password for DataLakeAdmin user.
  2. Use the ConsoleIAMLoginURL value from the CloudFormation template output to log in to account B with the data lake admin user name datalakeadmin1 and the password you copied from Secrets Manager.
  3. Open the AWS RAM console in another browser tab.
  4. In the navigation pane, under Shared with me, choose Resource shares to view the pending invitations.

You should see two resource share invitations from the producer account A: one for database-level share and one for table-level share.

  1. Choose each resource share link, review the details, and choose Accept.

After you accept the invitations, the status of the resource shares changes from Active from Pending.

Grant permissions in account B

To grant permissions in account B, complete the following steps:

  1. On the Lake Formation console, under Permissions on the navigation pane, choose Administrative roles and tasks.

  1. Under Database creators, choose Grant.

  1. Under IAM users and roles, choose lf-business-analysts.
  2. For Catalog permissions, select Create database.
  3. Choose Grant.
  4. Log out of the console as the data lake admin user.

Query the shared datasets as consumer users

To validate the lf-business-analysts user’s data access, perform the following steps:

  1. Log in to the console as lf-business-analysts, using the credentials noted from the CloudFormation stack output.
  2. On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.

  1. Select the database lftpcdb and on the Actions menu, choose Create resource link.

  1. Under Resource link name, enter rl_lftpcdb.
  2. Choose Create.
  3. After the resource link is created, select the resource link and choose View tables.

You can now see the four tables in the shared database.

  1. Open the Athena console in another browser tab and choose the lfblog-athenaresults-<your-accountB-id>-us-east-1 bucket as the query results location.
  2. Verify data access using the following query (for more information, refer to Running SQL queries using Amazon Athena):
Select * from rl_lftpcdb.customers limit 10;

The following screenshot shows the query output.

Notice that account A shared the database lftpcdb to account B using the LF-tag expression Sensitivity=Public. Columns c_first_name, c_last_name, and c_email_address in table customers were overwritten with Sensitivity=Confidential. Therefore, these three columns are not visible to user lf-business-analysts.

You can preview the other tables from the database similarly to see the available columns and data.

  1. Log out of the console as lf-business-analysts.

Now we can validate the lf-campaign-manager user’s data access.

  1. Log in to the console as lf-campaign-manager using the credentials noted from the CloudFormation stack output.
  2. On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
  3. Verify that you can see the database db_for_shared_tables shared by the data lake admin.

  1. Under Data catalog in the navigation pane, choose Tables.

You should be able to see the two tables shared from account A using the LF-tag expression HasCampaign=true. The two tables show the Owner account ID as account A.

Because lf-campaign-manager received table level shares, this user will create table-level resource links for querying in Athena.

  1. Select the promotions table, and on the Actions menu, choose Create resource link.

  1. For Resource link name, enter rl_promotions.

  1. Under Database, choose db_for_shared_tables for the database to contain the resource link.
  2. Choose Create.
  3. Repeat the table resource link creation for the other table items.

Notice that the resource links show account B as owner, whereas the actual tables show account A as the owner.

  1. Open the Athena console in another browser tab and choose the lfblog-athenaresults-<your-accountB-id>-us-east-1 bucket as the query results location.
  2. 11. Query the tables using the resource links.

As shown in the following screenshot, all columns of both tables are accessible to lf-campaign-manager.

In summary, you have seen how LF-tags are used to share a database and select tables from one account to another account’s IAM users.

Clean up

To avoid incurring charges on the AWS resources created in this post, you can perform the following steps.

First, clean up resources in account A:

  1. Empty the S3 bucket created for this post by deleting the downloaded objects from your S3 bucket.
  2. Delete the CloudFormation stack.

This deletes the S3 bucket, custom IAM roles, policies, and the LF database, tables, and permissions.

  1. You may choose to undo the Lake Formation settings also and add IAM access back from the Lake Formation console Settings page.

Now complete the following steps in account B:

  1. Empty the S3 bucket lfblog-athenaresults-<your-accountB-id>-us-east-1 used as the Athena query results location.
  2. Revoke permission to lf-business-analysts as database creator.
  3. Delete the CloudFormation stack.

This deletes the IAM users, S3 bucket, Lake Formation database db_for_shared_tables, resource links, and all the permissions from Lake Formation.

If there are any resource links and permissions left, delete them manually in Lake Formation from both accounts.

Conclusion

In this post, we illustrated the benefits of using Lake Formation cross-account sharing Version 3 using LF-tags to direct IAM principals and how to receive the shared tables in the consumer account. We used a two-account scenario in which a data producer account shares a database and specific tables to individual IAM users in another account using LF-tags. In the receiving account, we showed the role played by a data lake admin vs. the receiving IAM users. We also illustrated how to overwrite column tags to mask and share PII data.

With Version 3 of cross-account sharing features, Lake Formation makes possible more modern data mesh models, where a producer can directly share to an IAM principal in another account, instead of the entire account. Data mesh implementation becomes easier for data administrators and data platform owners because they can easily scale to hundreds of consumer accounts using the LF-tags based sharing to organizational units or IDs.

We encourage you to upgrade your Lake Formation cross-account sharing to Version 3 and benefit from the enhancements. For more details, see Updating cross-account data sharing version settings.


About the authors

Aarthi Srinivasan is a Senior Big Data Architect with AWS Lake Formation. She likes building data lake solutions for AWS customers and partners. When not on the keyboard, she explores the latest science and technology trends and spends time with her family.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building analytics and data mesh solutions on AWS and sharing them with the community.

How Novo Nordisk built a modern data architecture on AWS

Post Syndicated from Jonatan Selsing original https://aws.amazon.com/blogs/big-data/how-novo-nordisk-built-a-modern-data-architecture-on-aws/

Novo Nordisk is a leading global pharmaceutical company, responsible for producing life-saving medicines that reach more than 34 million patients each day. They do this following their triple bottom line—that they must strive to be environmentally sustainable, socially sustainable, and financially sustainable. The combination of using AWS and data supports all these targets.

Data is pervasive throughout the entire value chain of Novo Nordisk. From foundational research, manufacturing lines, sales and marketing, clinical trials, pharmacovigilance, through patient-facing data-driven applications. Therefore, getting the foundation around how data is stored, safeguarded, and used in a way that provides the most value is one of the central drivers of improved business outcomes.

Together with AWS Professional Services, we’re building a data and analytics solution using a modern data architecture. The collaboration between Novo Nordisk and AWS Professional Services is a strategic and long-term close engagement, where developers from both organizations have worked together closely for years. The data and analytics environments are built around of the core tenets of the data mesh—decentralized domain ownership of data, data as a product, self-service data infrastructure, and federated computational governance. This enables the users of the environment to work with data in the way that drives the best business outcomes. We have combined this with elements from evolutionary architectures that will allow us to adapt different functionalities as AWS continuously develops new services and capabilities.

In this series of posts, you will learn how Novo Nordisk and AWS Professional Services built a data and analytics ecosystem to speed up innovation at petabyte scale:

  • In this first post, you will learn how the overall design has enabled the individual components to come together in a modular way. We dive deep into how we built a data management solution based on the data mesh architecture.
  • The second post discusses how we built a trust network between the systems that comprise the entire solution. We show how we use event-driven architectures, coupled with the use of attribute-based access controls, to ensure permission boundaries are respected at scale.
  • In the third post, we show how end-users can consume data from their tool of choice, without compromising data governance. This includes how to configure Okta, AWS Lake Formation, and Microsoft Power BI to enable SAML-based federated use of Amazon Athena for an enterprise business intelligence (BI) activity.

Pharma-compliant environment

As a pharmaceutical industry, GxP compliance is a mandate for Novo Nordisk. GxP is a general abbreviation for the “Good x Practice” quality guidelines and regulations defined by regulators such as European Medicines Agency, U.S. Food and Drug Administration, and others. These guidelines are designed to ensure that medicinal products are safe and effective for their intended use. In the context of a data environment, GxP compliance involves implementing integrity controls for data used to in decision making and processes and is used to guide how change management processes are implemented to continuously ensure compliance over time.

Because this data environment supports teams across the whole organization, each individual data owner must retain accountability on their data. Features were designed to provide data owners autonomy and transparency when managing their data, enabling them to take this responsibility. This includes the capability to handle personally identifiable information (PII) data and other sensitive workloads. To provide traceability on the environment, audit capabilities were added, which we describe more in this post.

Solution overview

The full solution is a sprawling landscape of independent services that work together to enable data and analytics with a decentralized data governance model at petabyte scale. Schematically, it can be represented as in the following figure.

Novo Nordisk Modern Data Architecture on AWS

The architecture is split into three independent layers: data management, virtualization, and consumption. The end-user sits in the consumption layer and works with their tool of choice. It’s meant to abstract as much of the AWS-native resources to application primitives. The consumption layer is integrated into the virtualization layer, which abstracts the access to data. The purpose of the virtualization layer is to translate between data consumption and data management solutions. The access to data is managed by what we refer to as data management solutions. We discuss one of our versatile data management solutions later in this post. Each layer in this architecture is independent of each other and instead only relies on well-defined interfaces.

Central to this architecture is that access is encapsulated in an AWS Identity and Access Management (IAM) role session. The data management layer focuses on providing the IAM role with the right permissions and governance, the virtualization layer provides access to the role, and the consumption layer abstracts the use of the roles in the tools of choice.

Technical architecture

Each of the three layers in the overall architecture has a distinct responsibility, but no singular implementation. Think of them as abstract classes. They can be implemented in concrete classes, and in our case they rely on foundational AWS services and capabilities. Let’s go through each of the three layers.

Data management layer

The data management layer is responsible for providing access to and governance of data. As illustrated in the following diagram, a minimal construct in the data management layer is the combination of an Amazon Simple Storage Service (Amazon S3) bucket and an IAM role that gives access to the S3 bucket. This construct can be expanded to include granular permission with Lake Formation, auditing with AWS CloudTrail, and security response capabilities from AWS Security Hub. The following diagram also shows that a single data management solution has no singular span. It can cross many AWS accounts and be comprised of any number of IAM role combinations.Data Mamangement Architecture

We have purposely not illustrated the trust policy of these roles in this figure, because those are a collaborative responsibility between the virtualization layer and the data management layer. We go into detail of how that works in the next post in this series. Data engineering professionals often interface directly with the data management layer, where they curate and prepare data for consumption.

Virtualization layer

The purpose of the virtualization layer is to keep track of who can do what. It doesn’t have any capabilities in itself, but translates the requirements from the data management ecosystems to the consumption layers and vice versa. It enables end-users on the consumption layer to access and manipulate data on one or more data management ecosystems, according to their permissions. This layer abstracts from end-users the technical details on data access, such as permission model, role assumptions, and storage location. It owns the interfaces to the other layers and enforces the logic of the abstraction. In the context of hexagonal architectures (see Developing evolutionary architecture with AWS Lambda), the interface layer plays the role of the domain logic, ports, and adapters. The other two layers are actors. The data management layer communicates the state of the layer to the virtualization layer and conversely receives information about the service landscape to trust. The virtualization layer architecture is shown in the following diagram.

Virtualization Layer Architecture

Consumption layer

The consumption layer is where the end-users of the data products are sitting. This can be data scientists, business intelligence analysts, or any third party that generates value from consuming the data. It’s important for this type of architecture that the consumption layer has a hook-based sign-in flow, where the authorization into the application can be modified at sign-in time. This is to translate the AWS-specific requirement into the target applications. After the session in the client-side application has successfully been started, it’s up to the application itself to instrument for data layer abstraction, because this will be application specific. And this is an additional important decoupling, where some responsibility is pushed to the decentralized units. Many modern software as a service (SaaS) applications support these built-in mechanisms, such as Databricks or Domino Data Lab, whereas more traditional client-side applications like RStudio Server have more limited native support for this. In the case where native support is missing, a translation down to the OS user session can be done to enable the abstraction. The consumption layer is shown schematically in the following diagram.

Consumption Layer Architecture

When using the consumption layer as intended, the users don’t know that the virtualization layer exists. The following diagram illustrates the data access patterns.

Data Access Patterns

Modularity

One of the main advantages of adopting the hexagonal architecture pattern, and delegating both the consuming layer and the data management layer to primary and secondary actors, means that they can be changed or replaced as new functionalities are released that require new solutions. This gives a hub-and-spoke type pattern, where many different types of producer/consumer type systems can be connected and work simultaneously in union. An example of this is that the current solution running in Novo Nordisk supports multiple, simultaneous data management solutions and are exposed in a homogenous way in the consuming layer. This includes both a data lake, the data mesh solution presented in this post, and several independent data management solutions. And these are exposed to multiple types of consuming applications, from custom managed, self-hosted applications, to SaaS offerings.

Data management ecosystem

To scale the usage of the data and increase the freedom, Novo Nordisk, jointly with AWS Professional Services, built a data management and governance environment, named Novo Nordisk Enterprise DataHub (NNEDH). NNEDH implements a decentralized distributed data architecture, and data management capabilities such as an enterprise business data catalog and data sharing workflow. NNEDH is an example of a data management ecosystem in the conceptual framework introduced earlier.

Decentralized architecture: From a centralized data lake to a distributed architecture

Novo Nordisk’s centralized data lake consists of 2.3 PB of data from more than 30 business data domains worldwide serving over 2000+ internal users throughout the value chain. It has been running successfully for several years. It is one of the data management ecosystems currently supported.

Within the centralized data architecture, data from each data domain is copied, stored, and processed in one central location: a central data lake hosted in one data storage. This pattern has challenges at scale because it retains the data ownership with the central team. At scale, this model slows down the journey toward a data-driven organization, because ownership of the data isn’t sufficiently anchored with the professionals closest to the domain.

The monolithic data lake architecture is shown in the following diagram.Monolithic Data Lake Architecture

Within the decentralized distributed data architecture, the data from each domain is kept within the domain on its own data storage and compute account. In this case, the data is kept close to domain experts, because they’re the ones who know their own data best and are ultimately the owner of any data products built around their data. They often work closely with business analysts to build the data product and therefore know what good data means to consumers of their data products. In this case, the data responsibility is also decentralized, where each domain has its own data owner, putting the accountability onto the true owners of the data. Nevertheless, this model might not work at small scale, for example an organization with only one business unit and tens of users, because it would introduce more overhead on the IT team to manage the organization data. It better suits large organizations, or small and medium ones that would like to grow and scale.

The Novo Nordisk data mesh architecture is shown in the following diagram.

Novo Nordisk Data Mesh Architecture

Data domains and data assets

To enable the scalability of data domains across the organization, it’s mandatory to have a standard permission model and data access pattern. This standard must not be too restrictive in such a way that it may be a blocker for specific use cases, but it should be standardized in such a way to use the same interface between the data management and virtualization layers.

The data domains on NNEDH are implemented by a construct called an environment. An environment is composed of at least one AWS account and one AWS Region. It’s a workplace where data domain teams can work and collaborate to build data products. It links the NNEDH control plane to the AWS accounts where the data and compute of the domain reside. The data access permissions are also defined at the environment level, managed by the owner of the data domain. The environments have three main components: a data management and governance layer, data assets, and optional blueprints for data processing.

For data management and governance, the data domains rely on Lake Formation, AWS Glue, and CloudTrail. The deployment method and setup of these components is standardized across data domains. This way, the NNEDH control plane can provide connectivity and management to data domains in a standardized way.

The data assets of each domain residing in an environment are organized in a dataset, which is a collection of related data used for building a data product. It includes technical metadata such as data format, size, and creation time, and business metadata such as the producer, data classification, and business definition. A data product can use one or several datasets. It is implemented through managed S3 buckets and the AWS Glue Data Catalog.

Data processing can be implemented in different ways. NNEDH provides blueprints for data pipelines with predefined connectivity to data assets to speed up the delivery of data products. Data domain users have the freedom to use any other compute capability on their domain, for example using AWS services not predefined on the blueprints or accessing the datasets from other analytics tools implemented in the consumption layer, as mentioned earlier in this post.

Data domain personas and roles

On NNEDH, the permission levels on data domains are managed through predefined personas, for example data owner, data stewards, developers, and readers. Each persona is associated with an IAM role that has a predefined permission level. These permissions are based on the typical needs of users on these roles. Nevertheless, to give more flexibility to data domains, these permissions can be customized and extended as needed.

The permissions associated with each persona are related only to actions allowed on the AWS account of the data domain. For the accountability on data assets, the data access to the assets is managed by specific resource policies instead of IAM roles. Only the owner of each dataset, or data stewards delegated by the owner, can grant or revoke data access.

On the dataset level, a required persona is the data owner. Typically, they work closely with one or many data stewards as data products managers. The data steward is the data subject matter expert of the data product domain, responsible for interpreting collected data and metadata to derive deep business insights and build the product. The data steward bridges between business users and technical teams on each data domain.

Enterprise business data catalog

To enable freedom and make the organization data assets discoverable, a web-based portal data catalog is implemented. It indexes in a single repository metadata from datasets built on data domains, breaking data silos across the organization. The data catalog enables data search and discovery across different domains, as well as automation and governance on data sharing.

The business data catalog implements data governance processes within the organization. It ensures the data ownership—someone in the organization is responsible for the data origin, definition, business attributes, relationships, and dependencies.

The central construct of a business data catalog is a dataset. It’s the search unit within the business catalog, having both technical and business metadata. To collect technical metadata from structured data, it relies on AWS Glue crawlers to recognize and extract data structures from the most popular data formats, including CSV, JSON, Avro, and Apache Parquet. It provides information such as data type, creation date, and format. The metadata can be enriched by business users by adding a description of the business context, tags, and data classification.

The dataset definition and related metadata are stored in an Amazon Aurora Serverless database and Amazon OpenSearch Service, enabling you to run textual queries on the data catalog.

Data sharing

NNEDH implements a data sharing workflow, enabling peer-to-peer data sharing across AWS accounts using Lake Formation. The workflow is as follows:

  1. A data consumer requests access to the dataset.
  2. The data owner grants access by approving the access request. They can delegate the approval of access requests to the data steward.
  3. Upon the approval of an access request, a new permission is added to the specific dataset in Lake Formation of the producer account.

The data sharing workflow is shown schematically in the following figure.

Data Sharing Workflow

Security and audit

The data in the Novo Nordisk data mesh lies in AWS accounts owned by Novo Nordisk business accounts. The configuration and the states of the data mesh are stored in Amazon Relational Database Service (Amazon RDS). The Novo Nordisk security architecture is shown in the following figure.

Novo Nordisk Distributed Security and Audit Architecture

Access and edits to the data in NNEDH needs to be logged for audit purposes. We need to be able to tell who modified data, when the modification happened, and what modifications were applied. In addition, we need to be able to answer why the modification was allowed by that person at that time.

To meet these requirements, we use the following components:

  • CloudTrail to log API calls. We specifically enable CloudTrail data event logging for S3 buckets and objects. By activating the logging, we can trace back any modification to any files in the data lake to the person who made the modification. We enforce usage of source identity for IAM role sessions to ensure user traceability.
  • We use Amazon RDS to store the configuration of the data mesh. We log queries against the RDS database. Together with CloudTrail, this log allows us to answer the question of why a modification to a file in Amazon S3 at a specific time by a specific person is possible.
  • Amazon CloudWatch to log activities across the mesh.

In addition to those logging mechanisms, the S3 buckets are created using the following properties:

  • The bucket is encrypted using server-side encryption with AWS Key Management Service (AWS KMS) and customer managed keys
  • Amazon S3 versioning is activated by default

Access to the data in NNEDH is controlled at the group level instead of individual users. The group corresponds to the group defined in the Novo Nordisk directory group. To keep track of the person who modified the data in the data lakes, we use the source identity mechanism explained in the post How to relate IAM role activity to corporate identity.

Conclusion

In this post, we showed how Novo Nordisk built a modern data architecture to speed up the delivery of data-driven use cases. It includes a distributed data architecture, to scale the usage to petabyte scale for over 2,000 internal users throughout the value chain, as well as a distributed security and audit architecture handling data accountability and traceability on the environment to meet their compliance requirements.

The next post in this series describes the implementation of distributed data governance and control at scale of Novo Nordisk’s modern data architecture.


About the Authors

Jonatan Selsing is former research scientist with a PhD in astrophysics that has turned to the cloud. He is currently the Lead Cloud Engineer at Novo Nordisk, where he enables data and analytics workloads at scale. With an emphasis on reducing the total cost of ownership of cloud-based workloads, while giving full benefit of the advantages of cloud, he designs, builds, and maintains solutions that enable research for future medicines.

Hassen Riahi is a Sr. Data Architect at AWS Professional Services. He holds a PhD in Mathematics & Computer Science on large-scale data management. He works with AWS customers on building data-driven solutions.

Anwar Rizal is a Senior Machine Learning consultant based in Paris. He works with AWS customers to develop data and AI solutions to sustainably grow their business.

Moses Arthur comes from a mathematics and computational research background and holds a PhD in Computational Intelligence specialized in Graph Mining. He is currently a Cloud Product Engineer at Novo Nordisk building GxP-compliant enterprise data lakes and analytics platforms for Novo Nordisk global factories producing digitalized medical products.

Alessandro FiorAlessandro Fior is a Sr. Data Architect at AWS Professional Services. With over 10 years of experience delivering data and analytics solutions, he is passionate about designing and building modern and scalable data platforms that accelerate companies to get value from their data.

Kumari RamarKumari Ramar is an Agile certified and PMP certified Senior Engagement Manager at AWS Professional Services. She delivers data and AI/ML solutions that speed up cross-system analytics and machine learning models, which enable enterprises to make data-driven decisions and drive new innovations.

Build an AWS Lake Formation permissions inventory dashboard using AWS Glue and Amazon QuickSight

Post Syndicated from Srividya Parthasarathy original https://aws.amazon.com/blogs/big-data/build-an-aws-lake-formation-permissions-inventory-dashboard-using-aws-glue-and-amazon-quicksight/

AWS Lake Formation is an integrated data lake service that makes it easy for you to ingest, clean, catalog, transform, and secure your data and make it available for analysis and machine learning (ML). Lake Formation provides a single place to define fine-grained access control on catalog resources. These permissions are granted to the principals by a data lake admin, and integrated engines like Amazon Athena, AWS Glue, Amazon EMR, and Amazon Redshift Spectrum enforce the access controls defined in Lake Formation. It also allows principals to securely share data catalog resources across multiple AWS accounts and AWS organizations through a centralized approach.

As organizations are adopting Lake Formation for scaling their permissions, there is steady increase in the access policies established and managed within the enterprise. However, it becomes more difficult to analyze and understand the permissions for auditing. Therefore, customers are looking for a simple way to collect, analyze, and visualize permissions data so that they can inspect and validate the policies. It also enables organizations to take actions that help them with compliance requirements.

This solution offers the ability to consolidate and create a central inventory of Lake Formation permissions that are registered in the given AWS account and Region. It provides a high-level view of various permissions that Lake Formation manages and aims at answering questions like:

  • Who has select access on given table
  • Which tables have delete permission granted
  • Which databases or tables does the given principal have select access to

In this post, we walk through how to set up and collect the permissions granted on resources in a given account using the Lake Formation API. AWS Glue makes it straightforward to set up and run jobs for collecting the permission data and creating an external table on the collected data. We use Amazon QuickSight to create a permissions dashboard using an Athena data source and dataset.

Overview of solution

The following diagram illustrates the architecture of this solution.

In this solution, we walk through the following tasks:

  1. Create an AWS Glue job to collect and store permissions data, and create external tables using Boto3.
  2. Verify the external tables created using Athena.
  3. Sign up for a QuickSight Enterprise account and enable Athena access.
  4. Create a dataset using an Athena data source.
  5. Use the datasets for analysis.
  6. Publish the analyses as a QuickSight dashboard.

The collected JSON data is flattened and written into an Amazon Simple Storage Service (Amazon S3) bucket as Parquet files partitioned by account ID, date, and resource type. After the data is stored in Amazon S3, external tables are created on them and filters are added for different types of resource permissions. These datasets can be imported into SPICE, an in-memory query engine that is part of QuickSight, or queried directly from QuickSight to create analyses. Later, you can publish these analyses as a dashboard and share it with other users.

Dashboards are created for the following use cases:

  • Database permissions
  • Table permissions
  • Principal permissions

Prerequisites

You should have the following prerequisites:

  • An S3 bucket to store the permissions inventory data
  • An AWS Glue database for permissions inventory metadata
  • An AWS Identity and Access Management (IAM) role for the AWS Glue job with access to the inventory AWS Glue database and S3 bucket and added as a data lake admin
  • A QuickSight account with access to Athena
  • An IAM role for QuickSight with access to the inventory AWS Glue database and S3 bucket

Set up and run the AWS Glue job

We create an AWS Glue job to collect Lake Formation permissions data for the given account and Region that is provided as job parameters, and the collected data is flattened before storage. Data is partitioned by account ID, date, and permissions type, and is stored as Parquet in an S3 bucket using Boto3. We create external tables on the data and add filters for different types of resource permissions.

To create the AWS Glue job, complete the following steps:

  1. Download the Python script file to local.
  2. On the AWS Glue console, under Data Integration and ETL in the navigation pane, choose Jobs.
  3. Under Create job, select Python Shell script editor.
  4. For Options, select Upload and edit an existing script.
  5. For File upload, choose Choose file.
  6. Choose the downloaded file (lf-permissions-inventory.py).
  7. Choose Create.

GlueJob

  1. After the job is created, enter a name for the job (for this post, lf-inventory-builder) and choose Save.

Glue Job save

  1. Choose the Job details tab.
  2. For Name, enter a name for the job.
  3. For IAM Role¸ choose an IAM role that has access to the inventory S3 bucket and inventory schema and registered as data lake admin.
  4. For Type, choose Python Shell.
  5. For Python version, choose Python 3.9.

Glue Job Details

  1. You can leave other property values at their default.
  2. Under Advanced properties¸ configure the following job parameters and values:
    1. catalog-id: with the value as the current AWS account ID whose permissions data are collected.
    2. databasename: with the value as the AWS Glue database where the inventory-related schema objects are created.
    3. region: with the value as the current Region where the job is configured and whose permissions data is collected.
    4. s3bucket: with the value as the S3 bucket where the collected permissions data is written.
    5. createtable: with the value yes, which enables external table creation on the data.

Job Parameters

  1. Choose Save to save the job settings.

Glue Job Save

  1. Choose Run to start the job.

When the job is complete, the run status changes to Succeeded. You can view the log messages in Amazon CloudWatch Logs.

Job Run

Permissions data is collected and stored in the S3 bucket (under lfpermissions-data) that you provided in the job parameters.

S3 Structure

The following external tables are created on the permissions data and can be queried using Athena:

  • lfpermissions – A summary of resource permissions
  • lfpermissionswithgrant – A summary of grantable resource permissions

For both tables, the schema structure is the same and the lftype column indicates what type of permissions the row applies to.

Athena Table Schema

Verify the tables using Athena

You can use Athena to verify the data using the following queries.

For more information, refer to Running SQL queries using Amazon Athena

  • List the database permissions:
Select * from lfpermissions where lftype=’DATABASE’
  • List the table permissions:
Select * from lfpermissions where lftype= ‘TABLE’
  • List the data lake permissions:
Select * from lfpermissions where lftype= ‘DATA_LOCATION’
  • List the grantable database permissions:
Select * from lfpermissionswithgrant where lftype=’DATABASE’
  • List the grantable table permissions:
Select * from lfpermissionswithgrant where lftype= ‘TABLE’
  • List grantable data lake permissions:
Select * from lfpermissionswithgrant where lftype= ‘DATA_LOCATION’

As the next step, we create a QuickSight dashboard with three sheets, each focused on different sets of permissions (database, table, principal) to slice and dice the data.

Sign up for a QuickSight account

If you haven’t signed up for QuickSight, complete the following steps:

  1. Sign in to the AWS Management Console as Admin, search for QuickSight and choose Sign up for QuickSight.

QuickSight signup

  1. For Edition, select Enterprise.
  2. Choose Continue.
  3. For Authentication method, select Use IAM federated identities & QuickSight-managed users.
  4. Under QuickSight Region, choose the same Region as your inventory S3 bucket.
  5. Under Account info, enter a QuickSight account name and email address for notification.

QuickSight Form

  1. In the Quick access to AWS services section, for IAM Role, select Use QuickSight-managed role (default).
  2. Allow access to IAM, Athena, and Amazon S3.
  3. Specify the S3 bucket that contains the permissions data.
  4. Choose Finish to complete the signup process.

QuickSight configuration

Note: If the inventory bucket and database is managed by Lake Formation, grant database and table access to the created QuickSight IAM role. For instructions, refer to Granting and revoking permissions on Data Catalog resources.

Configure your dataset in QuickSight

QuickSight is configured with an Athena data source the same Region as the S3 bucket. To set up your dataset, complete the following steps:

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose New dataset.

Quicksight DataSet

  1. Choose Athena as your data source.

QuickSight Datasource

  1. Enter LF_DASHBOARD_DS as the name of your data source.
  2. Choose Create data source.
  3. For Catalog, leave it as AwsDataCatalog.
  4. For Database, choose database name provided as parameter to the Job.
  5. For Tables, select lfpermissions.
  6. Choose Select.

QuickSight Catalog Info

  1. Select Directly query your data and choose Visualize to take you to the analysis.

Quicksight data mode

Create analyses

We create three sheets for our dashboard to view different levels of permissions.

Sheet 1: Database permission view

To view database permissions, complete the following steps:

  1. On the QuickSight console, choose the plus sign to create a new sheet.
  2. Choose Add, then choose Add title.

QuickSight Title

  1. Name the sheet Database Permissions.
  2. Repeat steps (5-7) to add the following parameters:
    • catalogid
    • databasename
    • permission
    • tablename
  3. On the Add menu, choose Add parameter.
  4. Enter a name for the parameter.
  5. Leave the other values as default and choose Create.
  6. Choose Insights in the navigation pane, then choose Add control.

QuickSight Control

  1. Add a control for each parameter:
    1. For each parameter, for Style¸ choose List, and for Values, select Link to a dataset field.
      QuickSight Dependency
    2. Provide additional information for each parameter according to the following table.
Parameter Display Name Dataset Field
catalogid AccountID lfpermissions catalog_id
databasename DatabaseName lfpermissions databasename
permission Permission lfpermissions permission
  1. Add a control dependency and for Database, choose the options menu and choose Edit.

QuickSight Dependency

  1. Under Format control, choose Control options.
  2. Change the relevant values, choose AccountID, and choose Update.
  3. Similarly, under Permission control, choose Control options.
  4. Change the relevant values, choose AccountID, and choose Update.

We create two visuals for this view.

  1. For the first visual, choose Visualize and choose pivot table as the visual type.
  2. Drag and drop catalog_id and databasename into Rows.
  3. Drag and drop permission into Column.
  4. Drag and drop principal into Values and change the aggregation to Count distinct.

QuickSight Database View1

  1. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. DATABASE as the value.
  2. Add a filter on catalog_id the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  3. Add a filter on databasename with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose databasename.
  4. Add a filter on permission with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose permission.
  5. Choose Actions in the navigation pane.
  6. Define a new action with the following parameters:
    1. For Activation, select Select.
    2. For Filter action, select All fields.
    3. For Target visuals, select Select visuals and Check principal.

Now we add our second visual.

  1. Add a second visual and choose the table visual type.
  2. Drag and drop principal to Group by.
  3. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. DATABASE as the value.
  4. Add a filter on catalog_id the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  5. Add a filter on databasename the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose databasename.
  6. Add a filter on permission with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose permission.

Now the Database and Permission drop-down menus are populated based on the relevant attributes and changes dynamically.

QuickSight Database View2

Sheet 2: Table permission view

Now that we have created the database permissions sheet, we can add a table permissions sheet.

  1. Choose the plus sign to add a new sheet.
  2. On the QuickSight console, choose Add, then choose Add title.
  3. Name the sheet Table Permissions.
  4. Choose Insights in the navigation pane, then choose Add control.
  5. Add a control for each parameter:
    1. For each parameter, for Style¸ choose List, and for Values, select Link to a dataset field.
    2. Provide the additional information for each parameter according to the following table.
Parameter Display Name Dataset Field
catalogid AccountID lfpermissions catalog_id
databasename DatabaseName lfpermissions databasename
permission Permission lfpermissions permission
tablename TableName lfpermissions tablename

We create two visuals for this view.

  1. For the first visual, choose Visualize and choose pivot table as the visual type.
  2. Drag and drop catalog_id, databasename, and tablename into Rows.
  3. Drag and drop permission into Column.
  4. Drag and drop principal into Values and change the aggregation to Count distinct.
  5. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. TABLE as the value.
  6. Add a filter on catalog_id the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  7. Add a filter on the databasename with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose databasename.
  8. Add a filter on tablename with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose tablename.
  9. Add a filter on permission with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose permission.
  10. Choose Actions in the navigation pane.
  11. Define a new action with the following parameters:
    1. For Activation, select Select.
    2. For Filter action, select All fields.
    3. For Target visuals, select Select visuals and Check principal.

Now we add our second visual.

  1. Add a second visual and choose the table visual type.
  2. Drag and drop principal to Group by.
  3. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. TABLE as the value.
  4. Add a filter on catalog_id the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  5. Add a filter on the databasename with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose databasename.
  6. Add a filter on tablename with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose tablename.
  7. Add a filter on permission with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose permission.

Now the Databasename, Tablename, and Permission drop-down menus are populated based on the relevant attributes.

QuickSight Table Permissions

Sheet 3: Principal permission view

Now we add a third sheet for principal permissions.

  1. Choose the plus sign to add a new sheet.
  2. On the QuickSight console, choose Add, then choose Add title.
  3. Name the sheet Principal Permissions.
  4. Choose Insights in the navigation pane, then choose Add control.
  5. Add a control for the catalogid parameter:
    1. For Style¸ choose List, and for Values, select Link to a dataset field.
    2. Provide the additional information for the parameter according to the following table.
Parameter Display Name Dataset Field
catalogid AccountID lfpermissions catalog_id

We create four visuals for this view.

  1. For the first visual, choose Visualize and choose pivot table as the visual type.
  2. Drag and drop catalog_id and principal into Rows.
  3. Drag and drop permission into Column.
  4. Drag and drop databasename into Values and change the aggregation to Count distinct.
  5. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. DATABASE as the value.
  6. Add a filter on the catalog_id field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  7. Choose Actions in the navigation pane.
  8. Define a new action with the following parameters:
    1. For Activation, select Select.
    2. For Filter action, select All fields.
    3. For Target visuals, select Select visuals and Check Databasename.
  9. For the second visual, choose Visualize and choose table as the visual type.
  10. Drag and drop databasename into Group by.
  11. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. DATABASE as the value.
  12. Add a filter on the catalog_id field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  13. For the third visual, choose Visualize and choose pivot table as the visual type.
  14. Drag and drop catalog_id and principal into Rows.
  15. Drag and drop permission into Column.
  16. Drag and drop tablename into Values and change the aggregation to Count distinct.
  17. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. TABLE as the value.
  18. Add a filter on the catalog_id field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  19. Choose Actions in the navigation pane.
  20. Define a new action with the following parameters:
    1. For Activation, select Select.
    2. For Filter action, select All fields.
    3. For Target visuals, select Select visuals and Check Tablename.
  21. For the final visual, choose Visualize and choose table as the visual type.
  22. Drag and drop tablename into Group by.
  23. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. TABLE as the value.
  24. Add a filter on the catalog_id field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.

The following screenshot shows our sheet.

QuickSight Prinicipal View

Create a dashboard

Now that the analysis is ready, you can publish it as a dashboard and share it with other users. For instructions, refer to the tutorial Create an Amazon QuickSight dashboard.

Clean up

To clean up the resources created in this post, complete the following steps:

  1. Delete the AWS Glue job lf-inventory-builder.
  2. Delete the data stored under the bucket provided as the value of the s3bucket job parameter.
  3. Drop the external tables created under the schema provided as the value of the databasename job parameter.
  4. If you signed up for QuickSight to follow along with this post, you can delete the account.
  5. For an existing QuickSight account, delete the following resources:
    1. lfpermissions dataset
    2. lfpermissions analysis
    3. lfpermissions dashboard

Conclusion

In this post, we provided a design and implementation steps for a solution to collect Lake Formation permissions in a given Region of an account and consolidate them for analysis. We also walked through the steps to create a dashboard using Amazon QuickSight. You can utilize other QuickSight visuals to create more sophisticated dashboards based on your requirements.

You can also expand this solution to consolidate permissions for a multi-account setup. You can use a shared bucket across organizations and accounts and configure an AWS Glue job in each account or organization to write their permission data. With this solution, you can maintain a unified dashboard view of all the Lake Formation permissions within your organization, thereby providing a central audit mechanism to comply with business requirements.

Thanks for reading this post! If you have any comments or questions, please don’t hesitate to leave them in the comments section.


About the Author

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building analytics and data mesh solutions on AWS and sharing them with the community.

Centrally manage access and permissions for Amazon Redshift data sharing with AWS Lake Formation

Post Syndicated from Srividya Parthasarathy original https://aws.amazon.com/blogs/big-data/centrally-manage-access-and-permissions-for-amazon-redshift-data-sharing-with-aws-lake-formation/

Today’s global, data-driven organizations treat data as an asset and use it across different lines of business (LOBs) to drive timely insights and better business decisions. Amazon Redshift data sharing allows you to securely share live, transactionally consistent data in one Amazon Redshift data warehouse with another Amazon Redshift data warehouse within the same AWS account, across accounts, and across Regions, without needing to copy or move data from one cluster to another.

Some customers share their data with 50–100 data warehouses in different accounts and do a lot of cross-sharing, making it difficult to track who is accessing what data. They have to navigate to an individual account’s Amazon Redshift console to retrieve the access information. Also, many customers have their data lake on Amazon Simple Storage Service (Amazon S3), which is shared within and across various business units. As the organization grows and democratizes the data, administrators want the ability to manage the datashare centrally for governance and auditing, and to enforce fine-grained access control.

Working backward from customer ask, we are announcing the preview of the following new feature: Amazon Redshift data sharing integration with AWS Lake Formation, which enables Amazon Redshift customers to centrally manage access to their Amazon Redshift datashares using Lake Formation.

Lake Formation has been a popular choice for centrally governing data lakes backed by Amazon S3. Now, with Lake Formation support for Amazon Redshift data sharing, it opens up new design patterns, and broadens governance and security posture across data warehouses. With this integration, you can use Lake Formation to define fine-grained access control on tables and views being shared with Amazon Redshift data sharing for federated AWS Identity and Access Management (IAM) users and IAM roles.

Customers are using the data mesh approach, which provides a mechanism to share data across business units. Customers are also using a modern data architecture to share data from data lake stores and Amazon Redshift purpose-built data stores across business units. Lake Formation provides the ability to enforce data governance within and across business units, which enables secure data access and sharing, easy data discovery, and centralized audit for data access.

United Airlines is in the business of connecting people and uniting the world.

“As a data-driven enterprise, United is trying to create a unified data and analytics experience for our analytics community that will innovate and build modern data-driven applications. We believe we can achieve this by building a purpose-built data mesh architecture using a variety of AWS services like Athena, Aurora, Amazon Redshift, and Lake Formation to simplify management and governance around granular data access and collaboration.”

-Ashok Srinivas, Director of ML Engineering and Sarang Bapat, Director of Data Engineering.

In this post, we show how to centrally manage access and permissions for Amazon Redshift data sharing with Lake Formation.

Solution overview

In this solution, we demonstrate how integration of Amazon Redshift data sharing with Lake Formation for data governance can help you build data domains, and how you can use the data mesh approach to bring data domains together to enable data sharing and federation across business units. The following diagram illustrates our solution architecture.

solution architecture

The data mesh is a decentralized, domain-oriented architecture that emphasizes separating data producers from data consumers via a centralized, federated Data Catalog. Typically, the producers and consumers run within their own account. The details of these data mesh characteristics are as follows:

  • Data producers – Data producers own their data products and are responsible for building their data, maintaining its accuracy, and keeping their data product up to date. They determine what datasets can be published for consumption and share their datasets by registering them with the centralized data catalog in a central governance account. You might have a producer steward or administrator persona for managing the data products with the central governance steward or administrators team.
  • Central governance account – Lake Formation enables fine-grained access management on the shared dataset. The centralized Data Catalog offers consumers the ability to quickly find shared datasets, allows administrators to centrally manage access permissions on shared datasets, and provides security teams the ability to audit and track data product usage across business units.
  • Data consumers – The data consumer obtains access to shared resources from the central governance account. These resources are available inside the consumer’s AWS Glue Data Catalog, allowing fine-grained access on the database and table that can be managed by the consumer’s data stewards and administrators.

The following steps provide an overview of how Amazon Redshift data sharing can be governed and managed by Lake Formation in the central governance pattern of a data mesh architecture:

  1. In the producer account, data objects are created and maintained in the Amazon Redshift producer cluster. A data warehouse admin creates the Amazon Redshift datashare and adds datasets (tables, views) to the share.
  2. The data warehouse admin grants and authorizes access on the datashare to the central governance account’s Data Catalog.
  3. In the central governance account, the data lake admin accepts the datashare and creates the AWS Glue database that points to the Amazon Redshift datashare so that Lake Formation can manage it.
  4. The data lake admin shares the AWS Glue database and tables to the consumer account using Lake Formation cross-account sharing.
  5. In the consumer account, the data lake admin accepts the resource share invitation via AWS Resource Access Manager (AWS RAM) and can view the database listed in the account.
  6. The data lake admin defines the fine-grained access control and grants permissions on databases and tables to IAM users (for this post, consumer1 and consumer2) in the account.
  7. In the Amazon Redshift cluster, the data warehouse admin creates an Amazon Redshift database that points to the Glue database and authorizes usage on the created Amazon Redshift database to the IAM users.
  8. The data analyst as an IAM user can now use their preferred tools like the Amazon Redshift query editor to access the dataset based on the Lake Formation fine-grained permissions.

We use the following account setup for our example in this post:

  • Producer account – 123456789012
  • Central account – 112233445566
  • Consumer account – 665544332211

Prerequisites

Create the Amazon Redshift data share and add datasets

In the data producer account, create an Amazon Redshift cluster using the RA3 node type with encryption enabled. Complete the following steps:

  1. On the Amazon Redshift console, create a cluster subnet group.

For more information, refer to Managing cluster subnet groups using the console.

  1. Choose Create cluster.
  2. For Cluster identifier, provide the cluster name of your choice.
  3. For Preview track, choose preview_2022.

  1. For Node type, choose one of the RA3 node types.

This feature is only supported on the RA3 node type.

  1. For Number of nodes, enter the number of nodes that you need for your cluster.
  2. Under Database configurations, choose the admin user name and admin user password.
  3. Under Cluster permissions, you can select the IAM role and set it as the default.

For more information about the default IAM role, refer to Creating an IAM role as default for Amazon Redshift.

cluster permissions

  1. Turn on the Use defaults option next to Additional configurations to modify the default settings.
  2. Under Network and security, specify the following:
    1. For Virtual private cloud (VPC), choose the VPC you would like to deploy the cluster in.
    2. For VPC security groups, either leave as default or add the security groups of your choice.
    3. For Cluster subnet group, choose the cluster subnet group you created.

additional configurations

  1. Under Database configuration, in the Encryption section, select Use AWS Key Management Service (AWS KMS) or Use a hardware security module (HSM).

Encryption is disabled by default.

  1. For Choose an AWS KMS key, you can either choose an existing AWS Key Management Service (AWS KMS) key, or choose Create an AWS KMS key to create a new key.

For more information, refer to Creating keys.

database configurations

  1. Choose Create cluster.
  2. For this post, create tables and load data into the producer Amazon Redshift cluster using the following script.

Authorize the datashare

Install or update the latest AWS Command Line Interface (AWS CLI) version to run the AWS CLI to authorize the datashare. For instructions, refer to Installing or updating the latest version of the AWS CLI.

Set up Lake Formation permissions

To use the AWS Glue Data Catalog in Lake Formation, complete the following steps in the central governance account to update the Data Catalog settings to use Lake Formation permissions to control catalog resources instead of IAM-based access control:

  1. Sign in to the Lake Formation console as admin.
  2. In the navigation pane, under Data catalog, choose Settings.
  3. Deselect Use only IAM access control for new databases.
  4. Deselect Use only IAM access control for new tables in new databases.
  5. Choose Version 2 for Cross account version settings.
  6. Choose Save.

data catalog settings

Set up an IAM user as a data lake administrator

If you’re using an existing data lake administrator user or role add the following managed policies, if not attached and skip the below setup steps:

AWSGlueServiceRole
AmazonRedshiftFullAccess

Otherwise, to set up an IAM user as a data lake administrator, complete the following steps:

  1. On the IAM console, choose Users in the navigation pane.
  2. Select the IAM user who you want to designate as the data lake administrator.
  3. Choose Add an inline policy on the Permissions tab.
  4. Replace <AccountID> with your own account ID and add the following policy:
{
    "Version": "2012-10-17",
    "Statement": [ {
        "Condition": {"StringEquals": {
            "iam:AWSServiceName":"lakeformation.amazonaws.com"}},
            "Action":"iam:CreateServiceLinkedRole",
            "Resource": "*",
            "Effect": "Allow"},
            {"Action": ["iam:PutRolePolicy"],
            "Resource": "arn:aws:iam::<AccountID>:role/aws-service role/lakeformation.amazonaws.com/AWSServiceRoleForLakeFormationDataAccess",
            "Effect": "Allow"
     },{
                  "Effect": "Allow",
                  "Action": [
                    "ram:AcceptResourceShareInvitation",
                    "ram:RejectResourceShareInvitation",
                    "ec2:DescribeAvailabilityZones",
                    "ram:EnableSharingWithAwsOrganization"
                  ],
                  "Resource": "*"
                }]
}
  1. Provide a policy name.
  2. Review and save your settings.
  3. Choose Add permissions, and choose Attach existing policies directly.
  4. Add the following policies:
    1. AWSLakeFormationCrossAccountManager
    2. AWSGlueConsoleFullAccess
    3. AWSGlueServiceRole
    4. AWSLakeFormationDataAdmin
    5. AWSCloudShellFullAccess
    6. AmazonRedshiftFullAccess
  5. Choose Next: Review and add permissions.

Data consumer account setup

In the consumer account, follow the steps mentioned previously in the central governance account to set up Lake Formation and a data lake administrator.

  1. In the data consumer account, create an Amazon Redshift cluster using the RA3 node type with encryption (refer to the steps demonstrated to create an Amazon Redshift cluster in the producer account).
  2. Choose Launch stack to deploy an AWS CloudFormation template to create two IAM users with policies.

launch stack

The stack creates the following users under the data analyst persona:

  • consumer1
  • consumer2
  1. After the CloudFormation stack is created, navigate to the Outputs tab of the stack.
  2. Capture the ConsoleIAMLoginURL and LFUsersCredentials values.

createiamusers

  1. Choose the LFUsersCredentials value to navigate to the AWS Secrets Manager console.
  2. In the Secret value section, choose Retrieve secret value.

secret value

  1. Capture the secret value for the password.

Both consumer1 and consumer2 need to use this same password to log in to the AWS Management Console.

secret value

Configure an Amazon Redshift datashare using Lake Formation

Producer account

Create a datashare using the console

Complete the following steps to create an Amazon Redshift datashare in the data producer account and share it with Lake Formation in the central account:

  1. On the Amazon Redshift console, choose the cluster to create the datashare.
  2. On the cluster details page, navigate to the Datashares tab.
  3. Under Datashares created in my namespace, choose Connect to database.

connect to database

  1. Choose Create datashare.

create datashare

  1. For Datashare type, choose Datashare.
  2. For Datashare name, enter the name (for this post, demotahoeds).
  3. For Database name, choose the database from where to add datashare objects (for this post, dev).
  4. For Publicly accessible, choose Turn off (or choose Turn on to share the datashare with clusters that are publicly accessible).

datashare information

  1. Under DataShare objects, choose Add to add the schema to the datashare (in this post, the public schema).
  2. Under Tables and views, choose Add to add the tables and views to the datashare (for this post, we add the table customer and view customer_view).

datashare objects

  1. Under Data consumers, choose Publish to AWS Data Catalog.
  2. For Publish to the following accounts, choose Other AWS accounts.
  3. Provide the AWS account ID of the consumer account. For this post, we provide the AWS account ID of the Lake Formation central governance account.
  4. To share within the same account, choose Local account.
  5. Choose Create datashare.

data consumers

  1. After the datashare is created, you can verify by going back to the Datashares tab and entering the datashare name in the search bar under Datashares created in my namespace.
  2. Choose the datashare name to view its details.
  3. Under Data consumers, you will see the consumer status of the consumer data catalog account as Pending Authorization.

data consumers

  1. Choose the checkbox against the consumer data catalog which will enable the Authorize option.

authorize

  1. Click Authorize to authorize the datashare access to the consumer account data catalog, consumer status will change to Authorized.

authorized

Create a datashare using a SQL command

Complete the following steps to create a datashare in data producer account 1 and share it with Lake Formation in the central account:

  1. On the Amazon Redshift console, in the navigation pane, choose Editor, then Query editor V2.
  2. Choose (right-click) the cluster name and choose Edit connection or Create Connection.
  3. For Authentication, choose Temporary credentials.

Refer to Connecting to an Amazon Redshift database to learn more about the various authentication methods.

  1. For Database, enter a database name (for this post, dev).
  2. For Database user, enter the user authorized to access the database (for this post, awsuser).
  3. Choose Save to connect to the database.

Connecting to an Amazon Redshift database

  1. Run the following SQL commands to create the datashare and add the data objects to be shared:
create datashare demotahoeds;
ALTER DATASHARE demotahoeds ADD SCHEMA PUBLIC;
ALTER DATASHARE demotahoeds ADD TABLE customer;
ALTER DATASHARE demotahoeds ADD TABLE customer_view;
  1. Run the following SQL command to share the producer datashare to the central governance account:
GRANT USAGE ON DATASHARE demotahoeds TO ACCOUNT '<central-aws-account-id>' via DATA CATALOG

Run the following SQL command

  1. You can verify the datashare created and objects shared by running the following SQL command:
DESC DATASHARE demotahoeds

DESC DATASHARE demotahoeds

  1. Run the following command using the AWS CLI to authorize the datashare to the central data catalog so that Lake Formation can manage them:
aws redshift authorize-data-share \
--data-share-arn 'arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds' \
--consumer-identifier DataCatalog/<central-aws-account-id>

The following is an example output:

 {
    "DataShareArn": "arn:aws:redshift:us-east-1:XXXXXXXXXX:datashare:cd8d91b5-0c17-4567-a52a-59f1bdda71cd/demotahoeds",
    "ProducerArn": "arn:aws:redshift:us-east-1:XXXXXXXXXX:namespace:cd8d91b5-0c17-4567-a52a-59f1bdda71cd",
    "AllowPubliclyAccessibleConsumers": false,
    "DataShareAssociations": [{
        "ConsumerIdentifier": "DataCatalog/XXXXXXXXXXXX",
        "Status": "AUTHORIZED",
        "CreatedDate": "2022-11-09T21:10:30.507000+00:00",
        "StatusChangeDate": "2022-11-09T21:10:50.932000+00:00"
    }]
}

You can verify the datashare status on the console by following the steps outlined in the previous section.

Central catalog account

The data lake admin accepts and registers the datashare with Lake Formation in the central governance account and creates a database for the same. Complete the following steps:

  1. Sign in to the console as the data lake administrator IAM user or role.
  2. If this is your first time logging in to the Lake Formation console, select Add myself and choose Get started.
  3. Under Data catalog in the navigation pane, choose Data sharing and view the Amazon Redshift datashare invitations on the Configuration tab.
  4. Select the datashare and choose Review Invitation.

AWS Lake Formation data sharing

A window pops up with the details of the invitation.

  1. Choose Accept to register the Amazon Redshift datashare to the AWS Glue Data Catalog.

accept reject invitation

  1. Provide a name for the AWS Glue database and choose Skip to Review and create.

Skip to Review and create

  1. Review the content and choose Create database.

create database

After the AWS Glue database is created on the Amazon Redshift datashare, you can view them under Shared Databases.

Shared Databases.

You can also use the AWS CLI to register the datashare and create the database. Use the following commands:

  1. Describe the Amazon Redshift datashare that is shared with the central account:
aws redshift describe-data-shares
  1. Accept and associate the Amazon Redshift datashare to Data Catalog:
aws redshift associate-data-share-consumer \
--data-share-arn 'arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds' \
--consumer-arn arn:aws:glue:us-east-1:<central-aws-account-id>:catalog

The following is an example output:

 {
    "DataShareArn": "arn:aws:redshift:us-east-1:123456789012:datashare:cd8d91b5-0c17-4567-a52a-59f1bdda71cd/demotahoeds",
    "ProducerArn": "arn:aws:redshift:us-east-1:123456789012:namespace:cd8d91b5-0c17-4567-a52a-59f1bdda71cd",
    "AllowPubliclyAccessibleConsumers": false,
    "DataShareAssociations": [
        {
            "ConsumerIdentifier": "arn:aws:glue:us-east-1:112233445566:catalog",
            "Status": "ACTIVE",
            "ConsumerRegion": "us-east-1",
            "CreatedDate": "2022-11-09T23:25:22.378000+00:00",
            "StatusChangeDate": "2022-11-09T23:25:22.378000+00:00"
        }
    ]
}
  1. Register the Amazon Redshift datashare in Lake Formation:
aws lakeformation register-resource \
--resource-arn arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds
  1. Create the AWS Glue database that points to the accepted Amazon Redshift datashare:
aws glue create-database --region <central-catalog-region> --cli-input-json '{
    "CatalogId": "<central-aws-account-id>",
    "DatabaseInput": {
        "Name": "demotahoedb",
        "FederatedDatabase": {
            "Identifier": "arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds",
            "ConnectionName": "aws:redshift"
        }
    }
}'

Now the data lake administrator of the central governance account can view and share access on both the database and tables to the data consumer account using the Lake Formation cross-account sharing feature.

Grant datashare access to the data consumer

To grant the data consumer account permissions on the shared AWS Glue database, complete the following steps:

  1. On the Lake Formation console, under Permissions in the navigation pane, choose Data Lake permissions.
  2. Choose Grant.
  3. Under Principals, select External accounts.
  4. Provide the data consumer account ID (for this post, 665544332211).
  5. Under LF_Tags or catalog resources, select Named data catalog resources.
  6. For Databases, choose the database demotahoedb.
  7. Select Describe for both Database permissions and Grantable permissions.
  8. Choose Grant to apply the permissions.

grant data permissions

To grant the data consumer account permissions on tables, complete the following steps:

  1. On the Lake Formation console, under Permissions in the navigation pane, choose Data Lake permissions.
  2. Choose Grant.
  3. Under Principals, select External accounts.
  4. Provide the consumer account (for this post, we use 665544332211).
  5. Under LF-Tags or catalog resources, select Named data catalog resources.
  6. For Databases, choose the database demotahoedb.
  7. For Tables, choose All tables.
  8. Select Describe and Select for both Table permissions and Grantable permissions.
  9. Choose Grant to apply the changes.

grant the data consumer account permissions on tables

Consumer account

The consumer admin will receive the shared resources from the central governance account and delegate access to other users in the consumer account as shown in the following table.

IAM User Object Access Object Type Access Level
consumer1 public.customer Table All
consumer2 public.customer_view View specific columns: c_customer_id, c_birth_country, cd_gender, cd_marital_status, cd_education_status

In the data consumer account, follow these steps to accept the resources shared with the account:

  1. Sign in to the console as the data lake administrator IAM user or role.
  2. If this is your first time logging in to the Lake Formation console, select Add myself and choose Get started.
  3. Sign in to the AWS RAM console.
  4. In the navigation pane, under Shared with me, choose Resource shares to view the pending invitations. You will receive 2 invitations.

Resource shares to view the pending invitations

  1. Choose the pending invitations and accept the resource share.

Choose the pending invitation and accept the resource share

  1. On the Lake formation console, under Data catalog in the navigation pane, choose Databases to view the cross-account shared database.

choose Databases to view the cross-account shared database

Grant access to the data analyst and IAM users using Lake Formation

Now the data lake admin in the data consumer account can delegate permissions on the shared database and tables to users in the consumer account.

Grant database permissions to consumer1 and consumer2

To grant the IAM users consumer1 and consumer2 database permissions, follow these steps:

  1. On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
  2. Select the database demotahoedb and on the Actions menu, choose Grant.

choose Grant database

  1. Under Principals, select IAM users and roles.
  2. Choose the IAM users consumer1 and consumer2.
  3. Under LF-Tags or catalog resources, demotahoedb is already selected for Databases.
  4. Select Describe for Database permissions.
  5. Choose Grant to apply the permissions.

Choose Grant to apply the permissions

Grant table permissions to consumer1

To grant the IAM user consumer1 permissions on table public.customer, follow these steps:

  1. Under Data catalog in the navigation pane, choose Databases.
  2. Select the database demotahoedb and on the Actions menu, choose Grant.
  3. Under Principals, select IAM users and roles.
  4. Choose IAM user consumer1.
  5. Under LF-Tags or catalog resources, demotahoedb is already selected for Databases.
  6. For Tables, choose public.customer.
  7. Select Describe and Select for Table permissions.
  8. Choose Grant to apply the permissions.

Grant table permissions to consumer1

Grant column permissions to consumer2

To grant the IAM user consumer2 permissions on non-sensitive columns in public.customer_view, follow these steps:

  1. Under Data catalog in the navigation pane, choose Databases.
  2. Select the database demotahoedb and on the Actions menu, choose Grant.
  3. Under Principals, select IAM users and roles.
  4. Choose the IAM user consumer2.
  5. Under LF-Tags or catalog resources, demotahoedb is already selected for Databases.
  6. For Tables, choose public.customer_view.

Grant column permissions to consumer2

  1. Select Select for Table permissions.
  2. Under Data Permissions, select Column-based access.
  3. Select Include columns and choose the non-sensitive columns (c_customer_id, c_birth_country, cd_gender, cd_marital_status, and cd_education_status).
  4. Choose Grant to apply the permissions.

table permissions

Consume the datashare from the data consumer account in the Amazon Redshift cluster

In the Amazon Redshift consumer data warehouse, log in as the admin user using Query Editor V2 and complete the following steps:

  1. Create the Amazon Redshift database from the shared catalog database using the following SQL command:
CREATE DATABASE demotahoedb FROM ARN 'arn:aws:glue:<producer-region>:<producer-aws-account-id>:database/demotahoedb' WITH DATA CATALOG SCHEMA demotahoedb ;
  1. Run the following SQL commands to create and grant usage on the Amazon Redshift database to the IAM users consumer1 and consumer2:
CREATE USER IAM:consumer1 password disable;
CREATE USER IAM:consumer2  password disable;
GRANT USAGE ON DATABASE demotahoedb TO IAM:consumer1;
GRANT USAGE ON DATABASE demotahoedb TO IAM:consumer2;

In order to use a federated identity to enforce Lake Formation permissions, follow the next steps to configure Query Editor v2.

  1. Choose the settings icon in the bottom left corner of the Query Editor v2, then choose Account settings.

identity to enforce Lake Formation permissions

  1. Under Connection settings, select Authenticate with IAM credentials.
  2. Choose Save.

Authenticate with IAM credentials

Query the shared datasets as a consumer user

To validate that the IAM user consumer1 has datashare access from Amazon Redshift, perform the following steps:

  1. Sign in to the console as IAM user consumer1.
  2. On the Amazon Redshift console, choose Query Editor V2 in the navigation pane.
  3. To connect to the consumer cluster, choose the consumer cluster in the tree-view pane.
  4. When prompted, for Authentication, select Temporary credentials using your IAM identity.
  5. For Database, enter the database name (for this post, dev).
  6. The user name will be mapped to your current IAM identity (for this post, consumer1).
  7. Choose Save.

edit connection for redshift

  1. Once you’re connected to the database, you can validate the current logged-in user with the following SQL command:
select current_user;

  1. To find the federated databases created on the consumer account, run the following SQL command:
SHOW DATABASES FROM DATA CATALOG [ACCOUNT '<id1>', '<id2>'] [LIKE 'expression'];

federated databases created on the consumer account

  1. To validate permissions for consumer1, run the following SQL command:
select * from demotahoedb.public.customer limit 10;

As shown in the following screenshot, consumer1 is able to successfully access the datashare customer object.

Now let’s validate that consumer2 doesn’t have access to the datashare tables “public.customer” on the same consumer cluster.

  1. Log out of the console and sign in as IAM user consumer2.
  2. Follow the same steps to connect to the database using the query editor.
  3. Once connected, run the same query:
select * from demotahoedb.public.customer limit 10;

The user consumer2 should get a permission denied error, as in the following screenshot.

should get a permission denied error

Let’s validate the column-level access permissions of consumer2 on public.customer_view view.

  1. Connect to Query Editor v2 as consumer2 and run the following SQL command:
select c_customer_id,c_birth_country,cd_gender,cd_marital_status from demotahoedb.public.customer_view limit 10;

In the following screenshot, you can see consumer2 is only able to access columns as granted by Lake Formation.

access columns as granted by Lake Formation

Conclusion

A data mesh approach provides a method by which organizations can share data across business units. Each domain is responsible for the ingestion, processing, and serving of their data. They are data owners and domain experts, and are responsible for data quality and accuracy. Using Amazon Redshift data sharing with Lake Formation for data governance helps build the data mesh architecture, enabling data sharing and federation across business units with fine-grained access control.

Special thanks to everyone who contributed to launch Amazon Redshift data sharing with AWS Lake Formation:

Debu Panda, Michael Chess, Vlad Ponomarenko, Ting Yan, Erol Murtezaoglu, Sharda Khubchandani, Rui Bi

References


About the Authors

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Harshida Patel is a Analytics Specialist Principal Solutions Architect, with AWS.

Ranjan Burman is a Analytics Specialist Solutions Architect, with AWS.

Vikram Sahadevan is a Senior Resident Architect on the AWS Data Lab team. He enjoys efforts that focus around providing prescriptive architectural guidance, sharing best practices, and removing technical roadblocks with joint engineering engagements between customers and AWS technical resources that accelerate data, analytics, artificial intelligence, and machine learning initiatives.

Steve Mitchell is a Senior Solution Architect with a passion for analytics and data mesh. He enjoys working closely with customers as they transition to a modern data architecture.

How GoDaddy built a data mesh to decentralize data ownership using AWS Lake Formation

Post Syndicated from Ankit Jhalaria original https://aws.amazon.com/blogs/big-data/how-godaddy-built-a-data-mesh-to-decentralize-data-ownership-using-aws-lake-formation/

This is a guest post co-written with Ankit Jhalaria from GoDaddy.

GoDaddy is empowering everyday entrepreneurs by providing all the help and tools to succeed online. With more than 20 million customers worldwide, GoDaddy is the place people come to name their idea, build a professional website, attract customers, and manage their work.

GoDaddy is a data-driven company, and getting meaningful insights from data helps them drive business decisions to delight their customers. In 2018, GoDaddy began a large infrastructure revamp and partnered with AWS to innovate faster than ever before to meet the needs of its customer growth around the world. As part of this revamp, the GoDaddy Data Platform team wanted to set the company up for long-term success by creating a well-defined data strategy and setting goals to decentralize the ownership and processing of data.

In this post, we discuss how GoDaddy uses AWS Lake Formation to simplify security management and data governance at scale, and enable data as a service (DaaS) supporting organization-wide data accessibility with cross-account data sharing using a data mesh architecture.

The challenge

In the vast ocean of data, deriving useful insights is an art. Prior to the AWS partnership, GoDaddy had a shared Hadoop cluster on premises that various teams used to create and share datasets with other analysts for collaboration. As the teams grew, copies of data started to grow in the Hadoop Distributed File System (HDFS). Several teams started to build tooling to manage this challenge independently, duplicating efforts. Managing permissions on these data assets became harder. Making data discoverable across a growing number of data catalogs and systems is something that had started to become a big challenge. Although the cost of storage these days is relatively inexpensive, when there are several copies of the same data asset available, it makes it harder for analysts to efficiently and reliably use the data available to them. Business analysts need robust pipelines on key datasets that they rely upon to make business decisions.

Solution overview

In GoDaddy’s data mesh hub and spoke model, a central data catalog contains information about all the data products that exist in the company. In AWS terminology, this is the AWS Glue Data Catalog. The data platform team provides APIs, SDKs, and Airflow Operators as components that different teams use to interact with the catalog. Activities such as updating the metastore to reflect a new partition for a given data product, and occasionally running MSCK repair operations, are all handled in the central governance account, and Lake Formation is used to secure access to the Data Catalog.

The data platform team introduced a layer of data governance that ensures best practices for building data products are followed throughout the company. We provide the tooling to support data engineers and business analysts while leaving the domain experts to run their data pipelines. With this approach, we have well-curated data products that are intuitive and easy to understand for our business analysts.

A data product refers to an entity that powers insights for analytical purposes. In simple terms, this could refer to an actual dataset pointing to a location in Amazon Simple Storage Service (Amazon S3). Data producers are responsible for the processing of data and creating new snapshots or partitions depending on the business needs. In some cases, data is refreshed every 24 hours, and other cases, every hour. Data consumers come to the data mesh to consume data, and permissions are managed in the central governance account through Lake Formation. Lake Formation uses AWS Resource Access Manager (AWS RAM) to send resource shares to different consumer accounts to be able to access the data from the central governance account. We go into details about this functionality later in the post.

The following diagram illustrates the solution architecture.

Solution architecture illustrated

Defining metadata with the central schema repository

Data is only useful if end-users can derive meaningful insights from it—otherwise, it’s just noise. As part of onboarding with the data platform, a data producer registers their schema with the data platform along with relevant metadata. This is reviewed by the data governance team that ensures best practices for creating datasets are followed. We have automated some of the most common data governance review items. This is also the place where producers define a contract about reliable data deliveries, often referred to as Service Level Objective (SLO). After a contract is in place, the data platform team’s background processes monitor and send out alerts when data producers fail to meet their contract or SLO.

When managing permissions with Lake Formation, you register the Amazon S3 location of different S3 buckets. Lake Formation uses AWS RAM to share the named resource.

When managing resources with AWS RAM, the central governance account creates AWS RAM shares. The data platform provides a custom AWS Service Catalog product to accept AWS RAM shares in consumer accounts.

Having consistent schemas with meaningful names and descriptions makes the discovery of datasets easy. Every data producer who is a domain expert is responsible for creating well-defined schemas that business users use to generate insights to make key business decisions. Data producers register their schemas along with additional metadata with the data lake repository. Metadata includes information about the team responsible for the dataset, such as their SLO contract, description, and contact information. This information gets checked into a Git repository where automation kicks in and validates the request to make sure it conforms to standards and best practices. We use AWS CloudFormation templates to provision resources. The following code is a sample of what the registration metadata looks like.

Sample code of what the registration metadata looks like

As part of the registration process, automation steps run in the background to take care of the following on behalf of the data producer:

  • Register the producer’s Amazon S3 location of the data with Lake Formation – This allows us to use Lake Formation for fine-grained access to control the table in the AWS Glue Data Catalog that refers to this location as well as to the underlying data.
  • Create the underlying AWS Glue database and table – Based on the schema specified by the data producer along with the metadata, we create the underlying AWS Glue database and table in the central governance account. As part of this, we also use table properties of AWS Glue to store additional metadata to use later for analysis.
  • Define the SLO contract – Any business-critical dataset needs to have a well-defined SLO contract. As part of dataset registration, the data producer defines a contract with a cron expression that gets used by the data platform to create an event rule in Amazon EventBridge. This rule triggers an AWS Lambda function to watch for deliveries of the data and triggers an alert to the data producer’s Slack channel if they breach the contract.

Consuming data from the data mesh catalog

When a data consumer belonging to a given line of business (LOB) identifies the data product that they’re interested in, they submit a request to the central governance team containing their AWS account ID that they use to query the data. The data platform provides a portal to discover datasets across the company. After the request is approved, automation runs to create an AWS RAM share with the consumer account covering the AWS Glue database and tables mapped to the data product registered in the AWS Glue Data Catalog of the central governance account.

The following screenshot shows an example of a resource share.

Example of a resource share

The consumer data lake admin needs to accept the AWS RAM share and create a resource link in Lake Formation to start querying the shared dataset within their account. We automated this process by building an AWS Service Catalog product that runs in the consumer’s account as a Lambda function that accepts shares on behalf of consumers.

When the resource linked datasets are available in the consumer account, the consumer data lake admin provides grants to IAM users and roles mapping to data consumers within the account. These consumers (application or user persona) can now query the datasets using AWS analytics services of their choice like Amazon Athena and Amazon EMR based on the access privileges granted by the consumer data lake admin.

Day-to-day operations and metrics

Managing permissions using Lake Formation is one part of the overall ecosystem. After permissions have been granted, data producers create new snapshots of the data at a certain cadence that can vary from every 15 minutes to a day. Data producers are integrated with the data platform APIs that informs the platform about any new refreshes of the data. The data platform automatically writes a 0-byte _SUCCESS file for every dataset that gets refreshed, and notifies the subscribed consumer account via an Amazon Simple Notification Service (Amazon SNS) topic in the central governance account. Consumers use this as a signal to trigger their data pipelines and processes to start processing newer version of the data utilizing an event-driven approach.

There are over 2,000 data products built on the GoDaddy data mesh on AWS. Every day, there are thousands of updates to the AWS Glue metastore in the central data governance account. There are hundreds of data producers generating data every hour in a wide array of S3 buckets, and thousands of data consumers consuming data across a wide array of tools, including Athena, Amazon EMR, and Tableau from different AWS accounts.

Business outcomes

With the move to AWS, GoDaddy’s Data Platform team laid the foundations to build a modern data platform that has increased our velocity of building data products and delighting our customers. The data platform has successfully transitioned from a monolithic platform to a model where ownership of data has been decentralized. We accelerated the data platform adoption to over 10 lines of business and over 300 teams globally, and are successfully managing multiple petabytes of data spread across hundreds of accounts to help our business derive insights faster.

Conclusion

GoDaddy’s hub and spoke data mesh architecture built using Lake Formation simplifies security management and data governance at scale, to deliver data as a service supporting company-wide data accessibility. Our data mesh manages multiple petabytes of data across hundreds of accounts, enabling decentralized ownership of well-defined datasets with automation in place, which helps the business discover data assets quicker and derive business insights faster.

This post illustrates the use of Lake Formation to build a data mesh architecture that enables a DaaS model for a modernized enterprise data platform. For more information, see Design a data mesh architecture using AWS Lake Formation and AWS Glue.


About the Authors

Ankit Jhalaria is the Director Of Engineering on the Data Platform at GoDaddy. He has over 10 years of experience working in big data technologies. Outside of work, Ankit loves hiking, playing board games, building IoT projects, and contributing to open-source projects.

Harsh Vardhan is an AWS Solutions Architect, specializing in Analytics. He has over 6 years of experience working in the field of big data and data science. He is passionate about helping customers adopt best practices and discover insights from their data.

Kyle Tedeschi is a Principal Solutions Architect at AWS. He enjoys helping customers innovate, transform, and become leaders in their respective domains. Outside of work, Kyle is an avid snowboarder, car enthusiast, and traveler.

Use an event-driven architecture to build a data mesh on AWS

Post Syndicated from Jan Michael Go Tan original https://aws.amazon.com/blogs/big-data/use-an-event-driven-architecture-to-build-a-data-mesh-on-aws/

In this post, we take the data mesh design discussed in Design a data mesh architecture using AWS Lake Formation and AWS Glue, and demonstrate how to initialize data domain accounts to enable managed sharing; we also go through how we can use an event-driven approach to automate processes between the central governance account and data domain accounts (producers and consumers). We build a data mesh pattern from scratch as Infrastructure as Code (IaC) using AWS CDK and use an open-source self-service data platform UI to share and discover data between business units.

The key advantage of this approach is being able to add actions in response to data mesh events such as permission management, tag propagation, search index management, and to automate different processes.

Before we dive into it, let’s look at AWS Analytics Reference Architecture, an open-source library that we use to build our solution.

AWS Analytics Reference Architecture

AWS Analytics Reference Architecture (ARA) is a set of analytics solutions put together as end-to-end examples. It regroups AWS best practices for designing, implementing, and operating analytics platforms through different purpose-built patterns, handling common requirements, and solving customers’ challenges.

ARA exposes reusable core components in an AWS CDK library, currently available in Typescript and Python. This library contains AWS CDK constructs (L3) that can be used to quickly provision analytics solutions in demos, prototypes, proofs of concept, and end-to-end reference architectures.

The following table lists data mesh specific constructs in the AWS Analytics Reference Architecture library.

Construct Name Purpose
CentralGovernance Creates an Amazon EventBridge event bus for central governance account that is used to communicate with data domain accounts (producer/consumer). Creates workflows to automate data product registration and sharing.
DataDomain Creates an Amazon EventBridge event bus for data domain account (producer/consumer) to communicate with central governance account. It creates data lake storage (Amazon S3), and workflow to automate data product registration. It also creates a workflow to populate AWS Glue Catalog metadata for newly registered data product.

You can find AWS CDK constructs for the AWS Analytics Reference Architecture on Construct Hub.

In addition to ARA constructs, we also use an open-source Self-service data platform (User Interface). It is built using AWS Amplify, Amazon DynamoDB, AWS Step Functions, AWS Lambda, Amazon API Gateway, Amazon EventBridge, Amazon Cognito, and Amazon OpenSearch. The frontend is built with React. Through the self-service data platform you can: 1) manage data domains and data products, and 2) discover and request access to data products.

Central Governance and data sharing

For the governance of our data mesh, we will use AWS Lake Formation. AWS Lake Formation is a fully managed service that simplifies data lake setup, supports centralized security management, and provides transactional access on top of your data lake. Moreover, it enables data sharing across accounts and organizations. This centralized approach has a number of key benefits, such as: centralized audit; centralized permission management; and centralized data discovery. More importantly, this allows organizations to gain the benefits of centralized governance while taking advantage of the inherent scaling characteristics of decentralized data product management.

There are two ways to share data resources in Lake Formation: 1) Named Based Access Control (NRAC), and 2) Tag-Based Access Control (LF-TBAC). NRAC uses AWS Resource Access Manager (AWS RAM) to share data resources across accounts. Those are consumed via resource links that are based on created resource shares. Tag-Based Access Control (LF-TBAC) is another approach to share data resources in AWS Lake Formation, that defines permissions based on attributes. These attributes are called LF-tags. You can read this blog to learn about LF-TBAC in the context of data mesh.

The following diagram shows how NRAC and LF-TBAC data sharing works. In this example, data domain is registered as a node on mesh and therefore we create two databases in the central governance account. NRAC database is shared with data domain via AWS RAM. Access to data products that we register in this database will be handled through NRAC. LF-TBAC database is tagged with data domain N line of business (LOB) LF-tag: <LOB:N>. LOB tag is automatically shared with data domain N account and therefore database is available in that account. Access to Data Products in this database will be handled through LF-TBAC.

BDB-2279-ram-tag-share

In our solution we will demonstrate both NRAC and LF-TBAC approaches. With the NRAC approach, we will build up an event-based workflow that would automatically accept RAM share in the data domain accounts and automate the creation of the necessary metadata objects (eg. local database, resource links, etc). While with the LF-TBAC approach, we rely on permissions associated with the shared LF-Tags to allow producer data domains to manage their data products, and consumer data domains read access to the relevant data products associated with the LF-Tags that they requested access to.

We use CentralGovernance construct from ARA library to build a central governance account. It creates an EventBridge event bus to enable communication with data domain accounts that register as nodes on mesh. For each registered data domain, specific event bus rules are created that route events towards that account. Central governance account has a central metadata catalog that allows for data to be stored in different data domains, as opposed to a single central lake. For each registered data domain, we create two separate databases in central governance catalog to demonstrate both NRAC and LF-TBAC data sharing. CentralGovernance construct creates workflows for data product registration and data product sharing. We also deploy a self-service data platform UI  to enable good user experience to manage data domains, data products, and to simplify data discovery and sharing.

BDB-2279-central-gov

A data domain: producer and consumer

We use DataDomain construct from ARA library to build a data domain account that can be either producer, consumer, or both. Producers manage the lifecycle of their respective data products in their own AWS accounts. Typically, this data is stored in Amazon Simple Storage Service (Amazon S3). DataDomain construct creates a data lake storage with cross-account bucket policy that enables central governance account to access the data. Data is encrypted using AWS KMS, and central governance account has a permission to use the key. Config secret in AWS Secrets Manager contains all the necessary information to register data domain as a node on mesh in central governance. It includes: 1) data domain name, 2) S3 location that holds data products, and 3) encryption key ARN. DataDomain construct also creates data domain and crawler workflows to automate data product registration.

BDB-2279-data-domain

Creating an event-driven data mesh

Data mesh architectures typically require some level of communication and trust policy management to maintain least privileges of the relevant principals between the different accounts (for example, central governance to producer, central governance to consumer). We use event-driven approach via EventBridge to securely forward events from one event bus to event bus in another account while maintaining the least privilege access. When we register data domain to central governance account through the self-service data platform UI, we establish bi-directional communication between the accounts via EventBridge. Domain registration process also creates database in the central governance catalog to hold data products for that particular domain. Registered data domain is now a node on mesh and we can register new data products.

The following diagram shows data product registration process:

BDB-2279-register-dd-small

  1. Starts Register Data Product workflow that creates an empty table (the schema is managed by the producers in their respective producer account). This workflow also grants a cross-account permission to the producer account that allows producer to manage the schema of the table.
  2. When complete, this emits an event into the central event bus.
  3. The central event bus contains a rule that forwards the event to the producer’s event bus. This rule was created during the data domain registration process.
  4. When the producer’s event bus receives the event, it triggers the Data Domain workflow, which creates resource-links and grants permissions.
  5. Still in the producer account, Crawler workflow gets triggered when the Data Domain workflow state changes to Successful. This creates the crawler, runs it, waits and checks if the crawler is done, and deletes the crawler when it’s complete. This workflow is responsible for populating tables’ schemas.

Now other data domains can find newly registered data products using the self-service data platform UI and request access. The sharing process works in the same way as product registration by sending events from the central governance account to consumer data domain, and triggering specific workflows.

Solution Overview

The following high-level solution diagram shows how everything fits together and how event-driven architecture enables multiple accounts to form a data mesh. You can follow the workshop that we released to deploy the solution that we covered in this blog post. You can deploy multiple data domains and test both data registration and data sharing. You can also use self-service data platform UI to search through data products and request access using both LF-TBAC and NRAC approaches.

BDB-2279-arch-diagram

Conclusion

Implementing a data mesh on top of an event-driven architecture provides both flexibility and extensibility. A data mesh by itself has several moving parts to support various functionalities, such as onboarding, search, access management and sharing, and more. With an event-driven architecture, we can implement these functionalities in smaller components to make them easier to test, operate, and maintain. Future requirements and applications can use the event stream to provide their own functionality, making the entire mesh much more valuable to your organization.

To learn more how to design and build applications based on event-driven architecture, see the AWS Event-Driven Architecture page. To dive deeper into data mesh concepts, see the Design a Data Mesh Architecture using AWS Lake Formation and AWS Glue blog.

If you’d like our team to run data mesh workshop with you, please reach out to your AWS team.


About the authors


Jan Michael Go Tan is a Principal Solutions Architect for Amazon Web Services. He helps customers design scalable and innovative solutions with the AWS Cloud.

Dzenan Softic is a Senior Solutions Architect at AWS. He works with startups to help them define and execute their ideas. His main focus is in data engineering and infrastructure.

David Greenshtein is a Specialist Solutions Architect for Analytics at AWS with a passion for ETL and automation. He works with AWS customers to design and build analytics solutions enabling business to make data-driven decisions. In his free time, he likes jogging and riding bikes with his son.
Vincent Gromakowski is an Analytics Specialist Solutions Architect at AWS where he enjoys solving customers’ analytics, NoSQL, and streaming challenges. He has a strong expertise on distributed data processing engines and resource orchestration platform.

Integrate AWS IAM Identity Center (successor to AWS Single Sign-On) with AWS Lake Formation fine-grained access controls

Post Syndicated from Benon Boyadjian original https://aws.amazon.com/blogs/big-data/integrate-aws-iam-identity-center-successor-to-aws-single-sign-on-with-aws-lake-formation-fine-grained-access-controls/

Data lakes are a centralized repository for storing structured and unstructured data at scale. Data lakes enable you to create dashboards, perform big data processing and real-time analytics, and create machine learning (ML) models on your data to drive business decisions.

Many customers are choosing AWS Lake Formation as their data lake management solution. Lake Formation is an integrated data lake service that makes it simple for you to ingest, clean, catalog, transform, and secure your data and make it available for analysis and ML.

However, some companies require account authentication and authorization to be managed through AWS IAM Identity Center (successor to AWS Single Sign-On), which doesn’t have a built-in integration with Lake Formation.

Integrating Lake Formation with IAM Identity Center can help you manage data access at the organization level, consolidating AWS account and data lake authentication and authorization.

In this post, we walk through the steps to integrate IAM Identity Center with Lake Formation.

Solution overview

In this post, we configure IAM Identity Center with permission sets for your data lake personas. These are the permissions that allow your data lake users to access Lake Formation. When the permission sets are assigned to your data lake account, IAM Identity Center creates Identity and Access Management (IAM) roles in that account. The IAM roles are prefixed with AWSReservedSSO_<Permission Set Name>.

In Lake Formation, you can grant data resource permissions to IAM users and roles. To integrate with IAM Identity Center, you will grant data resource access to the IAM roles created by IAM Identity Center.

Now, when users access the data lake account through the IAM Identity Center portal, they assume an IAM role that has access to Lake Formation resources.

The following diagram illustrates this solution architecture.

To implement the solution, complete the following high-level steps:

  1. Create a permission set within IAM Identity Center
  2. Grant Users or Groups access to the data lake account in IAM Identity Center
  3. Assign an IAM Identity Center role as a Data Lake Administrator
  4. Grant IAM Identity Center generated IAM role data lake permissions in Lake Formation
  5. Grant IAM Identity Center generated IAM role data location permissions in Lake Formation

Prerequisites

For this walkthrough, you should have the following prerequisites: 

Create a permission set with IAM Identity Center

To create your permission set, complete the following steps:

  1. Sign into the AWS Management Console with your management account and go to the Region where IAM Identity Center is configured.
  2. On the IAM Identity Center Console, choose Permissions sets in the navigation pane.
  3. Choose Create permission set.

  4. Select Custom permission set, then choose Next.

  5. Next, you must specify policies. The first permission set you create should have data lake admin privileges.
    AWS recommends granting data lake admins the following AWS managed policies: AWSGlueConsoleFullAccess, AWSLakeFormationCrossAccountManager, AWSLakeFormationDataAdmin, AmazonAthenaFullAccess, and CloudWatchLogsReadOnlyAccess. However, if these permissions are too permissive or not permissive enough, you may prefer using customer managed policies.
  6. Choose Next
  7. Specify permission set details, then choose Next.

  8. Review your settings, then choose Create.

Repeat the steps to create a data analyst role to grant Lake Formation access. For this post, we created the role LakeFormationDataAnalyst with the policy AmazonAthenaFullAccess.

Grant users or groups access to the data lake account in IAM Identity Center

To grant access to users and groups, complete the following steps:

  1. On the IAM Identity Center console, chose AWS accounts in the navigation pane.
  2. Choose Assign users or groups.

  3. Select the user or group you want to assign the data lake account permissions to (DataLakeAdmin).
  4. Choose Next.

  5. Select the permission you created earlier.
  6. Choose Next.

  7. Review your settings, then choose Submit.

Verify your IAM Identity Center permissions have been successfully granted by visiting your IAM Identity Center Portal, choosing the data lake admin, and signing in to the console.

Assign an IAM Identity Center role as a data lake administrator

The following steps set up a data lake administrator with the IAM role created by IAM Identity Center. Administrators have full access to the Lake Formation console, and control the initial data configuration and access permissions. For all users and groups that don’t need to be data lake administrators, skip to the next series of steps.

  1. Sign in to the console as the data lake account with admin access.
  2. Open the Lake Formation console.A pop-up window appears, prompting you to define your administrators.
  3. Select Add other AWS users or roles.
  4. Choose the permission set you created earlier (starting with AWSReservedSSO_DataLakeAdmin).
  5. Choose Get started.
  6. On the Administrative roles and tasks page, under Database creators, choose Grant.
  7. Choose your data lake admin role.
  8. Select Create database under Catalog permissions and Grantable permissions.
  9. Choose Grant.

You now have an IAM Identity Center-generated IAM principal that is assigned as the data lake administrator and database creator.

Grant the IAM Identity Center role data lake permissions in Lake Formation

You now manage data lake permissions. For more information, refer to Managing Lake Formation permissions. 

Whether you’re managing permissions with LF-tags or named resources, the steps for granting access remain the same

  1. On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permissions.
  2. Choose Grant.
  3. Select IAM users and roles.
  4. Choose the AWSReservedSSO_LakeFormationDataAnalyst role.
  5. Grant access to database and table permissions as applicable, then choose Grant.

You now have an IAM Identity Center-generated IAM principal data permissions.

Grant the IAM Identity Center role data location permissions in Lake Formation

When granting access to data locations, the process remains the same.

  1. On the Lake Formation console, under Permissions in the navigation pane, choose Data locations.
  2. Choose Grant.
  3. Choose the AWSReservedSSO_LakeFormationDataAnalyst role.
  4. Complete the remaining fields and choose Grant.

You now have an IAM Identity Center-generated IAM principal with Data location access.

Validate data access

We now validate data access for the IAM Identity Center principal.

  1. Sign in to the console through IAM Identity Center as the principal you granted access to. For this post, we’re logging in as the LakeFormationDataAnalyst role.

    To test data access, we run some queries in Amazon Athena.
  2. On the Athena console, choose Query editor.
  3. On the Settings tab, confirm that a query result location is set up.
  4. If you don’t have a query result location, choose Manage and configure your query result location and encryption.
  5. In the Athena query editor, on the Editor tab, choose the database that you granted access to.If the principal doesn’t have access to the Lake Formation table and data location, you won’t be able to view data in Athena.
  6. Choose the menu icon next to your table and choose Generate table DDL.

Confirm that the data appears on the Query results tab.

Conclusion

In this post, we demonstrated how to integrate IAM Identity Center with Lake Formation permissions. You can now grant IAM Identity Center identities administrator, database creation, database and table, and data location access in Lake Formation. Managing data lake permissions through IAM Identity Center allows you to control data access from your management account, helping to improve your scalability and security.

If you’re wondering how to adapt this solution to Tag-based access control, read Easily manage your data lake at scale using AWS Lake Formation Tag-based access control and apply the techniques you learned from this blog.


About the authors

Benon Boyadjian is a Private Equity Solutions Architect at AWS. He is passionate about helping customers understand the impact AWS can have on their businesses and guiding their AWS implementations. In his free time, he enjoys swimming, snowboarding, and playing with his cat Dirt.

Janakiraman Shanmugam is a Senior Data Architect at Amazon Web Services . He has a focus in Data & Analytics and enjoys helping customers to solve Big data & machine learning problems. Outside of the office, he loves to be with his friends and family and spend time outdoors.

Let’s Architect! Modern data architectures

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-modern-data-architectures/

With the rapid growth in data coming from data platforms and applications, and the continuous improvements in state-of-the-art machine learning algorithms, data are becoming key assets for companies.

Modern data architectures include data mesh—a recent style that represents a paradigm shift, in which data is treated as a product and data architectures are designed around business domains. This type of approach supports the idea of distributed data, where each business domain focuses on the quality of the data it produces and exposes to the consumers.

In this edition of Let’s Architect!, we focus on data mesh and how it is designed on AWS, plus other approaches to adopt modern architectural patterns.

Design a data mesh architecture using AWS Lake Formation and AWS Glue

Domain Driven Design (DDD) is a software design approach where a solution is divided into domains aligned with business capabilities, software, and organizational boundaries. Unlike software architectures, most data architectures are often designed around technologies rather than business domains.

In this blog, you can learn about data mesh, an architectural pattern that applies the principles of DDD to data architectures. Data are organized into domains and considered the product that each team owns and offers for consumption.

A data mesh design organizes around data domains. Each domain owns multiple data products with their own data and technology stacks

A data mesh design organizes around data domains. Each domain owns multiple data products with their own data and technology stacks

Building Data Mesh Architectures on AWS

In this video, discover how to use the data mesh approach in AWS. Specifically, how to implement certain design patterns for building a data mesh architecture with AWS services in the cloud.

This is a pragmatic presentation to get a quick understanding of data mesh fundamentals, the benefits/challenges, and the AWS services that you can use to build it. This video provides additional context to the aforementioned blog post and includes several examples on the benefits of modern data architectures.

This diagram demonstrates the pattern for sharing data catalogs between producer domains and consumer domains

This diagram demonstrates the pattern for sharing data catalogs between producer domains and consumer domains

Build a modern data architecture on AWS with Amazon AppFlow, AWS Lake Formation, and Amazon Redshift

In this blog, you can learn how to build a modern data strategy using AWS managed services to ingest data from sources like Salesforce. Also discussed is how to automatically create metadata catalogs and share data seamlessly between the data lake and data warehouse, plus creating alerts in the event of an orchestrated data workflow failure.

The second part of the post explains how a data warehouse can be built by using an agile data modeling pattern, as well as how ELT jobs were quickly developed, orchestrated, and configured to perform automated data quality testing.

A data platform architecture and the subcomponents used to build it

A data platform architecture and the subcomponents used to build it

AWS Lake Formation Workshop

With a modern data architecture on AWS, architects and engineers can rapidly build scalable data lakes; use a broad and deep collection of purpose-built data services; and ensure compliance via unified data access, security, and governance. As data mesh is a modern architectural pattern, you can build it using a service like AWS Lake Formation.

Familiarize yourself with new technologies and services by not only learning how they work, but also to building prototypes and projects to gain hands-on experience. This workshop allows builders to become familiar with the features of AWS Lake Formation and its integrations with other AWS services.

A data catalog is a key component in a data mesh architecture. AWS Glue crawlers interact with data stores and other elements to populate the data catalog

A data catalog is a key component in a data mesh architecture. AWS Glue crawlers interact with data stores and other elements to populate the data catalog

See you next time!

Thanks for joining our discussion on data mesh! See you in a couple of weeks when we talk more about architectures and the challenges that we face every day while working with distributed systems.

Other posts in this series

Looking for more architecture content?

AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!