Tag Archives: AWS Lake Formation

Set up and monitor AWS Glue crawlers using the enhanced AWS Glue UI and crawler history

Post Syndicated from Leonardo Gomez original https://aws.amazon.com/blogs/big-data/set-up-and-monitor-aws-glue-crawlers-using-the-enhanced-aws-glue-ui-and-crawler-history/

A data lake is a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis. Setting up and managing data lakes today involves a lot of manual, complicated, and time-consuming tasks. AWS Glue and AWS Lake Formation make it easy to build, secure, and manage data lakes. As data from existing data stores is moved in the data lake, there is a need to catalog the data to prepare it for analytics from services such as Amazon Athena.

AWS Glue crawlers are a popular way to populate the AWS Glue Catalog. AWS Glue crawlers are a key component that allow you to connect to data sources or targets, use different classifiers to determine the logical schema for the data, and create metadata in the Data Catalog. You can run crawlers on a schedule, on demand, or triggered based on an Amazon Simple Storage Service (Amazon S3) event to ensure that the Data Catalog is up to date. Using S3 event notifications can reduce the cost and time a crawler needs to update large and frequently changing tables.

The AWS Glue crawlers UI has been redesigned to offer a better user experience, and new functionalities have been added. This new UI provides easier setup of crawlers across multiple sources, including Amazon S3, Amazon DynamoDB, Amazon Redshift, Amazon Aurora, Amazon DocumentDB (with MongoDB compatibility), Delta Lake, MariaDB, Microsoft SQL Server, MySQL, Oracle, PostgreSQL, and MongoDB. A new AWS Glue crawler history feature has also been launched, which provides a convenient way to view crawler runs, their schedules, data sources, and tags. For each crawl, the crawler history offers a summary of data modifications such as changes in the database schema or Amazon S3 partition changes. Crawler history also provides DPU hours that can reduce the time to analyze and debug crawler operations and costs.

This post shows how to create an AWS Glue crawler that supports S3 event notification using the new UI. We also show how to navigate through the new crawler history section and get valuable insights.

Overview of solution

To demonstrate how to create an AWS Glue crawler using the new UI, we use the Toronto parking tickets dataset, specifically the data about parking tickets issued in the city of Toronto between 2017–2018. The goal is to create a crawler based on S3 events, run it, and explore the information showed in the UI about the run of this crawler.

As mentioned before, instead of crawling all the subfolders on Amazon S3, we use an S3 event-based approach. This helps improve the crawl time by using S3 events to identify the changes between two crawls by listing all the files from the subfolder that triggered the event instead of listing the full Amazon S3 target. For this post, we create an S3 event, Amazon Simple Storage Service (Amazon SNS) topic, and Amazon Simple Queue Service (Amazon SQS ) queue.

The following diagram illustrates our solution architecture.

Prerequisites

For this walkthrough, you should have the following prerequisites:

If the AWS account you use to follow this post uses Lake Formation to manage permissions on the AWS Glue Data Catalog, make sure that you log in as a user with access to create databases and tables. For more information, refer to Implicit Lake Formation permissions.

Launch your CloudFormation stack

To create your resources for this use case, complete the following steps:

  1. Launch your CloudFormation stack in us-east-1:
    BDB-2063-launch-cloudformation-stack
  2. Under Parameters, enter a name for your S3 bucket (include your account number).
  3. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  4. Choose Create stack.
  5. Wait until the creation of the stack is complete, as shown on the AWS CloudFormation console.
  6. On the stack’s Outputs tab, take note of the SQS queue ARN—we use it during the crawler creation process.

Launching this stack creates AWS resources. You need the following resources from the Outputs tab for the next steps:

  • GlueCrawlerRole – The IAM role to run AWS Glue jobs
  • BucketName – The name of the S3 bucket to store solution-related files
  • GlueSNSTopic – The SNS topic, which we use as the target for the S3 event
  • SQSArn – The SQS queue ARN; this queue is going to be consumed by the AWS Glue crawler

Create an AWS Glue crawler

Let’s first create the dataset that is going to be used as the source of the AWS Glue crawler:

  1. Open AWS CloudShell.
  2. Run the following command:
    aws s3 cp s3://aws-bigdata-blog/artifacts/gluenewcrawlerui/sourcedata/year=2017/Parking_Tags_Data_2017_2.csv s3://glue-crawler-blog-<YOUR ACCOUNT NUMBER>/torontotickets/year=2017/Parking_Tags_Data_2017_2.csv


    This action triggers an S3 event that sends a message to the SNS topic that you created using the CloudFormation template. This message is consumed by an SQS queue that will be input for the AWS Glue crawler.

    Now, let’s create the AWS Glue crawler.

  3. On the AWS Glue console, choose Crawlers in the navigation pane.
  4. Choose Create crawler.
  5. For Name, enter a name (for example, BlogPostCrawler).
  6. Choose Next.
  7. For Is your data already mapped to Glue tables, select Not yet.
  8. In the Data sources section, choose Add data source.

    For this post, you use an S3 dataset as a source.
  9. For Data source, choose S3.
  10. For Location of S3 data, select In this account.
  11. For S3 path, enter the path to the S3 bucket you created with the CloudFormation template (s3://glue-crawler-blog-YOUR ACCOUNT NUMBER/torontotickets/).
  12. For Subsequent crawler runs, select Crawl based on events.
  13. Enter the SQS queue ARN you created earlier.
  14. Choose Add a S3 data source.
  15. Choose Next.
  16. For Existing IAM role¸ choose the role you created (GlueCrawlerBlogRole).
  17. Choose Next.

    Now let’s create an AWS Glue database.
  18. Under Target database, choose Add database.
  19. For Name, enter blogdb.
  20. For Location, choose the S3 bucket created by the CloudFormation template.
  21. Choose Create database.
  22. On the Set output and scheduling page, for Target database, choose the database you just created (blogdb).
  23. For Table name prefix, enter blog.
  24. For Maximum table threshold, you can optionally set a limit for the number of tables that this crawler can scan. For this post, we leave this option blank.
  25. For Frequency, choose On demand.
  26. Choose Next.
  27. Review the configuration and choose Create crawler.

Run the AWS Glue crawler

To run the crawler, navigate to the crawler on the AWS Glue console.

Choose Run crawler.

On the Crawler runs tab, you can see the current run of the crawler.

Explore the crawler run history data

When the crawler is complete, you can see the following details:

  • Duration – The exact duration time of the crawler run
  • DPU hours – The number of DPU hours spent during the crawler run; this is very useful to calculate costs
  • Table changes – The changes applied to the table, like new columns or partitions

Choose Table changes to see the crawler run summary.

You can see the table blogtorontotickets was created, and also a 2017 partition.

Let’s add more data to the S3 bucket to see how the crawler processes this change.

  1. Open CloudShell.
  2. Run the following command:
    aws s3 cp s3://aws-bigdata-blog/artifacts/gluenewcrawlerui/sourcedata/year=2018/Parking_Tags_Data_2018_1.csv s3://glue-crawler-blog-<YOUR ACCOUNT NUMBER>/torontotickets/year=2018/Parking_Tags_Data_2018_1.csv

  3. Choose Run crawler to run the crawler one more time.

You can see the second run of the crawler listed.

Note that the DPU hours were reduced by more than half; this is because only one partition was scanned and added. Having an event-based crawler helps reduce runtime and cost.

You can choose the Table changes information of the second run to see more details.

Note under Partitions added, the 2018 partition was created.

Additional notes

Keep in mind the following considerations:

  • Crawler history is supported for crawls that have occurred since the launch date of the crawler history feature, and only retains up to 12 months of crawls. Older crawls will not be returned.
  • To set up a crawler using AWS CloudFormation, you can use following template.
  • You can get all the crawls of a specified crawler by using list-crawls APIs.
  • You can update existing crawlers with a single Amazon S3 target to use this new feature. You can do this either via the AWS Glue console or by calling the update_crawler API.

Clean up

To avoid incurring future charges, and to clean up unused roles and policies, delete the resources you created: the CloudFormation stack, S3 bucket, AWS Glue crawler, AWS Glue database, and AWS Glue table.

Conclusion

You can use AWS Glue crawlers to discover datasets, extract schema information, and populate the AWS Glue Data Catalog. AWS Glue crawlers now provide an easier-to-use UI workflow to set up crawlers and also provide metrics associated with past crawlers run to simplify monitoring and auditing. In this post, we provided a CloudFormation template to set up AWS Glue crawlers to use S3 event notifications, which reduces the time and cost needed to incrementally process table data updates in the AWS Glue Data Catalog. We also showed you how to monitor and understand the cost of crawlers.

Special thanks to everyone who contributed to the crawler history launch: Theo Xu, Jessica Cheng and Joseph Barlan.

Happy crawling!


About the authors

Leonardo Gómez is a Senior Analytics Specialist Solutions Architect at AWS. Based in Toronto, Canada, He has over a decade of experience in data management, helping customers around the globe address their business and technical needs. Connect with him on LinkedIn.

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Use an AD FS user and Tableau to securely query data in AWS Lake Formation

Post Syndicated from Jason Nicholls original https://aws.amazon.com/blogs/big-data/use-an-ad-fs-user-and-tableau-to-securely-query-data-in-aws-lake-formation/

Security-conscious customers often adopt a Zero Trust security architecture. Zero Trust is a security model centered on the idea that access to data shouldn’t be solely based on network location, but rather require users and systems to prove their identities and trustworthiness and enforce fine-grained identity-based authorization rules before granting access to applications, data, and other systems.

Some customers rely on third-party identity providers (IdPs) like Active Directory Federated Services (AD FS) as a system to manage credentials and prove identities and trustworthiness. Users can use their AD FS credentials to authenticate to various related yet independent systems, including the AWS Management Console (for more information, see Enabling SAML 2.0 federated users to access the AWS Management Console).

In the context of analytics, some customers extend Zero Trust to data stored in data lakes, which includes the various business intelligence (BI) tools used to access that data. A common data lake pattern is to store data in Amazon Simple Storage Service (Amazon S3) and query the data using Amazon Athena.

AWS Lake Formation allows you to define and enforce access policies at the database, table, and column level when using Athena queries to read data stored in Amazon S3. Lake Formation supports Active Directory and Security Assertion Markup Language (SAML) identity providers such as OKTA and Auth0. Furthermore, Lake Formation securely integrates with the AWS BI service Amazon QuickSight. QuickSight allows you to effortlessly create and publish interactive BI dashboards, and supports authentication via Active Directory. However, if you use alternative BI tools like Tableau, you may want to use your Active Directory credentials to access data stored in Lake Formation.

In this post, we show you how you can use AD FS credentials with Tableau to implement a Zero Trust architecture and securely query data in Amazon S3 and Lake Formation.

Solution overview

In this architecture, user credentials are managed by Active Directory, and not Amazon Identity and Access Management (IAM). Although Tableau provides a connector to connect Tableau to Athena, the connector requires an AWS access key ID and an AWS secret access key normally used for programmatic access. Creating an IAM user with programmatic access for use by Tableau is a potential solution, however some customers have made an architectural decision that access to AWS accounts is done via a federated process using Active Directory, and not an IAM user.

In this post, we show you how you can use the Athena ODBC driver in conjunction with AD FS credentials to query sample data in a newly created data lake. We simulate the environment by enabling federation to AWS using AD FS 3.0 and SAML 2.0. Then we guide you through setting up a data lake using Lake Formation. Finally, we show you how you can configure an ODBC driver for Tableau to securely query your data in the lake data using your AD FS credentials.

Prerequisites

The following prerequisites are required to complete this walkthrough:

Create your environment

To simulate the production environment, we created a standard VPC in Amazon Virtual Private Cloud (Amazon VPC) with one private subnet and one public subnet. You can do the same using the VPC wizard. Our Amazon Elastic Compute Cloud (Amazon EC2) instance running the Tableau client is located in a private subnet and accessible via an EC2 bastion host. For simplicity, connecting out to Amazon S3, AWS Glue, and Athena is done via the NAT gateway and internet gateway set up by the VPC wizard. Optionally, you can replace the NAT gateway with AWS PrivateLink endpoints (AWS Security Token Service (AWS STS), Amazon S3, Athena, and AWS Glue endpoints are required) to make sure traffic remains within the AWS network.

The following diagram illustrates our environment architecture.

After you create your VPC with its private and public subnets, you can continue to build out the other requirements, such as Active Directory and Lake Formation. Let’s begin with Active Directory.

Enable federation to AWS using AD FS 3.0 and SAML 2.0

AD FS 3.0, a component of Windows Server, supports SAML 2.0 and is integrated with IAM. This integration allows Active Directory users to federate to AWS using corporate directory credentials, such as a user name and password from Active Directory. Before you can complete this section, AD FS must be configured and running.

To set up AD FS, follow the instructions in Setting up trust between AD FS and AWS and using Active Directory credentials to connect to Amazon Athena with ODBC driver. The first section of the post explains in detail how to set up AD FS and establish the trust between AD FS and Active Directory. The post ends with setting up an ODBC driver for Athena, which you can skip. The post creates a group name called ArunADFSTest. This group relates to a role in your AWS account, which you use later.

When you have successfully verified that you can log in using your IdP, you’re ready to configure your Windows environment ODBC driver to connect to Athena.

Set up a data lake using Lake Formation

Lake Formation is a fully managed service that makes it easy for you to build, secure, and manage data lakes. Lake Formation provides its own permissions model that augments the IAM permissions model. This centrally defined permissions model enables fine-grained access to data stored in data lakes through a simple grant/revoke mechanism. We use this permissions model to grant access to the AD FS role we created earlier.

  1. On the Lake Formation console, you’re prompted with a welcome box the first time you access Lake formation.
    The box asks you to select the initial administrative user and roles.
  2. Choose Add myself and choose Get Started.
    We use the sample database provided by Lake Formation, but you’re welcome to use your own dataset. For instructions on loading your own dataset, see Getting Started with Lake Formation.With Lake Formation configured, we must grant read access to the AD FS role (ArunADFSTest) we created in the previous step.
  3. In the navigation pane, choose Databases.
  4. Select the database sampledb.
  5. On the Actions menu, choose Grant.

    We grant the SamlOdbcAccess role access to sampledb.
  6. For Principals, select IAM users and roles.
  7. For IAM users and roles, choose the role ArunADFSTest.
  8. Select Named data catalog resources.
  9. For Databases, choose the database sampledb.
  10. For Tables¸ choose All tables.
  11. Set the table permissions to Select and Describe.
  12. For Data permissions, select All data access.
  13. Choose Grant.
    Our AD FS user assumes the role ArunADFSTest, which has been granted access to sampledb by Lake Formation. However, the ArunADFSTest role requires access to Lake Formation, Athena, AWS Glue, and Amazon S3. Following the practice of least privilege, AWS defines policies for specific Lake Formation personas. Our user fits the Data Analyst persona, which requires enough permissions to run queries.
  14. Add the AmazonAthenaFullAccess managed policy (for instructions, see Adding and removing IAM identity permissions) and the following inline policy to the ArunADFSTest role:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "lakeformation:GetDataAccess",
                    "glue:GetTable",
                    "glue:GetTables",
                    "glue:SearchTables",
                    "glue:GetDatabase",
                    "glue:GetDatabases",
                    "glue:GetPartitions",
                    "lakeformation:GetResourceLFTags",
                    "lakeformation:ListLFTags",
                    "lakeformation:GetLFTag",
                    "lakeformation:SearchTablesByLFTags",
                    "lakeformation:SearchDatabasesByLFTags"
                ],
                "Resource": "*"
            }
        ]
    }

    Each time Athena runs a query, it stores the results in an S3 bucket, which is configured as the query result location in Athena.

  15. Create an S3 bucket, and in this new bucket create a new folder called athena_results.
  16. Update the settings on the Athena console to use your newly created folder.
    Tableau uses Athena to run the query and read the results from Amazon S3, which means that the ArunADFSTest role requires access to your newly created S3 folder.
  17. Attach the following inline policy to the ArunADFSTest role:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:PutObject",
                    "s3:PutObjectAcl"
                ],
                "Resource": "arn:aws:s3:::[BUCKET_NAME]/athena_results/*"
            }
        ]
    }

Our AD FS user can now assume a role that has enough privileges to query the sample database. The next step is to configure the ODBC driver on the client.

Configure an Athena ODBC driver

Athena is a managed serverless and interactive query service that allows you to analyze your data in Amazon S3 using standard Structured Query Language (SQL). You can use Athena to directly query data that is located in Amazon S3 or data that is registered with Lake Formation. Athena provides you with ODBC and JDBC drivers to effortlessly integrate with your data analytics tools (such as Microsoft Power BI, Tableau, or SQL Workbench) to seamlessly gain insights about your data in minutes.

To connect to our Lake Formation environment, we first need to install and configure the Athena ODBC driver on our Windows environment.

  1. Download the Athena ODBC driver relevant to your Windows environment.
  2. Install the driver by choosing the driver file you downloaded (in our case, Simba+Athena+1.1+64-bit.msi).
  3. Choose Next on the welcome page.
  4. Read the End-User License Agreement, and if you agree to it, select I Accept the terms in the License Agreement and choose Next.
  5. Leave the default installation location for the ODBC driver and choose Next.
  6. Choose Install to begin the installation.
  7. If the User Access Control pop-up appears, choose Yes to allow the driver installation to continue.
  8. When the driver installation is complete, choose Finish to close the installer.
  9. Open the Windows ODBC configuration application by selecting the Start bar and searching for ODBC.
  10. Open the version corresponding to the Athena ODBC version you installed, in our case 64 bit.
  11. On the User DSN tab, choose Add.
  12. Choose the Simba Athena ODBC driver and choose Finish.

Configure the ODBC driver to AD FS authentication

We now need to configure the driver.

  1. Choose the driver on the Driver configuration page.
  2. For Data Source Name, enter sampledb.
  3. For Description, enter Lake Formation Sample Database.
  4. For AWS Region, enter eu-west-1 or the Region you used when configuring Lake Formation.
  5. For Metadata Retrieval Method, choose Auto.
  6. For S3 Output Location, enter s3://[BUCKET_NAME]/athena_results/.
  7. For Encryption Options, choose NOT_SET.
  8. Clear the rest of the options.
  9. Choose Authentication Options.
  10. For Authentication Type, choose ADFS.
  11. For User, enter [DOMAIN]\[USERNAME].
  12. For Password, enter your domain user password.
  13. For Preferred Role, enter aws:iam::[ACCOUNT NUMBER]:role/ArunADFSTest.
    The preferred role is the same role configured in the previous section (ArunADFSTest).
  14. For IdP Host, enter the AD federation URL you configured during AD FS setup.
  15. For IdP Port, enter 443.
  16. Select SSL Insecure.
  17. Choose OK.
  18. Choose Test on the initial configuration page to test the connection.
  19. When you see a success confirmation, choose OK.

We can now connect to our Lake Formation sample database from our desktop environment using the Athena ODBC driver. The next step is to use Tableau to query our data using the ODBC connection.

Connect to your data using Tableau

To connect to your data, complete the following steps:

  1. Open your Tableau Desktop edition.
  2. Under To a Server, choose More.
  3. On the list of available Tableau installed connectors, choose Other Databases (ODBC).
  4. Choose the ODBC database you created earlier.
  5. Choose Connect.
  6. Choose Sign In.

    When the Tableau workbook opens, select the database, schema, and table that you want to query.
  7. For Database, choose the database as listed in the ODBC setup (for this post, AwsDataCatalog).
  8. For Schema, choose your schema (sampledb).
  9. For Table, search for and choose your table (elb_logs).
  10. Drag the table to the work area to start your query and further report development.

Clean up

AWS Lake Formation provides database-, table-, column-, and tag-based access controls, and cross-account sharing at no charge. Lake Formation charges a fee for transaction requests and for metadata storage. In addition to providing a consistent view of data and enforcing row-level and cell-level security, the Lake Formation Storage API scans data in Amazon S3 and applies row and cell filters before returning results to applications. There is a fee for this filtering. To make sure that you’re not charged for any of the services that you no longer need, stop any EC2 instances that you created. Remove any objects in Amazon S3 you no longer require, because you pay for objects stored in S3 buckets.

Lastly, delete any Active Directory instances you may have created.

Conclusion

Lake Formation makes it simple to set up a secure data lake and then use the data lake with your choice of analytics and machine learning services, including Tableau. In this post, we showed you how you can connect to your data lake using AD FS credentials in a simple and secure way by using the Athena ODBC driver. Your AD FS user is configured within the ODBC driver, which then assumes a role in AWS. This role is granted access to only the data you require via Lake Formation.

To learn more about Lake Formation, see the Lake Formation Developer Guide or follow the Lake Formation workshop.


About the Authors

Jason Nicholls is an Enterprise Solutions Architect at AWS. He’s passionate about building scalable web and mobile applications on AWS. He started coding on a Commodore VIC 20, which lead to a career in software development. Jason holds a MSc in Computer Science with specialization in coevolved genetic programming. He is based in Johannesburg, South Africa.

Francois van Rensburg is a Partner Management Solutions Architect at AWS. He has spent the last decade helping enterprise organizations successfully migrate to the cloud. He is passionate about networking and all things cloud. He started as a Cobol programmer and has built everything from software to data centers. He is based in Denver, Colorado.

Build a modern data architecture and data mesh pattern at scale using AWS Lake Formation tag-based access control

Post Syndicated from Nivas Shankar original https://aws.amazon.com/blogs/big-data/build-a-modern-data-architecture-and-data-mesh-pattern-at-scale-using-aws-lake-formation-tag-based-access-control/

Customers are exploring building a data mesh on their AWS platform using AWS Lake Formation and sharing their data lakes across the organization. A data mesh architecture empowers business units (organized into domains) to have high ownership and autonomy for the technologies they use, while providing technology that enforces data security policies both within and between domains through data sharing. Data consumers request access to these data products, which are approved by producer owners within a framework that provides decentralized governance, but centralized monitoring and auditing of the data sharing process. As the number of tables and users increase, data stewards and administrators are looking for ways to manage permissions on data lakes easily at scale. Customers are struggling with “role explosion” and need to manage hundreds or even thousands of user permissions to control data access. For example, for an account with 1,000 resources and 100 principals, the data steward would have to create and manage up to 100,000 policy statements. As new principals and resources get added or deleted, these policies have to be updated to keep the permissions current.

Lake Formation tag-based access control (TBAC) solves this problem by allowing data stewards to create LF-tags (based on their business needs) that are attached to resources. You can create policies on a smaller number of logical tags instead of specifying policies on named resources. LF-tags enable you to categorize and explore data based on taxonomies, which reduces policy complexity and scales permissions management. You can create and manage policies with tens of logical tags instead of the thousands of resources. Lake Formation TBAC decouples policy creation from resource creation, which helps data stewards manage permissions on many databases, tables, and columns by removing the need to update policies every time a new resource is added to the data lake. Finally, TBAC allows you to create policies even before the resources come into existence. All you have to do is tag the resource with the right LF-tag to make sure existing policies manage it.

This post focuses on managing permissions on data lakes at scale using LF-tags in Lake Formation for cross accounts. For managing data lake catalog tables from AWS Glue and administering permission to Lake Formation, data stewards within the producing accounts have functional ownership based on the functions they support, and can grant access to various consumers, external organizations, and accounts. You can now define LF-tags; associate at the database, table, or column level; and then share controlled access across analytic, machine learning (ML), and extract, transform, and load (ETL) services for consumption. LF-tags make sure that governance can be scaled easily by replacing the policy definitions of thousands of resources with a few logical tags.

Solution overview

LF-tag access has three key components:

  • Tag ontology and classification – Data stewards can define an LF-tag ontology based on their business needs and grant access based on LF-tags to AWS Identity and Access Management (IAM) principals and SAML principals or groups
  • Tagging resources – Data engineers can easily create, automate, implement, and track all LF-tags and permissions against AWS Glue catalogs through the Lake Formation API
  • Policy evaluation – Lake Formation evaluates the effective permissions based on LF-tags at query time and allows access to data through consuming services such as Amazon Athena, AWS Glue, Amazon Redshift Spectrum, Amazon SageMaker Data Wrangler, and Amazon EMR Studio, based on the effective permissions granted across multiple accounts or organization-level data shares

The following diagram illustrates the relationship between the data producer, data consumer, and central governance accounts.

In the above diagram, the central governance account box shows the tagging ontology that will be used with the associated tag colors. These will be shared with both the producers and consumers, to be used to tag resources.

In this post, we considered two databases, as shown in the following figure, and show how you can set up a Lake Formation table and create Lake Formation tag-based policies.

The solution includes the following high-level steps:

  1. The data mesh owner defines the central tag ontology with LF-tags:
    1. LOB – Classified at the line of business (LOB) level (database)
    2. LOB:Function – Classified at the business function level (table)
    3. Classification – Classification of the functional data level (columns)
  2. The data mesh owner assigns respective permission levels to the product data steward to use centrally defined tags and associates permission to their database and tables with different LF-tags.
  3. The producer steward in the central account owns two databases: lob = card and lob = retail.
  4. The producer steward switches to the data producer account to add table metadata using an AWS Glue crawler.
  5. The producer steward associates column-level classifications Classification = Sensitive or Classification = Non-Sensitive to tables under the Card database in the central account.
  6. The producer steward associates table-level tags lob:retail = Customer and lob:retail = Reviews to tables under the Retail database in the central account.
  7. The consumer admin grants fine-grained access control to different data analysts.

With this configuration, the consumer analyst can focus on performing analysis with the right data.

Set up resources with AWS CloudFormation

We provide three AWS CloudFormation templates in this post: for the producer account, central account, and consumer account. Deploy the CloudFormation templates in the order of producer, central, and consumer, because there are dependencies between the templates.

The CloudFormation template for the central account generates the following resources:

  • Two IAM users:
    • DataMeshOwner
    • ProducerSteward
  • Grant DataMeshOwner as the LakeFormation Admin
  • One IAM role:
    • LFRegisterLocationServiceRole
  • Two IAM policies:
    • ProducerStewardPolicy
    • S3DataLakePolicy
  • Create databases “retail” and “cards” for ProducerSteward to manage Data Catalog
  • Share the data location permission for producer account to manage Data Catalog

The CloudFormation template for the producer account generates the following resources:

  • Two Amazon Simple Storage Service (Amazon S3) buckets:
    • RetailBucket, which holds two tables:
      • Customer_Info
      • Customer_Review
    • CardsBucket, which holds one table:
      • Credit_Card
  • Allow Amazon S3 bucket access for the central account Lake Formation service role.
  • Two AWS Glue crawlers
  • One AWS Glue crawler service role
  • Grant permissions on the S3 bucket locations tbac-cards-<ProducerAccountID>-<aws-region> and tbac-retail-<ProducerAccountID>-<aws-region> to the AWS Glue crawler role
  • One producer steward IAM user

The CloudFormation template for the consumer account generates the following resources:

  • One S3 bucket:
    • <AWS Account ID>-<aws-region>-athena-logs
  • One Athena workgroup:
    • consumer-workgroup
  • Three IAM users:
    • ConsumerAdmin
    • ConsumerAnalyst1
    • ConsumerAnalyst2

Launch the CloudFormation stack in the central account

To create resources in the central account, complete the following steps:

  1. Sign in to the central account’s AWS CloudFormation console in the target Region.
  2. Choose Launch Stack:   
    Launch Stack
  3. Choose Next.
  4. For Stack name, enter stack-central.
  5. For DataMeshOwnerUserPassword, enter the password you want for the data lake admin IAM user in the central account.
  6. For ProducerStewardUserPassword, enter the password you want for the producer steward IAM user in the producer account.
  7. For ProducerAWSAccount, enter the AWS <ProducerAccountID>.
  8. Choose Next.
  9. On the next page, choose Next.
  10. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  11. Choose Create stack.
  12. Collect the value for LFRegisterLocationServiceRole on the stack’s Outputs tab.

Launch the CloudFormation stack in the producer account

To set up resources in the producer account, complete the following steps:

  1. Sign in to the producer account’s AWS CloudFormation console in the target Region.
  2. Choose Launch Stack:
  3. Choose Next.
  4. For Stack name, enter stack-producer.
  5. For CentralAccountID, copy and paste the value of the <CentralAccountID> .
  6. For CentralAccountLFServiceRole, copy and paste the value of the LFRegisterLocationServiceRole collected from the stack-central.
  7. For LFDatabaseName, keep the default value of the tbac database name.
  8. For ProducerStewardUserPassword, enter the password you want for the data lake admin IAM user on the producer account.
  9. Choose Next.
  10. On the next page, choose Next.
  11. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  12. Choose Create stack.

Launch the CloudFormation stack in the consumer account

To create resources in the consumer account, complete the following steps:

  1. Sign in to the consumer account’s AWS CloudFormation console in the target Region.
  2. Choose Launch Stack:
  3. Choose Next.
  4. For Stack name, enter stack-consumer.
  5. For ConsumerAdminUserName and ConsumerAdminUserPassword, enter the user name and password you want for the data lake admin IAM user.
  6. For ConsumerAnalyst1UserName and ConsumerAnalyst1UserPassword, enter the user name and password you want for the consumeranalyst1 IAM user.
  7. For ConsumerAnalyst2UserName and ConsumerAnalyst2UserPassword, enter the user name and password you want for the consumeranalyst2 IAM user.
  8. Choose Next.
  9. On the next page, choose Next.
  10. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  11. Choose Create stack.

Configure Lake Formation cross-account sharing

After you create your resources with AWS CloudFormation, you perform the following steps in the producer and central account to set up Lake Formation cross-account sharing.

Central governance account

In the central account, complete the following steps:

  1. Sign in to the Lake Formation console as admin.
  2. In the navigation pane, choose Permissions, then choose Administrative roles and tasks.

The CloudFormation template added the data mesh owner as the data lake administrator.

Next, we update the Data Catalog settings to use Lake Formation permissions to control catalog resources instead of IAM-based access control.

  1. In the navigation pane, under Data catalog¸ choose Settings.
  2. Uncheck Use only IAM access control for new databases.
  3. Uncheck Use only IAM access control for new tables in new databases.
  4. Choose Save.

Next, we need to set up the AWS Glue Data Catalog resource policy to grant cross-account access to Data Catalog resources.

  1. Use the following policy, and replace the account number and Region with your own values:
  2. As described in Lake Formation Tag-Based Access Control Cross-Account Prerequisites, before you can use the tag-based access control method to grant cross-account access to resources, you must add the following JSON permissions object to the AWS Glue Data Catalog resource policy in the producer account. This gives the consumer account permission to access the Data Catalog when glue:EvaluatedByLakeFormationTagsis true. Also, this condition becomes true for resources on which you granted permission using Lake Formation permission Tags to the consumer’s account. This policy is required for every AWS account that you’re granting permissions to. We discuss the full IAM policy later in this post.
    {
       "PolicyInJson": "{\"Version\" : \"2012-10-17\",\"Statement\" : [ {\"Effect\" : \"Allow\",\"Principal\" : {\"AWS\" : [\"arn:aws:iam::<ProducerAccountID>:root\",\"arn:aws:iam::<ConsumerAccountID>:root\"]},\"Action\" : \"glue:*\",\"Resource\" : [ \"arn:aws:glue:<aws-region>:<CentralAccountID>:table/*\", \"arn:aws:glue:<aws-region>:<CentralAccountID>:database/*\", \"arn:aws:glue:<aws-region>:<CentralAccountID>:catalog\" ],\"Condition\" : {\"Bool\" : {\"glue:EvaluatedByLakeFormationTags\" : \"true\"}}}, {\"Effect\" : \"Allow\",\"Principal\" : {\"Service\" : \"ram.amazonaws.com\"},\"Action\" : \"glue:ShareResource\",\"Resource\" : [ \"arn:aws:glue:<aws-region>:<CentralAccountID>:table/*\", \"arn:aws:glue:<aws-region>:<CentralAccountID>:database/*\", \"arn:aws:glue:<aws-region>:<CentralAccountID>:catalog\" ]} ]}",
       "EnableHybrid": "TRUE"
    }

Replace the <aws-region>, <ProducerAccountID>, <ConsumerAccountID> and <CentralAccountID> values in the above policy as appropriate and save it in a file called policy.json.

  1. Next, run the following AWS Command Line Interface (AWS CLI) command on AWS CloudShell.
aws glue put-resource-policy --region <aws-region> --cli-input-json file://policy.json

For more information about this policy, see put-resource-policy.

  1. Next, we verify the two source data S3 buckets are registered as data lake locations in the central account. This is completed by the CloudFormation template.
  2. Under Register and ingest in the navigation pane, choose Data lake locations.

You should see the two S3 buckets registered under the data lake locations.

Configure Lake Formation Data Catalog settings in the central account

After we complete all the prerequisites, we start the data mesh configuration. We log in as DataMeshOwner in the central account.

Define LF-tags

DataMeshOwner creates the tag ontology by defining LF-tags. Complete the following steps:

  1. On the Lake Formation console, under Permissions in the navigation pane, under Administrative roles and tasks, choose LF-Tags.
  2. Choose Add LF-tags.
  3. For Key, enter LOB and for Values, choose Retail and Cards.
  4. Choose Add LF-tag.
  5. Repeat these steps to add the key LOB:Retail and values Customer and Reviews, and the key Classification with values Sensitive and Non-Sensitive.

Now we complete the configuration of the tag ontology.

Grant permissions

We grant ProducerSteward in the central accounts describe and associate permissions on the preceding tag ontology. This enables ProducerSteward to view the LF-tags and assign them to Data Catalog resources (databases, tables, and columns). ProducerSteward in the central account can further grant the permission to ProducerSteward in the producer account. For more information, see Granting, Revoking, and Listing LF-Tag Permissions. When you have multiple producers, grant the relevant tags to each steward.

  1. Under Permissions in the navigation pane, under Administrative roles and tasks, choose LF-tag permissions.
  2. Choose Grant.
  3. For IAM users and roles, choose the ProducerSteward user.
  4. In the LF-Tags section, add all three key-values:
    1. Key LOB with values Retail and Cards.
    2. Key LOB:Retail with values Customer and Reviews.
    3. Key Classification with values Sensitive and Non-Sensitive.
  5. For Permissions, select Describe and Associate for both LF-tag permissions and Grantable permissions.
  6. Choose Grant.

Next, we grant ProducerSteward tag-based data lake permissions. This enables ProducerSteward to create, alter, and drop tables in the databases with corresponding tags. ProducerSteward in the producer account can further grant the permission across accounts.

  1. In the navigation pane, under Permissions, Data lake permissions, choose Grant.
  2. For Principals, choose IAM users and roles, and choose ProducerSteward.
  3. For LF-tags or catalog resources, select Resources matched by LF-Tags (recommended).
  4. Choose Add LF-Tag.
  5. For Key, choose LOB and for Values, choose Cards.
  6. For Database permissions, select the Super permission because ProducerSteward owns the producer databases.

This permission allows a principal to perform every supported Lake Formation operation on the database. Use this admin permission when a principal is trusted with all operations.

  1. Select Super under Grantable permissions so the ProducerSteward user can grant database-level permissions to the producer and consumer accounts.
  2. For Table permissions, select Super.
  3. Select Super permission under Grantable permissions.
  4. Choose Grant.
  5. Repeat these steps for key LOB and value Retail.
  6. In the navigation pane, under Permissions, Data lake permissions, choose Grant.
  7. For Principals, choose IAM users and roles, and choose ProducerSteward.
  8. For LF-tags or catalog resources, select Resources matched by LF-Tags (recommended).
  9. Add the key LOB with value Cards, and the key Classification with values Sensitive and Non-Sensitive.
  10. For Database permissions, select Super.
  11. Select Super permission under Grantable permissions.
  12. For Table permissions, select Super.
  13. Select Super under Grantable permissions.
  14. Choose Grant.

This gives ProducerSteward fine-grained permission expression on columns with either Sensitive or Non-sensitive tags.

  1. Repeat these steps for key LOB and value Retails, and key LOB:Retails and value Reviews or Customer.

This gives ProducerSteward fine-grained permission expression on tables with either Reviews or Customers tags.

Producer data steward actions in the central account

Next, we log in as the ProducerSteward user in the central account and create skeleton databases.

  1. Sign in to the Lake Formation console as ProducerSteward.
  2. In the navigation pane, under Data catalog, select Databases.
  3. Choose the cards database.
  4. On the Actions menu, choose Edit LF-tags.
  5. Choose Assign new LF-tag.
  6. For Assigned Keys, enter LOB and for Values, choose Cards.
  7. Choose Save.

This assigns the LOB=Cards tag to the Cards database.

  1. Repeat these steps for Retail database and assign the LOB=Retail tag to the Retail database.

Next, we share the LF-tags and data lake permissions with the producer account so that ProducerSteward in the producer account can run AWS Glue crawlers and generate tables in the preceding skeleton databases.

  1. Under Permissions in the navigation pane, under Administrative roles and tasks, choose LF-tag permissions.
  2. Choose Grant.
  3. For Principals, select External accounts.
  4. For AWS account or AWS organization, enter the account ID for the producer account.
  5. In the LF-Tags section, we only need to add database-level tags.
  6. For Key, enter LOB and for Values, choose Retail and Cards.
  7. For Permissions, choose Describe and Associate for both LF-tag permissions and Grantable permissions.
  8. Choose Grant.
  9. In the navigation pane, under Permissions, Data lake permissions, choose Grant.
  10. For Principals, select External accounts.
  11. For AWS account or AWS organization, enter the account ID for the producer account.
  12. For LF-tags or catalog resources, select Resources matched by LF-Tags (recommended).
  13. Choose Add LF-Tag.
  14. Choose the key LOB and value Cards.
  15. For Database permissions, select Create table and Describe because the ProducerSteward user in the producer account will add tables in the database.
  16. Select Create table and Describe under Grantable permissions so the ProducerSteward user can further grant the permission to the AWS Glue crawler.
  17. For Table permissions, select all the permissions.
  18. Select all the permissions under Grantable permissions.
  19. Choose Grant.
  20. Repeat these steps for LOB=Retail.

Now the Lake Formation administrators on the producer account side has the right permissions to add tables.

Crawl source tables in the producer account

Next, we log in as the ProducerSteward user in the producer account to crawl the source tables for the Cards and Retail databases.

  1. Sign in to the Lake Formation console as ProducerSteward.
  2. In the navigation pane, under Administrative Roles and Tasks, verify that ProducerSteward is configured as the data lake administrator.
  3. In the navigation pane, under Permissions, then choose Administrative roles and tasks, choose LF-Tags.

You can verify the root-level LOB tags that were shared with the producer account.

  1. In the navigation pane, under Data catalog, select Databases.

You can verify the two databases cards and retail that were shared with the producer account from the previous step.

Now, we create resource links in the producer account for these two databases. These links point at the shared databases and are used by AWS Glue crawler to create the tables. First, we create a resource link for the cards database.

  1. Select the cards database and on the Actions menu, choose Create resource link.
  2. For Resource link name, enter rl_cards.
  3. Choose Create.
  4. Repeat these steps to create a resource link for the retail database.

After the resource link creation, you should see both the resource link databases as shown in the following screenshot.

Next, we need to grant permissions to the AWS Glue crawler role so that the crawler can crawl the source bucket and create the tables.

  1. Select the rl_cards database and on the Actions menu, choose Grant.
  2. In the Grant data permissions section, select IAM users and roles, and choose the AWS Glue crawler role that was created by the CloudFormation template (for example, stack-producer-AWSGlueServiceRoleDefault-xxxxxx).
  3. For Databases, choose rl_cards.
  4. For Resource link permissions, select Describe.
  5. Choose Grant.
  6. Repeat these steps for rl_retail.
  7. Next, in the navigation pane, choose Data lake Permissions and choose Grant.
  8. For IAM users and roles, choose the role stack-producer-AWSGlueServiceRoleDefault-XXXX.
  9. For LF-Tags or catalog resources, select Resources matched by LF-Tags.
  10. Enter the key LOB and values Retail and Cards.
  11. For Database permissions, select Create table and Describe.
  12. For Table permissions, choose Select, Describe, and Alter.
  13. Choose Grant.

Next, we will verify grant permissions on the S3 bucket locations corresponding to cards and retail producers to the AWS Glue crawler role. This is completed by the CloudFormation template.

In the navigation pane, under Permissions, on the Data Locations, you should see the locations.

Now we’re ready to run the crawlers. We configure the crawlers that the CloudFormation template created, to point to these resource link databases.

  1. On the AWS Glue console, under Data catalog in the navigation pane, choose Crawlers.

The two crawlers you created should be listed.

  1. Select the crawler for the cards database CardsCrawler-xxxxxxxxxxxx and on the Action menu, choose Edit crawler.
  2. For the input data store, choose the S3 bucket for the cards producer.
  3. For IAM role, choose the AWS Glue service role created by the CloudFormation template.
  4. For Schedule, choose Run on demand.
  5. For the output database, choose the resource link database rl_cards corresponding to the cards database.
  6. Verify all the information and choose Save.
  7. Repeat these steps for the crawler corresponding to the retail producer.
  8. Select both crawlers and choose Run crawler.

When the crawlers finish, they create tables corresponding to each producer in their respective resource link databases. The table schemas are present in the shared database in the central account.

Configure Lake Formation tags in the central account

Next, we perform fine-grained access control for the tables that the crawlers created to support different consumption use cases using Lake Formation tags.

Tag columns

First, we tag sensitive columns in the cards table corresponding to the cards database, first using the Classification tag that we created earlier.

  1. Log in to central account as IAM user ProducerSteward.
  2. On the Lake Formation console, in the navigation pane, choose Data catalog and then choose Tables.

You should see three tables: the cards table corresponding to cards database, and the reviews and customers tables corresponding to the retail database.

  1. Choose the cards table.
  2. Navigate to the Schema section and choose Edit schema.
  3. Select all the columns and choose Edit tags.
  4. Choose Assign new LF-Tag.
  5. For Assigned keys, enter Classification and for Values, choose Non-Sensitive.
  6. Choose Save.

Next, we selectively tag the sensitive columns.

  1. In the Edit schema section, select columns card number, card holder’s name, cvv/cvv2, and card pin.
  2. Choose Edit tags.
  3. For Assigned keys, enter Classification and for Values, choose Sensitive.
  4. Choose Save.
  5. Choose Save as new version to save the schema.

Tag tables

Next, we tag the reviews and customer tables under the retail database using the LOB:retail tag that we created earlier.

  1. On the Tables page, select the reviews table and on the Actions menu, choose Edit LF-tags.
  2. Choose Assign new LF-Tag.
  3. For Assigned keys, choose LOB:Retail and for Values, choose Reviews.
  4. Choose Save.
  5. Repeat the steps for the customer table. Choose LOB:Retail for the key and Customer for the value.

Grant tag permissions

Next, grant LF-tag permissions to the external consumer account.

  1. On the Lake Formation console, in the navigation pane, choose Permissions, then choose Administrative roles and tasks and choose LF-tag permissions.
  2. Choose Grant.
  3. For Principals, select External accounts.
  4. For AWS account or AWS organization, enter the AWS account number corresponding to the consumer account.
  5. For LF-Tags, choose Add LF-Tag.
  6. For Key, choose LOB and for Values, choose Retail and Cards.
  7. Repeat these steps for key Classification with values Non-Sensitive and Sensitive, and key LOB:Retail with values Reviews and Customer.
  8. For Permissions, choose Describe.
  9. For Grantable permissions, choose Describe.
  10. Choose Grant.

Next, we grant Lake Formation policy tag expression permissions to the external consumer account.

  1. In the navigation pane, choose Data lake permissions and choose Grant.
  2. In the Principals section, select External accounts.
  3. For AWS account or AWS organization, enter the AWS account number corresponding to the consumer account.
  4. For LF-Tags or catalog resources, select Resources matched by LF-Tags.
  5. Choose Add LF-Tag.
  6. For Key, choose LOB and for Values¸ choose Retail.
  7. For Database permissions, select Describe.
  8. For Grantable permissions, select Describe.
  9. Choose Grant.
  10. Repeat these steps to grant permissions on the policy tag expression LOB=Cards.

Next, we grant table permissions.

  1. In the navigation pane, choose Data lake permissions and choose Grant.
  2. For Principals, select External accounts.
  3. For AWS account or AWS organization, enter the AWS account number corresponding to the consumer account.
  4. For LF-Tags or catalog resources, select Resources matched by LF-Tags.
  5. Add key LOB with value Retail, and key LOB:Retail with values Reviews and Customer.
  6. For Table Permissions, select Select and Describe.
  7. For Grantable permissions, select Select and Describe.
  8. Choose Grant.
  9. Repeat these steps to grant permissions on the policy tag expressions LOB=Cards and Classification = (Non-Sensitive or Sensitive).

Share and consume tables in the consumer account

When you sign in to the Lake Formation console in the consumer account as ConsumerAdmin, you can see all the tags and the corresponding values that were shared by the producer.

In these next steps, we share and consume tables in the consumer account.

Create a resource link to the shared database

On the Databases page on the Lake Formation console, you can see all the databases that were shared to the consumer account. To create a resource link, complete the following steps:

  1. On the Databases page, select the cards database and on the Actions menu, choose Create resource link.
  2. Enter the resource link name as rl_cards.
  3. Leave the shared database and shared database’s owner ID as default.
  4. Choose Create.
  5. Follow the same process to create the rl_retail resource link.

Grant Describe permission to ConsumerAnalyst1

To grant Describe permissions on resource link databases to ConsumerAnalyst1, complete the following steps:

  1. On the Databases page, select the resource database rl_retail and on the Actions menu, choose Grant.
  2. In the Grant data permissions section, select IAM users and roles.
  3. Choose the role ConsumerAnalyst1.
  4. In the Resource link permissions section, select Describe.
  5. Choose Grant.
  6. Follow the same steps to grant rl_cards access to ConsumerAnalyst2.

Grant Tag permissions to ConsumerAnalyst1

To grant Tag permissions on the LOB:Retail Customer tag to ConsumerAnalyst1 to access the sales table, complete the following steps:

  1. On the Lake Formation console, on the Data permission page, choose Grant.
  2. In the Grant data permissions section, select IAM users and roles.
  3. Choose the role ConsumerAnalyst1.
  4. For LF-Tags or catalog resources, select Resources matched by LF-Tags.
  5. Add the key LOB with value Retail, and the key LOB:Retail with value Customer.
  6. For Table permissions, select Select and Describe.
  7. Choose Grant.

Access to the customers table inside the rl_retail database is granted to ConsumerAnalyst1.

Grant Tag permissions to ConsumerAnalyst2

To grant Tag permissions on the Classification:Sensitive tag to l to access attributes tagged as Sensitive in the cards table, complete the following steps:

  1. On the Lake Formation console, on the Data permission page, choose Grant.
  2. In the Grant data permissions section, select IAM users and roles.
  3. Choose the role ConsumerAnalyst2.
  4. For LF-Tags or catalog resources, select Resources matched by LF-Tags.
  5. Add the key LOB with value Cards, and the key Classification with value Sensitive.
  6. For Table permissions, select Select and Describe.
  7. Choose Grant.

Access to attributes tagged as sensitive in the cards table inside the rl_cards database is granted to ConsumerAnalyst2.

Validate the access to ConsumerAnalyst1

To confirm ConsumerAnalyst1 access, complete the following steps:

  1. On the Athena console, for Workgroup, choose consumer-workgroup.
  2. Choose Acknowledge.
  3. Choose the database rl_retail.

You should be able to see the customers table and be able to query.

Validate the access to ConsumerAnalyst2

To confirm ConsumerAnalyst2 access, complete the following steps:

  1. On the Athena console, for Workgroup, choose consumer-workgroup.
  2. Choose Acknowledge.
  3. Choose the database rl_cards.

You should be able to see only the sensitive attributes from the cards table.

As a thought experiment, you can also check to see the Lake Formation Tag-based access policy behavior on columns to which the user doesn’t have policy grants.

When an untagged column is selected from the table rl_cards.cards, Athena returns an error. For example, you can run the following query to choose the untagged column “issuing_bank” which is non-sensitive.

SELECT issuing_bank FROM "rl_cards"."cards" limit 10;

Conclusion

In this post, we explained how to create a Lake Formation tag-based access control policy in Lake Formation using an AWS public dataset. In addition, we explained how to query tables, databases, and columns that have Lake Formation tag-based access policies associated with them.

You can generalize these steps to share resources across accounts. You can also use these steps to grant permissions to SAML identities.

A data mesh approach provides a method by which organizations can share data across business units. Each domain is responsible for the ingestion, processing, and serving of their data. They are data owners and domain experts, and are responsible for data quality and accuracy. This is similar to how microservices turn a set of technical capabilities into a product that can be consumed by other microservices. Implementing a data mesh on AWS is made simple by using managed and serverless services such as AWS Glue, Lake Formation, Athena, and Redshift Spectrum to provide a well-understood, performant, scalable, and cost-effective solution to integrate, prepare, and serve data.


About the Authors

Nivas Shankar is a Principal Data Architect at Amazon Web Services. He helps and works closely with enterprise customers building data lakes and analytical applications on the AWS platform. He holds a master’s degree in physics and is highly passionate about theoretical physics concepts.

Dylan Qu is an AWS solutions architect responsible for providing architectural guidance across the full AWS stack with a focus on Data Analytics, AI/ML and DevOps.

Pavan Emani is a Data Lake Architect at AWS, specialized in big data and analytics solutions. He helps customers modernize their data platforms on the cloud. Outside of work, he likes reading about space and watching sports.

Prasanna Sridharan is a Senior Data & Analytics Architect with AWS. He is passionate about building the right big data solution for the AWS customers. He is specialized in the design and implementation of Analytics, Data Management and Big Data systems, mainly for Enterprise and FSI customers.

Build your data pipeline in your AWS modern data platform using AWS Lake Formation, AWS Glue, and dbt Core

Post Syndicated from Benjamin Menuet original https://aws.amazon.com/blogs/big-data/build-your-data-pipeline-in-your-aws-modern-data-platform-using-aws-lake-formation-aws-glue-and-dbt-core/

dbt has established itself as one of the most popular tools in the modern data stack, and is aiming to bring analytics engineering to everyone. The dbt tool makes it easy to develop and implement complex data processing pipelines, with mostly SQL, and it provides developers with a simple interface to create, test, document, evolve, and deploy their workflows. For more information, see docs.getdbt.com.

dbt primarily targets cloud data warehouses such as Amazon Redshift or Snowflake. Now, you can use dbt against AWS data lakes, thanks to the following two services:

In this post, you’ll learn how to deploy a data pipeline in your modern data platform using the dbt-glue adapter built by the AWS Professional Services team in collaboration with dbtlabs.

With this new open-source, battle-tested dbt AWS Glue adapter, developers can now use dbt for their data lakes, paying for just the compute they need, with no need to shuffle data around. They still have access to everything that makes dbt great, including the local developer experience, documentation, tests, incremental data processing, Git integration, CI/CD, and more.

Solution overview

The following diagram shows the architecture of the solution.

The steps in this workflow are as follows:

  1. The data team configures a local Python virtual environment and creates a data pipeline with dbt.
  2. The dbt-glue adapter uses Lake Formation to perform all structure manipulation, like creation of database, tables. or views.
  3. The dbt-glue adapter uses AWS Glue interactive sessions as the backend for processing your data.
  4. All data is stored in Amazon Simple Storage Service (Amazon S3) as Parquet open file format.
  5. The data team can now query all data stored in the data lake using Amazon Athena.

Walkthrough overview

For this post, you run a data pipeline that creates indicators based on NYC taxi data by following these steps:

  1. Deploy the provided AWS CloudFormation stack in Region us-east-1.
  2. Configure your Amazon CloudShell environment.
  3. Install dbt, the dbt CLI, and the dbt adaptor.
  4. Use CloudShell to clone the project and configure it to use your account’s configuration.
  5. Run dbt to implement the data pipeline.
  6. Query the data with Athena.

For our use case, we use the data from the New York City Taxi Records dataset. This dataset is available in the Registry of Open Data on AWS (RODA), which is a repository containing public datasets from AWS resources.

The CloudFormation template creates the nyctaxi database in your AWS Glue Data Catalog and a table (records) that points to the public dataset. You don’t need to host the data in your account.

Prerequisites

The CloudFormation template used by this project configures the AWS Identity and Access Management (IAM) role GlueInteractiveSessionRole with all the mandatory permissions.

For more details on permissions for AWS Glue interactive sessions, refer to Securing AWS Glue interactive sessions with IAM.

Deploy resources with AWS CloudFormation

The CloudFormation stack deploys all the required infrastructure:

  • An IAM role with all the mandatory permissions to run an AWS Glue interactive session and the dbt-glue adapter.
  • An AWS Glue database and table to store the metadata related to the NYC taxi records dataset
  • An S3 bucket to use as output and store the processed data
  • An Athena configuration (a workgroup and S3 bucket to store the output) to explore the dataset
  • An AWS Lambda function as an AWS CloudFormation custom resource that updates all the partitions in the AWS Glue table

To create these resources, choose Launch Stack and follow the instructions:

Configure the CloudShell environment

To start working with the shell, complete the following steps:

  1. Sign in to the AWS Management Console and launch CloudShell using either one of the following two methods:
    1. Choose the CloudShell icon on the console navigation bar.
    2. Enter cloudshell in the Find Services box and then choose the CloudShell option.
  2. Because dbt and the dbt-glue adapter are compatible with Python versions 3.7, 3.8, and 3.9, check the version of Python:
    $ python3 --version

  3. Configure a Python virtual environment to isolate the package version and code dependencies:
    $ sudo yum install git -y
    $ python3 -m venv dbt_venv
    $ source dbt_venv/bin/activate
    $ python3 -m pip install --upgrade pip

  4. Configure the aws-glue-session package:
    $ sudo yum install gcc krb5-devel.x86_64 python3-devel.x86_64 -y
    $ pip3 install --no-cache-dir --upgrade boto3
    $ pip3 install --no-cache-dir --upgrade aws-glue-sessions

Install dbt, the dbt CLI, and the dbt adaptor

The dbt CLI is a command-line interface for running dbt projects. It’s is free to use and available as an open source project. Install dbt and the dbt CLI with the following code:

$ pip3 install --no-cache-dir dbt-core

For more information, refer to How to install dbt, What is dbt?, and Viewpoint.

Install the dbt adapter with the following code:

$ pip3 install --no-cache-dir dbt-glue

Clone the project

The dbt AWS Glue interactive session demo project contains an example of a data pipeline that produces metrics based on NYC taxi dataset. Clone the project with the following code:

$ git clone https://github.com/aws-samples/dbtgluenyctaxidemo

This project comes with the following configuration example:

$ dbtgluenyctaxidemo/profile/profiles.yml

The following table summarizes the parameter options for the adaptor.

Option Description Mandatory
project_name The dbt project name. This must be the same as the one configured in the dbt project. yes
type The driver to use. yes
query-comment A string to inject as a comment in each query that dbt runs. no
role_arn The ARN of the interactive session role created as part of the CloudFormation template. yes
region The AWS Region were you run the data pipeline. yes
workers The number of workers of a defined workerType that are allocated when a job runs. yes
worker_type The type of predefined worker that is allocated when a job runs. Accepts a value of Standard, G.1X, or G.2X. yes
schema The schema used to organize data stored in Amazon S3. yes
database The database in Lake Formation. The database stores metadata tables in the Data Catalog. yes
session_provisioning_timeout_in_seconds The timeout in seconds for AWS Glue interactive session provisioning. yes
location The Amazon S3 location of your target data. yes
idle_timeout The AWS Glue session idle timeout in minutes. (The session stops after being idle for the specified amount of time.) no
glue_version The version of AWS Glue for this session to use. Currently, the only valid options are 2.0 and 3.0. The default value is 2.0. no
security_configuration The security configuration to use with this session. no
connections A comma-separated list of connections to use in the session. no

Run the dbt project

The objective of this sample project is to create the following four tables, which contain metrics based on the NYC taxi dataset:

  • silver_nyctaxi_avg_metrics – Basic metrics based on NYC Taxi Open Data for the year 2016
  • gold_nyctaxi_passengers_metrics – Metrics per passenger based on the silver metrics table
  • gold_nyctaxi_distance_metrics – Metrics per distance based on the silver metrics table
  • gold_nyctaxi_cost_metrics – Metrics per cost based on the silver metrics table
  1. To run the project dbt, you should be in the project folder:
    $ cd dbtgluenyctaxidemo

  2. The project requires you to set environment variables in order to run on the AWS account:
    $ export DBT_ROLE_ARN="arn:aws:iam::$(aws sts get-caller-identity --query "Account" --output text):role/GlueInteractiveSessionRole"
    $ export DBT_S3_LOCATION="s3://aws-dbt-glue-datalake-$(aws sts get-caller-identity --query "Account" --output text)-us-east-1/"

  3. Make sure the profile is set up correctly from the command line:
    $ dbt debug --profiles-dir profile

  4. Run the models with the following code:
    $ dbt run --profiles-dir profile

  5. Generate documentation for the project:
    $ dbt docs generate --profiles-dir profile

  6. View the documentation for the project:
    $ dbt docs serve --profiles-dir profile

Query data via Athena

This section demonstrates how to query the target table using Athena. To query the data, complete the following steps:

  1. On the Athena console, switch the workgroup to athena-dbt-glue-aws-blog.
  2. If the Workgroup athena-dbt-glue-aws-blog settings dialog box appears, choose Acknowledge.
  3. Use the following query to explore the metrics created by the dbt project:
    SELECT cm.avg_cost_per_minute
    , cm.avg_cost_per_distance
    , dm.avg_distance_per_duration
    , dm.year
    , dm.month
    , dm.type
    FROM "dbt_nyc_metrics"."gold_nyctaxi_distance_metrics" dm
    LEFT JOIN "dbt_nyc_metrics"."gold_nyctaxi_cost_metrics" cm
    ON dm.type = cm.type
    AND dm.year = cm.year
    AND dm.month = cm.month
    WHERE dm.type = 'yellow'
    AND dm.year = '2016'
    AND dm.month = '6'

The following screenshot shows the results of this query.

Clean Up

To clean up your environment, complete the following steps in CloudShell:

  1. Delete the database created by dbt:
    $ aws glue delete-database --name dbt_nyc_metrics

  2. Delete all generated data:
    $ aws s3 rm s3://aws-dbt-glue-datalake-$(aws sts get-caller-identity --query "Account" --output text)-us-east-1/ --recursive
    $ aws s3 rm 3://aws-athena-dbt-glue-query-results-$(aws sts get-caller-identity --query "Account" --output text)-us-east-1/ --recursive

  3. Delete the CloudFormation stack:
    $ aws cloudformation delete-stack --stack-name dbt-demo

Summary

This post demonstrates how AWS managed services are key enablers and accelerators to build a modern data platform at scale or take advantage of an existing one.

With the introduction of dbt and aws-glue-dbt-adapter, data teams can access data stored in your modern data platform using SQL statements to extract value from data.

To report a bug or request a feature, please open an issue on GitHub. If you have any questions or suggestions, leave your feedback in the comment section. If you need further assistance to optimize your modern data platform, contact your AWS account team or a trusted AWS Partner.


About the Authors

Benjamin Menuet is a Data Architect with AWS Professional Services. He helps customers develop big data and analytics solutions to accelerate their business outcomes. Outside of work, Benjamin is a trail runner and has finished some mythic races like the UTMB.

Armando Segnini is a Data Architect with AWS Professional Services. He spends his time building scalable big data and analytics solutions for AWS Enterprise and Strategic customers. Armando also loves to travel with his family all around the world and take pictures of the places he visits.

Moshir Mikael is a Senior Practice Manager with AWS Professional Services.  He led development of large enterprise data platforms in EMEA and currently leading the Professional Services teams in EMEA for analytics.

Anouar Zaaber is a Senior Engagement Manager in AWS Professional Services. He leads internal AWS teams, external partners, and customer teams to deliver AWS cloud services that enable customers to realize their business outcomes.

Insights for CTOs: Part 3 – Growing your business with modern data capabilities

Post Syndicated from Syed Jaffry original https://aws.amazon.com/blogs/architecture/insights-for-ctos-part-3-growing-your-business-with-modern-data-capabilities/

This post was co-wrtiten with Jonathan Hwang, head of Foundation Data Analytics at Zendesk.


In my role as a Senior Solutions Architect, I have spoken to chief technology officers (CTOs) and executive leadership of large enterprises like big banks, software as a service (SaaS) businesses, mid-sized enterprises, and startups.

In this 6-part series, I share insights gained from various CTOs and engineering leaders during their cloud adoption journeys at their respective organizations. I have taken these lessons and summarized architecture best practices to help you build and operate applications successfully in the cloud. This series also covers building and operating cloud applications, security, cloud financial management, modern data and artificial intelligence (AI), cloud operating models, and strategies for cloud migration.

In Part 3, I’ve collaborated with the head of Foundation Analytics at Zendesk, Jonathan Hwang, to show how Zendesk incrementally scaled their data and analytics capabilities to effectively use the insights they collect from customer interactions. Read how Zendesk built a modern data architecture using Amazon Simple Storage Service (Amazon S3) for storage, Apache Hudi for row-level data processing, and AWS Lake Formation for fine-grained access control.

Why Zendesk needed to build and scale their data platform

Zendesk is a customer service platform that connects over 100,000 brands with hundreds of millions of customers via telephone, chat, email, messaging, social channels, communities, review sites, and help centers. They use data from these channels to make informed business decisions and create new and updated products.

In 2014, Zendesk’s data team built the first version of their big data platform in their own data center using Apache Hadoop for incubating their machine learning (ML) initiative. With that, they launched Answer Bot and Zendesk Benchmark report. These products were so successful they soon overwhelmed the limited compute resources available in the data center. By the end of 2017, it was clear Zendesk needed to move to the cloud to modernize and scale their data capabilities.

Incrementally modernizing data capabilities

Zendesk built and scaled their workload to use data lakes on AWS, but soon encountered new architecture challenges:

  • The General Data Protection Regulation (GDPR) “right to be forgotten” rule made it difficult and costly to maintain data lakes, because deleting a small piece of data required reprocessing large datasets.
  • Security and governance was harder to manage when data lake scaled to a larger number of users.

The following sections show you how Zendesk is addressing GDPR rules by evolving from plain Apache Parquet files on Amazon S3 to Hudi datasets on Amazon S3 to enable row level inserts/updates/deletes. To address security and governance, Zendesk is migrating to AWS Lake Formation centralized security for fine-grained access control at scale.

Zendesk’s data platform

Figure 1 shows Zendesk’s current data platform. It consists of three data pipelines: “Data Hub,” “Data Lake,” and “Self Service.”

Zendesk data pipelines

Figure 1. Zendesk data pipelines

Data Lake pipelines

The Data Lake and Data Hub pipelines cover the entire lifecycle of the data from ingestion to consumption.

The Data Lake pipelines consolidate the data from Zendesk’s highly distributed databases into a data lake for analysis.

Zendesk uses Amazon Database Migration Service (AWS DMS) for change data capture (CDC) from over 1,800 Amazon Aurora MySQL databases in eight AWS Regions. It detects transaction changes and applies them to the data lake using Amazon EMR and Hudi.

Zendesk ticket data consists of over 10 billion events and petabytes of data. The data lake files in Amazon S3 are transformed and stored in Apache Hudi format and registered on the AWS Glue catalog to be available as data lake tables for analytics querying and consumption via Amazon Athena.

Data Hub pipelines

The Data Hub pipelines focus on real-time events and streaming analytics use cases with Apache Kafka. Any application at Zendesk can publish events to a global Kafka message bus. Apache Flink ingests these events into Amazon S3.

The Data Hub provides high-quality business data that is highly available and scalable.

Self-managed pipeline

The self-managed pipelines empower product engineering teams to use the data lake for those use cases that don’t fit into our standard integration patterns. All internal Zendesk product engineering teams can use standard tools such as Amazon EMR, Amazon S3, Athena, and AWS Glue to publish their own analytics dataset and share them with other teams.

A notable example of this is Zendesk’s fraud detection engineering team. They publish their fraud detection data and findings through our self-manage data lake platform and use Amazon QuickSight for visualization.

You need fine-grained security and compliance

Data lakes can accelerate growth through faster decision making and product innovation. However, they can also bring new security and compliance challenges:

  • Visibility and auditability. Who has access to what data? What level of access do people have and how/when and who is accessing it?
  • Fine-grained access control. How do you define and enforce least privilege access to subsets of data at scale without creating bottlenecks or key person/team dependencies?

Lake Formation helps address these concerns by auditing data access and offering row- and column-level security and a delegated access control model to create data stewards for self-managed security and governance.

Zendesk used Lake Formation to build a fine-grained access control model that uses row-level security. It detects personally identifiable information (PII) while scaling the data lake for self-managed consumption.

Some Zendesk customers opt out of having their data included in ML or market research. Zendesk uses Lake Formation to apply row-level security to filter out records associated with a list of customer accounts who have opted out of queries. They also help data lake users understand which data lake tables contain PII by automatically detecting and tagging columns in the data catalog using AWS Glue’s PII detection algorithm.

The value of real-time data processing

When you process and consume data closer to the time of its creation, you can make faster decisions. Streaming analytics design patterns, implemented using services like Amazon Managed Streaming for Apache Kafka (Amazon MSK) or Amazon Kinesis, create an enterprise event bus to exchange data between heterogeneous applications in near real time.

For example, it is common to use streaming to augment the traditional database CDC ingestion into the data lake with additional streaming ingestion of application events. CDC is a common data ingestion pattern, but the information can be too low level. This requires application context to be reconstructed in the data lake and business logic to be duplicated in two places, inside the application and in the data lake processing layer. This creates a risk of semantic misrepresentation of the application context.

Zendesk faced this challenge with their CDC data lake ingestion from their Aurora clusters. They created an enterprise event bus built with Apache Kafka to augment their CDC with higher-level application domain events to be exchanged directly between heterogeneous applications.

Zendesk’s streaming architecture

A CDC database ticket table schema can sometimes contain unnecessary and complex attributes that are application specific and do not capture the domain model of the ticket. This makes it hard for downstream consumers to understand and use the data. A ticket domain object may span several database tables when modeled in third normal form, which makes querying for analysts difficult downstream. This is also a brittle integration method because downstream data consumers can easily be impacted when the application logic changes, which makes it hard to derive a common data view.

To move towards event-based communication between microservices, Zendesk created the Platform Data Architecture (PDA) project, which uses a standard object model to represent a higher level, semantic view of their application data. Standard objects are domain objects designed for cross-domain communication and do not suffer from the lower level fragmented scope of database CDC. Ultimately, Zendesk aims to transition their data architecture from a collection of isolated products and data silos into a cohesive unified data platform.

An application view of Zendesk’s streaming architecture

Figure 2. An application view of Zendesk’s streaming architecture

Figure 3 shows how all Zendesk products and users integrate through common standard objects and standard events within the Data Hub. Applications publish and consume standard objects and events to/from the event bus.

For example, a complete ticket standard object will be published to the message bus whenever it is created, updated, or changed. On the consumption side, these events get used by product teams to enable platform capabilities such as search, data export, analytics, and reporting dashboards.

Summary

As Zendesk’s business grew, their data lake evolved from simple Parquet files on Amazon S3 to a modern Hudi-based incrementally updateable data lake. Now, their original coarse-grained IAM security policies use fine-grained access control with Lake Formation.

We have repeatedly seen this incremental architecture evolution achieve success because it reduces the business risk associated with the change and provides sufficient time for your team to learn and evaluate cloud operations and managed services.

Looking for more architecture content? AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Other posts in this series

How the Georgia Data Analytics Center built a cloud analytics solution from scratch with the AWS Data Lab

Post Syndicated from Kanti Chalasani original https://aws.amazon.com/blogs/big-data/how-the-georgia-data-analytics-center-built-a-cloud-analytics-solution-from-scratch-with-the-aws-data-lab/

This is a guest post by Kanti Chalasani, Division Director at Georgia Data Analytics Center (GDAC). GDAC is housed within the Georgia Office of Planning and Budget to facilitate governed data sharing between various state agencies and departments.

The Office of Planning and Budget (OPB) established the Georgia Data Analytics Center (GDAC) with the intent to provide data accountability and transparency in Georgia. GDAC strives to support the state’s government agencies, academic institutions, researchers, and taxpayers with their data needs. Georgia’s modern data analytics center will help to securely harvest, integrate, anonymize, and aggregate data.

In this post, we share how GDAC created an analytics platform from scratch using AWS services and how GDAC collaborated with the AWS Data Lab to accelerate this project from design to build in record time. The pre-planning sessions, technical immersions, pre-build sessions, and post-build sessions helped us focus on our objectives and tangible deliverables. We built a prototype with a modern data architecture and quickly ingested additional data into the data lake and the data warehouse. The purpose-built data and analytics services allowed us to quickly ingest additional data and deliver data analytics dashboards. It was extremely rewarding to officially release the GDAC public website within only 4 months.

A combination of clear direction from OPB executive stakeholders, input from the knowledgeable and driven AWS team, and the GDAC team’s drive and commitment to learning played a huge role in this success story. GDAC’s partner agencies helped tremendously through timely data delivery, data validation, and review.

We had a two-tiered engagement with the AWS Data Lab. In the first tier, we participated in a Design Lab to discuss our near-to-long-term requirements and create a best-fit architecture. We discussed the pros and cons of various services that can help us meet those requirements. We also had meaningful engagement with AWS subject matter experts from various AWS services to dive deeper into the best practices.

The Design Lab was followed by a Build Lab, where we took a smaller cross section of the bigger architecture and implemented a prototype in 4 days. During the Build Lab, we worked in GDAC AWS accounts, using GDAC data and GDAC resources. This not only helped us build the prototype, but also helped us gain hands-on experience in building it. This experience also helped us better maintain the product after we went live. We were able to continually build on this hands-on experience and share the knowledge with other agencies in Georgia.

Our Design and Build Lab experiences are detailed below.

Step 1: Design Lab

We wanted to stand up a platform that can meet the data and analytics needs for the Georgia Data Analytics Center (GDAC) and potentially serve as a gold standard for other government agencies in Georgia. Our objective with the AWS Data Design Lab was to come up with an architecture that meets initial data needs and provides ample scope for future expansion, as our user base and data volume increased. We wanted each component of the architecture to scale independently, with tighter controls on data access. Our objective was to enable easy exploration of data with faster response times using Tableau data analytics as well as build data capital for Georgia. This would allow us to empower our policymakers to make data-driven decisions in a timely manner and allow State agencies to share data and definitions within and across agencies through data governance. We also stressed on data security, classification, obfuscation, auditing, monitoring, logging, and compliance needs. We wanted to use purpose-built tools meant for specialized objectives.

Over the course of the 2-day Design Lab, we defined our overall architecture and picked a scaled-down version to explore. The following diagram illustrates the architecture of our prototype.

The architecture contains the following key components:

  • Amazon Simple Storage Service (Amazon S3) for raw data landing and curated data staging.
  • AWS Glue for extract, transform, and load (ETL) jobs to move data from the Amazon S3 landing zone to Amazon S3 curated zone in optimal format and layout. We used an AWS Glue crawler to update the AWS Glue Data Catalog.
  • AWS Step Functions for AWS Glue job orchestration.
  • Amazon Athena as a powerful tool for a quick and extensive SQL data analysis and to build a logical layer on the landing zone.
  • Amazon Redshift to create a federated data warehouse with conformed dimensions and star schemas for consumption by Tableau data analytics.

Step 2: Pre-Build Lab

We started with planning sessions to build foundational components of our infrastructure: AWS accounts, Amazon Elastic Compute Cloud (Amazon EC2) instances, an Amazon Redshift cluster, a virtual private cloud (VPC), route tables, security groups, encryption keys, access rules, internet gateways, a bastion host, and more. Additionally, we set up AWS Identity and Access Management (IAM) roles and policies, AWS Glue connections, dev endpoints, and notebooks. Files were ingested via secure FTP, or from a database to Amazon S3 using AWS Command Line Interface (AWS CLI). We crawled Amazon S3 via AWS Glue crawlers to build Data Catalog schemas and tables for quick SQL access in Athena.

The GDAC team participated in Immersion Days for training in AWS Glue, AWS Lake Formation, and Amazon Redshift in preparation for the Build Lab.

We defined the following as the success criteria for the Build Lab:

  • Create ETL pipelines from source (Amazon S3 raw) to target (Amazon Redshift). These ETL pipelines should create and load dimensions and facts in Amazon Redshift.
  • Have a mechanism to test the accuracy of the data loaded through our pipelines.
  • Set up Amazon Redshift in a private subnet of a VPC, with appropriate users and roles identified.
  • Connect from AWS Glue to Amazon S3 to Amazon Redshift without going over the internet.
  • Set up row-level filtering in Amazon Redshift based on user login.
  • Data pipelines orchestration using Step Functions.
  • Build and publish Tableau analytics with connections to our star schema in Amazon Redshift.
  • Automate the deployment using AWS CloudFormation.
  • Set up column-level security for the data in Amazon S3 using Lake Formation. This allows for differential access to data based on user roles to users using both Athena and Amazon Redshift Spectrum.

Step 3: Four-day Build Lab

Following a series of implementation sessions with our architect, we formed the GDAC data lake and organized downstream data pulls for the data warehouse with governed data access. Data was ingested in the raw data landing lake and then curated into a staging lake, where data was compressed and partitioned in Parquet format.

It was empowering for us to build PySpark Extract Transform Loads (ETL) AWS Glue jobs with our meticulous AWS Data Lab architect. We built reusable glue jobs for the data ingestion and curation using the code snippets provided. The days were rigorous and long, but we were thrilled to see our centralized data repository come into fruition so rapidly. Cataloging data and using Athena queries proved to be a fast and cost-effective way for data exploration and data wrangling.

The serverless orchestration with Step Functions allowed us to put AWS Glue jobs into a simple readable data workflow. We spent time designing for performance and partitioning data to minimize cost and increase efficiency.

Database access from Tableau and SQL Workbench/J were set up for my team. Our excitement only grew as we began building data analytics and dashboards using our dimensional data models.

Step 4: Post-Build Lab

During our post-Build Lab session, we closed several loose ends and built additional AWS Glue jobs for initial and historic loads and append vs. overwrite strategies. These strategies were picked based on the nature of the data in various tables. We returned for a second Build Lab to work on building data migration tasks from Oracle Database via VPC peering, file processing using AWS Glue DataBrew, and AWS CloudFormation for automated AWS Glue job generation. If you have a team of 4–8 builders looking for a fast and easy foundation for a complete data analytics system, I would highly recommend the AWS Data Lab.

Conclusion

All in all, with a very small team we were able to set up a sustainable framework on AWS infrastructure with elastic scaling to handle future capacity without compromising quality. With this framework in place, we are moving rapidly with new data feeds. This would not have been possible without the assistance of the AWS Data Lab team throughout the project lifecycle. With this quick win, we decided to move forward and build AWS Control Tower with multiple accounts in our landing zone. We brought in professionals to help set up infrastructure and data compliance guardrails and security policies. We are thrilled to continually improve our cloud infrastructure, services and data engineering processes. This strong initial foundation has paved the pathway to endless data projects in Georgia.


About the Author

Kanti Chalasani serves as the Division Director for the Georgia Data Analytics Center (GDAC) at the Office of Planning and Budget (OPB). Kanti is responsible for GDAC’s data management, analytics, security, compliance, and governance activities. She strives to work with state agencies to improve data sharing, data literacy, and data quality through this modern data engineering platform. With over 26 years of experience in IT management, hands-on data warehousing, and analytics experience, she thrives for excellence.

Vishal Pathak is an AWS Data Lab Solutions Architect. Vishal works with customers on their use cases, architects solutions to solve their business problems, and helps them build scalable prototypes. Prior to his journey with AWS, Vishal helped customers implement BI, data warehousing, and data lake projects in the US and Australia.

Build a data sharing workflow with AWS Lake Formation for your data mesh

Post Syndicated from Jan Michael Go Tan original https://aws.amazon.com/blogs/big-data/build-a-data-sharing-workflow-with-aws-lake-formation-for-your-data-mesh/

A key benefit of a data mesh architecture is allowing different lines of business (LOBs) and organizational units to operate independently and offer their data as a product. This model not only allows organizations to scale, but also gives the end-to-end ownership of maintaining the product to data producers that are the domain experts of the data. This ownership entails maintaining the data pipelines, debugging ETL scripts, fixing data quality issues, and keeping the catalog entries up to date as the dataset evolves over time.

On the consumer side, teams can search the central catalog for relevant data products and request access. Access to the data is done via the data sharing feature in AWS Lake Formation. As the amount of data products grow and potentially more sensitive information is stored in an organization’s data lake, it’s important that the process and mechanism to request and grant access to specific data products are done in a scalable and secure manner.

This post describes how to build a workflow engine that automates the data sharing process while including a separate approval mechanism for data products that are tagged as sensitive (for example, containing PII data). Both the workflow and approval mechanism are customizable and should be adapted to adhere to your company’s internal processes. In addition, we include an optional workflow UI to demonstrate how to integrate with the workflow engine. The UI is just one example of how the interaction works. In a typical large enterprise, you can also use ticketing systems to automatically trigger both the workflow and the approval process.

Solution overview

A typical data mesh architecture for analytics in AWS contains one central account that collates all the different data products from multiple producer accounts. Consumers can search the available data products in a single location. Sharing data products to consumers doesn’t actually make a separate copy, but instead just creates a pointer to the catalog item. This means any updates that producers make to their products are automatically reflected in the central account as well as in all the consumer accounts.

Building on top of this foundation, the solution contains several components, as depicted in the following diagram:

The central account includes the following components:

  • AWS Glue – Used for Data Catalog purposes.
  • AWS Lake Formation – Used to secure access to the data as well as provide the data sharing capabilities that enable the data mesh architecture.
  • AWS Step Functions – The actual workflow is defined as a state machine. You can customize this to adhere to your organization’s approval requirements.
  • AWS Amplify – The workflow UI uses the Amplify framework to secure access. It also uses Amplify to host the React-based application. On the backend, the Amplify framework creates two Amazon Cognito components to support the security requirements:
    • User pools – Provide a user directory functionality.
    • Identity pools – Provide federated sign-in capabilities using Amazon Cognito user pools as the location of the user details. The identity pools vend temporary credentials so the workflow UI can access AWS Glue and Step Functions APIs.
  • AWS Lambda – Contains the application logic orchestrated by the Step Functions state machine. It also provides the necessary application logic when a producer approves or denies a request for access.
  • Amazon API Gateway – Provides the API for producers to accept and deny requests.

The producer account contains the following components:

The consumer account contains the following components:

  • AWS Glue – Used for Data Catalog purposes.
  • AWS Lake Formation – After the data has been made available, consumers can grant access to its own users via Lake Formation.
  • AWS Resource Access Manager (AWS RAM) – If the grantee account is in the same organization as the grantor account, the shared resource is available immediately to the grantee. If the grantee account is not in the same organization, AWS RAM sends an invitation to the grantee account to accept or reject the resource grant. For more details about Lake Formation cross-account access, see Cross-Account Access: How It Works.

The solution is split into multiple steps:

  1. Deploy the central account backend, including the workflow engine and its associated components.
  2. Deploy the backend for the producer accounts. You can repeat this step multiple times depending on the number of producer accounts that you’re onboarding into the workflow engine.
  3. Deploy the optional workflow UI in the central account to interact with the central account backend.

Workflow overview

The following diagram illustrates the workflow. In this particular example, the state machine checks if the table or database (depending on what is being shared) has the pii_flag parameter and if it’s set to TRUE. If both conditions are valid, it sends an approval request to the producer’s SNS topic. Otherwise, it automatically shares the product to the requesting consumer.

This workflow is the core of the solution, and can be customized to fit your organization’s approval process. In addition, you can add custom parameters to databases, tables, or even columns to attach extra metadata to support the workflow logic.

Prerequisites

The following are the deployment requirements:

You can clone the workflow UI and AWS CDK scripts from the GitHub repository.

Deploy the central account backend

To deploy the backend for the central account, go to the root of the project after cloning the GitHub repository and enter the following code:

yarn deploy-central --profile <PROFILE_OF_CENTRAL_ACCOUNT>

This deploys the following:

  • IAM roles used by the Lambda functions and Step Functions state machine
  • Lambda functions
  • The Step Functions state machine (the workflow itself)
  • An API Gateway

When the deployment is complete, it generates a JSON file in the src/cfn-output.json location. This file is used by the UI deployment script to generate a scoped-down IAM policy and workflow UI application to locate the state machine that was created by the AWS CDK script.

The actual AWS CDK scripts for the central account deployment are in infra/central/. This also includes the Lambda functions (in the infra/central/functions/ folder) that are used by both the state machine and the API Gateway.

Lake Formation permissions

The following table contains the minimum required permissions that the central account data lake administrator needs to grant to the respective IAM roles for the backend to have access to the AWS Glue Data Catalog.

Role Permission Grantable
WorkflowLambdaTableDetails
  • Database: DESCRIBE
  • Tables: DESCRIBE
N/A
WorkflowLambdaShareCatalog
  • Tables: SELECT, DESCRIBE
  • Tables: SELECT, DESCRIBE

Workflow catalog parameters

The workflow uses the following catalog parameters to provide its functionality.

Catalog Type Parameter Name Description
Database data_owner (Required) The account ID of the producer account that owns the data products.
Database data_owner_name A readable friendly name that identifies the producer in the UI.
Database pii_flag A flag (true/false) that determines whether the data product requires approval (based on the example workflow).
Column pii_flag A flag (true/false) that determines whether the data product requires approval (based on the example workflow). This is only applicable if requesting table-level access.

You can use UpdateDatabase and UpdateTable to add parameters to database and column-level granularity, respectively. Alternatively, you can use the CLI for AWS Glue to add the relevant parameters.

Use the AWS CLI to run the following command to check the current parameters in your database:

aws glue get-database --name <DATABASE_NAME> --profile <PROFILE_OF_CENTRAL_ACCOUNT>

You get the following response:

{
  "Database": {
    "Name": "<DATABASE_NAME>",
    "CreateTime": "<CREATION_TIME>",
    "CreateTableDefaultPermissions": [],
    "CatalogId": "<CATALOG_ID>"
  }
}

To update the database with the parameters indicated in the preceding table, we first create the input JSON file, which contains the parameters that we want to update the database with. For example, see the following code:

{
  "Name": "<DATABASE_NAME>",
  "Parameters": {
    "data_owner": "<AWS_ACCOUNT_ID_OF_OWNER>",
    "data_owner_name": "<AWS_ACCOUNT_NAME_OF_OWNER>",
    "pii_flag": "true"
  }
}

Run the following command to update the Data Catalog:

aws glue update-database --name <DATABASE_NAME> --database-input file://<FILE_NAME>.json --profile <PROFILE_OF_CENTRAL_ACCOUNT>

Deploy the producer account backend

To deploy the backend for your producer accounts, go to the root of the project and run the following command:

yarn deploy-producer --profile <PROFILE_OF_PRODUCER_ACCOUNT> --parameters centralMeshAccountId=<central_account_account_id>

This deploys the following:

  • An SNS topic where approval requests get published.
  • The ProducerWorkflowRole IAM role with a trust relationship to the central account. This role allows Amazon SNS publish to the previously created SNS topic.

You can run this deployment script multiple times, each time pointing to a different producer account that you want to participate in the workflow.

To receive notification emails, subscribe your email in the SNS topic that the deployment script created. For example, our topic is called DataLakeSharingApproval. To get the full ARN, you can either go to the Amazon Simple Notification Service console or run the following command to list all the topics and get the ARN for DataLakeSharingApproval:

aws sns list-topics --profile <PROFILE_OF_PRODUCER_ACCOUNT>

After you have the ARN, you can subscribe your email by running the following command:

aws sns subscribe --topic-arn <TOPIC_ARN> --protocol email --notification-endpoint <EMAIL_ADDRESS> --profile <PROFILE_OF_PRODUCER_ACCOUNT>

You then receive a confirmation email via the email address that you subscribed. Choose Confirm subscription to receive notifications from this SNS topic.

Deploy the workflow UI

The workflow UI is designed to be deployed in the central account where the central data catalog is located.

To start the deployment, enter the following command:

yarn deploy-ui

This deploys the following:

  • Amazon Cognito user pool and identity pool
  • React-based application to interact with the catalog and request data access

The deployment command prompts you for the following information:

  • Project information – Use the default values.
  • AWS authentication – Use your profile for the central account. Amplify uses this profile to deploy the backend resources.

UI authentication – Use the default configuration and your username. Choose No, I am done when asked to configure advanced settings.

  • UI hosting – Use hosting with the Amplify console and choose manual deployment.

The script gives a summary of what is deployed. Entering Y triggers the resources to be deployed in the backend. The prompt looks similar to the following screenshot:

When the deployment is complete, the remaining prompt is for the initial user information such as user name and email. A temporary password is automatically generated and sent to the email provided. The user is required to change the password after the first login.

The deployment script grants IAM permissions to the user via an inline policy attached to the Amazon Cognito authenticated IAM role:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Action":[
            "glue:GetDatabase",
            "glue:GetTables",
            "glue:GetDatabases",
            "glue:GetTable"
         ],
         "Resource":"*"
      },
      {
         "Effect":"Allow",
         "Action":[
            "states:ListExecutions",
            "states:StartExecution"
         ],
         "Resource":[
"arn:aws:states:<REGION>:<AWS_ACCOUNT_ID>:stateMachine:<STATE_MACHINE_NAME>"
]
      },
      {
         "Effect":"Allow",
         "Action":[
             "states:DescribeExecution"
         ],
         "Resource":[
"arn:aws:states:<REGION>:<AWS_ACCOUNT_ID>:execution:<STATE_MACHINE_NAME>:*"
]
      }


   ]
}

The last remaining step is to grant Lake Formation permissions (DESCRIBE for both databases and tables) to the authenticated IAM role associated with the Amazon Cognito identity pool. You can find the IAM role by running the following command:

cat amplify/team-provider-info.json

The IAM role name is in the AuthRoleName property under the awscloudformation key. After you grant the required permissions, you can use the URL provided in your browser to open the workflow UI.

Your temporary password is emailed to you so you can complete the initial login, after which you’re asked to change your password.

The first page after logging in is the list of databases that consumers can access.

Choose Request Access to see the database details and the list of tables.

Choose Request Per Table Access and see more details at the table level.

Going back in the previous page, we request database-level access by entering the consumer account ID that receives the share request.

Because this database has been tagged with a pii_flag, the workflow needs to send an approval request to the product owner. To receive this approval request email, the product owner’s email needs to be subscribed to the DataLakeSharingApproval SNS topic in the product account. The details should look similar to the following screenshot:

The email looks similar to the following screenshot:

The product owner chooses the Approve link to trigger the Step Functions state machine to continue running and share the catalog item to the consumer account.

For this example, the consumer account is not part of an organization, so the admin of the consumer account has to go to AWS RAM and accept the invitation.

After the resource share is accepted, the shared database appears in the consumer account’s catalog.

Clean up

If you no longer need to use this solution, use the provided cleanup scripts to remove the deployed resources.

Producer account

To remove the deployed resources in producer accounts, run the following command for each producer account that you deployed in:

yarn clean-producer --profile <PROFILE_OF_PRODUCER_ACCOUNT>

Central account

Run the following command to remove the workflow backend in the central account:

yarn clean-central --profile <PROFILE_OF_CENTRAL_ACCOUNT>

Workflow UI

The cleanup script for the workflow UI relies on an Amplify CLI command to initiate the teardown of the deployed resources. Additionally, you can use a custom script to remove the inline policy in the authenticated IAM role used by Amazon Cognito so that Amplify can fully clean up all the deployed resources. Run the following command to trigger the cleanup:

yarn clean-ui

This command doesn’t require the profile parameter because it uses the existing Amplify configuration to infer where the resources are deployed and which profile was used.

Conclusion

This post demonstrated how to build a workflow engine to automate an organization’s approval process to gain access to data products with varying degrees of sensitivity. Using a workflow engine enables data sharing in a self-service manner while codifying your organization’s internal processes to be able to safely scale as more data products and teams get onboarded.

The provided workflow UI demonstrated one possible integration scenario. Other possible integration scenarios include integration with your organization’s ticketing system to trigger the workflow as well as receive and respond to approval requests, or integration with business chat applications to further shorten the approval cycle.

Lastly, a high degree of customization is possible with the demonstrated approach. Organizations have complete control over the workflow, how data product sensitivity levels are defined, what gets auto-approved and what needs further approvals, the hierarchy of approvals (such as a single approver or multiple approvers), and how the approvals get delivered and acted upon. You can take advantage of this flexibility to automate your company’s processes to help them safely accelerate towards being a data-driven organization.


About the Author

Jan Michael Go Tan is a Principal Solutions Architect for Amazon Web Services. He helps customers design scalable and innovative solutions with the AWS Cloud.

How to Run Massively Scalable ADAS Simulation Workloads on CAEdge

Post Syndicated from Hendrik Schoeneberg original https://aws.amazon.com/blogs/architecture/how-to-run-massively-scalable-adas-simulation-workloads-on-caedge/

This post was co-written by Hendrik Schoeneberg, Sr. Global Big Data Architect, The An Binh Nguyen, Product Owner for Cloud Simulation at Continental, Autonomous Mobility – Engineering Platform, Rumeshkrishnan Mohan, Global Big Data Architect, and Junjie Tang, Principal Consultant at AWS Professional Services.

AV/ADAS simulations processing large-scale field sensor data such as radar, lidar, and high-resolution video come with many challenges. Typically, the simulation workloads are spiky with occasional, but high compute demands, so the platform must scale up or down elastically to match the compute requirements. The platform must be flexible enough to integrate specialized ADAS simulation software, use distributed computing or HPC frameworks, and leverage GPU accelerated compute resources where needed.

Continental created the Continental Automotive Edge (CAEdge) Framework to address these challenges. It is a modular multi-tenant hardware and software framework that connects the vehicle to the cloud. You can learn more about this in Developing a Platform for Software-defined Vehicles with Continental Automotive Edge (CAEdge) and Developing a platform for software-defined vehicles (re:Invent session-AUT304).

In this blog post, we’ll illustrate how the CAEdge Framework orchestrates ADAS simulation workloads with Amazon Managed Workflows for Apache Airflow (MWAA). We’ll show how it delegates the high-performance workloads to AWS Batch for elastic, highly scalable, and customizable compute needs. We’ll showcase the “bring your own software-in-the-loop” (BYO-SIL) pattern, detailing how to leverage specialized and proprietary simulation software in your workflow. We’ll also demonstrate how to integrate the simulation platform with tenant data in the CAEdge framework. This orchestrate and delegate pattern has previously been introduced in Field Notes: Deploying Autonomous Driving and ADAS Workloads at Scale with Amazon Managed Workflows for Apache Airflow and the Reference Architecture for Autonomous Driving Data Lake.

Solution overview for autonomous driving simulation

The following diagram shows a high-level overview of this solution.

Figure 1. Architecture diagram for autonomous driving simulation

Figure 1. Architecture diagram for autonomous driving simulation

It can be broken down into five major parts, illustrated in Figure 1.

  1. Simulation API: Amazon API Gateway (label 1) provides a REST API for authenticated users to schedule or monitor simulation. It runs on Amazon Managed Workflows for Apache Airflow (MWAA) using AWS Lambda (label 2).
  2. Simulation control plane: The simulation control plane (label 3) built on MWAA enables users to design and initiate workflows and integrate with AWS services.
  3. Scalable compute backend: We leverage the parallelization and elastic scalability capabilities of AWS Batch to distribute the workload (label 4). Additionally, we want to be able to run proprietary software components as part of the simulation workflows and use highly customizable Amazon EC2 compute environments. For example, we can use GPU acceleration or Graviton-based instances for workloads that must run on ARM, instead of x86 architectures.
  4. Autonomous Driving Data Lake (ADDL) integration: The simulations’ input and output data will be stored in a data lake (label 5) on Amazon S3. To provide efficient read and write access, data gets copied before each simulation run using RAID0 bundled ephemeral instance storage drives. After the simulation, results are written back to the data lake and are ready for reporting and analytics. We use AWS Lake Formation for metadata storage, data cataloging, and permission handling.
  5. Automated deployment: The solution architecture’s deployment is fully automated using a CI/CD pipeline in AWS CodePipeline and AWS CodeBuild (label 6). We can then perform automated testing and deployment to multiple target environments.

In this blog we will focus on the simulation control plane, scalable compute backend, and ADDL integration.

Simulation control plane

In a typical simulation, there are tasks like data movement (copying input data and persisting the results). These steps can require high levels of parallelization or GPU support. In addition, they can involve third-party or proprietary software, specialized runtime environments, or architectures like ARM. To facilitate all task requirements, we will delegate task initiation to the AWS services that best match the requirements. Correct sequence of tasks is achieved by an orchestration layer. Decoupling the simulation orchestration from task initiation follows the key design pattern to separate concerns. This enables the architecture to be adapted to specific requirements.

For the high-level task orchestration, we’ll introduce a simulation control plane built on MWAA. Modeling all simulation tasks in a directed acyclic graph (DAG), MWAA performs scheduling and task initiation in the correct sequence. It also handles the integration with AWS’ services. You can intuitively monitor and debug the progress of a simulation run across different services and networks. For the workflow within CAEdge, we’ll use AWS Batch to run simulations that are packaged as Docker containers.

Parameterizable workflows: Airflow supports parameterization of DAGs by providing a JSON object when triggering a DAG run that can be accessed at runtime. This enables us to create a simulation execution framework in which we model the simulation steps. It lets you specify the simulations’ parameters such as which Docker container to execute, runtime configuration, or input and output data locations, see Figures 2 and 3.

Figure 2. Specify simulations’ parameters in JSON

Figure 2. Specify simulations’ parameters in JSON

Figure 3. Read simulations’ parameters from JSON

Figure 3. Read simulations’ parameters from JSON

Scalable compute backend

Publishing the Docker image: To run a simulation, you must provide a Docker container in Amazon Elastic Container Registry (ECR). Manually push Docker images to ECR using the command line interface (CLI) or using CI/CD pipeline automation.

Choosing the compute option: Containers at AWS describes the services options for developers. Various factors can contribute to your decision-making. For the CAEdge platform, we want to run thousands of containers with fine-grained control over the underlying compute instance’s configuration. We also want to run containers in privileged mode, so AWS Batch on EC2 is a great match. Figure 4 outlines the compute backend’s architecture.

Figure 4. Compute backend architecture diagram

Figure 4. Compute backend architecture diagram

To run a Docker container on AWS Batch, we need three components: A job definition, a compute environment, and a job queue. The AWS Batch job definition specifies which job to run and how to run it. For example, we can define the Docker image to use, and specify the vCPU and memory configuration. The job definition will be submitted to a job queue, which in turn is linked to one or more compute environments. The compute environment specifies the compute configuration used to run a containerized workload, in addition to instance types and storage configurations. Creating a compute environment describes this more in detail. In the section ‘ADDL integration,’ we’ll describe how to select and configure the EC2 container instances to maximize network and disk I/O.

The AWS Batch array job feature can spin up thousands of independent, but similarly configured jobs. Instead of submitting a single job for each input file from the simulation control plane, we can instead submit a single array job. This provides the collection of input files as input. AWS Batch can spin up child jobs to process each entry of the collection in parallel, reducing operational overhead drastically.

With these components, we can now submit a job definition to a job queue from the simulation control plane, which will initiate on the corresponding compute environment.

ADDL integration

In the CAEdge platform, all simulation recordings and their metadata are stored and cataloged in the data lake. On average, single recordings are around 100 GB in size with simulation requests containing 100–300 recordings. A simulation request can therefore require 10–30 TB of data movement from S3 to the containers before the simulation starts. The containers’ performance will directly depend on I/O performance, as the input data is read and processed during execution. To provide the highest performance of data transfer and simulation workload, we need data storage options that maximize network and disk I/O throughput.

Choosing the storage option: For CAEdge’s simulation workloads, the storage solution acts as a temporary scratch space for the simulation containers’ input data. It should maximize I/O throughput. These requirements are met in the most cost-efficient way by choosing EC2 instances of the M5d family and their attached instance storage.

The M5d are general-purpose instances with NVMe-based SSDs, which are physically connected to the host server and provide block-level storage coupled to the lifetime of the instance. As described in the previous section, we configured AWS Batch to create compute environments using EC2’s M5d instances. While other storage options like Amazon Elastic File System (EFS) and Amazon Elastic Block Storage (EBS) can scale to match even demanding throughput scenarios, the simulation benefits from a high-performance, temporary storage location that is directly attached to the host.

Bundling the volumes: M5d instances can have multiple NVMe drives attached, depending on their size and storage configuration. To provide a single stable storage location for the simulation containers, bundle all attached NVMe SSDs into a single RAID0 volume during the instance’s launch, using a modified user data script. With this, we can provide a fixed storage location that can be accessed from the simulation containers. Additionally, we are distributing I/O operations evenly across all disks in the RAID0 configuration, which improves the read and write performance.

KPIs used for measurement

The SIL factor tells you how long the simulation takes compared to the duration of the recorded data.

SIL factor example:

  • For a recording with 60 minutes of data and a simulation duration of 120 minutes, the resulting SIL factor is 120 / 60 = 2. The simulation runs twice as long as real time.
  • For a recording with 60 minutes of data and a simulation duration of 60 minutes, the resulting SIL factor is 60 / 60 = 1. The simulation runs in real time.

Similarly, the aggregated SIL factor tells you how long the simulation for multiple recordings takes compared to the duration of the recorded data. It also factors in the horizontal scaling capabilities.

Aggregated SIL factor example:

  • For 3 recordings with 60 minutes of data and an overall simulation duration of 120 minutes, the resulting SIL factor is 120 / 3 * 60 = 0.67. The overall simulation is a third faster than real time.

Performance results

The storage optimizations described in the ADDL integration KPI section preceding, led to an improvement of the overall simulation duration of 50%. To benchmark the overall simulation platform’s performance, we created a simulation run with 15 recordings. The simulation run completed successfully with an aggregated SIL factor of 0.363, or, alternatively put, the simulation was roughly three times faster than real time. During production use, the platform will handle simulation runs with an average of 100-300 recordings. For these runs, the aggregated SIL factor is expected to be even smaller, as more simulations can be processed in parallel.

Conclusion

In this post, we showed how to design a platform for ADAS simulation workloads using the Bring Your Own Software-In-the-Loop (BYO-SIL) concept. We covered the key components including simulation control plane, scalable compute backend, and ADDL integration based on Continental Automotive Edge (CAEdge). We discussed the key performance benefits due to horizontal scalability and how to choose an optimized storage integration pattern. This led to an improvement of the overall simulation duration of 50%.

Centralize governance for your data lake using AWS Lake Formation while enabling a modern data architecture with Amazon Redshift Spectrum

Post Syndicated from Rajesh Francis original https://aws.amazon.com/blogs/big-data/centralize-governance-for-your-data-lake-using-aws-lake-formation-while-enabling-a-modern-data-architecture-with-amazon-redshift-spectrum/

Many customers are modernizing their data architecture using Amazon Redshift to enable access to all their data from a central data location. They are looking for a simpler, scalable, and centralized way to define and enforce access policies on their data lakes on Amazon Simple Storage Service (Amazon S3). They want access policies to allow their data lake consumers to use the analytics service of their choice, to best suit the operations they want to perform on the data. Although the existing method of using Amazon S3 bucket policies to manage access control is an option, when the number of combinations of access levels and users increase, managing bucket level policies may not scale.

AWS Lake Formation allows you to simplify and centralize access management. It allows organizations to manage access control for Amazon S3-based data lakes using familiar concepts of databases, tables, and columns (with more advanced options like row and cell-level security). Lake Formation uses the AWS Glue Data Catalog to provide access control for Amazon S3 data lake with most commonly used AWS analytics services, like Amazon Redshift (via Amazon Redshift Spectrum), Amazon Athena, AWS Glue ETL, and Amazon EMR (for Spark-based notebooks). These services honor the Lake Formation permissions model out of the box, which makes it easy for customers to simplify, standardize, and scale data security management for data lakes.

With Amazon Redshift, you can build a modern data architecture, to seamlessly extend your data warehouse to your data lake and read all data – data in your data warehouse, and data in your data lake – without creating multiple copies of data. Amazon Redshift Spectrum feature enable direct query of your S3 data lake, and many customers are leveraging this to modernize their data platform. You can use Amazon Redshift managed storage for frequently accessed data and move less frequently accessed data to Amazon S3 data lake and securely access it using Redshift Spectrum.

In this post, we discuss how you can use AWS Lake Formation to centralize data governance and data access management while using Amazon Redshift Spectrum to query your data lake. Lake Formation allows you to grant and revoke permissions on databases, tables, and column catalog objects created on top of Amazon S3 data lake. This is easier for customers, as it is similar to managing permissions on relational databases.

In the first post of this two-part series, we focus on resources within the same AWS account. In the second post, we extend the solution across AWS accounts using the Lake Formation data sharing feature.

Solution overview

The following diagram illustrates our solution architecture.

The solution workflow consists of the following steps:

  1. Data stored in an Amazon S3 data lake is crawled using an AWS Glue crawler.
  2. The crawler infers the metadata of data on Amazon S3 and stores it in the form of a database and tables in the AWS Glue Data Catalog.
  3. You register the Amazon S3 bucket as the data lake location with Lake Formation. It’s natively integrated with the Data Catalog.
  4. You use Lake Formation to grant permissions at the database, table, and column level to defined AWS Identity and Access Management (IAM) roles.
  5. You create external schemas within Amazon Redshift to manage access for marketing and finance teams.
  6. You provide access to the marketing and finance groups to their respective external schemas and associate the appropriate IAM roles to be assumed. The admin role and admin group is limited for administration work.
  7. Marketing and finance users now can assume their respective IAM roles and query data using the SQL query editor to their external schemas inside Amazon Redshift.

Lake Formation default security settings

To maintain backward compatibility with AWS Glue, Lake Formation has the following initial security settings:

  • The super permission is granted to the group IAMAllowedPrincipals on all existing Data Catalog resources.
  • Settings to use only IAM access control are enabled for new Data Catalog resources.

To change security settings, see Changing the Default Security Settings for Your Data Lake.

Note: Leave the default settings as is until you’re ready to move completely to the Lake Formation permission model. You can update settings at a database level if you want permissions set by Lake Formation to take effect. For more details about upgrades, refer to Upgrading AWS Glue Data Permissions to the AWS Lake Formation Model.

We don’t recommend reverting back from the Lake Formation permission model to an IAM-only permission model. You may also want to first deploy the solution in a new test account.

Prerequisites

To set up this solution, you need basic familiarity with the AWS Management Console, an AWS account, and access to the following AWS services:

Create the data lake administrator

Data lake administrators are initially the only IAM users or roles that can grant Lake Formation permissions on data locations and Data Catalog resources to any principal.

To set up an IAM user as a data lake administrator, add the provided inline policy to the IAM user or IAM role you use to provision the resources for this blog solution. For more details, refer to Create a Data Lake Administrator.

  1. On the IAM console, choose Users, and choose the IAM user who you want to designate as the data lake administrator.
  2. Choose Add an inline policy on the Permissions tab and add the following policy:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "VisualEditor0",
                "Effect": "Allow",
                "Action": [
                    "lakeformation:GetDataLakeSettings",
                    "lakeformation:PutDataLakeSettings"
                ],
                "Resource": "*"
            }
        ]
    }

  3. Provide a policy name.
  4. Review and save your settings.

Note: If you’re using an existing administrator user/role, you may have this already provisioned.

  1. Sign in to the AWS management console as the designated data lake administrator IAM user or role for this solution.

Note: The CloudFormation template doesn’t work if you skip the below step.

  1. If this is your first time on the Lake Formation console, select Add myself and choose Get started.

You can also add yourself as data lake administrator by going to Administrative roles and tasks under Permissions, select Choose administrators, and adding yourself as an administrator if you missed this in the initial welcome screen.

Provision resources with CloudFormation

In this step, we create the solution resources using a CloudFormation template. The template performs the following actions:

  • Creates an S3 bucket to copy sample data files and SQL scripts
  • Registers the S3 data lake location with Lake Formation
  • Creates IAM roles and policies as needed for the environment
  • Assigns principals (IAM roles) to handle data lake settings
  • Creates Lambda and Step Functions resources to load necessary data
  • Runs AWS Glue crawler jobs to create Data Catalog tables
  • Configures Lake Formation permissions
  • Creates an Amazon Redshift cluster
  • Runs a SQL script to create the database group, database user, and external schemas for the admin, marketing, and finance groups

To create your resources, complete the following steps:

  1. Launch the provided template in AWS Region us-east-1.
  2. Choose Next.
  3. For Stack name, you can keep the default stack name or change it.
  4. For DbPassword, provide a secure password instead of using the default provided.
  5. For InboundTraffic, change the IP address range to your local machine’s IP address in CIDR format instead of using the default.
  6. Choose Next.
  7. Choose Next again until you get to the review page.
  8. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  9. Choose Create stack.

The stack takes approximately 10 minutes to deploy successfully. When it’s complete, you can view the outputs on the AWS CloudFormation console.

Update Lake Formation default settings

You also need to update the default settings at the Lake Formation database level. This makes sure that the Lake Formation permissions the CloudFormation template sets up during provisioning can take effect over the default settings.

  1. On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
  2. Choose the database you created with the CloudFormation template.
  3. Choose Edit.
  4. Deselect Use only IAM access control for new tables in the database.
  5. Choose Save.

This action is important because it removes the IAM control model from this database and allows only Lake Formation to take security grant/revoke access to it. This step makes sure other steps in this solution are successful.

  1. Choose Databases in the navigation pane.
  2. Select the same database.
  3. On the Actions menu, choose View permissions.

You can review the permissions enabled for this database.

  1. Select the IAMAllowedPrincipals group and choose Revoke to remove default permission settings for this individual database.

The IAMAllowedPrincipal row no longer appears in the list on the Permissions page.

Similarly, we need to remove the IAMAllowedPrincipal group at the table level. The CloudFormation template created six tables for this database. Let’s see how to use data lake permissions to remove access at the table level.

  1. On the Lake Formation console, choose Data lake permissions in the navigation pane.
  2. Filter by Principal:IAMAllowedPrincipals and Database:<<database name>>.

You can review all the tables we need to update permissions for.

  1. Select each table one by one and choose Revoke.

With these steps, we’ve made sure that the default settings at the Lake Formation account level are still in place, and only manually updated for the database and tables we’re going to work with in this post. When you’re ready to move completely to a Lake Formation permission model, you can update the settings at the account level instead of individually updating them. For more details, see Change the default permission model.

Validate the provisioned resources

The CloudFormation template provisions many resources automatically to create your environment. In this section, we check some of the key resources to understand them better.

Lake Formation resources

On the Lake Formation console, check that a new data lake location is registered with an IAM role on the Data lake locations page.

This is the IAM role any integrated service like Amazon Redshift assumes to access data on the registered Amazon S3 location. This integration happens out of the box when the right roles and policies are applied. For more details, see Requirements for Roles Used to Register Locations.

Check the Administrative roles and tasks page confirm that the logged-in user is added as the data lake administrator and IAMAllowedPrincipals is added as database creator.

Then check the tables that the AWS Glue crawlers created in the Data Catalog database. These tables are logical entities, because the data is in an Amazon S3 location. After you create these objects, you can access them via different services.

Lastly, check permissions set by the template using the Lake Formation permission model on the tables to be accessed by finance and marketing users from Amazon Redshift.

The following screenshot shows that the finance role has access to all columns for the store and item tables, but only the listed columns for the store_sales table.

Similarly, you can review access for the marketing role, which has access to all columns in the customer_activity and store_sales tables.

Amazon S3 resources

The CloudFormation template creates two S3 buckets:

  • data-lake – Contains the data used for this post
  • script – Contains the SQL which we use to create Amazon Redshift database objects

Open the script bucket to see the scripts. You can download and open them to view the SQL code used.

The setup_lakeformation_demo.sql script gives you the SQL code to create the external database schema and assign different roles for data governance purposes. The external schema is for AWS Glue Data Catalog-based objects that point to data in the data lake. We then grant access to different database groups and users to manage security for finance and marketing users.

The scripts run in the following order:

  1. sp_create_db_group.sql
  2. sp_create_db_user.sql
  3. setup_lakeformation_demo.sql

Amazon Redshift resources

On the Amazon Redshift console, choose Clusters in the navigation pane and choose the cluster you created with the CloudFormation template. Then choose the Properties tab.

The Cluster permissions section lists three attached roles. The template used the admin role to provision Amazon Redshift database-level objects. The finance role is attached to the finance schema in Amazon Redshift, and the marketing role is attached to the marketing schema.

Each of these roles are given permissions in such a way that they can use the Amazon Redshift query editor to query Data Catalog tables using Redshift Spectrum. For more details, see Using Redshift Spectrum with AWS Lake Formation and Query the Data in the Data Lake Using Amazon Redshift Spectrum.

Query the data

We use Amazon Redshift query editor v2 to query the external schema and Data Catalog tables (external tables). The external schema is already created as part of the CloudFormation template. When the external schema is created using the Data Catalog, the tables in the database are automatically created and are available through Amazon Redshift as external tables.

  1. On the Amazon Redshift console, choose Query editor v2.
  2. Choose Configure account.
  3. Choose the database cluster.
  4. For Database, enter dev.
  5. For User name, enter awsuser.
  6. For Authentication, select Temporary credentials.
  7. Choose Create connection.

When you’re connected and logged in as administrator user, you can see both local and external schemas and tables, as shown in the following screenshot.

Validate role-based Lake formation permissions in Amazon Redshift

Next, we validate how the Lake Formation security settings work for the marketing and finance users.

  1. In the query editor, choose (right-click) the database connection.
  2. Choose Edit connection.
  3. For User name, enter marketing_ro.
  4. Choose Edit connection.
  5. After connected as maketing_ro, choose the dev database under the cluster and navigate to the customer_activity table.
  6. Choose the refresh icon.
  7. Repeat these steps to edit the connection and update the user to finance_ro.
  8. Try again to refresh the dev database.

As expected, this user only has access to the allowed schema and tables.

With this solution, you can segregate different users at the schema level and use Lake Formation to make sure they can only see the tables and columns their role allows.

Column-level security with Lake Formation permissions

Lake Formation also allows you to set which columns a principal can or can’t see within a table. For example, when you select store_sales as the marketing_ro user, you see many columns, like customer_purchase_estimate. However, as the finance_ro user, you don’t see these columns.

Manual access control via the Lake Formation console

In this post, we’ve been working with a CloudFormation template-based environment, which is an automated way to create environment templates and simplify operations.

In this section, we show how you can set up all the configurations through the console, and we use another table as an example to walk you through the steps.

As demonstrated in previous steps, the marketing user in this environment has all column access to the tables customer_activity and store_sales in the external schema retail_datalake_marketing. We change some of that manually to see how it works using the console.

  1. On the Lake Formation console, choose Data lake permissions.
  2. Filter by the principal RedshiftMarketingRole.
  3. Select the principal for the store_sales table and choose Revoke.
  4. Confirm by choosing Revoke again.

A success message appears, and the permission row is no longer listed.

  1. Choose Grant to configure a new permission level for the marketing user on the store_sales table at the column level.
  2. Select IAM users and roles and choose your role.
  3. In the LF-Tags or catalog resources section, select Named data catalog resources.
  4. For Databases, choose your database.
  5. For Tables, choose the store_sales table.
  6. For Table permissions¸ check Select.
  7. In the Data permissions section, select Simple column-based access.
  8. Select Exclude columns.
  9. Choose the columns as shown in the following screenshot.
  10. Choose Grant.

We now query the table from Amazon Redshift again to confirm that the effective changes match the controls placed by Lake Formation. In the following query, we select a column that isn’t authorized:

/* Selecting columns not authorized will result in error. */
select s_country, ss_net_profit from retail_datalake_marketing.store_sales;

As expected, we get an error.

Clean up

Clean up resources created by the CloudFormation template to avoid unnecessary cost to your AWS account. You can delete the CloudFormation stack by selecting the stack on the AWS CloudFormation console and choosing Delete. This action deletes all the resources it provisioned. If you manually updated a template-provisioned resource, you may see some issues during clean-up, and you need to clean these up manually.

Summary

In this post, we showed how you can integrate Lake Formation with Amazon Redshift to seamlessly control access to Amazon S3 data lake. We also demonstrated how to query your data lake using Redshift Spectrum and external tables. This is a powerful mechanism that helps you build a modern data architecture to easily query data on your data lake and data warehouses together. We also saw how you can use CloudFormation templates to automate the resource creation with infrastructure as code. You can use this to simplify your operations, especially when you want replicate the resource setup from development to production landscape during your project cycles.

Finally, we covered how data lake administrators can manually control search on data catalog objects and grant or revoke access at the database, table, and column level. We encourage you to try the steps we outlined in this post and use the CloudFormation template to set up security in Lake Formation to control data lake access from Redshift Spectrum.

In the second post of this series, we focus on how you can take this concept and apply it across accounts using a Lake Formation data-sharing feature in a hub-and-spoke topography.


About the Authors

Vaibhav Agrawal is an Analytics Specialist Solutions Architect at AWS. Throughout his career, he has focused on helping customers design and build well-architected analytics and decision support platforms.

Jason Pedreza is an Analytics Specialist Solutions Architect at AWS with over 13 years of data warehousing experience. Prior to AWS, he built data warehouse solutions at Amazon.com. He specializes in Amazon Redshift and helps customers build scalable analytic solutions.

Rajesh Francis is a Senior Analytics Customer Experience Specialist at AWS. He specializes in Amazon Redshift and focuses on helping to drive AWS market and technical strategy for data warehousing and analytics services. Rajesh works closely with large strategic customers to help them adopt our new services and features, develop long-term partnerships, and feed customer requirements back to our product development teams to guide our product roadmap.

Securely share your data across AWS accounts using AWS Lake Formation

Post Syndicated from Yumiko Kanasugi original https://aws.amazon.com/blogs/big-data/securely-share-your-data-across-aws-accounts-using-aws-lake-formation/

Data lakes have become very popular with organizations that want a centralized repository that allows you to store all your structured data and unstructured data at any scale. Because data is stored as is, there is no need to convert it to a predefined schema in advance. When you have new business use cases, you can easily build new types of analyses on top of the data lake, at any time.

In real-world use cases, it’s common to have requirements to share data stored within the data lake with multiple companies, organizations, or business units. For example, you may want to provide your data to stakeholders in another company for a co-marketing campaign between the two companies. For any of these use cases, the producer party wants to share data in a secure and effective manner, without having to copy the entire database.

In August 2019, we announced the general availability of AWS Lake Formation, a fully managed service that makes it easy to set up a secure data lake in days. AWS Lake Formation permission management capabilities simplify securing and managing distributed data lakes across multiple AWS accounts through a centralized approach, providing fine-grained access control to the AWS Glue Data Catalog and Amazon Simple Storage Service (Amazon S3) locations.

There are two options to share your databases and tables with another account by using Lake Formation cross-account access control:

  • Lake Formation tag-based access control (recommended)
  • Lake Formation named resources

In this post, I explain the differences between these two options, and walk you through the steps to configure cross-account sharing.

Overview of tag-based access control

Lake Formation tag-based access control is an authorization strategy that defines permissions based on attributes. In Lake Formation, these attributes are called LF-tags. You can attach LF-tags to Data Catalog resources and Lake Formation principals. Data lake administrators can assign and revoke permissions on Lake Formation resources using these LF-tags. For more details about tag-based access control, refer to Easily manage your data lake at scale using AWS Lake Formation Tag-based access control.

The following diagram illustrates the architecture of this method.

We recommend tag-based access control for the following use cases:

  • You have a large number of tables and principals that the data lake administrator has to grant access to
  • You want to classify your data based on an ontology and grant permissions based on classification
  • The data lake administrator wants to assign permissions dynamically, in a loosely coupled way

You can also use tag-based access control to share Data Catalog resources (databases, tables, and columns) with external AWS accounts.

Overview of named resources

The Lake Formation named resource method is an authorization strategy that defines permissions for resources. Resources include databases, tables, and columns. Data lake administrators can assign and revoke permissions on Lake Formation resources. See Cross-Account Access: How It Works for details.

The following diagram illustrates the architecture for this method.

We recommend using named resources if the data lake administrator prefers granting permissions explicitly to individual resources.

When you use the named resource method to grant Lake Formation permissions on a Data Catalog resource to an external account, Lake Formation uses AWS Resource Access Manager (AWS RAM) to share the resource.

Now, let’s take a closer look at how to configure cross-account access with these two options. We refer to the account that has the source table as the producer account, and refer to the account that needs access to the source table as consumer account.

Configure Lake Formation Data Catalog settings in the producer account

Lake Formation provides its own permission management model. To maintain backward compatibility with the AWS Identity and Access Management (IAM) permission model, the Super permission is granted to the group IAMAllowedPrincipals on all existing AWS Glue Data Catalog resources by default. Also, Use only IAM access control settings are enabled for new data catalog resources.

In this post, we do fine grained access control using Lake Formation permissions and use IAM policies for coarse grained access control. See Methods for Fine-Grained Access Control for details. Therefore, before you use an AWS CloudFormation template for a quick setup, you need to change Lake Formation Data Catalog settings in the producer account.

This setting affects all newly created databases and tables, so we strongly recommend completing this tutorial in a non-production or new account. Also, if you’re using a shared account (such as your company’s dev account), make sure it doesn’t affect others resources. If you prefer to keep the default security settings, you must complete an extra step when sharing resources to other accounts, in which you revoke the default Super permission from IAMAllowedPrincipals on the database or table. We discuss the details later in this post.

To configure Lake Formation Data Catalog settings in the producer account, complete the following steps:

  1. Sign in to the producer account as an admin user, or a user with Lake Formation PutDataLakeSettings API permission.
  2. On the Lake Formation console, in the navigation pane, under Data catalog, choose Settings.
  3. Deselect Use only IAM access control for new databases and Use only IAM access control for new tables in new databases
  4. Choose Save.

Additionally, you can remove CREATE_DATABASE permissions for IAMAllowedPrincipals under Administrative roles and tasks > Database creators. Only then, who can create a new database is governed through Lake Formation permissions.

Set up resources with AWS CloudFormation

We provide two CloudFormation templates in this post: one for the producer account, and one for the consumer account.

The CloudFormation template for the producer account generates the following resources:

  • An S3 bucket to serve as our data lake.
  • A Lambda function (for Lambda-backed AWS CloudFormation custom resources). We use the function to copy sample data files from the public S3 bucket to your S3 bucket.
  • IAM users and policies:
    • DataLakeAdminProducer
  • An AWS Glue Data Catalog database, table, and partition. Because we introduce two options for sharing resources across AWS accounts, this template creates two separate sets of database and table.
  • Lake Formation data lake settings and permissions. This includes:

The CloudFormation template for the consumer account generates the following resources:

  • IAM users and policies:
    • DataLakeAdminConsumer
    • DataAnalyst
  • An AWS Glue Data Catalog database. We use this database for creating resource links to shared resources.

Launch the CloudFormation stack in the producer account

To launch the CloudFormation stack in the producer account, complete the following steps:

  1. Sign in to the producer account’s AWS CloudFormation console in the target Region.
  2. Choose Launch Stack:
  3. Choose Next.
  4. For Stack name, enter a stack name, such as stack-producer.
  5. For ProducerDatalakeAdminUserName and ProducerDatalakeAdminUserPassword, enter the user name and password you want for the data lake admin IAM user.
  6. For DataLakeBucketName, enter the name of your data lake bucket. This name needs to be globally unique.
  7. For DatabaseName and TableName, leave the default values.
  8. Choose Next.
  9. On the next page, choose Next.
  10. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  11. Choose Create stack.

Launch the CloudFormation stack in the consumer account

To launch the CloudFormation stack in the consumer account, complete the following steps:

  1. Sign in to the consumer account’s AWS CloudFormation console in the target Region.
  2. Choose Launch Stack:
  3. Choose Next.
  4. For Stack name, enter a stack name, such as stack-consumer.
  5. For ConsumerDatalakeAdminUserName and ConsumerDatalakeAdminUserPassword, enter the user name and password you want for the data lake admin IAM user.
  6. For DataAnalystUserName and DataAnalystUserPassword, enter the user name and password you want for the data analyst IAM user.
  7. For DatabaseName, leave the default values.
  8. For AthenaQueryResultS3BucketName, enter the name of the S3 bucket that stores Amazon Athena query results. If you don’t have one, create an S3 bucket.
  9. Choose Next.
  10. On the next page, choose Next.
  11. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  12. Choose Create stack.

Stack creation can take about 1 minute.

(Optional) AWS KMS server-side encryption

If the source S3 bucket is encrypted using server-side encryption with an AWS Key Management Service (AWS KMS) customer master key (CMK), make sure the IAM role that Lake Formation uses to access S3 data is registered as the key user for the KMS CMK. By default, the IAM role AWSServiceRoleForLakeFormationDataAccess is used, but you can choose other IAM roles when registering an S3 data lake location. To register the Lake Formation role as the KMS key user, you can use the AWS KMS console, or directly add the permission to the key policy using the KMS PutKeyPolicy API and the AWS Command Line Interface (AWS CLI).

You don’t have to add individual consumer accounts to the key policy. Only the role that Lake Formation uses is required. Also, this step isn’t necessary if the source S3 bucket is encrypted with server-side encryption with Amazon S3, or an AWS managed key.

To add a Lake Formation role as the KMS key user via the console, complete the following steps:

  1. Sign in to the AWS KMS console as the key administrator.
  2. In the navigation pane, under Customer managed keys, choose the key that is used to encrypt the source S3 bucket.
  3. Under Key users, choose Add.
  4. Select AWSServiceRoleForLakeFormationDataAccess and choose Add.

To use the AWS CLI, enter the following command (replace <key-id>, <name-of-key-policy>, and <key-policy> with valid values):

aws kms put-key-policy --key-id <key-id> --policy-name <name-of-key-policy> --policy <key-policy>

For more information, see put-key-policy.

Lake Formation cross-account sharing prerequisites

Before sharing resources with Lake Formation, there are prerequisites for both the tag-based access control method and named resource method.

Tag-based access control cross-account sharing prerequisites

As described in Lake Formation Tag-Based Access Control Cross-Account Prerequisites, before you can use the tag-based access control method to grant cross-account access to resources, you must add the following JSON permissions object to the AWS Glue Data Catalog resource policy in the producer account. This gives the consumer account permission to access the Data Catalog when glue:EvaluatedByLakeFormationTags is true. Also, this condition becomes true for resources on which you granted permission using Lake Formation permission Tags to the consumer’s account. This policy is required for every AWS account that you’re granting permissions to.

The following policy must be within a Statement element. We discuss the full IAM policy later in this post.

{
    "Effect": "Allow",
    "Action": [
        "glue:*"
    ],
    "Principal": {
        "AWS": [
            "<consumer-account-id>"
        ]
    },
    "Resource": [
        "arn:aws:glue:<region>:<account-id>:table/*",
        "arn:aws:glue:<region>:<account-id>:database/*",
        "arn:aws:glue:<region>:<account-id>:catalog"
    ],
    "Condition": {
        "Bool": {
            "glue:EvaluatedByLakeFormationTags": true
        }
    }
}

Named resource method cross-account sharing prerequisites

As described in Managing Cross-Account Permissions Using Both AWS Glue and Lake Formation, if there is no Data Catalog resource policy in your account, the Lake Formation cross-account grants that you make proceed as usual. However, if a Data Catalog resource policy exists, you must add the following statement to it to permit your cross-account grants to succeed if they’re made with the named resource method. If you plan to use only the named resource method, or only the tag-based access control method, you can skip this step. In this post, we evaluate both methods, so we need to add the following policy.

The following policy must be within a Statement element. We discuss the full IAM policy in the next section.

{
    "Effect": "Allow",
    "Action": [
        "glue:ShareResource"
    ],
    "Principal": {
        "Service": [
            "ram.amazonaws.com"
        ]
    },
    "Resource": [
        "arn:aws:glue:<region>:<account-id>:table/*/*",
        "arn:aws:glue:<region>:<account-id>:database/*",
        "arn:aws:glue:<region>:<account-id>:catalog"
    ]
}

Add the AWS Glue Data Catalog resource policy using the AWS CLI

If we grant cross-account permissions by using both the tag-based access control method and named resource method, we must set the EnableHybrid argument to ‘true’ when adding the preceding policies. Because this option isn’t currently supported on the console, we must use the glue:PutResourcePolicy API and AWS CLI.

First, create a policy document (such as policy.json) and add the preceding two policies. Replace <consumer-account-id> with the account ID of the AWS account receiving the grant, <region> with the Region of the Data Catalog containing the databases and tables that you are granting permissions on, and <account-id> with the producer AWS account ID.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "ram.amazonaws.com"
            },
            "Action": "glue:ShareResource",
            "Resource": [
                "arn:aws:glue:<region>:<account-id>:table/*/*",
                "arn:aws:glue:<region>:<account-id>:database/*",
                "arn:aws:glue:<region>:<account-id>:catalog"
            ]
        },
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "<consumer-account-id>"
            },
            "Action": "glue:*",
            "Resource": [
                "arn:aws:glue:<region>:<account-id>:table/*/*",
                "arn:aws:glue:<region>:<account-id>:database/*",
                "arn:aws:glue:<region>:<account-id>:catalog"
            ],
            "Condition": {
                "Bool": {
                    "glue:EvaluatedByLakeFormationTags": "true"
                }
            }
        }
    ]
}

Enter the following AWS CLI command. Replace <glue-resource-policy> with the correct values (such as file://policy.json).

aws glue put-resource-policy --policy-in-json <glue-resource-policy> --enable-hybrid TRUE

For more information, see put-resource-policy.

Implement the Lake Formation tag-based access control method

In this section, we walk through the following high-level steps:

  1. Define an LF-tag.
  2. Assign the LF-tag to the target resource.
  3. Grant LF-tag permissions to the consumer account.
  4. Grant data permissions to the consumer account.
  5. Optionally, revoke permissions for IAMAllowedPrincipals on the database, tables, and columns.
  6. Create a resource link to the shared table.
  7. Create an LF-tag and assign it to the target database.
  8. Grant LF-tag data permissions to the consumer account.

Define an LF-tag

If you’re signed in to your producer account, sign out before completing the following steps.

  1. Sign in as the producer account data lake administrator. Use the producer account ID, IAM user name (the default is DatalakeAdminProducer), and password that you specified during CloudFormation stack creation.
  2. On the Lake Formation console, in the navigation pane, under Permissions, and under Administrative roles and tasks, choose LF-tags.
  3. Choose Add LF-tag.
  4. Specify the key and values. In this post, we create an LF-tag where the key is Confidentiality and the values are private, sensitive, and public.
  5. Choose Add LF-tag.

Assign the LF-tag to the target resource

As a data lake administrator, you can attach tags to resources. If you plan to use a separate role, you may have to grant describe and attach permissions to the separate role.

  1. In the navigation pane, under Data catalog, select Databases.
  2. Select the target database (lakeformation_tutorial_cross_account_database_tbac) and on the Actions menu, choose Edit LF-tags.

For this post, we assign an LF-tag to a database, but you can also assign LF-tags to tables and columns.

  1. Choose Assign new LF-Tag.
  2. Add the key Confidentiality and value public.
  3. Choose Save.

Grant LF-tag permission to the consumer account

Still in the producer account, we grant permissions to the consumer account to access the LF-tag.

  1. In the navigation pane, under Permissions, Administrative roles and tasks, LF-tag permissions, choose Grant.
  2. For Principals, choose External accounts.
  3. Enter the target AWS account ID.

AWS accounts within the same organization appear automatically. Otherwise, you have to manually enter the AWS account ID. As of this writing, Lake Formation tag-based access control doesn’t support granting permission to organizations or organization units.

  1. For LF-Tags, choose the key and values of the LF-tag that is being shared with the consumer account (key Confidentiality and value public).
  2. For Permissions, select Describe for LF-tag permissions.

LF-tag permissions are permissions given to the consumer account. Grantable permissions are permissions that the consumer account can grant to other principals.

  1. Choose Grant.

At this point, the consumer data lake administrator should be able to find the policy tag being shared via the consumer account Lake Formation console, under Permissions, Administrative roles and tasks, LF-tags.

Grant data permission to the consumer account

We will now provide data access to the consumer account by specifying an LF-Tag expression and granting the consumer account access to any table or database that matches the expression.

  1. In the navigation pane, under Permissions, Data lake permissions, choose Grant.
  2. For Principals, choose External accounts, and enter the consumer AWS account ID.
  3. For LF-tags or catalog resources, under Resources matched by LF-Tags (recommended), choose Add LF-Tag.
  4. Select the key and values of the tag that is being shared with the consumer account (key Confidentiality and value public).
  5. For Database permissions, select Describe under Database permissions to grant access permissions at the database level.
  6. Select Describe under Grantable permissions so the consumer account can grant database-level permissions to its users.
  7. For Table and column permissions, select Select and Describe under Table permissions.
  8. Select Select and Describe under Grantable permissions.
  9. Choose Grant.

Revoke permission for IAMAllowedPrincipals on the database, tables, and columns (Optional)

At the very beginning of this tutorial, we changed the Lake Formation Data Catalog settings. If you skipped that part, this step is required. If you changed your Lake Formation Data Catalog settings, you can skip this step.

In this step, we have to revoke the default Super permission from IAMAllowedPrincipals on the database or table. See Secure Existing Data Catalog Resources for details.

Before revoking permission for IAMAllowedPrincipals, make sure that you granted existing IAM principals with necessary permission through Lake Formation. This includes two steps:

  1. Add IAM permission to the target IAM user or role with the Lake Formation GetDataAccess action (with IAM policy).
  2. Grant the target IAM user or role with Lake Formation data permissions (alter, select, and so on)

Then, revoke permissions for IAMAllowedPrincipals. Otherwise, after revoking permissions for IAMAllowedPrincipals, existing IAM principals may no longer be able to access the target database or catalog.

Revoking Super permission for IAMAllowedPrincipals is required when you want to apply the Lake Formation permission model (instead of the IAM policy model) to manage user access within a single account or among multiple accounts using the Lake Formation permission model. You don’t have to revoke permission of IAMAllowedPrincipals for other tables where you want to keep the traditional IAM policy model.

At this point, the consumer account data lake administrator should be able to find the database and table being shared via the consumer account Lake Formation console, under Data catalog, Databases. If not, confirm if the following are properly configured:

  • Make sure the correct policy tag and values are assigned to the target databases and tables
  • Make sure the correct tag permission and data permission are assigned to the consumer account
  • Revoke the default super permission from IAMAllowedPrincipals on the database or table

Create a resource link to the shared table

When a resource is shared between accounts, the shared resources are not put in the consumer accounts’ catalog. To make them available, and query the underlying data of a shared table using services like Athena, we need to create a resource link to the shared table. A resource link is a Data Catalog object that is a link to a local or shared database or table. By creating a resource link, you can:

  • Assign a different name to a database or table that aligns with your Data Catalog resource naming policies
  • Use services such as Athena and Amazon Redshift Spectrum to query shared databases or tables

To create a resource link, complete the following steps:

  1. If you’re signed in to your consumer account, sign out.
  2. Sign in as the consumer account data lake administrator. Use the consumer account ID, IAM user name (default DatalakeAdminConsumer) and password that you specified during CloudFormation stack creation.
  3. On the Lake Formation console, in the navigation pane, under Data catalog, Databases, choose the shared database lakeformation_tutorial_cross_account_database_tbac.

If you don’t see the database, revisit the previous steps to see if everything is properly configured.

  1. Choose View tables.
  2. Choose the shared table amazon_reviews_table_tbac.
  3. On the Actions menu, choose Create resource link.
  4. For Resource link name, enter a name (for this post, amazon_reviews_table_tbac_resource_link).
  5. Under Database, select the database that the resource link is created in (for this post, the CloudFormation stack created the database lakeformation_tutorial_cross_account_database_consumer).
  6. Choose Create.

The resource link appears under Data catalog, Tables.

Create an LF-tag and assign it to the target database

Lake Formation tags reside in the same catalog as the resources. This means that tags created in the producer account aren’t available to use when granting access to the resource links in the consumer account. You need to create a separate set of LF-tags in the consumer account to use LF tag-based access control when sharing the resource links in the consumer account. Let’s first create the LF-tag. Refer to the previous sections for full instructions.

  1. Define the LF-tag in the consumer account. For this post, we use key Division and values sales, marketing, and analyst.
  2. Assign the LF-tag key Division and value analyst to the database lakeformation_tutorial_cross_account_database_consumer, where the resource link is created in.

Grant LF-tag data permission to the consumer

As a final step, we grant LF-tag data permission to the consumer.

  1. In the navigation pane, under Permissions, Data lake permissions, choose Grant.
  2. For Principals, choose IAM users and roles, and choose the user DataAnalyst.
  3. For LF-tags or catalog resources, choose Resources matched by LF-tags (recommended).
  4. Choose key Division and value analyst.
  5. For Database permissions, select Describe under Database permissions.
  6. For Table and column permissions, select Select and Describe under Table permissions.
  7. Choose Grant.
  8. Repeat these steps for user DataAnalyst, where the LF-tag key is Confidentiality and value is public.

At this point, the data analyst user in the consumer account should be able to find the database and resource link, and query the shared table via the Athena console.

If not, confirm if the following are properly configured:

  • Make sure the resource link is created for the shared table
  • Make sure you granted the user access to the LF-tag shared by the producer account
  • Make sure you granted the user access to the LF-tag associated to the resource link and database that the resource link is created in
  • Check if you assigned the correct LF-tag to the resource link, and to the database that the resource link is created in

Implement the Lake Formation named resource method

To use the named resource method, we walk through the following high-level steps:

  1. Optionally, revoke permission for IAMAllowedPrincipals on the database, tables, and columns.
  2. Grant data permission to the consumer account.
  3. Accept a resource share from AWS RAM.
  4. Create a resource link for the shared table.
  5. Grant data permission for the shared table to the consumer.
  6. Grant data permission for the resource link to the consumer.

Revoke permission for IAMAllowedPrincipals on the database, tables, and columns (Optional)

At the very beginning of this tutorial, we changed Lake Formation Data Catalog settings. If you skipped that part, this step is required. For instructions, see the optional step in the previous section.

Grant data permission to the consumer account

If you’re signed in to producer account as another user, sign out first.

  1. Sign in as the producer account data lake administrator using the AWS account ID, IAM user name (default is DatalakeAdminProducer), and password specified during CloudFormation stack creation.
  2. In the navigation pane, under Permissions, Data lake permissions, choose Grant.
  3. For Principals, choose External accounts, and enter one or more AWS account IDs or AWS Organizations IDs.

Organizations that the producer account belongs to and AWS accounts within the same organization appear automatically. Otherwise, manually enter the account ID or organization ID.

  1. For LF-tags or catalog resources, choose Named data catalog resources.
  2. Under Databases, choose the database lakeformation_tutorial_cross_account_database_named_resource.
  3. Under Tables, choose All tables.
  4. For Table and column permissions, select Select and Describe under Table permissions.
  5. Select Select and Describe under Grantable permissions.
  6. Optionally, for Data permissions, select Simple column-based access if column-level permission management is required.
  7. Choose Grant.

If you haven’t revoked permission for IAMAllowedPrincipals, you get a Grant permissions failed error.

At this point, you should see the target table being shared via AWS RAM with the consumer account under Permissions, Data permissions.

Accept a resource share from AWS RAM

This step is required only for account ID-based sharing, not for organization-based sharing.

  1. Sign in as the consumer account data lake administrator using the IAM user name (default is DatalakeAdminConsumer) and password specified during CloudFormation stack creation.
  2. On the AWS RAM console, in the navigation pane, under Shared with me, Resource shares, choose the shared Lake Formation resource.

The Status should be Pending.

  1. Confirm the resource details, and choose Accept resource share.

At this point, the consumer account data lake administrator should be able to find the shared resource on the Lake Formation console under Data catalog, Databases.

Create a resource link for the shared table

Follow the instructions detailed earlier to create a resource link for a shared table. Name the resource link amazon_reviews_table_named_resource_resource_link. Create the resource link in the database lakeformation_tutorial_cross_account_database_consumer.

Grant data permission for the shared table to the consumer

To grant data permission for the shared table to the consumer, complete the following steps:

  1. In the navigation pane, under Permissions, Data lake permissions, choose Grant.
  2. For Principals, choose IAM users and roles, and choose the user DataAnalyst.
  3. For LF-tags or catalog resources, choose Named data catalog resources.
  4. Under Databases, choose the database lakeformation_tutorial_cross_account_database_named_resource.

If you don’t see the database on the drop-down list, choose Load more.

  1. Under Tables, choose the table amazon_reviews_table_named_resource.
  2. For Table and column permissions, select Select and Describe under Table permissions.
  3. Choose Grant.

Grant data permission for the resource link to the consumer

In addition to granting the data lake user permission to access the shared table, you also need to grant the data lake user permission to access the resource link.

  1. In the navigation pane, under Permissions, Data lake permissions, choose Grant.
  2. For Principals, choose IAM users and roles, and choose the user DataAnalyst.
  3. For LF-tags or catalog resources, choose Named data catalog resources.
  4. Under Databases, choose the database lakeformation_tutorial_cross_account_database_consumer.
  5. Under Tables, choose the table amazon_reviews_table_named_resource_resource_link.
  6. For Resource link permissions, select Describe under Resource link permissions.
  7. Choose Grant.

At this point, the data analyst user in the consumer account should be able to find the database and resource link, and query the shared table via the Athena console.

If not, confirm if the following are properly configured:

  • Make sure the resource link is created for the shared table
  • Make sure you granted the user access to the table shared by the producer account
  • Make sure you granted the user access to the resource link and database that the resource link is created in

Clean up

To clean up the resources created within this tutorial, delete or change the following resources:

  • Producer account:
    • AWS RAM resource share
    • Lake Formation tags
    • CloudFormation stack
    • Lake Formation settings
    • AWS Glue Data Catalog settings
  • Consumer account:
    • Lake Formation tags
    • CloudFormation stack

Summary

Lake Formation cross-account sharing enables you to share data across AWS accounts without copying the actual data. Also, it provides both the producer and consumer with control over data permissions in a flexible way. In this post, we introduced two different options to reference catalog data from another account by using the cross-account access features provided by Lake Formation:

  • Tag-based access control
  • Named resource

The tag-based access control method is recommended when many resources and entities are involved. Although it seems like this option requires more steps, tag-based access control helps data lake administrators control relationships between each user and table via tags dynamically. The named resource method provides the data lake administrator with a more straightforward way to manage catalog permissions. You can choose the method that best fits your requirement.


About the author

Yumiko Kanasugi is a Solutions Architect with Amazon Web Services Japan, supporting digital native business customers to utilize AWS.

Build a modern data architecture on AWS with Amazon AppFlow, AWS Lake Formation, and Amazon Redshift.

Post Syndicated from Dr. Yannick Misteli original https://aws.amazon.com/blogs/big-data/build-a-modern-data-architecture-on-aws-with-amazon-appflow-aws-lake-formation-and-amazon-redshift/

This is a guest post written by Dr. Yannick Misteli, lead cloud platform and ML engineering in global product strategy (GPS) at Roche.

Recently the Roche Data Insights (RDI) initiative was launched to achieve our vision using new ways of working and collaboration in order to build shared, interoperable data & insights with federated governance. Furthermore, a simplified & integrated data landscape shall be established in order to empower insights communities. One of the first domains to engage in this program is the Go-to-Market (GTM) area which comprises sales, marketing, medical access and market affairs in Roche. GTM domain enables Roche to understand customers and to ultimately create and deliver valuable services that meet their needs. GTM as a domain extends beyond health care professionals (HCPs) to a larger healthcare ecosystem consisting of patients, communities, health authorities, payers, providers, academia, competitors, so on and so forth. Therefore, Data & Analytics are key in supporting the internal and external stakeholders in their decision-making processes through actionable insights.

Roche GTM built a modern data and machine learning (ML) platform on AWS while utilizing DevOps best practices. The mantra of everything as code (EaC) was key in building a fully automated, scalable data lake and data warehouse on AWS.

In this this post, you learn about how Roche used AWS products and services such as Amazon AppFlow, AWS Lake Formation, and Amazon Redshift to provision and populate their data lake; how they sourced, transformed, and loaded data into the data warehouse; and how they realized best practices in security and access control.

In the following sections, you dive deep into the scalable, secure, and automated modern data platform that Roche has built. We demonstrate how to automate data ingestion, security standards, and utilize DevOps best practices to ease management of your modern data platform on AWS.

Data platform architecture

The following diagram illustrates the data platform architecture.

The architecture contains the following components:

Lake Formation security

We use Lake Formation to secure all data as it lands in the data lake. Separating each data lake layer into distinct S3 buckets and prefixes enables fine-grained access control policies that Lake Formation implements. This concept also extends to locking down access to specific rows and columns and applying policies to specific IAM roles and users. Governance and access to data lake resources is difficult to manage, but Lake Formation simplifies this process for administrators.

To secure access to the data lake using Lake Formation, the following steps are automated using the AWS CDK with customized constructs:

  1. Register the S3 data buckets and prefixes, and corresponding AWS Glue databases with Lake Formation.
  2. Add data lake administrators (GitLab runner IAM deployment role and administrator IAM role).
  3. Grant the AWS Glue job IAM roles access to the specific AWS Glue databases.
  4. Grant the AWS Lambda IAM role access to the Amazon AppFlow databases.
  5. Grant the listed IAM roles access to the corresponding tables in the AWS Glue databases.

AWS Glue Data Catalog

The AWS Glue Data Catalog is the centralized registration and access point for all databases and tables that are created in both the data lake and in Amazon Redshift. This provides centralized transparency to all resources along with their schemas and the location of all data that is referenced. This is a critical aspect for any data operations performed within the lake house platform.

Data sourcing and ingestion

Data is sourced and loaded into the data lake through the use of AWS Glue jobs and Amazon AppFlow. The ingested data is made available in the Amazon Redshift data warehouse through Amazon Redshift Spectrum using external schemas and tables. The process of creating the external schemas and linking it to the Data Catalog is outlined later in this post.

Amazon AppFlow Salesforce ingestion

Amazon AppFlow is a fully-managed integration service that allows you to pull data from sources such as Salesforce, SAP, and Zendesk. Roche integrates with Salesforce to load Salesforce objects securely into their data lake without needing to write any custom code. Roche also pushes ML results back to Salesforce using Amazon AppFlow to facilitate the process.

Salesforce objects are first fully loaded into Amazon S3 and then are flipped to a daily incremental load to capture deltas. The data lands in the raw zone bucket in Parquet format using the date as a partition. The Amazon AppFlow flows are created through the use of a YAML configuration file (see the following code). This configuration is consumed by the AWS CDK deployment to create the corresponding flows.

appflow:
  flow_classes:
    salesforce:
      source: salesforce
      destination: s3
      incremental_load: 1
      schedule_expression: "rate(1 day)"
      s3_prefix: na
      connector_profile: roche-salesforce-connector-profile1,roche-salesforce-connector-profile2
      description: appflow flow flow from Salesforce
      environment: all
  - name: Account
    incremental_load: 1
    bookmark_col: appflow_date_str
  - name: CustomSalesforceObject
    pii: 0
    bookmark_col: appflow_date_str
    upsert_field_list: upsertField
    s3_prefix: prefix
    source: s3
    destination: salesforce
    schedule_expression: na
    connector_profile: roche-salesforce-connector-profile

The YAML configuration makes it easy to select whether data should be loaded from an S3 bucket back to Salesforce or from Salesforce to an S3 bucket. This configuration is subsequently read by the AWS CDK app and corresponding stacks to translate into Amazon AppFlow flows.

The following options are specified in the preceding YAML configuration file:

  • source – The location to pull data from (Amazon S3, Salesforce)
  • destination – The destination to put data to (Amazon S3, Salesforce)
  • object_name – The name of the Salesforce object to interact with
  • incremental_load – A Boolean specifying if the load should be incremental or full (0 means full, 1 means incremental)
  • schedule_expression – The cron or rate expression to run the flow (na makes it on demand)
  • s3_prefix – The prefix to push or pull the data from in the S3 bucket
  • connector_profile – The Amazon AppFlow connector profile name to use when connecting to Salesforce (can be a CSV list)
  • environment – The environment to deploy this Amazon AppFlow flow to (all means deploy to dev and prod, dev means development environment, prod means production environment)
  • upsert_field_list – The set of Salesforce object fields (can be a CSV list) to use when performing an upsert operation back to Salesforce (only applicable when loaded data back from an S3 bucket back to Salesforce)
  • bookmark_col – The name of the column to use in the Data Catalog for registering the daily load date string partition

Register Salesforce objects to the Data Catalog

Complete the following steps to register data loaded into the data lake with the Data Catalog and link it to Amazon Redshift:

  1. Gather Salesforce object fields and corresponding data types.
  2. Create a corresponding AWS Glue database in the Data Catalog.
  3. Run a query against Amazon Redshift to create an external schema that links to the AWS Glue database.
  4. Create tables and partitions in the AWS Glue database and tables.

Data is accessible via the Data Catalog and the Amazon Redshift cluster.

Amazon AppFlow dynamic field gathering

To construct the schema of the loaded Salesforce object in the data lake, you invoke the following Python function. The code utilizes an Amazon AppFlow client from Boto3 to dynamically gather the Salesforce object fields to construct the Salesforce object’s schema.

import boto3

client = boto3.client('appflow')

def get_salesforce_object_fields(object_name: str, connector_profile: str):
    """
    Gathers the Salesforce object and its corresponding fields.

    Parameters:
        salesforce_object_name (str) = the name of the Salesforce object to consume.
        appflow_connector_profile (str) = the name of AppFlow Connector Profile.

    Returns:
        object_schema_list (list) =  a list of the object's fields and datatype (a list of dictionaries).
    """
    print("Gathering Object Fields")

    object_fields = []

    response = client.describe_connector_entity(
        connectorProfileName=connector_profile,
        connectorEntityName=object_name,
        connectorType='Salesforce'
    )

    for obj in response['connectorEntityFields']:
        object_fields.append(
            {'field': obj['identifier'], 'data_type': obj['supportedFieldTypeDetails']['v1']['fieldType']})

    return object_fields

We use the function for both the creation of the Amazon AppFlow flow via the AWS CDK deployment and for creating the corresponding table in the Data Catalog in the appropriate AWS Glue database.

Create an Amazon CloudWatch Events rule, AWS Glue table, and partition

To add new tables (one per Salesforce object loaded into Amazon S3) and partitions into the Data Catalog automatically, you create an Amazon CloudWatch Events rule. This function enables you to query the data in both AWS Glue and Amazon Redshift.

After the Amazon AppFlow flow is complete, it invokes a CloudWatch Events rule and a corresponding Lambda function to either create a new table in AWS Glue or add a new partition with the corresponding date string for the current day. The CloudWatch Events rule looks like the following screenshot.

The invoked Lambda function uses the Amazon SageMaker Data Wrangler Python package to interact with the Data Catalog. Using the preceding function definition, the object fields and their data types are accessible to pass to the following function call:

import awswrangler as wr

def create_external_parquet_table(
    database_name: str, 
    table_name: str, 
    s3_path: str, 
    columns_map: dict, 
    partition_map: dict
):
    """
    Creates a new external table in Parquet format.

    Parameters:
        database_name (str) = the name of the database to create the table in.
        table_name (str) = the name of the table to create.
        s3_path (str) = the S3 path to the data set.
        columns_map (dict) = a dictionary object containing the details of the columns and their data types from appflow_utility.get_salesforce_object_fields
        partition_map (dict) = a map of the paritions for the parquet table as {'column_name': 'column_type'}
    
    Returns:
        table_metadata (dict) = metadata about the table that was created.
    """

    column_type_map = {}

    for field in columns_map:
        column_type_map[field['name']] = field['type']

    return wr.catalog.create_parquet_table(
        database=database_name,
        table=table_name,
        path=s3_path,
        columns_types=column_type_map,
        partitions_types=partition_map,
        description=f"AppFlow ingestion table for {table_name} object"
    )

If the table already exists, the Lambda function creates a new partition to account for the date in which the flow completed (if it doesn’t already exist):

import awswrangler as wr

def create_parquet_table_date_partition(
    database_name: str, 
    table_name: str, 
    s3_path: str, 
    year: str, 
    month: str, 
    day: str
):
    """
    Creates a new partition by the date (YYYY-MM-DD) on an existing parquet table.

    Parameters:
        database_name (str) = the name of the database to create the table in.
        table_name (str) = the name of the table to create.
        s3_path (str) = the S3 path to the data set.
        year(str) = the current year for the partition (YYYY format).
        month (str) = the current month for the partition (MM format).
        day (str) = the current day for the partition (DD format).
    
    Returns:
        table_metadata (dict) = metadata about the table that has a new partition
    """

    date_str = f"{year}{month}{day}"
    
    return wr.catalog.add_parquet_partitions(
        database=database_name,
        table=table_name,
        partitions_values={
            f"{s3_path}/{year}/{month}/{day}": [date_str]
        }
    )
    
def table_exists(
    database_name: str, 
    table_name: str
):
    """
    Checks if a table exists in the Glue catalog.

    Parameters:
        database_name (str) = the name of the Glue Database where the table should be.
        table_name (str) = the name of the table.
    
    Returns:
        exists (bool) = returns True if the table exists and False if it does not exist.
    """

    try:
        wr.catalog.table(database=database_name, table=table_name)
        return True
    except ClientError as e:
        return False

Amazon Redshift external schema query

An AWS Glue database is created for each Amazon AppFlow connector profile that is present in the preceding configuration. The objects that are loaded from Salesforce into Amazon S3 are registered as tables in the Data Catalog under the corresponding database. To link the database in the Data Catalog with an external Amazon Redshift schema, run the following query:

CREATE EXTERNAL SCHEMA ${connector_profile_name}_ext from data catalog
database '${appflow_connector_profile_name}'
iam_role 'arn:aws:iam::${AWS_ACCOUNT_ID}:role/RedshiftSpectrumRole'
region 'eu-west-1';

The specified iam_role value must be an IAM role created ahead of time and must have the appropriate access policies specified to query the Amazon S3 location.

Now, all the tables available in the Data Catalog can be queried using SQL locally in Amazon Redshift Spectrum.

Amazon AppFlow Salesforce destination

Roche trains and invokes ML models using data found in the Amazon Redshift data warehouse. After the ML models are complete, the results are pushed back into Salesforce. Through the use of Amazon AppFlow, we can achieve the data transfer without writing any custom code. The schema of the results must match the schema of the corresponding Salesforce object, and the format of the results must be written in either JSON lines or CSV format in order to be written back into Salesforce.

AWS Glue Jobs

To source on-premises data feeds into the data lake, Roche has built a set of AWS Glue jobs in Python. There are various external sources including databases and APIs that are directly loaded into the raw zone S3 bucket. The AWS Glue jobs are run on a daily basis to load new data. The data that is loaded follows the partitioning scheme of YYYYMMDD format in order to more efficiently store and query datasets. The loaded data is then converted into Parquet format for more efficient querying and storage purposes.

Amazon EKS and KubeFlow

To deploy ML models on Amazon EKS, Roche uses Kubeflow on Amazon EKS. The use of Amazon EKS as the backbone infrastructure makes it easy to build, train, test, and deploy ML models and interact with Amazon Redshift as a data source.

Firewall Manager

As an added layer of security, Roche takes extra precautions through the use of Firewall Manager. This allows Roche to explicitly deny or allow inbound and outbound traffic through the use of stateful and stateless rule sets. This also enables Roche to allow certain outbound access to external websites and deny websites that they don’t want resources inside of their Amazon VPC to have access to. This is critical especially when dealing with any sensitive datasets to ensure that data is secured and has no chance of being moved externally.

CI/CD

All the infrastructure outlined in the architecture diagram was automated and deployed to multiple AWS Regions using a continuous integration and continuous delivery (CI/CD) pipeline with GitLab Runners as the orchestrator. The GitFlow model was used for branching and invoking automated deployments to the Roche AWS accounts.

Infrastructure as code and AWS CDK

Infrastructure as code (IaC) best practices were used to facilitate the creation of all infrastructure. The Roche team uses the Python AWS CDK to deploy, version, and maintain any changes that occur to the infrastructure in their AWS account.

AWS CDK project structure

The top level of the project structure in GitLab includes the following folders (while not limited to just these folders) in order to keep infrastructure and code all in one location.

To facilitate the various resources that are created in the Roche account, the deployment was broken into the following AWS CDK apps, which encompass multiple stacks:

  • core
  • data_lake
  • data_warehouse

The core app contains all the stacks related to account setup and account bootstrapping, such as:

  • VPC creation
  • Initial IAM roles and policies
  • Security guardrails

The data_lake app contains all the stacks related to creating the AWS data lake, such as:

  • Lake Formation setup and registration
  • AWS Glue database creation
  • S3 bucket creation
  • Amazon AppFlow flow creation
  • AWS Glue job setup

The data_warehouse app contains all the stacks related to setting up the data warehouse infrastructure, such as:

  • Amazon Redshift cluster
  • Load balancer to Amazon Redshift cluster
  • Logging

The AWS CDK project structure described was chosen to keep the deployment flexible and to logically group together stacks that relied on each other. This flexibility allows for deployments to be broken out by function and deployed only when truly required and needed. This decoupling of different parts of the provisioning maintains flexibility when deploying.

AWS CDK project configuration

Project configurations are flexible and extrapolated away as YAML configuration files. For example, Roche has simplified the process of creating a new Amazon AppFlow flow and can add or remove flows as needed simply by adding a new entry into their YAML configuration. The next time the GitLab runner deployment occurs, it picks up the changes on AWS CDK synthesis to generate a new change set with the new set of resources. This configuration and setup keeps things dynamic and flexible while decoupling configuration from code.

Network architecture

The following diagram illustrates the network architecture.

We can break down the architecture into the following:

  • All AWS services are deployed in two Availability Zones (except Amazon Redshift)
  • Only private subnets have access to the on-premises Roche environment
  • Services are deployed in backend subnets
  • Perimeter protection using AWS Network Firewall
  • A network load balancer publishes services to the on premises environment

Network security configurations

Infrastructure, configuration, and security are defined as code in AWS CDK, and Roche uses a CI/CD pipeline to manage and deploy them. Roche has an AWS CDK application to deploy the core services of the project: VPC, VPN connectivity, and AWS security services (AWS Config, Amazon GuardDuty, and AWS Security Hub). The VPC contains four network layers deployed in two Availability Zones, and they have VPC endpoints to access AWS services like Amazon S3, Amazon DynamoDB, and Amazon Simple Queue Service (Amazon SQS). They limit internet access using AWS Network Firewall.

The infrastructure is defined as code and the configuration is segregated. Roche performed the VPC setup by running the CI/CD pipeline to deploy their infrastructure. The configuration is in a specific external file; if Roche wants to change any value of the VPC, they need to simply modify this file and run the pipeline again (without typing any new lines of code). If Roche wants to change any configurations, they don’t want to have to change any code. It makes it simple for Roche to make changes and simply roll them out to their environment, making the changes more transparent and easier to configure. Traceability of the configuration is more transparent and it makes it simpler for approving the changes.

The following code is an example of the VPC configuration:

"test": {
        "vpc": {
            "name": "",
            "cidr_range": "192.168.40.0/21",
            "internet_gateway": True,
            "flow_log_bucket": shared_resources.BUCKET_LOGGING,
            "flow_log_prefix": "vpc-flow-logs/",
        },
        "subnets": {
            "private_subnets": {
                "private": ["192.168.41.0/25", "192.168.41.128/25"],
                "backend": ["192.168.42.0/23", "192.168.44.0/23"],
            },
            "public_subnets": {
                "public": {
                    "nat_gateway": True,
                    "publics_ip": True,
                    "cidr_range": ["192.168.47.64/26", "192.168.47.128/26"],
                }
            },
            "firewall_subnets": {"firewall": ["192.168.47.0/28", "192.168.47.17/28"]},
        },
        ...
         "vpc_endpoints": {
            "subnet_group": "backend",
            "services": [
                "ec2",
                "ssm",
                "ssmmessages",
                "sns",
                "ec2messages",
                "glue",
                "athena",
                "secretsmanager",
                "ecr.dkr",
                "redshift-data",
                "logs",
                "sts",
            ],
            "gateways": ["dynamodb", "s3"],
            "subnet_groups_allowed": ["backend", "private"],
        },
        "route_53_resolvers": {
            "subnet": "private",
        ...

The advantages of this approach are as follows:

  • No need to modify the AWS CDK constructor and build new code to change VPC configuration
  • Central point to manage VPC configuration
  • Traceability of changes and history of the configuration through Git
  • Redeploy all the infrastructure in a matter of minutes in other Regions or accounts

Operations and alerting

Roche has developed an automated alerting system if any part of the end-to-end architecture encounters any issues, focusing on any issues when loading data from AWS Glue or Amazon AppFlow. All logging is published to CloudWatch by default for debugging purposes.

The operational alerts have been built for the following workflow:

  1. AWS Glue jobs and Amazon AppFlow flows ingest data.
  2. If a job fails, it emits an event to a CloudWatch Events rule.
  3. The rule is triggered and invokes an Lambda function to send failure details to an Amazon Simple Notification Service (Amazon SNS) topic.
  4. The SNS topic has a Lambda subscriber that gets invoked:
    1. The Lambda function reads out specific webhook URLs from AWS Secrets Manager.
    2. The function fires off an alert to the specific external systems.
  5. The external systems receive the message and the appropriate parties are notified of the issue with details.

The following architecture outlines the alerting mechanisms built for the lake house platform.

Conclusion

The GTM (Go-To-Market) domain has been successful in enabling their business stakeholders, data engineers and data scientists providing a platform that is extendable to many use-cases that Roche faces. It is a key enabler and an accelerator for the GTM organization in Roche. Through a modern data platform, Roche is now able to better understand customers and ultimately create and deliver valuable services that meet their needs. It extends beyond health care professionals (HCPs) to a larger healthcare ecosystem. The platform and infrastructure in this blog help to support and accelerate both internal and external stakeholders in their decision-making processes through actionable insights.

The steps in this post can help you plan to build a similar modern data strategy using AWS managed services to ingest data from sources like Salesforce, automatically create metadata catalogs and share data seamlessly between the data lake and data warehouse, and create alerts in the event of an orchestrated data workflow failure. In part 2 of this post, you learn about how the data warehouse was built using an agile data modeling pattern and how ELT jobs were quickly developed, orchestrated, and configured to perform automated data quality testing.

Special thanks go to the Roche team: Joao Antunes, Krzysztof Slowinski, Krzysztof Romanowski, Bartlomiej Zalewski, Wojciech Kostka, Patryk Szczesnowicz, Igor Tkaczyk, Kamil Piotrowski, Michalina Mastalerz, Jakub Lanski, Chun Wei Chan, Andrzej Dziabowski for their project delivery and support with this post.


About The Authors

Dr. Yannick Misteli, Roche – Dr. Yannick Misteli is leading cloud platform and ML engineering teams in global product strategy (GPS) at Roche. He is passionate about infrastructure and operationalizing data-driven solutions, and he has broad experience in driving business value creation through data analytics.

Simon Dimaline, AWS – Simon Dimaline has specialised in data warehousing and data modelling for more than 20 years. He currently works for the Data & Analytics team within AWS Professional Services, accelerating customers’ adoption of AWS analytics services.

Matt Noyce, AWS – Matt Noyce is a Senior Cloud Application Architect in Professional Services at Amazon Web Services. He works with customers to architect, design, automate, and build solutions on AWS for their business needs.

Chema Artal Banon, AWS – Chema Artal Banon is a Security Consultant at AWS Professional Services and he works with AWS’s customers to design, build, and optimize their security to drive business. He specializes in helping companies accelerate their journey to the AWS Cloud in the most secure manner possible by helping customers build the confidence and technical capability.

A special Thank You goes out to the following people whose expertise made this post possible from AWS:

  • Thiyagarajan Arumugam – Principal Analytics Specialist Solutions Architect
  • Taz Sayed – Analytics Tech Leader
  • Glenith Paletta – Enterprise Service Manager
  • Mike Murphy – Global Account Manager
  • Natacha Maheshe – Senior Product Marketing Manager
  • Derek Young – Senior Product Manager
  • Jamie Campbell – Amazon AppFlow Product Manager
  • Kamen Sharlandjiev – Senior Solutions Architect – Amazon AppFlow
  • Sunil Jethwani Principal Customer Delivery Architect
  • Vinay Shukla – Amazon Redshift Principal Product Manager
  • Nausheen Sayed – Program Manager

AWS Lake Formation – General Availability of Cell-Level Security and Governed Tables with Automatic Compaction

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-lake-formation-general-availability-of-cell-level-security-and-governed-tables-with-automatic-compaction/

A data lake can help you break down data silos and combine different types of analytics into a centralized repository. You can store all of your structured and unstructured data in this repository. However, setting up and managing data lakes involve a lot of manual, complicated, and time-consuming tasks. AWS Lake Formation makes it easy to set up a secure data lake in days instead of weeks or months.

Today, I am excited to share the general availability of some new features that simplify even further loading data, optimizing storage, and managing access to a data lake:

  • Governed Tables – A new type of Amazon Simple Storage Service (Amazon S3) tables that makes it simple and reliable to ingest and manage data at any scale. Governed tables support ACID transactions that let multiple users concurrently and reliably insert and delete data across multiple governed tables. ACID transactions also let you run queries that return consistent and up-to-date data. In case of errors in your extract, transform, and load (ETL) processes, or during an update, changes are not committed and will not be visible.
  • Storage Optimization with Automatic Compaction for governed tables – When this option is enabled, Lake Formation automatically compacts small S3 objects in your governed tables into larger objects to optimize access via analytics engines, such as Amazon Athena and Amazon Redshift Spectrum. By using automatic compaction, you don’t have to implement custom ETL jobs that read, merge, and compress data into new files, and then replace the original files.
  • Granular Access Control with Row and Cell-Level Security – You can control access to specific rows and columns in query results and within AWS Glue ETL jobs based on the identity of who is performing the action. In this way, you don’t have to create (and keep updated) subsets of your data for different roles and legislations. This works for both governed and traditional S3 tables.

Using Governed Tables, ACID Transactions, and Automatic Compaction
In the Lake Formation console, I can enable governed data access and management at table creation. Automatic compaction is enabled by default, and it can be disabled using the AWS Command Line Interface (CLI) or AWS SDKs.

Console screenshot.

Governed tables have a manifest that tracks the S3 objects that are part of the table’s data. I can use the UpdateTableObjects API to keep the manifest updated when adding new objects to the table, and I can call it using the AWS CLI and SDKs. This API is implicitly used by the AWS Glue ETL library.

Moreover, I have access to new Lake Formation APIs to start, commit, or cancel a transaction. I can use these APIs to wrap data loading, data transformation, and output consistent and up-to-date data.

Using Row and Cell-Level Security
There are many use cases where, for a table, you want to restrict access to specific columns, rows, or a combination that depends on the role of the user accessing the data. For example, a company with offices in the US, Germany, and France can create a filter for analysts based in the European Union (EU) to limit access to EU-based customers.

Console screenshot.

The filter can enforce that some columns, such as date of birth (dob) and phone, are not accessible to those analysts. Moreover, access to individual rows can be filtered by using filter expressions. You can configure row filter expressions with a SQL-compatible syntax based on the open-source PartiQL language. In this case, only rows with country equal to Germany or France (country='DE' OR country='FR') are visible.

Console screenshot.

Availability and Pricing
These new features are available today in the following AWS Regions: US East (N. Virginia), US West (Oregon), Europe (Ireland), US East (Ohio), and Asia Pacific (Tokyo).

When querying governed tables, or tables secured with row and cell-level security, you pay by the amount of data scanned (with a 10MB minimum). When using governed tables, transaction metadata is charged by the number of S3 objects tracked, and you pay for the number of transaction requests. Automatic compaction is charged based on the data processed. For more information, see the AWS Lake Formation pricing page.

While implementing these features, we introduced a new Lake Formation Storage API that is integrated with tools such as AWS Glue, Amazon Athena, Amazon Redshift Spectrum, and Amazon QuickSight. You can use this storage API directly in your applications to query tables with a SQL-like syntax (joins are not supported) and get the benefits of governed tables and cell-level security.

See the detailed blog series published during the preview to learn more:

Effective data lakes using AWS Lake Formation

Take advantage of these new features to simplify the creation and management of your data lake.

Danilo

How Parametric Built Audit Surveillance using AWS Data Lake Architecture

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/architecture/how-parametric-built-audit-surveillance-using-aws-data-lake-architecture/

Parametric Portfolio Associates (Parametric), a wholly owned subsidiary of Morgan Stanley, is a registered investment adviser. Parametric provides investment advisory services to individual and institutional investors around the world. Parametric manages over 100,000 client portfolios with assets under management exceeding $400B (as of 9/30/21).

As a registered investment adviser, Parametric is subject to numerous regulatory requirements. The Parametric Compliance team conducts regular reviews on the firm’s portfolio management activities. To accomplish this, the organization needs both active and archived audit data to be readily available.

Parametric’s on-premises data lake solution was based on an MS-SQL server. They used an Apache Hadoop platform for their data storage, data management, and analytics. Significant gaps existed with the on-premises solution, which complicated audit processes. They were spending a large amount of effort on system maintenance, operational management, and software version upgrades. This required expensive consulting services and challenges with keeping the maintenance windows updated. This limited their agility, and also impacted their ability to derive more insights and value from their data. In an environment of rapid growth, adoption of more sophisticated analytics tools and processes has been slower to evolve.

In this blog post, we will show how Parametric implemented their Audit Surveillance Data Lake on AWS with purpose-built fully managed analytics services. With this solution, Parametric was able to respond to various audit requests within hours rather than days or weeks. This resulted in a system with a cost savings of 5x, with no data growth. Additionally, this new system can seamlessly support a 10x data growth.

Audit surveillance platform

The Parametric data management office (DMO) was previously running their data workloads using an on-premises data lake, which ran on the Hortonworks data platform of Apache Hadoop. This platform wasn’t up to date, and Parametric’s hardware was reaching end-of-life. Parametric was faced with a decision to either reinvest in their on-premises infrastructure or modernize their infrastructure using a modern data analytics platform on AWS. After doing a detailed cost/benefit analysis, the DMO calculated a 5x cost savings by using AWS. They decided to move forward and modernize with AWS due to these cost benefits, in addition to elasticity and security features.

The PPA compliance team asked the DMO to provide an enterprise data service to consume data from a data lake. This data was destined for downstream applications and ad-hoc data querying capabilities. It was accessed via standard JDBC tools and user-friendly business intelligence dashboards. The goal was to ensure that seven years of audit data would be readily available.

The DMO team worked with AWS to conceptualize an audit surveillance data platform architecture and help accelerate the implementation. They attended a series of AWS Immersion Days focusing on AWS fundamentals, Data Lakes, Devops, Amazon EMR, and serverless architectures. They later were involved in a four-day AWS Data Lab with AWS SMEs to create a data lake. The first use case in this Lab was creating the Audit Surveillance system on AWS.

Audit surveillance architecture on AWS

The following diagram shows the Audit Surveillance data lake architecture on AWS by using AWS purpose-built analytics services.

Figure 1. Audit Surveillance data lake architecture diagram

Figure 1. Audit Surveillance data lake architecture diagram

Architecture flow

  1. User personas: As first step, the DMO team identified three user personas for the Audit Surveillance system on AWS.
    • Data service compliance users who would like to consume audit surveillance data from the data lake into their respective applications through an enterprise data service.
    • Business users who would like to create business intelligence dashboards using a BI tool to audit data for compliance needs.
    • Complaince IT users who would like to perform ad-hoc queries on the data lake to perform analytics using an interactive query tool.
  2. Data ingestion: Data is ingested into Amazon Simple Storage Service (S3) from different on-premises data sources by using AWS Lake Formation blueprints. AWS Lake Formation provides workflows that define the data source and schedule to import data into the data lake. It is a container for AWS Glue crawlers, jobs, and triggers that are used to orchestrate the process to load and update the data lake.
  3. Data storage: Parametric used Amazon S3 as a data storage to build an Audit Surveillance data lake, as it has unmatched 11 nines of durability and 99.99% availability. The existing Hadoop storage was replaced with Amazon S3. The DMO team created a drop zone (raw), an analytics zone (transformed), and curated (enriched) storage layers for their data lake on AWS.
  4. Data cataloging: AWS Glue Data Catalog was the central catalog used to store and manage metadata for all datasets hosted in the Audit Surveillance data lake. The existing Hadoop metadata store was replaced with AWS Glue Data Catalog. AWS services such as AWS Glue, Amazon EMR, and Amazon Athena, natively integrate with AWS Glue Data Catalog.
  5. Data processing: Amazon EMR and AWS Glue process the raw data and places it into analytics zones (transformed) and curated zones (enriched) S3 buckets. Amazon EMR was used for big data processing and AWS Glue for standard ETL processes. AWS Lambda and AWS Step Functions were used to initiate monitoring and ETL processes.
  6. Data consumption: After Audit Surveillance data was transformed and enriched, the data was consumed by various personas within the firm as follows:
    • AWS Lambda and Amazon API Gateway were used to support consumption for data service compliance users.
    • Amazon QuickSight was used to create business intelligence dashboards for compliance business users.
    • Amazon Athena was used to query transformed and enriched data for compliance IT users.
  7. Security: AWS Key Management Service (KMS) customer managed keys were used for encryption at rest, and TLS for encryption at transition. Access to the encryption keys is controlled using AWS Identity and Access Management (IAM) and is monitored through detailed audit trails in AWS CloudTrail. Amazon CloudWatch was used for monitoring, and thresholds were created to determine when to send alerts.
  8. Governance: AWS IAM roles were attached to compliance users that permitted the administrator to grant access. This was only given to approved users or programs that went through authentication and authorization through AWS SSO. Access is logged and permissions can be granted or denied by the administrator. AWS Lake Formation is used for fine-grained access controls to grant/revoke permissions at the database, table, or column-level access.

Conclusion

The Parametric DMO team successfully replaced their on-premises Audit Surveillance Data Lake. They now have a modern, flexible, highly available, and scalable data platform on AWS, with purpose-built analytics services.

This change resulted in a 5x cost savings, and provides for a 10x data growth. There are now fast responses to internal and external audit requests (hours rather than days or weeks). This migration has given the company access to a wider breadth of AWS analytics services, which offers greater flexibility and options.

Maintaining the on-premises data lake would have required significant investment in both hardware upgrade costs and annual licensing and upgrade vendor consulting fees. Parametric’s decision to migrate their on-premises data lake has yielded proven cost benefits. And it has introduced new functions, service, and capabilities that were previously unavailable to Parametric DMO.

You may also achieve similar efficiencies and increase scalability by migrating on-premises data platforms into AWS. Read more and get started on building Data Lakes on AWS.

Integral Ad Science secures self-service data lake using AWS Lake Formation

Post Syndicated from Mat Sharpe original https://aws.amazon.com/blogs/big-data/integral-ad-science-secures-self-service-data-lake-using-aws-lake-formation/

This post is co-written with Mat Sharpe, Technical Lead, AWS & Systems Engineering from Integral Ad Science.

Integral Ad Science (IAS) is a global leader in digital media quality. The company’s mission is to be the global benchmark for trust and transparency in digital media quality for the world’s leading brands, publishers, and platforms. IAS does this through data-driven technologies with actionable real-time signals and insight.

In this post, we discuss how IAS uses AWS Lake Formation and Amazon Athena to efficiently manage governance and security of data.

The challenge

IAS processes over 100 billion web transactions per day. With strong growth and changing seasonality, IAS needed a solution to reduce cost, eliminate idle capacity during low utilization periods, and maximize data processing speeds during peaks to ensure timely insights for customers.

In 2020, IAS deployed a data lake in AWS, storing data in Amazon Simple Storage Service (Amazon S3), cataloging its metadata in the AWS Glue Data Catalog, ingesting and processing using Amazon EMR, and using Athena to query and analyze the data. IAS wanted to create a unified data platform to meet its business requirements. Additionally, IAS wanted to enable self-service analytics for customers and users across multiple business units, while maintaining critical controls over data privacy and compliance with regulations such as GDPR and CCPA. To accomplish this, IAS needed to securely ingest and organize real-time and batch datasets, as well as secure and govern sensitive customer data.

To meet the dynamic nature of IAS’s data and use cases, the team needed a solution that could define access controls by attribute, such as classification of data and job function. IAS processes significant volumes of data and this continues to grow. To support the volume of data, IAS needed the governance solution to scale in order to create and secure many new daily datasets. This meant IAS could enable self-service access to data from different tools, such as development notebooks, the AWS Management Console, and business intelligence and query tools.

To address these needs, IAS evaluated several approaches, including a manual ticket-based onboarding process to define permissions on new datasets, many different AWS Identity and Access Management (IAM) policies, and an AWS Lambda based approach to automate defining Lake Formation table and column permissions triggered by changes in security requirements and the arrival of new datasets.

Although these approaches worked, they were complex and didn’t support the self-service experience that IAS data analysts required.

Solution overview

IAS selected Lake Formation, Athena, and Okta to solve this challenge. The following architectural diagram shows how the company chose to secure its data lake.

The solution needed to support data producers and consumers in multiple AWS accounts. For brevity, this diagram shows a central data lake producer that includes a set of S3 buckets for raw and processed data. Amazon EMR is used to ingest and process the data, and all metadata is cataloged in the data catalog. The data lake consumer account uses Lake Formation to define fine-grained permissions on datasets shared by the producer account; users logging in through Okta can run queries using Athena and be authorized by Lake Formation.

Lake Formation enables column-level control, and all Amazon S3 access is provisioned via a Lake Formation data access role in the query account, ensuring only that service can access the data. Each business unit with access to the data lake is provisioned with an IAM role that only allows limited access to:

  • That business unit’s Athena workgroup
  • That workgroup’s query output bucket
  • The lakeformation:GetDataAccess API

Because Lake Formation manages all the data access and permissions, the configuration of the user’s role policy in IAM becomes very straightforward. By defining an Athena workgroup per business unit, IAS also takes advantage of assigning per-department billing tags and query limits to help with cost management.

Define a tag strategy

IAS commonly deals with two types of data: data generated by the company and data from third parties. The latter usually includes contractual stipulations on privacy and use.

Some data sets require even tighter controls, and defining a tag strategy is one key way that IAS ensures compliance with data privacy standards. With the tag-based access controls in Lake Formation IAS can define a set of tags within an ontology that is assigned to tables and columns. This ensures users understand available data and whether or not they have access. It also helps IAS manage privacy permissions across numerous tables with new ones added every day.

At a simplistic level, we can define policy tags for class with private and non-private, and for owner with internal and partner.

As we progressed, our tagging ontology evolved to include individual data owners and data sources within our product portfolio.

Apply tags to data assets

After IAS defined the tag ontology, the team applied tags at the database, table, and column level to manage permissions. Tags are inherited, so they only need to be applied at the highest level. For example, IAS applied the owner and class tags at the database level and relied on inheritance to propagate the tags to all the underlying tables and columns. The following diagram shows how IAS activated a tagging strategy to distinguish between internal and partner datasets , while classifying sensitive information within these datasets.

Only a small number of columns contain sensitive information; IAS relied on inheritance to apply a non-private tag to the majority of the database objects and then overrode it with a private tag on a per-column basis.

The following screenshot shows the tags applied to a database on the Lake Formation console.

With its global scale, IAS needed a way to automate how tags are applied to datasets. The team experimented with various options including string matching on column names, but the results were unpredictable in situations where unexpected column names are used (ipaddress vs. ip_address, for example). Ultimately, IAS incorporated metadata tagging into its existing infrastructure as code (IaC) process, which gets applied as part of infrastructure updates.

Define fine-grained permissions

The final piece of the puzzle was to define permission rules to associate with tagged resources. The initial data lake deployment involved creating permission rules for every database and table, with column exclusions as necessary. Although these were generated programmatically, it added significant complexity when the team needed to troubleshoot access issues. With Lake Formation tag-based access controls, IAS reduced hundreds of permission rules down to precisely two rules, as shown in the following screenshot.

When using multiple tags, the expressions are logically ANDed together. The preceding statements permit access only to data tagged non-private and owned by internal.

Tags allowed IAS to simplify permission rules, making it easy to understand, troubleshoot, and audit access. The ability to easily audit which datasets include sensitive information and who within the organization has access to them made it easy to comply with data privacy regulations.

Benefits

This solution provides self-service analytics to IAS data engineers, analysts, and data scientists. Internal users can query the data lake with their choice of tools, such as Athena, while maintaining strong governance and auditing. The new approach using Lake Formation tag-based access controls reduces the integration code and manual controls required. The solution provides the following additional benefits:

  • Meets security requirements by providing column-level controls for data
  • Significantly reduces permission complexity
  • Reduces time to audit data security and troubleshoot permissions
  • Deploys data classification using existing IaC processes
  • Reduces the time it takes to onboard data users including engineers, analysts, and scientists

Conclusion

When IAS started this journey, the company was looking for a fully managed solution that would enable self-service analytics while meeting stringent data access policies. Lake Formation provided IAS with the capabilities needed to deliver on this promise for its employees. With tag-based access controls, IAS optimized the solution by reducing the number of permission rules from hundreds down to a few, making it even easier to manage and audit. IAS continues to analyze data using more tools governed by Lake Formation.


About the Authors

Mat Sharpe is the Technical Lead, AWS & Systems Engineering at IAS where he is responsible for the company’s AWS infrastructure and guiding the technical teams in their cloud journey. He is based in New York.

Brian Maguire is a Solution Architect at Amazon Web Services, where he is focused on helping customers build their ideas in the cloud. He is a technologist, writer, teacher, and student who loves learning. Brian is the co-author of the book Scalable Data Streaming with Amazon Kinesis.

Danny Gagne is a Solutions Architect at Amazon Web Services. He has extensive experience in the design and implementation of large-scale high-performance analysis systems, and is the co-author of the book Scalable Data Streaming with Amazon Kinesis. He lives in New York City.

Implement anti-money laundering solutions on AWS

Post Syndicated from Yomi Abatan original https://aws.amazon.com/blogs/big-data/implement-anti-money-laundering-solutions-on-aws/

The detection and prevention of financial crime continues to be an important priority for banks. Over the past 10 years, the level of activity in financial crimes compliance in financial services has expanded significantly, with regulators around the globe taking scores of enforcement actions and levying $36 billion in fines. Apart from the fines, the overall cost of compliance for global financial services companies is suspected to have reached $181 billion in 2020. For most banks, know your customer (KYC) and anti-money laundering (AML) constitute the largest area of concern within the broader financial crime compliance. In light of this, there is an urgent need to have effective AML systems that are scalable and fit for purpose in order to manage the risk of money laundering as well as the risk of non-compliance by the banks. Addressing money laundering at a high-level covers the following areas:

  • Client screening and identity
  • Transaction monitoring
  • Extended customer risk profile
  • Reporting of suspicious transactions

In this post we focus on transaction monitoring by looking at the general challenges with implementing transaction monitoring (TM) solutions and how AWS services can be leveraged to build a solution in the cloud from the perspectives of data analytics; risk management and ad hoc analysis. The following diagram is a conceptual architecture for a transaction monitoring solution on the AWS Cloud.

Current challenges

Due to growing digital channels for facilitating financial transactions, the increasing access to financial services for more people, and the growth in global payments; capturing and processing data related to TM is now considered a big data challenge. The big data challenges and observations include:

  • The volume of data continues to prove to be too expansive for effective processing in a traditional on-premises data center solution.
  • The velocity of banking transactions continues to rise despite the economic challenges of COVID-19.
  • The variety of the data that needs to be processed for TM platforms continues to increase as more data sources with unstructured data become available. These data sources require techniques such as optical character recognition (OCR) and natural language processing (NLP) to automate the process of getting value out of such data without excessive manual effort.
  • Finally, due to the layered nature of complex transactions involved in TM solutions, having data aggregated from multiple financial institutions provides a more comprehensive insight into the flow of financial transactions. Such an aggregation is usually less viable in a traditional on-premises solution.

Data Analytics

The first challenge with implementing TM solutions is having the tools and services to ingest data into a central store (often called a data lake) that is secure and scalable. Not only does this data lake need to capture terabytes or even petabytes of data, but it also needs to facilitate the process of moving data in and out of purpose-built data stores for time series, graph, data marts, and machine learning (ML) processing. In AWS, we refer to a data architecture which covers data lakes, purpose-built data stores and the data movement across data stores as a lake house architecture.

The following diagram illustrates a TM architecture on the AWS Cloud. This is a more detailed sample architecture of the lake house approach.

Ingestion of data into the lake house typically comes from a client’s data center (if the client is not already on the cloud), or from different client AWS accounts that host transaction systems or from external sources. For clients with transaction systems still on premises, we notice although several AWS services can be used to transfer data from on premises to the AWS Cloud, a number of our clients with a batch requirement utilize AWS Transfer Family, which provides fully managed support for secure file transfers directly into and out of Amazon Simple Storage Service (Amazon S3) or Amazon Elastic File System (Amazon EFS). With real-time requirements, we see the use of Amazon Managed Streaming for Apache Kafka (Amazon MSK), which is a fully managed service that makes it easy for you to build and run applications that use Apache Kafka to process streaming data. One other way to bring in reference data or external data like politically exposed persons (PEP) lists, watch lists, or stop lists for the AML process is via AWS Data Exchange, which makes it easy to find, subscribe to, and use third-party data in the cloud.

In this architecture, the ingestion process always stores the raw data in Amazon S3, which offers industry-leading scalability, data availability, security, and performance. For those clients already on the AWS Cloud, it’s very likely your data is already stored in Amazon S3.

For TM, the ingestion of the data comes from KYC systems, customer account stores, as well as transaction repositories. Data from KYC systems need to have the entity information, which can relate to a company or individual. For the corporate entities, information on the underlying beneficiary owners (UBOs)—the natural persons who directly or indirectly own or control a certain percentage of company—is also required. Before we discuss the data pipeline (the flow of data from the landing zone to the curated data layer) in detail, it’s important to address some of the security and audit requirements of the sensitive data classes typically used in AML processing.

According to Gartner, “Data governance is the specification of decision rights and an accountability framework to ensure the appropriate behavior in the valuation, creation, consumption, and control of data and analytics.” From an AML perspective, the specification and the accountable framework mentioned in this definition requires several enabling components.

The first is a data catalog, which is sometimes grouped into technical, process, and business catalogs. On the AWS platform, this catalog is provided either directly through AWS Glue or indirectly through AWS Lake Formation. Although the catalog implemented by AWS Glue is fundamentally a technical catalog, you can still extend it to add process and business relevant attributes.

The second enabling component is data lineage. This service should be flexible enough to support the different types of data lineage, namely vertical, horizontal, and physical. From an AML perspective, vertical lineage can provide a trace from AML regulation, which requires the collection of certain data classes, all the way to the data models captured in the technical catalog. Horizontal and physical lineage provide a trace of the data from source to eventual suspicious activity reporting for suspected transactions. Horizonal lineage provides lineage at the metadata level, whereas physical lineage captures trace at the physical level.

The third enabling component of data governance is data security. This covers several aspects of dealing with requirements of encryption of data at rest and in transit, but also de-identification of data during processing. This area requires a range of de-identification techniques depending on the context of use. Some of the techniques include tokenization, encryption, generalization, masking, perturbation, redaction, and even substitution of personally identifiable information (PII) or sensitive data usually at the attribute level. It’s important to use the right de-identification technique to enforce the right level of privacy while still ensuring the data still has sufficient inference signals for use in ML. You can use Amazon Macie, a fully managed data security and data privacy service that uses ML and pattern matching to discover and protect sensitive data, to automate PII discovery prior to applying the right de-identification technique.

Moving data from landing zone (raw data) all the way to curated data involves several steps of processing, including data quality validation, compression, transformation, enrichment, de-duplication, entity resolution, and entity aggregation. Such processing is usually referred to as extract, transform, and load (ETL). In this architecture, we have a choice of using a serverless architecture based on AWS Glue (using Scala or Python programming languages) or implementing Amazon EMR (a cloud big-data platform for processing large datasets using open-source tools such as Apache Spark and Hadoop). Amazon EMR provides the flexibility to run these ETL workloads on Amazon Elastic Compute Cloud (Amazon EC2) instances, Amazon Elastic Kubernetes Service (Amazon EKS) clusters and also on AWS Outposts.

Risk management framework

The risk management framework part of the architecture contains the rules, thresholds, algorithms, models, and control policies that govern the process of detecting and reporting suspicious transactions. Traditionally, most TM solutions have relied solely on rule-based controls to implement AML requirements. However, these rule-based implementations quickly become complex and difficult to maintain, as criminals find new and sophisticated ways to circumvent existing AML controls. Apart from the complexity and maintenance, rule-based approaches usually result in large number of false positives. False positives in this context are when transactions are flagged as suspicious but turn out not to be. Some of the numbers here are quite remarkable, with some studies revealing less than 2% of cases actually turning to be suspicious. The implication of this is the operational costs and the teams of operational resources required to investigate these false positives. Another implication that sometimes get overlooked is the customer experience, in which a customer service like payment or clearing of transactions is delayed or declined due to false positives. This usually leads to a less than satisfactory customer experience. Despite the number of false positives, AML failings and subsequent fines are hardly out of the news; in one case the Financial Conduct Authority (FCA) in the United Kingdom deciding to take the unprecedented step of bringing criminal proceedings against a bank over failed AML processes.

In light of some of the shortcomings of a rule-based AML approach, a lot of research and focus has been performed by financial services customers, including RegTechs, on applying ML to detect suspicious transactions. One comprehensive study on the use of ML techniques in suspicious transaction detection is a paper published by Z. Chen et al. This paper was published in 2018 (which in ML terms is a lifetime ago), but the concepts and findings are still relevant. The paper highlights some of the common algorithms and challenges with using ML for AML. AML data is a high-dimensional space that usually requires dimensionality reduction through the use of algorithms like Principal Component Analysis (PCA) or autoencoders (neural networks used to learn efficient data encodings in an unsupervised manner). As part of feature engineering, most algorithms require the value of transactions (debits and credits) aggregated by time intervals—daily, weekly, and monthly. Clustering algorithms like k-means or some variants of k-means are used to create clusters for customer or transaction profiles. There is also the need to deal with class imbalance usually found in AML datasets.

All of these algorithms referenced in the Z. Chen et al paper are supported by Amazon SageMaker. SageMaker is a fully managed ML service that allows data scientists and developers to easily build, train, and deploy ML models for AML. You can also implement some of the other categories of algorithms that support AML such as behavioral modelling, risk scoring, and anomaly detection with SageMaker. You can use a wide range of algorithms to address AML challenges, including supervised, semi-supervised, and unsupervised models. Some additional factors that determine the suitability of algorithms include high recall and precision rate of the models, and the ability to utilize approaches such as SHapley Additive exPlanation (SHAP) and Local Interpretable Model-Agnostic Explanations (LIME) values to explain the model output. Amazon SageMaker Clarify can detect bias and increases transparency of ML models.

Algorithms that focus on risk scoring enable a risk profile that can span across various data classes such as core customer attributes including industry, geography, bank product, business size, complex ownership structure for entities, as well as transactions (debits and credits) and frequency of such transactions. In addition, external data such as PEP lists, various stop lists and watch lists, and in some cases media coverage related to suspected fraud or corruption can also be weighted into a customer’s risk profile.

Rule-based and ML approaches aren’t mutually exclusive, but it’s likely that rules will continue to play a peripheral role as better algorithms are researched and implemented. One of the reasons why the development of algorithms for AML has been sluggish is the availability of reliable datasets, which include result data indicating when a correct suspicious activity report (SAR) was filed for a given scenario. Unlike other areas of ML in which findings have been openly shared for further research, with AML, a lot of the progress first appears in commercial products belonging to vendors who are protective of their intellectual property.

Ad hoc analysis and reporting

The final part of the architecture includes support for case or event management tooling and a reporting service for the eventual SAR. These services can be AWS Marketplace solutions or developed from scratch using AWS services such as Amazon EKS or Amazon ECS. This part of the architecture also provides support for a very important aspect of AML: network analytics. Network or link analysis has three main components:

  • Clustering – The construction of graphs and representation of money flow. Amazon Neptune is a fast, reliable, fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets.
  • Statistical analysis – Used to assist with finding metrics around centrality, normality, clustering, and eigenvector centrality.
  • Data visualization – An interactive and extensible data visualization platform to support exploratory data analysis. Findings from the network analytics can also feed into customer risk profiles and supervised ML algorithms.

Conclusion

None of the services or architecture layers described in this architecture are tightly coupled; different layers and services can be swapped with AWS Marketplace solutions or other FinTech or RegTech solutions that support cloud-based deployment. This means the AWS Cloud has a powerful ecosystem of native services and third-party solutions that can be deployed on the foundation of a Lake House Architecture on AWS to build a modern TM solution in the cloud. To find out more information about key of parts of the architecture described in this post, refer to the following resources:

For seeding data into a data lake (including taking advantage of ACID compliance):

For using Amazon EMR for data pipeline processing and some of recent updates to the Amazon EMR:

For taking advantage of SageMaker to support financial crime use cases:

Please contact AWS if you need help developing a full-scale AML solution (covering client screening and identity, transaction monitoring, extended customer risk profile and reporting of suspicious transactions) on AWS.


About the Author

Yomi Abatan is a Sr. Solution Architect based in London, United Kingdom. He works with financial services organisations, architecting, designing and implementing various large-scale IT solutions. He, currently helps established financial services AWS customers embark on Digital transformations using AWS cloud as an accelerator. Before joining AWS he worked in various architecture roles with several tier-one investment banks.

How MOIA built a fully automated GDPR compliant data lake using AWS Lake Formation, AWS Glue, and AWS CodePipeline

Post Syndicated from Leonardo Pêpe original https://aws.amazon.com/blogs/big-data/how-moia-built-a-fully-automated-gdpr-compliant-data-lake-using-aws-lake-formation-aws-glue-and-aws-codepipeline/

This is a guest blog post co-written by Leonardo Pêpe, a Data Engineer at MOIA.

MOIA is an independent company of the Volkswagen Group with locations in Berlin and Hamburg, and operates its own ride pooling services in Hamburg and Hanover. The company was founded in 2016 and develops mobility services independently or in partnership with cities and existing transport systems. MOIA’s focus is on ride pooling and the holistic development of the software and hardware for it. In October 2017, MOIA started a pilot project in Hanover to test a ride pooling service, which was brought into public operation in July 2018. MOIA covers the entire value chain in the area of ​​ride pooling. MOIA has developed a ridesharing system to avoid individual car traffic and use the road infrastructure more efficiently.

In this post, we discuss how MOIA uses AWS Lake Formation, AWS Glue, and AWS CodePipeline to store and process gigabytes of data on a daily basis to serve 20 different teams with individual user data access and implement fine-grained control of the data lake to comply with General Data Protection Regulation (GDPR) guidelines. This involves controlling access to data at a granular level. The solution enables MOIA’s fast pace of innovation to automatically adapt user permissions to new tables and datasets as they become available.

Background

Each MOIA vehicle can carry six passengers. Customers interact with the MOIA app to book a trip, cancel a trip, and give feedback. The highly distributed system prepares multiple offers to reach their destination with different pickup points and prices. Customers select an option and are picked up from their chosen location. All interactions between the customers and the app, as well as all the interactions between internal components and systems (the backend’s and vehicle’s IoT components), are sent to MOIA’s data lake.

Data from the vehicle, app, and backend must be centralized to have the synchronization between trips planned and implemented, and then to collect passenger feedback. To provide different pricing and routing options, MOIA needed centralized data. MOIA decided to build and secure its Amazon Simple Storage Service (Amazon S3) based Data Lake using AWS Lake Formation.

Different MOIA teams that includes Data Analysts, Data Scientists and Data Engineers need to access centralized data from different sources for the development and operations of the application workloads. It’s a legal requirement to control the access and format of the data to these different teams. The app development team needs to understand customer feedback in an anonymized way, pricing-related data must be accessed only by the business analytics team, vehicle data is meant to be used only by the vehicle maintenance team, and the routing team needs access to customer location and destination.

Architecture overview

The following diagram illustrates MOIA solution architecture.

The solution has the following components:

  1. Data input – The apps, backend, and IoT devices are continuously streaming event messages via Amazon Kinesis Data Streams in a nested information in a JSON format.
  2. Data transformation – When the raw data is received, the data pipeline orchestrator (Apache Airflow) starts the data transformation jobs to meet requirements for respective data formats. Amazon EMR, AWS Lambda, or AWS Fargate cleans, transforms, and loads the events in the S3 data lake.
  3. Persistence layer – After the data is transformed and normalized, AWS Glue crawlers classify the data to determine the format, schema, and associated properties. They group data into tables or partitions and write metadata to the AWS Glue Data Catalog based on the structured partitions created in Amazon S3. The data is written on Amazon S3 partitioned by its event type and timestamp so it can be easily accessed via Amazon Athena and the correct permissions can be applied using Lake Formation. The crawlers are run at a fixed interval in accordance with the batch jobs, so no manual runs are required.
  4. Governance layer – Access on the data lake is managed using the Lake Formation governance layer. MOIA uses Lake Formation to centrally define security, governance, and auditing policies in one place. These policies are consistently implemented, which eliminates the need to manually configure them across security services like AWS Identity and Access Management (IAM), AWS Key Management Service (AWS KMS), storage services like Amazon S3, or analytics and machine learning (ML) services like Amazon Redshift, Athena, Amazon EMR for Apache Spark. Lake Formation reduces the effort in configuring policies across services and provides consistent assistance for GDPR requirements and compliance. When a user makes a request to access Data Catalog resources or underlying data from the data lake, for the request to succeed, it must pass permission checks by both IAM and Lake Formation. Lake Formation permissions control access to Data Catalog resources, Amazon S3 locations, and the underlying data at those locations. User permissions to the data on the table, row, or column level are defined in Lake Formation, so that users only access the data if they’re authorized.
  5. Data access layer – After the flattened and structured data is available to be accessed by the data analysts, scientists, and engineers they can use Amazon Redshift or Athena to perform SQL analysis, or AWS Data Wrangler and Apache Spark to perform programmatic analysis and build ML use cases.
  6. Infrastructure orchestration and data pipeline – After the data is ingested and transformed, the infrastructure orchestration layer (CodePipeline and Lake Formation) creates or updates the governance layer with the proper and predefined permissions periodically every hour (see more about the automation in the governance layer later in this post).

Governance challenge

MOIA wants to evolve their ML models for routing, demand prediction, and business models continuously. This requires MOIA to constantly review models and update them, therefore power users such as data administrators and engineers frequently redesign the table schemas in the AWS Glue Data Catalog as part of the data engineering workflow. This highly dynamic metadata transformation requires an equally dynamic governance layer pipeline that can assign the right user permissions to all tables and adapt to these changes transparently without disruptions to end-users. Due to GDPR requirements, there is no room for error, and manual work is required to assign the right permissions according to GDPR compliance on the tables in the data lake with many terabytes of data. Without automation, many developers are needed for administration; this adds human error into the workflows, which is not acceptable. Manual administration and access management isn’t a scalable solution. MOIA needed to innovate faster with GDPR compliance with a small team of developers.

Data schema and data structure often changes at MOIA, resulting in new tables being created in the data lake. To guarantee that new tables inherit the permissions granted to the same group of users who already have access to the entire database, MOIA uses an automated process that grants Lake Formation permission to newly created tables, columns, and databases, as described in the next section.

Solution overview

The following diagram illustrates the continuous deployment loop using AWS CloudFormation.

The workflow contains the following steps:

  1. MOIA data scientists and engineers create or modify new tables to evolve the business and ML models. This results in new tables or changes in the existing schema of the table.
  2. The data pipeline is orchestrated via Apache Airflow, which controls the whole cycle of ingestion, cleansing, and data transformation using Spark jobs (as shown in Step 1). When the data is in its desired format and state, Apache Airflow triggers the AWS Glue crawler job to discover the data schema in the S3 bucket and add the new partitions to the Data Catalog. Apache Airflow also triggers CodePipeline to assign the right permissions on all the tables.
  3. Apache Airflow triggers CodePipeline every hour, which uses MOIA’s libraries to discover the new tables, columns, and databases, and grants Lake Formation permissions in the AWS CloudFormation template for all tables, including new and modified. This orchestration layer ensures Lake Formation permissions are granted. Without this step, nobody can access new tables, and new data isn’t visible to data scientists, business analysts, and others who need it.
  4. MOIA uses Stacker and Troposphere to identify all the tables, assign the tables the right permissions, and deploy the CloudFormation stack. Troposphere is the infrastructure as code (IaC) framework that renders the CloudFormation templates, and stacker helps with the parameterization and deployment of the templates on AWS. Stacker also helps produce reusable templates in the form of blueprints as pure Python code, which can be easily parameterized using YAML files. The Python code uses Boto3 clients to lookup the Data Catalog and search for databases and tables.
  5. The stacker blueprint libraries developed by MOIA (which internally use Boto3 and troposphere) are used to discover newly created databases or tables in the Data Catalog.
  6. A Permissions configuration file allows MOIA to predefine data lake tables (Can be a wildcard indicating all tables available in one database) and specific personas who have access to these tables. These roles are split into different categories called lake personas, and have specific access levels. Example lake personas include data domain engineers, data domain technical users, data administrators, and data analysts. Users and their permissions are defined in the file. In case access to personally identifiable information (PII) data is granted, the GDPR officer ensures that a record of processing activities is in place and the usage of personal data is documented according to the law.
  7. MOIA uses stacker to read the predefined permissions file, and uses the customized stacker blueprints library containing the logic to assign permissions to lake personas for each of the tables discovered in Step 5.
  8. The discovered and modified tables are matched with predefined permissions in Step 6 and Step 7. Stacker uses YAML format as the input for the CloudFormation template parameters to define users and data access permissions. These parameters are related to the user’s roles and define access to the tables. Based on this, MOIA creates a customized stacker that creates AWS CloudFormation resources dynamically. The following are the example stacks in AWS CloudFormation:
    1. data-lake-domain-database – Holds all resources that are relevant during the creation of the database, which includes lake permissions at the database level that allow the domain database admin personas to perform operations in the database.
    2. data-lake-domain-crawler – Scans Amazon S3 constantly to discover and create new tables and updates.
    3. data-lake-lake-table-permissions – Uses Boto3 combined with troposphere to list the available tables and grant lake permissions to the domain roles that access it.

This way, new CloudFormation templates are created, containing permissions for new or modified tables. This process guarantees a fully automated governance layer for the data lake. The generated CloudFormation template contains Lake Formation permission resources for each table, database, or column. The process of managing Lake Formation permissions on Data Catalog databases, tables, and columns is simplified by granting Data Catalog permissions using the Lake Formation tag-based access control method. The advantage of generating a CloudFormation template is audibility. When the new version of the CloudFormation stack is prepared with an access control set on new or modified tables, administrators can compare that stack with the older version to discover newly prepared and modified tables. MOIA can view the differences via the AWS CloudFormation console before new stack deployment.

  1. Either the CloudFormation stack is deployed in the account with the right permissions for all users, or a stack deployment failure notice is sent to the developers via a Slack channel (Step 10). This ensures the visibility in the deployment process and failed pipelines in permission assignment.
  2. Whenever the pipeline fails, MOIA receives the notification via Slack so the engineers can promptly fix the errors. This loop is very important to guarantee that whenever the schema changes, those changes are reflected in the data lake without manual intervention.
  3. Following this automated pipeline, all the tables are assigned right permissions.

Benefits

This solution delivers the following benefits:

  • Automated enforcement of data access permissions allows people to access the data without manual interventions in a scalable way. With automation, 1,000 hours of manual work are saved every year.
  • The GDPR team does the assessment internally every month. The GDPR team constantly provides guidance and approves each change in permissions. This audit trail for records of processing activities is automated with the help of Lake Formation (otherwise it needs a dedicated human resource).
  • The automated workflow of permission assignment is integrated into existing CI/CD processes, resulting in faster onboarding of new teams, features, and dataset releases. An average of 48 releases are done each month (including major and minor version releases and new event types). Onboarding new teams and forming internal new teams is very easy now.
  • This solution enables you to create a data lake using IaC processes that are easy and GDPR-compliant.

Conclusion

MOIA has created scalable, automated, and versioned permissions with a GDPR-supported, governed data lake using Lake Formation. This solution helps them bring new features and models to market faster, and reduces administrative and repetitive tasks. MOIA can focus on 48 average releases every month, contributing to a great customer experience and new data insights.


About the Authors

Leonardo Pêpe is a Data Engineer at MOIA. With a strong background in infrastructure and application support and operations, he is immersed in the DevOps philosophy. He’s helping MOIA build automated solutions for its data platform and enabling the teams to be more data-driven and agile. Outside of MOIA, Leonardo enjoys nature, Jiu-Jitsu and martial arts, and explores the good of life with his family.

 

Sushant Dhamnekar is a Solutions Architect at AWS. As a trusted advisor, Sushant helps automotive customers to build highly scalable, flexible, and resilient cloud architectures, and helps them follow the best practices around advanced cloud-based solutions. Outside of work, Sushant enjoys hiking, food, travel, and CrossFit workouts.

 

 

Shiv Narayanan is Global Business Development Manager for Data Lakes and Analytics solutions at AWS. He works with AWS customers across the globe to strategize, build, develop and deploy modern data platforms. Shiv loves music, travel, food and trying out new tech.

Address Modernization Tradeoffs with Lake House Architecture

Post Syndicated from Sukhomoy Basak original https://aws.amazon.com/blogs/architecture/address-modernization-tradeoffs-with-lake-house-architecture/

Many organizations are modernizing their applications to reduce costs and become more efficient. They must adapt to modern application requirements that provide 24×7 global access. The ability to scale up or down quickly to meet demand and process a large volume of data is critical. This is challenging while maintaining strict performance and availability. For many companies, modernization includes decomposing a monolith application into a set of independently developed, deployed, and managed microservices. The decoupled nature of a microservices environment allows each service to evolve agilely and independently. While there are many benefits for moving to a microservices-based architecture, there can be some tradeoffs. As your application monolith evolves into independent microservices, you must consider the implications to your data architecture.

In this blog post we will provide example use cases, and show how Lake House Architecture on AWS can streamline your microservices architecture. A Lake house architecture embraces the decentralized nature of microservices by facilitating data movement. These transfers can be between data stores, from data stores to data lake, and from data lake to data stores (Figure 1).

Figure 1. Integrating data lake, data warehouse, and all purpose-built stores into a coherent whole

Figure 1. Integrating data lake, data warehouse, and all purpose-built stores into a coherent whole

Health and wellness application challenges

Our fictitious health and wellness customer has an application architecture comprised of several microservices backed by purpose-built data stores. User profiles, assessments, surveys, fitness plans, health preferences, and insurance claims are maintained in an Amazon Aurora MySQL-Compatible relational database. The event service monitors the number of steps walked, sleep pattern, pulse rate, and other behavioral data in Amazon DynamoDB, a NoSQL database (Figure 2).

Figure 2. Microservices architecture for health and wellness company

Figure 2. Microservices architecture for health and wellness company

With this microservices architecture, it’s common to have data spread across various data stores. This is because each microservice uses a purpose-built data store suited to its usage patterns and performance requirements. While this provides agility, it also presents challenges to deriving needed insights.

Here are four challenges that different users might face:

  1. As a health practitioner, how do I efficiently combine the data from multiple data stores to give personalized recommendations that improve patient outcomes?
  2. As a sales and marketing professional, how do I get a 360 view of my customer, when data lives in multiple data stores? Profile and fitness data are in a relational data store, but important behavioral and clickstream data are in NoSQL data stores. It’s hard for me to run targeted marketing campaigns, which can lead to revenue loss.
  3. As a product owner, how do I optimize healthcare costs when designing wellbeing programs for patients?
  4. As a health coach, how do I find patients and help them with their wellness goals?

Our remaining subsections highlight AWS Lake House Architecture capabilities and features that allow data movement and the integration of purpose-built data stores.

1. Patient care use case

In this scenario, a health practitioner is interested in historical patient data to estimate the likelihood of a future outcome. To get the necessary insights and identify patterns, the health practitioner needs event data from Amazon DynamoDB and patient profile data from Aurora MySQL-Compatible. Our health practitioner will use Amazon Athena to run an ad-hoc analysis across these data stores.

Amazon Athena provides an interactive query service for both structured and unstructured data. The federated query functionality in Amazon Athena helps with running SQL queries across data stored in relational, NoSQL, and custom data sources. Amazon Athena uses Lambda-based data source connectors to run federated queries. Figure 3 illustrates the federated query architecture.

Figure 3. Amazon Athena federated query

Figure 3. Amazon Athena federated query

The patient care team could use an Amazon Athena federated query to find out if a patient needs urgent care. It is able to detect anomalies in the combined datasets from claims processing, device data, and electronic health record (HER) as show in Figure 4.

Figure 4. Federated query result by combining data from claim, device, and EHR stores

Figure 4. Federated query result by combining data from claim, device, and EHR stores

Healthcare data from various sources, including EHRs and genetic data, helps improve personalized care. Machine learning (ML) is able to harness big data and perform predictive analytics. This creates opportunities for researchers to develop personalized treatments for various diseases, including cancer and depression.

To achieve this, you must move all the related data into a centralized repository such as an Amazon S3 data lake. For specific use cases, you also must move the data between the purpose-built data stores. Finally, you must build an ML solution that can predict the outcome. Amazon Redshift ML, combined with its federated query processing capabilities enables data analysts and database developers to create a platform to detect patterns (Figure 5). With this platform, health practitioners are able to make more accurate, data-driven decisions.

Figure 5. Amazon Redshift federated query with Amazon Redshift ML

Figure 5. Amazon Redshift federated query with Amazon Redshift ML

2. Sales and marketing use case

To run marketing campaigns, the sales and marketing team must search customer data from a relational database, with event data in a non-relational data store. We will move the data from Aurora MySQL-Compatible and Amazon DynamoDB to Amazon Elasticsearch Service (ES) to meet this requirement.

AWS Database Migration Service (DMS) helps move the change data from Aurora MySQL-Compatible to Amazon ES using Change Data Capture (CDC). AWS Lambda could be used to move the change data from DynamoDB streams to Amazon ES, as shown in Figure 6.

Figure 6. Moving and combining data from Aurora MySQL-Compatible and Amazon DynamoDB to Amazon Elasticsearch Service

Figure 6. Moving and combining data from Aurora MySQL-Compatible and Amazon DynamoDB to Amazon Elasticsearch Service

The sales and marketing team can now run targeted marketing campaigns by querying data from Amazon Elasticsearch Service, see Figure 7. They can improve sales operations by visualizing data with Amazon QuickSight.

Figure 7. Personalized search experience for ad-tech marketing team

Figure 7. Personalized search experience for ad-tech marketing team

3. Healthcare product owner use case

In this scenario, the product owner must define the care delivery value chain. They must develop process maps for patient activity and estimate the cost of patient care. They must analyze these datasets by building business intelligence (BI) reporting and dashboards with a tool like Amazon QuickSight. Amazon Redshift, a cloud scale data warehousing platform, is a good choice for this. Figure 8 illustrates this pattern.

Figure 8. Moving data from Amazon Aurora and Amazon DynamoDB to Amazon Redshift

Figure 8. Moving data from Amazon Aurora and Amazon DynamoDB to Amazon Redshift

The product owners can use integrated business intelligence reports with Amazon Redshift to analyze their data. This way they can make more accurate and appropriate decisions, see Figure 9.

Figure 9. Business intelligence for patient care processes

Figure 9. Business intelligence for patient care processes

4. Health coach use case

In this scenario, the health coach must find a patient based on certain criteria. They would then send personalized communication to connect with the patient to ensure they are following the proposed health plan. This proactive approach contributes to a positive patient outcome. It can also reduce healthcare costs incurred by insurance companies.

To be able to search patient records with multiple data points, it is important to move data from Amazon DynamoDB to Amazon ES. This also will provide a fast and personalized search experience. The health coaches can be notified and will have the information they need to provide guidance to their patients. Figure 10 illustrates this pattern.

Figure 10. Moving Data from Amazon DynamoDB to Amazon ES

Figure 10. Moving Data from Amazon DynamoDB to Amazon ES

The health coaches can use Elasticsearch to search users based on specific criteria. This helps them with counseling and other health plans, as shown in Figure 11.

Figure 11. Simplified personalized search using patient device data

Figure 11. Simplified personalized search using patient device data

Summary

In this post, we highlight how Lake House Architecture on AWS helps with the challenges and tradeoffs of modernization. A Lake House architecture on AWS can help streamline the movement of data between the microservices data stores. This offers new capabilities for various analytics use cases.

For further reading on architectural patterns, and walkthroughs for building Lake House Architecture, see the following resources:

Easily manage your data lake at scale using AWS Lake Formation Tag-based access control

Post Syndicated from Nivas Shankar original https://aws.amazon.com/blogs/big-data/easily-manage-your-data-lake-at-scale-using-tag-based-access-control-in-aws-lake-formation/

Thousands of customers are building petabyte-scale data lakes on AWS. Many of these customers use AWS Lake Formation to easily build and share their data lakes across the organization. As the number of tables and users increase, data stewards and administrators are looking for ways to manage permissions on data lakes easily at scale. Customers are struggling with “role explosion” and need to manage hundreds or even thousands of user permissions to control data access. For example, for an account with 1,000 resources and 100 principals, the data steward would have to create and manage up to 100,000 policy statements. Furthermore, as new principals and resources get added or deleted, these policies have to be updated to keep the permissions current.

Lake Formation Tag-based access control solves this problem by allowing data stewards to create LF-tags (based on their data classification and ontology) that can then be attached to resources. You can create policies on a smaller number of logical tags instead of specifying policies on named resources. LF-tags enable you to categorize and explore data based on taxonomies, which reduces policy complexity and scales permissions management. You can create and manage policies with tens of logical tags instead of the thousands of resources. LF-tags access control decouples policy creation from resource creation, which helps data stewards manage permissions on a large number of databases, tables, and columns by removing the need to update policies every time a new resource is added to the data lake. Finally, LF-tags access allows you to create policies even before the resources come into existence. All you have to do is tag the resource with the right LF-tags to ensure it is managed by existing policies.

This post focuses on managing permissions on data lakes at scale using LF-tags in Lake Formation. When it comes to managing data lake catalog tables from AWS Glue and administering permission to Lake Formation, data stewards within the producing accounts have functional ownership based on the functions they support, and can grant access to various consumers, external organizations, and accounts. You can now define LF-tags; associate at the database, table, or column level; and then share controlled access across analytic, machine learning (ML), and extract, transform, and load (ETL) services for consumption. LF-tags ensures that governance can be scaled easily by replacing the policy definitions of thousands of resources with a small number of logical tags.

LF-tags access has three main components:

  • Tag ontology and classification – Data stewards can define a LF-tag ontology based on data classification and grant access based on LF-tags to AWS Identity and Access Management (IAM) principals and SAML principals or groups
  • Tagging resources – Data engineers can easily create, automate, implement, and track all LF-tags and permissions against AWS Glue catalogs through the Lake Formation API
  • Policy evaluation – Lake Formation evaluates the effective permissions based on LF-tags at query time and allows access to data through consuming services such as Amazon Athena, Amazon Redshift Spectrum, Amazon SageMaker Data Wrangler, and Amazon EMR Studio, based on the effective permissions granted across multiple accounts or organization-level data shares

Solution overview

The following diagram illustrates the architecture of the solution described in this post.

In this post, we demonstrate how you can set up a Lake Formation table and create Lake Formation tag-based policies using a single account with multiple databases. We walk you through the following high-level steps:

  1. The data steward defines the tag ontology with two LF-tags: Confidential and Sensitive. Data with “Confidential = True” has tighter access controls. Data with “Sensitive = True” requires specific analysis from the analyst.
  2. The data steward assigns different permission levels to the data engineer to build tables with different LF-tags.
  3. The data engineer builds two databases: tag_database and col_tag_database. All tables in tag_database are configured with “Confidential = True”. All tables in the col_tag_database are configured with “Confidential = False”. Some columns of the table in col_tag_database are tagged with “Sensitive = True” for specific analysis needs.
  4. The data engineer grants read permission to the analyst for tables with specific expression condition “Confidential = True” and  “Confidential = FalseSensitive = True”.
  5. With this configuration, the data analyst can focus on performing analysis with the right data.

Provision your resources

This post includes an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs. The template creates three different personas to perform this exercise and copies the nyc-taxi-data dataset to your local Amazon Simple Storage Service (Amazon S3) bucket.

To create these resources, complete the following steps:

  1. Sign in to the AWS CloudFormation console in the us-east-1 Region.
  2. Choose Launch Stack:
  3. Choose Next.
  4. In the User Configuration section, enter password for three personas: DataStewardUserPassword, DataEngineerUserPassword and DataAnalystUserPassword.
  5. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  6. Choose Create.

The stack takes up to 5 minutes and creates all the required resources, including:

  • An S3 bucket
  • The appropriate Lake Formation settings
  • The appropriate Amazon Elastic Compute Cloud (Amazon EC2) resources
  • Three user personas with user ID credentials:
    • Data steward (administrator) – The lf-data-steward user has the following access:
      • Read access to all resources in the Data Catalog
      • Can create LF-tags and associate to the data engineer role for grantable permission to other principals
    • Data engineer – The lf-data-engineer user has the following access:
      • Full read, write, and update access to all resources in the Data Catalog
      • Data location permissions in the data lake
      • Can associate LF-tags and associate to the Data Catalog
      • Can attach LF-tags to resources, which provides access to principals based on any policies created by data stewards
    • Data analyst – The lf-data-analyst user has the following access:
      • Fine-grained access to resources shared by Lake Formation Tag-based access policies

Register your data location and create an LF-tag ontology

We perform this first step as the data steward user (lf-data-steward) to verify the data in Amazon S3 and the Data Catalog in Lake Formation.

  1. Sign in to the Lake Formation console as lf-data-steward with the password used while deploying the CloudFormation stack.
  2. In the navigation pane, under Permissions¸ choose Administrative roles and tasks.
  3. For IAM users and roles, choose the user lf-data-steward.
  4. Choose Save to add lf-data-steward as a Lake Formation admin.

    Next, we will update the Data catalog settings to use Lake Formation permission to control catalog resources instead of IAM based access control.
  5. In the navigation pane, under Data catalog¸ choose Settings.
  6. Uncheck Use only IAM access control for new databases.
  7. Uncheck Use only IAM access control for new tables in new databases.
  8. Click Save.

    Next, we need to register the data location for the data lake.
  9. In the navigation pane, under Register and ingest, choose Data lake locations.
  10. For Amazon S3 path, enter s3://lf-tagbased-demo-<<Account-ID>>.
  11. For IAM role¸ leave it as the default value AWSServiceRoleForLakeFormationDataAccess.
  12. Choose Register location.
    Next, we create the ontology by defining a LF-tag.
  13. Under Permissions in the navigation pane, under Administrative roles, choose LF-Tags.
  14. Choose Add LF-tags.
  15. For Key, enter Confidential.
  16. For Values, add True and False.
  17. Choose Add LF-tag.
  18. Repeat the steps to create the LF-tag Sensitive with the value True.
    You have created all the necessary LF-tags for this exercise.Next, we give specific IAM principals the ability to attach newly created LF-tags to resources.
  19. Under Permissions in the navigation pane, under Administrative roles, choose LF-tag permissions.
  20. Choose Grant.
  21. Select IAM users and roles.
  22. For IAM users and roles, search for and choose the lf-data-engineer role.
  23. In the LF-tag permission scope section, add the key Confidential with values True and False, and the key Sensitive with value True.
  24. Under Permissions¸ select Describe and Associate for LF-tag permissions and Grantable permissions.
  25. Choose Grant.

    Next, we grant permissions to lf-data-engineer to create databases in our catalog and on the underlying S3 bucket created by AWS CloudFormation.
  26. Under Permissions in the navigation pane, choose Administrative roles.
  27. In the Database creators section, choose Grant.
  28. For IAM users and roles, choose the lf-data-engineer role.
  29. For Catalog permissions, select Create database.
  30. Choose Grant.

    Next, we grant permissions on the S3 bucket (s3://lf-tagbased-demo-<<Account-ID>>) to the lf-data-engineer user.
  31. In the navigation pane, choose Data locations.
  32. Choose Grant.
  33. Select My account.
  34. For IAM users and roles, choose the lf-data-engineer role.
  35. For Storage locations, enter the S3 bucket created by the CloudFormation template (s3://lf-tagbased-demo-<<Account-ID>>).
  36. Choose Grant.
    Next, we grant lf-data-engineer grantable permissions on resources associated with the LF-tag expression Confidential=True.
  37. In the navigation pane, choose Data permissions.
  38. Choose Grant.
  39. Select IAM users and roles.
  40. Choose the role lf-data-engineer.
  41. In the LF-tag or catalog resources section, Select Resources matched by LF-Tags.
  42. Choose Add LF-Tag.
  43. Add the key Confidential with the values True.
  44. In the Database permissions section, select Describe for Database permissions and Grantable permissions.
  45. In the Table and column permissions section, select Describe, Select, and Alter for both Table permissions and Grantable permissions.
  46. Choose Grant.
    Next, we grant lf-data-engineer grantable permissions on resources associated with the LF-tag expression Confidential=False.
  47. In the navigation pane, choose Data permissions.
  48. Choose Grant.
  49. Select IAM users and roles.
  50. Choose the role lf-data-engineer.
  51. Select Resources matched by LF-tags.
  52. Choose Add LF-tag.
  53. Add the key Confidential with the values False.
  54. In the Database permissions section, select Describe for Database permissions and Grantable permissions.
  55. In the Table and column permissions section, do not select anything.
  56. Choose Grant.
    Next, we grant lf-data-engineer grantable permissions on resources associated with the LF-tag expression Confidential=False and Sensitive=True.
  57. In the navigation pane, choose Data permissions.
  58. Choose Grant.
  59. Select IAM users and roles.
  60. Choose the role lf-data-engineer.
  61. Select Resources matched by LF-tags.
  62. Choose Add LF-tag.
  63. Add the key Confidential with the values False.
  64. Choose Add LF-tag.
  65. Add the key Sensitive with the values True.
  66. In the Database permissions section, select Describe for Database permissions and Grantable permissions.
  67. In the Table and column permissions section, select Describe, Select, and Alter for both Table permissions and Grantable permissions.
  68. Choose Grant.

Create the Lake Formation databases

Now, sign in as lf-data-engineer with the password used while deploying the CloudFormation stack. We create two databases and attach LF-tags to the databases and specific columns for testing purposes.

Create your database and table for database-level access

We first create the database tag_database, the table source_data, and attach appropriate LF-tags.

  1. On the Lake Formation console, choose Databases.
  2. Choose Create database.
  3. For Name, enter tag_database.
  4. For Location, enter the S3 location created by the CloudFormation template (s3://lf-tagbased-demo-<<Account-ID>>/tag_database/).
  5. Deselect Use only IAM access control for new tables in this database.
  6. Choose Create database.

Next, we create a new table within tag_database.

  1. On the Databases page, select the database tag_database.
  2. Choose View Tables and click Create table.
  3. For Name, enter source_data.
  4. For Database, choose the database tag_database.
  5. For Data is located in, select Specified path in my account.
  6. For Include path, enter the path to tag_database created by the CloudFormation template (s3://lf-tagbased-demo-<<Account-ID>>/tag_database/).
  7. For Data format, select CSV.
  8. Under Upload schema, enter the following schema JSON:
    [
                   {
                        "Name": "vendorid",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "lpep_pickup_datetime",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "lpep_dropoff_datetime",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "store_and_fwd_flag",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "ratecodeid",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "pulocationid",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "dolocationid",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "passenger_count",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "trip_distance",
                        "Type": "string"
                        
                        
                   }, 
                      {
                        "Name": "fare_amount",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "extra",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "mta_tax",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "tip_amount",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "tolls_amount",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "ehail_fee",
                        "Type": "string"
                        
                        
                   }, 
                   {
                        "Name": "improvement_surcharge",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "total_amount",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "payment_type",
                        "Type": "string"
                        
                        
                   }
     
    ]
    

  9. Choose Upload.

After uploading the schema, the table schema should look like the following screenshot.

  1. Choose Submit.

Now we’re ready to attach LF-tags at the database level.

  1. On the Databases page, find and select tag_database.
  2. On the Actions menu, choose Edit LF-tags.
  3. Choose Assign new LF-tag.
  4. For Assigned keys¸ choose the Confidential LF-tag you created earlier.
  5. For Values, choose True.
  6. Choose Save.

This completes the LF-tag assignment to the tag_database database.

Create your database and table for column-level access

Now we repeat these steps to create the database col_tag_database and table source_data_col_lvl, and attach LF-tags at the column level.

  1. On the Databases page, choose Create database.
  2. For Name, enter col_tag_database.
  3. For Location, enter the S3 location created by the CloudFormation template (s3://lf-tagbased-demo-<<Account-ID>>/col_tag_database/).
  4. Deselect Use only IAM access control for new tables in this database.
  5. Choose Create database.
  6. On the Databases page, select your new database (col_tag_database).
  7. Choose View tables and Click Create table.
  8. For Name, enter source_data_col_lvl.
  9. For Database, choose your new database (col_tag_database).
  10. For Data is located in, select Specified path in my account.
  11. Enter the S3 path for col_tag_database (s3://lf-tagbased-demo-<<Account-ID>>/col_tag_database/).
  12. For Data format, select CSV.
  13. Under Upload schema, enter the following schema JSON:
    [
                   {
                        "Name": "vendorid",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "lpep_pickup_datetime",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "lpep_dropoff_datetime",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "store_and_fwd_flag",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "ratecodeid",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "pulocationid",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "dolocationid",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "passenger_count",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "trip_distance",
                        "Type": "string"
                        
                        
                   }, 
                      {
                        "Name": "fare_amount",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "extra",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "mta_tax",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "tip_amount",
                        "Type": "string"
                        
                        
                   },
                      {
                        "Name": "tolls_amount",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "ehail_fee",
                        "Type": "string"
                        
                        
                   }, 
                   {
                        "Name": "improvement_surcharge",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "total_amount",
                        "Type": "string"
                        
                        
                   },
                   {
                        "Name": "payment_type",
                        "Type": "string"
                        
                        
                   }
     
    ]
    

  14. Choose Upload.

After uploading the schema, the table schema should look like the following screenshot.

  1. Choose Submit to complete the creation of the table

Now you associate the  Sensitive=True LF-tag to the columns vendorid and fare_amount.

  1. On the Tables page, select the table you created (source_data_col_lvl).
  2. On the Actions menu, choose Edit Schema.
  3. Select the column vendorid and choose Edit LF-tags.
  4. For Assigned keys, choose Sensitive.
  5. For Values, choose True.
  6. Choose Save.

Repeat the steps for the Sensitive LF-tag update for fare_amount column.

  1. Select the column fare_amount and choose Edit LF-tags.
  2. Add the Sensitive key with value True.
  3. Choose Save.
  4. Choose Save as new version to save the new schema version with tagged columns.The following screenshot shows column properties with the LF-tags updated.
    Next we associate the Confidential=False LF-tag to col_tag_database. This is required for lf-data-analyst to be able to describe the database col_tag_database when logged in from Athena.
  5. On the Databases page, find and select col_tag_database.
  6. On the Actions menu, choose Edit LF-tags.
  7. Choose Assign new LF-tag.
  8. For Assigned keys¸ choose the Confidential LF-tag you created earlier.
  9. For Values, choose False.
  10. Choose Save.

Grant table permissions

Now we grant permissions to data analysts for consumption of the database tag_database and the table col_tag_database.

  1. Sign in to the Lake Formation console as lf-data-engineer.
  2. On the Permissions page, select Data Permissions
  3. Choose Grant.
  4. Under Principals, select IAM users and roles.
  5. For IAM users and roles, choose lf-data-analyst.
  6. Select Resources matched by LF-tags.
  7. Choose Add LF-tag.
  8. For Key, choose Confidential.
  9. For Values¸ choose True.
  10. For Database permissions, select Describe
  11. For Table permissions, choose Select and Describe.
  12. Choose Grant.

    This grants permissions to the lf-data-analyst user on the objects associated with the LF-tag Confidential=True (Database : tag_database)  to describe the database and the select permission on tables.Next, we repeat the steps to grant permissions to data analysts for LF-tag expression for Confidential=False . This LF-tag is used for describing the col_tag_database and the table source_data_col_lvl when logged in as lf-data-analyst from Athena. And so, we only grant describe access to the resources through this LF-tag expression.
  13. Sign in to the Lake Formation console as lf-data-engineer.
  14. On the Databases page, select the database col_tag_database.
  15. Choose Action and Grant.
  16. Under Principals, select IAM users and roles.
  17. For IAM users and roles, choose lf-data-analyst.
  18. Select Resources matched by LF-tags.
  19. Choose Add LF-tag.
  20. For Key, choose Confidential.
  21. For Values¸ choose False.
  22. For Database permissions, select Describe.
  23. For Table permissions, do not select anything.
  24. Choose Grant.

    Next, we repeat the steps to grant permissions to data analysts for LF-tag expression for Confidential=False and Sensitive=True. This LF-tag is used for describing the col_tag_database and the table source_data_col_lvl (Column level) when logged in as lf-data-analyst from Athena.
  25. Sign in to the Lake Formation console as lf-data-engineer.
  26. On the Databases page, select the database col_tag_database.
  27. Choose Action and Grant.
  28. Under Principals, select IAM users and roles.
  29. For IAM users and roles, choose lf-data-analyst.
  30. Select Resources matched by LF-tags.
  31. Choose Add LF-tag.
  32. For Key, choose Confidential.
  33. For Values¸ choose False.
  34. Choose Add LF-tag.
  35. For Key, choose Sensitive.
  36. For Values¸ choose True.
  37. For Database permissions, select Describe.
  38. For Database permissions, select Describe.
  39. For Table permissions, select Select and Describe.
  40. Choose Grant.

Run a query in Athena to verify the permissions

For this step, we sign in to the Athena console as lf-data-analyst and run SELECT queries against the two tables (source_data and source_data_col_lvl). We use our S3 path as the query result location (s3://lf-tagbased-demo-<<Account-ID>>/athena-results/).

  1. In the Athena query editor, choose tag_database in the left panel.
  2. Choose the additional menu options icon (three vertical dots) next to source_data and choose Preview table.
  3. Choose Run query.

The query should take a few minutes to run. The following screenshot shows our query results.

The first query displays all the columns in the output because the LF-tag is associated at the database level and the source_data table automatically inherited the LF-tag from the database tag_database.

  1. Run another query using col_tag_database and source_data_col_lvl.

The second query returns just the two columns that were tagged (Non-Confidential and Sensitive).

As a thought experiment, you can also check to see the Lake Formation Tag-based access policy behavior on columns to which the user doesn’t have policy grants.

When an untagged column is selected from the table source_data_col_lvl, Athena returns an error. For example, you can run the following query to choose untagged columns geolocationid:

SELECT geolocationid FROM "col_tag_database"."source_data_col_lvl" limit 10;

Extend the solution to cross-account scenarios

You can extend this solution to share catalog resources across accounts. The following diagram illustrates a cross-account architecture.

We describe this in more detail in a subsequent post.

Clean up

To help prevent unwanted charges to your AWS account, you can delete the AWS resources that you used for this walkthrough.

  1. Sign in as lf-data-engineer Delete the databases tag_database and col_tag_database
  2. Now, Sign in as lf-data-steward and clean up all the LF-tag Permissions, Data Permissions and Data Location Permissions that were granted above that were granted lf-data-engineer and lf-data-analyst.
  3. Sign in to the Amazon S3 console as the account owner (the IAM credentials you used to deploy the CloudFormation stack).
  4. Delete the following buckets:
    1. lf-tagbased-demo-accesslogs-<acct-id>
    2. lf-tagbased-demo-<acct-id>
  5. On the AWS CloudFormation console, delete the stack you created.
  6. Wait for the stack status to change to DELETE_COMPLETE.

Conclusion

In this post, we explained how to create a LakeFormation Tag-based access control policy in Lake Formation using an AWS public dataset. In addition, we explained how to query tables, databases, and columns that have LakeFormation Tag-based access policies associated with them.

You can generalize these steps to share resources across accounts. You can also use these steps to grant permissions to SAML identities. In subsequent posts, we highlight these use cases in more detail.


About the Authors

Sanjay Srivastava is a principal product manager for AWS Lake Formation. He is passionate about building products, in particular products that help customers get more out of their data. During his spare time, he loves to spend time with his family and engage in outdoor activities including hiking, running, and gardening.

 

 

 

Nivas Shankar is a Principal Data Architect at Amazon Web Services. He helps and works closely with enterprise customers building data lakes and analytical applications on the AWS platform. He holds a master’s degree in physics and is highly passionate about theoretical physics concepts.

 

 

Pavan Emani is a Data Lake Architect at AWS, specialized in big data and analytics solutions. He helps customers modernize their data platforms on the cloud. Outside of work, he likes reading about space and watching sports.

 

Effective data lakes using AWS Lake Formation, Part 5: Securing data lakes with row-level access control

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/effective-data-lakes-using-aws-lake-formation-part-5-secure-data-lakes-with-row-level-access-control/

Increasingly, customers are looking at data lakes as a core part of their strategy to democratize data access across the organization. Data lakes enable you to handle petabytes and exabytes of data coming from a multitude of sources in varying formats, and gives users the ability to access it from their choice of analytics and machine learning tools. Fine-grained access controls are needed to ensure data is protected and access is granted to only those who require it.

AWS Lake Formation is a fully managed service that helps you build, secure, and manage data lakes, and provide access control for data in the data lake. Lake Formation row-level permissions allow you to restrict access to specific rows based on data compliance and governance policies. Lake Formation also provides centralized auditing and compliance reporting by identifying which principals accessed what data, when, and through which services.

Effective data lakes using AWS Lake Formation

This post demonstrates how row-level access controls work in Lake Formation, and how to set them up.

If you have large fact tables storing billions of records, you need a way to enable different users and teams to access only the data they’re allowed to see. Row-level access control is a simple and performant way to protect data, while giving users access to the data they need to perform their job. In the retail industry for instance, you may want individual departments to only see their own transactions, but allow regional managers access to transactions from every department.

Traditionally you can achieve row-level access control in a data lake through two common approaches:

  • Duplicate the data, redact sensitive information, and grant coarse-grained permissions on the redacted dataset
  • Load data into a database or a data warehouse, create a view with a WHERE clause to select only specific records, and grant permission on the resulting view

These solutions work well when dealing with a small number of tables, principals, and permissions. However, they make it difficult to audit and maintain because access controls are spread across multiple systems and methods. To make it easier to manage and enforce fine-grained access controls in a data lake, we announced a preview of Lake Formation row-level access controls. With this preview feature, you can create row-level filters and attach them to tables to restrict access to data for AWS Identity and Access Management (IAM) and SAMLv2 federated identities.

How data filters work for row-level security

Granting permissions on a table with row-level security (row filtering) restricts access to only specific rows in the table. The filtering is based on the values of one or more columns. For example, a salesperson analyzing sales opportunities should only be allowed to see those opportunities in their assigned territory and not others. We can define row-level filters to restrict access where the value of the territory column matches the assigned territory of the user.

With row-level security, we introduced the concept of data filters. Data filters make it simpler to manage and assign a large number of fine-grained permissions. You can specify the row filter expression using the WHERE clause syntax described in the PartiQL dialect.

Example use case

In this post, a fictional ecommerce company sells many different products, like books, videos, and toys. Customers can leave reviews and star ratings for each product, so other customers can make informed decisions about what they should buy. We use the Amazon Customer Reviews Dataset, which includes different products and customer reviews.

To illustrate the different roles and responsibilities of a data owner and a data consumer, we assume two personas: a data lake administrator and a data analyst. The administrator is responsible for setting up the data lake, creating data filters, and granting permissions to data analysts. Data analysts residing in different countries (for our use case, the US and Japan) can only analyze product reviews for customers located in their own country and for compliance reasons, shouldn’t be able to see data for customers located in other countries. We have two data analysts: one responsible for the US marketplace and another for the Japanese marketplace. Each analyst uses Amazon Athena to analyze customer reviews for their specific marketplace only.

Set up resources with AWS CloudFormation

This post includes an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs.

The CloudFormation template generates the following resources:

  • An AWS Lambda function (for Lambda-backed AWS CloudFormation custom resources). We use the function to copy sample data files from the public S3 bucket to your Amazon Simple Storage Service (Amazon S3) bucket.
  • An S3 bucket to serve as our data lake.
  • IAM users and policies:
    • DataLakeAdmin
    • DataAnalystUS
    • DataAnalystJP
  • An AWS Glue Data Catalog database, table, and partition.
  • Lake Formation data lake settings and permissions.

When following the steps in this section, use either us-east-1 or us-west-2 Regions (where the preview functionality is currently available).

Before launching the CloudFormation template, you need to ensure that you disabled Use only IAM access control for new databases/tables by following steps:

  1. Sign in to the Lake Formation console in the us-east-1 or us-west-2 Region.
  2. Under Data catalog, choose Settings.
  3. Deselect Use only IAM access control for new databases and Use only IAM access control for new tables in new databases.
  4. Choose Save.

To launch the CloudFormation stack, complete the following steps:

  1. Sign in to the CloudFormation console in the same Region.
  2. Choose Launch Stack:
  3. Choose Next.
  4. For DatalakeAdminUserName and DatalakeAdminUserPassword, enter the user name and password you want for the data lake admin IAM user.
  5. For DataAnalystUsUserName and DataAnalystUsUserPassword, enter the user name and password you want for the data analyst user who is responsible for the US marketplace.
  6. For DataAnalystJpUserName and DataAnalystJpUserPassword, enter the user name and password you want for the data analyst user who is responsible for the Japanese marketplace.
  7. For DataLakeBucketName, enter the name of your data lake bucket.
  8. For DatabaseName and TableName, leave as the default.
  9. Choose Next.
  10. On the next page, choose Next.
  11. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  12. Choose Create.

Stack creation can take about 1 minute.

Query without data filters

After you set up the environment, you can query the product reviews table. Let’s first query the table without row-level access controls to make sure we can see the data. If you’re running queries in Athena for the first time, you need to configure the query result location.

Sign in to the Athena console using the DatalakeAdmin user, and run the following query:

SELECT * 
FROM lakeformation_tutorial_row_security.amazon_reviews
LIMIT 10

The following screenshot shows the query result. This table has only one partition, product_category=Video, so each record is a review comment for a video product.

Let’s run an aggregation query to retrieve the total number of records per marketplace:

SELECT marketplace, count(*) as total_count
FROM lakeformation_tutorial_row_security.amazon_reviews
GROUP BY marketplace

The following screenshot shows the query result. The marketplace column has five different values. In the subsequent steps, we set up row-based filters using the marketplace column.

Set up data filters

Let’s start by creating two different data filters, one for the analyst responsible for the US marketplace, and another for the one responsible for the Japanese marketplace. The we grant the users their respective permissions.

Create a filter for the US marketplace data

Let’s first set up a filter for the US marketplace data.

  1. As the DatalakeAdmin user, open the Lake Formation console.
  2. Choose Data filters.
  3. Choose Create new filter.
  4. For Data filter name, enter amazon_reviews_US.
  5. For Target database, choose the database lakeformation_tutorial_row_security.
  6. For Target table, choose the table amazon_reviews.
  7. For Column-level access, leave as the default.
  8. For Row filter expression, enter marketplace='US'.
  9. Choose Create filter.

Create a filter for the Japanese marketplace data

Let’s create another data filter to restrict access to the Japanese marketplace data.

  1. On the Data filters page, choose Create new filter.
  2. For Data filter name, enter amazon_reviews_JP.
  3. For Target database, choose the database lakeformation_tutorial_row_security.
  4. For Target table, choose the table amazon_reviews.
  5. For Column-level access, leave as the default.
  6. For Row filter expression, enter marketplace='JP'.
  7. Choose Create filter.

Grant permissions to the US data analyst

Now we have two data filters. Next, we need to grant permissions using these data filters to our analysts. We start by granting permissions to the DataAnalystUS user.

  1. On the Data permissions page, choose Grant.
  2. For Principals, choose IAM users and roles, and choose the user DataAnalystUS.
  3. For Policy tags or catalog resources, choose Named data catalog resources.
  4. For Database, choose the database lakeformation_tutorial_row_security.
  5. For Table, choose the table amazon_reviews.
  6. For Table permissions, select Select.
  7. For Data permissions, select Advanced cell-level filters.
  8. Select the filter amazon_reviews_US.
  9. Choose Grant.

The following screenshot show the available data filters you can attach to a table when configuring permissions.

Grant permissions to the Japanese data analyst

Next, complete the following steps to configure permissions for the user DataAnalystJP:

  1. On the Data permissions page, choose Grant.
  2. For Principals, choose IAM users and roles, and choose the user DataAnalystJP.
  3. For Policy tags or catalog resources, choose Named data catalog resources.
  4. For Database, choose the database lakeformation_tutorial_row_security.
  5. For Table, choose the table amazon_reviews.
  6. For Table permissions, select Select.
  7. For Data permissions, select Advanced cell-level filters.
  8. Select the filter amazon_reviews_JP.
  9. Choose Grant.

Query with data filters

With the data filters attached to the product reviews table, we’re ready to run some queries and see how permissions are enforced by Lake Formation. Because row-level security is in preview as of this writing, we need to create a special Athena workgroup named AmazonAthenaLakeFormationPreview, and switch to using it. For more information, see Managing Workgroups.

Sign in to the Athena console using the DataAnalystUS user and switch to the AmazonAthenaLakeFormationPreview workgroup. Run the following query to retrieve a few records, which are filtered based on the row-level permissions we defined:

SELECT * 
FROM lakeformation.lakeformation_tutorial_row_security.amazon_reviews
LIMIT 10

Note the prefix of lakeformation. before the database name; this is required for the preview only.

The following screenshot shows the query result.

Similarly, run a query to count the total number of records per marketplace:

SELECT marketplace, count(*) as total_count
FROM lakeformation.lakeformation_tutorial_row_security.amazon_reviews
GROUP BY marketplace 

The following screenshot shows the query result. Only the marketplace US shows in the results. This is because our user is only allowed to see rows where the marketplace column value is equal to US.

Switch to the DataAnalystJP user and run the same query:

SELECT * 
FROM lakeformation.lakeformation_tutorial_row_security.amazon_reviews
LIMIT 10

The following screenshot shows the query result. All of the records belong to the JP marketplace.

Run the query to count the total number of records per marketplace:

SELECT marketplace, count(*) as total_count
FROM lakeformation.lakeformation_tutorial_row_security.amazon_reviews
GROUP BY marketplace

The following screenshot shows the query result. Again, only the row belonging to the JP marketplace is returned.

Clean up

Now to the final step, cleaning up the resources.

  1. Delete the CloudFormation stack.
  2. Delete the Athena workgroup AmazonAthenaLakeFormationPreview.

Conclusion

In this post, we covered how row-level security in Lake Formation enables you to control data access without needing to duplicate it or manage complicated alternatives such as views. We demonstrated how Lake Formation data filters can make creating, managing, and enforcing row-level permissions simple and easy.

When you want to grant permission on specific cell, you can include or exclude columns in the data filters in addition to the row filter expression. You can learn more about the cell filters in Part 4: Implementing cell-level and row-level security.

You can get started with Lake Formation today by visiting the AWS Lake Formation product page. If you want to try out row-level security, as well as the other exciting new features like ACID transactions and acceleration currently available for preview in the US East (N. Virginia) and the US West (Oregon) Regions, sign up for the preview.


About the Authors

Noritaka Sekiyama is a Senior Big Data Architect on the AWS Glue and AWS Lake Formation team. He has 11 years of experience working in the software industry. Based in Tokyo, Japan, he is responsible for implementing software artifacts, building libraries, troubleshooting complex issues and helping guide customer architectures.

 

 

 

Sanjay Srivastava is a Principal Product Manager for AWS Lake Formation. He is passionate about building products, in particular products that help customers get more out of their data. During his spare time, he loves to spend time with his family and engage in outdoor activities including hiking, running, and gardening.