All posts by Praveen Kumar

Implement tag-based access control for your data lake and Amazon Redshift data sharing with AWS Lake Formation

Post Syndicated from Praveen Kumar original https://aws.amazon.com/blogs/big-data/implement-tag-based-access-control-for-your-data-lake-and-amazon-redshift-data-sharing-with-aws-lake-formation/

Data-driven organizations treat data as an asset and use it across different lines of business (LOBs) to drive timely insights and better business decisions. Many organizations have a distributed tools and infrastructure across various business units. This leads to having data across many instances of data warehouses and data lakes using a modern data architecture in separate AWS accounts.

Amazon Redshift data sharing allows you to securely share live, transactionally consistent data in one Amazon Redshift data warehouse with another Redshift data warehouse within the same AWS account, across accounts, and across Regions, without needing to copy or move data from one cluster to another. Customers want to be able to manage their permissions in a central place across all of their assets. Previously, the management of Redshift datashares was limited to only within Amazon Redshift, which made it difficult to manage your data lake permissions and Amazon Redshift permissions in a single place. For example, you had to navigate to an individual account to view and manage access information for Amazon Redshift and the data lake on Amazon Simple Storage Service (Amazon S3). As an organization grows, administrators want a mechanism to effectively and centrally manage data sharing across data lakes and data warehouses for governance and auditing, and to enforce fine-grained access control.

We recently announced the integration of Amazon Redshift data sharing with AWS Lake Formation. With this feature, Amazon Redshift customers can now manage sharing, apply access policies centrally, and effectively scale the permission using LF-Tags.

Lake Formation has been a popular choice for centrally governing data lakes backed by Amazon S3. Now, with Lake Formation support for Amazon Redshift data sharing, it opens up new design patterns and broadens governance and security posture across data warehouses. With this integration, you can use Lake Formation to define fine-grained access control on tables and views being shared with Amazon Redshift data sharing for federated AWS Identity and Access Management (IAM) users and IAM roles. Lake Formation also provides tag-based access control (TBAC), which can be used to simplify and scale governance of data catalog objects such as databases and tables.

In this post, we discuss this new feature and how to implement TBAC for your data lake and Amazon Redshift data sharing on Lake Formation.

Solution overview

Lake Formation tag-based access control (LF-TBAC) allows you to group similar AWS Glue Data Catalog resources together and define the grant or revoke permissions policy by using an LF-Tag expression. LF-Tags are hierarchical in that when a database is tagged with an LF-Tag, all tables in that database inherit the tag, and when a LF-Tag is applied to a table, all the columns within that table inherit the tag. Inherited tags then can be overridden if needed. You then can create access policies within Lake Formation using LF-Tag expressions to grant principals access to tagged resources using an LF-Tag expression. See Managing LF-Tags for metadata access control for more details.

To demonstrate LF-TBAC with central data access governance capability, we use the scenario where two separate business units own particular datasets and need to share data across teams.

We have a customer care team who manages and owns the customer information database including customer demographics data. And have a marketing team who owns a customer leads dataset, which includes information on prospective customers and contact leads.

To be able to run effective campaigns, the marketing team needs access to the customer data. In this post, we demonstrate the process of sharing this data that is stored in the data warehouse and giving the marketing team access. Furthermore, there are personally identifiable information (PII) columns within the customer dataset that should only be accessed by a subset of power users on a need-to-know basis. This way, data analysts within marketing can only see non-PII columns to be able to run anonymous customer segment analysis, but a group of power users can access PII columns (for example, customer email address) to be able to run campaigns or surveys for specific groups of customers.

The following diagram shows the structure of the datasets that we work with in this post and a tagging strategy to provide fine-grained column-level access.

Beyond our tagging strategy on the data resources, the following table gives an overview of how we should grant permissions to our two personas via tags.

IAM Role Persona Resource Type Permission LF-Tag expression
marketing-analyst A data analyst in the marketing team DB describe (department:marketing OR department:customer) AND classification:private
. Table select (department:marketing OR department:customer) AND classification:private
. . . . .
marketing-poweruser A privileged user in the marketing team DB describe (department:marketing OR department:customer) AND classification: private
. Table (Column) select (department:marketing OR department:customer) AND (classification:private OR classification:pii-sensitive)

The following diagram gives a high-level overview of the setup that we deploy in this post.

The following is a high-level overview of how to use Lake Formation to control datashare permissions:

Producer Setup:

  1. In the producers AWS account, the Amazon Redshift administrator that owns the customer database creates a Redshift datashare on the producer cluster and grants usage to the AWS Glue Data Catalog in the same account.
  2. The producer cluster administrator authorizes the Lake Formation account to access the datashare.
  3. In Lake Formation, the Lake Formation administrator discovers and registers the datashares. They must discover the AWS Glue ARNs they have access to and associate the datashares with an AWS Glue Data Catalog ARN. If you’re using the AWS Command Line Interface (AWS CLI), you can discover and accept datashares with the Redshift CLI operations describe-data-shares and associate-data-share-consumer. To register a datashare, use the Lake Formation CLI operation register-resource.
  4. The Lake Formation administrator creates a federated database in the AWS Glue Data Catalog; assigns tags to the databases, tables, and columns; and configures Lake Formation permissions to control user access to objects within the datashare. For more information about federated databases in AWS Glue, see Managing permissions for data in an Amazon Redshift datashare.

Consumer Setup:

  1. On the consumer side (marketing), the Amazon Redshift administrator discovers the AWS Glue database ARNs they have access to, creates an external database in the Redshift consumer cluster using an AWS Glue database ARN, and grants usage to database users authenticated with IAM credentials to start querying the Redshift database.
  2. Database users can use the views SVV_EXTERNAL_TABLES and SVV_EXTERNAL_COLUMNS to find all the tables or columns within the AWS Glue database that they have access to; then they can query the AWS Glue database’s tables.

When the producer cluster administrator decides to no longer share the data with the consumer cluster, the producer cluster administrator can revoke usage, deauthorize, or delete the datashare from Amazon Redshift. The associated permissions and objects in Lake Formation are not automatically deleted.

Prerequisites:

To follow the steps in this post, you must satisfy the following prerequisites:

Deploy environment including producer and consumer Redshift clusters

To follow along the steps outlined in this post, deploy following AWS CloudFormation stack that includes necessary resources to demonstrate the subject of this post:

  1. Choose Launch stack to deploy a CloudFormation template.
  2. Provide an IAM role that you have already configured as a Lake Formation administrator.
  3. Complete the steps to deploy the template and leave all settings as default.
  4. Select I acknowledge that AWS CloudFormation might create IAM resources, then choose Submit.

This CloudFormation stack creates the following resources:

  • Producer Redshift cluster – Owned by the customer care team and has customer and demographic data on it.
  • Consumer Redshift cluster – Owned by the marketing team and is used to analyze data across data warehouses and data lakes.
  • S3 data lake – Contains the web activity and leads datasets.
  • Other necessary resources to demonstrate the process of sharing data – For example, IAM roles, Lake Formation configuration, and more. For a full list of resources created by the stack, examine the CloudFormation template.

After you deploy this CloudFormation template, resources created will incur cost to your AWS account. At the end of the process, make sure that you clean up resources to avoid unnecessary charges.

After the CloudFormation stack is deployed successfully (status shows as CREATE_COMPLETE), take note of the following items on the Outputs tab:

  • Marketing analyst role ARN
  • Marketing power user role ARN
  • URL for Amazon Redshift admin password stored in AWS Secrets Manager

Create a Redshift datashare and add relevant tables

On the AWS Management Console, switch to the role that you nominated as Lake Formation admin when deploying the CloudFormation template. Then go to Query Editor v2. If this is the first time using Query Editor V2 in your account, follow these steps to configure your AWS account.

The first step in Query Editor is to log in to the customer Redshift cluster using the database admin credentials to make your IAM admin role a DB admin on the database.

  1. Choose the options menu (three dots) next to the lfunified-customer-dwh cluster and choose Create connection.

  2. Select Database user name and password.
  3. Leave Database as dev.
  4. For User name, enter admin.
  5. For Password, complete the following steps:
    1. Go to the console URL, which is the value of the RedShiftClusterPassword CloudFormation output in previous step. The URL is the Secrets Manager console for this password.
    2. Scroll down to the Secret value section and choose Retrieve secret value.
    3. Take note of the password to use later when connecting to the marketing Redshift cluster.
    4. Enter this value for Password.
  6. Choose Create connection.

Create a datashare using a SQL command

Complete the following steps to create a datashare in the data producer cluster (customer care) and share it with Lake Formation:

  1. On the Amazon Redshift console, in the navigation pane, choose Editor, then Query editor V2.
  2. Choose (right-click) the cluster name and choose Edit connection or Create connection.
  3. For Authentication, select Temporary credentials using your IAM identity.

Refer to Connecting to an Amazon Redshift database to learn more about the various authentication methods.

  1. For Database, enter a database name (for this post, dev).
  2. Choose Create connection to connect to the database.
  3. Run the following SQL commands to create the datashare and add the data objects to be shared:
    create datashare customer_ds;
    ALTER DATASHARE customer_ds ADD SCHEMA PUBLIC;
    ALTER DATASHARE customer_ds ADD TABLE customer;

  4. Run the following SQL command to share the customer datashare to the current account via the AWS Glue Data Catalog:
    GRANT USAGE ON DATASHARE customer_ds TO ACCOUNT '<aws-account-id>' via DATA CATALOG;

  5. Verify the datashare was created and objects shared by running the following SQL command:
    DESC DATASHARE customer_ds;

Take note of the datashare producer cluster name space and account ID, which will be used in the following step. You can complete the following actions on the console, but for simplicity, we use AWS CLI commands.

  1. Go to CloudShell or your AWS CLI and run the following AWS CLI command to authorize the datashare to the Data Catalog so that Lake Formation can manage them:
    aws redshift authorize-data-share \
    --data-share-arn 'arn:aws:redshift:<aws-region>:<aws-account-id>:datashare:<producer-cluster-namespace>/customer_ds' \
    --consumer-identifier DataCatalog/<aws-account-id>

The following is an example output:

 {
    "DataShareArn": "arn:aws:redshift:us-east-2:<aws-account-id>:datashare:cd8d91b5-0c17-4567-a52a-59f1bdda71cd/customer_ds",
    "ProducerArn": "arn:aws:redshift:us-east-2:<aws-account-id>:namespace:cd8d91b5-0c17-4567-a52a-59f1bdda71cd",
    "AllowPubliclyAccessibleConsumers": false,
    "DataShareAssociations": [{
        "ConsumerIdentifier": "DataCatalog/<aws-account-id>XX",
        "Status": "AUTHORIZED",
        "CreatedDate": "2022-11-09T21:10:30.507000+00:00",
        "StatusChangeDate": "2022-11-09T21:10:50.932000+00:00"
    }]
}

Take note of your datashare ARN that you used in this command to use in the next steps.

Accept the datashare in the Lake Formation catalog

To accept the datashare, complete the following steps:

  1. Run the following AWS CLI command to accept and associate the Amazon Redshift datashare to the AWS Glue Data Catalog:
    aws redshift associate-data-share-consumer --data-share-arn 'arn:aws:redshift:<aws-region>:<aws-account-id>:datashare:<producer-cluster-namespace>/customer_ds' \
    --consumer-arn arn:aws:glue:<aws-region>:<aws-account-id>:catalog

The following is an example output:

{
 "DataShareArn": "arn:aws:redshift:us-east-2:<aws-account-id>:datashare:cfd5fcbd-3492-42b5-9507-dad5d87f7427/customer_ds",
 "ProducerArn": "arn:aws:redshift:us-east-2:<aws-account-id>:namespace:cfd5fcbd-3492-42b5-9507-dad5d87f7427",
 "AllowPubliclyAccessibleConsumers": false,
 "DataShareAssociations": [
 {
 "ConsumerIdentifier": "arn:aws:glue:us-east-2:<aws-account-id>:catalog",
 "Status": "ACTIVE",
 "ConsumerRegion": "us-east-2",
 "CreatedDate": "2023-05-18T12:25:11.178000+00:00",
 "StatusChangeDate": "2023-05-18T12:25:11.178000+00:00"
 }
 ]
}
  1. Register the datashare in Lake Formation:
    aws lakeformation register-resource \
     --resource-arn arn:aws:redshift:<aws-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/customer_ds

  2. Create the AWS Glue database that points to the accepted Redshift datashare:
    aws glue create-database --region <aws-region> --cli-input-json '{
        "CatalogId": "<aws-account-id>",
        "DatabaseInput": {
            "Name": "customer_db_shared",
            "FederatedDatabase": {
                "Identifier": "arn:aws:redshift:<aws-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/customer_ds",
                "ConnectionName": "aws:redshift"
            }
        }
    }'

  3. To verify, go to the Lake Formation console and check that the database customer_db_shared is created.

Now the data lake administrator can view and grant access on both the database and tables to the data consumer team (marketing) personas using Lake Formation TBAC.

Assign Lake Formation tags to resources

Before we grant appropriate access to the IAM principals of the data analyst and power user within the marketing team, we have to assign LF-tags to tables and columns of the customer_db_shared database. We then grant these principals permission to appropriate LF-tags.

To assign LF-tags, follow these steps:

  1. Assign the department and classification LF-tag to customer_db_shared (Redshift datashare) based on the tagging strategy table in the solution overview. You can run the following actions on the console, but for this post, we use the following AWS CLI command:
    aws lakeformation add-lf-tags-to-resource --cli-input-json '{
        "CatalogId": "<aws-account-id>",
        "Resource": {
        "Database": {
        "CatalogId": "<aws-account-id>",
        "Name": "customer_db_shared"
        }
        },
        "LFTags": [
        {
        "CatalogId": "<aws-account-id>",
        "TagKey": "department",
        "TagValues": [
        "customer"]
        },
        {
        "CatalogId": "<aws-account-id>",
        "TagKey": "classification",
        "TagValues": [
        "private"]
        }
        ]
        }'

If the command is successful, you should get a response like the following:

{
"Failures": []
}
  1. Assign the appropriate department and classification LF-tag to marketing_db (on the S3 data lake):
    aws lakeformation add-lf-tags-to-resource --cli-input-json '{
        "CatalogId": "<aws-account-id>",
        "Resource": {
        "Database": {
        "CatalogId": "<aws-account-id>",
        "Name": "lfunified_marketing_dl_db"
        }
        },
        "LFTags": [
        {
        "CatalogId": "<aws-account-id>",
        "TagKey": "department",
        "TagValues": [
        "marketing"]
        },
        {
        "CatalogId": "<aws-account-id>",
        "TagKey": "classification",
        "TagValues": [
        "private"]
        }
        ]
        }'

Note that although you only assign the department and classification tag on the database level, it gets inherited by the tables and columns within that database.

  1. Assign the classification pii-sensitive LF-tag to PII columns of the customer table to override the inherited value from the database level:
    aws lakeformation add-lf-tags-to-resource --cli-input-json '{
        "CatalogId": "<aws-account-id>",
        "Resource": {
        "TableWithColumns": {
        "CatalogId": "<aws-account-id>",
        "DatabaseName": "customer_db_shared",
        "Name": "public.customer",
        "ColumnNames":["c_first_name","c_last_name","c_email_address"]
        }
        },
        "LFTags": [
        {
        "CatalogId": "<aws-account-id>",
        "TagKey": "classification",
        "TagValues": [
        "pii-sensitive"]
        }
        ]
        }'

Grant permission based on LF-tag association

Run the following two AWS CLI commands to allow the marketing data analyst access to the customer table excluding the pii-sensitive (PII) columns. Replace the value for DataLakePrincipalIdentifier with the MarketingAnalystRoleARN that you noted from the outputs of the CloudFormation stack:

aws lakeformation grant-permissions --cli-input-json '{
    "CatalogId": "<aws-account-id>",
    "Principal": {"DataLakePrincipalIdentifier" : "<MarketingAnalystRoleARN-from-CloudFormation-Outputs>"},
    "Resource": {
    "LFTagPolicy": {
    "CatalogId": "<aws-account-id>",
    "ResourceType": "DATABASE",
    "Expression": [{"TagKey": "department","TagValues": ["marketing","customer"]},{"TagKey": "classification","TagValues": ["private"]}]
    }
    },
    "Permissions": [
    "DESCRIBE"
    ],
    "PermissionsWithGrantOption": []
}'
aws lakeformation grant-permissions --cli-input-json '{
    "CatalogId": "<aws-account-id>",
    "Principal": {"DataLakePrincipalIdentifier" : "<MarketingAnalystRoleARN-from-CloudFormation-Outputs>"},
    "Resource": {
    "LFTagPolicy": {
    "CatalogId": "<aws-account-id>",
    "ResourceType": "TABLE",
    "Expression": [{"TagKey": "department","TagValues": ["marketing","customer"]},{"TagKey": "classification","TagValues": ["private"]}]
    }
    },
    "Permissions": [
    "SELECT"
    ],
    "PermissionsWithGrantOption": []
}'

We have now granted marketing analysts access to the customer database and tables that are not pii-sensitive.

To allow marketing power users access to table columns with restricted LF-tag (PII columns), run the following AWS CLI command:

aws lakeformation grant-permissions --cli-input-json '{
    "CatalogId": "<aws-account-id>",
    "Principal": {"DataLakePrincipalIdentifier" : "<MarketingPowerUserRoleARN-from-CloudFormation-Outputs>"},
    "Resource": {
    "LFTagPolicy": {
    "CatalogId": "<aws-account-id>",
    "ResourceType": "DATABASE",
    "Expression": [{"TagKey": "department","TagValues": ["marketing","customer"]},{"TagKey": "classification","TagValues": ["private"]}]
    }
    },
    "Permissions": [
    "DESCRIBE"
    ],
    "PermissionsWithGrantOption": []
}'
aws lakeformation grant-permissions --cli-input-json '{
    "CatalogId": "<aws-account-id>",
    "Principal": {"DataLakePrincipalIdentifier" : "<MarketingPowerUserRoleARN-from-CloudFormation-Outputs>"},
    "Resource": {
    "LFTagPolicy": {
    "CatalogId": "<aws-account-id>",
    "ResourceType": "TABLE",
    "Expression": [{"TagKey": "department","TagValues": ["marketing","customer"]},{"TagKey": "classification","TagValues": ["private", "pii-sensitive"]}]
    }
    },
    "Permissions": [
    "SELECT"
    ],
    "PermissionsWithGrantOption": []
}'

We can combine the grants into a single batch grant permissions call:

aws lakeformation batch-grant-permissions --region us-east-1 --cli-input-json '{
    "CatalogId": "<aws-account-id>",
 "Entries": [
 {  "Id": "1",
    "Principal": {"DataLakePrincipalIdentifier" : "arn:aws:iam:: <aws-account-id>:role/Blog-MarketingAnalystRole-1CYV6JSNN14E3"},
    "Resource": {
    "LFTagPolicy": {
    "CatalogId": "<aws-account-id>",
    "ResourceType": "DATABASE",
    "Expression": [{"TagKey": "department","TagValues": ["marketing","customer"]},{"TagKey": "classification","TagValues": ["private"]}]
    }
    },
    "Permissions": [
    "DESCRIBE"
    ],
    "PermissionsWithGrantOption": []
    },
    {  "Id": "2",
    "Principal": {"DataLakePrincipalIdentifier" : "arn:aws:iam:: <aws-account-id>:role/Blog-MarketingAnalystRole-1CYV6JSNN14E3"},
    "Resource": {
    "LFTagPolicy": {
    "CatalogId": "<aws-account-id>",
    "ResourceType": "TABLE",
    "Expression": [{"TagKey": "department","TagValues": ["marketing","customer"]},{"TagKey": "classification","TagValues": ["private"]}]
    }
    },
    "Permissions": [
    "SELECT"
    ],
    "PermissionsWithGrantOption": []
    },
     {  "Id": "3",
    "Principal": {"DataLakePrincipalIdentifier" : "arn:aws:iam:: <aws-account-id>:role/Blog-MarketingPoweruserRole-RKKM0TWQBP0W"},
    "Resource": {
    "LFTagPolicy": {
    "CatalogId": "<aws-account-id>",
    "ResourceType": "DATABASE",
    "Expression": [{"TagKey": "department","TagValues": ["marketing","customer"]},{"TagKey": "classification","TagValues": ["private", "pii-sensitive"]}]
    }
    },
    "Permissions": [
    "DESCRIBE"
    ],
    "PermissionsWithGrantOption": []
    },
    {  "Id": "4",
    "Principal": {"DataLakePrincipalIdentifier" : "arn:aws:iam:: <aws-account-id>:role/Blog-MarketingPoweruserRole-RKKM0TWQBP0W"},
    "Resource": {
    "LFTagPolicy": {
    "CatalogId": "<aws-account-id>",
    "ResourceType": "TABLE",
    "Expression": [{"TagKey": "department","TagValues": ["marketing","customer"]},{"TagKey": "classification","TagValues": ["private", "pii-sensitive"]}]
    }
    },
    "Permissions": [
    "SELECT"
    ],
    "PermissionsWithGrantOption": []
    }
    ]
 }'

Validate the solution

In this section, we go through the steps to test the scenario.

Consume the datashare in the consumer (marketing) data warehouse

To enable the consumers (marketing team) to access the customer data shared with them via the datashare, first we have to configure Query Editor v2. This configuration is to use IAM credentials as the principal for the Lake Formation permissions. Complete the following steps:

  1. Sign in to the console using the admin role you nominated in running the CloudFormation template step.
  2. On the Amazon Redshift console, go to Query Editor v2.
  3. Choose the gear icon in the navigation pane, then choose Account settings.
  4. Under Connection settings, select Authenticate with IAM credentials.
  5. Choose Save.

Now let’s connect to the marketing Redshift cluster and make the customer database available to the marketing team.

  1. Choose the options menu (three dots) next to the Serverless:lfunified-marketing-wg cluster and choose Create connection.
  2. Select Database user name and password.
  3. Leave Database as dev.
  4. For User name, enter admin.
  5. For Password, enter the same password you retrieved from Secrets Manger in an earlier step.
  6. Choose Create connection.
  7. Once successfully connected, choose the plus sign and choose Editor to open a new Query Editor tab.
  8. Make sure that you specify the Serverless: lfunified-marketing-wg workgroup and dev database.
  9. To create the Redshift database from the shared catalog database, run the following SQL command on the new tab:
    CREATE DATABASE ext_customerdb_shared FROM ARN 'arn:aws:glue:<aws-region>:<aws-account-id>:database/customer_db_shared' WITH DATA CATALOG SCHEMA "customer_db_shared"

  10. Run the following SQL commands to create and grant usage on the Redshift database to the IAM roles for the power users and data analyst. You can get the IAM role names from the CloudFormation stack outputs:
    CREATE USER IAMR:"lf-redshift-ds-MarketingAnalystRole-XXXXXXXXXXXX" password disable;
    GRANT USAGE ON DATABASE ext_customerdb_shared to IAMR:"lf-redshift-ds-MarketingAnalystRole-XXXXXXXXXXXX";
    
    CREATE USER IAMR:"lf-redshift-ds-MarketingPoweruserRole-YYYYYYYYYYYY" password disable;
    GRANT USAGE ON DATABASE ext_customerdb_shared to IAMR:"lf-redshift-ds-MarketingPoweruserRole-YYYYYYYYYYYY";

Create the data lake schema in AWS Glue and allow the marketing power role to query the lead and web activity data

Run the following SQL commands to make the lead data in the S3 data lake available to the marketing team:

create external schema datalake from data catalog
database 'lfunified_marketing_dl_db' 
iam_role 'SESSION'
catalog_id '<aws-account-id>';
GRANT USAGE ON SCHEMA datalake TO IAMR:"lf-redshift-ds-MarketingAnalystRole-XXXXXXXXXXXX";
GRANT USAGE ON SCHEMA datalake TO IAMR:"lf-redshift-ds-MarketingPoweruserRole-YYYYYYYYYYYY";

Query the shared dataset as a marketing analyst user

To validate that the marketing team analysts (IAM role marketing-analyst-role) have access to the shared database, perform the following steps:

  1. Sign in to the console (for convenience, you can use a different browser) and switch your role to lf-redshift-ds-MarketingAnalystRole-XXXXXXXXXXXX.
  2. On the Amazon Redshift console, go to Query Editor v2.
  3. To connect to the consumer cluster, choose the Serverless: lfunified-marketing-wg consumer data warehouse in the navigation pane.
  4. When prompted, for Authentication, select Federated user.
  5. For Database, enter the database name (for this post, dev).
  6. Choose Save.
  7. Once you’re connected to the database, you can validate the current logged-in user with the following SQL command:
    select current_user;

  8. To find the federated databases created on the consumer account, run the following SQL command:
    SHOW DATABASES FROM DATA CATALOG ACCOUNT '<aws-account-id>';

  9. To validate permissions for the marketing analyst role, run the following SQL command:
    select * from ext_customerdb_shared.public.customer limit 10;

As you can see in the following screenshot, the marketing analyst is able to successfully access the customer data but only the non-PII attributes, which was our intention.

  1. Now let’s validate that the marketing analyst doesn’t have access to the PII columns of the same table:
    select c_customer_email from ext_customerdb_shared.public.customer limit 10;

Query the shared datasets as a marketing power user

To validate that the marketing power users (IAM role lf-redshift-ds-MarketingPoweruserRole-YYYYYYYYYYYY) have access to pii-sensetive columns in the shared database, perform the following steps:

  1. Sign in to the console (for convenience, you can use a different browser) and switch your role to lf-redshift-ds-MarketingPoweruserRole-YYYYYYYYYYYY.
  2. On the Amazon Redshift console, go to Query Editor v2.
  3. To connect to the consumer cluster, choose the Serverless: lfunified-marketing-wg consumer data warehouse in the navigation pane.
  4. When prompted, for Authentication, select Federated user.
  5. For Database, enter the database name (for this post, dev).
  6. Choose Save.
  7. Once you’re connected to the database, you can validate the current logged-in user with the following SQL command:
    select current_user;

  8. Now let’s validate that the marketing power role has access to the PII columns of the customer table:
    select c_customer_id, c_first_name, c_last_name,c_customer_email from customershareddb.public.customer limit 10;

  9. Validate that the power users within the marketing team can now run a query to combine data across different datasets that they have access to in order to run effective campaigns:
    SELECT
        emailaddress as emailAddress,  customer.c_first_name as firstName, customer.c_last_name as lastName, leadsource, contactnotes, usedpromo
    FROM
        "dev"."datalake"."lead" as lead
    JOIN ext_customerdb_shared.public.customer as customer
    ON lead.emailaddress = customer.c_email_address
    WHERE lead.donotreachout = 'false'

Clean up

After you complete the steps in this post, to clean up resources, delete the CloudFormation stack:

  1. On the AWS CloudFormation console, select the stack you deployed in the beginning of this post.
  2. Choose Delete and follow the prompts to delete the stack.

Conclusion

In this post, we showed how you can use Lake Formation tags and manage permissions for your data lake and Amazon Redshift data sharing using Lake Formation. Using Lake Formation LF-TBAC for data governance helps you manage your data lake and Amazon Redshift data sharing permissions at scale. Also, it enables data sharing across business units with fine-grained access control. Managing access to your data lake and Redshift datashares in a single place enables better governance, helping with data security and compliance.

If you have questions or suggestions, submit them in the comments section.

For more information on Lake Formation managed Amazon Redshift data sharing and tag-based access control, refer to Centrally manage access and permissions for Amazon Redshift data sharing with AWS Lake Formation and Easily manage your data lake at scale using AWS Lake Formation Tag-based access control.


About the Authors

Praveen Kumar is an Analytics Solution Architect at AWS with expertise in designing, building, and implementing modern data and analytics platforms using cloud-native services. His areas of interests are serverless technology, modern cloud data warehouses, streaming, and ML applications.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Paul Villena is an Analytics Solutions Architect in AWS with expertise in building modern data and analytics solutions to drive business value. He works with customers to help them harness the power of the cloud. His areas of interests are infrastructure as code, serverless technologies, and coding in Python.

Mostafa Safipour is a Solutions Architect at AWS based out of Sydney. He works with customers to realize business outcomes using technology and AWS. Over the past decade, he has helped many large organizations in the ANZ region build their data, digital, and enterprise workloads on AWS.

Persist and analyze metadata in a transient Amazon MWAA environment

Post Syndicated from Praveen Kumar original https://aws.amazon.com/blogs/big-data/persist-and-analyze-metadata-in-a-transient-amazon-mwaa-environment/

Customers can harness sophisticated orchestration capabilities through the open-source tool Apache Airflow. Airflow can be installed on Amazon EC2 instances or can be dockerized and deployed as a container on AWS container services. Alternatively, customers can also opt to leverage Amazon Managed Workflows for Apache Airflow (MWAA).

Amazon MWAA is a fully managed service that enables customers to focus more of their efforts on high-impact activities such as programmatically authoring data pipelines and workflows, as opposed to maintaining or scaling the underlying infrastructure. Amazon MWAA offers auto-scaling capabilities where it can respond to surges in demand by scaling the number of Airflow workers out and back in.

With Amazon MWAA, there are no upfront commitments and you only pay for what you use based on instance uptime, additional auto-scaling capacity, and storage of the Airflow back-end metadata database. This database is provisioned and managed by Amazon MWAA and contains the necessary metadata to support the Airflow application.  It hosts key data points such as historical execution times for tasks and workflows and is valuable in understanding trends and behaviour of your data pipelines over time. Although the Airflow console does provide a series of visualisations that help you analyse these datasets, these are siloed from other Amazon MWAA environments you might have running, as well as the rest of your business data.

Data platforms encompass multiple environments. Typically, non-production environments are not subject to the same orchestration demands and schedule as those of production environments. In most instances, these non-production environments are idle outside of business hours and can be spun down to realise further cost-efficiencies. Unfortunately, terminating Amazon MWAA instances results in the purging of that critical metadata.

In this post, we discuss how to export, persist and analyse Airflow metadata in Amazon S3 enabling you to run and perform pipeline monitoring and analysis. In doing so, you can spin down Airflow instances without losing operational metadata.

Benefits of Airflow metadata

Persisting the metadata in the data lake enables customers to perform pipeline monitoring and analysis in a more meaningful manner:

  • Airflow operational logs can be joined and analysed across environments
  • Trend analysis can be conducted to explore how data pipelines are performing over time, what specific stages are taking the most time, and how is performance effected as data scales
  • Airflow operational data can be joined with business data for improved record level lineage and audit capabilities

These insights can help customers understand the performance of their pipelines over time and guide focus towards which processes need to be optimised.

The technique described below to extract metadata is applicable to any Airflow deployment type, but we will focus on Amazon MWAA in this blog.

Solution Overview

The below diagram illustrates the solution architecture. Please note, Amazon QuickSight is NOT included as part of the CloudFormation stack and is not covered in this tutorial. It has been placed in the diagram to illustrate that metadata can be visualised using a business intelligence tool.

As part of this tutorial, you will be performing the below high-level tasks:

  • Run CloudFormation stack to create all necessary resources
  • Trigger Airflow DAGs to perform sample ETL workload and generate operational metadata in back-end database
  • Trigger Airflow DAG to export operational metadata into Amazon S3
  • Perform analysis with Amazon Athena

This post comes with an AWS CloudFormation stack that automatically provisions the necessary AWS resources and infrastructure, including an active Amazon MWAA instance, for this solution. The entire code is available in the GitHub repository.

The Amazon MWAA instance will already have three directed-acyclic graphs (DAGs) imported:

  1. glue-etl – This ETL workflow leverages AWS Glue to perform transformation logic on a CSV file (customer_activity.csv). This file will be loaded as part of the CloudFormation template into the s3://<DataBucket>/raw/ prefix.

The first task glue_csv_to_parquet converts the ‘raw’ data to parquet format and stores the data in location s3://<DataBucket>/optimised/.  By converting the data in parquet format, you can achieve faster query performance and lower query costs.

The second task glue_transform runs an aggregation over the newly created parquet format and stores the aggregated data in location s3://<DataBucket>/conformed/.

  1. db_export_dag – This DAG consists of one task, export_db, which exports the data from the back-end Airflow database into Amazon S3 in the location s3://<DataBucket>/export/.

Please note that you may experience time-out issues when extracting large amounts of data. On busy Airflow instances, our recommendation will be to set up frequent extracts in small chunks.

  1. run-simple-dag – This DAG does not perform any data transformation or manipulation. It is used in this blog for the purposes of populating the back-end Airflow database with sufficient operational data.

Prerequisites

To implement the solution outlined in this blog, you will need following :

Steps to run a data pipeline using Amazon MWAA and saving metadata to s3:

  1. Choose Launch Stack:
  2. Choose Next.
  3. For Stack name, enter a name for your stack.
  4. Choose Next.
  5. Keep the default settings on the ‘Configure stack options’ page, and choose Next.
  6. Acknowledge that the template may create AWS Identity and Access Management (IAM) resources.
  7. Choose Create stack. The stack can take up to 30 mins to complete.

The CloudFormation template generates the following resources:

    • VPC infrastructure that uses Public routing over the Internet.
    • Amazon S3 buckets required to support Amazon MWAA, detailed below:
      • The Data Bucket, refered in this blog as s3://<DataBucket>, holds the data which will be optimised and transformed for further analytical consumption. This bucket will also hold the data from the Airflow back-end metadata database once extracted.
      • The Environment Bucket, refered in this blog as s3://<EnvironmentBucket>, stores your DAGs, as well as any custom plugins, and Python dependencies you may have.
    • Amazon MWAA environment that’s associated to the  s3://<EnvironmentBucket>/dags location.
    • AWS Glue jobs for data processing and help generate airflow metadata.
    • AWS Lambda-backed custom resources to upload to Amazon S3 the sample data, AWS Glue scripts and DAG configuration files,
    • AWS Identity and Access Management (IAM) users, roles, and policies.
  1. Once the stack creation is successful, navigate to the Outputs tab of the CloudFormation stack and make note of DataBucket and EnvironmentBucket name. Store your Apache Airflow Directed Acyclic Graphs (DAGs), custom plugins in a plugins.zip file, and Python dependencies in a requirements.txt file.
  2. Open the Environments page on the Amazon MWAA console.
  3. Choose the environment created above. (The environment name will include the stack name). Click on Open Airflow UI.
  4. Choose glue-etl DAG , unpause by clicking the radio button next to the name of the DAG and click on the Play Button on Right hand side to Trigger DAG. It may take up to a minute for DAG to appear.
  5. Leave Configuration JSON as empty and hit Trigger.
  6. Choose run-simple-dag DAG, unpause and click on Trigger DAG.
  7. Once both DAG executions have completed, select the db_export_dag DAG, unpause and click on Trigger DAG. Leave Configuration JSON as empty and hit Trigger.

This step will extract the dag and task metadata to a S3 location. This is a sample list of tables and more tables can be added as required. The exported metadata will be located in s3://<DataBucket>/export/ folder.

Visualise using Amazon QuickSight and Amazon Athena

Amazon Athena is a serverless interactive query service that can be used to run exploratory analysis on data stored in Amazon S3.

If you are using Amazon Athena for the first time, please find the steps here to setup query location. We can use Amazon Athena to explore and analyse the metadata generated from airflow dag runs.

  1. Navigate to Athena Console and click explore the query editor.
  2. Hit View Settings.
  3. Click Manage.
  4. Replace with s3://<DataBucket>/logs/athena/. Once completed, return to the query editor.
  5. Before we can perform our pipeline analysis, we need to create the below DDLs. Replace the <DataBucket> as part of the LOCATION clause with the parameter value as defined in the CloudFormation stack (noted in Step 8 above).
    CREATE EXTERNAL TABLE default.airflow_metadata_dagrun (
            sa_instance_state STRING,
            dag_id STRING,
            state STRING,
            start_date STRING,
            run_id STRING,
            external_trigger STRING,
            conf_name STRING,
            dag_hash STRING,
             id STRING,
            execution_date STRING,
            end_date STRING,
            creating_job_id STRING,
            run_type STRING,
            last_scheduling_decision STRING
       )
    PARTITIONED BY (dt string)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    LOCATION 's3://<DataBucket>/export/dagrun/'
    TBLPROPERTIES ("skip.header.line.count"="1");
    MSCK REPAIR TABLE default.airflow_metadata_dagrun;
    
    CREATE EXTERNAL TABLE default.airflow_metadata_taskinstance (
            sa_instance_state STRING,
            start_date STRING,
            job_id STRING,
            pid STRING,
            end_date STRING,
            pool STRING,
            executor_config STRING,
            duration STRING,
            pool_slots STRING,
            external_executor_id STRING,
            state STRING,
            queue STRING,
            try_number STRING,
            max_tries STRING,
            priority_weight STRING,
            task_id STRING,
            hostname STRING,
            operator STRING,
            dag_id STRING,
            unixname STRING,
            queued_dttm STRING,
            execution_date STRING,
            queued_by_job_id STRING,
            test_mode STRING
       )
    PARTITIONED BY (dt string)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    LOCATION 's3://<DataBucket>/export/taskinstance/'
    TBLPROPERTIES ("skip.header.line.count"="1");
    MSCK REPAIR TABLE default.airflow_metadata_taskinstance;

  6. You can preview the table in the query editor of Amazon Athena.

  7. With the metadata persisted, you can perform pipeline monitoring and derive some powerful insights on the performance of your data pipelines overtime. As an example to illustrate this, execute the below SQL query in Athena.

This query returns pertinent metrics at a monthly grain which include number of executions of the DAG in that month, success rate, minimum/maximum/average duration for the month and a variation compared to the previous months average.

Through the below SQL query, you will be able to understand how your data pipelines are performing over time.

select dag_run_prev_month_calcs.*
        , avg_duration - prev_month_avg_duration as var_duration
from
    (
select dag_run_monthly_calcs.*
            , lag(avg_duration, 1, avg_duration) over (partition by dag_id order by year_month) as prev_month_avg_duration
    from
        (
            select dag_id
                    , year_month
                    , sum(counter) as num_executions
                    , sum(success_ind) as num_success
                    , sum(failed_ind) as num_failed
                    , (cast(sum(success_ind) as double)/ sum(counter))*100 as success_rate
                    , min(duration) as min_duration
                    , max(duration) as max_duration
                    , avg(duration) as avg_duration
            from
                (
                    select dag_id
                            , 1 as counter
                            , case when state = 'success' then 1 else 0 end as success_ind
                            , case when state = 'failed' then 1 else 0 end as failed_ind
                            , date_parse(start_date,'%Y-%m-%d %H:%i:%s.%f+00:00') as start_date
                            , date_parse(end_date,'%Y-%m-%d %H:%i:%s.%f+00:00') as end_date
                            , date_parse(end_date,'%Y-%m-%d %H:%i:%s.%f+00:00') - date_parse(start_date,'%Y-%m-%d %H:%i:%s.%f+00:00') as duration
                            , date_format(date_parse(start_date,'%Y-%m-%d %H:%i:%s.%f+00:00'), '%Y-%m') as year_month
                    from "default"."airflow_metadata_dagrun"
                    where state <> 'running'
                )  dag_run_counters
            group by dag_id, year_month
        ) dag_run_monthly_calcs
    ) dag_run_prev_month_calcs
order by dag_id, year_month

  1. You can also visualize this data using your BI tool of choice. While step by step details of creating a dashboard is not covered in this blog, please refer the below dashboard built on Amazon QuickSight as an example of what can be built based on the metadata extracted above. If you are using Amazon QuickSight for the first time, please find the steps here on how to get started.

Through QuickSight, we can quickly visualise and derive that our data pipelines are completing successfully, but on average are taking a longer time to complete over time.

Clean up the environment

  1. Navigate to the S3 console and click on the <DataBucket> noted in step 8 above.
  2. Click on Empty bucket.
  3. Confirm the selection.
  4. Repeat this step for bucket <EnvironmentBucket> (noted in step 8 above) and Empty bucket.
  5. Run the below statements in the query editor to drop the two Amazon Athena tables. Run statements individually.
    DROP TABLE default.airflow_metadata_dagrun;
    DROP TABLE default.airflow_metadata_taskinstance;

  6. On the AWS CloudFormation console, select the stack you created and choose Delete.

Summary

In this post, we presented a solution to further optimise the costs of Amazon MWAA by tearing down instances whilst preserving the metadata. Storing this metadata in your data lake enables you to better perform pipeline monitoring and analysis. This process can be scheduled and orchestrated programatically and is applicable to all Airflow deployments, such as Amazon MWAA, Apache Airflow installed on Amazon EC2, and even on-premises installations of Apache Airflow.

To learn more, please visit Amazon MWAA and Getting Started with Amazon MWAA.


About the Authors

Praveen Kumar is a Specialist Solution Architect at AWS with expertise in designing, building, and implementing modern data and analytics platforms using cloud-native services. His areas of interests are serverless technology, streaming applications, and modern cloud data warehouses.

Avnish Jain is a Specialist Solution Architect in Analytics at AWS with experience designing and implementing scalable, modern data platforms on the cloud for large scale enterprises. He is passionate about helping customers build performant and robust data-driven solutions and realise their data & analytics potential.