Use an AD FS user and Tableau to securely query data in AWS Lake Formation

Post Syndicated from Jason Nicholls original https://aws.amazon.com/blogs/big-data/use-an-ad-fs-user-and-tableau-to-securely-query-data-in-aws-lake-formation/

Security-conscious customers often adopt a Zero Trust security architecture. Zero Trust is a security model centered on the idea that access to data shouldn’t be solely based on network location, but rather require users and systems to prove their identities and trustworthiness and enforce fine-grained identity-based authorization rules before granting access to applications, data, and other systems.

Some customers rely on third-party identity providers (IdPs) like Active Directory Federated Services (AD FS) as a system to manage credentials and prove identities and trustworthiness. Users can use their AD FS credentials to authenticate to various related yet independent systems, including the AWS Management Console (for more information, see Enabling SAML 2.0 federated users to access the AWS Management Console).

In the context of analytics, some customers extend Zero Trust to data stored in data lakes, which includes the various business intelligence (BI) tools used to access that data. A common data lake pattern is to store data in Amazon Simple Storage Service (Amazon S3) and query the data using Amazon Athena.

AWS Lake Formation allows you to define and enforce access policies at the database, table, and column level when using Athena queries to read data stored in Amazon S3. Lake Formation supports Active Directory and Security Assertion Markup Language (SAML) identity providers such as OKTA and Auth0. Furthermore, Lake Formation securely integrates with the AWS BI service Amazon QuickSight. QuickSight allows you to effortlessly create and publish interactive BI dashboards, and supports authentication via Active Directory. However, if you use alternative BI tools like Tableau, you may want to use your Active Directory credentials to access data stored in Lake Formation.

In this post, we show you how you can use AD FS credentials with Tableau to implement a Zero Trust architecture and securely query data in Amazon S3 and Lake Formation.

Solution overview

In this architecture, user credentials are managed by Active Directory, and not Amazon Identity and Access Management (IAM). Although Tableau provides a connector to connect Tableau to Athena, the connector requires an AWS access key ID and an AWS secret access key normally used for programmatic access. Creating an IAM user with programmatic access for use by Tableau is a potential solution, however some customers have made an architectural decision that access to AWS accounts is done via a federated process using Active Directory, and not an IAM user.

In this post, we show you how you can use the Athena ODBC driver in conjunction with AD FS credentials to query sample data in a newly created data lake. We simulate the environment by enabling federation to AWS using AD FS 3.0 and SAML 2.0. Then we guide you through setting up a data lake using Lake Formation. Finally, we show you how you can configure an ODBC driver for Tableau to securely query your data in the lake data using your AD FS credentials.

Prerequisites

The following prerequisites are required to complete this walkthrough:

Create your environment

To simulate the production environment, we created a standard VPC in Amazon Virtual Private Cloud (Amazon VPC) with one private subnet and one public subnet. You can do the same using the VPC wizard. Our Amazon Elastic Compute Cloud (Amazon EC2) instance running the Tableau client is located in a private subnet and accessible via an EC2 bastion host. For simplicity, connecting out to Amazon S3, AWS Glue, and Athena is done via the NAT gateway and internet gateway set up by the VPC wizard. Optionally, you can replace the NAT gateway with AWS PrivateLink endpoints (AWS Security Token Service (AWS STS), Amazon S3, Athena, and AWS Glue endpoints are required) to make sure traffic remains within the AWS network.

The following diagram illustrates our environment architecture.

After you create your VPC with its private and public subnets, you can continue to build out the other requirements, such as Active Directory and Lake Formation. Let’s begin with Active Directory.

Enable federation to AWS using AD FS 3.0 and SAML 2.0

AD FS 3.0, a component of Windows Server, supports SAML 2.0 and is integrated with IAM. This integration allows Active Directory users to federate to AWS using corporate directory credentials, such as a user name and password from Active Directory. Before you can complete this section, AD FS must be configured and running.

To set up AD FS, follow the instructions in Setting up trust between AD FS and AWS and using Active Directory credentials to connect to Amazon Athena with ODBC driver. The first section of the post explains in detail how to set up AD FS and establish the trust between AD FS and Active Directory. The post ends with setting up an ODBC driver for Athena, which you can skip. The post creates a group name called ArunADFSTest. This group relates to a role in your AWS account, which you use later.

When you have successfully verified that you can log in using your IdP, you’re ready to configure your Windows environment ODBC driver to connect to Athena.

Set up a data lake using Lake Formation

Lake Formation is a fully managed service that makes it easy for you to build, secure, and manage data lakes. Lake Formation provides its own permissions model that augments the IAM permissions model. This centrally defined permissions model enables fine-grained access to data stored in data lakes through a simple grant/revoke mechanism. We use this permissions model to grant access to the AD FS role we created earlier.

  1. On the Lake Formation console, you’re prompted with a welcome box the first time you access Lake formation.
    The box asks you to select the initial administrative user and roles.
  2. Choose Add myself and choose Get Started.
    We use the sample database provided by Lake Formation, but you’re welcome to use your own dataset. For instructions on loading your own dataset, see Getting Started with Lake Formation.With Lake Formation configured, we must grant read access to the AD FS role (ArunADFSTest) we created in the previous step.
  3. In the navigation pane, choose Databases.
  4. Select the database sampledb.
  5. On the Actions menu, choose Grant.

    We grant the SamlOdbcAccess role access to sampledb.
  6. For Principals, select IAM users and roles.
  7. For IAM users and roles, choose the role ArunADFSTest.
  8. Select Named data catalog resources.
  9. For Databases, choose the database sampledb.
  10. For Tables¸ choose All tables.
  11. Set the table permissions to Select and Describe.
  12. For Data permissions, select All data access.
  13. Choose Grant.
    Our AD FS user assumes the role ArunADFSTest, which has been granted access to sampledb by Lake Formation. However, the ArunADFSTest role requires access to Lake Formation, Athena, AWS Glue, and Amazon S3. Following the practice of least privilege, AWS defines policies for specific Lake Formation personas. Our user fits the Data Analyst persona, which requires enough permissions to run queries.
  14. Add the AmazonAthenaFullAccess managed policy (for instructions, see Adding and removing IAM identity permissions) and the following inline policy to the ArunADFSTest role:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "lakeformation:GetDataAccess",
                    "glue:GetTable",
                    "glue:GetTables",
                    "glue:SearchTables",
                    "glue:GetDatabase",
                    "glue:GetDatabases",
                    "glue:GetPartitions",
                    "lakeformation:GetResourceLFTags",
                    "lakeformation:ListLFTags",
                    "lakeformation:GetLFTag",
                    "lakeformation:SearchTablesByLFTags",
                    "lakeformation:SearchDatabasesByLFTags"
                ],
                "Resource": "*"
            }
        ]
    }

    Each time Athena runs a query, it stores the results in an S3 bucket, which is configured as the query result location in Athena.

  15. Create an S3 bucket, and in this new bucket create a new folder called athena_results.
  16. Update the settings on the Athena console to use your newly created folder.
    Tableau uses Athena to run the query and read the results from Amazon S3, which means that the ArunADFSTest role requires access to your newly created S3 folder.
  17. Attach the following inline policy to the ArunADFSTest role:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:PutObject",
                    "s3:PutObjectAcl"
                ],
                "Resource": "arn:aws:s3:::[BUCKET_NAME]/athena_results/*"
            }
        ]
    }

Our AD FS user can now assume a role that has enough privileges to query the sample database. The next step is to configure the ODBC driver on the client.

Configure an Athena ODBC driver

Athena is a managed serverless and interactive query service that allows you to analyze your data in Amazon S3 using standard Structured Query Language (SQL). You can use Athena to directly query data that is located in Amazon S3 or data that is registered with Lake Formation. Athena provides you with ODBC and JDBC drivers to effortlessly integrate with your data analytics tools (such as Microsoft Power BI, Tableau, or SQL Workbench) to seamlessly gain insights about your data in minutes.

To connect to our Lake Formation environment, we first need to install and configure the Athena ODBC driver on our Windows environment.

  1. Download the Athena ODBC driver relevant to your Windows environment.
  2. Install the driver by choosing the driver file you downloaded (in our case, Simba+Athena+1.1+64-bit.msi).
  3. Choose Next on the welcome page.
  4. Read the End-User License Agreement, and if you agree to it, select I Accept the terms in the License Agreement and choose Next.
  5. Leave the default installation location for the ODBC driver and choose Next.
  6. Choose Install to begin the installation.
  7. If the User Access Control pop-up appears, choose Yes to allow the driver installation to continue.
  8. When the driver installation is complete, choose Finish to close the installer.
  9. Open the Windows ODBC configuration application by selecting the Start bar and searching for ODBC.
  10. Open the version corresponding to the Athena ODBC version you installed, in our case 64 bit.
  11. On the User DSN tab, choose Add.
  12. Choose the Simba Athena ODBC driver and choose Finish.

Configure the ODBC driver to AD FS authentication

We now need to configure the driver.

  1. Choose the driver on the Driver configuration page.
  2. For Data Source Name, enter sampledb.
  3. For Description, enter Lake Formation Sample Database.
  4. For AWS Region, enter eu-west-1 or the Region you used when configuring Lake Formation.
  5. For Metadata Retrieval Method, choose Auto.
  6. For S3 Output Location, enter s3://[BUCKET_NAME]/athena_results/.
  7. For Encryption Options, choose NOT_SET.
  8. Clear the rest of the options.
  9. Choose Authentication Options.
  10. For Authentication Type, choose ADFS.
  11. For User, enter [DOMAIN]\[USERNAME].
  12. For Password, enter your domain user password.
  13. For Preferred Role, enter aws:iam::[ACCOUNT NUMBER]:role/ArunADFSTest.
    The preferred role is the same role configured in the previous section (ArunADFSTest).
  14. For IdP Host, enter the AD federation URL you configured during AD FS setup.
  15. For IdP Port, enter 443.
  16. Select SSL Insecure.
  17. Choose OK.
  18. Choose Test on the initial configuration page to test the connection.
  19. When you see a success confirmation, choose OK.

We can now connect to our Lake Formation sample database from our desktop environment using the Athena ODBC driver. The next step is to use Tableau to query our data using the ODBC connection.

Connect to your data using Tableau

To connect to your data, complete the following steps:

  1. Open your Tableau Desktop edition.
  2. Under To a Server, choose More.
  3. On the list of available Tableau installed connectors, choose Other Databases (ODBC).
  4. Choose the ODBC database you created earlier.
  5. Choose Connect.
  6. Choose Sign In.

    When the Tableau workbook opens, select the database, schema, and table that you want to query.
  7. For Database, choose the database as listed in the ODBC setup (for this post, AwsDataCatalog).
  8. For Schema, choose your schema (sampledb).
  9. For Table, search for and choose your table (elb_logs).
  10. Drag the table to the work area to start your query and further report development.

Clean up

AWS Lake Formation provides database-, table-, column-, and tag-based access controls, and cross-account sharing at no charge. Lake Formation charges a fee for transaction requests and for metadata storage. In addition to providing a consistent view of data and enforcing row-level and cell-level security, the Lake Formation Storage API scans data in Amazon S3 and applies row and cell filters before returning results to applications. There is a fee for this filtering. To make sure that you’re not charged for any of the services that you no longer need, stop any EC2 instances that you created. Remove any objects in Amazon S3 you no longer require, because you pay for objects stored in S3 buckets.

Lastly, delete any Active Directory instances you may have created.

Conclusion

Lake Formation makes it simple to set up a secure data lake and then use the data lake with your choice of analytics and machine learning services, including Tableau. In this post, we showed you how you can connect to your data lake using AD FS credentials in a simple and secure way by using the Athena ODBC driver. Your AD FS user is configured within the ODBC driver, which then assumes a role in AWS. This role is granted access to only the data you require via Lake Formation.

To learn more about Lake Formation, see the Lake Formation Developer Guide or follow the Lake Formation workshop.


About the Authors

Jason Nicholls is an Enterprise Solutions Architect at AWS. He’s passionate about building scalable web and mobile applications on AWS. He started coding on a Commodore VIC 20, which lead to a career in software development. Jason holds a MSc in Computer Science with specialization in coevolved genetic programming. He is based in Johannesburg, South Africa.

Francois van Rensburg is a Partner Management Solutions Architect at AWS. He has spent the last decade helping enterprise organizations successfully migrate to the cloud. He is passionate about networking and all things cloud. He started as a Cobol programmer and has built everything from software to data centers. He is based in Denver, Colorado.