Implement fine-grained access control in Amazon SageMaker Studio and Amazon EMR using Apache Ranger and Microsoft Active Directory

Post Syndicated from Rahul Sarda original https://aws.amazon.com/blogs/big-data/implement-fine-grained-access-control-in-amazon-sagemaker-studio-and-amazon-emr-using-apache-ranger-and-microsoft-active-directory/

Amazon SageMaker Studio is a fully integrated development environment (IDE) for machine learning (ML) that enables data scientists and developers to perform every step of the ML workflow, from preparing data to building, training, tuning, and deploying models. SageMaker Studio comes with built-in integration with Amazon EMR, enabling data scientists to interactively prepare data at petabyte scale using frameworks such as Apache Spark, Hive, and Presto right from SageMaker Studio notebooks. With Amazon SageMaker, developers, data scientists, and SageMaker Studio users can access both raw data stored in Amazon Simple Storage Service (Amazon S3), and cataloged tabular data stored in a Hive metastore easily. SageMaker Studio’s support for Apache Ranger creates a simple mechanism for applying fine-grained access control to the raw and cataloged data with grant and revoke policies administered from a friendly web interface.

In this post, we show how you can authenticate into SageMaker Studio using an existing Active Directory (AD), with authorized access to both Amazon S3 and Hive cataloged data using AD entitlements via Apache Ranger integration and AWS IAM Identity Center (successor to AWS Single Sign-On). With this solution, you can manage access to multiple SageMaker environments and SageMaker Studio notebooks using a single set of credentials. Subsequently, Apache Spark jobs created from SageMaker Studio notebooks will access only the data and resources permitted by Apache Ranger policies attached to the AD credentials, inclusive of table and column-level access.

With this capability, multiple SageMaker Studio users can connect to the same EMR cluster, gaining access only to data granted to their user or group, with audit records captured and visible in Amazon CloudWatch. This multi-tenant environment is possible through user session isolation that prevents users from accessing datasets and cluster resources allocated to other users. Ultimately, organizations can provision fewer clusters, reduce administrative overhead, and increase cluster utilization, saving staff time and cloud costs.

Solution overview

We demonstrate this solution with an end-to-end use case using a sample ecommerce dataset. The dataset is available within provided AWS CloudFormation templates and consists of transactional ecommerce data (products, orders, customers) cataloged in a Hive metastore.

The solution utilizes two data analyst personas, Alex and Tina, each tasked with different analysis requiring fine-grained limitations on dataset access:

  • Tina, a data scientist on the marketing team, is tasked with building a model for customer lifetime value. Data access should only be permitted to non-sensitive customer, product, and orders data.
  • Alex, a data scientist on the sales team, is tasked to generate product demand forecast, requiring access to product and orders data. No customer data is required.

The following figure illustrates our desired fine-grained access.

The following diagram illustrates the solution architecture.

The architecture is implemented as follows:

  • Microsoft Active Directory – Used to manage user authentication, select AWS application access, and user and group membership for Apache Ranger secured data authorization
  • Apache Ranger – Used to monitor and manage comprehensive data security across the Hadoop and Amazon EMR platform
  • Amazon EMR – Used to retrieve, prepare, and analyze data from the Hive metastore using Spark
  • SageMaker Studio – An integrated IDE with purpose-built tools to build AI/ML models.

The following sections walk through the setup of the architectural components for this solution using the CloudFormation stack.

Prerequisites

Before you get started, make sure you have the following prerequisites:

Create resources with AWS CloudFormation

To build the solution within your environment, use the provided CloudFormation templates to create the required AWS resources.

Note that running these CloudFormation templates and the following configuration steps will create AWS resources that may incur charges. Additionally, all the steps should be run in the same Region.

Template 1

This first template creates the following resources and takes approximately 15 minutes to complete:

  • A Multi-AZ, multi-subnet VPC infrastructure, with managed NAT gateways in the public subnet for each Availability Zone
  • S3 VPC endpoints and Elastic Network Interfaces
  • A Windows Active Directory domain controller using Amazon Elastic Compute Cloud (Amazon EC2) with cross-realm trust
  • A Linux Bastion host (Amazon EC2) in an auto scaling group

To deploy this template, complete the following steps:

  1. Sign in to the AWS Management Console.
  2. On the Amazon EC2 console, create an EC2 key pair.
  3. Choose Launch Stack :
  4. Select the target Region
  5. Verify the stack name and provide the following parameters:
    1. The name of the key pair you created.
    2. Passwords for cross-realm trust, the Windows domain admin, LDAP bind, and default AD user. Be sure to record these passwords to use in future steps.
    3. Select a minimum of three Availability Zones based on the chosen Region.
  6. Review the remaining parameters. No changes are required for the solution, but you may change parameter values if desired.
  7. Choose Next and then choose Next again.
  8. Review the parameters.
  9. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names and I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND.
  10. Choose Submit.

Template 2

The second template creates the following resources and takes approximately 30–60 minutes to complete:

To deploy this template, complete the following steps:

  1. Choose Launch Stack :
  2. Select the target Region
  3. Verify the stack name and provide the following parameters:
    1. Key pair name (created earlier).
    2. LDAPHostPrivateIP address, which can be found in the output section of the Windows AD CloudFormation stack.
    3. Passwords for the Windows domain admin, cross-realm trust, AD domain user, and LDAP bind. Use the same passwords as you did for the first CloudFormation template.
    4. Passwords for the RDS for MySQL database and KDC admin. Record these passwords; they may be needed in future steps.
    5. Log directory for the EMR cluster.
    6. VPC (it contains the name of the CloudFormation stack)
    7. Subnet details (align the subnet name with the parameter name).
    8. Set AppsEMR to Hadoop, Spark, Hive, Livy, Hue, and Trino.
    9. Leave RangerAdminPassword as is.
  4. Review the remaining parameters. No changes are required beyond what is mentioned, but you may change parameter values if desired.
  5. Choose Next, then choose Next again.
  6. Review the parameters.
  7. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names and I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND.
  8. Choose Submit.

Integrate Active Directory with AWS accounts using IAM Identity Center

To enable users to sign in to SageMaker with Active Directory credentials, a connection between IAM Identity Center and Active Directory must be established.

To connect to Microsoft Active Directory, we set up AWS Directory Service using AD Connector.

  1. On the Directory Service console, choose Directories in the navigation pane.
  2. Choose Set up directory.
  3. For Directory types, select AD Connector.
  4. Choose Next.
  5. For Directory size, select the appropriate size for AD Connector. For this post, we select Small.
  6. Choose Next.
  7. Choose the VPC and private subnets where the Windows AD domain controller resides.
  8. Choose Next.
  9. In the Active Directory information section, enter the following details (this information can be retrieved on the Outputs tab of the first CloudFormation template):
    1. For Directory DNS Name, enter awsemr.com.
    2. For Directory NetBIOS name, enter awsemr.
    3. For DNS IP addresses, enter the IPv4 private IP address from AD Controller.
    4. Enter the service account user name and password that you provided during stack creation.
  10. Choose Next.
  11. Review the settings and choose Create directory.

After the directory is created, you will see its status as Active on the Directory Services console.

Set up AWS Organizations

AWS Organizations supports IAM Identity Center in only one Region at a time. To enable IAM Identity Center in this Region, you must first delete the IAM Identity Center configuration if created in another Region. Do not delete an existing IAM Identity Center configuration unless you are sure it will not negatively impact existing workloads.

  1. Navigate to the IAM Identity Center console.
  2. If IAM Identity Center has not been activated previously, choose Enable. If an organization does not exist, an alert appears to create one.
  3. Choose Create AWS organization.
  4. Choose Settings in the navigation pane.
  5. On the Identity source tab, on the Actions menu, choose Change identity source.
  6. For Choose identity source, select Active Directory.
  7. Choose Next.
  8. For Existing Directories, choose AWSEMR.COM.
  9. Choose Next.
  10. To confirm the change, enter ACCEPT in the confirmation input box, then choose Change identity source. Upon completion, you will be redirected to Settings, where you receive the alert Configurable AD sync paused.
  11. Choose Resume sync.
  12. Choose Settings in the navigation pane.
  13. On the Identity source tab, on the Actions menu, choose Manage sync.
  14. Choose Add users and groups to specify the users and groups to sync from Active Directory to IAM Identity Center.
  15. On the Users tab, enter tina and choose Add.
  16. Enter alex and choose Add.
  17. Choose Submit.
  18. On the Groups tab, enter datascience and choose Add.
  19. Choose Submit.

After your users and groups are synced to IAM Identity Center, you can see them by choosing Users or Groups in the navigation pane on the IAM Identity Center console. When they’re available, you can assign them access to AWS accounts and cloud applications. The initial sync may take up to 5 minutes.

Set up a SageMaker domain using IAM Identity Center

To set up a SageMaker domain, complete the following steps:

  1. On the SageMaker console, choose Domains in the navigation pane.
  2. Choose Create domain.
  3. Choose Standard setup, then choose Configure.
  4. For Domain Name, enter a unique name for your domain.
  5. For Authentication, choose AWS IAM Identity Center.
  6. Choose Create a new role for the default execution role.
  7. In the Create an IAM Role popup, choose Any S3 bucket.
  8. Choose Create role.
  9. Copy the role details to be used in next section for adding a policy for EMR cluster access.
  10. In the Network and storage section, specify the following:
    1. Choose the VPC that you created using the first CloudFormation template.
    2. Choose a private subnet in an Availability Zone supported by SageMaker.
    3. Use the default security group (sg-XXXX).
    4. Choose VPC only.

Note that there is a public domain called AWSEMR.COM that will conflict with the one created for this solution if Public internet only is selected.

  1. Leave all other options as default and choose Next.
  2. In the Studio settings section, accept the defaults and choose Next.
  3. In the RStudio settings section, accept the defaults and choose Next.
  4. In the Canvas setting section, accept the defaults and choose Submit.

Add a policy to provide SageMaker Studio access to the EMR cluster

Complete the following steps to give SageMaker Studio access to the EMR cluster:

  1. On the IAM console, choose Roles in the navigation pane.
  2. Search and choose for the role you copied earlier (<AmazonSageMaker-ExecutionRole- XXXXXXXXXXXXXXX>).
  3. On the Permissions tab, choose Add permissions and Attach policy.
  4. Search for and choose the policy AmazonEMRFullAccessPolicy_v2.
  5. Choose Add permissions.

Add users and groups to access the domain

Complete the following steps to give users and groups access to the domain:

  1. On the SageMaker console, choose Domains in the navigation pane.
  2. Choose the domain you created earlier.
  3. On the Domain details page, choose Assign users and groups.
  4. On the Users tab, select the users tina and alex.
  5. On the Groups tab, select the group datascience.
  6. Choose Assign users and groups.

Configure Spark data access rights in Apache Ranger

Now that the AWS environment is set up, we configure Hive dataset security using Apache Ranger.

To start, collect the Apache Ranger URL details to access the Ranger admin console:

  1. On the Amazon EC2 console, choose Resources in the navigation pane, then Instance (running).
  2. Choose the Ranger server EC2 instance and copy the private IP DNS name (IPv4 only).
    Next, connect to the Windows domain controller to use the connected VPC to access the Ranger admin console. This is done by logging in to the Windows server and launching a web browser.
  3. Install the Remote Desktop Services client on your computer to connect with Windows Server.
  4. Authorize inbound traffic from your computer to the Windows AD domain controller EC2 instance.
  5. On the Amazon EC2 console, choose Resources in the navigation pane, then Instance (running).
  6. Choose on the Windows Domain Controller (DC1) EC2 instance ID and copy the public IP DNS name (IPv4 only).
  7. Use Microsoft Remote Desktop to log in to the Windows domain controller:
    1. Computer – Use the public IP DNS name (IPv4 only).
    2. Username – Enter awsadmin.
    3. Password – Use the password you set during the first CloudFormation template setup.
  8. Disable the Enhanced Security Configuration for Internet Explorer.
  9. Launch Internet Explorer and navigate to the Ranger admin console using the private IP DNS name (IPv4 only) associated with the Ranger server noted earlier and port 6182 (for example, https://<RangerServer Private IP DNS name>:6182).
  10. Choose Continue to this website (not recommended) if you receive a security alert.
  11. Log in using the default user name and password. During the first logon, you should modify your password and store it securely.
  12. In the top Ranger banner, choose Settings and Users/Groups/Roles.
  13. Confirm Tina and Alex are listed as users with a User Source of External.
  14. Confirm the datascience group is listed as a group with Group Source of External.

If the Tina or Alex users aren’t listed, follow the Apache Ranger troubleshooting instructions in the appendix at the end of this post.

Dataset policies

The Apache Ranger access policy model consists of two major components: specification of the resources a policy is applied to, such as files and directories, databases, tables, and columns, services, and so on, and the specification of access conditions (permissions) for specific users and groups.

Configure your dataset policy with the following steps:

  1. On the Ranger admin console, choose the Ranger icon in the top banner to return to the main page.
  2. Choose the service name amazonemrspark inside AMAZON-EMR-SPARK.
  3. Choose Add New Policy and add a new policy with the following parameters:
    1. For Policy Name, enter Data Science Policy.
    2. For Database, enter staging and default.
    3. For EMR Spark Table, enter products and orders.
    4. For EMR Spark Column, enter *.
    5. In the Allow Conditions section, for Select User, enter tina and alex, and for Permissions, enter select and read.
  4. Choose Add.
    When using Internet Explorer & adding a new policy, you may receive the error SCRIPT438: Object doesn't support property or method 'assign'. In this case, install and use an alternate browser such as Firefox or Chrome.
  5. Choose Add New Policy and add a new policy for tina:
    1. For Policy Name, enter Customer Demographics Policy.
    2. For Database, enter staging.
    3. For EMR Spark Table, enter Customers.
    4. For EMR Spark Column, choose customer_id, first_name, last_name, region, and state.
    5. In the Allow Conditions section, for Select User, enter Tina and for Permissions, enter select and read.
  6. Choose Add.

Configure Amazon S3 data access rights in Apache Ranger

Complete the following steps to configure Amazon S3 data access rights:

  1. On the Ranger admin console, choose the Ranger icon in the top banner to return to the main page.
  2. Choose the service name amazonemrs3 inside AMAZON-EMR-EMRFS.
  3. Choose Add New Policy and add a policy for the datascience group as follows:
    1. For Policy Name, enter Data Science S3 Policy.
    2. For S3 resource, enter the following:
      • aws-bigdata-blog/artifacts/aws-blog-emr-ranger/data/staging/products
      • aws-bigdata-blog/artifacts/aws-blog-emr-ranger/data/staging/orders
    3. In the Allow Conditions, section, for Select User, enter tina and alex, and for Permissions, enter GetObject and ListObjects.
  4. Choose Add.
  5. Choose Add New Policy and add a new policy for tina:
    1. For Policy Name, enter Customer Demographics S3 Policy.
    2. For S3 resource, enter aws-bigdata-blog/artifacts/aws-blog-emr-ranger/data/staging/customers.
    3. In the Allow Conditions section, for Select User, enter Tina and for Permissions, enter GetObject and ListObjects.
  6. Choose Add.

Configure Amazon S3 user working folders

While working with data, users often require data storage for interim results. To provide each user with a private working directory, complete the following steps:

  1. On the Ranger admin console, choose Ranger icon in the top banner to return to the main page.
  2. Choose the service name amazonemrs3 inside AMAZON-EMR-EMRFS.
  3. Choose Add New Policy and add a policy for {USER} as follows:
    1. For Policy Name, enter User Directory S3 Policy.
    2. For S3 resource, enter <Bucket Name>/data/{USER} (use a bucket within the account).
    3. Enable Recursive.
    4. In the Allow Conditions, section, for Select User, enter {USER} and for Permissions, enter GetObject, ListObjects, PutObject, and DeleteObject.
  4. Choose Add.

Use the user access login URL

Users attempting to access shared AWS applications via IAM Identity Center need to first log in to the AWS environment with a custom link using their Active Directory user name and password. The link needed can be found on the IAM Identity Center console.

  1. On the IAM Identity Center console, choose Settings in the navigation pane.
  2. On the Identity source tab, locate the user login link under AWS access portal URL.

Test role-based data access

To review, data scientist Tina needs to build a customer lifetime value model, which requires access to orders, product, and non-sensitive customer data. Data scientist Alex only needs access to orders and product data to build a product demand model.

In this section, we test the data access levels for each role.

Data scientist Tina

Complete the following steps:

  1. Log in using the URL you located in the previous step.
  2. Enter Microsoft AD user [email protected] and your password.
  3. Choose the Amazon SageMaker Studio tile.
  4. In the SageMaker Studio UI, start a notebook:
    1. Choose File, New, and Notebook.
    2. For Image, choose SparkMagic.
    3. For Kernel, choose PySpark.
    4. For Instance Type, choose ml.t3.medium.
    5. Choose Select.
  5. When the notebook kernel starts, connect to the EMR cluster by running the following code:
    %load_ext sagemaker_studio_analytics_extension.magics
    %sm_analytics emr connect --cluster-id <EMR Cluster ID> --auth-type Kerberos --language python

The EMR cluster ID details can be found on the Outputs tab of the EMR cluster CloudFormation stack created with the second template.

  1. Enter Microsoft AD [email protected] and your password. (Note that [email protected] is case-sensitive.)
  2. Choose Connect.

Now we can test Tina’s data access.

  1. In a new cell, enter the following query and run the cell:
    %%sql
    show tables from staging

Returned data will indicate the table objects accessible to Tina.

  1. In a new cell, run the following:
    %%sql
    select * from staging.customers limit 5

Returned data will include columns Tina has been granted access.

Let’s test Tina’s access to customer data.

  1. In a new cell, run the following:
    %%sql
    select customer_id, education_level, first_name, last_name, marital_status, region, state from staging.customers limit 15

The preceding query will result in an Access Denied error due to the inclusion of sensitive data columns.

During ad hoc analysis and model building, it’s common for users to create temporary datasets that need to be persisted for a short period. Let’s test Tina’s ability to create a working dataset and store results in a private working directory.

  1. In a new cell, run the following:
    join_order_to_customer = spark.sql("select orders.*, first_name, last_name, region, state from staging.orders, staging.customers where orders.customer_id = customers.customer_id")

  2. Before running the following code, update the S3 path variable <bucket name> to correspond to an S3 location within your local account:
    join_order_to_customer.write.mode("overwrite").format("parquet").option("path", "s3://<bucket name>/data/tina/order_and_product/").save()

The preceding query writes the created dataset as Parquet files in the S3 bucket specified.

Data scientist: Alex

Complete the following steps:

  1. Log in using the URL you located in the previous step.
  2. Enter Microsoft AD user [email protected] and your password.
  3. Choose the Amazon SageMaker Studio tile.
  4. In the SageMaker Studio UI, start a notebook:
    1. Choose File, New, and Notebook.
    2. For Image, choose SparkMagic.
    3. For Kernel, choose PySpark.
    4. For Instance Type, choose ml.t3.medium.
    5. Choose Select.
  5. When the notebook kernel starts, connect to the EMR cluster by running the following code:
    %load_ext sagemaker_studio_analytics_extension.magics
    %sm_analytics emr connect --cluster-id <EMR Cluster ID> --auth-type Kerberos --language python

  6. Enter Microsoft AD [email protected] and your password (note that [email protected] is case-sensitive).
  7. Choose Connect. Now we can test Alex’s data access.
  8. In a new cell, enter the following query and run the cell:
    %%sql
    show tables from staging

    Returned data will indicate the table objects accessible to Alex. Note that the customers table is missing.

  9. In a new cell, run the following:
    %%sql
    select * from staging.orders limit 5

Returned data will include columns Alex has been granted access.

Let’s test Alex’s access to customer data.

  1. In a new cell, run the following:
    %%sql
    select * from staging.customers limit 5

The preceding query will result in an Access Denied error because Alex doesn’t have access to customers.

We can verify Ranger is responsible for the denial by looking at the CloudWatch logs.

Now that you can successfully access data, feel free to interactively explore, visualize, prepare, and model the data using the different user personas.

Clean up

When you’re finished experimenting with this solution, clean up your resources:

  1. Shut down and update SageMaker Studio and Studio apps. Ensure that all apps created as part of this post are deleted before deleting the stack.
  2. Change the identity source for IAM Identity Center back to Identity Center Directory.
  3. Delete the directory AWSEMR.COM from Directory Services.
  4. Empty the S3 buckets created by the CloudFormation stacks.
  5. Delete the stacks via the AWS CloudFormation console for the non-nested stacks starting in reverse order.

Conclusion

This post showed how you can implement fine-grained access control in SageMaker Studio and Amazon EMR using Apache Ranger and Microsoft Active Directory. We also demonstrated how multiple SageMaker Studio users can connect to the same EMR cluster and access different tables and columns using Apache Ranger, wherein each user is scoped with permissions matching their individual level of access to data. In addition, we demonstrated how the individual users can access separate S3 folders for storing their intermediate data. We detailed the steps required to set up the integration and provided CloudFormation templates to set up the base infrastructure from end to end.

To learn more about using Amazon EMR with SageMaker Studio, refer to Prepare Data using Amazon EMR. We encourage you to try out this new functionality, and connect with the Machine Learning & AI community if you have any questions or feedback!

Appendix: Apache Ranger troubleshooting

The sync between Active Directory and Apache Ranger is set for every 24 hours. To force a sync, complete the following steps:

  1. Connect to the Apache Ranger server using SSH. This can be done using directly or Session Manager, a capability of AWS Systems Manager, or through AWS Cloud9.
  2. Once connected, issue the following commands:
    sudo /usr/bin/ranger-usersync stop || true
    sudo /usr/bin/ranger-usersync start
    sudo chkconfig ranger-usersync on

  3. To confirm the sync, open the Ranger console as an admin.
  4. Choose Audit in the top banner.
  5. Choose the User Sync tab and confirm the event time.

About the Authors

Rahul Sarda is a Senior Analytics & ML Specialist at AWS. He is a seasoned leader with over 20 years of experience, who is passionate about helping customers build scalable data and analytics solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys spending time with his family, stay healthy, running and road cycling.

Varun Rao Bhamidimarri is a Sr Manager, AWS Analytics Specialist Solutions Architect team. His focus is helping customers with adoption of cloud-enabled analytics solutions to meet their business requirements. Outside of work, he loves spending time with his wife and two kids, stay healthy, mediate and recently picked up gardening during the lockdown.