All posts by Behram Irani

Enable fine-grained data access in Zeppelin Notebook with AWS Lake Formation

Post Syndicated from Behram Irani original

This post explores how you can use AWS Lake Formation integration with Amazon EMR (still in beta) to implement fine-grained column-level access controls while using Spark in a Zeppelin Notebook.

My previous post Extract data using AWS Glue and analyzing with Amazon Athena showed you a simple use case for extracting any Salesforce object data using AWS Glue and Apache Spark, saving it to Amazon Simple Storage Service (Amazon S3), cataloging the data using the Data Catalog in Glue, and querying it using Amazon Athena.

Preparing your data

For simplicity of setup and to build on the concept of fine-grained access control of data, you use the same data that you extracted from the Salesforce account object in the post Extract data using AWS Glue and analyzing with Amazon Athena. Follow all the steps from the preceding post to create a table called sfdc_output, which you can query in Athena and see all the fields of the account object.

In the following sections, you see how to restrict access to only a select set of columns in this table for a user who queries this data using Spark SQL in Zeppelin Notebook.

Setting up Lake Formation

Lake Formation aims to simplify and accelerate the creation of data lakes. Amazon EMR integrates with Lake Formation and its security model to allow fine-grained access control on databases, tables, and columns defined in the Data Catalog for data stored in Amazon S3. Users authenticate against third-party identity providers (IdPs) through SAML, and the principal is used to determine if the user has the appropriate access to the columns within a table and partitions in the Data Catalog.

Lake Formation provides its own permissions model that augments the AWS Identity and Access Management (IAM) permissions model. This centrally defined permissions model enables fine-grained access to data stored in data lakes through a simple grant and revoke mechanism.

The following diagram illustrates the workflow.

In the preceding flow, you still authenticate the principal at the IdP, and use the IAM policy to authorize access to AWS resources. Additionally, you use Lake Formation to authorize data access. When a principal attempts to run a query in Amazon EMR against a table set up with Lake Formation, Amazon EMR requests temporary credentials for data access from Lake Formation. Lake Formation returns temporary credentials and allows data access.

For more information about setting up Lake Formation, see Setting Up AWS Lake Formation. For this use case, you want to enable the integration of Amazon EMR with Lake Formation so you can use Zeppelin Notebook to see the fine-grained data access controls in action. For this post, I’ve configured the authentication module using the third-party SAML provider Auth0. You can also use Okta or Active Directory Federation Services (ADFS) to set up authentication with the IdP of your choice. For more information about setting up IdP and to launch Amazon EMR with an AWS CloudFormation stack, see Integration with Amazon EMR.

Granting fine-grained access with Lake Formation

For the purpose of demonstrating fine-grained data access, I created a user called developer in Auth0. Suppose that for your table sfdc_output, you don’t want to give this user access to certain billing-related fields. Complete the following steps:

  1. On the Lake Formation console, choose Data permissions.
  2. For Active Directory users and groups, enter the ARN for the user developer.
  3. For Database, choose default.
  4. For Table, choose sfdc_output.
  5. For Column, choose Include columns.
  6. For Include columns, choose sic, name, accountnumber, and type.

These are the specific columns from this table that you want the developer user to have access to.

  1. For Table permissions¸ select Select.
  2. Choose Grant.

After you grant specific permissions to the developer user, you have to remove LFPassthrough access by revoking Super access given to the IAMAllowedPrincipals group. For backward compatibility, Lake Formation only passes through IAM permissions for all existing Data Catalog tables. Revoking Super access enables it to apply specific Lake Formation grants and IAM permissions.

  1. On the Lake Formation console, choose Permissions.
  2. Select IAMAllowedPrincipals for sfdc_output
  3. Choose Revoke.

You also need to register the Amazon S3 location in Lake Formation where the table data resides.

  1. On the Lake Formation console, choose Data lake locations.
  2. Select the Amazon S3 path for your table.
  3. Choose Register location.

Running a query

You can now test the restrictions by running a query.

Log in to the Zeppelin console using its URL. To access Zeppelin Notebook, you must first ensure that your cluster’s master security group is configured to allow access to the Proxy Agent (port 8442) from your desktop. Do not open your EMR master to the public ( or ::0). It redirects the link to the IdP provider for login and authentication with the developer user credentials. After authentication is complete, create a new notebook and run a Spark SQL query against the sfdc_output table.

The following screenshot shows that even though the developer user queried for the full table, only the columns that you granted them access to in Lake Formation are visible.

As an additional exercise, you can create another user in IdP and give a different set of column access in Lake Formation to that user. Query the table again by logging in as that user and observe the corresponding security mechanism being applied.

Querying with Jupyter Notebook

If you want to use Jupyter Notebook, you can spin up an Amazon EMR notebook, attach it to the running cluster, and run the same query in it. The following screenshot shows that the results are the same.


This post showed how Lake Formation provides fine-grained, column-level access to tables in the Data Catalog using Spark. It enables federated single sign-on to Apache Zeppelin or Amazon EMR notebooks from your enterprise identity system that is compatible with SAML 2.0.

You can try this solution for your use-cases and if you have comments or feedback, please leave them below.


About the Authors

Behram Irani is a Senior Data Architect at Amazon Web Services.





Rahul Sonawane is a Senior Consultant, Big Data at Amazon Web Services.


Extracting data using AWS Glue and analyzing with Amazon Athena

Post Syndicated from Behram Irani original

Salesforce is a popular and widely used customer relationship management (CRM) platform. It lets you store and manage prospect and customer information—like contact info, accounts, leads, and sales opportunities—in one central location. You can derive a lot of useful information by combining the prospect information stored in Salesforce with other structured and unstructured data in your data lake.

In this post, I show you how to use AWS Glue to extract data from a account object and save it to Amazon S3. You then use Amazon Athena to generate a report by joining the account object data from with the orders data from a separate order management system.

Preparing your data

I signed up for a free account, which comes with a handful of sample records populated with many of the objects. You can use your organization’s development account and pull data from multiple objects at the same time by modifying the SOQL query in your AWS Glue code. To demonstrate extracting data from these objects, only use the Account object to keep the query simple.

To demonstrate joining data with data from another system using Amazon Athena, you create a sample data file showing orders coming from an order management system.

Setting up an AWS Glue job

Use the open source springml library to connect Apache Spark with The library comes with plenty of handy features that allow you to read, write, and update objects using the Apache Spark framework.

You can compile the jars from the springml GitHub repo or download with dependencies from the Maven repo. Upload these JAR files to your S3 bucket and make a note of the full path for each.


In the AWS Management Console, choose AWS Glue in the Region where you want to run the service. Choose Jobs, Add Job. Follow the wizard by filling in the necessary details.

Under the Security configuration, script libraries, and job parameters (optional) section, for Dependent jars path, list the paths for the four JAR files listed previously, separated by commas.

For this job, I allocated Maximum capacity as “2.” This field defines the number of AWS Glue data processing units (DPUs) that the system can allocate when this job runs. A DPU is a relative measure of processing power that consists of four vCPUs of compute capacity and 16 GB of memory. When you specify an Apache Spark ETL job, you can allocate 2–100 DPUs. The default is 10 DPUs.

Execute the AWS Glue job to extract data from the object

The following Scala code extracts a few fields from the Account object in and writes them as a table to S3 in Apache Parquet file format.

import{DynamicFrame, GlueContext}  
import org.apache.spark.SparkContext  
import scala.collection.JavaConverters.mapAsJavaMapConverter  
object SfdcExtractData {  
  def main(sysArgs: Array[String]) {  
    val sparkContext: SparkContext = new SparkContext()  
    val glueContext: GlueContext = new GlueContext(sparkContext)  
    val sparkSession = glueContext.getSparkSession  
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)  
    Job.init(args("JOB_NAME"), glueContext, args.asJava)  
    val soql = "select name, accountnumber, industry, type, billingaddress, sic from account"  
    val df ="com.springml.spark.salesforce").option("soql",soql).option("username", "username").option("password","password+securitytoken").load()
    val datasource0 = DynamicFrame(df, glueContext).withName("datasource0").withTransformationContext("datasource0")  
    val datasink1 = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions(Map("path" -> "s3://replace-with-your-s3-bucket/sfdc-output", "partitionKeys" -> Seq("Industry"))), format = "parquet", transformationContext = "datasink1").writeDynamicFrame(datasource0)  

This code relies on a few key components:

val df ="com.springml.spark.salesforce").option("soql",soql).option("username", "username").option("password","password+securitytoken").load()

This code example establishes a connection, submits a SOQL-compatible query for the Account object, and loads the returned records into a Spark DataFrame. Don’t forget to replace username with your username and password as a combination of your password and the security token of your profile.

Best practices suggest storing and retrieving the password using AWS Secrets Manager instead of hardcoding it. For simplicity, I left it hardcoded in this example.

Keep in mind that this query is simple and returns only a handful of records. For large volumes of data, you might want to limit the results returned by your query or use other techniques like bulk query and chunking. Check the springml page to learn more about the functionality that supports.

val datasink1 = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions(Map("path" -> "s3://replace-with-your-s3-bucket/sfdc-output", "partitionKeys" -> Seq("Industry"))), format = "parquet", transformationContext = "datasink1").writeDynamicFrame(datasource0)  

This code does all the writing to your S3 bucket. In this example, you want to aggregate data by Industry segments. Because of that, you should partition the data by the Industry field.

Also, the code writes in Parquet format. Athena charges you by the amount of data scanned per query. You can save on costs and get better performance when you partition the data, compress data, or convert it to columnar formats like Parquet.

After you run this code in AWS Glue, you can go to your S3 bucket where the sink points and find something like the following structure:

Query the data with Athena

After the code drops your data into your S3 bucket with the correct partition and format, AWS Glue can crawl the dataset. It creates the appropriate schema in the AWS Glue Data Catalog. Wait for AWS Glue to create the table. Then, Athena can query the table and join with other tables in the catalog.

First, use the AWS Glue crawler to discover the Account data that you previously stored in the S3 bucket. For details about how to use the crawler, see populating the AWS Glue Data Catalog.

In this example, point the crawler to the S3 output prefix where you stored your Account data, and run it. The crawler creates a new catalog table before it finally stops.

The AWS Glue Data Catalog table automatically captures all the column names, types, and partition column used, and stores everything in your S3 bucket in Parquet file format. You can now query this table with Athena. A simple SELECT query on that table shows the results of scanning the data from the S3 bucket.

Now your data is ready for Athena to query. For this example, join this data with the sample orders from this sample order management system in S3. After your AWS Glue crawler finishes cataloging the sample orders data, Athena can query it.

Finally, use Athena to join both tables in an aggregation query.


In this post, I showed a simple example for extracting any object data using AWS Glue and Apache Spark, and saving it to S3. You can then catalog your S3 data in AWS Glue Data Catalog, allowing Athena to query it. With this mechanism in place, you can easily incorporate Salesforce data into your AWS based data lake.

If you have comments or feedback, please leave them below.


About the Author


Behram Irani is a Data Architect at Amazon Web Services.