Post Syndicated from Raymond Lai original https://aws.amazon.com/blogs/big-data/enforce-fine-grained-access-control-on-open-table-formats-via-amazon-emr-integrated-with-aws-lake-formation/
With Amazon EMR 6.15, we launched AWS Lake Formation based fine-grained access controls (FGAC) on Open Table Formats (OTFs), including Apache Hudi, Apache Iceberg, and Delta lake. This allows you to simplify security and governance over transactional data lakes by providing access controls at table-, column-, and row-level permissions with your Apache Spark jobs. Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making. You can build a lake house architecture using Amazon EMR integrated with Lake Formation for FGAC. This combination of services allows you to conduct data analysis on your transactional data lake while ensuring secure and controlled access.
The Amazon EMR record server component supports table-, column-, row-, cell-, and nested attribute-level data filtering functionality. It extends support to Hive, Apache Hudi, Apache Iceberg, and Delta lake formats for both reading (including time travel and incremental query) and write operations (on DML statements such as INSERT). Additionally, with version 6.15, Amazon EMR introduces access control protection for its application web interface such as on-cluster Spark History Server, Yarn Timeline Server, and Yarn Resource Manager UI.
In this post, we demonstrate how to implement FGAC on Apache Hudi tables using Amazon EMR integrated with Lake Formation.
Transaction data lake use case
Amazon EMR customers often use Open Table Formats to support their ACID transaction and time travel needs in a data lake. By preserving historical versions, data lake time travel provides benefits such as auditing and compliance, data recovery and rollback, reproducible analysis, and data exploration at different points in time.
Another popular transaction data lake use case is incremental query. Incremental query refers to a query strategy that focuses on processing and analyzing only the new or updated data within a data lake since the last query. The key idea behind incremental queries is to use metadata or change tracking mechanisms to identify the new or modified data since the last query. By identifying these changes, the query engine can optimize the query to process only the relevant data, significantly reducing the processing time and resource requirements.
Solution overview
In this post, we demonstrate how to implement FGAC on Apache Hudi tables using Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) integrated with Lake Formation. Apache Hudi is an open source transactional data lake framework that greatly simplifies incremental data processing and the development of data pipelines. This new FGAC feature supports all OTF. Besides demonstrating with Hudi here, we will follow up with other OTF tables with other blogs. We use notebooks in Amazon SageMaker Studio to read and write Hudi data via different user access permissions through an EMR cluster. This reflects real-world data access scenarios—for example, if an engineering user needs full data access to troubleshoot on a data platform, whereas data analysts may only need to access a subset of that data that doesn’t contain personally identifiable information (PII). Integrating with Lake Formation via the Amazon EMR runtime role further enables you to improve your data security posture and simplifies data control management for Amazon EMR workloads. This solution ensures a secure and controlled environment for data access, meeting the diverse needs and security requirements of different users and roles in an organization.
The following diagram illustrates the solution architecture.

We conduct a data ingestion process to upsert (update and insert) a Hudi dataset to an Amazon Simple Storage Service (Amazon S3) bucket, and persist or update the table schema in the AWS Glue Data Catalog. With zero data movement, we can query the Hudi table governed by Lake Formation via various AWS services, such as Amazon Athena, Amazon EMR, and Amazon SageMaker.
When users submit a Spark job through any EMR cluster endpoints (EMR Steps, Livy, EMR Studio, and SageMaker), Lake Formation validates their privileges and instructs the EMR cluster to filter out sensitive data such as PII data.
This solution has three different types of users with different levels of permissions to access the Hudi data:
- hudi-db-creator-role – This is used by the data lake administrator who has privileges to carry out DDL operations such as creating, modifying, and deleting database objects. They can define data filtering rules on Lake Formation for row-level and column-level data access control. These FGAC rules ensure that data lake is secured and fulfills the data privacy regulations required.
- hudi-table-pii-role – This is used by engineering users. The engineering users are capable of carrying out time travel and incremental queries on both Copy-on-Write (CoW) and Merge-on-Read (MoR). They also have privilege to access PII data based on any timestamps.
- hudi-table-non-pii-role – This is used by data analysts. Data analysts’ data access rights are governed by FGAC authorized rules controlled by data lake administrators. They do not have visibility on columns containing PII data like names and addresses. Additionally, they can’t access rows of data that don’t fulfill certain conditions. For example, the users only can access data rows that belong to their country.
Prerequisites
You can download the three notebooks used in this post from the GitHub repo.
Before you deploy the solution, make sure you have the following:
- An AWS account
- An AWS Identity and Access Management (IAM) user with administrator permission
Complete the following steps to set up your permissions:
- Log in to your AWS account with your admin IAM user.
Make sure you are in theus-east-1Region.
- Create a S3 bucket in the
us-east-1Region (for example,emr-fgac-hudi-us-east-1-<ACCOUNT ID>).
Next, we enable Lake Formation by changing the default permission model.
- Sign in to the Lake Formation console as the administrator user.
- Choose Data Catalog settings under Administration in the navigation pane.
- Under Default permissions for newly created databases and tables, deselect Use only IAM access control for new databases and Use only IAM access control for new tables in new databases.
- Choose Save.

Alternatively, you need to revoke IAMAllowedPrincipals on resources (databases and tables) created if you started Lake Formation with the default option.
Finally, we create a key pair for Amazon EMR.
- On the Amazon EC2 console, choose Key pairs in the navigation pane.
- Choose Create key pair.
- For Name, enter a name (for example
emr-fgac-hudi-keypair). - Choose Create key pair.

The generated key pair (for this post, emr-fgac-hudi-keypair.pem) will save to your local computer.
Next, we create an AWS Cloud9 interactive development environment (IDE).
- On the AWS Cloud9 console, choose Environments in the navigation pane.
- Choose Create environment.
- For Name¸ enter a name (for example,
emr-fgac-hudi-env). - Keep the other settings as default.

- Choose Create.
- When the IDE is ready, choose Open to open it.

- In the AWS Cloud9 IDE, on the File menu, choose Upload Local Files.

- Upload the key pair file (
emr-fgac-hudi-keypair.pem). - Choose the plus sign and choose New Terminal.

- In the terminal, input the following command lines:
Note that the example code is a proof of concept for demonstration purposes only. For production systems, use a trusted certification authority (CA) to issue certificates. Refer to Providing certificates for encrypting data in transit with Amazon EMR encryption for details.
Deploy the solution via AWS CloudFormation
We provide an AWS CloudFormation template that automatically sets up the following services and components:
- An S3 bucket for the data lake. It contains the sample TPC-DS dataset.
- An EMR cluster with security configuration and public DNS enabled.
- EMR runtime IAM roles with Lake Formation fine-grained permissions:
- <STACK-NAME>-hudi-db-creator-role – This role is used to create Apache Hudi database and tables.
- <STACK-NAME>-hudi-table-pii-role – This role provides permission to query all columns of Hudi tables, including columns with PII.
- <STACK-NAME>-hudi-table-non-pii-role – This role provides permission to query Hudi tables that have filtered out PII columns by Lake Formation.
- SageMaker Studio execution roles that allow the users to assume their corresponding EMR runtime roles.
- Networking resources such as VPC, subnets, and security groups.
Complete the following steps to deploy the resources:
- Choose Quick create stack to launch the CloudFormation stack.
- For Stack name, enter a stack name (for example,
rsv2-emr-hudi-blog). - For Ec2KeyPair, enter the name of your key pair.
- For IdleTimeout, enter an idle timeout for the EMR cluster to avoid paying for the cluster when it’s not being used.
- For InitS3Bucket, enter the S3 bucket name you created to save the Amazon EMR encryption certificate .zip file.
- For S3CertsZip, enter the S3 URI of the Amazon EMR encryption certificate .zip file.

- Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
- Choose Create stack.
The CloudFormation stack deployment takes around 10 minutes.
Set up Lake Formation for Amazon EMR integration
Complete the following steps to set up Lake Formation:
- On the Lake Formation console, choose Application integration settings under Administration in the navigation pane.
- Select Allow external engines to filter data in Amazon S3 locations registered with Lake Formation.
- Choose Amazon EMR for Session tag values.
- Enter your AWS account ID for AWS account IDs.
- Choose Save.

- Choose Databases under Data Catalog in the navigation pane.
- Choose Create database.
- For Name, enter default.
- Choose Create database.

- Choose Data lake permissions under Permissions in the navigation pane.
- Choose Grant.
- Select IAM users and roles.
- Choose your IAM roles.
- For Databases, choose default.
- For Database permissions, select Describe.
- Choose Grant.

Copy Hudi JAR file to Amazon EMR HDFS
To use Hudi with Jupyter notebooks, you need to complete the following steps for the EMR cluster, which includes copying a Hudi JAR file from the Amazon EMR local directory to its HDFS storage, so that you can configure a Spark session to use Hudi:
- Authorize inbound SSH traffic (port 22).
- Copy the value for Primary node public DNS (for example, ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws.com) from the EMR cluster Summary section.

- Go back to previous AWS Cloud9 terminal you used to create the EC2 key pair.
- Run the following command to SSH into the EMR primary node. Replace the placeholder with your EMR DNS hostname:
- Run the following command to copy the Hudi JAR file to HDFS:
Create the Hudi database and tables in Lake Formation
Now we’re ready to create the Hudi database and tables with FGAC enabled by the EMR runtime role. The EMR runtime role is an IAM role that you can specify when you submit a job or query to an EMR cluster.
Grant database creator permission
First, let’s grant the Lake Formation database creator permission to<STACK-NAME>-hudi-db-creator-role:
- Log in to your AWS account as an administrator.
- On the Lake Formation console, choose Administrative roles and tasks under Administration in the navigation pane.
- Confirm that your AWS login user has been added as a data lake administrator.
- In the Database creator section, choose Grant.
- For IAM users and roles, choose
<STACK-NAME>-hudi-db-creator-role. - For Catalog permissions, select Create database.
- Choose Grant.
Register the data lake location
Next, let’s register the S3 data lake location in Lake Formation:
- On the Lake Formation console, choose Data lake locations under Administration in the navigation pane.
- Choose Register location.
- For Amazon S3 path, Choose Browse and choose the data lake S3 bucket. (
<STACK_NAME>s3bucket-XXXXXXX) created from the CloudFormation stack. - For IAM role, choose
<STACK-NAME>-hudi-db-creator-role. - For Permission mode, select Lake Formation.
- Choose Register location.

Grant data location permission
Next, we need to grant<STACK-NAME>-hudi-db-creator-rolethe data location permission:
- On the Lake Formation console, choose Data locations under Permissions in the navigation pane.
- Choose Grant.
- For IAM users and roles, choose
<STACK-NAME>-hudi-db-creator-role. - For Storage locations, enter the S3 bucket (
<STACK_NAME>-s3bucket-XXXXXXX). - Choose Grant.

Connect to the EMR cluster
Now, let’s use a Jupyter notebook in SageMaker Studio to connect to the EMR cluster with the database creator EMR runtime role:
- On the SageMaker console, choose Domains in the navigation pane.
- Choose the domain
<STACK-NAME>-Studio-EMR-LF-Hudi. - On the Launch menu next to the user profile
<STACK-NAME>-hudi-db-creator, choose Studio.

- Download the notebook rsv2-hudi-db-creator-notebook.
- Choose the upload icon.

- Choose the downloaded Jupyter notebook and choose Open.
- Open the uploaded notebook.
- For Image, choose SparkMagic.
- For Kernel, choose PySpark.
- Leave the other configurations as default and choose Select.

- Choose Cluster to connect to the EMR cluster.

- Choose the EMR on EC2 cluster (
<STACK-NAME>-EMR-Cluster) created with the CloudFormation stack. - Choose Connect.
- For EMR execution role, choose
<STACK-NAME>-hudi-db-creator-role. - Choose Connect.
Create database and tables
Now you can follow the steps in the notebook to create the Hudi database and tables. The major steps are as follows:
- When you start the notebook, configure
“spark.sql.catalog.spark_catalog.lf.managed":"true"to inform Spark that spark_catalog is protected by Lake Formation. - Create Hudi tables using the following Spark SQL.
- Insert data from the source table to the Hudi tables.
- Insert data again into the Hudi tables.
Query the Hudi tables via Lake Formation with FGAC
After you create the Hudi database and tables, you’re ready to query the tables using fine-grained access control with Lake Formation. We have created two types of Hudi tables: Copy-On-Write (COW) and Merge-On-Read (MOR). The COW table stores data in a columnar format (Parquet), and each update creates a new version of files during a write. This means that for every update, Hudi rewrites the entire file, which can be more resource-intensive but provides faster read performance. MOR, on the other hand, is introduced for cases where COW may not be optimal, particularly for write- or change-heavy workloads. In a MOR table, each time there is an update, Hudi writes only the row for the changed record, which reduces cost and enables low-latency writes. However, the read performance might be slower compared to COW tables.
Grant table access permission
We use the IAM role<STACK-NAME>-hudi-table-pii-roleto query Hudi COW and MOR containing PII columns. We first grant the table access permission via Lake Formation:
- On the Lake Formation console, choose Data lake permissions under Permissions in the navigation pane.
- Choose Grant.
- Choose
<STACK-NAME>-hudi-table-pii-rolefor IAM users and roles. - Choose the
rsv2_blog_hudi_db_1database for Databases. - For Tables, choose the four Hudi tables you created in the Jupyter notebook.

- For Table permissions, select Select.
- Choose Grant.

Query PII columns
Now you’re ready to run the notebook to query the Hudi tables. Let’s follow similar steps to the previous section to run the notebook in SageMaker Studio:
- On the SageMaker console, navigate to the
<STACK-NAME>-Studio-EMR-LF-Hudidomain. - On the Launch menu next to the
<STACK-NAME>-hudi-table-readeruser profile, choose Studio. - Upload the downloaded notebook rsv2-hudi-table-pii-reader-notebook.
- Open the uploaded notebook.
- Repeat the notebook setup steps and connect to the same EMR cluster, but use the role
<STACK-NAME>-hudi-table-pii-role.
In the current stage, FGAC-enabled EMR cluster needs to query Hudi’s commit time column for performing incremental queries and time travel. It does not support Spark’s “timestamp as of” syntax and Spark.read(). We are actively working on incorporating support for both actions in future Amazon EMR releases with FGAC enabled.
You can now follow the steps in the notebook. The following are some highlighted steps:
- Run a snapshot query.
- Run an incremental query.
- Run a time travel query.
- Run MOR read-optimized and real-time table queries.
Query the Hudi tables with column-level and row-level data filters
We use the IAM role<STACK-NAME>-hudi-table-non-pii-roleto query Hudi tables. This role is not allowed to query any columns containing PII. We use the Lake Formation column-level and row-level data filters to implement fine-grained access control:
- On the Lake Formation console, choose Data filters under Data Catalog in the navigation pane.
- Choose Create new filter.
- For Data filter name, enter
customer-pii-filter. - Choose
rsv2_blog_hudi_db_1for Target database. - Choose
rsv2_blog_hudi_mor_sql_dl_customer_1for Target table. - Select Exclude columns and choose the
c_customer_id,c_email_address, andc_last_namecolumns. - Enter
c_birth_country != 'HONG KONG'for Row filter expression. - Choose Create filter.

- Choose Data lake permissions under Permissions in the navigation pane.
- Choose Grant.
- Choose
<STACK-NAME>-hudi-table-non-pii-rolefor IAM users and roles. - Choose
rsv2_blog_hudi_db_1for Databases. - Choose
rsv2_blog_hudi_mor_sql_dl_tpc_customer_1for Tables. - Choose
customer-pii-filterfor Data filters. - For Data filter permissions, select Select.
- Choose Grant.

Let’s follow similar steps to run the notebook in SageMaker Studio:
- On the SageMaker console, navigate to the domain
Studio-EMR-LF-Hudi. - On the Launch menu for the
hudi-table-readeruser profile, choose Studio. - Upload the downloaded notebook rsv2-hudi-table-non-pii-reader-notebook and choose Open.
- Repeat the notebook setup steps and connect to the same EMR cluster, but select the role
<STACK-NAME>-hudi-table-non-pii-role.
You can now follow the steps in the notebook. From the query results, you can see that FGAC via the Lake Formation data filter has been applied. The role can’t see the PII columnsc_customer_id,c_last_name, andc_email_address. Also, the rows fromHONG KONGhave been filtered.

Clean up
After you’re done experimenting with the solution, we recommend cleaning up resources with the following steps to avoid unexpected costs:
- Shut down the SageMaker Studio apps for the user profiles.
The EMR cluster will be automatically deleted after the idle timeout value.
- Delete the Amazon Elastic File System (Amazon EFS) volume created for the domain.
- Empty the S3 buckets created by the CloudFormation stack.
- On the AWS CloudFormation console, delete the stack.
Conclusion
In this post, we used Apachi Hudi, one type of OTF tables, to demonstrate this new feature to enforce fine-grained access control on Amazon EMR. You can define granular permissions in Lake Formation for OTF tables and apply them via Spark SQL queries on EMR clusters. You also can use transactional data lake features such as running snapshot queries, incremental queries, time travel, and DML query. Please note that this new feature covers all OTF tables.
This feature is launched starting from Amazon EMR release 6.15 in all Regions where Amazon EMR is available. With the Amazon EMR integration with Lake Formation, you can confidently manage and process big data, unlocking insights and facilitating informed decision-making while upholding data security and governance.
To learn more, refer to Enable Lake Formation with Amazon EMR and feel free to contact your AWS Solutions Architects, who can be of assistance alongside your data journey.
About the Author
Raymond Lai is a Senior Solutions Architect who specializes in catering to the needs of large enterprise customers. His expertise lies in assisting customers with migrating intricate enterprise systems and databases to AWS, constructing enterprise data warehousing and data lake platforms. Raymond excels in identifying and designing solutions for AI/ML use cases, and he has a particular focus on AWS Serverless solutions and Event Driven Architecture design.
Bin Wang, PhD, is a Senior Analytic Specialist Solutions Architect at AWS, boasting over 12 years of experience in the ML industry, with a particular focus on advertising. He possesses expertise in natural language processing (NLP), recommender systems, diverse ML algorithms, and ML operations. He is deeply passionate about applying ML/DL and big data techniques to solve real-world problems.
Aditya Shah is a Software Development Engineer at AWS. He is interested in Databases and Data warehouse engines and has worked on performance optimisations, security compliance and ACID compliance for engines like Apache Hive and Apache Spark.
Melody Yang is a Senior Big Data Solution Architect for Amazon EMR at AWS. She is an experienced analytics leader working with AWS customers to provide best practice guidance and technical advice in order to assist their success in data transformation. Her areas of interests are open-source frameworks and automation, data engineering and DataOps.



















Aruna Govindaraju is an Amazon OpenSearch Specialist Solutions Architect and has worked with many commercial and open source search engines. She is passionate about search, relevancy, and user experience. Her expertise with correlating end-user signals with search engine behavior has helped many customers improve their search experience.
Dagney Braun is a Principal Product Manager at AWS focused on OpenSearch.



Prashant Agrawal is a Senior Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.
Satish Nandi is a Senior Product Manager with Amazon OpenSearch Service. He is focused on OpenSearch Serverless and has years of experience in networking, security and ML/AI. He holds a Bachelor degree in Computer Science and an MBA in Entrepreneurship. In his free time, he likes to fly airplanes, hang gliders and ride his motorcycle.


Alex Naumov is a Principal Data Architect at smava GmbH, and leads the transformation projects at the Data department. Alex previously worked 10 years as a consultant and data/solution architect in a wide variety of domains, such as telecommunications, banking, energy, and finance, using various tech stacks, and in many different countries. He has a great passion for data and transforming organizations to become data-driven and the best in what they do.
Lingli Zheng works as a Business Development Manager in the AWS worldwide specialist organization, supporting customers in the DACH region to get the best value out of Amazon analytics services. With over 12 years of experience in energy, automation, and the software industry with a focus on data analytics, AI, and ML, she is dedicated to helping customers achieve tangible business results through digital transformation.
Alexander Spivak is a Senior Startup Solutions Architect at AWS, focusing on B2B ISV customers across EMEA North. Prior to AWS, Alexander worked as a consultant in financial services engagements, including various roles in software development and architecture. He is passionate about data analytics, serverless architectures, and creating efficient organizations.
Mia Heard is a product marketing manager for Amazon Redshift, a fully managed, AI-powered cloud data warehouse with the best price-performance for analytic workloads.








Sreenivasa Munagala is a Principal Data Architect at FanDuel Group. He defines their Amazon Redshift optimization strategy and works with the data analytics team to provide solutions to their key business problems.
Matt Grimm is a Principal Data Architect at FanDuel Group, moving the company to an event-based, data-driven architecture using the integration of both streaming and batch data, while also supporting their Machine Learning Platform and development teams.
Luke Shearer is a Cloud Support Engineer at Amazon Web Services for the Data Insight Analytics profile, where he is engaged with AWS customers every day and is always working to identify the best solution for each customer.
Dhaval Shah is Senior Customer Success Engineer at AWS and specializes in bringing the most complex and demanding data analytics workloads to Amazon Redshift. He has more then 20 years of experiences in different databases and data warehousing technologies. He is passionate about efficient and scalable data analytics cloud solutions that drive business value for customers.
Sidhanth Muralidhar is a Principal Technical Account Manager at AWS. He works with large enterprise customers who run their workloads on AWS. He is passionate about working with customers and helping them architect workloads for cost, reliability, performance, and operational excellence at scale in their cloud journey. He has a keen interest in data analytics as well.























Arun Sudhir is a Staff Software Engineer at Eightfold AI. He has more than 15 years of experience in design and development of backend software systems in companies like Microsoft and AWS, and has a deep knowledge of database engines like Amazon Aurora PostgreSQL and Amazon Redshift.
Rohit Bansal is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to build next-generation analytics solutions using AWS Analytics services.
Anjali Vijayakumar is a Senior Solutions Architect at AWS focusing on EdTech. She is passionate about helping customers build well-architected solutions in the cloud.






Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team at Amazon Web Services. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.
Benjamin Menuet is a Senior Data Architect on the AWS Professional Services team at Amazon Web Services. He helps customers develop data and analytics solutions to accelerate their business outcomes. Outside of work, Benjamin is a trail runner and has finished some iconic races like the UTMB.
Akira Ajisaka is a Senior Software Development Engineer on the AWS Glue team. He likes open source software and distributed systems. In his spare time, he enjoys playing arcade games.
Kinshuk Pahare is a Principal Product Manager on the AWS Glue team at Amazon Web Services.
Jason Ganz is the manager of the Developer Experience (DX) team at dbt Labs




Neeraja Rentachintala is a Principal Product Manager with Amazon Redshift. Neeraja is a seasoned Product Management and GTM leader, bringing over 20 years of experience in product vision, strategy and leadership roles in data products and platforms. Neeraja delivered products in analytics, databases, data Integration, application integration, AI/Machine Learning, large scale distributed systems across On-Premise and Cloud, serving Fortune 500 companies as part of ventures including MapR (acquired by HPE), Microsoft SQL Server, Oracle, Informatica and Expedia.com.


G2 Krishnamoorthy is VP of Analytics, leading AWS data lake services, data integration, Amazon OpenSearch Service, and Amazon QuickSight. Prior to his current role, G2 built and ran the Analytics and ML Platform at Facebook/Meta, and built various parts of the SQL Server database, Azure Analytics, and Azure ML at Microsoft.
Rahul Pathak is VP of Relational Database Engines, leading Amazon Aurora, Amazon Redshift, and Amazon QLDB. Prior to his current role, he was VP of Analytics at AWS, where he worked across the entire AWS database portfolio. He has co-founded two companies, one focused on digital media analytics and the other on IP-geolocation.












































