Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/part-1-effective-data-lakes-using-aws-lake-formation-part-1-getting-started-with-governed-tables/
Thousands of customers are building their data lakes on Amazon Simple Storage Service (Amazon S3). You can use AWS Lake Formation to build your data lakes easily—in a matter of days as opposed to months. However, there are still some difficult challenges to address with your data lakes:
- Supporting streaming updates and deletes in your data lakes, for example, database replication, and supporting privacy regulations such as GDPR and CCPA
- Achieving fine-grained secure sharing not only with table- or column-level access control, but with row-level access control
- Optimizing the layout of various tables and files on Amazon S3 to improve analytics performance
We announced Lake Formation transactions, row-level security, and acceleration for preview at AWS re:Invent 2020. These capabilities are available via new, open, and public update and access APIs for data lakes. These APIs extend the governance capabilities of Lake Formation with row-level security, and provide transactions semantics on data lakes.
In this series of the posts, we provide a step-by-step instruction to use these new Lake Formation features. In this post, we focus on the first step of setting up governed tables.
Lake Formations transactions, row-level security, and acceleration are currently available for preview in the US East (N. Virginia) AWS Region. To get early access to these capabilities, sign up for the preview. You need to be approved for the preview to gain access to these features.
Governed Table
The Data Catalog supports a new type of metadata tables: governed tables. Governed tables are unique to Lake Formation. Governed tables are a new Amazon S3 table type that supports atomic, consistent, isolated, and durable (ACID) transactions. Lake Formation transactions simplify ETL script and workflow development, and allow multiple users to concurrently and reliably insert, delete, and modify rows across multiple governed tables. Lake Formation automatically compacts and optimizes storage of governed tables in the background to improve query performance. When you create a table, you can specify whether or not the table is governed.
Setting up resources with AWS CloudFormation
In this post, I demonstrate how you can create a new governed table using existing data on Amazon S3. We use the Amazon Customer Reviews Dataset, which is stored in a public S3 bucket as sample data. You don’t need to copy the data to your bucket or worry about Amazon S3 storage costs. You can just set up a governed table pointing to this existing public data to see how it works.
This post includes an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs. If you prefer setting up resources on the AWS Management Console rather than AWS CloudFormation, see the instructions in the appendix at the end of this post.
The CloudFormation template generates the following resources:
- AWS Identity and Access Management(IAM) users, roles, and policies
- AWS Lake Formation data lake settings and permissions
To create your resources, complete the following steps:
- Sign in to the CloudFormation console in
us-east-1
Region. - Choose Launch Stack:
- Choose Next.
- For DatalakeAdminUserNameand DatalakeAdminUserPassword, enter your IAM user name and password for data lake admin user.
- For DataAnalystUserNameand DataAnalystUserPassword, enter your IAM user name and password for data analyst user.
- For DatabaseName, leave as the default.
- Choose Next.
- On the next page, choose Next.
- Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
- Choose Create.
Stack creation can take up to 2 minutes.
Setting up a governed table
Now you can create and configure your first governed table in AWS Lake Formation.
Creating a governed table
To create your governed table, complete the following steps:
- Sign in to the Lake Formation console in
us-east-1
Region using theDatalakeAdmin1
user. - Choose Tables.
- Choose Create table.
- For Name, enter
amazon_reviews_governed
. - For Database, enter
lakeformation_tutorial_amazon_reviews
. - Select Enable governed data access and management.
- Select Enable row based permissions.
-
- For Data is located in, choose Specified path in another account.
- Enter the path
s3://amazon-reviews-pds/parquet/
. - For Classification, choose PARQUET.
- Choose Upload Schema.
- Enter the following JSON array into the text box:
- Choose Upload.
- Choose Add column.
- For Column name, enter
product_category
. - For Data type, choose String.
- Select Partition Key.
- Choose Add.
- Choose Submit.
Now you can see that the new governed table has been created.
When you choose the table name, you can see the details of the governed table, and you can also see Governance: Enabled
in this view. It means that it’s a Lake Formation governed table. If you have other existing tables, it should show as Governance: Disabled
because the tables are not governed tables.
You can also see lakeformation.aso.status: true
under Table properties. It means that automatic compaction is enabled for this table. For this post, we use a read-only table and don’t utilize automatic compaction. To disable the automatic compaction, complete the following steps:
- Choose Edit table.
- Deselect Automatic compaction.
- Choose Save.
Currently, no data and no partitions are registered to this governed table. In the next step, we register existing S3 objects to the governed table using Lake Formation manifest APIs.
Even if you locate your data in the table location of the governed table, the data isn’t recognized yet. To make the governed table aware of the data, you need to make a Lake Formation API call, or use an AWS Glue job with Lake Formation transactions.
Configuring Lake Formation permissions
You need to grant Lake Formation permissions for your governed table. Complete the following steps:
Table-level permissions
- Sign in to the Lake Formation console in
us-east-1
Region using theDatalakeAdmin1
user. - Under Permissions, choose Data permissions.
- Under Data permission, choose Grant.
- For Database, choose
lakeformation_tutorial_amazon_reviews
. - For Table, choose
amazon_reviews_governed
. - For IAM users and roles, choose the role
LFRegisterLocationServiceRole-
<CloudFormation stack name>
and the userDatalakeAdmin1
. - Select Table permissions.
- Under Table permissions, select Alter, Insert, Drop, Delete, Select, and Describe.
- Choose Grant.
- Under Data permission, choose Grant.
- For Database, choose
lakeformation_tutorial_amazon_reviews
. - For Table, choose
amazon_reviews_governed
. - For IAM users and roles, choose the user
DataAnalyst1
. - Under Table permissions, select Select and Describe.
- Choose Grant.
Row-level permissions
- Under Permissions, choose Data permissions.
- Under Data permission, choose Grant.
- For Database, choose
lakeformation_tutorial_amazon_reviews
. - For Table, choose
amazon_reviews_governed
. - For IAM users and roles, choose the role
LFRegisterLocationServiceRole-
<CloudFormation stack name>
, the usersDatalakeAdmin1
andDataAnalyst
. - Select Row-based permissions.
- For Filter name, enter
allowAll
. - For Choose filter type, select Allow access to all rows.
- Choose Grant.
Adding table objects into the governed table
To register S3 objects to a governed table, you need to call the UpdateTableObjects
API needs for the objects. You can call it using the AWS Command Line Interface (AWS CLI) and SDK, and also the AWS Glue ETL library (the API is called implicitly in the library). For this post, we use the AWS CLI to explain the behavior in the API level. If you don’t have the AWS CLI, see Installing, updating, and uninstalling the AWS CLI. You also need to install the service model file provided in the Lake Formation preview program. You need to run the following commands using DatalakeAdmin1
user’s credential (or an IAM role or user where sufficient permissions are granted).
First, begin a new transaction with the BeginTransaction
API:
Now you can register any files on the location. For this post, we choose one sample partition product_category=Camera
from the amazon-reviews-pds
table, and choose one file under this partition. Uri
, ETag
, and Size
are the required information for further steps, so you need to copy them.
Create a new file named write-operations1.json
and enter the following JSON: (replace Uri
, ETag
, and Size
with the values you copied.)
Let’s register an existing object on the bucket to the governed table by making an UpdateTableObjects
API call using write-operations1.json
you created. (replace <transaction-id> with the transaction id you got in begin-transaction
command.)
You can ensure the change before the transaction commit by making the GetTableObjects
API call with the same transaction ID: (Replace <transaction-id> with the id you got in begin-transaction
command.)
To make this data available for other transactions, you need to call the CommitTransaction
API: (replace <transaction-id> with the transaction id you got in begin-transaction
command.)
Let’s add one more partition into this table. This time we add one file per partition, and add only two partitions as an example. For actual usage, you need to add all the files under all the partitions that you need.
Add partitions with following commands:
- Call the
BeginTransaction
API to start another Lake Formation transaction: - List Amazon S3 objects located on
amazon-reviews-pds
bucket to choose another sample file: - Call the
HeadObject
API against one sample file in order to copyETag
andSize
- Create a new file named
write-operations2.json
and enter the following JSON: (ReplaceUri
,ETag
, andSize
with the values you copied.) - Call the
UpdateTableObjects
API usingwrite-operations2.json
: (replace <transaction-id> with the transaction id you got inbegin-transaction
command.)
Querying the governed table using Amazon Athena
Now your governed table is ready! Let’s start querying the governed table using Amazon Athena. Sign in to the Athena console in us-east-1
Region using DataAnalyst1
user.
If it’s your first time running queries on Athena, you need to configure a query result location. For more information, see Specifying a Query Result Location.
To utilize Lake Formation preview features, you need to create a special workgroup named AmazonAthenaLakeFormationPreview
, and join the workgroup. For more information, see Managing Workgroups.
Running a simple query
Sign in to the Athena console in us-east-1
Region using the DataAnalyst1
user. First, let’s preview 10 records stored in a governed table:
The following screenshot shows the query results.
Running an analytic query
Next, let’s run an analytic query with aggregation for simulating real-world use cases:
The following screenshot shows the results. This query returned the total number of reviews and average rating per product category.
Running an analytic query with time travel
Each governed table maintains a versioned manifest of the Amazon S3 objects that it comprises. You can use previous versions of the manifest for time travel queries. Your queries against governed tables in Athena can include a timestamp to indicate that you want to discover the state of the data at a particular date and time.
To submit a time travel query in Athena, add a WHERE
clause that sets the column __asOfDate
to the epoch time (long integer) representation of the required date and time. Let’s run the time travel query: (replace <epoch-milliseconds> with the timestamp which is right after you made the first UpdateTableObjects
call. To retrieve the epoch milliseconds, see the tips introduced after the screenshots in this post.)
The following screenshot shows the query results. The result only includes the record of product_category=Camera
. This is because that the file under product_category=Books has been added after this timestamp (1612267920000 ms = 2021/02/02 12:12:00 UTC
), which has been specified in the time travel column __asOfDate
.
To retrieve epoch time from commands, you can run below commands.
The following command is for Linux (GNU date command):
The following command is for OSX (BSD date command):
Cleaning up
Now to the final step, cleaning up the resources.
- Delete the CloudFormation stack. The governed table you created is automatically deleted with the stack.
- Delete the Athena workgroup
AmazonAthenaLakeFormationPreview
.
Conclusion
In this blog post, we explained how to create a Lake Formation governed table with existing data in an AWS public dataset. In addition, we explained how to query against governed tables and how to run time travel queries for governed tables. With Lake Formation governed tables, you can achieve transactions, row-level security, and query acceleration. In Part 2 of this series, we show you how to create a governed table for streaming data sources and demonstrate how Lake Formation transactions work.
Lake Formations transactions, row-level security, and acceleration are currently available for preview in the US East (N. Virginia) AWS Region. To get early access to these capabilities, please sign up for the preview.
Appendix: Setting up resources via the console
When following the steps in this section, use the Region us-east-1
because as of this writing, this Lake Formation preview feature is available only in us-east-1
.
Configuring IAM roles and IAM users
First, you need to set up two IAM roles, one is for AWS Glue ETL jobs, another is for the Lake Formation data lake location.
IAM policies
To create your policies, complete the following steps:
- On the IAM console, create a new Policy for Amazon S3.
- Save the policy as
S3DataLakePolicy
as follows: - Create a new IAM policy named LFLocationPolicy with the following statements:
- Create a new IAM policy named
LFQuery
Policy with the following statements:IAM role for AWS Lake Formation
To create your IAM role for the Lake Formation data lake location, complete the following steps:
- Create a new Lake Formation role called
LFRegisterLocationServiceRol
e with a Lake Formation trust relationship:Attach the customer managed policies
S3DataLakePolicy
andLFLocationPolicy
you created in the previous step.
This role is used to register locations with Lake Formation which in-turn performs credential vending for Athena at query time.
IAM users
To create your users, complete the following steps:
- Create an IAM user named
DatalakeAdmin
. - Attach the following AWS managed policies:
AWSLakeFormationDataAdmin
AmazonAthenaFullAccess
IAMReadOnlyAccess
- Attach the customer managed policy
LFQueryPolicy
. - Create an IAM user named
DataAnalyst
that can use Athena to query data. - Attach the AWS managed policy
AmazonAthenaFullAccess
. - Attach the customer managed policy
LFQueryPolicy
.
Configuring Lake Formation
If you’re new to Lake Formation, you can follow below steps for getting started with AWS Lake Formation.
- On the Lake Formation console, under Permissions, choose Admins and database creators.
- In the Data lake administratorssection, choose Grant.
- For IAM users and roles, choose your IAM user
DatalakeAdmin
. - Choose Save.
- In the Database creators section, choose Grant.
- For IAM users and roles, choose the
LFRegisterLocationServiceRole
. - Select Create Database.
- Choose Grant.
- Under Register and ingest, choose Data lake locations.
- Choose Register location.
- For Amazon S3 path, enter your Amazon S3 path to the bucket where your data is stored. This needs to be the same bucket you listed in
LFLocationPolicy
. Lake Formation uses this role to vend temporary Amazon S3 credentials to query services that need read/write access to the bucket and all prefixes under it. - For IAM role, choose the
LFRegisterLocationServiceRole
. - Choose Register location.
- Under Data catalog, choose Settings.
- Make sure that both check boxes for Use only IAM access control for new databases and Use only IAM access control for new tables in new databases are deselected.
- Under Data catalog, choose Databases.
- Choose Create database.
- Select Database.
- For Name, enter
lakeformation_tutorial_amazon_reviews
. - Choose Create database.
About the Author
Noritaka Sekiyama is a Senior Big Data Architect at AWS Glue & Lake Formation. His passion is for implementing software artifacts for building data lakes more effectively and easily. During his spare time, he loves to spend time with his family, especially hunting bugs—not software bugs, but bugs like butterflies, pill bugs, snails, and grasshoppers.