Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 1: Getting Started

Post Syndicated from Akira Ajisaka original https://aws.amazon.com/blogs/big-data/part-1-getting-started-introducing-native-support-for-apache-hudi-delta-lake-and-apache-iceberg-on-aws-glue-for-apache-spark/

AWS Glue is a serverless, scalable data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources. AWS Glue provides an extensible architecture that enables users with different data processing use cases.

A common use case is building data lakes on Amazon Simple Storage Service (Amazon S3) using AWS Glue extract, transform, and load (ETL) jobs. Data lakes free you from proprietary data formats defined by the business intelligence (BI) tools and limited capacity of proprietary storage. In addition, data lakes help you break down data silos to maximize end-to-end data insights. As data lakes have grown in size and matured in usage, a significant amount of effort can be spent keeping the data up to date by ensuring files are updated in a transactionally consistent manner.

AWS Glue customers can now use the following open-source data lake storage frameworks: Apache Hudi, Linux Foundation Delta Lake, and Apache Iceberg. These data lake frameworks help you store data and interface data with your applications and frameworks. Although popular data file formats such as Apache Parquet, CSV, and JSON can store big data, data lake frameworks bundle distributed big data files into tabular structures that are otherwise hard to manage. This makes data lake table frameworks the building constructs of databases on data lakes.

We announced general availability for native support for Apache Hudi, Linux Foundation Delta Lake, and Apache Iceberg on AWS Glue for Spark. This feature removes the need to install a separate connector or associated dependencies, manage versions, and simplifies the configuration steps required to use these frameworks in AWS Glue for Apache Spark. With these open-source data lake frameworks, you can simplify incremental data processing in data lakes built on Amazon S3 by using ACID (atomicity, consistency, isolation, durability) transactions, upserts, and deletes.

This post demonstrates how AWS Glue for Apache Spark works with Hudi, Delta, and Iceberg dataset tables, and describes typical use cases on an AWS Glue Studio notebook.

Enable Hudi, Delta, Iceberg in Glue for Apache Spark

You can use Hudi, Delta, or Iceberg by specifying a new job parameter --datalake-formats. For example, if you want to use Hudi, you need to specify the key as --datalake-formats and the value as hudi. If the option is set, AWS Glue automatically adds the required JAR files into the runtime Java classpath, and that’s all you need. You don’t need to build and configure the required libraries or install a separate connector. You can use the following library versions with this option.

AWS Glue version	Hudi	Delta Lake	Iceberg
AWS Glue 3.0	0.10.1	1.0.0	0.13.1
AWS Glue 4.0	0.12.1	2.1.0	1.0.0

If you want to use other versions of the preceding libraries, you can choose either of the following options:

Use the connectors in AWS Marketplace. For example, you can use Iceberg 0.14.0 in AWS Glue 3.0 using the Apache Iceberg Connector for AWS Glue. Refer to Use the AWS Glue connector to read and write Apache Iceberg tables with ACID transactions and perform time travel for further details.
Put your Hudi, Delta, or Iceberg libraries into your S3 bucket and specify the location using the –extra-jars option to include the libraries in the Java classpath. If you provide multiple JAR files, you must set them with comma-separated values. Refer to AWS Glue job parameters for more details.

If you choose either of the preceding options, you need to make sure the --datalake-formats job parameter is unspecified. For more information, see Process Apache Hudi, Delta Lake, Apache Iceberg datasets at scale, part 1: AWS Glue Studio Notebook.

Prerequisites

To continue this tutorial, you need to create the following AWS resources in advance:

An AWS Identity and Access Management (IAM) role for your ETL job or notebook as instructed in Set up IAM permissions for AWS Glue Studio
An S3 bucket for storing data

Process Hudi, Delta, and Iceberg datasets on an AWS Glue Studio notebook

AWS Glue Studio notebooks provide serverless notebooks with minimal setup. It makes data engineers and developers quickly and interactively explore and process their datasets. You can start using Hudi, Delta, or Iceberg in an AWS Glue Studio notebook by specifying the parameter via %%configure magic and setting the AWS Glue version to 3.0 as follows:

# Use Glue version 3.0
%glue_version 3.0

# Configure '--datalake-formats' Job parameter
%%configure
{
  "--datalake-formats": "your_comma_separated_formats"
}

For more information, refer to the example notebooks available in the GitHub repository:

For this post, we use an Iceberg DataFrame as an example.

The following sections explain how to use an AWS Glue Studio notebook to create an Iceberg table and append records to the table.

Launch a Jupyter notebook to process Iceberg tables

Complete the following steps to launch an AWS Glue Studio notebook:

Download the Jupyter notebook file.
On the AWS Glue console, choose Jobs in the navigation plane.
Under Create job, select Jupyter Notebook.

Select Upload and edit an existing notebook.
Upload native_iceberg_dataframe.ipynb through Choose file under File upload.

Choose Create.
For Job name, enter native_iceberg_dataframe.
For IAM Role, choose your IAM role.
Choose Start notebook job.

Prepare and configure SparkSession with Iceberg configuration

Complete the following steps to configure SparkSession to process Iceberg tables:

Run the following cell.

You can see --datalake-formats iceberg is set by the %%configure Jupyter magic command. For more information about Jupyter magics, refer to Configuring AWS Glue interactive sessions for Jupyter and AWS Glue Studio notebooks.

Provide your S3 bucket name and bucket prefix for your Iceberg table location in the following cell, and run it.

Run the following cells to initialize SparkSession.

Optionally, if you previously ran the notebook, you need to run the following cell to clean up existing resources.

Now you’re ready to create Iceberg tables using the notebook.

Create an Iceberg table

Complete the following steps to create an Iceberg table using the notebook:

Run the following cell to create a DataFrame (df_products) to write.

If successful, you can see the following table.

Run the following cell to create an Iceberg table using the DataFrame.

Now you can read data from the Iceberg table by running the following cell.

Append records to the Iceberg table

Complete the following steps to append records to the Iceberg table:

Run the following cell to create a DataFrame (df_products_appends) to append.

Run the following cell to append the records to the table.

Run the following cell to confirm that the preceding records are successfully appended to the table.

Clean up

To avoid incurring ongoing charges, clean up your resources:

Run step 4 in the Prepare and configure SparkSession with Iceberg configuration section in this post to delete the table and underlying S3 objects.
On the AWS Glue console, choose Jobs in the navigation plane.
Select your job and on the Actions menu, choose Delete job(s).
Choose Delete to confirm.

Considerations

With this capability, you have three different options to access Hudi, Delta, and Iceberg tables:

Spark DataFrames, for example spark.read.format("hudi").load("s3://path_to_data")
SparkSQL, for example SELECT * FROM table
GlueContext, for example create_data_frame.from_catalog, write_data_frame.from_catalog, getDataFrame, and writeDataFrame

Learn more in Using the Hudi framework in AWS Glue, Using the Delta Lake framework in AWS Glue, and Using the Iceberg framework in AWS Glue.

Delta Lake native integration works with the catalog tables created from native Delta Lake tables by AWS Glue crawlers. This integration does not depend on manifest files. For more information, refer to Introducing native Delta Lake table support with AWS Glue crawlers.

Conclusion

This post demonstrated how to process Apache Hudi, Delta Lake, Apache Iceberg datasets using AWS Glue for Apache Spark. You can integrate your data using those data lake formats easily without struggling with library dependency management.

In subsequent posts in this series, we’ll show you how you can use AWS Glue Studio to visually author your ETL jobs with simpler configuration and setup for these data lake formats, and how to use AWS Glue workflows to orchestrate data pipelines and automate ingestion into your data lakes on Amazon S3 with AWS Glue jobs. Stay tuned!

If you have comments or feedback, please leave them in the comments.

About the authors

Akira Ajisaka is a Senior Software Development Engineer on the AWS Glue team. He likes open-source software and distributed systems. In his spare time, he enjoys playing both arcade and console games.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.

Savio Dsouza is a Software Development Manager on the AWS Glue team. His teams work on building and innovating in distributed compute systems and frameworks, namely on Apache Spark.

Noise