Process Apache Hudi, Delta Lake, Apache Iceberg dataset at scale, part 2: Using AWS Glue Studio Visual Editor

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/part-2-integrate-apache-hudi-delta-lake-apache-iceberg-dataset-at-scale-using-aws-glue-studio-visual-editor/

Transactional data lake technologies such as Apache Hudi, Delta Lake, Apache Iceberg, and AWS Lake Formation governed tables is evolving rapidly, and gaining great popularity. These technologies simplified the data processing pipeline significantly, and they provided further useful capabilities like upserts, rolling back, and time travel queries.

In the first post of this series, we went through how to process Apache Hudi, Delta Lake, and Apache Iceberg datasets using AWS Glue connectors. AWS Glue simplifies reading and writing your data in those data lake formats, and building the data lakes on top of those technologies. Running the sample notebooks on AWS Glue Studio notebook, you could interactively develop and run your code, then immediately see the results. The notebooks let you explore how those technologies work when you have coding experience.

This second post focuses on other use cases for customers who prefer visual job authoring without writing custom code. Even without coding experience, you can easily build your transactional data lakes on AWS Glue Studio visual editor, and take advantage of those transactional data lake technologies. In addition, you can also use Amazon Athena to query the data stored using Hudi and Iceberg. This tutorial demonstrates how to read and write each format on AWS Glue Studio visual editor, and then how to query from Athena.

Process Apache Hudi, Delta Lake, Apache Iceberg dataset at scale

Prerequisites

The following are the instructions to read/write tables using each data lake format on AWS Glue Studio Visual Editor. You can use any of the marketplace connector or the custom connector based on your requirements.

To continue this tutorial, you must create the following AWS resources in advance:

Reads/writes using the connector on AWS Glue Studio Visual Editor

In this tutorial, you read and write each of the transaction data lake format data on the AWS Glue Studio Visual Editor. There are three main configurations: connection, connection options, and job parameters that you must configure per the data lake format. Note that no code is included in this tutorial. Let’s see how it works.

Apache Hudi writes

Complete following steps to write into Apache Hudi table using the connector:

  1. Open AWS Glue Studio.
  2. Choose Jobs.
  3. Choose Visual with a source and target.
  4. For Source, choose Amazon S3.
  5. For Target, choose hudi-0101-byoc-connector.
  6. Choose Create.
  7. Under Visual, choose Data source – S3 bucket.
  8. Under Node properties, for S3 source type, choose S3 location.
  9. For S3 URL, enter s3://covid19-lake/rearc-covid-19-world-cases-deaths-testing/json/.
  10. Choose Data target – Connector.
  11. Under Node properties, for Connection, choose hudi-0101-byoc-connection.
  12. For Connection options, enter the following pairs of Key and Value (choose Add new option to enter a new pair).
    1. Key: path. Value: <Your S3 path for Hudi table location>
    2. Key: hoodie.table.name, Value: test
    3. Key: hoodie.datasource.write.storage.type, Value: COPY_ON_WRITE
    4. Key: hoodie.datasource.write.operation, Value: upsert
    5. Key: hoodie.datasource.write.recordkey.field, Value: location
    6. Key: hoodie.datasource.write.precombine.field, Value: date
    7. Key: hoodie.datasource.write.partitionpath.field, Value: iso_code
    8. Key: hoodie.datasource.write.hive_style_partitioning, Value: true
    9. Key: hoodie.datasource.hive_sync.enable, Value: true
    10. Key: hoodie.datasource.hive_sync.database, Value: hudi
    11. Key: hoodie.datasource.hive_sync.table, Value: test
    12. Key: hoodie.datasource.hive_sync.partition_fields, Value: iso_code
    13. Key: hoodie.datasource.hive_sync.partition_extractor_class, Value: org.apache.hudi.hive.MultiPartKeysValueExtractor
    14. Key: hoodie.datasource.hive_sync.use_jdbc, Value: false
    15. Key: hoodie.datasource.hive_sync.mode, Value: hms
  13. Under Job details, for IAM Role, choose your IAM role.
  14. Under Advanced properties, for Job parameters, choose Add new parameter.
  15. For Key, enter --conf.
  16. For Value, enter spark.serializer=org.apache.spark.serializer.KryoSerializer.
  17. Choose Save.
  18. Choose Run.

Apache Hudi reads

Complete following steps to read from the Apache Hudi table that you created in the previous section using the connector:

  1. Open AWS Glue Studio.
  2. Choose Jobs.
  3. Choose Visual with a source and target.
  4. For Source, choose hudi-0101-byoc-connector.
  5. For Target, choose Amazon S3.
  6. Choose Create.
  7. Under Visual, choose Data source – Connection.
  8. Under Node properties, for Connection, choose hudi-0101-byoc-connection.
  9. For Connection options, choose Add new option.
  10. For Key, enter path. For Value, enter your S3 path for your Hudi table that you created in the previous section.
  11. Choose Transform – ApplyMapping, and choose Remove.
  12. Choose Data target – S3 bucket.
  13. Under Data target properties, for Format, choose JSON.
  14. For S3 Target type. choose S3 location.
  15. For S3 Target Location enter your S3 path for output location.
  16. Under Job details, for IAM Role, choose your IAM role.
  17. Choose Save.
  18. Choose Run.

Delta Lake writes

Complete the following steps to write into the Delta Lake table using the connector:

  1. Open AWS Glue Studio.
  2. Choose Jobs.
  3. Choose Visual with a source and target.
  4. For Source, choose Amazon S3.
  5. For Target, choose delta-100-byoc-connector.
  6. Choose Create.
  7. Under Visual, choose Data source – S3 bucket.
  8. Under Node properties, for S3 source type, choose S3 location.
  9. For S3 URL, enter s3://covid19-lake/rearc-covid-19-world-cases-deaths-testing/json/.
  10. Choose Data target – Connector.
  11. Under Node properties, for Connection, choose your delta-100-byoc-connection.
  12. For Connection options, choose Add new option.
  13. For Key, enter path. For Value, enter your S3 path for Delta table location. Choose Add new option.
  14. For Key, enter partitionKeys. For Value, enter iso_code.
  15. Under Job details, for IAM Role, choose your IAM role.
  16. Under Advanced properties, for Job parameters, choose Add new parameter.
  17. For Key, enter --conf.
  18. For Value, enter spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog.
  19. Choose Save.
  20. Choose Run.

Delta Lake reads

Complete the following steps to read from the Delta Lake table that you created in the previous section using the connector:

  1. Open AWS Glue Studio.
  2. Choose Jobs.
  3. Choose Visual with a source and target.
  4. For Source, choose delta-100-byoc-connector.
  5. For Target, choose Amazon S3.
  6. Choose Create.
  7. Under Visual, choose Data source – Connection.
  8. Under Node properties, for Connection, choose delta-100-byoc-connection.
  9. For Connection options, choose Add new option.
  10. For Key, enter path. For Value, enter your S3 path for Delta table that you created in the previous section. Choose Add new option.
  11. For Key, enter partitionKeys. For Value, enter iso_code.
  12. Choose Transform – ApplyMapping, and choose Remove.
  13. Choose Data target – S3 bucket.
  14. Under Data target properties, for Format, choose JSON.
  15. For S3 Target type, choose S3 location.
  16. For S3 Target Location enter your S3 path for output location.
  17. Under Job details, for IAM Role, choose your IAM role.
  18. Under Advanced properties, for Job parameters, choose Add new parameter.
  19. For Key, enter --conf.
  20. For Value, enter spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog.
  21. Choose Save.
  22. Choose Run.

Apache Iceberg writes

Complete the following steps to write into Apache Iceberg table using the connector:

  1. Open AWS Glue console.
  2. Choose Databases.
  3. Choose Add database.
  4. For database name, enter iceberg, and choose Create.
  5. Open AWS Glue Studio.
  6. Choose Jobs.
  7. Choose Visual with a source and target.
  8. For Source, choose Amazon S3.
  9. For Target, choose iceberg-0131-byoc-connector.
  10. Choose Create.
  11. Under Visual, choose Data source – S3 bucket.
  12. Under Node properties, for S3 source type, choose S3 location.
  13. For S3 URL, enter s3://covid19-lake/rearc-covid-19-world-cases-deaths-testing/json/.
  14. Choose Data target – Connector.
  15. Under Node properties, for Connection, choose iceberg-0131-byoc-connection.
  16. For Connection options, choose Add new option.
  17. For Key, enter path. For Value, enter glue_catalog.iceberg.test.
  18. Choose SQL under Transform to create a new AWS Glue Studio node.
  19. Under Node properties, for Node parents, choose ApplyMapping.
  20. Under Transform, for SQL alias, verify that myDataSource is entered.
  21. For SQL query, enter CREATE TABLE glue_catalog.iceberg.test AS SELECT * FROM myDataSource WHERE 1=2. This is to create a table definition with no records because the Iceberg target requires table definition before data ingestion.
  22. Under Job details, for IAM Role, choose your IAM role.
  23. Under Advanced properties, for Job parameters, choose Add new parameter.
  24. For Key, enter --conf.
  25. For Value, enter the following value (replace the placeholder your_s3_bucket with your S3 bucket name): spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.warehouse=s3://your_s3_bucket/iceberg/warehouse --conf spark.sql.catalog.glue_catalog.catalog-impl --conf park.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager --conf spark.sql.catalog.glue_catalog.lock.table=iceberg_lock --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
  26. Choose Save.
  27. Choose Run.

Apache Iceberg reads

Complete the following steps to read from Apache Iceberg table that you created in the previous section using the connector:

  1. Open AWS Glue Studio.
  2. Choose Jobs.
  3. Choose Visual with a source and target.
  4. For Source, choose Apache Iceberg Connector for AWS Glue 3.0.
  5. For Target, choose Amazon S3.
  6. Choose Create.
  7. Under Visual, choose Data source – Connection.
  8. Under Node properties, for Connection, choose your Iceberg connection name.
  9. For Connection options, choose Add new option.
  10. For Key, enter path. For Value, enter glue_catalog.iceberg.test.
  11. Choose Transform – ApplyMapping, and choose Remove.
  12. Choose Data target – S3 bucket.
  13. Under Data target properties, for Format, choose JSON.
  14. For S3 Target type, choose S3 location.
  15. For S3 Target Location enter your S3 path for the output location.
  16. Under Job details, for IAM Role, choose your IAM role.
  17. Under Advanced properties, for Job parameters, choose Add new parameter.
  18. For Key, enter --conf.
  19. For Value, enter the following value (replace the placeholder your_s3_bucket with your S3 bucket name): spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.warehouse=s3://your_s3_bucket/iceberg/warehouse --conf spark.sql.catalog.glue_catalog.catalog-impl --conf park.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager --conf spark.sql.catalog.glue_catalog.lock.table=iceberg_lock --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
  20. Choose Save.
  21. Choose Run.

Query from Athena

The Hudi table and the iceberg tables created with the above instructions are also queryable from Athena.

  1. Open the Athena console.
  2. Run the following SQL to query the Hudi table:
    SELECT * FROM "hudi"."test" LIMIT 10

  3. Run the following SQL to query the Iceberg table:
    SELECT * FROM "iceberg"."test" LIMIT 10

If you want to query the Delta table from Athena, follow Presto, Trino, and Athena to Delta Lake integration using manifests.

Conclusion

This post summarized how to utilize Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue platform, as well as demonstrated how each format works with the AWS Glue Studio Visual Editor. You can start using those data lake formats easily in any of the AWS Glue DynamicFrames, Spark DataFrames, and Spark SQL on the AWS Glue jobs, the AWS Glue Studio notebooks, and the AWS Glue Studio visual editor.


About the Author

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He enjoys collaborating with different teams to deliver results like this post. In his spare time, he enjoys playing video games with his family.