All posts by Noritaka Sekiyama

Scale your AWS Glue for Apache Spark jobs with R type, G.12X, and G.16X workers

2025-07-18 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/scale-your-aws-glue-for-apache-spark-jobs-with-r-type-g-12x-and-g-16x-workers/

With AWS Glue, organizations can discover, prepare, and combine data for analytics, machine learning (ML), AI, and application development. At its core, AWS Glue for Apache Spark jobs operate by specifying your code and the number of Data Processing Units (DPUs) needed, with each DPU providing computing resources to power your data integration tasks. However, although the existing workers effectively serve most data integration needs, today’s data landscapes are becoming increasingly complex at larger scale. Organizations are dealing with larger data volumes, more diverse data sources, and increasingly sophisticated transformation requirements.

Although horizontal scaling (adding more workers) effectively addresses many data processing challenges, certain workloads benefit significantly from vertical scaling (increasing the capacity of individual workers). These scenarios include processing large, complex query plans, handling memory-intensive operations, or managing workloads that require substantial per-worker resources for operations such as large join operations, complex aggregations, and data skew scenarios. The ability to scale both horizontally and vertically provides the flexibility needed to optimize performance across diverse data processing requirements.

Responding to these growing demands, today we are pleased to announce the general availability of AWS Glue R type, G.12X, and G.16X workers, the new AWS Glue worker types for the most demanding data integration workloads. G.12X and G.16X workers offer increased compute, memory, and storage, making it possible for you to vertically scale and run even more intensive data integration jobs. R type workers offer increased memory to meet even more memory-intensive requirements. Larger worker types not only benefit the Spark executors, but also in cases where the Spark driver needs larger capacity—for instance, because the job query plan is large. To learn more about Spark driver and executors, see Key topics in Apache Spark.

This post demonstrates how AWS Glue R type, G.12X, and G.16X workers help you scale up your AWS Glue for Apache Spark jobs.

R type workers

AWS Glue R type workers are designed for memory-intensive workloads where you need more memory per worker than G worker types. G worker types run with a 1:4 vCPU to memory (GB) ratio, whereas R worker types run with a 1:8 vCPU to memory (GB) ratio. R.1X workers provide 1 DPU, with 4 vCPU, 32 GB memory, and 94 GB of disk per node. R.2X workers provide 2 DPU, with 8 vCPU, 64 GB memory, and 128 GB of disk per node. R.4X workers provide 4 DPU, with 16 vCPU, 128 GB memory, and 256 GB of disk per node. R.8X workers provide 8 DPU, with 32 vCPU, 256 GB memory, and 512 GB of disk per node. As with G worker types, you can choose R type workers with a single parameter change in the API, AWS Command Line Interface (AWS CLI), or AWS Glue Studio. Regardless of the worker used, the AWS Glue jobs have the same capabilities, including automatic scaling and interactive job authoring using notebooks. R type workers are available with AWS Glue 4.0 and 5.0.

The following table shows compute, memory, disk, and Spark configurations for each R worker type.

AWS Glue Worker Type	DPU per Node	vCPU	Memory (GB)	Disk (GB)	Approximate Free Disk Space (GB)	Number of Spark Executors per Node	Number of Cores per Spark Executor
R.1X	1	4	32	94	44	1	4
R.2X	2	8	64	128	78	1	8
R.4X	4	16	128	256	230	1	16
R.8X	8	32	256	512	485	1	32

To use R type workers on an AWS Glue job, change the setting of the worker type parameter. In AWS Glue Studio, you can choose R 1X, R 2X, R 4X, or R 8X under Worker type.

In the AWS API or AWS SDK, you can specify R worker types in the WorkerType parameter. In the AWS CLI, you can use the --worker-type parameter in a create-job command.

To use R worker types on an AWS Glue Studio notebook or interactive sessions, set R.1X, R.2X, R.4X, or R.8X in the %worker_type magic:

R type workers are priced at $0.52 per DPU-hour for each job, billed per second with a 1-minute minimum.

G.12X and G.16X workers

AWS Glue G.12X and G.16X workers give you more compute, memory, and storage to run your most demanding jobs. G.12X workers provide 12 DPU, with 48 vCPU, 192 GB memory, and 768 GB of disk per worker node. G.16X workers provide 16 DPU, with 64 vCPU, 256 GB memory, and 1024 GB of disk per node. G.16x is double the resources of the existing largest worker type G.8X. You can enable G.12X and G.16X workers with a single parameter change in the API, AWS CLI, or AWS Glue Studio. Regardless of the worker used, the AWS Glue jobs have the same capabilities, including automatic scaling and interactive job authoring using notebooks. G.12X and G.16X workers are available with AWS Glue 4.0 and 5.0.The following table shows compute, memory, disk, and Spark configurations for each G worker type.

AWS Glue Worker Type	DPU per Node	vCPU	Memory (GB)	Disk (GB)	Approximate Free Disk Space (GB)	Number of Spark Executors per Node	Number of Cores per Spark Executor
G.025X	0.25	2	4	84	34	1	2
G.1X	1	4	16	94	44	1	4
G.2X	2	8	32	138	78	1	8
G.4X	4	16	64	256	230	1	16
G.8X	8	32	128	512	485	1	32
G.12X (new)	12	48	192	768	741	1	48
G.16X (new)	16	64	256	1024	996	1	64

To use G.12X and G.16X workers on an AWS Glue job, change the setting of the worker type parameter to G.12X or G.16X. In AWS Glue Studio, you can choose G 12X or G 16X under Worker type.

In the AWS API or AWS SDK, you can specify G.12X or G.16X in the WorkerType parameter. In the AWS CLI, you can use the --worker-type parameter in a create-job command.

To use G.12X and G.16X on an AWS Glue Studio notebook or interactive sessions, set G.12X or G.16X in the %worker_type magic:

G type workers are priced at $0.44 per DPU-hour for each job, billed per second with a 1-minute minimum. This is the same pricing as the existing worker types.

Choose the right worker type for your workload

To optimize job resource utilization, run your expected application workload to identify the ideal worker type that aligns with your application’s requirements. Start with general worker types like G.1X or G.2X, and monitor your job run from AWS Glue job metrics, observability metrics, and Spark UI. For more details about how to monitor the resource metrics for AWS Glue jobs, see Best practices for performance tuning AWS Glue for Apache Spark jobs.

When your data processing workload is well distributed across workers, G.1X or G.2X work very well. However, some workloads might require more resources per worker. You can use the new G.12X, G.16X, and R type workers to address them. In this section, we discuss typical use cases where vertical scaling is effective.

Large join operations

Some joins might involve large tables where one or both sides need to be broadcast. Multi-way joins require multiple large datasets to be held in memory. With skewed joins, certain partition keys have disproportionately large data volumes. Horizontal scaling doesn’t help when the entire dataset needs to be in memory on each node for broadcast joins.

High-cardinality group by operations

This use case includes aggregations on columns with many unique values, operations requiring maintenance of large hash tables for grouping, and distinct counts on columns with high uniqueness. High-cardinality operations often result in large hash tables that need to be maintained in memory on each node. Adding more nodes doesn’t reduce the size of these per-node data structures.

Window functions and complex aggregations

Some operations might require a large window frame, or involve computing percentiles, medians, or other rank-based analytics across large datasets, in addition to complex grouping sets or CUBE operations on high-cardinality columns. These operations often require keeping large portions of data in memory per partition. Adding more nodes doesn’t reduce the memory requirement for each individual window or grouping operation.

Complex query plans

Complex query plans can have many stages and deep dependency chains, operations requiring large shuffle buffers, or multiple transformations that need to maintain large intermediate results. These query plans often involve large amounts of intermediate data that need to be held in memory. More nodes don’t necessarily simplify the plan or reduce per-node memory requirements.

Machine learning and complex analytics

With ML and analytics use cases, model training might involve large feature sets, wide transformations requiring substantial intermediate data, or complex statistical computations requiring entire datasets in memory. Many ML algorithms and complex analytics require the entire dataset or large portions of it to be processed together, which can’t be effectively distributed across more nodes.

Data skew scenarios

In some data skew scenarios, you might have to process heavily skewed data where certain partitions are significantly larger, or perform operations on datasets with high-cardinality keys, leading to uneven partition sizes. Horizontal scaling can’t address the fundamental issue of data skew, where some partitions remain much larger than others regardless of the number of nodes.

State-heavy stream processing

State-heavy stream processing can include stateful operations with large state requirements, windowed operations over streaming data with large window sizes, or processing micro-batches with complex state management. Stateful stream processing often requires maintaining large amounts of state per key or window, which can’t be easily distributed across more nodes without compromising the integrity of the state.

In-memory caching

These scenarios might include large datasets that must be be cached for repeated access, iterative algorithms requiring multiple passes over the same data, or caching large datasets for fast access, which often requires keeping substantial portions of data in each node’s memory. Horizontal scaling might not help if the entire dataset needs to be cached on each node for optimal performance.

Data skew example scenarios

Several common patterns can typically cause data skew, such as sorting or groupBy transformations on columns with non-uniformed value distributions, and join operations where certain keys appear more frequently than other keys.

In the following example, we compare the behavior with two different worker types, G.2X and R.2X in the same sample workload to process skewed data.

With G.2X workers

With the G.2X worker type, an AWS Glue job with 10 workers failed due to a No space on left device error while writing records into Amazon Simple Storage Service (Amazon S3). This was mainly caused by large shuffling on a specific column. The following Spark UI view shows the job details.

The Jobs tab shows two completed jobs and one active job where 8 tasks failed out of 493 tasks. Let’s drill down to the details.

The Executors tab shows an uneven distribution of data processing across the Spark executors, which indicates data skew in this failed job. Executors with IDs 2, 7, and 10 have failed tasks and read approximately 64.5 GiB of shuffle data as shown in the Shuffle Read column. In contrast, the other executors show 0.0 B of shuffle data in the Shuffle Read column.

The G.2X worker type can handle most Spark workloads such as data transformations and join operations. However, in this example, there was significant data skew, which caused certain executors to fail due to exceeding the allocated memory.

With R.2X workers

With the R.2X worker type, an AWS Glue job with 10 workers successfully ran without any failures. The number of workers is the same as the previous example—the only difference is the worker type. R workers have two times more memory compared to G workers. The following Spark UI view shows more details.

The Jobs tab shows three completed jobs. No failures are shown on this page.

The Executors tab shows no failed tasks per executor even though there’s an uneven distribution of shuffle reads across executors.

The results showed that R.2X workers successfully completed the workload that failed on G.2X workers using the same number of executors but with the additional memory capacity to handle the skewed data distribution.

Conclusion

In this post, we demonstrated how AWS Glue R type, G.12X, and G.16X workers can help you vertically scale your AWS Glue for Apache Spark jobs. You can start using the new R type, G.12X, and G.16X workers to scale your workload today. For more information on these new worker types and AWS Regions where the new workers are available, visit the AWS Glue documentation.

To learn more, see Getting Started with AWS Glue.

About the Authors

Noritaka Sekiyama is a Principal Big Data Architect with AWS Analytics services. He’s responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.

Tomohiro Tanaka is a Senior Cloud Support Engineer at Amazon Web Services. He’s passionate about helping customers use Apache Iceberg for their data lakes on AWS. In his free time, he enjoys a coffee break with his colleagues and making coffee at home.

Peter Tsai is a Software Development Engineer at AWS, where he enjoys solving challenges in the design and performance of the AWS Glue runtime. In his leisure time, he enjoys hiking and cycling.

Matt Su is a Senior Product Manager on the AWS Glue team. He enjoys helping customers uncover insights and make better decisions using their data with AWS Analytics services. In his spare time, he enjoys skiing and gardening.

Sean McGeehan is a Software Development Engineer at AWS, where he builds features for the AWS Glue fulfillment system. In his leisure time, he explores his home of Philadelphia and work city of New York.

Access Amazon Redshift Managed Storage tables through Apache Spark on AWS Glue and Amazon EMR using Amazon SageMaker Lakehouse

2025-05-15 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/access-amazon-redshift-managed-storage-tables-through-apache-spark-on-aws-glue-and-amazon-emr-using-amazon-sagemaker-lakehouse/

Data environments in data-driven organizations are changing to meet the growing demands for analytics, including business intelligence (BI) dashboarding, one-time querying, data science, machine learning (ML), and generative AI. These organizations have a huge demand for lakehouse solutions that combine the best of data warehouses and data lakes to simplify data management with easy access to all data from their preferred engines.

Amazon SageMaker Lakehouse unifies all your data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and artificial intelligence and machine learning (AI/ML) applications on a single copy of data. SageMaker Lakehouse gives you the flexibility to access and query your data in place with all Apache Iceberg compatible tools and engines. It secures your data in the lakehouse by defining fine-grained permissions, which are consistently applied across all analytics and ML tools and engines. You can bring data from operational databases and applications into your lakehouse in near real time through zero-ETL integrations. It accesses and queries data in-place with federated query capabilities across third-party data sources through Amazon Athena.

With SageMaker Lakehouse, you can access tables stored in Amazon Redshift managed storage (RMS) through Iceberg APIs, using the Iceberg REST catalog backed by AWS Glue Data Catalog. This expands your data integration workload across data lakes and data warehouses, enabling seamless access to diverse data sources.

Amazon SageMaker Unified Studio, Amazon EMR 7.5.0 and higher, and AWS Glue 5.0 natively support SageMaker Lakehouse. This post describes how to integrate data on RMS tables through Apache Spark using SageMaker Unified Studio, Amazon EMR 7.5.0 and higher, and AWS Glue 5.0.

How to access RMS tables through Apache Spark on AWS Glue and Amazon EMR

With SageMaker Lakehouse, RMS tables are accessible through the Apache Iceberg REST catalog. Open source engines such as Apache Spark are compatible with Apache Iceberg, and they can interact with RMS tables by conﬁguring this Iceberg REST catalog. You can learn more in Connecting to the Data Catalog using AWS Glue Iceberg REST extension endpoint.

Note that the Iceberg REST extensions endpoint is used when you access RMS tables. This endpoint is accessible through the Apache Iceberg AWS Glue Data Catalog extensions, which comes preinstalled on AWS Glue 5.0 and Amazon EMR 7.5.0 or higher. The extension library enables access to RMS tables using the Amazon Redshift connector for Apache Spark.

To access RMS backed catalog databases from Spark, each RMS database requires its own Spark session catalog configuration. Here are the required Spark configurations:

Spark config key	Value
`spark.sql.catalog.{catalog_name}`	`org.apache.iceberg.spark.SparkCatalog`
`spark.sql.catalog.{catalog_name}.type`	`glue`
`spark.sql.catalog.{catalog_name}.glue.id`	`{account_id}:{rms_catalog_name}/{database_name}`
`spark.sql.catalog.{catalog_name}.client.region`	`{aws_region}`
`spark.sql.extensions`	`org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions`

Configuration parameters:

{catalog_name}: Your chosen name for referencing the RMS catalog database in your application code
{rms_catalog_name}: The RMS catalog name as shown in the AWS Lake Formation catalogs section
{database_name}: The RMS database name
{aws_region}: The AWS Region where the RMS catalog is located

For a deeper understanding of how the Amazon Redshift hierarchy (databases, schemas, and tables) is mapped to the AWS Glue multilevel catalogs, you can refer to the Bringing Amazon Redshift data into the AWS Glue Data Catalog documentation.

In the following section, we demonstrate how to access RMS tables through Apache Spark using SageMaker Unified Studio JupyterLab notebooks with the AWS Glue 5.0 runtime and Amazon EMR Serverless.

Although we can bring existing Amazon Redshift tables into the AWS Glue Data catalog by creating a Lakehouse Redshift catalog from an existing Redshift namespace and provide access to a SageMaker Unified Studio project, in the following example, you’ll create a managed Amazon Redshift Lakehouse catalog directly from SageMaker Unified Studio and work with that.

Prerequisites

To follow these instructions, you must have the following prerequisites:

An AWS account
An SageMaker Unified Studio domain

Create a SageMaker Unified Studio project

Complete the following steps to create a SageMaker Unified Studio project:

Sign in to SageMaker Unified Studio.
Choose Select a project on the top menu and choose Create project.
For Project name, enter demo.
For Project profile, choose All capabilities.
Choose Continue.

Leave the default values and choose Continue.
Review the configurations and choose Create project.

You need to wait for the project to be created. Project creation can take about 5 minutes. When the project status changes to Active, select the project name to access the project’s home page.

Make note of the Project role ARN because you’ll need it for next steps.

You’ve successfully created the project and noted the project role ARN. The next step is to configure a Lakehouse catalog for your RMS.

Configure a Lakehouse catalog for your RMS

Complete the following steps to configure a Lakehouse catalog for your RMS:

In the navigation pane, choose Data.
Choose the + (plus) sign.
Select Create Lakehouse catalog to create a new catalog and choose Next.

For Lakehouse catalog name, enter rms-catalog-demo.
Choose Add catalog.

Wait for the catalog to be created.

In SageMaker Unified Studio, choose Data in the left navigation pane, then select the three vertical dots next to Redshift (Lakehouse) and choose Refresh to make sure the Amazon Redshift compute is active.

Create a new table in the RMS Lakehouse catalog:

In SageMaker Unified Studio, on the top menu, under Build, choose Query Editor.
On the top right, choose Select data source.
For CONNECTIONS, choose Redshift (Lakehouse).
For DATABASES, choose dev@rms-catalog-demo.
For SCHEMAS, choose public.
Choose Choose.

In the query cell, enter and execute the following query to create a new schema:

create schema "dev@rms-catalog-demo".salesdb

In a new cell, enter and execute the following query to create a new table:

create table salesdb.store_sales (ss_sold_timestamp timestamp, ss_item text, ss_sales_price float);

In a new cell, enter and execute the following query to populate the table with sample data:

insert into salesdb.store_sales values ('2024-12-01T09:00:00Z', 'Product 1', 100.0),
('2024-12-01T11:00:00Z', 'Product 2', 500.0),
('2024-12-01T15:00:00Z', 'Product 3', 20.0),
('2024-12-01T17:00:00Z', 'Product 4', 1000.0),
('2024-12-01T18:00:00Z', 'Product 5', 30.0),
('2024-12-02T10:00:00Z', 'Product 6', 5000.0),
('2024-12-02T16:00:00Z', 'Product 7', 5.0);

In a new cell, enter and run the following query to verify the table contents:

select * from salesdb.store_sales;

(Optional) Create an Amazon EMR Serverless application

IMPORTANT: This section is only required if you plan to test also using Amazon EMR Serverless. If you intend to use AWS Glue exclusively, you can skip this section entirely.

Navigate to the project page. In the left navigation pane, select Compute, then select the Data processing Choose Add compute.

Choose Create new compute resources, then choose Next.

Select EMR Serverless.

Specify emr_serverless_application as Compute name, select Compatibility as Permission mode, and choose Add compute.

Monitor the deployment progress. Wait for the Amazon EMR Serverless application to complete its deployment. This process can take a minute.

Access Amazon Redshift Managed Storage tables through Apache Spark

In this section, we demonstrate how to query tables stored in RMS using a SageMaker Unified Studio notebook.

In the navigation pane, choose Data
Under Lakehouse, select the down arrow next to rms-catalog-demo
Under dev, select the down arrow next salesdb, choose store_sales, and choose the three dots

SageMaker Lakehouse oﬀers multiple analysis options: Query with Athena, Query with Redshift, and Open in Jupyter Lab notebook.

Choose Open in Jupyter Lab notebook
On the Launcher tab, choose Python 3 (ipykernel)

In SageMaker Unified Studio JupyterLab, you can specify different compute types for each notebook cell. Although this example demonstrates using AWS Glue compute (project.spark.compatibility), the same code can be executed using Amazon EMR Serverless by selecting the appropriate compute in the cell settings. The following table shows the connection type and compute values to specify when running PySpark code or Spark SQL code with different engines:

Compute option	Pyspark code		Spark SQL
Compute option	Connection type	Compute	Connection type	Compute
AWS Glue	Pyspark	`project.spark.compatibility`	SQL	`project.spark.compatibility`
Amazon EMR Serverless	Pyspark	`emr-s.emr_serverless_application`	SQL	`emr-s.emr_serverless_application`

In the notebook cell’s top left corner, set Connection Type to PySpark and select spark.compatibility (AWS Glue 5.0) as Compute
Execute the following code to initialize the SparkSession and configure rmscatalog as the session catalog for accessing the dev database under the rms-catalog-demo RMS catalog:

from pyspark.sql import SparkSession

catalog_name = "rmscatalog"
#Change <your_account_id> with your AWS account ID
rms_catalog_id = "<your_account_id>:rms-catalog-demo/dev"

#Change with your AWS region
aws_region="us-east-2"

spark = SparkSession.builder.appName('rms_demo') \
    .config(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') \
    .config(f'spark.sql.catalog.{catalog_name}.type', 'glue') \
    .config(f'spark.sql.catalog.{catalog_name}.glue.id', rms_catalog_id) \
    .config(f'spark.sql.catalog.{catalog_name}.client.region', aws_region) \
    .config('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
    .getOrCreate()

Create a new cell and switch the connection type from PySpark to SQL to execute Spark SQL commands directly
Enter the following SQL statement to view all tables under salesdb (RMS schema) within rmscatalog:

SHOW TABLES IN rmscatalog.salesdb

In a new SQL cell, enter the following DESCRIBE EXTENDED statement to view detailed information about the store_sales table in the salesdb schema:

DESCRIBE EXTENDED rmscatalog.salesdb.store_sales

In the output, you’ll observe that the Provider is set to iceberg. This indicates that the table is recognized as an Iceberg table, despite being stored in Amazon Redshift managed storage.

In a new SQL cell, enter the following SELECT statement to view the content of the table

SELECT * FROM rmscatalog.salesdb.store_sales

Throughout this example, we demonstrated how to create a table in Amazon Redshift Serverless and seamlessly query it as an Iceberg table using Apache Spark within a SageMaker Unified Studio notebook.

Clean up

To avoid incurring future charges, clean up all created resources:

Delete the created SageMaker Unified Studio project. This step will automatically delete Amazon EMR compute (for example, the Amazon EMR Serverless application) that was provisioned from the project:
1. Inside SageMaker Studio, navigate to the demo project’s Project overview section.
2. Choose Actions, then select Delete project.
3. Type confirm and choose Delete project.

Delete the created Lakehouse catalog:
1. Navigate to the AWS Lake Formation page in the Catalogs section.
2. Select the rms-catalog-demo catalog, choose Actions, then select Delete.
3. In the confirmation window type rms-catalog-demo and then choose Drop.

Conclusion

In this post, we demonstrated how to use Apache Spark to interact with Amazon Redshift Managed Storage tables through Amazon SageMaker Lakehouse using the Iceberg REST catalog. This integration provides a unified view of your data across Amazon S3 data lakes and Amazon Redshift data warehouses, so you can build powerful analytics and AI/ML applications while maintaining a single copy of your data.

For additional workloads and implementations, visit Simplify data access for your enterprise using Amazon SageMaker Lakehouse.

About the Authors

Noritaka Sekiyama is a Principal Big Data Architect with Amazon Web Services (AWS) Analytics services. He’s responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.

Stefano Sandonà is a Senior Big Data Specialist Solution Architect at Amazon Web Services (AWS). Passionate about data, distributed systems, and security, he helps customers worldwide architect high-performance, efficient, and secure data solutions.

Derek Liu is a Senior Solutions Architect based out of Vancouver, BC. He enjoys helping customers solve big data challenges through Amazon Web Services (AWS) analytic services.

Raj Ramasubbu is a Senior Analytics Specialist Solutions Architect focused on big data and analytics and AI/ML with Amazon Web Services (AWS). He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. Raj provided technical expertise and leadership in building data engineering, big data analytics, business intelligence, and data science solutions for over 18 years prior to joining AWS. He helped customers in various industry verticals like healthcare, medical devices, life science, retail, asset management, car insurance, residential REIT, agriculture, title insurance, supply chain, document management, and real estate.

Angel Conde Manjon is a Sr. EMEA Data & AI PSA, based in Madrid. He has previously worked on research related to data analytics and AI in diverse European research projects. In his current role, Angel helps partners develop businesses centered on data and AI.

Appendix: Sample script for Lake Formation FGAC enabled Spark cluster

If you want to access RMS tables from Lake Formation FGAC enabled Spark cluster on AWS Glue or Amazon EMR, refer to the following code example:

from pyspark.sql import SparkSession

catalog_name = "rmscatalog"
rms_catalog_name = "123456789012:rms-catalog-demo/dev"
account_id = "123456789012"
region = "us-east-2"

spark = SparkSession.builder.appName('rms_demo') \
.config('spark.sql.defaultCatalog', catalog_name) \
.config(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') \
.config(f'spark.sql.catalog.{catalog_name}.type', 'glue') \
.config(f'spark.sql.catalog.{catalog_name}.glue.id', rms_catalog_name) \
.config(f'spark.sql.catalog.{catalog_name}.client.region', region) \
.config(f'spark.sql.catalog.{catalog_name}.glue.account-id', account_id) \
.config(f'spark.sql.catalog.{catalog_name}.glue.catalog-arn',f'arn:aws:glue:{region}:{rms_catalog_name}') \
.config('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
.getOrCreate()

Unified scheduling for visual ETL flows and query books in Amazon SageMaker Unified Studio

2025-05-01 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/unified-scheduling-for-visual-etl-flows-and-query-books-in-amazon-sagemaker-unified-studio/

Data engineers and analysts often need to automate their data processing workflows and queries to maintain up-to-date data pipelines and reports. Amazon SageMaker Unified Studio provides a unified environment for data, analytics, machine learning (ML), and AI workloads. Amazon SageMaker Unified Studio provides powerful tools for visual extract, transform, and load (ETL) flows and query books. Until today, scheduling these workflows has required additional setup and infrastructure.

Today, we’re excited to introduce a new unified scheduling feature that simplifies this process. SageMaker Unified Studio allows you to create ETL flows using a visual interface and write SQL analytics queries using query books. This new unified scheduling feature allows you to schedule your visual ETL flows and query books directly from SageMaker Unified Studio within the same interface, eliminating the need for visiting other consoles or complex configurations. Using Amazon EventBridge Scheduler, this feature provides a seamless and easy-to-use scheduling experience.

In this post, we walk through how to schedule your visual ETL flows and query books with just a few clicks, explore the underlying architecture, and demonstrate how this feature can streamline your data workflow automation.

Feature overview

SageMaker Unified Studio unified scheduling is built on top of EventBridge Scheduler and Amazon SageMaker Training. When you configure a new schedule from SageMaker Unified Studio, a new EventBridge schedule is automatically created in your AWS account. The EventBridge schedule is configured with the SageMaker CreateTrainingJob API. The SageMaker Training job runs visual ETL flows or query books.

The following diagram illustrates how it works.

Prerequisites

To run the instruction, you must have the following prerequisites:

An AWS account
A SageMaker Unified Studio domain
A SageMaker Unified Studio project with a All capabilities profile. This profile includes Tooling blueprint in which Scheduling is enabled by default. If scheduling is disabled, you may need to update your project’s profile.

Schedule a visual ETL flow

Complete the following steps to configure a schedule on a visual ETL flow:

On the SageMaker Unified Studio console, on the top menu, choose Build.
Under DATA ANALYSIS & INTEGRATION, choose Visual ETL flows.
For Select or create project to continue, select your project, and choose Continue.
Choose your visual ETL flow. If you don’t have any visual ETL flows, refer to Author visual ETL flows on Amazon SageMaker Unified Studio to create a new visual ETL flow.
Choose the Schedule icon.
For Schedule name, enter a unique name (for example, everyday).
For Schedule Type, select Recurring.
For Value, enter 1.
For Unit, choose days.
For Timezone, choose your time zone.
Choose Create schedule.

You have successfully configured the schedule. Because Start date and time is not given, the visual ETL flow is triggered immediately and then it is triggered once a day after that.

Edit the schedule

You can view the configured schedules with the following steps:

On the SageMaker Unified Studio console, navigate to Visual ETL flows for your project.
Choose the Schedules tab.
Choose Edit schedule under Actions.
Edit with your preferences, then choose Save.

Pause or resume the schedule

If you want to pause the schedule, complete the following steps:

Choose Pause schedule under Actions.

On the same Schedule tab, Status of the schedule will be updated to Paused.

To resume the schedule, choose Activate schedule.

Delete the schedule

To delete the schedule, complete the following steps:

Choose Delete schedule under Actions.
Choose Delete schedule in the dialog.

On the same Schedule tab, you can verify that the deleted schedule disappears.

Schedule a query book flow

Complete the following steps to configure a schedule on a query book:

On the SageMaker Unified Studio console, on the top menu, choose Build.
Under DATA ANALYSIS & INTEGRATION, choose Query Editor.
On the data explorer, under Lakehouse, choose AwsDataCatalog.
Navigate to the table venue_event_agg. This table is created in the previous section.
On the options menu (three dots), choose Query with Athena.
On the Actions menu, choose Save to project.
Choose Save changes.
On the Actions menu, choose Create schedule.
For Schedule Type, choose Recurring.
For Value, enter 1.
For Unit, choose days.
For Timezone, choose your time zone.
Choose Create schedule.

You have successfully configured the schedule. Because Start date and time was not set, the query book is triggered immediately and then it is triggered once a day after that. You can optionally configure start and end times if you want to limit your schedule to run in a specific date range.

To view the configured schedules, in the navigation pane, choose Scheduled queries.

You can view the list of scheduled queries and edit, pause, resume, or delete them, as shown in the previous section.

Clean up

To avoid incurring future charges, clean up the resources you created during this walkthrough:

On the Schedule tab of Visual ETL flows, select the everyday schedule, and choose Delete schedule under Actions. The related EventBridge schedule is automatically deleted as well.
On the SageMaker AI console, choose Training jobs under Training, and delete all the SageMaker training jobs that start with everyday-.
(Optional) To delete the visual ETL flow, on the Flows tab of Visual ETL flows, select your visual ETL flow, and choose Delete flow under Actions.

Conclusion

The new unified scheduling experience in SageMaker Unified Studio simplifies workflow automation. With unified scheduling, you can seamlessly orchestrate your visual ETL flows and query books in one centralized location.

Whether you’re running daily data transformations, weekly analytical queries, or monthly reporting workflows, the unified scheduling experience provides a straightforward path to automation. This capability enables data teams to focus more on deriving insights from their data and less on managing infrastructure and scheduling configurations.

We encourage you to try out this new experience and share your feedback with us. For more information about SageMaker Unified Studio and its capabilities, visit our documentation or explore our other blog posts about visual ETL flows and query books.

About the Authors

Noritaka Sekiyama is a Principal Big Data Architect for AWS Analytics services with a strong focus on data engineering. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.

Daniel Obi is a Frontend Engineer on the Amazon SageMaker Unified Studio team. He is dedicated to building intuitive and effective solutions that enhance user experience and technical functionality. Outside of his professional work, he enjoys watching and playing basketball.

Vasudevan Venkataramanan is a Senior Software Engineer on the Amazon SageMaker Unified Studio team. He is responsible for technical direction of scheduling and orchestration within SageMaker Unified Studio. Outside of his professional work, he enjoys spending time with his kid, and playing pickleball and cricket.

Yuhang Huang is a Software Development Manager on the Amazon SageMaker Unified Studio team. He leads the engineering team to design, build, and operate scheduling and orchestration capabilities in SageMaker Unified Studio. In his free time, he enjoys playing tennis.

Gal Heyne is a Senior Technical Product Manager for AWS Analytics services with a strong focus on AI/ML and data engineering. She is passionate about developing a deep understanding of customers’ business needs and collaborating with engineers to design simple-to-use data products.

An integrated experience for all your data and AI with Amazon SageMaker Unified Studio (preview)

2024-12-11 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/an-integrated-experience-for-all-your-data-and-ai-with-amazon-sagemaker-unified-studio-preview/

Organizations are building data-driven applications to guide business decisions, improve agility, and drive innovation. Many of these applications are complex to build because they require collaboration across teams and the integration of data, tools, and services. Data engineers use data warehouses, data lakes, and analytics tools to load, transform, clean, and aggregate data. Data scientists use notebook environments (such as JupyterLab) to create predictive models for different target segments.

However, building advanced data-driven applications poses several challenges. First, it can be time consuming for users to learn multiple services’ development experiences. Second, because data, code, and other development artifacts like machine learning (ML) models are stored within different services, it can be cumbersome for users to understand how they interact with each other and make changes. Third, configuring and governing access to appropriate users for data, code, development artifacts, and compute resources across services is a manual process.

To address these challenges, organizations often build bespoke integrations between services, tools, and their own access management systems. Organizations want the flexibility to adopt the best services for their use cases while empowering their data practitioners with a unified development experience.

We launched Amazon SageMaker Unified Studio in preview to tackle these challenges. SageMaker Uniﬁed Studio is an integrated development environment (IDE) for data, analytics, and AI. Discover your data and put it to work using familiar AWS tools to complete end-to-end development workflows, including data analysis, data processing, model training, generative AI app building, and more, in a single governed environment. Create or join projects to collaborate with your teams, share AI and analytics artifacts securely, and discover and use your data stored in Amazon S3, Amazon Redshift, and more data sources through the Amazon SageMaker Lakehouse. As AI and analytics use cases converge, transform how data teams work together with SageMaker Unified Studio.

This post demonstrates how SageMaker Unified Studio unifies your analytic workloads.

The following screenshot illustrates the SageMaker Unified Studio.

The SageMaker Unified Studio provides the following quick access menu options from Home:

Discover:
- Data catalog – Find and query data assets and explore ML models
- Generative AI playground – Experiment with the chat or image playground
- Shared generative AI assets – Explore generative AI applications and prompts shared with you.
Build with projects:
- ML and generative AI model – Build, train, and deploy ML and foundation models with fully managed infrastructure, tools, and workflows.
- Generative AI app development – Build generative AI apps and experiment with foundation models, prompts, agents, functions, and guardrails in Amazon Bedrock IDE.
- Data processing and SQL analytics – Analyze, prepare, and integrate data for analytics and AI using Amazon Athena, Amazon EMR, AWS Glue, and Amazon Redshift.
- Data and AI governance – Publish your data products to the catalog with glossaries and metadata forms. Govern access securely in the Amazon SageMaker Catalog built on Amazon DataZone.

With SageMaker Unified Studio, you now have a unified development experience across these services. You only need to learn these tools once and then you can use them across all services.

With SageMaker Unified Studio notebooks, you can use Python or Spark to interactively explore and visualize data, prepare data for analytics and ML, and train ML models. With the SQL editor, you can query data lakes, databases, data warehouses, and federated data sources. The SageMaker Unified Studio tools are integrated with Amazon Q, can quickly build, refine, and maintain applications with text-to-code capabilities.

In addition, SageMaker Unified Studio provides a unified view of an application’s building blocks such as data, code, development artifacts, and compute resources across services to approved users. This allows data engineers, data scientists, business analysts, and other data practitioners working from the same tool to quickly understand how an application works, seamlessly review each other’s work, and make the required changes.

Furthermore, SageMaker Unified Studio automates and simplifies access management for an application’s building blocks. After these building blocks are added to a project, they are automatically accessible to approved users from all tools—SageMaker Unified Studio configures any required service-specific permissions. With SageMaker Unified Studio, data practitioners can access all the capabilities of AWS purpose-built analytics, AI/ML, and generative AI services from a single unified development experience.

In the following sections, we walk through how to get started with SageMaker Unified Studio and some example use cases.

Create a SageMaker Unified Studio domain

Complete the following steps to create a new SageMaker Unified Studio domain:

On the SageMaker platform console, choose Domains in the navigation pane.
Choose Create domain.
For How do you want to set up your domain?, select Quick setup (recommended for exploration).

Initially, no virtual private cloud (VPC) has been specifically set up for use with SageMaker Unified Studio, so you will see a dialog box prompting you to create a VPC.

Choose Create VPC.

You’re redirected to the AWS CloudFormation console to deploy a stack to configure VPC resources.

Choose Create stack, and wait for the stack to complete.
Return to the SageMaker Unified Studio console, and inside the dialog box, choose the refresh icon.
Under Quick setup settings, for Name, enter a name (for example, demo).
For Domain Execution role, Domain Service role, Provisioning role, and Manage Access role, leave as default.
For Virtual private cloud (VPC), verify that the new VPC you created in the CloudFormation stack is configured.
For Subnets, verify that the new private subnets you created in the CloudFormation stack are configured.
Choose Continue.
For Create IAM Identity Center user, search for your SSO user through your email address.

If you don’t have an IAM Identity Center instance, you will be prompted to enter your name after your email address. This will create a new local IAM Identity Center instance.

Choose Create domain.

Log in to the SageMaker Unified Studio

Now that you have created your new SageMaker Unified Studio domain, complete the following steps to visit the SageMaker Unified Studio:

On the SageMaker platform console, open the details page of your domain.
Choose the link for Amazon SageMaker Unified Studio URL.
Log in with your SSO credentials.

Now you signed in to the SageMaker Unified Studio.

Create a project

The next step is to create a project. Complete the following steps:

On the SageMaker Unified Studio, choose Select a project on the top menu, and choose Create project.
For Project name, enter a name (for example, demo).
For Project profile, choose Data analytics and AI-ML model development.
Choose Continue.
Review the input, and choose Create project.

You need to wait for the project to be created. Project creation can take about 5 minutes. Then the SageMaker Unified Studio console navigates you to the project’s home page.

Now you can use a variety of tools for your analytics, ML, and AI workload. In the following sections, we provide a few example use cases.

Process your data through a multi-compute notebook

SageMaker Unified Studio provides a unified JupyterLab experience across different languages, including SQL, PySpark, and Scala Spark. It also supports unified access across different compute runtimes such as Amazon Redshift and Amazon Athena for SQL, Amazon EMR Serverless, Amazon EMR on EC2, and AWS Glue for Spark.

Complete the following steps to get started with the unified JupyterLab experience:

Open your SageMaker Unified Studio project page.
On the top menu, choose Build, and under IDE & APPLICATIONS, choose JupyterLab.
Wait for the space to be ready.
Choose the plus sign and for Notebook, choose Python 3.

The following screenshot shows an example of the unified notebook page.

There are two dropdown menus on the top left of each cell. The Connection Type menu corresponds to connection types such as Local Python, PySpark, SQL, and so on.

The Compute menu corresponds to compute options such as Athena, AWS Glue, Amazon EMR, and so on.

For the first cell, choose PySpark, spark, which defaults to AWS Glue for Spark, and enter the following code to initialize SparkSession and create a DataFrame from an Amazon Simple Storage Service (Amazon S3) path, then run the cell:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df1 = spark.read.format("csv") \
    .option("multiLine", "true") \
    .option("header", "false") \
    .option("sep", ",") \
    .load("s3://aws-blogs-artifacts-public/artifacts/BDB-4798/data/venue.csv")

df1.show()

For the next cell, enter the following code to rename columns and filter the records, and run the cell:

df1_renamed = df1.withColumnsRenamed(
    {
        "_c0" : "venueid", 
        "_c1" : "venuename", 
        "_c2" : "venuecity", 
        "_c3" : "venuestate", 
        "_c4" : "venueseats"
    }
)

df1_filtered = df1_renamed.filter("`venuestate` == 'DC'")

df1_filtered.show()

For the next cell, enter the following code to create another DataFrame from another S3 path, and run the cell:

df2 = spark.read.format("csv") \
    .option("multiLine", "true") \
    .option("header", "false") \
    .option("sep", ",") \
    .load("s3://aws-blogs-artifacts-public/artifacts/BDB-4798/data/events.csv")
df2_renamed = df2.withColumnsRenamed(
    {
        "_c0" : "eventid", 
        "_c1" : "e_venueid", 
        "_c2" : "catid", 
        "_c3" : "dateid", 
        "_c4" : "eventname", 
        "_c5" : "starttime"
    }
)

df2_renamed.show()

For the next cell, enter the following code to join the frames and apply custom SQL, and run the cell:

df_joined = df2_renamed.join(df1_filtered, (df2_renamed['e_venueid'] == df1_filtered['venueid']), "inner")

df_sql = spark.sql("""
    select 
        venuename, 
        count(distinct eventid) as eventid_count
    from {myDataSource}
    group by venuename
""", myDataSource = df_joined)

df_sql.show()

For the next cell, enter following code to write to a table, and run the cell (replace the AWS Glue database name with your project database name, and the S3 path with your project’s S3 path):

df_sql.write.format("parquet") \
    .option("path", "s3://amazon-sagemaker-123456789012-us-east-2-xxxxxxxxxxxxx/dzd_1234567890123/xxxxxxxxxxxxx/dev/venue_event_agg/") \
    .option("header", False) \
    .option("compression", "snappy") \
    .mode("overwrite") \
    .saveAsTable("`glue_db_abcdefgh`.`venue_event_agg`")

Now you have successfully ingested data to Amazon S3 and created a new table called venue_event_agg.

In the next cell, switch the connection type from PySpark to SQL.
Run following SQL against the table (replace the AWS Glue database name with your project database name):
```
SELECT * FROM glue_db_abcdefgh.venue_event_agg
```

The following screenshot shows an example of the results.

The SQL ran on AWS Glue for Spark. Optionally, you can switch to other analytics engines like Athena by switching the compute.

Explore your data through a SQL Query Editor

In the previous section, you learned how the unified notebook works with different connection types and different compute engines. Next, let’s use the data explorer to explore the table you created using a notebook. Complete the following steps:

On the project page, choose Data.
Under Lakehouse, expand AwsDataCatalog.
Expand your database starting from glue_db_.
Choose venue_event_agg, choose Query with Athena.
Choose Run all.

The following screenshot shows an example of the query result.

As you enter text in the query editor, you will notice it provides suggestions for statements. The SQL query editor provides real-time autocomplete suggestions as you write SQL statements, covering DML/DDL statements, clauses, functions, and schemas of your catalogs like databases, tables, and columns. This enables faster, error-free query building.

You can complete editing the query and run it.

You can also open a generative SQL assistant powered by Amazon Q to help your query authoring experience.

For example, you can ask “Calculate the sum of eventid_count across all venues” in the assistant, and the query is automatically suggested. You can choose Add to querybook to copy the suggested query is copied to the querybook, and run it.

Next, coming back to the original query, and let’s try a quick visualization to analyze the data distribution.

Choose the chart view icon.
Under Structure, choose Traces.
For Type, choose Pie.
For Values, choose eventid_count.
For Labels, choose venuename.

The query result will display as a pie chart like the following example. You can customize the graph title, axis title, subplot styles, and more on the UI. The generated images can also be downloaded as PNG or JPEG files.

In the above instruction, you learned how the data explorer works with different visualizations.

Clean up

To clean up your resources, complete the following steps:

Delete the AWS Glue table venue_event_agg and S3 objects under the table S3 path.
Delete the project you created.
Delete the domain you created.
Delete the VPC named SageMakerUnifiedStudioVPC.

Conclusion

In this post, we demonstrated how SageMaker Unified Studio (preview) unifies your analytics workload. We also explained the end-to-end user experience of the SageMaker Unified Studio for two different use cases of notebook and query. Discover your data and put it to work using familiar AWS tools to complete end-to-end development workflows, including data analysis, data processing, model training, generative AI app building, and more, in a single governed environment. Create or join projects to collaborate with your teams, share AI and analytics artifacts securely, and discover and use your data stored in Amazon S3, Amazon Redshift, and more data sources through the Amazon SageMaker Lakehouse. As AI and analytics use cases converge, transform how data teams work together with SageMaker Unified Studio.

To learn more, visit Amazon SageMaker Unified Studio (preview).

About the Authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Chiho Sugimoto is a Cloud Support Engineer on the AWS Big Data Support team. She is passionate about helping customers build data lakes using ETL workloads. She loves planetary science and enjoys studying the asteroid Ryugu on weekends.

Zach Mitchell is a Sr. Big Data Architect. He works within the product team to enhance understanding between product engineers and their customers while guiding customers through their journey to develop data lakes and other data solutions on AWS analytics services.

Chanu Damarla is a Principal Product Manager on the Amazon SageMaker Unified Studio team. He works with customers around the globe to translate business and technical requirements into products that delight customers and enable them to be more productive with their data, analytics, and AI.

Introducing AWS Glue 5.0 for Apache Spark

2024-12-04 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/introducing-aws-glue-5-0-for-apache-spark/

AWS Glue is a serverless, scalable data integration service that makes it simple to discover, prepare, move, and integrate data from multiple sources. Today, we are launching AWS Glue 5.0, a new version of AWS Glue that accelerates data integration workloads in AWS. AWS Glue 5.0 upgrades the Spark engines to Apache Spark 3.5.2 and Python 3.11, giving you newer Spark and Python releases so you can develop, run, and scale your data integration workloads and get insights faster.

This post describes what’s new in AWS Glue 5.0, performance improvements, key highlights on Spark and related libraries, and how to get started on AWS Glue 5.0.

What’s new in AWS Glue 5.0

AWS Glue 5.0 upgrades the runtimes to Spark 3.5.2, Python 3.11, and Java 17 with new performance and security improvements from the open source. AWS Glue 5.0 also updates support for open table format libraries to Apache Hudi 0.15.0, Apache Iceberg 1.6.1, and Delta Lake 3.2.1 so you can solve advanced use cases around performance, cost, governance, and privacy in your data lakes. AWS Glue 5.0 adds support for Spark-native fine-grained access control with AWS Lake Formation so you can apply table- and column-level permissions on an Amazon Simple Storage Service (Amazon S3) data lake for write operations (such as INSERT INTO and INSERT OVERWRITE) with Spark jobs.

Key features include:

Amazon SageMaker Unified Studio support
Amazon SageMaker Lakehouse support
Frameworks updated to Spark 3.5.2, Python 3.11, Scala 2.12.18, and Java 17
Open Table Formats (OTF) updated to Hudi 0.15.0, Iceberg 1.6.1, and Delta Lake 3.2.1
Spark-native fine-grained access control using Lake Formation
Amazon S3 Access Grants support
requirements.txt support to install additional Python libraries
Data lineage support in Amazon DataZone

Amazon SageMaker Unified Studio support

Amazon SageMaker Unified Studio supports AWS Glue 5.0 for compute runtime of unified notebooks and visual ETL flow editor.

Amazon SageMaker Lakehouse support

Glue 5.0 supports native integration with Amazon SageMaker Lakehouse to enable unified access across Amazon Redshift data warehouses and S3 data lakes.

Frameworks updated to Spark 3.5.2, Python 3.11, Scala 2.12.18, and Java 17

AWS Glue 5.0 upgrades the runtimes to Spark 3.5.2, Python 3.11, Scala 2.12.18, and Java 17. Glue 5.0 uses AWS performance optimized Spark runtime, 3.9 times faster than open source Spark. Glue 5.0 is 32% faster than AWS Glue 4.0 and reduces costs by 22%.

For more details about updated library dependencies, see Dependent library upgrades section.

Open Table Formats (OTF) updated to Hudi 0.15.0, Iceberg 1.6.1, and Delta Lake 3.2.1

AWS Glue 5.0 upgrades the open table format libraries to Hudi 0.15.0, Iceberg 1.6.1, and Delta Lake 3.2.1.

Spark-native fine-grained access control using Lake Formation

AWS Glue supports AWS Lake Formation Fine Grained Access Control (FGAC) through native Spark DataFrames and Spark SQL.

S3 Access Grants support

S3 Access Grants provides a simplified model for defining access permissions to data in Amazon S3 by prefix, bucket, or object. AWS Glue 5.0 supports S3 Access Grants through EMR File System (EMRFS) using additional Spark configurations:

Key: --conf
Value: hadoop.fs.s3.s3AccessGrants.enabled=true --conf spark.hadoop.fs.s3.s3AccessGrants.fallbackToIAM=false

To learn more, refer to documentation.

requirements.txt support to install additional Python libraries

In AWS Glue 5.0, you can provide the standard requirements.txt file to manage Python library dependencies. To do that, provide the following job parameters:

Parameter 1:
- Key: --python-modules-installer-option
- Value: -r
Parameter 2:
- Key: --additional-python-modules
- Value: s3://path_to_requirements.txt

AWS Glue 5.0 nodes initially load Python libraries specified in requirements.txt. The following code illustrates the sample requirements.txt:

awswrangler==3.9.1 
elasticsearch==8.15.1
PyAthena==3.9.0
PyMySQL==1.1.1
PyYAML==6.0.2
pyodbc==5.2.0
pyorc==0.9.0 
redshift-connector==2.1.3
scipy==1.14.1
scikit-learn==1.5.2
SQLAlchemy==2.0.36

Data lineage support in Amazon DataZone (preview)

AWS Glue 5.0 supports data lineage in Amazon DataZone in preview. You can configure AWS Glue to automatically collect lineage information during Spark job runs and send the lineage events to be visualized in Amazon DataZone.

To configure this on the AWS Glue console, enable Generate lineage events, and enter your Amazon DataZone domain ID on the Job details tab.

Alternatively, you can provide the following job parameter (provide your DataZone domain ID):

Key: --conf
Value: extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener —conf spark.openlineage.transport.type=amazon_datazone_api —conf spark.openlineage.transport.domainId=<Your-Domain-ID>

Learn more in Amazon DataZone introduces OpenLineage-compatible data lineage visualization in preview.

Improved performance

AWS Glue 5.0 improves the price-performance of your AWS Glue jobs. AWS Glue 5.0 is 32% faster than AWS Glue 4.0 and reduces costs by 22%. The following chart shows the total job runtime for all queries (in seconds) in the 3 TB query dataset between AWS Glue 4.0 and AWS Glue 5.0. The TPC-DS dataset is located in an S3 bucket in Parquet format, and we used 30 G.2X workers in AWS Glue. We observed that our AWS Glue 5.0 TPC-DS tests on Amazon S3 was 58% faster than that on AWS Glue 4.0 while reducing cost by 36%.

.	AWS Glue 4.0	AWS Glue 5.0
Total Query Time (seconds)	1896.1904	1197.78755
Geometric Mean (seconds)	10.09472	6.82208
Estimated Cost ($)	45.85533	29.20133

The following graphs illustrates the comparisons of performance and cost.

Dependent library upgrades

The following table lists dependency upgrades.

Dependency	Version in AWS Glue 4.0	Version in AWS Glue 5.0
Spark	3.3.0	3.5.2
Hadoop	3.3.3	3.4.0
Scala	2.12	2.12.18
Hive	2.3.9	2.3.9
EMRFS	2.54.0	2.66.0
Arrow	7.0.0	12.0.1
Iceberg	1.0.0	1.6.1
Hudi	0.12.1	0.15.0
Delta Lake	2.1.0	3.2.1
Java	8	17
Python	3.10	3.11
boto3	1.26	1.34.131
AWS SDK for Java	1.12	2.28.8
AWS Glue Data Catalog Client	3.7.0	4.2.0
EMR DynamoDB Connector	4.16.0	5.6.0

The following table lists database connector (JDBC driver) upgrades.

Driver	Connector Version in AWS Glue 4.0	Connector Version in AWS Glue 5.0
MySQL	8.0.23	8.0.33
Microsoft SQL Server	9.4.0	10.2.0
Oracle Databases	21.7	23.3.0.23.09
PostgreSQL	42.3.6	42.7.3
Amazon Redshift	redshift-jdbc42-2.1.0.16	redshift-jdbc42-2.1.0.29

The following are Spark connector upgrades:

Driver	Connector Version in AWS Glue 4.0	Connector Version in AWS Glue 5.0
Amazon Redshift	6.1.3	6.3.0
OpenSearch	1.0.1	1.2.0
MongoDB	10.0.4	10.3.0
Snowflake	2.12.0	3.0.0
BigQuery	0.32.2	0.32.2

Apache Spark highlights

Spark 3.5.2 in AWS Glue 5.0 brings a number of valuable features, which we highlight in this section. To learn more about the highlights and enhancements of Spark 3.4 and 3.5, refer to Spark Release 3.4.0 and Spark Release 3.5.0.

Apache Arrow-optimized Python UDF

Python user-defined functions (UDFs) enable users to build custom code for data processing needs, providing flexibility and accessibility. However, performance suffers because UDFs require serialization between Python and JVM processes. Spark 3.5’s Apache Arrow-optimized UDFs solve this by keeping data in shared memory using Arrow’s high-performance columnar format, eliminating serialization overhead and making UDFs efficient for large-scale processing.

To use Arrow-optimized Python UDFs, set spark.sql.execution.pythonUDF.arrow.enabled to True.

Python user-defined table functions

A user-defined table function (UDTF) is a function that returns an entire output table instead of a single value. PySpark users can now write custom UDTFs with Python logic and use them in PySpark and SQL queries. Called in the FROM clause, UDTFs can accept zero or more arguments, either as scalar expressions or table arguments. The UDTF’s return type, defined as either a StructType (for example, StructType().add("c1", StringType())) or DDL string (for example, c1: string), determines the output table’s schema.

RocksDB state store enhancement

At Spark 3.2, RocksDB state store provider has been added as a built-in state store implementation.

Changelog checkpointing

A new checkpoint mechanism for the RocksDB state store provider called changelog checkpointing persists the changelog (updates) of the state. This reduces the commit latency, thereby reducing end-to-end latency significantly.

You can enable this by setting spark.sql.streaming.stateStore.rocksdb.changelogCheckpointing.enabled to True.

You can also enable this feature with existing checkpoints.

Memory management enhancements

Although the RocksDB state store provider is well-known to be useful to address memory issues on the state, there was no fine-grained memory management. Spark 3.5 introduces more fine-grained memory management, which enables users to cap the total memory usage across RocksDB instances in the same executor process, enabling users to configure the memory usage per executor process.

Enhanced Structured Streaming

Spark 3.4 and 3.5 have many enhancements related to Spark Structured Streaming.

This new API deduplicates rows based on certain events. Watermark-based processing allows for more precise control over late data handling:

Deduplicate the same rows: dropDuplicatesWithinWatermark()
Deduplicate values on ‘value’ columns: dropDuplicatesWithinWatermark(['value'])
Deduplicate using the guid column with a watermark based on the eventTime column: withWatermark("eventTime", "10 hours") .dropDuplicatesWithinWatermark(["guid"])

Get started with AWS Glue 5.0

You can start using AWS Glue 5.0 through AWS Glue Studio, the AWS Glue console, the latest AWS SDK, and the AWS Command Line Interface (AWS CLI).

To start using AWS Glue 5.0 jobs in AWS Glue Studio, open the AWS Glue job and on the Job Details tab, choose the version Glue 5.0 – Supports Spark 3.5, Scala 2, Python 3.

To start using AWS Glue 5.0 on an AWS Glue Studio notebook or an interactive session through a Jupyter notebook, set 5.0 in the %glue_version magic:

%%glue_version 5.0

The following output shows that the session is set to use AWS Glue 5.0:

Setting Glue version to: 5.0

Conclusion

In this post, we discussed the key features and benefits of AWS Glue 5.0. You can create new AWS Glue jobs on AWS Glue 5.0 to get the benefit from the improvements, or migrate your existing AWS Glue jobs.

We would like to thank the support of numerous engineers and leaders who helped build Glue 5.0 that enables customers with a performance optimized Spark runtime and several new capabilities.

About the Authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Stuti Deshpande is a Big Data Specialist Solutions Architect at AWS. She works with customers around the globe, providing them strategic and architectural guidance on implementing analytics solutions using AWS. She has extensive experience in big data, ETL, and analytics. In her free time, Stuti likes to travel, learn new dance forms, and enjoy quality time with family and friends.

Martin Ma is a Software Development Engineer on the AWS Glue team. He is passionate about improving the customer experience by applying problem-solving skills to invent new software solutions, as well as constantly searching for ways to simplify existing ones. In his spare time, he enjoys singing and playing the guitar.

Anshul Sharma is a Software Development Engineer in AWS Glue Team.

Rajendra Gujja is a Software Development Engineer on the AWS Glue team. He is passionate about distributed computing and everything and anything about data.

Maheedhar Reddy Chappidi is a Sr. Software Development Engineer on the AWS Glue team. He is passionate about building fault tolerant and reliable distributed systems at scale. Outside of his work, Maheedhar is passionate about listening to podcasts and playing with his two-year-old kid.

Savio Dsouza is a Software Development Manager on the AWS Glue team. His team works on generative AI applications for the Data Integration domain and distributed systems for efficiently managing data lakes on AWS and optimizing Apache Spark for performance and reliability.

Kartik Panjabi is a Software Development Manager on the AWS Glue team. His team builds generative AI features for the Data Integration and distributed system for data integration.

Mohit Saxena is a Senior Software Development Manager on the AWS Glue and Amazon EMR team. His team focuses on building distributed systems to enable customers with simple-to-use interfaces and AI-driven capabilities to efficiently transform petabytes of data across data lakes on Amazon S3, and databases and data warehouses on the cloud.

Introducing generative AI troubleshooting for Apache Spark in AWS Glue (preview)

2024-11-22 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/introducing-generative-ai-troubleshooting-for-apache-spark-in-aws-glue-preview/

Organizations run millions of Apache Spark applications each month to prepare, move, and process their data for analytics and machine learning (ML). Building and maintaining these Spark applications is an iterative process, where developers spend significant time testing and troubleshooting their code. During development, data engineers often spend hours sifting through log files, analyzing execution plans, and making configuration changes to resolve issues. This process becomes even more challenging in production environments due to the distributed nature of Spark, its in-memory processing model, and the multitude of configuration options available. Troubleshooting these production issues requires extensive analysis of logs and metrics, often leading to extended downtimes and delayed insights from critical data pipelines.

Today, we are excited to announce the preview of generative AI troubleshooting for Spark in AWS Glue. This is a new capability that enables data engineers and scientists to quickly identify and resolve issues in their Spark applications. This feature uses ML and generative AI technologies to provide automated root cause analysis for failed Spark applications, along with actionable recommendations and remediation steps. This post demonstrates how you can debug your Spark applications with generative AI troubleshooting.

How generative AI troubleshooting for Spark works

For Spark jobs, the troubleshooting feature analyzes job metadata, metrics and logs associated with the error signature of your job to generates a comprehensive root cause analysis. You can initiate the troubleshooting and optimization process with a single click on the AWS Glue console. With this feature, you can reduce your mean time to resolution from days to minutes, optimize your Spark applications for cost and performance, and focus more on deriving value from your data.

Manually debugging Spark applications can get challenging for data engineers and ETL developers due to a few different reasons:

Extensive connectivity and configuration options to a variety of resources with Spark while makes it a popular data processing platform, often makes it challenging to root cause issues when configurations are not correct, especially related to resource setup (S3 bucket, databases, partitions, resolved columns) and access permissions (roles and keys).
Spark’s in-memory processing model and distributed partitioning of datasets across its workers while good for parallelism, often make it difficult for users to identify root cause of failures resulting from resource exhaustion issues like out of memory and disk exceptions.
Lazy evaluation of Spark transformations while good for performance, makes it challenging to accurately and quickly identify the application code and logic which caused the failure from the distributed logs and metrics emitted from different executors.

Let’s look at a few common and complex Spark troubleshooting scenarios where Generative AI Troubleshooting for Spark can save hours of manual debugging time required to deep dive and come up with the exact root cause.

Resource setup or access errors

Spark applications allows to integrate data from a variety of resources like datasets with several partitions and columns on S3 buckets and Data Catalog tables, use the associated job IAM roles and KMS keys for correct permissions to access these resources, and require these resources to exist and be available in the right regions and locations referenced by their identifiers. Users can mis-configure their applications that result in errors requiring deep dive into the logs to understand the root cause being a resource setup or permission issue.

Manual RCA: Failure reason and Spark application Logs

Following example shows the failure reason for such a common setup issue for S3 buckets in a production job run. The failure reason coming from Spark does not help understand the root cause or the line of code that needs to be inspected for fixing it.

Exception in User Class: org.apache.spark.SparkException : Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (172.36.245.14 executor 1): com.amazonaws.services.glue.util.NonFatalException: Error opening file:

After deep diving into the logs of one of the many distributed Spark executors, it becomes clear that the error was caused due to a S3 bucket not existing, however the error stack trace is usually quite long and truncated to understand the precise root cause and location within Spark application where the fix is needed.

Caused by: java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: 80MTEVF2RM7ZYAN9; S3 Extended Request ID: AzRz5f/Amtcs/QatfTvDqU0vgSu5+v7zNIZwcjUn4um5iX3JzExd3a3BkAXGwn/5oYl7hOXRBeo=; Proxy: null), S3 Extended Request ID: AzRz5f/Amtcs/QatfTvDqU0vgSu5+v7zNIZwcjUn4um5iX3JzExd3a3BkAXGwn/5oYl7hOXRBeo=
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.list(Jets3tNativeFileSystemStore.java:423)
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.isFolderUsingFolderObject(Jets3tNativeFileSystemStore.java:249)
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.isFolder(Jets3tNativeFileSystemStore.java:212)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:518)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:935)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:927)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:983)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:197)
at com.amazonaws.services.glue.hadoop.TapeHadoopRecordReaderSplittable.initialize(TapeHadoopRecordReaderSplittable.scala:168)
... 29 more

With Generative AI Spark Troubleshooting: RCA and Recommendations

With Spark Troubleshooting, you simply click the Troubleshooting analysis button on your failed job run, and the service analyzes the debug artifacts of your failed job to identify the root cause analysis along with the line number in your Spark application that you can inspect to further resolve the issue.

Spark Out of Memory Errors

Let’s take a common but relatively complex error that requires significant manual analysis to conclude its because of a Spark job running out of memory on Spark driver (master node) or one of the distributed Spark executors. Usually, troubleshooting requires an experienced data engineer to manually go over the following steps to identify the root cause.

Search through Spark driver logs to find the exact error message
Navigate to the Spark UI to analyze memory usage patterns
Review executor metrics to understand memory pressure
Analyze the code to identify memory-intensive operations

This process often takes hours because the failure reason from Spark is usually not challenging to understand that it was a out of memory issue on the Spark driver and what is the remedy to fix it.

Manual RCA: Failure reason and Spark application Logs

Following example shows the failure reason for the error.

Py4JJavaError: An error occurred while calling o4138.collectToPython. java.lang.StackOverflowError

Spark driver logs require extensive search to find the exact error message. In this case, the error stack trace consisted of more than hundred function calls and is challenging to understand the precise root cause as the Spark application terminated abruptly.

py4j.protocol.Py4JJavaError: An error occurred while calling o4138.collectToPython.
: java.lang.StackOverflowError
 at org.apache.spark.sql.catalyst.trees.TreeNode$$Lambda$1942/131413145.get$Lambda(Unknown Source)
 at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:798)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:459)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:781)
 at org.apache.spark.sql.catalyst.trees.TreeNode.clone(TreeNode.scala:881)
 at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$clone(LogicalPlan.scala:30)
 at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.clone(AnalysisHelper.scala:295)
 at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.clone$(AnalysisHelper.scala:294)
 at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.clone(LogicalPlan.scala:30)
 at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.clone(LogicalPlan.scala:30)
 at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$clone$1(TreeNode.scala:881)
 at org.apache.spark.sql.catalyst.trees.TreeNode.applyFunctionIfChanged$1(TreeNode.scala:747)
 at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:783)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:459)
 ... repeated several times with hundreds of function calls

With Generative AI Spark Troubleshooting: RCA and Recommendations

With Spark Troubleshooting, you can click the Troubleshooting analysis button on your failed job run and get a detailed root cause analysis with the line of code which you can inspect, and also recommendations on best practices to optimize your Spark application for fixing the problem.

Spark Out of Disk Errors

Another complex error pattern with Spark is when it runs out of disk storage on one of the many Spark executors in the Spark application. Similar to Spark OOM exceptions, manual troubleshooting requires extensive deep dive into distributed executor logs and metrics to understand the root cause and identify the application logic or code causing the error due to Spark’s lazy execution of its transformations.

Manual RCA: Failure Reason and Spark application Logs

The associated failure reason and error stack trace in the application logs is again quiet long requiring the user to gather more insights from Spark UI and Spark metrics to identify the root cause and identify the resolution.

An error occurred while calling o115.parquet. No space left on device

py4j.protocol.Py4JJavaError: An error occurred while calling o115.parquet.
: org.apache.spark.SparkException: Job aborted.
 at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:638)
 at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:279)
 at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:193)
 at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
 at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
 at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
 ....

With Generative AI Spark Troubleshooting: RCA and Recommendations

With Spark Troubleshooting, it provides the RCA and the line number of code in the script where the data shuffle operation was lazily evaluated by Spark. It also points to best practices guide for optimizing the shuffle or wide transforms or using S3 shuffle plugin on AWS Glue.

Debug AWS Glue for Spark jobs

To use this troubleshooting feature for your failed job runs, complete following:

On the AWS Glue console, choose ETL jobs in the navigation pane.
Choose your job.
On the Runs tab, choose your failed job run.
Choose Troubleshoot with AI to start the analysis.
You will be redirected to the Troubleshooting analysis tab with generated analysis.

You will see Root Cause Analysis and Recommendations sections.

The service analyzes your job’s debug artifacts and provide the results. Let’s look at a real example of how this works in practice.

We show below an end-to-end example where Spark Troubleshooting helps a user with identification of the root cause for a resource setup issue and help fix the job to resolve the error.

Considerations

During preview, the service focuses on common Spark errors like resource setup and access issues, out of memory exceptions on Spark driver and executors, out of disk exceptions on Spark executors, and will clearly indicate when an error type is not yet supported. Your jobs must run on AWS Glue version 4.0.

The preview is available at no additional charge in all AWS commercial Regions where AWS Glue is available. When you use this capability, any validation runs triggered by you to test proposed solutions will be charged according to the standard AWS Glue pricing.

Conclusion

This post demonstrated how generative AI troubleshooting for Spark in AWS Glue helps your day-to-day Spark application debugging. It simplifies the debugging process for your Spark applications by using generative AI to automatically identify the root cause of failures and provides actionable recommendations to resolve the issues.

To learn more about this new troubleshooting feature for Spark, please visit Troubleshooting Spark jobs with AI.

A special thanks to everyone who contributed to the launch of generative AI troubleshooting for Apache Spark in AWS Glue: Japson Jeyasekaran, Rahul Sharma, Mukul Prasad, Weijing Cai, Jeremy Samuel, Hirva Patel, Martin Ma, Layth Yassin, Kartik Panjabi, Maya Patwardhan, Anshi Shrivastava, Henry Caballero Corzo, Rohit Das, Peter Tsai, Daniel Greenberg, McCall Peltier, Takashi Onikura, Tomohiro Tanaka, Sotaro Hikita, Chiho Sugimoto, Yukiko Iwazumi, Gyan Radhakrishnan, Victor Pleikis, Sriram Ramarathnam, Matt Sampson, Brian Ross, Alexandra Tello, Andrew King, Joseph Barlan, Daiyan Alamgir, Ranu Shah, Adam Rohrscheib, Nitin Bahadur, Santosh Chandrachood, Matt Su, Kinshuk Pahare, and William Vambenepe.

About the Authors

Vishal Kajjam is a Software Development Engineer on the AWS Glue team. He is passionate about distributed computing and using ML/AI for designing and building end-to-end solutions to address customers’ data integration needs. In his spare time, he enjoys spending time with family and friends.

Shubham Mehta is a Senior Product Manager at AWS Analytics. He leads generative AI feature development across services such as AWS Glue, Amazon EMR, and Amazon MWAA, using AI/ML to simplify and enhance the experience of data practitioners building data applications on AWS.

Wei Tang is a Software Development Engineer on the AWS Glue team. She is strong developer with deep interests in solving recurring customer problems with distributed systems and AI/ML.

XiaoRun Yu is a Software Development Engineer on the AWS Glue team. He is working on building new features for AWS Glue to help customers. Outside of work, Xiaorun enjoys exploring new places in the Bay Area.

Jake Zych is a Software Development Engineer on the AWS Glue team. He has deep interest in distributed systems and machine learning. In his spare time, Jake likes to create video content and play board games.

Savio Dsouza is a Software Development Manager on the AWS Glue team. His team works on distributed systems & new interfaces for data integration and efficiently managing data lakes on AWS.

Mohit Saxena is a Senior Software Development Manager on the AWS Glue team. His team focuses on building distributed systems to enable customers with interactive and simple-to-use interfaces to efficiently manage and transform petabytes of data across data lakes on Amazon S3, and databases and data warehouses on the cloud.

Introducing generative AI upgrades for Apache Spark in AWS Glue (preview)

2024-11-22 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/introducing-generative-ai-upgrades-for-apache-spark-in-aws-glue-preview/

Organizations run millions of Apache Spark applications each month on AWS, moving, processing, and preparing data for analytics and machine learning. As these applications age, keeping them secure and efficient becomes increasingly challenging. Data practitioners need to upgrade to the latest Spark releases to benefit from performance improvements, new features, bug fixes, and security enhancements. However, these upgrades are often complex, costly, and time-consuming.

Today, we are excited to announce the preview of generative AI upgrades for Spark, a new capability that enables data practitioners to quickly upgrade and modernize their Spark applications running on AWS. Starting with Spark jobs in AWS Glue, this feature allows you to upgrade from an older AWS Glue version to AWS Glue version 4.0. This new capability reduces the time data engineers spend on modernizing their Spark applications, allowing them to focus on building new data pipelines and getting valuable analytics faster.

Understanding the Spark upgrade challenge

The traditional process of upgrading Spark applications requires significant manual effort and expertise. Data practitioners must carefully review incremental Spark release notes to understand the intricacies and nuances of breaking changes, some of which may be undocumented. They then need to modify their Spark scripts and configurations, updating features, connectors, and library dependencies as needed.

Testing these upgrades involves running the application and addressing issues as they arise. Each test run may reveal new problems, resulting in multiple iterations of changes. After the upgraded application runs successfully, practitioners must validate the new output against the expected results in production. This process often turns into year-long projects that cost millions of dollars and consume tens of thousands of engineering hours.

How generative AI upgrades for Spark works

The Spark upgrades feature uses AI to automate both the identification and validation of required changes to your AWS Glue Spark applications. Let’s explore how these capabilities work together to simplify your upgrade process.

AI-driven upgrade plan generation

When you initiate an upgrade, the service analyzes your application using AI to identify necessary changes across both PySpark code and Spark configurations. During preview, Spark Upgrades supports upgrading from Glue 2.0 (Spark 2.4.3, Python 3.7) to Glue 4.0 (Spark 3.3.0, Python 3.10), automatically handling changes that would typically require extensive manual review of public Spark, Python and Glue version migration guides, followed by development, testing, and verification. Spark Upgrades addresses four key areas of changes:

Spark SQL API methods and functions
Spark DataFrame API methods and operations
Python language updates (including module deprecations and syntax changes)
Spark SQL and Core configuration settings

The complexity of these upgrades becomes evident when you consider migrating from Spark 2.4.3 to Spark 3.3.0 involves over a hundred version-specific changes. Several factors contribute to the challenges of performing manual upgrades:

Highly expressive language with a mix of imperative and declarative programming styles, allows users to easily develop Spark applications. However, this increases the complexity of identifying impacted code during upgrades.
Lazy execution of transformations in a distributed Spark application improves performance but makes runtime verification of application upgrades challenging for users.
Spark configurations changes in default values or the introduction of new configurations across versions can impact application behavior in different ways, making it difficult for users to identify issues during upgrades.

For example, in Spark 3.2, Spark SQL TRANSFORM operator can’t support alias in inputs. In Spark 3.1 and earlier, you could write a script transform like SELECT TRANSFORM(a AS c1, b AS c2) USING 'cat' FROM TBL.

# Original code (Glue 2.0)
query = """
SELECT TRANSFORM(item as product_name, price as product_price, number as product_number)
   USING 'cat'
FROM goods
WHERE goods.price > 5
"""
spark.sql(query)

# Updated code (Glue 4.0)
query = """
SELECT TRANSFORM(item, price, number)
   USING 'cat' AS (product_name, product_price, product_number)
FROM goods
WHERE goods.price > 5
"""
spark.sql(query)

In Spark 3.1, loading and saving timestamps before 1900-01-01 00:00:00Z as INT96 in Parquet files causes errors. In Spark 3.0, this wouldn’t fail but could result in timestamp shifts due to calendar rebasing. To restore the old behavior in Spark 3.1, you would need to configure the Spark SQL configurations for spark.sql.legacy.parquet.int96RebaseModeInRead and spark.sql.legacy.parquet.int96RebaseModeInWrite to LEGACY.

# Original code (Glue 2.0)
data = [(1, "1899-12-31 23:59:59"), (2, "1900-01-01 00:00:00")]
schema = StructType([ StructField("id", IntegerType(), True), StructField("timestamp", TimestampType(), True) ])
df = spark.createDataFrame(data, schema=schema)
df.write.mode("overwrite").parquet("path/to/parquet_file") 

# Updated code (Glue 4.0)
qspark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY") 
spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "LEGACY")

data = [(1, "1899-12-31 23:59:59"), (2, "1900-01-01 00:00:00")]
schema = StructType([ StructField("id", IntegerType(), True), StructField("timestamp", TimestampType(), True) ])
df = spark.createDataFrame(data, schema=schema)
df.write.mode("overwrite").parquet("path/to/parquet_file")

Automated validation in your environment

After identifying the necessary changes, Spark Upgrades validates the upgraded application by running it as an AWS Glue job in your AWS account. The service iterates through multiple validation runs, up to 10, reviewing any errors encountered in each iteration and refining the upgrade plan until it achieves a successful run. You can run a Spark Upgrade Analysis in your development account using mock datasets supplied through Glue job parameters used for validation runs.

After Spark Upgrades has successfully validated the changes, it presents an upgrade plan for you to review. You can then accept and apply the changes to your job in the development account, before replicating them to your job in the production account. The Spark Upgrade plan includes the following:

An upgrade summary with an explanation of code updates made during the process
The final script that you can use in place of your current script
Logs from validation runs showing how issues were identified and resolved

You can review all aspects of the upgrade, including intermediate validation attempts and any error resolutions, before deciding to apply the changes to your production job. This approach ensures you have full visibility into and control over the upgrade process while benefiting from AI-driven automation.

Get started with generative AI Spark upgrades

Let’s walk through the process of upgrading an AWS Glue 2.0 job to AWS Glue 4.0. Complete the following steps:

On the AWS Glue console, choose ETL jobs in the navigation pane.
Select your AWS Glue 2.0 job, and choose Run upgrade analysis with AI.
For Result path, enter s3://aws-glue-assets-<account-id>-<region>/scripts/upgraded/ (provide your own account ID and AWS Region).
Choose Run.
On the Upgrade analysis tab, wait for the analysis to be completed.

While an analysis is running, you can view the intermediate job analysis attempts (up to 10) for validation under the Runs tab. Additionally, the Upgraded summary in S3 documents the upgrades made by the Spark Upgrade service so far, refining the upgrade plan with each attempt. Each attempt will display a different failure reason, which the service tries to address in the subsequent attempt through code or configuration updates.
After a successful analysis, the upgraded script and a summary of changes will be uploaded to Amazon Simple Storage Service (Amazon S3).
Review the changes to make sure they meet your requirements, then choose Apply upgraded script.

Your job has now been successfully upgraded to AWS Glue version 4.0. You can check the Script tab to verify the updated script and the Job details tab to review the modified configuration.

Understanding the upgrade process through an example

We now show a production Glue 2.0 job that we would like to upgrade to Glue 4.0 using the Spark Upgrade feature. This Glue 2.0 job reads a dataset, updated daily in an S3 bucket under different partitions, containing new book reviews from an online marketplace and runs SparkSQL to gather insights into the user votes for the book reviews.

Original code (Glue 2.0) – before upgrade

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
from collections import Sequence
from pyspark.sql.types import DecimalType
from pyspark.sql.functions import lit, to_timestamp, col

def is_data_type_sequence(coming_dict):
    return True if isinstance(coming_dict, Sequence) else False

def dataframe_to_dict_list(df):
    return [row.asDict() for row in df.collect()]

books_input_path = (
    "s3://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Books/"
)
view_name = "books_temp_view"
static_date = "2010-01-01"
books_source_df = (
    spark.read.option("header", "true")
    .option("recursiveFileLookup", "true")
    .option("path", books_input_path)
    .parquet(books_input_path)
)
books_source_df.createOrReplaceTempView(view_name)
books_with_new_review_dates_df = spark.sql(
    f"""
        SELECT 
        {view_name}.*,
            DATE_ADD(to_date(review_date), "180.8") AS next_review_date,
            CASE 
                WHEN DATE_ADD(to_date(review_date), "365") < to_date('{static_date}') THEN 'Yes' 
                ELSE 'No' 
            END AS Actionable
        FROM {view_name}
    """
)
books_with_new_review_dates_df.createOrReplaceTempView(view_name)
aggregate_books_by_marketplace_df = spark.sql(
    f"SELECT marketplace, count({view_name}.*) as total_count, avg(star_rating) as average_star_ratings, avg(helpful_votes) as average_helpful_votes, avg(total_votes) as average_total_votes  FROM {view_name} group by marketplace"
)
aggregate_books_by_marketplace_df.show()
data = dataframe_to_dict_list(aggregate_books_by_marketplace_df)
if is_data_type_sequence(data):
    print("data is valid")
else:
    raise ValueError("Data is invalid")

aggregated_target_books_df = aggregate_books_by_marketplace_df.withColumn(
    "average_total_votes_decimal", col("average_total_votes").cast(DecimalType(3, -2))
)
aggregated_target_books_df.show()

New code (Glue 4.0) – after upgrade

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from collections.abc import Sequence
from pyspark.sql.types import DecimalType
from pyspark.sql.functions import lit, to_timestamp, col

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
spark.conf.set("spark.sql.adaptive.enabled", "false")
spark.conf.set("spark.sql.legacy.allowStarWithSingleTableIdentifierInCount", "true")
spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", "true")
job = Job(glueContext)

def is_data_type_sequence(coming_dict):
    return True if isinstance(coming_dict, Sequence) else False

def dataframe_to_dict_list(df):
    return [row.asDict() for row in df.collect()]

books_input_path = (
    "s3://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Books/"
)
view_name = "books_temp_view"
static_date = "2010-01-01"
books_source_df = (
    spark.read.option("header", "true")
    .option("recursiveFileLookup", "true")
    .load(books_input_path)
)
books_source_df.createOrReplaceTempView(view_name)
books_with_new_review_dates_df = spark.sql(
    f"""
        SELECT 
        {view_name}.*,
            DATE_ADD(to_date(review_date), 180) AS next_review_date,
            CASE 
                WHEN DATE_ADD(to_date(review_date), 365) < to_date('{static_date}') THEN 'Yes' 
                ELSE 'No' 
            END AS Actionable
        FROM {view_name}
    """
)
books_with_new_review_dates_df.createOrReplaceTempView(view_name)
aggregate_books_by_marketplace_df = spark.sql(
    f"SELECT marketplace, count({view_name}.*) as total_count, avg(star_rating) as average_star_ratings, avg(helpful_votes) as average_helpful_votes, avg(total_votes) as average_total_votes  FROM {view_name} group by marketplace"
)
aggregate_books_by_marketplace_df.show()
data = dataframe_to_dict_list(aggregate_books_by_marketplace_df)
if is_data_type_sequence(data):
    print("data is valid")
else:
    raise ValueError("Data is invalid")

aggregated_target_books_df = aggregate_books_by_marketplace_df.withColumn(
    "average_total_votes_decimal", col("average_total_votes").cast(DecimalType(3, -2))
)
aggregated_target_books_df.show()

Upgrade summary

In Spark 3.2, spark.sql.adaptive.enabled is enabled by default. To restore the behavior before Spark 3.2, 
you can set spark.sql.adaptive.enabled to false.

No suitable migration rule was found in the provided context for this specific error. The change was made based on the error message, which indicated that Sequence could not be imported from collections module. In Python 3.10, Sequence has been moved to the collections.abc module.

In Spark 3.1, path option cannot coexist when the following methods are called with path parameter(s): DataFrameReader.load(), DataFrameWriter.save(), DataStreamReader.load(), or DataStreamWriter.start(). In addition, paths option cannot coexist for DataFrameReader.load(). For example, spark.read.format(csv).option(path, /tmp).load(/tmp2) or spark.read.option(path, /tmp).csv(/tmp2) will throw org.apache.spark.sql.AnalysisException. In Spark version 3.0 and below, path option is overwritten if one path parameter is passed to above methods; path option is added to the overall paths if multiple path parameters are passed to DataFrameReader.load(). To restore the behavior before Spark 3.1, you can set spark.sql.legacy.pathOptionBehavior.enabled to true.

In Spark 3.0, the `date_add` and `date_sub` functions accepts only int, smallint, tinyint as the 2nd argument; fractional and non-literal strings are not valid anymore, for example: `date_add(cast('1964-05-23' as date), '12.34')` causes `AnalysisException`. Note that, string literals are still allowed, but Spark will throw `AnalysisException` if the string content is not a valid integer. In Spark version 2.4 and below, if the 2nd argument is fractional or string value, it is coerced to int value, and the result is a date value of `1964-06-04`.

In Spark 3.2, the usage of count(tblName.*) is blocked to avoid producing ambiguous results. Because count(*) and count(tblName.*) will output differently if there is any null values. To restore the behavior before Spark 3.2, you can set spark.sql.legacy.allowStarWithSingleTableIdentifierInCount to true.

In Spark 3.0, negative scale of decimal is not allowed by default, for example, data type of literal like 1E10BD is DecimalType(11, 0). In Spark version 2.4 and below, it was DecimalType(2, -9). To restore the behavior before Spark 3.0, you can set spark.sql.legacy.allowNegativeScaleOfDecimal to true.

As seen in the updated Glue 4.0 (Spark 3.3.0) script diff compared to the Glue 2.0 (Spark 2.4.3) script and the resulting upgrade summary, a total of six different code and configuration updates were applied across the six attempts of the Spark Upgrade Analysis.

Attempt #1 included a Spark SQL configuration (spark.sql.adaptive.enabled) to restore the application behavior as a new feature for Spark SQL adaptive query execution is introduced starting Spark 3.2. Users can inspect this configuration change and can further enable or disable it as per their preference.
Attempt #2 resolved a Python language change between Python 3.7 and 3.10 with the introduction of a new abstract base class (abc) under the Python collections module for importing Sequence.
Attempt #3 resolved an error encountered due to a change in behavior of DataFrame API starting Spark 3.1 where path option cannot exist with other DataFrameReader operations.
Attempt #4 resolved an error caused by a change in the Spark SQL function API signature for DATE_ADD which now only accepts integers as the second argument starting from Spark 3.0.
Attempt #5 resolved an error encountered due to the change in behavior Spark SQL function API for count(tblName.*) starting Spark 3.2. The behavior was restored with the introduction of a new Spark SQL configuration spark.sql.legacy.allowStarWithSingleTableIdentifierInCount
Attempt #6 successfully completed the analysis and ran the new script on Glue 4.0 without any new errors. The final attempt resolved an error encountered due to the prohibited use of negative scale for cast(DecimalType(3, -6) in Spark DataFrame API starting Spark 3.0. The issue was addressed by enabling the new Spark SQL configuration spark.sql.legacy.allowNegativeScaleOfDecimal.

Important considerations for preview

As you begin using automated Spark upgrades during the preview period, there are several important aspects to consider for optimal usage of the service:

Service scope and limitations – The preview release focuses on PySpark code upgrades from AWS Glue versions 2.0 to version 4.0. At the time of writing, the service handles PySpark code that doesn’t rely on additional library dependencies. You can run automated upgrades for up to 10 jobs concurrently in an AWS account, allowing you to efficiently modernize multiple jobs while maintaining system stability.
Optimizing costs during the upgrade process – Because the service uses generative AI to validate the upgrade plan through multiple iterations, with each iteration running as an AWS Glue job in your account, it’s essential to optimize the validation job run configurations for cost-efficiency. To achieve this, we recommend specifying a run configuration when starting an upgrade analysis as follows:
- Using non-production developer accounts and selecting sample mock datasets that represent your production data but are smaller in size for validation with Spark Upgrades.
- Using right-sized compute resources, such as G.1X workers, and selecting an appropriate number of workers for processing your sample data.
- Enabling Glue auto scaling when applicable to automatically adjust resources based on workload.
For example, if your production job processes terabytes of data with 20 G.2X workers, you might configure the upgrade job to process a few gigabytes of representative data with 2 G.2X workers and auto scaling enabled for validation.
Preview best practices – During the preview period, we strongly recommend starting your upgrade journey with non-production jobs. This approach allows you to familiarize yourself with the upgrade workflow, and understand how the service handles different types of Spark code patterns.

Your experience and feedback are crucial in helping us enhance and improve this feature. We encourage you to share your insights, suggestions, and any challenges you encounter through AWS Support or your account team. This feedback will help us improve the service and add capabilities that matter most to you during preview.

Conclusion

This post demonstrates how automated Spark upgrades can assist with migrating your Spark applications in AWS Glue. It simplifies the migration process by using generative AI to automatically identify the necessary script changes across different Spark versions.

To learn more about this feature in AWS Glue, see Generative AI upgrades for Apache Spark in AWS Glue.

A special thanks to everyone who contributed to the launch of generative AI upgrades for Apache Spark in AWS Glue: Shuai Zhang, Mukul Prasad, Liyuan Lin, Rishabh Nair, Raghavendhar Thiruvoipadi Vidyasagar, Tina Shao, Chris Kha, Neha Poonia, Xiaoxi Liu, Japson Jeyasekaran, Suthan Phillips, Raja Jaya Chandra Mannem, Yu-Ting Su, Neil Jonkers, Boyko Radulov, Sujatha Rudra, Mohammad Sabeel, Mingmei Yang, Matt Su, Daniel Greenberg, Charlie Sim, McCall Petier, Adam Rohrscheib, Andrew King, Ranu Shah, Aleksei Ivanov, Bernie Wang, Karthik Seshadri, Sriram Ramarathnam, Asterios Katsifodimos, Brody Bowman, Sunny Konoplev, Bijay Bisht, Saroj Yadav, Carlos Orozco, Nitin Bahadur, Kinshuk Pahare, Santosh Chandrachood, and William Vambenepe.

About the Authors

Keerthi Chadalavada is a Senior Software Development Engineer at AWS Glue, focusing on combining generative AI and data integration technologies to design and build comprehensive solutions for customers’ data and analytics needs.

Pradeep Patel is a Software Development Manager on the AWS Glue team. He is passionate about helping customers solve their problems by using the power of the AWS Cloud to deliver highly scalable and robust solutions. In his spare time, he loves to hike and play with web applications.

Chuhan Liu is a Software Engineer at AWS Glue. He is passionate about building scalable distributed systems for big data processing, analytics, and management. He is also keen on using generative AI technologies to provide brand-new experience to customers. In his spare time, he likes sports and enjoys playing tennis.

Vaibhav Naik is a software engineer at AWS Glue, passionate about building robust, scalable solutions to tackle complex customer problems. With a keen interest in generative AI, he likes to explore innovative ways to develop enterprise-level solutions that harness the power of cutting-edge AI technologies.

AWS Glue Data Catalog supports automatic optimization of Apache Iceberg tables through your Amazon VPC

2024-11-21 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/aws-glue-data-catalog-supports-automatic-optimization-of-apache-iceberg-tables-through-your-amazon-vpc/

The AWS Glue Data Catalog supports automatic table optimization of Apache Iceberg tables, including compaction, snapshots, and orphan data management. The data compaction optimizer constantly monitors table partitions and kicks off the compaction process when the threshold is exceeded for the number of files and file sizes.

The Iceberg table compaction process starts and will continue if the table or any of the partitions within the table has more than the configured number of files (default five files), each smaller than 75% of the target file size. The snapshot retention process runs periodically (default daily) to identify and remove snapshots that are older than the specified retention configuration from the table properties, while keeping the most recent snapshots up to the configured limit. Similarly, the orphan file deletion process scans the table metadata and the actual data files, identifies the unreferenced files, and deletes them to reclaim storage space. These storage optimizations can help you reduce metadata overhead, control storage costs, and improve query performance.

Although automatic table optimization has simplified day-to-day Iceberg table maintenance tasks, certain industries and customers have advanced requirements to access their Iceberg tables from specific virtual private clouds (VPCs). This access control is needed for not only data ingestion and querying, but also for table maintenance.

To help achieve such requirements, we provide the capability where the Data Catalog optimizes Iceberg tables to run in your specific VPC. This post demonstrates how it works with step-by-step instructions.

How the table optimizer works with AWS Glue network connection

By default, a table optimizer is not associated with any of your VPCs and subnets. With this new capability of supporting data access from VPCs, you can associate a table optimizer with an AWS Glue network connection to run in a specific VPC, subnet, and security group. An AWS Glue network connection is commonly used to run an AWS Glue job with a specific VPC, subnet, and security group. The following diagram illustrates how it works.

In the next sections, we demonstrate how to configure a table optimizer with an AWS Glue network connection.

Prerequisites

To run through this instruction, you must have the following prerequisites:

An AWS account.
AWS Identity and Access Management (IAM) permissions to create and deploy an AWS CloudFormation stack.

Set up resources with AWS CloudFormation

This post includes a sample AWS CloudFormation template that enables a quick setup of the solution resources. You can review and customize the template to suit your needs.

The CloudFormation template generates the following resources:

An Amazon Simple Storage Service (Amazon S3) bucket to store the dataset, AWS Glue job scripts, and so on. (See Appendix 1 at the end of this post for manual instructions.)
A Data Catalog database.
An AWS Glue job that creates and modifies sample customer data in your S3 bucket with a trigger every 10 minutes.
AWS IAM roles and policies.
A VPC, public subnet, two private subnets, internet gateway, and route tables.
Amazon Virtual Private Cloud (Amazon VPC) endpoints for AWS Glue, AWS Lake Formation, Amazon CloudWatch, Amazon S3, and AWS Security Token Service (AWS STS). The endpoint names are as follows:
- AWS Glue – com.amazonaws.<region>.glue (for example, com.amazonaws.us-east-1.glue).
- Lake Formation – com.amazonaws.<region>.lakeformation (only if tables are registered with Lake Formation).
- CloudWatch – com.amazonaws.<region>.monitoring.
- Amazon S3 – com.amazonaws.<region>.s3.
- AWS STS – com.amazonaws.<region>.sts.
An AWS Glue network connection configured with the VPC and subnet. (See Appendix 2 at the end of this post for manual instructions.)

To launch the CloudFormation stack, complete the following steps:

Sign in to the AWS CloudFormation console.
Choose Launch Stack.
Choose Next.
For SubnetAz1, choose your preferred Availability Zone.
For SubnetAz2, choose your preferred Availability Zone. This needs to be different from SubnetAz1.
Leave the other parameters as default or make appropriate changes based on your requirements, then choose Next.
Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create.

This stack can take around 5–10 minutes to complete, after which you can view the deployed stack on the AWS CloudFormation console.

Configure automatic table optimization with an AWS Glue network connection

Complete following steps to configure automatic table optimization with an AWS Glue network connection:

On the AWS Glue console, choose Databases in the navigation pane.
Choose iceberg_optimizer_vpc_db.
Under Tables, choose customer.
On the Table optimization – new tab, choose Enable optimization.

For Optimization configuration, choose Customize settings.
For IAM role, choose the iceberg-optimizer-vpc-MyGlueTableOptimizerRole-xxx role created by the CloudFormation stack.
For Virtual private cloud (VPC) – optional, choose myvpc_private_network_connection.

Select I acknowledge that expired data will be deleted as part of the optimizers and choose Enable optimization.

Now the table optimizer has been configured with your VPC. After a while, you can see how the optimizer worked.

Under Table optimization – new, choose View optimization history on the Actions menu.

You can confirm that the table optimizer worked successfully for this Iceberg table.

You have now seen how to set up the table optimizer with an AWS Glue network connection to run it through a specific VPC.

Clean up

When you have finished all the preceding steps, remember to clean up all the AWS resources you created using AWS CloudFormation:

Delete the S3 bucket storing the Iceberg table and the AWS Glue job script.
Delete the CloudFormation stack.

Conclusion

This post demonstrated how the Data Catalog supports automatic optimization of Iceberg tables through your VPC. With this enhancement, you can simplify table maintenance for your Iceberg tables under advanced security requirements. This feature is available today in all AWS Glue supported AWS Regions.

Try out this solution for your own use case, and share your feedback and questions in the comments.

About the Authors

Paul Villena is an Analytics Solutions Architect in AWS with expertise in building modern data and analytics solutions to drive business value. He works with customers to help them harness the power of the cloud. His areas of interest are infrastructure as code, serverless technologies, and coding in Python.

Justin Lin is a software engineer on the AWS Lake Formation team. He works on delivering managed optimization solutions for open table formats to enhance customer data management and query performance. In his spare time, he enjoys playing tennis.

Himani Desai is a Software Engineer on the AWS Lake Formation team. She works on providing managed optimization solutions for Iceberg tables.

Abishek Shankar is a software engineer on the AWS Lake Formation team, working on providing managed optimization solutions for Iceberg tables.

Shyam Rathi is a Software Development Manager on the AWS Lake Formation team, working on delivering new features and enhancements related to modern data lakes.

Sandeep Adwankar is a Senior Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Appendix 1: Configure your S3 bucket to allow access only from a specific VPC

The instructions provided in this post help you configure your S3 bucket automatically through the CloudFormation template, but you can also manually configure your S3 bucket to allow access only from a specific VPC. This is an optional step to simulate the strict security regulation on your Iceberg table. Complete following steps:

On the Amazon S3 console, choose Buckets in the navigation pane.
Choose your S3 bucket.
Choose Permissions.
Under Bucket policy, choose Edit.
Enter following bucket policy:

{
    "Version": "2012-10-17",
    "Id": "S3BucketPolicyVPCAccessOnly",
    "Statement": [
        {
            "Sid": "DenyIfNotFromAllowedVPC",
            "Effect": "Deny",
            "Principal": "*",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::<your-bucket-name>",
                "arn:aws:s3:::<your-bucket-name>/*"
            ],
            "Condition": {
                "StringNotEquals": {
                    "aws:SourceVpc": "<your-vpc-id>",
                    "aws:PrincipalArn": [
                        "arn:aws:iam::<your-account-id>:role/<your-IAM-role-name>"
                    ]
                }
            }
        }
    ]
}

Choose Save changes.

Now this S3 bucket prevents any data operations not from the VPC. You can try uploading files to the bucket through Amazon S3 console to see that this operation fails as expected.

Appendix 2: Create an AWS Glue network connection

You can also can manually configure the AWS Glue network connection with the following steps:

On the AWS Glue console, choose Data connections in the navigation pane.
Under Connections, choose Create connection.
Select Network, and choose Next.
For VPC, choose your VPC created by the CloudFormation stack. The VPC ID is shown on the Outputs tab of the CloudFormation stack.
For Subnet, choose your private subnet created by the CloudFormation stack. The subnet ID is shown on the Outputs tab of the CloudFormation stack.
For Security groups, choose your security group created by the CloudFormation stack. The security group ID is shown on the Outputs tab of the CloudFormation stack.
Choose Next.
For Name, enter myvpc_private_network_connection.
Choose Next.
Review the configurations and choose Create connection.

Introducing job queuing to scale your AWS Glue workloads

2024-09-03 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/introducing-job-queuing-to-scale-your-aws-glue-workloads/

Data is a key driver for your business. Data volume can increase significantly over time, and it often requires concurrent consumption of large compute resources. Data integration workloads can become increasingly concurrent as more and more applications demand access to data at the same time. In AWS, hundreds of thousands of customers use AWS Glue, a serverless data integration service, for integrating data across multiple data sources at scale. AWS Glue jobs can be triggered asynchronously via a schedule or event, or started synchronously, on-demand.

Your AWS account has quotas, also referred to as limits, which are the maximum number of service resources for your AWS account. AWS Glue quotas helps guarantee the availability of AWS Glue resources and prevents accidental over provisioning of resources. However, with large or spiky workloads, it can be challenging to manage job run concurrency or Data Processing Units (DPU) to stay under the service quotas.
Traditionally, when you hit the quota of concurrent Glue job runs, your jobs fail immediately.

Today, we are pleased to announce the general availability of AWS Glue job queuing. Job queuing increases scalability and improves the customer experience of managing AWS Glue jobs. With this new capability, you no longer need to manage concurrency of your AWS Glue job runs and attempt retries just to avoid job failures due to high concurrency. You can simply start your jobs, and when the job runs are in Waiting state, the AWS Glue job queuing feature staggers jobs automatically whenever possible. This increases your job success rates and the experience for large concurrency workloads.

This post demonstrates how job queuing helps you scale your Glue workloads and how job queuing works.

Use cases and benefits for job queuing

The following are common data integration use cases where many concurrent job runs are needed:

Many different data sources need to be read in parallel
Multiple large datasets need to be processed concurrently
Data is processed in an event-driven fashion, and many events occur at the same time

AWS Glue has the following service quotas per Region and account related to concurrent usage:

Max concurrent job runs per account
Max concurrent job runs per job
Max task DPUs per account

You can also configure maximum concurrency for individual jobs.

In the aforementioned typical use cases, when you run a job through the StartJobRun API or AWS Glue console, you may hit the upper limit defined at any of the discussed places. If this happens, your job fails immediately due to errors like ConcurrentRunsExceededException returned by the AWS Glue API endpoint.

Job queuing helps those typical use cases without forcing you to manage concurrency between all your job runs. You no longer need to make manual retries when you get ConcurrentRunsExceededException. Job queuing enqueues job runs when you hit the limit and automatically reattempts job runs when resources free up. It simplifies your daily operation and reduces latency for the retries. It also allows you to scale more with AWS Glue jobs.

In the next section, we describe how job queuing is configured.

Configure job queuing for Glue jobs

To enable job queuing on the AWS Glue Studio console, complete the following steps:

Open AWS Glue console.
Choose Jobs.
Choose your job.
Choose the Job details tab.
For Job Run Queuing, select Enable job runs to be queued to run later when they cannot run immediately due to service quotas
Choose Save.

In the next section, we describe how job queuing works.

How AWS Glue jobs work with job queuing

In the current job run lifecycle, the job-level and account-level limits are checked when a job starts, and the job moves to a Failed state when these limits are reached. With job queuing, your job run state goes into a Waiting state to be reattempted instead of Failed. The Waiting state means that job run is queued for retry after the limits have been exceeded or resources were not unavailable. Job queueing is another retry mechanism in addition to the customer-specified max retry.

AWS Glue job queuing will improve the success rates of job runs and reduce failures due to limits, but it doesn’t guarantee job run success. Limits and resources could still be unavailable by the time the reattempt run starts.

The following screenshot shows that two job runs are in the Waiting state:

The following limits are covered by job queuing:

Max concurrent job runs per account exceeded
Max concurrent job runs per job exceeded (which includes the account-level service quota as well as the configured parameter on the job)
Max concurrent DPUs exceeded
Resource unavailable due to IP address exhaustion in VPCs

The retry mechanism is configured to retry for a maximum of 15 minutes or 10 attempts, whichever comes first.

Here’s the state transition diagram for job runs when job queuing is enabled.

Considerations

Keep in mind the following considerations:

AWS Glue Flex jobs are not supported
With job queuing enabled, the parameter MaxRetries is not configurable for the same job

Conclusion

In this post, we described how the new job queuing capability helps you scale your AWS Glue job workload. You can start leveraging job queuing for your new jobs or existing jobs today. We are looking forward to hearing your feedback.

About the authors

Gyan Radhakrishnan is a Software Development Engineer on the AWS Glue team. He is working on designing and building end-to-end solutions for data intensive applications.

Simon Kern is a Software Development Engineer on the AWS Glue team. He is enthusiastic about serverless technologies, data engineering and building great services.

Dana Adylova is a Software Development Engineer on the AWS Glue team. She is working on building software for supporting data intensive applications. In her spare time, she enjoys knitting and reading sci-fi.

Matt Su is a Senior Product Manager on the AWS Glue team. He enjoys helping customers uncover insights and make better decisions using their data with AWS Analytic services. In his spare time, he enjoys skiing and gardening.

Migrate workloads from AWS Data Pipeline

2024-07-25 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/migrate-workloads-from-aws-data-pipeline/

AWS Data Pipeline helps customers automate the movement and transformation of data. With Data Pipeline, customers can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks. Launched in 2012, Data Pipeline predates several popular Amazon Web Services (AWS) offerings for orchestrating data pipelines such as AWS Glue, AWS Step Functions, and Amazon Managed Workflows for Apache Airflow (Amazon MWAA).

Data Pipeline has been a foundational service for getting customer off the ground for their extract, transform, load (ETL) and infra provisioning use cases. Some customers want a deeper level of control and specificity than possible using Data Pipeline. With the recent advancements in the data industry, customers are looking for a more feature-rich platform to modernize their data pipelines to get them ready for data and machine learning (ML) innovation. This post explains how to migrate from Data Pipeline to alternate AWS services to serve the growing needs of data practitioners. The option you choose depends on your current workload on Data Pipeline. You can migrate typical use cases of Data Pipeline to AWS Glue, Step Functions, or Amazon MWAA.

Note that you will need to modify the configurations and code in the examples provided in this post based on your requirements. Before starting any production workloads after migration, you need to test your new workflows to ensure no disruption to production systems.

Migrating workloads to AWS Glue

AWS Glue is a serverless data integration service that helps analytics users to discover, prepare, move, and integrate data from multiple sources. It includes tooling for authoring, running jobs, and orchestrating workflows. With AWS Glue, you can discover and connect to hundreds of different data sources and manage your data in a centralized data catalog. You can visually create, run, and monitor ETL pipelines to load data into your data lakes. Also, you can immediately search and query cataloged data using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

We recommend migrating your Data Pipeline workload to AWS Glue when:

You’re looking for a serverless data integration service that supports various data sources, authoring interfaces including visual editors and notebooks, and advanced data management capabilities such as data quality and sensitive data detection.
Your workload can be migrated to AWS Glue workflows, jobs (in Python or Apache Spark) and crawlers (for example, your existing pipeline is built on top of Apache Spark).
You need a single platform that can handle all aspects of your data pipeline, including ingestion, processing, transfer, integrity testing, and quality checks.
Your existing pipeline was created from a pre-defined template on the AWS Management Console for Data Pipeline, such as exporting a DynamoDB table to Amazon S3, or importing DynamoDB backup data from S3, and you’re looking for the same template.
Your workload doesn’t depend on a specific Hadoop ecosystem application such as Apache Hive.
Your workload doesn’t require orchestrating on-premises servers, user-managed Amazon Elastic Compute Cloude (Amazon EC2) instances, or a user-managed Amazon EMR cluster.

Example: Migrate EmrActivity on EmrCluster to export DynamoDB tables to S3

One of the most common workloads on Data Pipeline is to backup Amazon DynamoDB tables to Amazon Simple Storage Service (Amazon S3). Data Pipeline has a pre-defined template named Export DynamoDB table to S3 to export DynamoDB table data to a given S3 bucket.

The template uses EmrActivity (named TableBackupActivity) which runs on EmrCluster (named EmrClusterForBackup) and backs up data on DynamoDBDataNode to S3DataNode.

You can migrate these pipelines to AWS Glue because it natively supports reading from DynamoDB.

To define an AWS Glue job for the preceding use case:

Open the AWS Glue console.
Choose ETL jobs.
Choose Visual ETL.
For Sources, select Amazon DynamoDB.
On the node Data source - DynamoDB, for DynamoDB source, select Choose the DynamoDB table directly, then select your source DynamoDB table from the menu.
For Connection options, enter s3.bucket and dynamodb.s3.prefix.
Choose + (plus) to add a new node.
For Targets, select Amazon S3.
On the node Data target - S3 bucket, for Format, select your preferred format, for example, Parquet.
For S3 Target location, enter your destination S3 path.
On Job details tab, select IAM role. In case you do not have the IAM role, follow Configuring IAM permissions for AWS Glue.
Choose Save and Run.

Your AWS Glue job has been successfully created and started.

You might notice that there is no property to manage read I/O rate. It’s because the default DynamoDB reader used in Glue Studio does not scan the source DynamoDB table. Instead it uses DynamoDB export.

Example: Migrate EmrActivity on EmrCluster to import DynamoDB from S3

Another common workload on Data Pipeline is to restore DynamoDB tables using backup data on Amazon S3. Data Pipeline has a pre-defined template named Import DynamoDB backup data from S3 to import DynamoDB table data from a given S3 bucket.

The template uses EmrActivity (named TableLoadActivity) which runs on EmrCluster (named EmrClusterForLoad) and loads data from S3DataNode to DynamoDBDataNode.

You can migrate these pipelines to AWS Glue because it natively supports writing to DynamoDB.

Prerequisites are to create a destination DynamoDB table and catalog it on Glue Data Catalog using Glue crawler, Glue console, or the API.

Open the AWS Glue console.
Choose ETL jobs.
Choose Visual ETL.
For Sources, select Amazon S3.
On the node Data source - S3 bucket, for S3 URL, enter your S3 path.
Choose + (plus) to add a new node.
For Targets, select AWS Glue Data Catalog.
On the node Data target - Data Catalog, for Database, select your destination database on Data Catalog.
For Table, select your destination table on Data Catalog.
On Job details tab, select IAM role. In case you do not have the IAM role, follow Configuring IAM permissions for AWS Glue.
Choose Save and Run.

Your AWS Glue job has been successfully created and started.

Migrating workloads to Step Functions

AWS Step Functions is a serverless orchestration service that lets you build workflows for your business-critical applications. With Step Functions, you use a visual editor to build workflows and integrate directly with over 11,000 actions for over 250 AWS services, including AWS Lambda, Amazon EMR, DynamoDB, and more. You can use Step Functions for orchestrating data processing pipelines, handling errors, and working with the throttling limits on the underlying AWS services. You can create workflows that process and publish machine learning models, orchestrate micro-services, as well as control AWS services, such as AWS Glue, to create ETL workflows. You also can create long-running, automated workflows for applications that require human interaction.

We recommend migrating your Data Pipeline workload to Step Functions when:

You’re looking for a serverless, highly available workflow orchestration service.
You’re looking for a cost-effective solution that charges at single-task granularity.
Your workloads are orchestrating tasks for multiple AWS services, such as Amazon EMR, AWS Lambda, AWS Glue, or DynamoDB.
You’re looking for a low-code solution that comes with a drag-and-drop visual designer for workflow creation and doesn’t require learning new programming concepts.
You’re looking for a service that provides integrations with over 250 AWS services covering over 11,000 actions out-of-the-box, as well as allowing integrations with custom non-AWS services and activities.
Both Data Pipeline and Step Functions use JSON format to define workflows. This allows you to store your workflows in source control, manage versions, control access, and automate with continuous integration and development (CI/CD). Step Functions use a syntax called Amazon State Language, which is fully based on JSON and allows a seamless transition between the textual and visual representations of the workflow.
Your workload requires orchestrating on-premises servers, user-managed EC2 instances, or a user-managed EMR cluster.

With Step Functions, you can choose the same version of Amazon EMR that you’re currently using in Data Pipeline.

For migrating activities on Data Pipeline managed resources, you can use AWS SDK service integration on Step Functions to automate resource provisioning and cleaning up. For migrating activities on on-premises servers, user-managed EC2 instances, or a user-managed EMR cluster, you can install an SSM agent to the instance. You can initiate the command through the AWS Systems Manager Run Command from Step Functions. You can also initiate the state machine from the schedule defined in Amazon EventBridge.

Example: Migrate HadoopActivity on EmrCluster

To migrate HadoopActivity on EmrCluster on Data Pipeline to Step Functions:

Open the AWS Step Functions console.
Choose State machines.
Choose Create state machine.
In the Choose a template wizard, search for emr, select Manage an EMR job, and choose Select.

For Choose how to use this template, select Build on it.
Choose Use template.

For Create an EMR cluster state, configure API Parameters based on the EMR release label, EMR capacity, IAM role, and so on based on the existing EmrClusternode configuration on Data Pipeline.

For Run first step state, configure API Parameters based on the JAR file and arguments based on the existing HadoopActivity node configuration on Data Pipeline.
If you have further activities configured on the existing HadoopActivity, repeat step 8.
Choose Create.

Your state machine has been successfully configured. Learn more in Manage an Amazon EMR Job.

Migrating workloads to Amazon MWAA

Amazon MWAA is a managed orchestration service for Apache Airflow that lets you use the Apache Airflow platform to set up and operate end-to-end data pipelines in the cloud at scale. Apache Airflow is an open-source tool used to programmatically author, schedule, and monitor sequences of processes and tasks referred to as workflows. Apache Airflow brings in new concepts like executors, pools, and SLAs that provide you with superior data orchestration capabilities. With Amazon MWAA, you can use Airflow and Python programming language to create workflows without having to manage the underlying infrastructure for scalability, availability, and security. Amazon MWAA automatically scales its workflow runtime capacity to meet your needs and is integrated with AWS security services to help provide you with fast and secure access to your data.

We recommend migrating your Data Pipeline workloads to Amazon MWAA when:

You’re looking for a managed, highly available service to orchestrate workflows written in Python.
You want to transition to a fully managed, widely adopted open source technology—Apache Airflow—for maximum portability.
You require a single platform that can handle all aspects of your data pipeline, including ingestion, processing, transfer, integrity testing, and quality checks.
You’re looking for a service designed for data pipeline orchestration with features such as rich UI for observability, restarts for failed workflows, backfills, retries for tasks, and lineage support with OpenLineage.
You’re looking for a service that comes with more than 1,000 pre-built operators and sensors, covering AWS as well as non-AWS services.
Your workload requires orchestrating on-premises servers, user-managed EC2 instances, or a user-managed EMR cluster.

Amazon MWAA workflows are defined as directed acyclic graphs (DAGs) using Python, so you can also treat them as source code. Airflow’s extensible Python framework enables you to build workflows connecting with virtually any technology. It comes with a rich user interface for viewing and monitoring workflows and can be easily integrated with version control systems to automate the CI/CD process. With Amazon MWAA, you can choose the same version of Amazon EMR that you’re currently using in Data Pipeline.

Example: Migrate HadoopActivity on EmrCluster

Complete the following steps in case you do not have existing MWAA environments:

Create an AWS CloudFormation template on your computer by copying the template from the quick start guide into a local text file.
On the CloudFormation console, choose Stacks in the navigation pane.
Choose Create stack with the option With new resources (standard).
Choose Upload a template file and select the local template file.
Choose Next.
Complete the setup steps, entering a name for the environment, and leave the rest of the parameters as default.
On the last step, acknowledge that resources will be created and choose Submit.

The creation can take 20–30 minutes, until the status of the stack changes to CREATE_COMPLETE. The resource that will take the most time is the Airflow environment. While it’s being created, you can continue with the following steps, until you’re required to open the Airflow UI.

An Airflow workflow is based on a DAG, which is defined by a Python file that programmatically specifies the different tasks involved and its interdependencies. Complete the following scripts to create the DAG:

Create a local file named emr_dag.py using a text editor with following snippets, and configure the EMR related parameters based on the existing Data Pipeline definition:

from airflow import DAG
from airflow.providers.amazon.aws.operators.emr import (
    EmrCreateJobFlowOperator,
    EmrAddStepsOperator,
)
from airflow.providers.amazon.aws.sensors.emr import EmrStepSensor
from airflow.utils.dates import days_ago
from datetime import timedelta
import os
DAG_ID = os.path.basename(__file__).replace(".py", "")
SPARK_STEPS = [
    {
        'Name': 'calculate_pi',
        'ActionOnFailure': 'CONTINUE',
        'HadoopJarStep': {
            'Jar': 'command-runner.jar',
            'Args': ['spark-example', 'SparkPi', '10'],
        },
    }
]
JOB_FLOW_OVERRIDES = {
    'Name': 'my-demo-cluster',
    'ReleaseLabel': 'emr-6.1.0',
    'Applications': [
        {
            'Name': 'Spark'
        },
    ],
    'Instances': {
        'InstanceGroups': [
            {
                'Name': "Master nodes",
                'Market': 'ON_DEMAND',
                'InstanceRole': 'MASTER',
                'InstanceType': 'm5.xlarge',
                'InstanceCount': 1,
            },
            {
                'Name': "Slave nodes",
                'Market': 'ON_DEMAND',
                'InstanceRole': 'CORE',
                'InstanceType': 'm5.xlarge',
                'InstanceCount': 2,
            }
        ],
        'KeepJobFlowAliveWhenNoSteps': False,
        'TerminationProtected': False,
    },
    'VisibleToAllUsers': True,
    'JobFlowRole': 'EMR_EC2_DefaultRole',
    'ServiceRole': 'EMR_DefaultRole'
}
with DAG(
    dag_id=DAG_ID,
    start_date=days_ago(1),
    schedule_interval='@once',
    dagrun_timeout=timedelta(hours=2),
    catchup=False,
    tags=['emr'],
) as dag:
    cluster_creator = EmrCreateJobFlowOperator(
        task_id='create_job_flow',
        job_flow_overrides=JOB_FLOW_OVERRIDES,
        aws_conn_id='aws_default',
    )
    step_adder = EmrAddStepsOperator(
        task_id='add_steps',
        job_flow_id=cluster_creator.output,
        aws_conn_id='aws_default',
        steps=SPARK_STEPS,
    )
    step_checker = EmrStepSensor(
        task_id='watch_step',
        job_flow_id=cluster_creator.output,
        step_id="{{ task_instance.xcom_pull(task_ids='add_steps')[0] }}",
        aws_conn_id='aws_default',
    )
    cluster_creator >> step_adder >> step_checker

Defining the schedule in Amazon MWAA is as simple as updating the schedule_interval parameter for the DAG. For example, to run the DAG daily, set schedule_interval='@daily'.

Now, you create a workflow that invokes the Amazon EMR step you just created:

On the Amazon S3 console, locate the bucket created by the CloudFormation template, which will have a name starting with the name of the stack followed by -environmentbucket- (for example, myairflowstack-environmentbucket-ap1qks3nvvr4).
Inside that bucket, create a folder called dags, and inside that folder, upload the DAG file emr_dag.py that you created in the previous section.
On the Amazon MWAA console, navigate to the environment you deployed with the CloudFormation stack.

If the status is not yet Available, wait until it reaches that state. It shouldn’t take longer than 30 minutes after you deployed the CloudFormation stack.

Choose the environment link on the table to see the environment details.

It’s configured to pick up DAGs from the bucket and folder you used in the previous steps. Airflow will monitor that folder for changes.

Choose Open Airflow UI to open a new tab accessing the Airflow UI, using the integrated IAM security to sign you in.

If there are issues with the DAG file you created, it will display an error on top of the page indicating the lines affected. In that case, review the steps and upload again. After a few seconds, it will parse it and update or remove the error banner.

Clean up

After you migrate your existing Data Pipeline workload and verify that the migration was successful, delete your pipelines in Data Pipeline to stop further runs and billing.

Conclusion

In this blog post, we outlined a few alternate AWS services for migrating your existing Data Pipeline workloads. You can migrate to AWS Glue to run and orchestrate Apache Spark applications, AWS Step Functions to orchestrate workflows involving various other AWS services, or Amazon MWAA to help manage workflow orchestration using Apache Airflow. By migrating, you will be able to run your workloads with a broader range of data integration functionalities. If you have additional questions, post in the comments or read about migration examples in our documentation.

About the authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team and AWS Data Pipeline team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Vaibhav Porwal is a Senior Software Development Engineer on the AWS Glue and AWS Data Pipeline team. He is working on solving problems in orchestration space by building low cost, repeatable, scalable workflow systems that enables customers to create their ETL pipelines seamlessly.

Sriram Ramarathnam is a Software Development Manager on the AWS Glue and AWS Data Pipeline team. His team works on solving challenging distributed systems problems for data integration across AWS serverless and serverfull compute offerings.

Matt Su is a Senior Product Manager on the AWS Glue team and AWS Data Pipeline team. He enjoys helping customers uncover insights and make better decisions using their data with AWS Analytics services. In his spare time, he enjoys skiing and gardening.

Introducing AWS Glue usage profiles for flexible cost control

2024-06-18 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/introducing-aws-glue-usage-profiles-for-flexible-cost-control/

AWS Glue is a serverless data integration service that enables you to run extract, transform, and load (ETL) workloads on your data in a scalable and serverless manner. One of the main advantages of using a cloud platform is its flexibility; you can provision compute resources when you actually need them. However, with this ease of creating resources comes a risk of spiraling cloud costs when those resources are left unmanaged or without guardrails. As a result, admins need to balance avoiding high infrastructure costs with allowing users to work without unnecessary friction.

To address that, today we are excited to announce the general availability of AWS Glue usage profiles. With AWS Glue usage profiles, admins can create different profiles for various classes of users within the account, such as developers, testers, and product teams. Each profile is a unique set of parameters that can be assigned to different types of users. For example, developers may need more workers and can have a higher number of maximum workers, whereas product teams may need fewer workers and a lower timeout or idle timeout value.

How AWS Glue usage profiles works

An AWS Glue usage profile is a resource identified by an Amazon Resource Name (ARN) for better governance of resources. Admins have the ability to create AWS Glue usage profiles and define default values to be used when a parameter value is not provided. For example, you can create an AWS Glue usage profile with the default number of workers set to 2. When you sign in to the AWS Glue console using the AWS Identity and Access Management (IAM) user associated with the usage profile and create a new job, the initial value configured for the number of workers shows as 2 instead of the service default of 10.

Additionally, you can specify a set of allowed values for validation when a user associated with this profile creates a resource. If the parameter is numeric, admins can define a range of allowed values by specifying minimum and maximum values, instead of a specific set. For example, you can create an AWS Glue usage profile that allows only G.1X worker types. When you sign in to the AWS Glue console using an IAM user associated with this usage profile and create a job with a G.2X worker type, saving it will result in a failure.

Because an AWS Glue profile is a resource identified by an ARN, all the default IAM controls apply, including action-based, resource-based, and tag-based authorization. Admins update the IAM policy of users who create AWS Glue resources, granting them read permission on the profiles. This enables users to view the profiles. In order to use them when making API calls to create AWS Glue resources, admins will tag the user or role with glue:UsageProfile as the key and the profile name as the value. AWS Glue validates the API requests such as CreateJob, UpdateJob, StartJobRun, and CreateSession based on the values specified in the AWS Glue profile and raise appropriate exceptions.

In the following sections, we demonstrate how to create AWS Glue usage profiles, assign profiles to users, and demonstrate the usage profiles in action.

Create an AWS Glue usage profiles

To get started and create AWS Glue usage profiles, complete the following steps:

On the AWS Glue console, choose Cost management in the navigation pane.

Let’s create your first usage profile for your developers.

Choose Create usage profile.
For Usage profile name, enter developer.
Under Customize configurations for jobs, for Number of workers, for Default, enter 20.
For Default worker type, choose G.1X.
For Allowed worker types, choose G.1X, G.2X, G.4X, and G.8X.
For Customize configurations for sessions, configure the same values.
Choose Create usage profile.

Next, create another usage profile for your business analysts, who need fewer workers and a lower timeout or idle timeout value.

Choose Create usage profile.
For Usage profile name, enter analyst.
Under Customize configurations for jobs, for Number of workers, for Default, enter 2. For Maximum, enter 5.
For Default worker type, choose G.1X.
For Allowed worker types, choose only G.1X.
For Timeout, for Default, enter 60. For Maximum, enter 120.
For Customize configurations for sessions, configure the same values.
For Idle timeout, for Default, enter 10. For Maximum, enter 60.
Choose Create usage profile.

You have successfully created two usage profiles.

Assign usage profiles

Restrictions can only be applied to AWS Glue API calls made by IAM users or roles if the profile is assigned to them. There are two steps that the admin needs to take in order to assign a profile:

In IAM, create a tag named glue:UsageProfile on the user or role, with the name of the profile used as the tag value
The IAM policy assigned to the user or role needs to be updated to include the glue:GetUsageProfile IAM action permission to read the assigned profile

Follow these steps to create two new users, each assigned a different profile:

On the IAM console, choose Users in the navigation pane.
Choose Create user.
For User name, enter blogDeveloper.
Select Provide user access to the AWS Management Console and I want to create an IAM user.
You can enter a custom password or let one be generated (in the latter case, select Show password so you can use it later to sign in).
Choose Next.
Attach the managed policies AWSGlueConsoleFullAccess and IAMReadOnlyAccess.
Choose Next.
Review the summary and complete the creation.
Remember the password for later and choose Return to users list and choose the user just created.
On the Permissions tab, for Add permissions, choose Create inline policy.

In the policy editor, switch to JSON and enter the following policy, replacing the AWS Region, account ID, and usage profile name placeholders. For the usage profile name, use the value developer for the user blogDeveloper and analyst for the role blogAnalyst.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:GetUsageProfile"
      ],
      "Resource": [
        "arn:aws:glue:<aws region>:<account id>:usageProfile/<usage profile name>"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "iam:PassRole"
      ],
      "Resource": [
        "*"
      ],
      "Condition": {
        "StringLike": {
          "iam:PassedToService": [
            "glue.amazonaws.com"
          ]
        }
      }
    }
  ]
}

Name the policy GlueUsageProfilePermission and complete the creation.
On the Tags tab, add a new tag with the name glue:UsageProfile and the value developer.

Repeat the steps to create a user named blogAnalyst, and replace the ARN in the policy with arn:aws:glue:<aws region>:<account id>:usageProfile/analyst. Make sure the Region and account ID are populated before updating the policy. For the tag value, specify analyst instead of developer.

On the AWS Glue console, navigate to the developer usage profile. You can see that the status has been changed from Not assigned to Assigned.

Lastly, complete the following steps to create two IAM roles for AWS Glue jobs and sessions with the profile.

Create two IAM roles for AWS Glue. Name them GlueServiceRole-developer and GlueServiceRole-analyst.

Configure the following inline policies by replacing the Region, account ID, and usage profile name placeholders. For the usage profile name placeholder, use the value developer for the role GlueServiceRole-developer and analyst for the role GlueServiceRole-analyst.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:GetUsageProfile"
      ],
      "Resource": [
        "arn:aws:glue:<aws region>:<account id>:usageProfile/<usage profile name>"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "iam:PassRole"
      ],
      "Resource": [
        "*"
      ],
      "Condition": {
        "StringLike": {
          "iam:PassedToService": [
            "glue.amazonaws.com"
          ]
        }
      }
    }
  ]
}

On the Tags tab for the IAM role, add a new tag with the name glue:UsageProfile and the value developer for GlueServiceRole-developer and analyst for GlueServiceRole-analyst.

Usage profiles in action: Jobs

Now you have two users with different AWS Glue profiles assigned. Let’s test them and see the differences. First, let’s try the user blogDeveloper to see how the profile developer works.

Open the AWS Glue console with the blogDeveloper user.
Choose ETL jobs in the navigation pane and choose Script editor.
Choose Create script.
Choose the Job details tab.

The default number of Requested number of workers is 20, which corresponds to the default setting of the profile developer.

Next, let’s try the user blogAnalyst to see how the profile analyst works.

Open AWS Glue console with the blogAnalyst user.
Choose ETL jobs in the navigation pane and choose Script editor.
Choose Create script.
Choose the Job details tab.

The default number of Requested number of workers is 2, which corresponds to the default setting of the profile analyst.

Additionally, the default number of Job timeout is 60, which corresponds to the default setting of the profile analyst.

For Worker type, choose the dropdown menu.

Only G.1X is available and G.2X, G.4X, and G.8X are disabled. This is because we allowed the profile analyst to choose G.1X.

For Requested number of workers, enter 20 to simulate invalid input.

You will see the waring message The maximum number of workers cannot exceed 5 for usage profile "analyst".

Now, the user blogAnalyst is attempting to run a job in the account where the number of workers set for the job is 20. However, the maximum number of workers in the profile assigned to this user is 5. When the user tries to run the job, it fails with an error, as shown in the following screenshot.

In this example, we’ve demonstrated how usage profiles manage AWS Glue jobs based on the preconfigured values in the profiles.

Usage profiles in action: Sessions

Next, continue using the user blogAnalyst and try the AWS Glue Studio notebook interface to see how interactive sessions work with usage profiles:

Open the AWS Glue console with the blogAnalyst user.
Choose ETL jobs in the navigation pane and choose Notebook.
For IAM role, choose GlueServiceRole-analyst.
Choose Create notebook.
Wait for the notebook to be ready.

In the second notebook cell, %number_of_workers is set to 2, which corresponds to the default value of the profile analyst.

Update %number_of_workers from 2 to 10 to simulate an invalid access pattern:
```
%number_of_workers 10
```
Run the cell.

You get an error message saying “Provided number of workers is not within the range [1, 5] in the analyst profile.”

This is because the given value of 10 exceeds the maximum number of workers set in the profile assigned to this user.

Update %number_of_workers from 10 to 5 to simulate a valid access pattern:
```
%number_of_workers 5
```
Run the cell.

This time, the session has been successfully created.

Now you have observed how usage profiles manage AWS Glue interactive sessions based on the preconfigured values in the profiles.

Conclusion

This post demonstrated how AWS Glue usage profiles allow you to manage your AWS Glue resources with ease and flexibility.

With AWS Glue usage profiles, you can manage and control resources of different users in order to set your organization’s best practices and save costs. AWS Glue usage profiles serve as a guardrail to prevent unauthorized resource usage from occurring.

Try out the feature for yourself, and leave any feedback or questions in the comments.

About the Authors

Gonzalo Herreros is a Senior Big Data Architect on the AWS Glue team, with a background in machine learning and AI.

Keerthi Chadalavada is a Senior Software Development Engineer at AWS Glue. She is passionate about designing and building end-to-end solutions to address customer data integration and analytic needs.

Gal Heyne is a Product Manager for AWS Glue with a strong focus on AI/ML, data engineering, and BI. She is passionate about developing a deep understanding of customers’ business needs and collaborating with engineers to design easy-to-use data products.

Introducing Amazon Q data integration in AWS Glue

2024-04-30 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/introducing-amazon-q-data-integration-in-aws-glue/

Today, we’re excited to announce general availability of Amazon Q data integration in AWS Glue. Amazon Q data integration, a new generative AI-powered capability of Amazon Q Developer, enables you to build data integration pipelines using natural language. This reduces the time and effort you need to learn, build, and run data integration jobs using AWS Glue data integration engines.

Tell Amazon Q Developer what you need in English, it will return a complete job for you. For example, you can ask Amazon Q Developer to generate a complete extract, transform, and load (ETL) script or code snippet for individual ETL operations. You can troubleshoot your jobs by asking Amazon Q Developer to explain errors and propose solutions. Amazon Q Developer provides detailed guidance throughout the entire data integration workflow. Amazon Q Developer helps you learn and build data integration jobs using AWS Glue efficiently by generating the required AWS Glue code based on your natural language descriptions. You can create jobs that extract, transform, and load data that is stored in Amazon Simple Storage Service (Amazon S3), Amazon Redshift, and Amazon DynamoDB. Amazon Q Developer can also help you connect to third-party, software as a service (SaaS), and custom sources.

With general availability, we added new capabilities for you to author jobs using natural language. Amazon Q Developer can now generate complex data integration jobs with multiple sources, destinations, and data transformations. It can generate data integration jobs for extracts and loads to S3 data lakes including file formats like CSV, JSON, and Parquet, and ingestion into open table formats like Apache Hudi, Delta, and Apache Iceberg. It generates jobs for connecting to over 20 data sources, including relational databases like PostgreSQL, MySQL and Oracle; data warehouses like Amazon Redshift, Snowflake, and Google BigQuery; NoSQL databases like DynamoDB, MongoDB and OpenSearch; tables defined in the AWS Glue Data Catalog; and custom user-supplied JDBC and Spark connectors. Generated jobs can use a variety of data transformations, including filter, project, union, join, and custom user-supplied SQL.

Amazon Q data integration in AWS Glue helps you through two different experiences: the Amazon Q chat experience, and AWS Glue Studio notebook experience. This post describes the end-to-end user experiences to demonstrate how Amazon Q data integration in AWS Glue simplifies your data integration and data engineering tasks.

Amazon Q chat experience

Amazon Q Developer provides a conversational Q&A capability and a code generation capability for data integration. To start using the conversational Q&A capability, choose the Amazon Q icon on the right side of the AWS Management Console.

For example, you can ask, “How do I use AWS Glue for my ETL workloads?” and Amazon Q provides concise explanations along with references you can use to follow up on your questions and validate the guidance.

To start using the AWS Glue code generation capability, use the same window. On the AWS Glue console, start authoring a new job, and ask Amazon Q, “Please provide a Glue script that reads from Snowflake, renames the fields, and writes to Redshift.”

You will notice that the code is generated. With this response, you can learn and understand how you can author AWS Glue code for your purpose. You can copy/paste the generated code to the script editor and configure placeholders. After you configure an AWS Identity and Access Management (IAM) role and AWS Glue connections on the job, save and run the job. When the job is complete, you can start querying the table exported from Snowflake in Amazon Redshift.

Let’s try another prompt that reads data from two different sources, filters and projects them individually, joins on a common key, and writes the output to a third target. Ask Amazon Q: “I want to read data from S3 in Parquet format, and select some fields. I also want to read data from DynamoDB, select some fields, and filter some rows. I want to union these two datasets and write the results to OpenSearch.”

The code is generated. When the job is complete, your index is available in OpenSearch and can be used by your downstream workloads.

AWS Glue Studio notebook experience

Amazon Q data integration in AWS Glue helps you author code in an AWS Glue notebook to speed up development of new data integration applications. In this section, we walk you through how to set up the notebook and run a notebook job.

Prerequisites

Before going forward with this tutorial, complete the following prerequisites:

Set up AWS Glue Studio.

Configure an IAM role to interact with Amazon Q. Attach the following policy to your IAM role for the AWS Glue Studio notebook:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "CodeWhispererPermissions",
            "Effect": "Allow",
            "Action": [
                "codewhisperer:GenerateRecommendations"
            ],
            "Resource": "*"
        }
    ]
}

Create a new AWS Glue Studio notebook job

Create a new AWS Glue Studio notebook job by completing the following steps:

On the AWS Glue console, choose Notebooks under ETL jobs in the navigation pane.
Under Create job, choose Notebook.
For Engine, select Spark (Python).
For Options, select Start fresh.
For IAM role, choose the IAM role you configured as a prerequisite.
Choose Create notebook.

A new notebook is created with sample cells. Let’s try recommendations using the Amazon Q data integration in AWS Glue to auto-generate code based on your intent. Amazon Q would help you with each step as you express an intent in a Notebook cell.

Add a new cell and enter your comment to describe what you want to achieve. After you press Tab and Enter, the recommended code is shown. First intent is to extract the data: “Give me code that reads a Glue Data Catalog table”, followed by “Give me code to apply a filter transform with star_rating>3” and “Give me code that writes the frame into S3 as Parquet”.

Similar to the Amazon Q chat experience, the code is recommended. If you press Tab, then the recommended code is chosen. You can learn more in User actions.

You can run each cell by simply filling in the appropriate options for your sources in the generated code. At any point in the runs, you can also preview a sample of your dataset by simply using the show() method.

Let’s now try to generate a full script with a single complex prompt. “I have JSON data in S3 and data in Oracle that needs combining. Please provide a Glue script that reads from both sources, does a join, and then writes results to Redshift”

You may notice that, on the notebook, the Amazon Q data integration in AWS Glue generated the same code snippet that was generated in the Amazon Q chat.

You can also run the notebook as a job, either by choosing Run or programmatically.

Conclusion

With Amazon Q data integration, you have an artificial intelligence (AI) expert by your side to integrate data efficiently without deep data engineering expertise. These capabilities simplify and accelerate data processing and integration on AWS. Amazon Q data integration in AWS Glue is available in every AWS Region where Amazon Q is available. To learn more, visit the product page, our documentation, and the Amazon Q pricing page.

A special thanks to everyone who contributed to the launch of Amazon Q data integration in AWS Glue: Alexandra Tello, Divya Gaitonde, Andrew Kim, Andrew King, Anshul Sharma, Anshi Shrivastava, Chuhan Liu, Daniel Obi, Hirva Patel, Henry Caballero Corzo, Jake Zych, Jeremy Samuel, Jessica Cheng, , Keerthi Chadalavada, Layth Yassin, Maheedhar Reddy Chappidi, Maya Patwardhan, Neil Gupta, Raghavendhar Vidyasagar Thiruvoipadi, Rajendra Gujja, Rupak Ravi, Shaoying Dong, Vaibhav Naik, Wei Tang, William Jones, Daiyan Alamgir, Japson Jeyasekaran, Matt Sampson, Kartik Panjabi, Ranu Shah, Chuan Lei, Huzefa Rangwala, Jiani Zhang, Xiao Qin, Mukul Prasad, Alon Halevy, Brian Ross, Alona Nadler, Omer Zaki, Rick Sears, Bratin Saha, G2 Krishnamoorthy, Kinshuk Pahare, Nitin Bahadur, and Santosh Chandrachood.

About the Authors

Bo Li is a Senior Software Development Engineer on the AWS Glue team. He is devoted to designing and building end-to-end solutions to address customers’ data analytic and processing needs with cloud-based, data-intensive technologies.

Savio Dsouza is a Software Development Manager on the AWS Glue team. His team works on distributed systems & new interfaces for data integration and efficiently managing data lakes on AWS.

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

2024-03-29 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/enhance-monitoring-and-debugging-for-aws-glue-jobs-using-new-job-observability-metrics-part-3-visualization-and-trend-analysis-using-amazon-quicksight/

In Part 2 of this series, we discussed how to enable AWS Glue job observability metrics and integrate them with Grafana for real-time monitoring. Grafana provides powerful customizable dashboards to view pipeline health. However, to analyze trends over time, aggregate from different dimensions, and share insights across the organization, a purpose-built business intelligence (BI) tool like Amazon QuickSight may be more effective for your business. QuickSight makes it straightforward for business users to visualize data in interactive dashboards and reports.

In this post, we explore how to connect QuickSight to Amazon CloudWatch metrics and build graphs to uncover trends in AWS Glue job observability metrics. Analyzing historical patterns allows you to optimize performance, identify issues proactively, and improve planning. We walk through ingesting CloudWatch metrics into QuickSight using a CloudWatch metric stream and QuickSight SPICE. With this integration, you can use line charts, bar charts, and other graph types to uncover daily, weekly, and monthly patterns. QuickSight lets you perform aggregate calculations on metrics for deeper analysis. You can slice data by different dimensions like job name, see anomalies, and share reports securely across your organization. With these insights, teams have the visibility to make data integration pipelines more efficient.

Solution overview

The following architecture diagram illustrates the workflow to implement the solution.

The workflow includes the following steps:

AWS Glue jobs emit observability metrics to CloudWatch metrics.
CloudWatch streams metric data through a metric stream into Amazon Data Firehose.
Data Firehose uses an AWS Lambda function to transform data and ingest the transformed records into an Amazon Simple Storage Service (Amazon S3) bucket.
An AWS Glue crawler scans data on the S3 bucket and populates table metadata on the AWS Glue Data Catalog.
QuickSight periodically runs Amazon Athena queries to load query results to SPICE and then visualize the latest metric data.

All of the resources are defined in a sample AWS Cloud Development Kit (AWS CDK) template. You can deploy the end-to-end solution to visualize and analyze trends of the observability metrics.

Sample AWS CDK template

This post provides a sample AWS CDK template for a dashboard using AWS Glue observability metrics.

Typically, you have multiple accounts to manage and run resources for your data pipeline.

In this template, we assume the following accounts:

Monitoring account – This hosts the central S3 bucket, central Data Catalog, and QuickSight-related resources
Source account – This hosts individual data pipeline resources on AWS Glue and the resources to send metrics to the monitoring account

The template works even when the monitoring account and source account are the same.

This sample template consists of four stacks:

Amazon S3 stack – This provisions the S3 bucket
Data Catalog stack – This provisions the AWS Glue database, table, and crawler
QuickSight stack – This provisions the QuickSight data source, dataset, and analysis
Metrics sender stack – This provisions the CloudWatch metric stream, Firehose delivery stream, and Lambda function for transformation

Prerequisites

You should have the following prerequisites:

Python 3.9 or later
AWS accounts for the monitoring account and source account
An AWS named profile for the monitoring account and source account
The AWS CDK Toolkit 2.87.0 or later

Initialize the CDK project

To initialize the project, complete the following steps:

Clone the cdk template to your workplace:

$ git clone [email protected]:aws-samples/aws-glue-cdk-baseline.git 

$ cd aws-glue-cdk-baseline.git

Create a Python virtual environment specific to the project on the client machine:
```
$ python3 -m venv .venv
```

We use a virtual environment in order to isolate the Python environment for this project and not install software globally.

Activate the virtual environment according to your OS:
- On MacOS and Linux, use the following code:
```
$ source .venv/bin/activate
```
- On a Windows platform, use the following code:
```
% .venv\Scripts\activate.bat
```

After this step, the subsequent steps run within the bounds of the virtual environment on the client machine and interact with the AWS account as needed.

Install the required dependencies described in requirements.txt to the virtual environment:
```
$ pip install -r requirements.txt
```

Edit the configuration file default-config.yaml based on your environments (replace each account ID with your own.

create_s3_stack: false
create_metrics_sender_stack: false
create_catalog_stack: false
create_quicksight_stack: true

s3_bucket_name: glue-observability-demo-dashboard

firehose_log_group_name: /aws/kinesisfirehose/observability-demo-metric-stream
firehose_lambda_buffer_size_mb: 2
firehose_lambda_buffer_interval_seconds: 60
firehose_s3_buffer_size_mb: 128
firehose_s3_buffer_interval_seconds: 300

glue_database_name: observability_demo_db
glue_table_name: metric_data
glue_crawler_name: observability_demo_crawler
glue_crawler_cron_schedule: "cron(42 * * * ? *)"

athena_workgroup_name: primary

Bootstrap your AWS environments

Run the following commands to bootstrap your AWS environments:

In the monitoring account, provide your monitoring account number, AWS Region, and monitoring profile:

$ cdk bootstrap aws://<MONITORING-ACCOUNT-NUMBER>/<REGION> --profile <MONITORING-PROFILE> \
--cloudformation-execution-policies arn:aws:iam::aws:policy/AdministratorAccess

In the source account, provide your source account number, Region, and source profile:x

$ cdk bootstrap aws://<SOURCE-ACCOUNT-NUMBER>/<REGION> --profile <SOURCE-PROFILE> \
--cloudformation-execution-policies arn:aws:iam::aws:policy/AdministratorAccess

When you use only one account for all environments, you can just run thecdk bootstrapcommand one time.

Deploy your AWS resources

Run the following commands to deploy your AWS resources:

Run the following command using the monitoring account to deploy resources defined in the AWS CDK template:
```
$ cdk deploy '*' --profile <MONITORING-PROFILE>
```
Run the following command using the source account to deploy resources defined in the AWS CDK template:
```
$ cdk deploy MetricSenderStack --profile <SOURCE-PROFILE>
```

Configure QuickSight permissions

Initially, the new QuickSight resources including the dataset and analysis created by the AWS CDK template are not visible for you because there are no QuickSight permissions configured yet.

To make the dataset and analysis visible for you, complete the following steps:

On the QuickSight console, navigate to the user menu and choose Manage QuickSight.
In the navigation pane, choose Manage assets.
Under Browse assets, choose Analysis.
Search for GlueObservabilityAnalysis, and select it.
Choose SHARE.
For User or Group, select your user, then choose SHARE (1).
Wait for the share to be complete, then choose DONE.
On the Manage assets page, choose Datasets.
Search for observability_demo.metrics_data, and select it.
Choose SHARE.
For User or Group, select your user, then choose SHARE (1).
Wait for the share to be complete, then choose DONE.

Explore the default QuickSight analysis

Now your QuickSight analysis and dataset are visible to you. You can return to the QuickSight console and choose GlueObservabilityAnalysis under Analysis. The following screenshot shows your dashboard.

The sample analysis has two tabs: Monitoring and Insights. By default, the Monitoring tab has the following charts:

[Reliability] Job Run Errors Breakdown
[Reliability] Job Run Errors (Total)
[Performance] Skewness Job
[Performance] Skewness Job per Job

[Resource Utilization] Worker Utilization
[Resource Utilization] Worker Utilization per Job
[Throughput] BytesRead, RecordsRead, FilesRead, PartitionRead (Avg)
[Throughput] BytesWritten, RecordsWritten, FilesWritten (Avg)

[Resource Utilization Disk Available GB (Min)
[Resource Utilization Max Disk Used % (Max)

[Driver OOM] OOM Error Count
[Driver OOM] Max Heap Memory Used % (Max)
[Executor OOM] OOM Error Count
[Executor OOM] Max Heap Memory Used % (Max)

By default, the Insights tab has following insights:

Bottom Ranked Worker Utilization
Top Ranked Skewness Job

Forecast Worker Utilization
Top Mover readBytes

You can add any new graph charts or insights using the observability metrics based on your requirements.

Publish the QuickSight dashboard

When the analysis is ready, complete the following steps to publish the dashboard:

Choose PUBLISH.
Select Publish new dashboard as, and enter GlueObservabilityDashboard.
Choose Publish dashboard.

Then you can view and share the dashboard.

Visualize and analyze with AWS Glue job observability metrics

Let’s use the dashboard to make AWS Glue usage more performant.

Looking at the Skewness Job per Job visualization, there was spike on November 1, 2023. The skewness metrics of the job multistage-demo showed 9.53, which is significantly higher than others.

Let’s drill down into details. You can choose Controls, and change filter conditions based on date time, Region, AWS account ID, AWS Glue job name, job run ID, and the source and sink of the data stores. For now, let’s filter with the job name multistage-demo.

The filtered Worker Utilization per Job visualization shows 0.5, and its minimum value was 0.16. It seems like that there is a room for improvement in resource utilization. This observation guides you to enable auto scaling for this job to increase the worker utilization.

Clean up

Run the following commands to clean up your AWS resources:

Run the following command using the monitoring account to clean up resources:
```
$ cdk destroy '*' --profile <MONITORING-PROFILE>
```
Run the following command using the source account to clean up resources:
```
$ cdk destroy MetricSenderStack --profile <SOURCE-PROFILE>
```

Considerations

QuickSight integration is designed for analysis and better flexibility. You can aggregate metrics based on any fields. When dealing with many jobs at once, QuickSight insights help you identify problematic jobs.

QuickSight integration is achieved with more resources in your environments. The monitoring account needs an AWS Glue database, table, crawler, and S3 bucket, and the ability to run Athena queries to visualize metrics in QuickSight. Each source account needs to have one metric stream and one Firehose delivery stream. This can incur additional costs.

All the required resources are templatized in AWS CDK.

Conclusion

In this post, we explored how to visualize and analyze AWS Glue job observability metrics on QuickSight using CloudWatch metric streams and SPICE. By connecting the new observability metrics to interactive QuickSight dashboards, you can uncover daily, weekly, and monthly patterns to optimize AWS Glue job usage. The rich visualization capabilities of QuickSight allow you to analyze trends in metrics like worker utilization, error categories, throughput, and more. Aggregating metrics and slicing data by different dimensions such as job name can provide deeper insights.

The sample dashboard showed metrics over time, top errors, and comparative job analytics. These visualizations and reports can be securely shared with teams across the organization. With data-driven insights on the AWS Glue observability metrics, you can have deeper insights on performance bottlenecks, common errors, and more.

About the Authors

Chuhan Liu is a Software Development Engineer on the AWS Glue team. He is passionate about building scalable distributed systems for big data processing, analytics, and management. In his spare time, he enjoys playing tennis.

Sean Ma is a Principal Product Manager on the AWS Glue team. He has a track record of more than 18 years innovating and delivering enterprise products that unlock the power of data for users. Outside of work, Sean enjoys scuba diving and college football.

Mohit Saxena is a Senior Software Development Manager on the AWS Glue team. His team focuses on building distributed systems to enable customers with interactive and simple to use interfaces to efficiently manage and transform petabytes of data seamlessly across data lakes on Amazon S3, databases and data-warehouses on cloud.

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics: Part 2

2024-02-13 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/part-2-enhance-monitoring-and-debugging-for-aws-glue-jobs-using-new-job-observability-metrics/

Monitoring data pipelines in real time is critical for catching issues early and minimizing disruptions. AWS Glue has made this more straightforward with the launch of AWS Glue job observability metrics, which provide valuable insights into your data integration pipelines built on AWS Glue. However, you might need to track key performance indicators across multiple jobs. In this case, a dashboard that can visualize the same metrics with the ability to drill down into individual issues is an effective solution to monitor at scale.

This post, walks through how to integrate AWS Glue job observability metrics with Grafana using Amazon Managed Grafana. We discuss the types of metrics and charts available to surface key insights along with two use cases on monitoring error classes and throughput of your AWS Glue jobs.

Solution overview

Grafana is an open source visualization tool that allows you to query, visualize, alert on, and understand your metrics no matter where they are stored. With Grafana, you can create, explore, and share visually rich, data-driven dashboards. The new AWS Glue job observability metrics can be effortlessly integrated with Grafana for real-time monitoring purpose. Metrics like worker utilization, skewness, I/O rate, and errors are captured and visualized in easy-to-read Grafana dashboards. The integration with Grafana provides a flexible way to build custom views of pipeline health tailored to your needs. Observability metrics open up monitoring capabilities that weren’t possible before for AWS Glue. Companies relying on AWS Glue for critical data integration pipelines can have greater confidence that their pipelines are running efficiently.

AWS Glue job observability metrics are emitted as Amazon CloudWatch metrics. You can provision and manage Amazon Managed Grafana, and configure the CloudWatch plugin for the given metrics. The following diagram illustrates the solution architecture.

Implement the solution

Complete following steps to set up the solution:

Set up an Amazon Managed Grafana workspace.
Sign in to your workspace.
Choose Administration.
Choose Add new data source.
Choose CloudWatch.
For Default Region, select your preferred AWS Region.
For Namespaces of Custom Metrics, enter Glue.
Choose Save & test.

Now the CloudWatch data source has been registered.

Copy the data source ID from the URL https://g-XXXXXXXXXX.grafana-workspace.<region>.amazonaws.com/datasources/edit/<data-source-ID>/.

The next step is to prepare the JSON template file.

Download the Grafana template.
Replace <data-source-id> in the JSON file with your Grafana data source ID.

Lastly, configure the dashboard.

On the Grafana console, choose Dashboards.
Choose Import on the New menu.
Upload your JSON file, and choose Import.

The Grafana dashboard visualizes AWS Glue observability metrics, as shown in the following screenshots.

The sample dashboard has the following charts:

[Reliability] Job Run Errors Breakdown
[Throughput] Bytes Read & Write
[Throughput] Records Read & Write
[Resource Utilization] Worker Utilization
[Job Performance] Skewness
[Resource Utilization] Disk Used (%)
[Resource Utilization] Disk Available (GB)
[Executor OOM] OOM Error Count
[Executor OOM] Heap Memory Used (%)
[Driver OOM] OOM Error Count
[Driver OOM] Heap Memory Used (%)

Analyze the causes of job failures

Let’s try analyzing the causes of job run failures of the job iot_data_processing.

First, look at the pie chart [Reliability] Job Run Errors Breakdown. This pie chart quickly identifies which errors are most common.

Then filter with the job name iot_data_processing to see the common errors for this job.

We can observe that the majority (75%) of failures were due to glue.error.DISK_NO_SPACE_ERROR.

Next, look at the line chart [Resource Utilization] Disk Used (%) to understand the driver’s used disk space during the job runs. For this job, the green line shows the driver’s disk usage, and the yellow line shows the average of the executors’ disk usage.

We can observe that there were three times when 100% of disk was used in executors.

Next, look at the line chart [Throughput] Records Read & Write to see whether the data volume was changed and whether it impacted disk usage.

The chart shows that around four billion records were read at the beginning of this range; however, around 63 billion records were read at the peak. This means that the incoming data volume has significantly increased, and caused local disk space shortage in the worker nodes. For such cases, you can increase the number of workers, enable auto scaling, or choose larger worker types.

After implementing those suggestions, we can see lower disk usage and a successful job run.

(Optional) Configure cross-account setup

We can optionally configure a cross-account setup. Cross-account metrics depend on CloudWatch cross-account observability. In this setup, we expect the following environment:

AWS accounts are not managed in AWS Organizations
You have two accounts: one account is used as the monitoring account where Grafana is located, another account is used as the source account where the AWS Glue-based data integration pipeline is located

To configure a cross-account setup for this environment, complete the following steps for each account.

Monitoring account

Complete the following steps to configure your monitoring account:

Sign in to the AWS Management Console using the account you will use for monitoring.
On the CloudWatch console, choose Settings in the navigation pane.
Under Monitoring account configuration, choose Configure.
For Select data, choose Metrics.
For List source accounts, enter the AWS account ID of the source account that this monitoring account will view.
For Define a label to identify your source account, choose Account name.
Choose Configure.

Now the account is successfully configured as a monitoring account.

Under Monitoring account configuration, choose Resources to link accounts.
Choose Any account to get a URL for setting up individual accounts as source accounts.
Choose Copy URL.

You will use the copied URL from the source account in the next steps.

Source account

Complete the following steps to configure your source account:

Sign in to the console using your source account.
Enter the URL that you copied from the monitoring account.

You can see the CloudWatch settings page, with some information filled in.

For Select data, choose Metrics.
Do not change the ARN in Enter monitoring account configuration ARN.
The Define a label to identify your source account section is pre-filled with the label choice from the monitoring account. Optionally, choose Edit to change it.
Choose Link.
Enter Confirm in the box and choose Confirm.

Now your source account has been configured to link to the monitoring account. The metrics emitted in the source account will show on the Grafana dashboard in the monitoring account.

To learn more, see CloudWatch cross-account observability.

Considerations

The following are some considerations when using this solution:

Grafana integration is defined for real-time monitoring. If you have a basic understanding of your jobs, it will be straightforward for you to monitor performance, errors, and more on the Grafana dashboard.
Amazon Managed Grafana depends on AWS IAM Identify Center. This means you need to manage single sign-on (SSO) users separately, not just AWS Identity and Access Management (IAM) users and roles. It also requires another sign-in step from the AWS console. The Amazon Managed Grafana pricing model depends on an active user license per workspace. More users can cause more charges.
Graph lines are visualized per job. If you want to see the lines across all the jobs, you can choose ALL in the control.

Conclusion

AWS Glue job observability metrics offer a powerful new capability for monitoring data pipeline performance in real time. By streaming key metrics to CloudWatch and visualizing them in Grafana, you gain more fine-grained visibility that wasn’t possible before. This post showed how straightforward it is to enable observability metrics and integrate the data with Grafana using Amazon Managed Grafana. We explored the different metrics available and how to build customized Grafana dashboards to surface actionable insights.

Observability is now an essential part of robust data orchestration on AWS. With the ability to monitor data integration trends in real time, you can optimize costs, performance, and reliability.

About the Authors

Xiaoxi Liu is a Software Development Engineer on the AWS Glue team. Her passion is building scalable distributed systems for efficiently managing big data on the cloud, and her concentrations are distributed system, big data, and cloud computing.

Akira Ajisaka is a Senior Software Development Engineer on the AWS Glue team. He likes open source software and distributed systems. In his spare time, he enjoys playing arcade games.

Shenoda Guirguis is a Senior Software Development Engineer on the AWS Glue team. His passion is in building scalable and distributed data infrastructure and processing systems. When he gets a chance, Shenoda enjoys reading and playing soccer.

Sean Ma is a Principal Product Manager on the AWS Glue team. He has an 18-year track record of innovating and delivering enterprise products that unlock the power of data for users. Outside of work, Sean enjoys scuba diving and college football.

Mohit Saxena is a Senior Software Development Manager on the AWS Glue team. His team focuses on building distributed systems to enable customers with interactive and simple to use interfaces to efficiently manage and transform petabytes of data seamlessly across data lakes on Amazon S3, databases and data-warehouses on cloud.

New Amazon CloudWatch log class to cost-effectively scale your AWS Glue workloads

2023-12-20 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/new-amazon-cloudwatch-log-class-to-cost-effectively-scale-your-aws-glue-workloads/

AWS Glue is a serverless data integration service that makes it easier to discover, prepare, and combine data for analytics, machine learning (ML), and application development. You can use AWS Glue to create, run, and monitor data integration and ETL (extract, transform, and load) pipelines and catalog your assets across multiple data stores.

One of the most common questions we get from customers is how to effectively optimize costs on AWS Glue. Over the years, we have built multiple features and tools to help customers manage their AWS Glue costs. For example, AWS Glue Auto Scaling and AWS Glue Flex can help you reduce the compute cost associated with processing your data. AWS Glue interactive sessions and notebooks can help you reduce the cost of developing your ETL jobs. For more information about cost-saving best practices, refer to Monitor and optimize cost on AWS Glue for Apache Spark. Additionally, to understand data transfer costs, refer to the Cost Optimization Pillar defined in AWS Well-Architected Framework. For data storage, you can apply general best practices defined for each data source. For a cost optimization strategy using Amazon Simple Storage Service (Amazon S3), refer to Optimizing storage costs using Amazon S3.

In this post, we tackle the remaining piece—the cost of logs written by AWS Glue.

Before we get into the cost analysis of logs, let’s understand the reasons to enable logging for your AWS Glue job and the current options available. When you start an AWS Glue job, it sends the real-time logging information to Amazon CloudWatch (every 5 seconds and before each executor stops) during the Spark application starts running. You can view the logs on the AWS Glue console or the CloudWatch console dashboard. These logs provide you with insights into your job runs and help you optimize and troubleshoot your AWS Glue jobs. AWS Glue offers a variety of filters and settings to reduce the verbosity of your logs. As the number of job runs increases, so does the volume of logs generated.

To optimize CloudWatch Logs costs, AWS recently announced a new log class for infrequently accessed logs called Amazon CloudWatch Logs Infrequent Access (Logs IA). This new log class offers a tailored set of capabilities at a lower cost for infrequently accessed logs, enabling you to consolidate all your logs in one place in a cost-effective manner. This class provides a more cost-effective option for ingesting logs that only need to be accessed occasionally for auditing or debugging purposes.

In this post, we explain what the Logs IA class is, how it can help reduce costs compared to the standard log class, and how to configure your AWS Glue resources to use this new log class. By routing logs to Logs IA, you can achieve significant savings in your CloudWatch Logs spend without sacrificing access to important debugging information when you need it.

CloudWatch log groups used by AWS Glue job continuous logging

When continuous logging is enabled, AWS Glue for Apache Spark writes Spark driver/executor logs and progress bar information into the following log group:

/aws-glue/jobs/logs-v2

If a security configuration is enabled for CloudWatch logs, AWS Glue for Apache Spark will create a log group named as follows for continuous logs:

<Log-Group-Name>-<Security-Configuration-Name>

The default and custom log groups will be as follows:

The default continuous log group will be /aws-glue/jobs/logs-v2-<Security-Configuration-Name>
The custom continuous log group will be <custom-log-group-name>-<Security-Configuration-Name>

You can provide a custom log group name through the job parameter –continuous-log-logGroup.

Getting started with the new Infrequent Access log class for AWS Glue workload

To gain the benefits from Logs IA for your AWS Glue workloads, you need to complete the following two steps:

Create a new log group using the new Log IA class.
Configure your AWS Glue job to point to the new log group

Complete the following steps to create a new log group using the new Infrequent Access log class:

On the CloudWatch console, choose Log groups under Logs in the navigation pane.
Choose Create log group.
For Log group name, enter /aws-glue/jobs/logs-v2-infrequent-access.
For Log class, choose Infrequent Access.
Choose Create.

Complete the following steps to configure your AWS Glue job to point to the new log group:

On the AWS Glue console, choose ETL jobs in the navigation pane.
Choose your job.
On the Job details tab, choose Add new parameter under Job parameters.
For Key, enter --continuous-log-logGroup.
For Value, enter /aws-glue/jobs/logs-v2-infrequent-access.
Choose Save.
Choose Run to trigger the job.

New log events are written into the new log group.

View the logs with the Infrequent Access log class

Now you’re ready to view the logs with the Infrequent Access log class. Open the log group /aws-glue/jobs/logs-v2-infrequent-access on the CloudWatch console.

When you choose one of the log streams, you will notice that it redirects you to the CloudWatch console Logs Insight page with a pre-configured default command and your log stream selected by default. By choosing Run query, you can view the actual log events on the Logs Insights page.

Considerations

Keep in mind the following considerations:

You cannot change the log class of a log group after it’s created. You need to create a new log group to configure the Infrequent Access class.
The Logs IA class offers a subset of CloudWatch Logs capabilities, including managed ingestion, storage, cross-account log analytics, and encryption with a lower ingestion price per GB. For example, you can’t view log events through the standard CloudWatch Logs console. To learn more about the features offered across both log classes, refer to Log Classes.

Conclusion

This post provided step-by-step instructions to guide you through enabling Logs IA for your AWS Glue job logs. If your AWS Glue ETL jobs generate large volumes of log data that makes it a challenge as you scale your applications, the best practices demonstrated in this post can help you cost-effectively scale while centralizing all your logs in CloudWatch Logs. Start using the Infrequent Access class with your AWS Glue workloads today and enjoy the cost benefits.

About the Authors

Abeetha Bala is a Senior Product Manager for Amazon CloudWatch, primarily focused on logs. Being customer obsessed, she solves observability challenges through innovative and cost-effective ways.

Kinshuk Pahare is a leader in AWS Glue’s product management team. He drives efforts on the platform, developer experience, and big data processing frameworks like Apache Spark, Ray, and Python Shell.

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

2023-11-29 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/build-and-manage-your-modern-data-stack-using-dbt-and-aws-glue-through-dbt-glue-the-new-trusted-dbt-adapter/

dbt is an open source, SQL-first templating engine that allows you to write repeatable and extensible data transforms in Python and SQL. dbt focuses on the transform layer of extract, load, transform (ELT) or extract, transform, load (ETL) processes across data warehouses and databases through specific engine adapters to achieve extract and load functionality. It enables data engineers, data scientists, and analytics engineers to define the business logic with SQL select statements and eliminates the need to write boilerplate data manipulation language (DML) and data definition language (DDL) expressions. dbt lets data engineers quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, continuous integration and continuous delivery (CI/CD), and documentation.

dbt is predominantly used by data warehouses (such as Amazon Redshift) customers who are looking to keep their data transform logic separate from storage and engine. We have seen a strong customer demand to expand its scope to cloud-based data lakes because data lakes are increasingly the enterprise solution for large-scale data initiatives due to their power and capabilities.

In 2022, AWS published a dbt adapter called dbt-glue—the open source, battle-tested dbt AWS Glue adapter that allows data engineers to use dbt for cloud-based data lakes along with data warehouses and databases, paying for just the compute they need. The dbt-glue adapter democratized access for dbt users to data lakes, and enabled many users to effortlessly run their transformation workloads on the cloud with the serverless data integration capability of AWS Glue. From the launch of the adapter, AWS has continued investing into dbt-glue to cover more requirements.

Today, we are pleased to announce that the dbt-glue adapter is now a trusted adapter based on our strategic collaboration with dbt Labs. Trusted adapters are adapters not maintained by dbt Labs, but adaptors that that dbt Lab is comfortable recommending to users for use in production.

The key capabilities of the dbt-glue adapter are as follows:

Runs SQL as Spark SQL on AWS Glue interactive sessions
Manages table definitions on the AWS Glue Data Catalog
Supports open table formats such as Apache Hudi, Delta Lake, and Apache Iceberg
Supports AWS Lake Formation permissions for fine-grained access control

In addition to those capabilities, the dbt-glue adapter is designed to optimize resource utilization with several techniques on top of AWS Glue interactive sessions.

This post demonstrates how the dbt-glue adapter helps your workload, and how you can build a modern data stack using dbt and AWS Glue using the dbt-glue adapter.

Common use cases

One common use case for using dbt-glue is if a central analytics team at a large corporation is responsible for monitoring operational efficiency. They ingest application logs into raw Parquet tables in an Amazon Simple Storage Service (Amazon S3) data lake. Additionally, they extract organized data from operational systems capturing the company’s organizational structure and costs of diverse operational components that they stored in the raw zone using Iceberg tables to maintain the original schema, facilitating easy access to the data. The team uses dbt-glue to build a transformed gold model optimized for business intelligence (BI). The gold model joins the technical logs with billing data and organizes the metrics per business unit. The gold model uses Iceberg’s ability to support data warehouse-style modeling needed for performant BI analytics in a data lake. The combination of Iceberg and dbt-glue allows the team to efficiently build a data model that’s ready to be consumed.

Another common use case is when an analytics team in a company that has an S3 data lake creates a new data product in order to enrich its existing data from its data lake with medical data. Let’s say that this company is located in Europe and the data product must comply with the GDPR. For this, the company uses Iceberg to meet needs such as the right to be forgotten and the deletion of data. The company uses dbt to model its data product on its existing data lake due to its compatibility with AWS Glue and Iceberg and the simplicity that the dbt-glue adapter brings to the use of this storage format.

How dbt and dbt-glue work

The following are key dbt features:

Project – A dbt project enforces a top-level structure on the staging, models, permissions, and adapters. A project can be checked into a GitHub repo for version control.
SQL – dbt relies on SQL select statements for defining data transformation logic. Instead of raw SQL, dbt offers templatized SQL (using Jinja) that allows code modularity. Instead of having to copy/paste SQL in multiple places, data engineers can define modular transforms and call those from other places within the project. Having a modular pipeline helps data engineers collaborate on the same project.
Models – dbt models are primarily written as a SELECT statement and saved as a .sql file. Data engineers define dbt models for their data representations. To learn more, refer to About dbt models.
Materializations – Materializations are strategies for persisting dbt models in a warehouse. There are five types of materializations built into dbt: table, view, incremental, ephemeral, and materialized view. To learn more, refer to Materializations and Incremental models.
Data lineage – dbt tracks data lineage, allowing you to understand the origin of data and how it flows through different transformations. dbt also supports impact analysis, which helps identify the downstream effects of changes.

The high-level data flow is as follows:

Data engineers ingest data from data sources to raw tables and define table definitions for the raw tables.
Data engineers write dbt models with templatized SQL.
The dbt adapter converts dbt models to SQL statements compatible in a data warehouse.
The data warehouse runs the SQL statements to create intermediate tables or final tables, views, or materialized views.

The following diagram illustrates the architecture.

dbt-glue works with the following steps:

The dbt-glue adapter converts dbt models to SQL statements compatible in Spark SQL.
AWS Glue interactive sessions run the SQL statements to create intermediate tables or final tables, views, or materialized views.
dbt-glue supports csv, parquet, hudi, delta, and iceberg as fileformat.
On the dbt-glue adapter, table or incremental are commonly used for materializations at the destination. There are three strategies for incremental materialization. The merge strategy requires hudi, delta, or iceberg. With the other two strategies, append and insert_overwrite, you can use csv, parquet, hudi, delta, or iceberg.

The following diagram illustrates this architecture.

Example use case

In this post, we use the data from the New York City Taxi Records dataset. This dataset is available in the Registry of Open Data on AWS (RODA), which is a repository containing public datasets from AWS resources. The raw Parquet table records in this dataset stores trip records.

The objective is to create the following three tables, which contain metrics based on the raw table:

silver_avg_metrics – Basic metrics based on NYC Taxi Open Data for the year 2016
gold_passengers_metrics – Metrics per passenger based on the silver metrics table
gold_cost_metrics – Metrics per cost based on the silver metrics table

The final goal is to create two well-designed gold tables that store already aggregated results in Iceberg format for ad hoc queries through Amazon Athena.

Prerequisites

The instruction requires following prerequisites:

An AWS Identity and Access Management (IAM) role with all the mandatory permissions to run an AWS Glue interactive session and the dbt-glue adapter
An AWS Glue database and table to store the metadata related to the NYC taxi records dataset
An S3 bucket to use as output and store the processed data
An Athena configuration (a workgroup and S3 bucket to store the output) to explore the dataset
An AWS Lambda function (created as an AWS CloudFormation custom resource) that updates all the partitions in the AWS Glue table

With these prerequisites, we simulate the situation that data engineers have already ingested data from data sources to raw tables, and defined table definitions for the raw tables.

For ease of use, we prepared a CloudFormation template. This template deploys all the required infrastructure. To create these resources, choose Launch Stack in the us-east-1 Region, and follow the instructions:

Install dbt, the dbt CLI, and the dbt adaptor

The dbt CLI is a command line interface for running dbt projects. It’s free to use and available as an open source project. Install dbt and the dbt CLI with the following code:

$ pip3 install --no-cache-dir dbt-core

For more information, refer to How to install dbt, What is dbt?, and Viewpoint.

Install the dbt adapter with the following code:

$ pip3 install --no-cache-dir dbt-glue

Create a dbt project

Complete the following steps to create a dbt project:

Run the dbt init command to create and initialize a new empty dbt project:
```
$ dbt init
```
For the project name, enter dbt_glue_demo.
For the database, choose glue.

Now the empty project has been created. The directory structure is shown as follows:

$ cd dbt_glue_demo 
$ tree .
.
├── README.md
├── analyses
├── dbt_project.yml
├── macros
├── models
│   └── example
│       ├── my_first_dbt_model.sql
│       ├── my_second_dbt_model.sql
│       └── schema.yml
├── seeds
├── snapshots
└── tests

Create a source

The next step is to create a source table definition. We add models/source_tables.yml with the following contents:

version: 2

sources:
  - name: data_source
    schema: nyctaxi

    tables:
      - name: records

This source definition corresponds to the AWS Glue table nyctaxi.records, which we created in the CloudFormation stack.

Create models

In this step, we create a dbt model that represents the average values for trip duration, passenger count, trip distance, and total amount of charges. Complete the following steps:

Create the models/silver/ directory.

Create the file models/silver/silver_avg_metrics.sql with the following contents:

WITH source_avg as ( 
    SELECT avg((CAST(dropoff_datetime as LONG) - CAST(pickup_datetime as LONG))/60) as avg_duration 
    , avg(passenger_count) as avg_passenger_count 
    , avg(trip_distance) as avg_trip_distance 
    , avg(total_amount) as avg_total_amount
    , year
    , month 
    , type
    FROM {{ source('data_source', 'records') }} 
    WHERE year = "2016"
    AND dropoff_datetime is not null 
    GROUP BY year, month, type
) 
SELECT *
FROM source_avg

Create the file models/silver/schema.yml with the following contents:

version: 2

models:
  - name: silver_avg_metrics
    description: This table has basic metrics based on NYC Taxi Open Data for the year 2016

    columns:
      - name: avg_duration
        description: The average duration of a NYC Taxi trip

      - name: avg_passenger_count
        description: The average number of passenger per NYC Taxi trip

      - name: avg_trip_distance
        description: The average NYC Taxi trip distance

      - name: avg_total_amount
        description: The avarage amount of a NYC Taxi trip

      - name: year
        description: The year of the NYC Taxi trip

      - name: month
        description: The month of the NYC Taxi trip 

      - name: type
        description: The type of the NYC Taxi

Create the models/gold/ directory.

Create the file models/gold/gold_cost_metrics.sql with the following contents:

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key=["year", "month", "type"],
    file_format='iceberg',
    iceberg_expire_snapshots='False',
    table_properties={'format-version': '2'}
) }}
SELECT (avg_total_amount/avg_trip_distance) as avg_cost_per_distance
, (avg_total_amount/avg_duration) as avg_cost_per_minute
, year
, month 
, type 
FROM {{ ref('silver_avg_metrics') }}

Create the file models/gold/gold_passengers_metrics.sql with the following contents:

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key=["year", "month", "type"],
    file_format='iceberg',
    iceberg_expire_snapshots='False',
    table_properties={'format-version': '2'}
) }}
SELECT (avg_total_amount/avg_passenger_count) as avg_cost_per_passenger
, (avg_duration/avg_passenger_count) as avg_duration_per_passenger
, (avg_trip_distance/avg_passenger_count) as avg_trip_distance_per_passenger
, year
, month 
, type 
FROM {{ ref('silver_avg_metrics') }}

Create the file models/gold/schema.yml with the following contents:

version: 2

models:
  - name: gold_cost_metrics
    description: This table has metrics per cost based on NYC Taxi Open Data

    columns:
      - name: avg_cost_per_distance
        description: The average cost per distance of a NYC Taxi trip

      - name: avg_cost_per_minute
        description: The average cost per minute of a NYC Taxi trip

      - name: year
        description: The year of the NYC Taxi trip

      - name: month
        description: The month of the NYC Taxi trip

      - name: type
        description: The type of the NYC Taxi

  - name: gold_passengers_metrics
    description: This table has metrics per passenger based on NYC Taxi Open Data

    columns:
      - name: avg_cost_per_passenger
        description: The average cost per passenger for a NYC Taxi trip

      - name: avg_duration_per_passenger
        description: The average number of passenger per NYC Taxi trip

      - name: avg_trip_distance_per_passenger
        description: The average NYC Taxi trip distance

      - name: year
        description: The year of the NYC Taxi trip

      - name: month
        description: The month of the NYC Taxi trip 

      - name: type
        description: The type of the NYC Taxi

Remove the models/example/ folder, because it’s just an example created in the dbt init command.

Configure the dbt project

dbt_project.yml is a key configuration file for dbt projects. It contains the following code:

models:
  dbt_glue_demo:
    # Config indicated by + and applies to all files under models/example/
    example:
      +materialized: view

We configure dbt_project.yml to replace the preceding code with the following:

models:
  dbt_glue_demo:
    silver:
      +materialized: table

This is because that we want to materialize the models under silver as Parquet tables.

Configure a dbt profile

A dbt profile is a configuration that specifies how to connect to a particular database. The profiles are defined in the profiles.yml file within a dbt project.

Complete the following steps to configure a dbt profile:

Create the profiles directory.

Create the file profiles/profiles.yml with the following contents:

dbt_glue_demo:
  target: dev
  outputs:
    dev:
      type: glue
      query-comment: demo-nyctaxi
      role_arn: "{{ env_var('DBT_ROLE_ARN') }}"
      region: us-east-1
      workers: 5
      worker_type: G.1X
      schema: "dbt_glue_demo_nyc_metrics"
      database: "dbt_glue_demo_nyc_metrics"
      session_provisioning_timeout_in_seconds: 120
      location: "{{ env_var('DBT_S3_LOCATION') }}"

Create the profiles/iceberg/ directory.

Create the file profiles/iceberg/profiles.yml with the following contents:

dbt_glue_demo:
  target: dev
  outputs:
    dev:
      type: glue
      query-comment: demo-nyctaxi
      role_arn: "{{ env_var('DBT_ROLE_ARN') }}"
      region: us-east-1
      workers: 5
      worker_type: G.1X
      schema: "dbt_glue_demo_nyc_metrics"
      database: "dbt_glue_demo_nyc_metrics"
      session_provisioning_timeout_in_seconds: 120
      location: "{{ env_var('DBT_S3_LOCATION') }}"
      datalake_formats: "iceberg"
      conf: --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.warehouse="{{ env_var('DBT_S3_LOCATION') }}"warehouse/ --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"

The last two lines are added for setting Iceberg configurations on AWS Glue interactive sessions.

Run the dbt project

Now it’s time to run the dbt project. Complete the following steps:

To run the project dbt, you should be in the project folder:
```
$ cd dbt_glue_demo
```

The project requires you to set environment variables in order to run on the AWS account:

$ export DBT_ROLE_ARN="arn:aws:iam::$(aws sts get-caller-identity --query "Account" --output text):role/GlueInteractiveSessionRole"
$ export DBT_S3_LOCATION="s3://aws-dbt-glue-datalake-$(aws sts get-caller-identity --query "Account" --output text)-us-east-1"

Make sure the profile is set up correctly from the command line:

$ dbt debug --profiles-dir profiles
...
05:34:22 Connection test: [OK connection ok]
05:34:22 All checks passed!

If you see any failures, check if you provided the correct IAM role ARN and S3 location in Step 2.

Run the models with the following code:

$ dbt run -m silver --profiles-dir profiles
$ dbt run -m gold --profiles-dir profiles/iceberg/

Now the tables are successfully created in the AWS Glue Data Catalog, and the data is materialized in the Amazon S3 location.

You can verify those tables by opening the AWS Glue console, choosing Databases in the navigation pane, and opening dbt_glue_demo_nyc_metrics.

Query materialized tables through Athena

Let’s query the target table using Athena to verify the materialized tables. Complete the following steps:

On the Athena console, switch the workgroup to athena-dbt-glue-aws-blog.
If the workgroup athena-dbt-glue-aws-blog settings dialog box appears, choose Acknowledge.

Use the following query to explore the metrics created by the dbt project:

SELECT cm.avg_cost_per_minute
    , cm.avg_cost_per_distance
    , pm.avg_cost_per_passenger
    , cm.year
    , cm.month
    , cm.type
FROM "dbt_glue_demo_nyc_metrics"."gold_passengers_metrics" pm
LEFT JOIN "dbt_glue_demo_nyc_metrics"."gold_cost_metrics" cm
    ON cm.type = pm.type
    AND cm.year = pm.year
    AND cm.month = pm.month
WHERE cm.type = 'yellow'
    AND cm.year = '2016'
    AND cm.month = '6'

The following screenshot shows the results of this query.

Review dbt documentation

Complete the following steps to review your documentation:

Generate the following documentation for the project:

$ dbt docs generate --profiles-dir profiles/iceberg
11:41:51  Running with dbt=1.7.1
11:41:51  Registered adapter: glue=1.7.1
11:41:51  Unable to do partial parsing because profile has changed
11:41:52  Found 3 models, 1 source, 0 exposures, 0 metrics, 478 macros, 0 groups, 0 semantic models
11:41:52  
11:41:53  Concurrency: 1 threads (target='dev')
11:41:53  
11:41:53  Building catalog
11:43:32  Catalog written to /Users/username/Documents/workspace/dbt_glue_demo/target/catalog.json

Run the following command to open the documentation on your browser:
```
$ dbt docs serve --profiles-dir profiles/iceberg
```
In the navigation pane, choose gold_cost_metrics under dbt_glue_demo/models/gold.

You can see the detailed view of the model gold_cost_metrics, as shown in the following screenshot.

To see the lineage graph, choose the circle icon at the bottom right.

Clean up

To clean up your environment, complete the following steps:

Delete the database created by dbt:

$ aws glue delete-database —name dbt_glue_demo_nyc_metrics

Delete all generated data:

$ aws s3 rm s3://aws-dbt-glue-datalake-$(aws sts get-caller-identity —query "Account" —output text)-us-east-1/ —recursive
$ aws s3 rm s3://aws-athena-dbt-glue-query-results-$(aws sts get-caller-identity —query "Account" —output text)-us-east-1/ —recursive

Delete the CloudFormation stack:

$ aws cloudformation delete-stack —stack-name dbt-demo

Conclusion

This post demonstrated how the dbt-glue adapter helps your workload, and how you can build a modern data stack using dbt and AWS Glue using the dbt-glue adapter. You learned the end-to-end operations and data flow for data engineers to build and manage a data stack using dbt and the dbt-glue adapter. To report issues or request a feature enhancement, feel free to open an issue on GitHub.

About the authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team at Amazon Web Services. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Benjamin Menuet is a Senior Data Architect on the AWS Professional Services team at Amazon Web Services. He helps customers develop data and analytics solutions to accelerate their business outcomes. Outside of work, Benjamin is a trail runner and has finished some iconic races like the UTMB.

Akira Ajisaka is a Senior Software Development Engineer on the AWS Glue team. He likes open source software and distributed systems. In his spare time, he enjoys playing arcade games.

Kinshuk Pahare is a Principal Product Manager on the AWS Glue team at Amazon Web Services.

Jason Ganz is the manager of the Developer Experience (DX) team at dbt Labs

Introducing Apache Hudi support with AWS Glue crawlers

2023-11-22 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/introducing-apache-hudi-support-with-aws-glue-crawlers/

Apache Hudi is an open table format that brings database and data warehouse capabilities to data lakes. Apache Hudi helps data engineers manage complex challenges, such as managing continuously evolving datasets with transactions while maintaining query performance. Data engineers use Apache Hudi for streaming workloads as well as to create efficient incremental data pipelines. Hudi provides tables, transactions, efficient upserts and deletes, advanced indexes, streaming ingestion services, data clustering and compaction optimizations, and concurrency control, all while keeping your data in open source file formats. Hudi’s advanced performance optimizations make analytical workloads faster with any of the popular query engines including Apache Spark, Presto, Trino, Hive, and so on.

Many AWS customers adopted Apache Hudi on their data lakes built on top of Amazon S3 using AWS Glue, a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. AWS Glue Crawler is a component of AWS Glue, which allows you to create table metadata from data content automatically without requiring manual definition of the metadata.

AWS Glue crawlers now support Apache Hudi tables, simplifying the adoption of AWS Glue Data Catalog as the catalog for Hudi tables. One typical use case is to register Hudi tables, which does not have catalog table definition. Another typical use case is migration from other Hudi catalogs, such as Hive metastore. When migrating from other Hudi Catalogs, you can create and schedule an AWS Glue crawler and provide one or more Amazon S3 paths where the Hudi table files are located. You have the option to provide the maximum depth of Amazon S3 paths that the AWS Glue crawler can traverse. With each run, AWS Glue crawlers will extract schema and partition information and update AWS Glue Data Catalog with the schema and partition changes. AWS Glue crawlers updates the latest metadata file location in the AWS Glue Data Catalog that AWS analytical engines can directly use.

With this launch, you can create and schedule an AWS Glue crawler to register Hudi tables in AWS Glue Data Catalog. You can then provide one or multiple Amazon S3 paths where the Hudi tables are located. You have the option to provide the maximum depth of Amazon S3 paths that crawlers can traverse. With each crawler run, the crawler inspects each of the S3 paths and catalogs the schema information, such as new tables, deletes, and updates to schemas in the AWS Glue Data Catalog. Crawlers inspect partition information and add newly added partitions to AWS Glue Data Catalog. Crawlers also update the latest metadata file location in the AWS Glue Data Catalog that AWS analytical engines can directly use.

This post demonstrates how this new capability to crawl Hudi tables works.

How AWS Glue crawler works with Hudi tables

Hudi tables have two categories, with specific implications for each:

Copy on write (CoW) – Data is stored in a columnar format (Parquet), and each update creates a new version of files during a write.
Merge on read (MoR) – Data is stored using a combination of columnar (Parquet) and row-based (Avro) formats. Updates are logged to row-based delta files and are compacted as needed to create new versions of the columnar files.

With CoW datasets, each time there is an update to a record, the file that contains the record is rewritten with the updated values. With a MoR dataset, each time there is an update, Hudi writes only the row for the changed record. MoR is better suited for write- or change-heavy workloads with fewer reads. CoW is better suited for read-heavy workloads on data that change less frequently.

Hudi provides three query types for accessing the data:

Snapshot queries – Queries that see the latest snapshot of the table as of a given commit or compaction action. For MoR tables, snapshot queries expose the most recent state of the table by merging the base and delta files of the latest file slice at the time of the query.
Incremental queries – Queries only see new data written to the table, since a given commit or compaction. This effectively provides change streams to enable incremental data pipelines.
Read optimized queries – For MoR tables, queries see the latest data compacted. For CoW tables, queries see the latest data committed.

For copy-on-write tables, crawlers create a single table in the AWS Glue Data Catalog with the ReadOptimized Serde org.apache.hudi.hadoop.HoodieParquetInputFormat.

For merge-on-read tables, crawlers create two tables in AWS Glue Data Catalog for the same table location:

A table with suffix _ro, which uses the ReadOptimized Serde org.apache.hudi.hadoop.HoodieParquetInputFormat
A table with suffix _rt, which uses the RealTime Serde allowing for Snapshot queries: org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat

During each crawl, for each Hudi path provided, crawlers make an Amazon S3 list API call, filter based on the .hoodie folders, and find the most recent metadata file under that Hudi table metadata folder.

Crawl a Hudi CoW table using AWS Glue crawler

In this section, let’s go through how to crawl a Hudi CoW using AWS Glue crawlers.

Prerequisites

Here are the prerequisites for this tutorial:

Install and configure AWS Command Line Interface (AWS CLI).
Create your S3 bucket if you do not have it.
Create your IAM role for AWS Glue if you do not have it. You need s3:GetObject for s3://your_s3_bucket/data/sample_hudi_cow_table/.
Run the following command to copy the sample Hudi table into your S3 bucket. (Replace your_s3_bucket with your S3 bucket name.)

$ aws s3 sync s3://aws-bigdata-blog/artifacts/hudi-crawler/product_cow/ s3://your_s3_bucket/data/sample_hudi_cow_table/

This instruction guides you to copy sample data, but you can create any Hudi tables easily using AWS Glue. Learn more in Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 2: AWS Glue Studio Visual Editor.

Create a Hudi crawler

In this instruction, create the crawler through the console. Complete the following steps to create a Hudi crawler:

On the AWS Glue console, choose Crawlers.
Choose Create crawler.
For Name, enter hudi_cow_crawler. Choose Next.
Under Data source configuration, choose Add data source.
1. For Data source, choose Hudi.
2. For Include hudi table paths, enter s3://your_s3_bucket/data/sample_hudi_cow_table/. (Replace your_s3_bucket with your S3 bucket name.)
3. Choose Add Hudi data source.
Choose Next.
For Existing IAM role, choose your IAM role, then choose Next.
For Target database, choose Add database, then the Add database dialog appears. For Database name, enter hudi_crawler_blog, then choose Create. Choose Next.
Choose Create crawler.

Now a new Hudi crawler has been successfully created. The crawler can be triggered to run through the console or through the SDK or AWS CLI using the StartCrawl API. It could also be scheduled through the console to trigger the crawlers at specific times. In this instruction, run the crawler through the console.

Choose Run crawler.
Wait for the crawler to complete.

After the crawler has run, you can see the Hudi table definition in the AWS Glue console:

You have successfully crawled the Hudi CoR table with data on Amazon S3 and created an AWS Glue Data Catalog table with the schema populated. After you create the table definition on AWS Glue Data Catalog, AWS analytics services such as Amazon Athena are able to query the Hudi table.

Complete the following steps to start queries on Athena:

Open the Amazon Athena console.
Run the following query.

SELECT * FROM "hudi_crawler_blog"."sample_hudi_cow_table" limit 10;

The following screenshot shows our output:

Crawl a Hudi MoR table using AWS Glue crawler with AWS Lake Formation data permissions

In this section, let’s go through how to crawl a Hudi MoR table using AWS Glue. This time, you use AWS Lake Formation data permission for crawling Amazon S3 data sources instead of IAM and Amazon S3 permission. This is optional, but it simplifies permission configurations when your data lake is managed by AWS Lake Formation permissions.

Prerequisites

Here are the prerequisites for this tutorial:

Install and configure AWS Command Line Interface (AWS CLI).
Create your S3 bucket if you do not have it.
Create your IAM role for AWS Glue if you do not have it. You need lakeformation:GetDataAccess. But you do not need s3:GetObject for s3://your_s3_bucket/data/sample_hudi_mor_table/ because we use Lake Formation data permission to access the files.
Run the following command to copy the sample Hudi table into your S3 bucket. (Replace your_s3_bucket with your S3 bucket name.)

$ aws s3 sync s3://aws-bigdata-blog/artifacts/hudi-crawler/product_mor/ s3://your_s3_bucket/data/sample_hudi_mor_table/

In addition to the processing steps, complete the following steps to update the AWS Glue Data Catalog settings to use Lake Formation permissions to control catalog resources instead of IAM-based access control:

Sign in to the Lake Formation console as a data lake administrator.
1. If this is the first time accessing the Lake Formation console, add yourself as the data lake administrator.
Under Administration, choose Data catalog settings.
For Default permissions for newly created databases and tables, deselect Use only IAM access control for new databases and Use only IAM access control for new tables in new databases.
For Cross account version setting, choose Version 3.
Choose Save.

The next step is to register your S3 bucket in Lake Formation data lake locations:

On the Lake Formation console, choose Data lake locations, and choose Register location.
For Amazon S3 path, enter s3://your_s3_bucket/. (Replace your_s3_bucket with your S3 bucket name.)
Choose Register location.

Then, grant Glue crawler role access to data location so that the crawler can use Lake Formation permission to access the data and create tables in the location:

On the Lake Formation console, choose Data locations and choose Grant.
For IAM users and roles, select the IAM role you used for the crawler.
For Storage location, enter s3://your_s3_bucket/data/. (Replace your_s3_bucket with your S3 bucket name.)
Choose Grant.

Then, grant crawler role to create tables under the database hudi_crawler_blog:

On the Lake Formation console, choose Data lake permissions.
Choose Grant.
For Principals, choose IAM users and roles, and choose the crawler role.
For LF tags or catalog resources, choose Named data catalog resources.
For Database, choose the database hudi_crawler_blog.
Under Database permissions, select Create table.
Choose Grant.

Create a Hudi crawler with Lake Formation data permissions

Complete the following steps to create a Hudi crawler:

On the AWS Glue console, choose Crawlers.
Choose Create crawler.
For Name, enter hudi_mor_crawler. Choose Next.
Under Data source configuration, choose Add data source.
1. For Data source, choose Hudi.
2. For Include hudi table paths, enter s3://your_s3_bucket/data/sample_hudi_mor_table/. (Replace your_s3_bucket with your S3 bucket name.)
3. Choose Add Hudi data source.
Choose Next.
For Existing IAM role, choose your IAM role.
Under Lake Formation configuration – optional, select Use Lake Formation credentials for crawling S3 data source.
Choose Next.
For Target database, choose hudi_crawler_blog. Choose Next.
Choose Create crawler.

Now a new Hudi crawler has been successfully created. The crawler uses Lake Formation credentials for crawling Amazon S3 files. Let’s run the new crawler:

Choose Run crawler.
Wait for the crawler to complete.

After the crawler has run, you can see two tables of the Hudi table definition in the AWS Glue console:

sample_hudi_mor_table_ro (read optimized table)
sample_hudi_mor_table_rt (real time table)

You registered the data lake bucket with Lake Formation and enabled crawling access to the data lake using Lake Formation permissions. You have successfully crawled the Hudi MoR table with data on Amazon S3 and created an AWS Glue Data Catalog table with the schema populated. After you create the table definitions on AWS Glue Data Catalog, AWS analytics services such as Amazon Athena are able to query the Hudi table.

Complete the following steps to start queries on Athena:

Open the Amazon Athena console.

Run the following query.

SELECT * FROM "hudi_crawler_blog"."sample_hudi_mor_table_rt" limit 10;

The following screenshot shows our output:

Run the following query.

SELECT * FROM "hudi_crawler_blog"."sample_hudi_mor_table_ro" limit 10;

The following screenshot shows our output:

Fine-grained access control using AWS Lake Formation permissions

To apply fine-grained access control on the Hudi table, you can benefit from AWS Lake Formation permissions. Lake Formation permissions allow you to restrict access to specific tables, columns, or rows and then query the Hudi tables through Amazon Athena with fine-grained access control. Let’s configure Lake Formation permission for the Hudi MoR table.

Prerequisites

Here are the prerequisites for this tutorial:

Complete the previous section Crawl a Hudi MoR table using AWS Glue crawler with AWS Lake Formation data permissions.
Create an IAM user DataAnalyst, who has AWS managed policy AmazonAthenaFullAccess.

Create a Lake Formation data cell filter

Let’s first set up a filter for the MoR read optimized table.

Sign in to the Lake Formation console as a data lake administrator.
Choose Data filters.
Choose Create new filter.
For Data filter name, enter exclude_product_price.
For Target database, choose the database hudi_crawler_blog.
For Target table, choose the table sample_hudi_mor_table_ro.
For Column-level access, select Exclude columns, and choose the column price.
For Row filter expression, enter true.
Choose Create filter.

Grant Lake Formation permissions to the DataAnalyst user

Complete the following steps to grant Lake Formation permission to the DataAnalyst user

On the Lake Formation console, choose Data lake permissions.
Choose Grant.
For Principals, choose IAM users and roles, and choose the user DataAnalyst.
For LF tags or catalog resources, choose Named data catalog resources.
For Database, choose the database hudi_crawler_blog.
For Table – optional, choose the table sample_hudi_mor_table_ro.
For Data filters – optional, select exclude_product_price.
For Data filter permissions, select Select.
Choose Grant.

You granted Lake Formation permission on the database hudi_crawler_blog and the table sample_hudi_mor_table_ro, excluding the column price to the DataAnalyst user. Now let’s validate user access to the data using Athena.

On the query editor, run the following query:

SELECT * FROM "hudi_crawler_blog"."sample_hudi_mor_table_ro" limit 10;

The following screenshot shows our output:

Now you validated that the column price is not shown, but the other columns product_id, product_name, update_at, and category are shown.

Clean up

To avoid unwanted charges to your AWS account, delete the following AWS resources:

Delete AWS Glue database hudi_crawler_blog.
Delete AWS Glue crawlers hudi_cow_crawler and hudi_mor_crawler.
Delete Amazon S3 files under s3://your_s3_bucket/data/sample_hudi_cow_table/ and s3://your_s3_bucket/data/sample_hudi_mor_table/.

Conclusion

This post demonstrated how AWS Glue crawlers work for Hudi tables. With the support for Hudi crawler, you can quickly move to using AWS Glue Data Catalog as your primary Hudi table catalog. You can start building your serverless transactional data lake using Hudi on AWS using AWS Glue, AWS Glue Data Catalog, and Lake Formation fine-grained access controls for tables and formats supported by AWS analytical engines.

About the authors

Kyle Duong is a Software Development Engineer on the AWS Glue and Lake Formation team. He is passionate about building big data technologies and distributed systems.

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics

2023-11-21 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/enhance-monitoring-and-debugging-for-aws-glue-jobs-using-new-job-observability-metrics/

For any modern data-driven company, having smooth data integration pipelines is crucial. These pipelines pull data from various sources, transform it, and load it into destination systems for analytics and reporting. When running properly, it provides timely and trustworthy information. However, without vigilance, the varying data volumes, characteristics, and application behavior can cause data pipelines to become inefficient and problematic. Performance can slow down or pipelines can become unreliable. Undetected errors result in bad data and impact downstream analysis. That’s why robust monitoring and troubleshooting for data pipelines is essential across the following four areas:

Reliability
Performance
Throughput
Resource utilization

Together, these four aspects of monitoring provide end-to-end visibility and control over a data pipeline and its operations.

Today we are pleased to announce a new class of Amazon CloudWatch metrics reported with your pipelines built on top of AWS Glue for Apache Spark jobs. The new metrics provide aggregate and fine-grained insights into the health and operations of your job runs and the data being processed. In addition to providing insightful dashboards, the metrics provide classification of errors, which helps with root cause analysis of performance bottlenecks and error diagnosis. With this analysis, you can evaluate and apply the recommended fixes and best practices for architecting your jobs and pipelines. As a result, you gain the benefit of higher availability, better performance, and lower cost for your AWS Glue for Apache Spark workload.

This post demonstrates how the new enhanced metrics help you monitor and debug AWS Glue jobs.

Enable the new metrics

The new metrics can be configured through the job parameter enable-observability-metrics.

The new metrics are enabled by default on the AWS Glue Studio console. To configure the metrics on the AWS Glue Studio console, complete the following steps:

On the AWS Glue console, choose ETL jobs in the navigation pane.
Under Your jobs, choose your job.
On the Job details tab, expand Advanced properties.
Under Job observability metrics, select Enable the creation of additional observability CloudWatch metrics when this job runs.

To enable the new metrics in the AWS Glue CreateJob and StartJobRun APIs, set the following parameters in the DefaultArguments property:

Key – --enable-observability-metrics
Value – true

To enable the new metrics in the AWS Command Line Interface (AWS CLI), set the same job parameters in the --default-arguments argument.

Use case

A typical workload for AWS Glue for Apache Spark jobs is to load data from a relational database to a data lake with SQL-based transformations. The following is a visual representation of an example job where the number of workers is 10.

When the example job ran, the workerUtilization metrics showed the following trend.

Note that workerUtilization showed values between 0.20 (20%) and 0.40 (40%) for the entire duration. This typically happens when the job capacity is over-provisioned and many Spark executors were idle, resulting in unnecessary cost. To improve resource utilization efficiency, it’s a good idea to enable AWS Glue Auto Scaling. The following screenshot shows the same workerUtilization metrics graph when AWS Glue Auto Scaling is enabled for the same job.

workerUtilization showed 1.0 in the beginning because of AWS Glue Auto Scaling and it trended between 0.75 (75%) and 1.0 (100%) based on the workload requirements.

Query and visualize metrics in CloudWatch

Complete the following steps to query and visualize metrics on the CloudWatch console:

On the CloudWatch console, choose All metrics in the navigation pane.
Under Custom namespaces, choose Glue.
Choose Observability Metrics (or Observability Metrics Per Source, or Observability Metrics Per Sink).
Search for and select the specific metric name, job name, job run ID, and observability group.
On the Graphed metrics tab, configure your preferred statistic, period, and so on.

Query metrics using the AWS CLI

Complete the following steps for querying using the AWS CLI (for this example, we query the worker utilization metric):

Create a metric definition JSON file (provide your AWS Glue job name and job run ID):

$ cat multiplequeries.json
[
  {
    "Id": "avgWorkerUtil_0",
    "MetricStat" : {
      "Metric" : {
        "Namespace": "Glue",
        "MetricName": "glue.driver.workerUtilization",
        "Dimensions": [
          {
              "Name": "JobName",
              "Value": "<your-Glue-job-name-A>"
          },
          {
            "Name": "JobRunId",
            "Value": "<your-Glue-job-run-id-A>"
          },
          {
            "Name": "Type",
            "Value": "gauge"
          },
          {
            "Name": "ObservabilityGroup",
            "Value": "resource_utilization"
          }
        ]
      },
      "Period": 1800,
      "Stat": "Minimum",
      "Unit": "None"
    }
  },
  {
      "Id": "avgWorkerUtil_1",
      "MetricStat" : {
      "Metric" : {
        "Namespace": "Glue",
        "MetricName": "glue.driver.workerUtilization",
        "Dimensions": [
           {
             "Name": "JobName",
             "Value": "<your-Glue-job-name-B>"
           },
           {
             "Name": "JobRunId",
             "Value": "<your-Glue-job-run-id-B>"
           },
           {
             "Name": "Type",
             "Value": "gauge"
           },
           {
             "Name": "ObservabilityGroup",
             "Value": "resource_utilization"
           }
        ]
      },
      "Period": 1800,
      "Stat": "Minimum",
      "Unit": "None"
    }
  }
]

Run the get-metric-data command:

$ aws cloudwatch get-metric-data --metric-data-queries file://multiplequeries.json \
     --start-time '2023-10-28T18:20' \
     --end-time '2023-10-28T19:10'  \
     --region us-east-1
{
    "MetricDataResults": [
      {
         "Id": "avgWorkerUtil_0",
         "Label": "<your label A>",
         "Timestamps": [
               "2023-10-28T18:20:00+00:00"
         ], 
         "Values": [
               0.06718750000000001
         ],
         "StatusCode": "Complete"
      },
      {
         "Id": "avgWorkerUtil_1",
         "Label": "<your label B>",
         "Timestamps": [
              "2023-10-28T18:20:00+00:00"
          ],
          "Values": [
              0.5959183673469387
          ],
          "StatusCode": "Complete"
       }
    ],
    "Messages": []
}

Create a CloudWatch alarm

You can create static threshold-based alarms for the different metrics. For instructions, refer to Create a CloudWatch alarm based on a static threshold.

For example, for skewness, you can set an alarm for skewness.stage with a threshold of 1.0, and skewness.job with a threshold of 0.5. This threshold is just a recommendation; you can adjust the threshold based on your specific use case (for example, some jobs are expected to be skewed and it’s not an issue to be alarmed for). Our recommendation is to evaluate the metric values of your job runs for some time before qualifying the anomalous values and configuring the thresholds to alarm.

Other enhanced metrics

For a full list of other enhanced metrics available with AWS Glue jobs, refer to Monitoring with AWS Glue Observability metrics. These metrics allow you to capture the operational insights of your jobs, such as resource utilization (memory and disk), normalized error classes such as compilation and syntax, user or service errors, and throughput for each source or sink (records, files, partitions, and bytes read or written).

Job observability dashboards

You can further simplify observability for your AWS Glue jobs using dashboards for the insight metrics that enable real-time monitoring using Amazon Managed Grafana, and enable visualization and analysis of trends with Amazon QuickSight.

Conclusion

This post demonstrated how the new enhanced CloudWatch metrics help you monitor and debug AWS Glue jobs. With these enhanced metrics, you can more easily identify and troubleshoot issues in real time. This results in AWS Glue jobs that experience higher uptime, faster processing, and reduced expenditures. The end benefit for you is more effective and optimized AWS Glue for Apache Spark workloads. The metrics are available in all AWS Glue supported Regions. Check it out!

About the Authors

Shenoda Guirguis is a Senior Software Development Engineer on the AWS Glue team. His passion is in building scalable and distributed Data Infrastructure/Processing Systems. When he gets a chance, Shenoda enjoys reading and playing soccer.

Sean Ma is a Principal Product Manager on the AWS Glue team. He has an 18+ year track record of innovating and delivering enterprise products that unlock the power of data for users. Outside of work, Sean enjoys scuba diving and college football.

Mohit Saxena is a Senior Software Development Manager on the AWS Glue team. His team focuses on building distributed systems to enable customers with interactive and simple to use interfaces to efficiently manage and transform petabytes of data seamlessly across data lakes on Amazon S3, databases and data-warehouses on cloud.

Introducing AWS Glue serverless Spark UI for better monitoring and troubleshooting

2023-11-21 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/introducing-aws-glue-studio-serverless-spark-ui-for-better-monitoring-and-troubleshooting/

In AWS, hundreds of thousands of customers use AWS Glue, a serverless data integration service, to discover, combine, and prepare data for analytics and machine learning. When you have complex datasets and demanding Apache Spark workloads, you may experience performance bottlenecks or errors during Spark job runs. Troubleshooting these issues can be difficult and delay getting jobs working in production. Customers often use Apache Spark Web UI, a popular debugging tool that is part of open source Apache Spark, to help fix problems and optimize job performance. AWS Glue supports Spark UI in two different ways, but you need to set it up yourself. This requires time and effort spent managing networking and EC2 instances, or through trial-and error with Docker containers.

Today, we are pleased to announce serverless Spark UI built into the AWS Glue console. You can now use Spark UI easily as it’s a built-in component of the AWS Glue console, enabling you to access it with a single click when examining the details of any given job run. There’s no infrastructure setup or teardown required. AWS Glue serverless Spark UI is a fully-managed serverless offering and generally starts up in a matter of seconds. Serverless Spark UI makes it significantly faster and easier to get jobs working in production because you have ready access to low level details for your job runs.

This post describes how the AWS Glue serverless Spark UI helps you to monitor and troubleshoot your AWS Glue job runs.

Getting started with serverless Spark UI

You can access the serverless Spark UI for a given AWS Glue job run by navigating from your Job’s page in AWS Glue console.

On the AWS Glue console, choose ETL jobs.
Choose your job.
Choose the Runs tab.
Select the job run you want to investigate, then choose Spark UI.

The Spark UI will display in the lower pane, as shown in the following screen capture:

Alternatively, you can get to the serverless Spark UI for a specific job run by navigating from Job run monitoring in AWS Glue.

On the AWS Glue console, choose job run monitoring under ETL jobs.
Select your job run, and choose View run details.

Scroll down to the bottom to view the Spark UI for the job run.

Prerequisites

Complete the following prerequisite steps:

Enable Spark UI event logs for your job runs. It is enabled by default on Glue console and once enabled, Spark event log files will be created during the job run, and stored in your S3 bucket. The serverless Spark UI parses a Spark event log file generated in your S3 bucket to visualize detailed information for both running and completed job runs. A progress bar shows the percentage to completion, with a typical parsing time of less than a minute. Once logs are parsed, you can
When logs are parsed, you can use the built-in Spark UI to debug, troubleshoot, and optimize your jobs.

For more information about Apache Spark UI, refer to Web UI in Apache Spark.

Monitor and Troubleshoot with Serverless Spark UI

A typical workload for AWS Glue for Apache Spark jobs is loading data from relational databases to S3-based data lakes. This section demonstrates how to monitor and troubleshoot an example job run for the above workload with serverless Spark UI. The sample job reads data from MySQL database and writes to S3 in Parquet format. The source table has approximately 70 million records.

The following screen capture shows a sample visual job authored in AWS Glue Studio visual editor. In this example, the source MySQL table has already been registered in the AWS Glue Data Catalog in advance. It can be registered through AWS Glue crawler or AWS Glue catalog API. For more information, refer to Data Catalog and crawlers in AWS Glue.

Now it’s time to run the job! The first job run finished in 30 minutes and 10 seconds as shown:

Let’s use Spark UI to optimize the performance of this job run. Open Spark UI tab in the Job runs page. When you drill down to Stages and view the Duration column, you will notice that Stage Id=0 spent 27.41 minutes to run the job, and the stage had only one Spark task in the Tasks:Succeeded/Total column. That means there was no parallelism to load data from the source MySQL database.

To optimize the data load, introduce parameters called hashfield and hashpartitions to the source table definition. For more information, refer to Reading from JDBC tables in parallel. Continuing to the Glue Catalog table, add two properties: hashfield=emp_no, and hashpartitions=18 in Table properties.

This means the new job runs reading parallelize data load from the source MySQL table.

Let’s try running the same job again! This time, the job run finished in 9 minutes and 9 seconds. It saved 21 minutes from the previous job run.

As a best practice, view the Spark UI and compare them before and after the optimization. Drilling down to Completed stages, you will notice that there was one stage and 18 tasks instead of one task.

In the first job run, AWS Glue automatically shuffled data across multiple executors before writing to destination because there were too few tasks. On the other hand, in the second job run, there was only one stage because there was no need to do extra shuffling, and there were 18 tasks for loading data in parallel from source MySQL database.

Considerations

Keep in mind the following considerations:

Serverless Spark UI is supported in AWS Glue 3.0 and later
Serverless Spark UI will be available for jobs that ran after November 20, 2023, due to a change in how AWS Glue emits and stores Spark logs
Serverless Spark UI can visualize Spark event logs which is up to 1 GB in size
There is no limit in retention because serverless Spark UI scans the Spark event log files on your S3 bucket
Serverless Spark UI is not available for Spark event logs stored in S3 bucket that can only be accessed by your VPC

Conclusion

This post described how the AWS Glue serverless Spark UI helps you monitor and troubleshoot your AWS Glue jobs. By providing instant access to the Spark UI directly within the AWS Management Console, you can now inspect the low-level details of job runs to identify and resolve issues. With the serverless Spark UI, there is no infrastructure to manage—the UI spins up automatically for each job run and tears down when no longer needed. This streamlined experience saves you time and effort compared to manually launching Spark UIs yourself.

Give the serverless Spark UI a try today. We think you’ll find it invaluable for optimizing performance and quickly troubleshooting errors. We look forward to hearing your feedback as we continue improving the AWS Glue console experience.

About the authors

Alexandra Tello is a Senior Front End Engineer with the AWS Glue team in New York City. She is a passionate advocate for usability and accessibility. In her free time, she’s an espresso enthusiast and enjoys building mechanical keyboards.

Matt Sampson is a Software Development Manager on the AWS Glue team. He loves working with his other Glue team members to make services that our customers benefit from. Outside of work, he can be found fishing and maybe singing karaoke.

Load data incrementally from transactional data lakes to data warehouses

2023-10-19 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/load-data-incrementally-from-transactional-data-lakes-to-data-warehouses/

Data lakes and data warehouses are two of the most important data storage and management technologies in a modern data architecture. Data lakes store all of an organization’s data, regardless of its format or structure. An open table format such as Apache Hudi, Delta Lake, or Apache Iceberg is widely used to build data lakes on Amazon Simple Storage Service (Amazon S3) in a transactionally consistent manner for use cases including record-level upserts and deletes, change data capture (CDC), time travel queries, and more. Data warehouses, on the other hand, store data that has been cleaned, organized, and structured for analysis. Depending on your use case, it’s common to have a copy of the data between your data lake and data warehouse to support different access patterns.

When the data becomes very large and unwieldy, it can be difficult to keep the copy of the data between data lakes and data warehouses in sync and up to date in an efficient manner.

In this post, we discuss different architecture patterns to keep data in sync and up to date between data lakes built on open table formats and data warehouses such as Amazon Redshift. We also discuss the benefits of incremental loading and the techniques for implementing the architecture using AWS Glue, which is a serverless, scalable data integration service that helps you discover, prepare, move, and integrate data from multiple sources. Various data stores are supported in AWS Glue; for example, AWS Glue 4.0 supports an enhanced Amazon Redshift connector to read from and write to Amazon Redshift, and also supports a built-in Snowflake connector to read from and write to Snowflake. Moreover, Apache Hudi, Delta Lake, and Apache Iceberg are natively supported in AWS Glue.

Architecture patterns

Generally, there are three major architecture patterns to keep your copy of data between data lakes and data warehouses in sync and up to date:

Dual writes
Incremental queries
Change data capture

Let’s discuss each of the architecture patterns and the techniques to achieve them.

Dual writes

When initially ingesting data from its raw source into the data lake and data warehouse, a single batch process is configured to write to both. We call this pattern dual writes. Although this architecture pattern (see the following diagram) is straightforward and easy to implement, it can become error-prone because there are two separate transactions threads, and each can have its own errors, causing inconsistencies between the data lake and data warehouse when a write fails in one but not both.

Incremental queries

An incremental query architectural pattern is designed to ingest data first into the data lake with an open table format, and then load the newly written data from the data lake into the data warehouse. Open table formats such as Apache Hudi and Apache Iceberg support incremental queries based on their respective transaction logs. You can capture records inserted or updated with the incremental queries, and then merge the captured records into the destination data warehouses.

Apache Hudi supports incremental query, which allows you to retrieve all records written during specific time range.

Delta Lake doesn’t have a specific concept for incremental queries. It’s covered in a change data feed, which is explained in the next section.

Apache Iceberg supports incremental read, which allows you to read appended data incrementally. As of this writing, Iceberg gets incremental data only from the append operation; other operations such as replace, overwrite, and delete aren’t supported by incremental read.

For merging the records into Amazon Redshift, you can use the MERGE SQL command, which was released in April 2023. AWS Glue supports the Redshift MERGE SQL command within its data integration jobs. To learn more, refer to Exploring new ETL and ELT capabilities for Amazon Redshift from the AWS Glue Studio visual editor.

Incremental queries are useful to capture changed records; however, incremental queries can’t handle the deletes and just send the latest version of each record. If you need to handle delete operations in the source data lake, you will need to use a CDC-based approach.

The following diagram illustrates an incremental query architectural pattern.

Change data capture

Change data capture (CDC) is a well-known technique to capture all mutating operations in a source database system and relay those operations to another system. CDC keeps all the intermediate changes, including the deletes. With this architecture pattern, you capture not only inserts and updates, but also deletes committed to the data lake, and then merge those captured changes into the data warehouses.

Apache Hudi 0.13.0 or later supports change data capture as an experimental feature, which is only available for Copy-on-Write (CoW) tables. Merge-on-Read tables (MoR) do not support CDC as of this writing.

Delta Lake 2.0.0 or later supports a change data feed, which allows Delta tables to track record-level changes between table versions.

Apache Iceberg 1.2.1 or later supports change data capture through its create_changelog_view procedure. When you run this procedure, a new view that contains the changes from a given table is created.

The following diagram illustrates a CDC architecture.

Example scenario

To demonstrate the end-to-end experience, this post uses the Global Historical Climatology Network Daily (GHCN-D) dataset. The data is publicly accessible through an S3 bucket. For more information, see the Registry of Open Data on AWS. You can also learn more in Visualize over 200 years of global climate data using Amazon Athena and Amazon QuickSight.

The Amazon S3 location s3://noaa-ghcn-pds/csv/by_year/ has all of the observations from 1763 to the present organized in CSV files, one file for each year. The following block shows an example of what the records look like:

ID,DATE,ELEMENT,DATA_VALUE,M_FLAG,Q_FLAG,S_FLAG,OBS_TIME
AE000041196,20220101,TAVG,204,H,,S,
AEM00041194,20220101,TAVG,211,H,,S,
AEM00041217,20220101,TAVG,209,H,,S,
AEM00041218,20220101,TAVG,207,H,,S,
AE000041196,20220102,TAVG,226,H,,S,
...
AE000041196,20221231,TMAX,243,,,S,
AE000041196,20221231,PRCP,0,D,,S,
AE000041196,20221231,TAVG,202,H,,S,

The records have fields including ID, DATE, ELEMENT, and more. Each combination of ID, DATE, and ELEMENT represents a unique record in this dataset. For example, the record with ID as AE000041196, ELEMENT as TAVG, and DATE as 20220101 is unique. We use this dataset in the following examples and simulate record-level updates and deletes as sample operations.

Prerequisites

To continue with the examples in this post, you need to create (or already have) the following AWS resources:

An AWS Identity and Access Management (IAM) role for your ETL job or notebook. For instructions, refer to Set up IAM permissions for AWS Glue Studio.
An S3 bucket for storing data.

For the first tutorial (loading from Apache Hudi to Amazon Redshift), you also need the following:

A Redshift cluster or Amazon Redshift Serverless workgroup (Redshift Serverless is recommended).
An AWS Glue connection named redshift for Amazon Redshift access. Learn more in Configuring Redshift connections.

For the second tutorial (loading from Delta Lake to Snowflake), you need the following:

A Snowflake account.
An AWS Glue connection named snowflake for Snowflake access. For more information, refer to Configuring Snowflake connections.
An AWS Secrets Manager secret named snowflake_credentials with the following key pairs:
- Key sfUser with value <Your Snowflake username>
- Key sfPassword with value <Your Snowflake password>

These tutorials are inter-changeable, so you can easily apply the same pattern for any combination of source and destination, for example, Hudi to Snowflake, or Delta to Amazon Redshift.

Load data incrementally from Apache Hudi table to Amazon Redshift using a Hudi incremental query

This tutorial uses Hudi incremental queries to load data from a Hudi table and then merge the changes to Amazon Redshift.

Ingest initial data to a Hudi table

Complete the following steps:

Open AWS Glue Studio.
Choose ETL jobs.
Choose Visual with a source and target.
For Source and Target, choose Amazon S3, then choose Create.

A new visual job configuration appears. The next step is to configure the data source to read an example dataset.

Name this new job hudi-data-ingestion.
Under Visual, choose Data source – S3 bucket.
Under Node properties, for S3 source type, select S3 location.
For S3 URL, enter s3://noaa-ghcn-pds/csv/by_year/2022.csv.

The data source is configured. The next step is to configure the data target to ingest data in Apache Hudi on your S3 bucket.

Choose Data target – S3 bucket.
Under Data target properties – S3, for Format, choose Apache Hudi.
For Hudi Table Name, enter ghcn_hudi.
For Hudi Storage Type, choose Copy on write.
For Hudi Write Operation, choose Upsert.
For Hudi Record Key Fields, choose ID.
For Hudi Precombine Key Field, choose DATE.
For Compression Type, choose GZIP.
For S3 Target location, enter s3://<Your S3 bucket name>/<Your S3 bucket prefix>/hudi_incremental/ghcn/. (Provide your S3 bucket name and prefix.)
For Data Catalog update options, select Do not update the Data Catalog.

Now your data integration job is authored in the visual editor completely. Let’s add one remaining setting about the IAM role, then run the job.

Under Job details, for IAM Role, choose your IAM role.
Choose Save, then choose Run.

You can track the progress on the Runs tab. It finishes in several minutes.

Load data from the Hudi table to a Redshift table

In this step, we assume that the files are updated with new records every day, and want to store only the latest record per the primary key (ID and ELEMENT) to make the latest snapshot data queryable. One typical approach is to do an INSERT for all the historical data, and calculate the latest records in queries; however, this can introduce additional overhead in all the queries. When you want to analyze only the latest records, it’s better to do an UPSERT (update and insert) based on the primary key and DATE field rather than just an INSERT in order to avoid duplicates and maintain a single updated row of data.

Complete the following steps to load data from the Hudi table to a Redshift table:

Download the file hudi2redshift-incremental-load.ipynb.
In AWS Glue Studio, choose Jupyter Notebook, then choose Create.
For Job name, enter hudi-ghcn-incremental-load-notebook.
For IAM Role, choose your IAM role.
Choose Start notebook.

Wait for the notebook to be ready.

Run the first cell to set up an AWS Glue interactive session.
Replace the parameters with yours and run the cell under Configure your resource.
Run the cell under Initialize SparkSession and GlueContext.
Run the cell under Determine target time range for incremental query.
Run the cells under Run query to load data updated during a given timeframe.
Run the cells under Merge changes into destination table.

You can see the exact query immediately run right after ingesting a temp table into the Redshift table.

Run the cell under Update the last query end time.

Validate initial records in the Redshift table

Complete the following steps to validate the initial records in the Redshift table:

On the Amazon Redshift console, open Query Editor v2.

Run the following query:

SELECT * FROM "dev"."public"."ghcn" WHERE ID = 'AE000041196'

The query returns the following result set.

The original source file 2022.csv has historical records for record ID='AE000041196' from 20220101 to 20221231; however, the query result shows only four records, one record per ELEMENT at the latest snapshot of the day 20221230 or 20221231. Because we used the UPSERT write option when writing data, we configured the ID field as a Hudi record key field, the DATE field as a Hudi precombine field, and the ELEMENT field as partition key field. When two records have the same key value, Hudi picks the one with the largest value for the precombine field. When the job ingested data, it compared all the values in the DATE field for each pair of ID and ELEMENT, and then picked the record with the largest value in the DATE field. We use the current state of this table as an initial state.

Ingest updates to a Hudi table

Complete the following steps to simulating ingesting more records to the Hudi table:

On AWS Glue Studio, choose the job hudi-data-ingestion.
On the Data target – S3 bucket node, change the S3 location from s3://noaa-ghcn-pds/csv/by_year/2022.csv to s3://noaa-ghcn-pds/csv/by_year/2023.csv.
Run the job.

Because this job uses the DATE field as a Hudi precombine field, the records included in the new source file have been upserted into the Hudi table.

Load data incrementally from the Hudi table to the Redshift table

Complete the following steps to load the ingested records incrementally to the Redshift table:

On AWS Glue Studio, choose the job hudi-ghcn-incremental-load-notebook.
Run all the cells again.

In the cells under Run query, you will notice that the records shown this time have DATE in 2023. Only newly ingested records are shown here.

In the cells under Merge changes into destination table, the newly ingested records are merged into the Redshift table. The generated MERGE query statement in the notebook is as follows:

MERGE INTO public.ghcn USING public.ghcn_tmp ON 
    public.ghcn.ID = public.ghcn_tmp.ID AND 
    public.ghcn.ELEMENT = public.ghcn_tmp.ELEMENT
WHEN MATCHED THEN UPDATE SET 
    _hoodie_commit_time = public.ghcn_tmp._hoodie_commit_time,
    _hoodie_commit_seqno = public.ghcn_tmp._hoodie_commit_seqno,
    _hoodie_record_key = public.ghcn_tmp._hoodie_record_key,
    _hoodie_partition_path = public.ghcn_tmp._hoodie_partition_path,
    _hoodie_file_name = public.ghcn_tmp._hoodie_file_name, 
    ID = public.ghcn_tmp.ID, 
    DATE = public.ghcn_tmp.DATE, 
    ELEMENT = public.ghcn_tmp.ELEMENT, 
    DATA_VALUE = public.ghcn_tmp.DATA_VALUE, 
    M_FLAG = public.ghcn_tmp.M_FLAG, 
    Q_FLAG = public.ghcn_tmp.Q_FLAG, 
    S_FLAG = public.ghcn_tmp.S_FLAG, 
    OBS_TIME = public.ghcn_tmp.OBS_TIME 
WHEN NOT MATCHED THEN INSERT VALUES (
    public.ghcn_tmp._hoodie_commit_time, 
    public.ghcn_tmp._hoodie_commit_seqno, 
    public.ghcn_tmp._hoodie_record_key, 
    public.ghcn_tmp._hoodie_partition_path, 
    public.ghcn_tmp._hoodie_file_name, 
    public.ghcn_tmp.ID, 
    public.ghcn_tmp.DATE, 
    public.ghcn_tmp.ELEMENT, 
    public.ghcn_tmp.DATA_VALUE, 
    public.ghcn_tmp.M_FLAG, 
    public.ghcn_tmp.Q_FLAG, 
    public.ghcn_tmp.S_FLAG, 
    public.ghcn_tmp.OBS_TIME
);

The next step is to verify the result on the Redshift side.

Validate updated records in the Redshift table

Complete the following steps to validate the updated records in the Redshift table:

On the Amazon Redshift console, open Query Editor v2.

Run the following query:

SELECT * FROM "dev"."public"."ghcn" WHERE ID = 'AE000041196'

The query returns the following result set.

Now you see that the four records have been updated with the new records in 2023. If you have further future records, this approach works well to upsert new records based on the primary keys.

Load data incrementally from a Delta Lake table to Snowflake using a Delta change data feed

This tutorial uses a Delta change data feed to load data from a Delta table, and then merge the changes to Snowflake.

Ingest initial data to a Delta table

Complete the following steps:

Open AWS Glue Studio.
Choose ETL jobs.
Choose Visual with a source and target.
For Source and Target, choose Amazon S3, then choose Create.

A new visual job configuration appears. The next step is to configure the data source to read an example dataset.

Name this new job delta-data-ingestion.
Under Visual, choose Data source – S3 bucket.
Under Node properties, for S3 source type, select S3 location.
For S3 URL, enter s3://noaa-ghcn-pds/csv/by_year/2022.csv.

The data source is configured. The next step is to configure the data target to ingest data in Apache Hudi on your S3 bucket.

Choose Data target – S3 bucket.
Under Data target properties – S3, for Format, choose Delta Lake.
For Compression Type, choose Snappy.
For S3 Target location, enter s3://<Your S3 bucket name>/<Your S3 bucket prefix>/delta_incremental/ghcn/. (Provide your S3 bucket name and prefix.)
For Data Catalog update options, select Do not update the Data Catalog.

Now your data integration job is authored in the visual editor completely. Let’s add an additional detail about the IAM role and job parameters, and then run the job.

Under Job details, for IAM Role, choose your IAM role.
Under Job parameters, for Key, enter --conf and for Value, enter spark.databricks.delta.properties.defaults.enableChangeDataFeed=true.
Choose Save, then choose Run.

Load data from the Delta table to a Snowflake table

Complete the following steps to load data from the Delta table to a Snowflake table:

Download the file delta2snowflake-incremental-load.ipynb.
On AWS Glue Studio, choose Jupyter Notebook, then choose Create.
For Job name, enter delta-ghcn-incremental-load-notebook.
For IAM Role, choose your IAM role.
Choose Start notebook.

Wait for the notebook to be ready.

Run the first cell to start an AWS Glue interactive session.
Replace the parameters with yours and run the cell under Configure your resource.
Run the cell under Initialize SparkSession and GlueContext.
Run the cell under Determine target time range for CDC.
Run the cells under Run query to load data updated during a given timeframe.
Run the cells under Merge changes into destination table.

You can see the exact query immediately run right after ingesting a temp table in the Snowflake table.

Run the cell under Update the last query end time.

Validate initial records in the Snowflake warehouse

Run the following query in Snowflake:

SELECT * FROM "dev"."public"."ghcn" WHERE ID = 'AE000041196'

The query should return the following result set:

There are three records returned in this query.

Update and delete a record on the Delta table

Complete the following steps to update and delete a record on the Delta table as sample operations:

Return to the AWS Glue notebook job.
Run the cells under Update the record and Delete the record.

Load data incrementally from the Delta table to the Snowflake table

Complete the following steps to load the ingested records incrementally to the Redshift table:

On AWS Glue Studio, choose the job delta-ghcn-incremental-load-notebook.
Run all the cells again.

When you run the cells under Run query, you will notice that there are only three records, which correspond to the update and delete operation performed in the previous step.

In the cells under Merge changes into destination table, the changes are merged into the Snowflake table. The generated MERGE query statement in the notebook is as follows:

MERGE INTO public.ghcn USING public.ghcn_tmp ON 
    public.ghcn.ID = public.ghcn_tmp.ID AND 
    public.ghcn.DATE = public.ghcn_tmp.DATE AND 
    public.ghcn.ELEMENT = public.ghcn_tmp.ELEMENT 
WHEN MATCHED AND public.ghcn_tmp._change_type = 'update_postimage' THEN UPDATE SET 
    ID = public.ghcn_tmp.ID, 
    DATE = public.ghcn_tmp.DATE, 
    ELEMENT = public.ghcn_tmp.ELEMENT, 
    DATA_VALUE = public.ghcn_tmp.DATA_VALUE, 
    M_FLAG = public.ghcn_tmp.M_FLAG, 
    Q_FLAG = public.ghcn_tmp.Q_FLAG, 
    S_FLAG = public.ghcn_tmp.S_FLAG, 
    OBS_TIME = public.ghcn_tmp.OBS_TIME, 
    _change_type = public.ghcn_tmp._change_type, 
    _commit_version = public.ghcn_tmp._commit_version, 
    _commit_timestamp = public.ghcn_tmp._commit_timestamp 
WHEN MATCHED AND public.ghcn_tmp._change_type = 'delete' THEN DELETE 
WHEN NOT MATCHED THEN INSERT VALUES (
    public.ghcn_tmp.ID, 
    public.ghcn_tmp.DATE, 
    public.ghcn_tmp.ELEMENT, 
    public.ghcn_tmp.DATA_VALUE, 
    public.ghcn_tmp.M_FLAG, 
    public.ghcn_tmp.Q_FLAG, 
    public.ghcn_tmp.S_FLAG, 
    public.ghcn_tmp.OBS_TIME, 
    public.ghcn_tmp._change_type, 
    public.ghcn_tmp._commit_version, 
    public.ghcn_tmp._commit_timestamp
);

The next step is to verify the result on the Snowflake side.

Validate updated records in the Snowflake table

Complete the following steps to validate the updated and deleted records in the Snowflake table:

On Snowflake, run the following query:

SELECT * FROM ghcn WHERE ID = 'AE000041196' AND DATE = '20221231'

The query returns the following result set:

You will notice that the query only returns two records. The value of DATA_VALUE of the record ELEMENT=PRCP has been updated from 0 to 12345. The record ELEMENT=TMAX has been deleted. This means that your update and delete operations on the source Delta table have been successfully replicated to the target Snowflake table.

Clean up

Complete the following steps to clean up your resources:

Delete the following AWS Glue jobs:
- hudi-data-ingestion
- hudi-ghcn-incremental-load-notebook
- delta-data-ingestion
- delta-ghcn-incremental-load-notebook
Clean up your S3 bucket.
If needed, delete the Redshift cluster or the Redshift Serverless workgroup.

Conclusion

This post discussed architecture patterns to keep a copy of your data between data lakes using open table formats and data warehouses in sync and up to date. We also discussed the benefits of incremental loading and the techniques for achieving the use case using AWS Glue. We covered two use cases: incremental load from a Hudi table to Amazon Redshift, and from a Delta table to Snowflake.