All posts by Chiho Sugimoto

Introducing AWS Glue 5.1 for Apache Spark

Post Syndicated from Chiho Sugimoto original https://aws.amazon.com/blogs/big-data/introducing-aws-glue-5-1-for-apache-spark/

AWS Glue is a serverless, scalable data integration service that makes it simple to discover, prepare, move, and integrate data from multiple sources. AWS recently announced Glue 5.1, a new version of AWS Glue that accelerates data integration workloads in AWS. AWS Glue 5.1 upgrades the Spark engines to Apache Spark 3.5.6, giving you newer Spark release along with the newer dependent libraries so you can develop, run, and scale your data integration workloads and get insights faster.

In this post, we describe what’s new in AWS Glue 5.1, key highlights on Spark and related libraries, and how to get started on AWS Glue 5.1.

What’s new in AWS Glue 5.1

The following updates are in AWS Glue 5.1:

Runtime and library upgrades

AWS Glue 5.1 upgrades the runtime to Spark 3.5.6, Python 3.11, and Scala 2.12.18 with new improvements from the open source version. AWS Glue 5.1 also updates support for open table format libraries to Apache Hudi 1.0.2, Apache Iceberg 1.10.0, and Delta Lake 3.3.2 so you can solve advanced use cases around performance, cost, governance, and privacy in your data lakes.

Support for new Apache Iceberg features

AWS Glue 5.1 adds support for Apache Iceberg Materialized View, and Apache Iceberg format version 3.0. AWS Glue 5.1 also adds support for data writes into Iceberg and Hive tables with Spark-native fine-grained access control with AWS Lake Formation.

Apache Iceberg Materialized View is especially useful in cases where you need to accelerate frequently run queries on large data sets by pre-computing expensive aggregations. If you would like to learn more about Apache Iceberg materialized views, refer to Introducing Apache Iceberg materialized views in AWS Glue Data Catalog.

Apache Iceberg format version 3.0 is the latest Iceberg format version defined in Iceberg Table Spec. Following features are supported:

Create an Iceberg V3 format table

To create an Iceberg V3 format table, specify the format-version to 3 when creating the table. The following is a sample PySpark script: (replace amzn-s3-demo-bucket with your S3 bucket name):

from pyspark.sql import SparkSession

s3bucket = "amzn-s3-demo-bucket" 
database = "glue51_blog_demo" 
table_name = "iceberg_v3_table_demo"

spark = (
    SparkSession.builder
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.defaultCatalog", "glue_catalog")
    .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.glue_catalog.type", "glue")
    .config("spark.sql.catalog.glue_catalog.warehouse", f"s3://{s3bucket}/{database}/{table_name}/")
    .getOrCreate()
)

spark.sql(f"CREATE DATABASE IF NOT EXISTS {database}")

# Create Iceberg table with V3 format-version
spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {database}.{table_name} (
        id int,
        name string,
        age int,
        created_at timestamp
    ) USING iceberg
    TBLPROPERTIES (
        'format-version'='3',
        'write.delete.mode'='merge-on-read'
    )
""")

To migrate from V2 format to V3, use ALTER TABLE ... SET TBLPROPERTIES to update the format-version. The following is a sample PySpark script:

spark.sql(f"ALTER TABLE {database}.{table_name} SET TBLPROPERTIES ('format-version'='3')")

You cannot rollback from V3 to V2, so you need to be careful to verify that all your Iceberg clients support Iceberg V3 format version. Once upgraded, older versions cannot correctly read newer format versions, as Iceberg table format versions are not forward-compatible.

Create a table with Row Lineage tracking enabled

To create a table with Row Lineage tracking enabled, set the table property row-lineage to true. The following is a sample PySpark script:

# Create Iceberg table with row-lineage-tracking
spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {database}.{table_name} (
        id int,
        name string,
        age int,
        created_at timestamp
    ) USING iceberg
    TBLPROPERTIES (
        'format-version'='3',
        'row-lineage'='true',
        'write.delete.mode'='merge-on-read'
    )
""")

In tables with Row Lineage tracking enabled, row IDs are managed at the metadata level for tracking row modifications over time and auditing.

Extended support for AWS Lake Formation permissions

Fine-grained access control with Lake Formation has been supported through native Spark DataFrames and Spark SQL in Glue 5.0 for read operations. Glue 5.1 extends fine-grained access control for write operations.

Full-Table Access (FTA) control in Apache Spark were introduced for Apache Hive and Iceberg tables in Glue 5.0. Glue 5.1 extends FTA support for Apache Hudi tables and Delta Lake tables.

S3A by default

AWS Glue 5.1 uses S3A as the default S3 connector. This change aligns with the recent Amazon EMR adoption of S3A as the default connector and brings enhanced performance and advanced features to Glue workloads. For more details about the S3A connector’s capabilities and optimizations, see Optimize Amazon EMR runtime for Apache Spark with EMR S3A.

Note when migrating from Glue 5.0 to Glue 5.1, If both spark.hadoop.fs.s3a.endpoint and spark.hadoop.fs.s3a.endpoint.region are not set, the default region used by S3A is us-east-2. This may cause issues. To mitigate the issues caused by this change, set the spark.hadoop.fs.s3a.endpoint.region Spark configuration when using the S3A file system in AWS Glue 5.1.

Dependent library upgrades

AWS Glue 5.1 upgrades the runtime to Spark 3.5.6, Python 3.11, and Scala 2.12.18 with upgraded dependent libraries.

The following table lists dependency upgrades:

Dependency Version in AWS Glue 5.0 Version in AWS Glue 5.1
Spark 3.5.4 3.5.6
Hadoop 3.4.1 3.4.1
Scala 2.12.18 2.12.18
Hive 2.3.9 2.3.9
EMRFS 2.69.0 2.73.0
Arrow 12.0.1 12.0.1
Iceberg 1.7.1 1.10.0
Hudi 0.15.0 1.0.2
Delta Lake 3.3.0 3.3.2
Java 17 17
Python 3.11 3.11.14
boto3 1.34.131 1.40.61
AWS SDK for Java 2.29.52 2.35.5
AWS Glue Data Catalog Client 4.5.0 4.9.0
EMR DynamoDB Connector 5.6.0 5.7.0

The following are database connector (JDBC driver) upgrades:

Driver Connector version in AWS Glue 5.0 Connector version in AWS Glue 5.1
MySQL 8.0.33 8.0.33
Microsoft SQL Server 10.2.0 10.2.0
Oracle Databases 23.3.0.23.09 23.3.0.23.09
PostgreSQL 42.7.3 42.7.3
Amazon Redshift redshift-jdbc42-2.1.0.29 redshift-jdbc42-2.1.0.29

The following are Spark connector upgrades:

Driver Connector version in AWS Glue 5.0 Connector version in AWS Glue 5.1
Amazon Redshift 6.4.0 6.4.2
OpenSearch 1.2.0 1.2.0
MongoDB 10.3.0 10.3.0
Snowflake 3.0.0 3.1.1
BigQuery 0.32.2 0.32.2
AzureCosmos 4.33.0 4.33.0
AzureSQL 1.3.0 1.3.0
Vertica 3.3.5 3.3.5

Get started with AWS Glue 5.1

You can start using AWS Glue 5.1 through AWS Glue Studio, the AWS Glue console, the latest AWS SDK, and the AWS Command Line Interface (AWS CLI).

To start using AWS Glue 5.1 jobs in AWS Glue Studio, open the AWS Glue job and on the Job Details tab, choose the version Glue 5.1 – Supports Spark 3.5, Scala 2, Python 3.

To start using AWS Glue 5.1 on an AWS Glue Studio notebook or an interactive session through a Jupyter notebook, set 5.1 in the %glue_version magic:

%%glue_version 5.1

The following output shows that the session is set to use AWS Glue 5.1:

Setting Glue version to: 5.1

Spark Troubleshooting with Glue 5.1

To accelerate Apache Spark troubleshooting and job performance optimization for your Glue 5.1 ETL jobs, you can use the newly introduced Apache Spark troubleshooting agent. Traditional Spark troubleshooting requires extensive manual analysis of logs, performance metrics, and error patterns to identify root causes and optimization opportunities. The agent simplifies this process through natural language prompts, automated workload analysis, and intelligent code recommendations. The agent has three main components: an MCP-compatible AI assistant in your development environment for interaction, the MCP proxy for AWS that handles secure communication between your client and the MCP server, and an Amazon SageMaker Unified Studio managed MCP Server (preview) that provides specialized Spark troubleshooting and upgrade tools for Glue 5.1 jobs.

To set up the agent, follow the instructions to set up the resources and MCP configuration: Setup for Apache Spark Troubleshooting agent. Then, you can launch your preferred MCP client and use conversation to interact with the tools for troubleshooting.

The following is a demonstration on how you can use the Apache Spark troubleshooting agent with Kiro CLI to debug a Glue 5.1 job run.

For more information and video walkthroughs for how to use the Apache Spark troubleshooting agent, please refer to Apache Spark Troubleshooting agent for Amazon EMR.

Conclusion

In this post, we discussed the key features and benefits of AWS Glue 5.1. You can create new AWS Glue jobs on AWS Glue 5.1 or migrate your existing AWS Glue jobs to benefit from the improvements.

We would like to thank the support of numerous engineers and leaders who helped build Glue 5.1 to support customers with a performance optimized Spark runtime and deliver new capabilities.


About the authors

Chiho Sugimoto

Chiho is a Cloud Support Engineer on the AWS Big Data Support team. She is passionate about helping customers build data lakes using ETL workloads. She loves planetary science and enjoys studying the asteroid Ryugu on weekends.

Noritaka Sekiyama

Noritaka is a Principal Big Data Architect at the AWS Analytics product team. He’s responsible for designing new features in AWS products, building software artifacts, and providing architecture guidance to customers. In his spare time, he enjoys cycling on his road bike.

Peter Tsai

Peter is a Software Development Engineer at AWS, where he enjoys solving challenges in the design and performance of the AWS Glue runtime. In his leisure time, he enjoys hiking and cycling.

Bo Li

Bo Li is a Senior Software Development Engineer on the AWS Glue team. He is devoted to designing and building end-to-end solutions to address customers’ data analytic and processing needs with cloud-based, data-intensive and GenAI technologies.

Kartik Panjabi

Kartik is a Software Development Manager on the AWS Glue team. His team builds generative AI features for the Data Integration and distributed system for data integration.

Peter Manastyrny

Peter is a Product Manager focusing on data processing and data integration workloads at AWS. He is working on making AWS Glue the best tool for building and operating complex integrated data pipelines.

Introducing Jobs in Amazon SageMaker

Post Syndicated from Chiho Sugimoto original https://aws.amazon.com/blogs/big-data/introducing-jobs-in-amazon-sagemaker/

Processing large volumes of data efficiently is critical for businesses, and so data engineers, data scientists, and business analysts need reliable and scalable ways to run data processing workloads. The next generation of Amazon SageMaker is the center for all your data, analytics, and AI. Amazon SageMaker Unified Studio is a single data and AI development environment where you can find and access all of the data in your organization and act on it using the best tools across any use case.

We’re excited to announce a new data processing job experience for Amazon SageMaker. Jobs are a common concept widely used in existing AWS services such as Amazon EMR and AWS Glue. With this launch, you can now build jobs in SageMaker to process large volumes of data. Jobs can be built using your preferred tool. For example, you can create jobs from extract, transform, and load (ETL) scripts coded in the Unified Studio code editor, code interactively in a Unified Studio Notebooks, or create jobs visually using the Unified Studio Visual ETL editor. After being created, data processing jobs can be set to run on demand, scheduled using the built in scheduler, or orchestrated with SageMaker workflows. You can monitor the status of your data processing jobs and view run history showing status, logs, and performance metrics. When jobs encounter failures, you can use generative AI troubleshooting to automatically analyze errors and receive detailed recommendations to resolve issues quickly. Together, you can use these capabilities to author, manage, operate, and monitor data processing workloads across your organization. The new experience provides an experience that’s consistent with other AWS analytics services such as AWS Glue.

This post demonstrates how the new jobs experience works in SageMaker Unified Studio.

Prerequisites

To get started, you must have the following prerequisites in place:

  • An AWS account
  • A SageMaker Unified Studio domain
  • A SageMaker Unified Studio project with an Data analytics and AI-ML model development project profile

Example use case

A global apparel ecommerce retailer processes thousands of customer reviews daily across multiple marketplaces. They need to transform their raw review data into actionable insights to improve their product offerings and customer experience. Using SageMaker Unified Studio visual ETL editor, we’ll demonstrate how to transform raw review data into structured analytical datasets that enable market-specific performance analysis and product quality monitoring.

Create and run a visual job

In this section, you’ll create a Visual ETL Job that processes the review data from a Parquet file in Amazon Simple Storage Service Amazon S3. The job transforms the data using SQL queries and saves the results back to S3 buckets. Complete the following steps to create a Visual ETL Job:

  1. On the SageMaker Unified Studio console, on the top menu, choose Build.
  2. Under DATA ANALYSIS & INTEGRATION, choose Data processing jobs.
  3. Choose Create Visual ETL Job.

You’ll be directed to the Visual ETL editor, where you can create ETL jobs. You can use this editor to design data transformation pipelines by connecting source nodes, transformation nodes, and target nodes.

  1. On the top left, choose the plus (+) icon in the circle. Under Data sources, select Amazon S3.
  2. Select the Amazon S3 source node and enter the following values:
    1. S3 URI: s3://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Apparel/
    2. Format: Parquet
  3. Select Update node.
  4. Choose the plus (+) icon in the circle to the right of the Amazon S3 source node. Under Transforms, select SQL query.
  5. Enter the following query statement and select Update node.
SELECT
    marketplace,
    star_rating,
    DATE_FORMAT(review_date, 'yyyy-MM-dd') as review_date,
    COUNT(*) as review_count,
    AVG(CAST(helpful_votes as DOUBLE) / NULLIF(total_votes, 0)) as helpfulness_ratio,
    COUNT(CASE WHEN insight = 'Y' THEN 1 END) as insight_count
FROM {myDataSource}
GROUP BY
    marketplace,
    star_rating,
    DATE_FORMAT(review_date, 'yyyy-MM-dd')
  1. Choose the plus (+) icon to the right of the SQL Query node. Under Data target, select Amazon S3.
  2. Select the Amazon S3 target node and enter the following values:
    1. S3 URI: Choose the Amazon S3 location from the project overview page and add the suffix “/output/rating_analysis/”. For example, s3://<bucket-name>/<domainId>/<projectId>/output/rating_analysis/
    2. Format: Parquet
    3. Compression: Snappy
    4. Partition keys: review_date
    5. Mode: Append
  3. Select Update node.

Next, add another SQL query node connected to the same Amazon S3 data source. This node performs a SQL query transformations and outputs the results to a separate S3 location.

  1. On the top left, choose the plus (+) icon in the circle. Under Transforms, select SQL query, and connect the Amazon S3 source node.
  2. Enter the following query statement and select Update node.
SELECT 
    marketplace,
    product_id,
    product_title,
    COUNT(*) as review_count,
    AVG(star_rating) as avg_rating,
    SUM(helpful_votes) as total_helpful_votes,
    COUNT(DISTINCT customer_id) as unique_reviewers,
    COUNT(CASE WHEN insight = 'Y' THEN 1 END) as insight_count
FROM {myDataSource}
GROUP BY 
    marketplace,
    product_id,
    product_title
  1. Choose the plus (+) icon to the right of the SQL Query node. Under Data target, select Amazon S3.
  2. Select the Amazon S3 target node and enter the following values:
    1. S3 URI: Choose the Amazon S3 location from the project overview page and add suffix “/output/product_analysis/”. For example, s3://<bucket-name>/<domainId>/<projectId>/output/product_analysis/
    2. Format: Parquet
    3. Compression: Snappy
    4. Partition keys: marketplace
    5. Mode: Append
  3. Select Update node.

At this point, your end-to-end visual job should look like the following image. The next step is to save this job to the project and run the job.

  1. On the top right, choose Save to project to save the draft job. You can optionally change the name and add a description.
  2. Choose Save.
  3. On the top right, choose Run.

This will start running your Visual ETL job. You can monitor the list of job runs by selecting View runs in the top middle of the screen.

Create and run a code based job

In addition to creating jobs through the Visual ETL Editor, you can create jobs using a code-based approach by specifying Python script or Notebook files. When you specify a Notebook file, it automatically converts to a Python script to create the job. Here, you’ll create a notebook in JupyterLab within SageMaker Unified Studio, save it to the project repository, and then create a code-based job from that notebook. First, create a Notebook.

  1. On the SageMaker Unified Studio console, on the top menu, choose Build.
  2. Under IDE & APPLICATIONS, select JupyterLab.
  3. Select Python 3 under Notebook.

  1. For the first cell, select Local Python, python, enter following code:
%%configure -n project.spark.compatibility
{
    "number_of_workers": 10,
    "session_type": "etl",
    "glue_version": "5.0",
    "worker_type": "G.1X",
    "idle_timeout": 10,
    "timeout": 1200
}
  1. For the second cell, select PySpark, project.spark.compatibility, enter following code. This performs the same processing as the Visual ETL job you created above. Replace the S3 bucket and folder names for output_path.
import sys
from pyspark.context import SparkContext
from pyspark.sql import SparkSession

sc = SparkContext.getOrCreate()
spark = SparkSession.builder.getOrCreate()

# Create Spark session
sc = SparkContext.getOrCreate()
spark = SparkSession.builder.getOrCreate()

# Configure paths
input_path = "s3://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Apparel/"
output_path = "s3://<bucket-name>/<domainId>/<projectId>/code-job-output/results"


# Read data from S3
df = spark.read.format("parquet").load(input_path)
df.createOrReplaceTempView("reviews")

# Transform 1: Rating Analysis
rating_analysis = spark.sql("""
    SELECT 
        marketplace,
        star_rating,
        DATE_FORMAT(review_date, 'yyyy-MM-dd') as review_date,
        COUNT(*) as review_count,
        AVG(CAST(helpful_votes as DOUBLE) / NULLIF(total_votes, 0)) as helpfulness_ratio,
        COUNT(CASE WHEN insight = 'Y' THEN 1 END) as insight_count
    FROM reviews
    GROUP BY 
        marketplace,
        star_rating,
        DATE_FORMAT(review_date, 'yyyy-MM-dd')
""")

# Transform 2: Product Analysis
product_analysis = spark.sql("""
    SELECT 
        marketplace,
        product_id,
        product_title,
        COUNT(*) as review_count,
        AVG(star_rating) as avg_rating,
        SUM(helpful_votes) as total_helpful_votes,
        COUNT(DISTINCT customer_id) as unique_reviewers,
        COUNT(CASE WHEN insight = 'Y' THEN 1 END) as insight_count
    FROM reviews
    GROUP BY 
        marketplace,
        product_id,
        product_title
    HAVING 
        COUNT(*) >= 5
""")

# Write results to S3
rating_analysis.write.format("parquet") \
    .option("compression", "snappy") \
    .partitionBy("review_date") \
    .mode("append") \
    .save(f"{output_path}/rating_analysis")

product_analysis.write.format("parquet") \
    .option("compression", "snappy") \
    .partitionBy("marketplace") \
    .mode("append") \
    .save(f"{output_path}/product_analysis")
  1. Choose the File icon to save the notebook file. Enter the name of your notebook.

Save the notebook to the project’s repository.

  1. Choose the Git icon in the left navigation. This opens a panel where you can view the commit history and perform Git operations.
  2. Choose the plus (+) icon next to the files you want to commit.
  3. Enter a brief summary of the commit in the Summary text entry field. Optionally, enter a longer description of the commit in the Description text entry field.
  4. Choose Commit.
  5. Choose the Push committed changes icon to do a git push.

Create the Code-based Job from the Notebook file in the project repository.

  1. On the SageMaker Unified Studio console, on the top menu, choose Build.
  2. Under DATA ANALYSIS & INTEGRATION, choose Data processing jobs.
  3. Choose Create job from files.
  4. Choose Choose project files and choose Browse files.
  5. Select the Notebook file you created and choose Select.

Here, the Python script automatically converted from your notebook file will be displayed. Review the content.

  1.  Choose Next.
  2. For Job name, enter the name of your job.
  3. Choose Submit to create your job.
  4. Choose the job you created.
  5. Choose Run job.

Convert existing Visual ETL flows to jobs

You can convert an existing visual ETL flow to a job by saving your existing Visual ETL flow to the project repository. Use the following steps to create a job from your existing visual ETL flow:

  1. On the SageMaker Unified Studio console, on the top menu, choose Build.
  2. Under DATA ANALYSIS & INTEGRATION, select Visual ETL editor.
  3. Select the existing Visual ETL flow.
  4. On the top right, choose Save to project to save the draft flow. You can optionally change the name and add a description.
  5. Choose Save.

View jobs

You can view the list of jobs in your project on the Data processing jobs page. Jobs can be filtered by mode (Visual ETL or Code).

Monitor job runs

On each job’s detail page, you can view a list of job runs in the Job runs tab. You can filter activities by job run ID, status, start time, and end time. The Job runs list shows basic attributes such as duration, resources consumed, and instance type, along with log group names and various job parameters. You can list, compare, and explore job runs history based on various attributes.

On the individual job run details page, you can view job properties and output logs from the run. When a job fails because of an error, you can see the error message at the top of the page and examine detailed error information in the output logs.

Intelligent troubleshooting with generative AI: When jobs fail, you can take advantage of generative AI troubleshooting to resolve issues quickly. SageMaker Unified Studio’s AI-powered troubleshooting automatically analyzes job metadata, Spark event logs, error stack traces, and runtime metrics to identify root causes and provide actionable solutions. It handles both simple scenarios like missing S3 buckets, and complex performance issues such as out-of-memory exceptions. The analysis explains not just what failed, but why it failed and how to fix it, reducing troubleshooting time from hours or days to minutes.

To start the analysis, choosing Troubleshoot with AI at the top right. The troubleshooting analysis provides Root Cause Analysis identifying the specific issue, Analysis Insights explaining the error context and failure patterns, and Recommendations with step-by-step remediation actions. This expert-level analysis makes complex Spark debugging accessible to all team members, regardless of their Spark expertise.

Clean up

To avoid incurring future charges, delete the resources you created during this walkthrough:

  1. Delete Visual ETL flows in Visual ETL editor.
  2. Delete Data processing jobs, including Visual ETL and Code-based jobs.
  3. Delete Output files in the S3 bucket.

Conclusion

In this post, we explored the new job experience in Amazon SageMaker Unified Studio, which brings a familiar and consistent experience for data processing and data integration tasks. This new capability streamlines your workflows by providing enhanced visibility, cost management, and seamless migration paths from AWS Glue.With the ability to create both visual and code-based jobs, monitor job runs, and set up scheduling, the new jobs experience helps you build and manage data processing and data integration tasks efficiently. Whether you’re a data engineer working on ETL processes or a data scientist preparing datasets for machine learning, the job experience in SageMaker Unified Studio provides the tools you need in a unified environment.Start exploring the new job experience today to simplify your data processing workflows and make the most of your data in Amazon SageMaker Unified Studio.


About the authors

Chiho Sugimoto is a Cloud Support Engineer on the AWS Big Data Support team. She is passionate about helping customers build data lakes using ETL workloads. She loves planetary science and enjoys studying the asteroid Ryugu on weekends.

Noritaka Sekiyama is a Principal Big Data Architect at the AWS Analytics product team. He’s responsible for designing new features in AWS products, building software artifacts, and providing architecture guidance to customers. In his spare time, he enjoys cycling on his road bike.

Matt Su is a Senior Product Manager on the AWS Glue team. He enjoys helping customers uncover insights and make better decisions using their data with AWS Analytics services. In his spare time, he enjoys skiing and gardening.

Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity

Post Syndicated from Chiho Sugimoto original https://aws.amazon.com/blogs/big-data/introducing-a-new-unified-data-connection-experience-with-amazon-sagemaker-lakehouse-data-connectivity/

The need to integrate diverse data sources has grown exponentially, but there are several common challenges when integrating and analyzing data from multiple sources, services, and applications. First, you need to create and maintain independent connections to the same data source for different services. Second, the data connectivity experience is inconsistent across different services. For each service, you need to learn the supported authorization and authentication methods, data access APIs, and framework to onboard and test data sources. Third, some services require you to set up and manage compute resources used for federated connectivity, and capabilities like connection testing and data preview aren’t available in all services. This fragmented, repetitive, and error-prone experience for data connectivity is a significant obstacle to data integration, analysis, and machine learning (ML) initiatives.

To solve for these challenges, we launched Amazon SageMaker Lakehouse unified data connectivity. This feature offers the following capabilities and benefits:

  • With SageMaker Lakehouse unified data connectivity, you can set up a connection to a data source using a connection configuration template that is standardized for multiple services. Amazon SageMaker Unified Studio, AWS Glue, and Amazon Athena can share and reuse the same connection with proper permission configuration.
  • SageMaker Lakehouse unified data connectivity supports standard methods for data source connection authorization and authentications, such as basic authorization and OAuth2. This approach simplifies your data journey and helps you meet your security requirements.
  • The SageMaker Lakehouse data connection testing capability boosts your confidence in established connections. With the ability to browse metadata, you can understand the structure and schema of the data source, identify relevant tables and fields, and discover useful data assets you may not be aware of.
  • SageMaker Lakehouse unified data connectivity’s data preview capability helps you map source fields to target schemas, identify needed data transformation, and plan data standardization and normalization steps.
  • SageMaker Lakehouse unified data connectivity provides a set of APIs for you to use without the need to learn different APIs for various data sources, promoting coding efficiency and productivity.

With SageMaker Lakehouse unified data connectivity, you can confidently connect, explore, and unlock the full value of your data across AWS services and achieve your business objectives with agility.

This post demonstrates how SageMaker Lakehouse unified data connectivity helps your data integration workload by streamlining the establishment and management of connections for various data sources.

Solution overview

In this scenario, an e-commerce company sells products on their online platform. The product data is stored on Amazon Aurora PostgreSQL-Compatible Edition. Their existing business intelligence (BI) tool runs queries on Athena. Furthermore, they have a data pipeline to perform extract, transform, and load (ETL) jobs when moving data from the Aurora PostgreSQL database cluster to other data stores.

Now they have a new requirement to allow ad-hoc queries through SageMaker Unified Studio to enable data engineers, data analysts, sales representatives, and others to take advantage of its unified experience.

In the following sections, we demonstrate how to set up this connection and run queries using different AWS services.

Prerequisites

Before you begin, make sure you have the followings:

  • An AWS account.
  • A SageMaker Unified Studio domain.
  • An Aurora PostgreSQL database cluster.
  • A virtual private cloud (VPC) and private subnets required for SageMaker Unified Studio.
  • An Amazon Simple Storage Service (Amazon S3) bucket to store output from the AWS Glue ETL jobs. In the following steps, replace amzn-s3-demo-destination-bucket with the name of the S3 bucket.
  • An AWS Glue Data Catalog database. In the following steps, replace <your_database> with the name of your database.

Create an IAM role for the AWS Glue job

You can either create a new AWS Identity and Access Management (IAM) role or use an existing role that has permission to access the AWS Glue output bucket and AWS Secrets Manager.

If you want to create a new one, complete the following steps:

  1. On the IAM console, in the navigation pane, choose Roles.
  2. Choose Create role.
  3. For Trusted entity type, choose AWS service.
  4. For Service or use case, choose Glue.
  5. Choose Next.
  6. For Add permissions, choose AWSGlueServiceRole, then choose Next.
  7. For Role name, enter a role name (for this post, GlueJobRole-demo).
  8. Choose Create role.
  9. Choose the created IAM role.
  10. Under Permissions policies, choose Add permission and Create inline policy.
  11. For Policy editor, choose JSON, and enter the following policy:
    {
         "Version": "2012-10-17",
         "Statement": [
             {
                 "Effect": "Allow",
                 "Action": [
                     "s3:List*",
                     "s3:GetObject",
                     "s3:PutObject",
                     "s3:DeleteObject"
                 ],
                 "Resource": [
                     "arn:aws:s3:::amzn-s3-demo-destination-bucket/*",
                     "arn:aws:s3:::amzn-s3-demo-destination-bucket"
                 ]
             },
            {
                "Effect": "Allow",
                "Action": [
                    "secretsmanager:GetSecretValue"
                ],
                "Resource": [
                    "arn:aws:secretsmanager:<region>:<account-id>:secret:SageMakerUnifiedStudio-Glue-postgresql_source-*"
                ]
            }
         ]
     }

  12. Choose Next.
  13. For Policy name, enter a name for your policy.
  14. Choose Create policy.

Create a SageMaker Lakehouse data connection

Let’s get started with the unified data connection experience. The first step is to create a SageMaker Lakehouse data connection. Complete the following steps:

  1. Sign in to your SageMaker Unified Studio.
  2. Open your project.
  3. On your project, in the navigation pane, choose Data.
  4. Choose the plus sign.
  5. For Add data source, choose Add connection. Choose Next.
  6. Select PostgreSQL, and choose Next.
  7. For Name, enter postgresql_source.
  8. For Host, enter your host name of your Aurora PostgreSQL database cluster.
  9. For Port, enter your port number of your Aurora PostgreSQL database cluster (by default, it’s 5432).
  10. For Database, enter your database name.
  11. For Authentication, select Username and password.
  12. Enter your username and password.
  13. Choose Add data.

After the completion, it will create a new AWS Secrets Manager secret with a name like SageMakerUnifiedStudio-Glue-postgresql_source to securely store the specified username and password. It also creates a Glue connection with the same name postgresql_source.

Now you have a unified connection for Aurora PostgreSQL-Compatible.

Load data into the PostgreSQL database through the notebook

You will use a JupyterLab notebook on SageMaker Unified Studio to load sample data from an S3 bucket into a PostgreSQL database using Apache Spark.

  1. On the top left menu, choose Build, and under IDE & APPLICATIONS, choose JupyterLab.
  2. Choose Python 3 under Notebook.
  3. For the first cell, choose Local Python, python, enter following code, and run the cell:
    %%configure -f -n project.spark
    {
        "glue_version": "4.0"
    }

  4. For the second cell, choose PySpark, spark, enter following code, and run the cell:
    # Read sample data from S3 bucket
    df = spark.read.parquet("s3://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Apparel/")
    
    # Preview the data
    df.show()

The code snippet reads the sample data Parquet files from the specified S3 bucket location and stores the data in a Spark DataFrame named df. The df.show() command displays the first 20 rows of the DataFrame, allowing you to preview the sample data in a tabular format. Next, you will load this sample data into a PostgreSQL database.

  1. For the third cell, choose PySpark, spark, enter following code, and run the cell (replace <account-id> with your AWS account ID):
    import boto3
    import ast
    
    # replace you account ID before running this cell
    
    # Get secret
    secretsmanager_client = boto3.client('secretsmanager')
    get_secret_value_response = secretsmanager_client.get_secret_value(
        SecretId='SageMakerUnifiedStudio-Glue-postgresql_source' # replace the secret name if needed
    )
    secret = ast.literal_eval(get_secret_value_response["SecretString"])
    
    # Get connection
    glue_client = boto3.client('glue')
    glue_client_response = glue_client.get_connection(
        CatalogId='<account-id>',
        Name='postgresql_source' # replace the connection name if needed
    )
    connection_properties = glue_client_response["Connection"]["ConnectionProperties"]

  2. For the fourth cell, choose PySpark, spark, enter following code, and run the cell:
    # Load data into the DB
    jdbcurl = "jdbc:postgresql://{}:{}/{}".format(connection_properties["HOST"],connection_properties["PORT"],connection_properties["DATABASE"])
    df.write \
        .format("jdbc") \
        .option("url", jdbcurl) \
        .option("dbtable", "public.unified_connection_test") \
        .option("user", secret["username"]) \
        .option("password", secret["password"]) \
        .save()

Let’s see if you could successfully create the new table unified_connection_test. You can navigate to the project’s Data page to visually verify the existence of the newly created table.

  1. On the top left menu, choose your project name, and under CURRENT PROJECT, choose Data.

Within the Lakehouse section, expand the postgresql_source, then the public schema, and you should find the newly created unified_connection_test table listed there. Next, you will query the data in this table using SageMaker Unified Studio’s SQL query book feature.

Run queries on the connection through the query book using Athena

Now you can run queries using the connection you created. In this section, we demonstrate how to use the query book using Athena. Complete the following steps:

  1. In your project on SageMaker Unified Studio, choose the Lakehouse section, expand the postgresql_source, then the public
  2. On the options menu (three vertical dots) of the table unified_connection_test, choose Query with Athena.

This step will open a new SQL query book. The query statement select * from "postgresql_source"."public"."unified_connection_test" limit 10; is automatically filled.

  1. On the Actions menu, choose Save to Project.
  2. For Querybook title, enter the name of your SQL query book.
  3. Choose Save changes.

This will save the current SQL query book, and the status of the notebook will change from Draft to Saved. If you want to revert a draft notebook to its last published state, choose Revert to published version to roll back to the most recently published version. Now, let’s start running queries on your notebook.

  1. Choose Run all.

When a query finishes, results can be viewed in a few formats. The table view displays query results in a tabular format. You can download the results as JSON or CSV files using the download icon at the bottom of the output cell. Additionally, the notebook provides a chart view to visualize query results as graphs.

The sample data includes a column star_rating representing a 5-star rating for products. Let’s try a quick visualization to analyze the rating distribution.

  1. Choose Add SQL to add a new cell.
  2. Enter the following statement:
    SELECT count() as counts, star_rating FROM "postgresql_source"."public"."unified_connection_test"
    GROUP BY star_rating

  3. Choose the run icon of the cell, or you can press Ctrl+Enter or Cmd+Enter to run the query.

This will display the results in the output panel. Now you have learned how the connection works on SageMaker Unified Studio. Next, we show how you can use the connection on AWS Glue consoles.

Run Glue ETL jobs on the connection on the AWS Glue console

Next, we create an AWS Glue ETL job that reads table data from the PostgreSQL connection, converts data types, transforms the data into Parquet files, and outputs them to Amazon S3. It also creates a table in the Glue Data Catalog and add partitions so downstream data engineers can immediately use the table data. Complete the following steps:

  1. On the AWS Glue console, choose Visual ETL in the navigation pane.
  2. Under Create job, choose Visual ETL.
  3. At the top of the job, replace “Untitled job” with a name of your choice.
  4. On the Job Details tab, under Basic properties, specify the IAM role that the job will use (GlueJobRole-demo).
  5. For Glue version, choose Glue version 4.0
  6. Choose Save.
  7. On the Visual tab, choose the plus sign to open the Add nodes
  8. Search for postgresql and add PostgreSQL as Source.
  9. For JDBC source, choose JDBC connection details.
  10. For PostgreSQL connection, choose postgresql_source.
  11. For Table name, enter unified_connection_test
  1. As a child of this source, search in the Add nodes menu for timestamp and choose To Timestamp.
  2. For Column to convert, choose review_date.
  3. For Column type, choose iso.
  4. On the Visual tab, search in the Add nodes menu for s3 and add Amazon S3 as Target.
  5. For Format, choose Parquet.
  6. For Compression Type, choose Snappy.
  7. For S3 Target Location, enter your S3 output location (s3://amzn-s3-demo-destination-bucket).
  8. For Data Catalog update options, choose Create a table in the Data Catalog and on subsequent runs, update the schema and add new partitions.
  9. For Database, enter your Data Catalog database (<your_database>).
  10. For Table name, enter connection_demo_tbl.
  11. Under Partition keys, choose Add a partition key, and choose review_year.
  12. Choose Save, then choose Run to run the job.

When the job is complete, it will output Parquet files to Amazon S3 and create a table named connection_demo_tbl in the Data Catalog. You have now learned that you can use the SageMaker Lakehouse data connection not only in SageMaker Unified Studio, but also directly in AWS Glue console without needing to create separate individual connections.

Clean up

Now to the final step, cleaning up the resources. Complete the following steps:

  1. Delete the connection.
  2. Delete the Glue job.
  3. Delete the AWS Glue output S3 buckets.
  4. Delete the IAM role AWSGlueServiceRole.
  5. Delete the Aurora PostgreSQL cluster.

Conclusion

This post demonstrated how the SageMaker Lakehouse unified data connectivity works end to end, and how you can use the unified connection across different services such as AWS Glue and Athena. This new capability can simplify your data journey.

To learn more, refer to Amazon SageMaker Unified Studio.


About the Authors

Chiho Sugimoto is a Cloud Support Engineer on the AWS Big Data Support team. She is passionate about helping customers build data lakes using ETL workloads. She loves planetary science and enjoys studying the asteroid Ryugu on weekends.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.

Shubham Agrawal is a Software Development Engineer on the AWS Glue team. He has expertise in designing scalable, high-performance systems for handling large-scale, real-time data processing. Driven by a passion for solving complex engineering problems, he focuses on building seamless integration solutions that enable organizations to maximize the value of their data.

Joju Eruppanal is a Software Development Manager on the AWS Glue team. He strives to delight customers by helping his team build software. He loves exploring different cultures and cuisines.

Julie Zhao is a Senior Product Manager at AWS Glue. She joined AWS in 2021 and brings three years of startup experience leading products in IoT data platforms. Prior to startups, she spent over 10 years in networking with Cisco and Juniper across engineering and product. She is passionate about building products to solve customer problems.

Author data integration jobs with an interactive data preparation experience with AWS Glue visual ETL

Post Syndicated from Chiho Sugimoto original https://aws.amazon.com/blogs/big-data/author-data-integration-jobs-with-an-interactive-data-preparation-experience-with-aws-glue-visual-etl/

We are excited to announce a new capability of the AWS Glue Studio visual editor that offers a new visual user experience. Now you can author data preparation transformations and edit them with the AWS Glue Studio visual editor. The AWS Glue Studio visual editor is a graphical interface that enables you to create, run, and monitor data integration jobs in AWS Glue.

The new data preparation interface in AWS Glue Studio provides an intuitive, spreadsheet-style view for interactively working with tabular data. Within this interface, you can visually inspect tabular data samples, validate recipe steps through real-time runs, and author data preparation recipes without writing code. Within the new experience, you can choose from hundreds of prebuilt transformations. This allows data analysts and data scientists to rapidly construct the necessary data preparation steps to meet their business needs. After you complete authoring the recipes, AWS Glue Studio will automatically generate the Python script to run the recipe data transformations as part of AWS Glue extract, transform, and load (ETL) jobs.

In this post, we show how to use this new feature to build a visual ETL job that preprocesses data to meet the business needs for an example use case, entirely within the AWS Glue Studio console, without the overhead of manual script coding.

Example use case

A fictional e-commerce company sells apparel and allows customers to leave text reviews and star ratings for each product, to help other customers to make informed purchase decisions. To simulate this, we will use a sample synthetic review dataset, which includes different products and customer reviews.

In this scenario, you’re a data analyst in this company. Your role involves preprocessing raw customer review data to prepare it for downstream analytics. This requires transforming the data by normalizing columns through actions such as casting columns to appropriate data types, splitting a single column into multiple new columns, and adding computed columns based on other columns. To quickly create an ETL job for these business requirements, you use AWS Glue Studio to inspect the data and author data preparation recipes.

The AWS Glue job will be configured to output the file to Amazon Simple Storage Service (Amazon S3) in a preferred format and automatically create a table in the AWS Glue Data Catalog. This Data Catalog table will be shared with your analyst team, allowing them to query the table using Amazon Athena.

Prerequisites

For this tutorial, you need an S3 bucket to store output from the AWS Glue ETL job and Athena queries, and a Data Catalog database to create new tables. You also need to create AWS Identity and Access Management (IAM) roles for the AWS Glue job and AWS Management Console user.

Create an S3 bucket to store output from the AWS Glue ETL jobs and Athena query results

You can either create a new S3 bucket or use an existing bucket to store output from the AWS Glue ETL job and Athena queries. In the following steps, replace <glue-etl-output-s3-bucket> and <athena-query-output-s3-bucket> with the name of the S3 bucket.

Create a Data Catalog database

You can either create a new Data Catalog database or use an existing database to create tables. In the following steps, replace <your_database> with the name of your database.

Create an IAM role for the AWS Glue job

Complete the following steps to create an IAM role for the AWS Glue job:

  1. On the IAM console, in the navigation pane, choose Role.
  2. Choose Create role.
  3. For Trusted entity type, choose AWS service.
  4. For Service or use case, choose Glue.
  5. Choose Next.
  6. For Add permissions, choose AWSGlueServiceRole, then choose Next.
  7. For Role name, enter a role name (for this post, GlueJobRole-recipe-demo).
  8. Choose Create role.
  9. Choose the created IAM role.
  10. Under Permissions policies, choose Add permission and Create inline policy.
  11. For Policy editor, choose JSON, and enter the following policy:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:PutObject",
                    "s3:DeleteObject"
                ],
                "Resource": [
                    "arn:aws:s3:::aws-bigdata-blog/generated_synthetic_reviews/*"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:List*",
                    "s3:GetObject",
                    "s3:PutObject",
                    "s3:DeleteObject"
                ],
                "Resource": [
                    "arn:aws:s3:::<glue-etl-output-s3-bucket>/*",
                    "arn:aws:s3:::<glue-etl-output-s3-bucket>"
                ]
            }
        ]
    }

  12. Choose Next.
  13. For Policy name, enter a name for your policy.
  14. Choose Create policy.

Create an IAM role for the console user

Complete the following steps to create the IAM role to interact with the console:

  1. On the IAM console, in the navigation pane, choose Role.
  2. Choose Create role.
  3. For Trusted entity type, choose the entity of your choice.
  4. For Add permissions, add the following AWS managed policies:
    1. AmazonAthenaFullAccess
    2. AWSGlueConsoleFullAccess
  5. Choose Next.
  6. For Role name, enter a role name of your choice.
  7. Choose Create role.
  8. Choose the created IAM role.
  9. Under Permissions policies, choose Add permission and Create inline policy.
  10. For Policy editor, choose JSON, and enter the following policy:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "Statement1",
                "Effect": "Allow",
                "Action": [
                    "iam:PassRole"
                ],
                "Resource": [
                    "arn:aws:iam::<account-id>:role/GlueJobRole-recipe-demo"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject"
                ],
                "Resource": [
                    "arn:aws:s3:::aws-bigdata-blog/generated_synthetic_reviews/*"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:List*",
                    "s3:GetObject",
                    "s3:PutObject",
                    "s3:DeleteObject"
                ],
                "Resource": [
                    "arn:aws:s3:::<glue-etl-output-s3-bucket>/*",
                    "arn:aws:s3:::<athena-query-output-s3-bucket>/*"
                ]
            }
        ]
    }

  11. Choose Next.
  12. For Policy name, enter a name for your policy.
  13. Choose Create policy.

The S3 bucket and IAM roles required for this tutorial have been created and configured. Switch to the console user role that you set up and proceed with the steps in the following sections.

Author and run a data integration job using the interactive data preparation experience

Let’s create an AWS Glue ETL job in AWS Glue Studio. In this ETL job, we load S3 Parquet files as the source, process the data using recipe steps, and write the output to Amazon S3 as Parquet. You can configure all these steps in the visual editor in AWS Glue Studio. We use the new data preparation authoring capabilities to create recipes that meet our specific business needs for data transformations. This exercise will demonstrate how you can develop data preparation recipes in AWS Glue Studio that are tailored to your use case and can be readily incorporated into scalable ETL jobs. Complete the following steps:

  1. On the AWS Glue Studio console, choose Visual ETL in the navigation pane.
  2. Under Create job, choose Visual ETL.
  3. At the top of the job, replace “Untitled job” with a name of your choice.
  4. On the Job Details tab, under Basic properties, specify the IAM role that the job will use (GlueJobRole-recipe-demo).
  5. Choose Save.
  6. On the Visual tab, choose the plus sign to open the Add nodes menu. Search for s3 and add an Amazon S3 as a Source.
  1. For S3 source type, choose S3 location.
  2. For S3 URL, specify s3://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Apparel/.
  3. For Data format, select Parquet.
  4. As a child of this source, search in the Add nodes menu for recipe and add the Data Preparation Recipe
  5. In the Data preview window, choose Start session if it has not been started.
    1. If it hasn’t been started, Start a data preview session will be displayed on the Data Preparation Recipe
    2. Choose your IAM role for the AWS Glue job.
    3. Choose Start session.
  1. After your data preview session has been started, on the Data Preparation Recipe transform, choose Author Recipe to open the data preparation recipe editor.

This will initialize a session using a subset of the data. After session initialization, the AWS Glue Studio console provides an interactive interface that enables intuitive construction of recipe steps for AWS Glue ETL jobs.

As described in our example use case, you’re authoring recipes to preprocess customer review data for analysis. Upon reviewing the spreadsheet-style data preview, you notice the product_title column contains values like business formal pants, plain and business formal jeans, patterned, with the product name and sub-attribute separated by a comma. To better structure this data for downstream analysis, you decide to split the product_title column on the comma delimiter to create separate columns for the product name and sub-attribute. This will allow for easier filtering and aggregation by product type or attribute during analysis.

On the spreadsheet-style UI, you can check the statistics of each column like Min, Median, Max, cardinality, and value distribution for a subset of the data. This provides useful insights about the data to inform transformation decisions. When reviewing the statistics for the review_year columns, you notice they contain a wide range of values spanning over 15 years. To enable easier analysis of seasonal and weekly trends, you decide to derive new columns showing the week number and day of the week computed from the review_date column.

Moreover, for convenience of downstream analysis, you decided to change the data type of the customer_id and product_id columns from string to integer. Converting data types is a common task in ETL workflows for analytics. The data preparation recipes in AWS Glue Studio provide a wide variety of common ETL transformations like renaming columns, deleting columns, sorting, and reordering columns. Feel free to browse the data preparation UI to discover other available recipes that can help transform your data.

Let’s see how to implement the recipe step in the Data Preparation Recipe transform to meet these requirements.

  1. Select the customer_id column and choose the Change type recipe step.
    1. For Change type to, choose integer.
    2. Choose Apply to add the recipe step.
  1. Select the product_id column and choose the Change type recipe step.
    1. For Change type to, choose integer.
    2. Choose Apply.
  2. Select the product_title column and choose On a single delimiter under SPLIT.
    1. For Delimiter, select Enter custom value and enter ,.
    2. Choose Apply.
  1. Select the review_date column and choose Week number under EXTRACT.
    1. For Destination column, enter review_date_week_number.
    2. Choose Apply.
  1. Select the review_date column and choose Day of week under EXTRACT.
    1. For Destination column, enter review_date_week_day.
    2. Choose Apply.

After these recipe steps were applied, you can see the customer_id and product_id columns have been converted to integer, the product_title column has been split into product_title1 and product_title2, and review_date_week_number and review_date_week_day have been added. While authoring data preparation recipe steps, you can view tabular data and inspect whether the recipe steps are working as expected. This enables interactive validation of recipe steps through the subset examination results previewed in the UI during the recipe authoring process.

  1. Choose Done authoring recipe to close the interface.

Now, on the Script tab in AWS Glue Studio console, you can see the script generated from the recipe steps. AWS Glue Studio automatically converts the recipe steps configured through the UI into the Python code. This allows you to build ETL jobs utilizing the wide range of transformations available in data preparation recipes, without having to manually code the logic yourself.

  1. Choose Save to save the job.
  2. On the Visual tab, search in the Add nodes menu for s3 and add an Amazon S3 as a Target.
    1. For Format, choose Parquet.
    2. For Compression Type, choose Snappy.
    3. For S3 Target Location, select your output S3 location s3://<glue-etl-output-s3-bucket>/output/.
    4. For Data Catalog update options, choose Create a table in the Data Catalog and on subsequent runs, update the schema and add new partitions.
    5. For Database, choose the database of your choice.
    6. For Table name, enter data_preparation_recipe_demo_tbl.
    7. Under Partition keys, choose Add a partition key, and select review_year.
  3. Choose Save, then choose Run to run the job.

Up to this point, we have created and run the ETL job. When the job has finished running, a table named data_preparation_recipe_demo_tbl has been created in the Data Catalog. The table has the partition column review_year with partitions for the years 2000–2016. Let’s move on to the next step and query the table.

Run queries on the output data with Athena

Now that the AWS Glue ETL job is complete, let’s query the transformed output data. As a sample analysis, let’s find the top three items that were reviewed in 2008 across all marketplaces and calculate the average star rating for those items. Then, for the top one item that was reviewed in 2008, we find the top five sub-attributes for it. This will demonstrate querying the new processed dataset to derive insights.

  1. On the Athena console, run the following query against the table:
    SELECT count(*) AS count, product_title_1, avg(star_rating) AS ave 
    FROM <your_database>.data_preparation_recipe_demo_tbl 
    WHERE review_year = 2008
    GROUP BY product_title_1
    ORDER BY count DESC
    LIMIT 3;

This query counts the number of reviews in 2008 for each product_title_1 and returns the top three most reviewed items. It also calculates the average star_rating for each of the top three items. The query will return results as shown in the following screenshot.

The item made with natural materials heels is the top one most reviewed item. Now let’s query the top five most reviewed attributes for it.

  1. Run the following query against the table:
    SELECT count(*) AS count, product_title_2, avg(star_rating) AS ave 
    FROM <your_database>.data_preparation_recipe_demo_tbl
    WHERE review_year = 2008 
    AND product_title_1 = 'made with natural materials heels'
    GROUP BY product_title_2
    ORDER BY count DESC
    LIMIT 5;

The query will return results as shown in the following screenshot.

The query results show that for the top reviewed item made with natural materials heels, the top five most reviewed sub-attributes in 2008 were draped, asymmetric, muted, polka-dotted, and oversized. Of these top five sub-attributes, draped had the highest average star rating.

Through this walkthrough, we were able to quickly build an ETL job and generate datasets that fulfill analytics needs, without the overhead of manual script coding.

Clean up

If you no longer need this solution, you can delete the following resources created in this tutorial:

  • S3 bucket (s3://<glue-etl-output-s3-bucket>, s3://<athena-query-output-s3-bucket>)
  • IAM roles for the AWS Glue job (GlueJobRole-recipe-demo) and the console user
  • AWS Glue ETL job
  • Data Catalog database (<your_database>) and table (data_preparation_recipe_demo_tbl)

Conclusion

In this post, we introduced the new AWS Glue data preparation authoring experience, which lets you create new low-code no-code data integration recipe transformations directly on the AWS Glue Studio console. We demonstrated how you can use this feature to quickly build ETL jobs and generate datasets that meet your business needs without time-consuming manual coding.

The AWS Glue data preparation authoring experience is now publicly available. Try out this new capability and discover recipes that can facilitate your data transformations.

To learn more about using the interactive data preparation authoring experience in AWS Glue Studio, check out the following video and read the AWS News Blog.


About the Authors

Chiho Sugimoto is a Cloud Support Engineer on the AWS Big Data Support team. She is passionate about helping customers build data lakes using ETL workloads. She loves planetary science and enjoys studying the asteroid Ryugu on weekends.

Fabrizio Napolitano is a Principal Specialist Solutions Architect or Data Analytics at AWS. He has worked in the analytics domain for the last 20 years, now focusing on helping Canadian public sector organizations innovate with data. Quite by surprise, he become a Hockey Dad after moving to Canada.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.

Gal HeyneGal Heyne is a Technical Product Manager for AWS Data Processing services with a strong focus on AI/ML, data engineering, and BI. She is passionate about developing a deep understanding of customers’ business needs and collaborating with engineers to design easy-to-use data services products.