Making ETL easier with AWS Glue Studio

Post Syndicated from Leonardo Gomez original https://aws.amazon.com/blogs/big-data/making-etl-easier-with-aws-glue-studio/

AWS Glue Studio is an easy-to-use graphical interface that speeds up the process of authoring, running, and monitoring extract, transform, and load (ETL) jobs in AWS Glue. The visual interface allows those who don’t know Apache Spark to design jobs without coding experience and accelerates the process for those who do.

AWS Glue Studio was designed to help you create ETL jobs easily. After you design a job in the graphical interface, it generates Apache Spark code for you, abstracting users from the challenges of coding. When the job is ready, you can run it and monitor the job status using the integrated UI.

AWS Glue Studio supports different types of data sources, both structured and semi-structured, and offers data processing in real time and batch. You can extract data from sources like Amazon Simple Storage Service (Amazon S3), Amazon Relational Database Service (Amazon RDS), Amazon Kinesis, and Apache Kafka. It also offers Amazon S3 and tables defined in the AWS Glue Data Catalog as destinations.

This post shows you how to create an ETL job to extract, filter, join, and aggregate data easily using AWS Glue Studio.

About this blog post
Time to read 15 minutes
Time to complete 45 minutes
Cost to complete (estimated) Amazon S3: $0.023
AWS Glue: 0.036
AWS Identity & Access Management: $0
Total Cost: $0.059
Learning level Intermediate (200)
Services used AWS Glue, Amazon S3, AWS Identity and Access Management

Overview of solution

To demonstrate how to create an ETL job using AWS Glue Studio, we use the Toronto parking tickets dataset, specifically the data about parking tickets issued in the city of Toronto in 2018, and the trials dataset, which contains all the information about the trials related to those parking tickets. The goal is to filter, join, and aggregate the two datasets to get the number of parking tickets handled per court in the city of Toronto during that year.

Prerequisites

For this walkthrough, you should have an AWS account. For this post, you launch the required AWS resources using AWS CloudFormation in the us-east-1 Region. If you haven’t signed up for AWS, complete the following tasks:

  1. Create an account.
  2. Create an AWS Identity and Access Management (IAM) user. For instructions, see Create an IAM User.

Important: If the AWS account you use to follow this guide uses AWS Lake Formation to manage permissions on the Glue data catalog, make sure that you log in as a user that is both a Data lake administrator and a Database creator, as described in the documentation.

Launching your CloudFormation stack

To create your resources for this use case, complete the following steps:

  1. Launch your stack in us-east-1:
  2. Select the I acknowledge that AWS CloudFormation might create IAM resources with custom names option.
  3. Choose Create stack.

Launching this stack creates AWS resources. The following resources shown in the AWS CloudFormation output are the ones you need in the next steps:

  • Key – Description
  • AWSGlueStudioRole – IAM role to run AWS Glue jobs
  • AWSGlueStudioS3Bucket – Name of the S3 bucket to store blog-related files
  • AWSGlueStudioTicketsYYZDB – AWS Glue Data Catalog database
  • AWSGlueStudioTableTickets – Data Catalog table to use as a source
  • AWSGlueStudioTableTrials – Data Catalog table to use as a source
  • AWSGlueStudioParkingTicketCount –Data Catalog table to use as the destination

Creating a job

A job is the AWS Glue component that allows the implementation of business logic to transform data as part of the ETL process. For more information, see Adding Jobs in AWS Glue.

To create an AWS Glue job using AWS Glue Studio, complete the following steps:

  1. On the AWS Management Console, choose Services.
  2. Under Analytics, choose AWS Glue.
  3. In the navigation pane, choose AWS Glue Studio.
  4. On the AWS Glue Studio home page, choose Create and manage jobs.

AWS Glue Studio supports different sources, including Amazon S3, Amazon RDS, Amazon Kinesis, and Apache Kafka. For this post, you use two AWS Glue tables as data sources and one S3 bucket as the destination.

  1. In the Create Job section, select Blank graph.
  2. Choose Create.

This takes you to the Visual Canvas to create an AWS Glue job.

  1. Change the Job name from Untitled Job to YYZ-Tickets-Job.

You now have an AWS Glue job ready to filter, join, and aggregate data from two different sources.

Adding sources

For this post, you use two AWS Glue tables as data sources: Tickets and Trials, which the CloudFormation template created. The data is located in an external S3 bucket in Parquet format. To add these tables as sources, complete the following steps:

  1. Choose the (+) icon.
  2. On the Node properties tab, for Name, enter Tickets.
  3. For Node type, choose S3.
  4. On the Data Source properties -S3 tab, for Database, choose yyz-tickets.
  5. For Table, choose tickets.
  6. For Partition predicate (optional), leave blank.

Before adding the second data source to the ETL job, be sure that the node you just created isn’t selected.

  1. Choose the (+) icon.
  2. On the Node properties tab, for Name, enter Trials.
  3. For Node type, choose S3.
  4. On the Data Source properties -S3 tab, for Database, choose yyz-tickets.
  5. For Table, choose trials.
  6. For Partition predicate (optional), leave blank.

You now have two AWS Glue tables as the data sources for the AWS Glue job.

Adding transforms

A transform is the AWS Glue Studio component were the data is modified. You have the option of using different transforms that are part of this service or custom code. To add transforms, complete the following steps:

  1. Choose the Tickets node.
  2. Choose the (+) icon.

  3. On the Node properties tab, for Name, enter Ticket_Mapping.
  4. For Node type, choose ApplyMapping.
  5. For Node parents, choose Tickets.
  6. On the Transform tab, change the ticket_number data type from decimal to int.
  7. Drop the following columns:
    • Location1
    • Location2
    • Location3
    • Location4
    • Province

Now you add a second ApplyMapping transform to modify the Trials data source.

  1. Choose the Trials data source node.
  2. Choose the (+) icon.
  3. On the Node properties tab, for Name, enter Trial_Mapping.
  4. For Node type, choose ApplyMapping.
  5. For Node parents, leave at default value (Trials).
  1. On the Transform tab, change the parking_ticket_number data type from long to int.

Now that you have set the right data types and removed some of the columns, it’s time to join the data sources using the Join transform.

  1. Choose the Ticket_Mapping transform.
  2. Choose the (+) icon.

  3. On the Node properties tab, for Name, enter Join_Ticket_Trial.
  4. For Node type, choose Join.
  5. For Node parents, choose Ticket_Mapping and Trial_Mapping.
  6. On the Transform tab, for Join type, choose Inner join.
  7. For Join conditions, choose Add condition.
  8. For Ticket_Mapping, choose ticket_number.
  9. For Trial_Mapping, choose parking_ticket_number.

Now the two data sources are joined by the ticket_number and parking_ticket_number columns.

Performing data aggregation

In this step, you do some data aggregation to see the number of tickets handled per court in Toronto.

AWS Glue Studio offers the option of adding custom code for those use cases that need a more complex transformation. For this post, we use PySpark code to do the data transformation. It contains Sparksql code and a combination of dynamic frames and data frames.

  1. Choose the Join_Tickets_Trial transform.
  2. Choose the (+) icon.

  3. On the Node properties tab, for Name, enter Aggregate_Tickets.
  4. For Node type, choose Custom transform.
  5. For Node parents, leave Join_Ticket_Trial selected.
  6. On the Transform tab, for Code block, change the function name from MyTransform to Aggregate_Tickets.
  7. Enter the following code:
    selected = dfc.select(list(dfc.keys())[0]).toDF()
    selected.createOrReplaceTempView("ticketcount")
    totals = spark.sql("select court_location as location, infraction_description as infraction, count(infraction_code) as total  FROM ticketcount group by infraction_description, infraction_code, court_location order by court_location asc")
    results = DynamicFrame.fromDF(totals, glueContext, "results")
    return DynamicFrameCollection({"results": results}, glueContext)
    

After adding the custom transformation to the AWS Glue job, you want to store the result of the aggregation in the S3 bucket. To do this, you need a Select from collection transform to read the output from the Aggregate_Tickets node and send it to the destination.

  1. Choose the New node node.
  2. Leave the Transform tab with the default values.
  3. On the Node Properties tab, change the name of the transform to Select_Aggregated_Data.
  4. Leave everything else with the default values.
  5. Choose the Select_Aggregated_Data node.
  6. Choose the (+) icon.

  7. On the Node properties tab, for Name, enter Ticket_Count_Dest.
  8. For Node type, choose S3 in the Data target section.
  9. For Node parents, choose Select_Aggregated_Data.
  10. On the Data Target Properties-S3 tab, for Format, choose Parquet.
  11. For Compression Type, choose GZIP.
  12. For S3 Target Location, enter s3://glue-studio-blog-{Your Account ID as a 12-digit number}/parking_tickets_count/.

The job should look like the following screenshot.

You now have three transforms to do data mapping, filtering, and aggregation.

Configuring the job

When the logic behind the job is complete, you must set the parameters for the job run. In this section, you configure the job by selecting components such as the IAM role and the AWS Glue version you use to run the job.

  1. On the Job details tab, for Description, enter Glue Studio blog post job.
  2. For IAM Role, choose AWSGlueStudioRole (which the CloudFormation template created).
  3. For Job Bookmark, choose Disable.
  4. For Number of retries, optionally enter 1.
  5. Choose Save.
  6. When the job is saved, choose Run.

Monitoring the job

AWS Glue Studio offers a job monitoring dashboard that provides comprehensive information about your jobs. You can get job statistics and see detailed info about the job and the job status when running.

  1. In the AWS Glue Studio navigation panel, choose Monitoring.
  2. Choose the entry with the job name YYZ-Tickets_Job.
  3. For get more details about the job run, choose View run details.
  4. Wait until Run Status changes to Succeeded.

You can verify that the job ran successfully on the Amazon Athena console.

  1. On the Athena console, choose the yyz-tickets database.
  2. Choose the icon next to the parking_tickets_count table (which the CloudFormation template created).

For more information about creating AWS Glue tables, see Defining Tables in the AWS Glue Data Catalog.

  1. Choose Preview table.

As you can see in the following screenshot, the information that the job generated is available and you can query the number of tickets types per court issued in the city of Toronto in 2018.

Cleaning up

To avoid incurring future charges and to clean up unused roles and policies, delete the resources you created: the CloudFormation stack, S3 bucket, and AWS Glue job.

Conclusion

In this post, you learned how to use AWS Glue Studio to create an ETL job. You can use AWS Glue Studio to speed up the ETL job creation process and allow different personas to transform data without any previous coding experience. For more information about AWS Glue Studio, see the AWS Glue Studio documentation and What’s New with AWS.


About the author

Leonardo Gómez is a Senior Analytics Specialist Solution Architect at AWS. Based in Toronto, Canada, He works with customers across Canada to design and build big data solutions.