Performing data transformations using Snowflake and AWS Glue

Post Syndicated from Srinivas Kesanapally original https://aws.amazon.com/blogs/big-data/performing-data-transformations-using-snowflake-and-aws-glue/

In the connected world, data is getting generated from many different sources in a wide variety of data formats. Enterprises are looking for tools to ingest from these evolving data sources as well as programmatically customize the ingested data to meet their data warehousing needs. You also need solutions that help you quickly meet your business needs without provisioning any hardware or software resources, keeping costs low with the pay-as-you-use model.

AWS Glue is serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning (ML), and application development. AWS Glue provides all the capabilities needed for data integration and analyzes your data in minutes instead of weeks or months.

To further support wide variety of use cases, AWS Glue has launched a new capability at AWS re:Invent 2020 to support custom third party connectors that will help users to easily orchestrate data integration workflow visually using AWS Glue Studio in minutes with just few clicks. AWS Glue Customer Connectors help users to search and select connectors from the AWS Marketplace or bring their own connectors. Using this new feature, users can easily connect to Snowflake with few clicks using their own Snowflake connector and start orchestrating the data pipeline in minutes.

In this post, we go over how to unify your datasets in your Amazon Simple Storage Service (Amazon S3) data lake with data in Snowflake and read and transform it using AWS Glue. Though not addressed in this post, you can also read data from Amazon S3, perform transformations on it using AWS Glue, persist it into Snowflake by customizing the generated AWS Glue script.

Solution overview

The following architecture diagram shows how AWS Glue connects to Snowflake for data preparation.

You upload the Snowflake JDBC connector JAR file into your S3 bucket and define an AWS Identity and Access Management (IAM) role that has permissions to read from this bucket, write to a destination S3 bucket, and run AWS Glue jobs. Then, you define your credentials to connect to Snowflake either in AWS Secrets Manager or define it on the AWS Glue Studio console, and create a job that can load the JAR file from your S3 bucket and connect to Snowflake to get the data and save it to the defined S3 bucket location. With the same JDBC connection, you also can read data from your S3 bucket and write to Snowflake.

Creating a custom connector

To implement this solution, you first create a custom connector.

On the AWS Glue Studio console, under Connectors, choose Create custom connector.

On the AWS Glue Studio console, under Connectors, choose Create custom connector.

For Connector S3 URL, enter the S3 location where you uploaded the Snowflake JDBC connector JAR file.
For Name, enter a name (for this post, we enter snowflake-jdbc-connector).
For Connector type, choose JDBC.
For Class name, enter the Snowflake JDBC driver class name, snowflake.client.jdbc.SnowflakeDriver.
For JDBC URL base, enter the following URL (provide your own account): jdbc:snowflake://<snowflake account info> /?user=${Username}&password=${Password}&warehouse=${warehouse}.
For URL parameter delimiter, Enter &.
Choose Create connector.

8. Choose Create connector.

Creating a connection

To create a JDBC connection to Snowflake, complete the following steps:

On the Connectors page, select the connector.
Choose Create connection.

Choose Create connection.

For Name, enter a name, such as snowflake-glue-jdbc-connection.
For Description, enter a meaningful description to identify the connection.
For JDBC URL format, choose default.

You have an option to enter a user name and password or use Secrets Manager to store your encrypted credentials.

For this post, for Data source credentials, select Use a secret.
For Secret, choose your secret.
For Additional URL parameters, provide the following parameters needed to run a SQL statement in Snowflake:
1. warehouse – Virtual Snowflake warehouse to use to run the query. Replace {warehouse} with a valid value.
2. db – The Snowflake database name.
3. schema – The Snowflake database schema.
Verify that the JDBC URL is well formed.

Verify that the JDBC URL is well formed.

Creating a job

You’re now ready to define the job using this connection.

On the Connectors page, select your connection.
Choose Create job.

Choose Create job.

For Name, enter a name (for this post, we enter untitled job).
For Description, enter a meaningful description for the job.
For IAM Role, choose the role that has access to the target S3 location where job is writing to and the source location from where it’s loading the Snowflake JDBC JAR file and also to run the AWS Glue job (use the AWS Glue service role).
Use the default options for Type, Glue version, Language, Worker type, Number of workers, Number of retries, and Job timeout.
For Job bookmark, choose Disable.

For Job bookmark, choose Disable.

Save the job.
On the Visual tab, go to the Data Source properties-connector tab to specify the table or query to read from Snowflake.
Choose Save.

Choose Save.

In the Visual tab, choose the + icon to create a new S3 node for the destination.
On the Node properties tab, pay close attention to choose the node as Target node.

On the Node properties tab, pay close attention to choose the node as Target node.

On the Data target properties tab, define the S3 bucket location to where AWS Glue is writing the results to.

On the Data target properties tab, define the S3 bucket location to where AWS Glue is writing the results to.

Add an Apply Mapping transformation to map Snowflake column name to destination column

Add an Apply Mapping transformation to map Snowflake column name to destination column

Save your settings.
On the Script tab, look at the script generated by AWS Glue for verification.

On the Script tab, look at the script generated by AWS Glue for verification.

Run the job and validate that the table data is successfully stored in the specified S3 bucket location

In the following screenshot, I upload three records from my employee table in Snowflake into my S3 bucket.

In the following screenshot, I upload three records from my employee table in Snowflake into my S3 bucket.

The following screenshot shows that my S3 bucket has the data from Snowflake.

Conclusion

In this post, you went over how AWS Glue Console integration with Snowflake has simplified the process of connecting to Snowflake and apply transformations on it without writing a single line of code and you also learnt how to define Snowflake connection parameters in AWS Glue, connect to Snowflake from AWS Glue, read from Snowflake using AWS Glue and apply transformations to meet your business needs.

About the Author

Srinivas Kesanapally is a principal partner solution architect at AWS and has over 25 years of experience in working with database and analytics products from traditional to modern database vendors and has helped many large technology companies in designing data analytics solutions as well as led engineering teams involved in modernizing data analytic platforms.

Noise