Post Syndicated from Kapil Shardha original https://aws.amazon.com/blogs/big-data/use-aws-glue-to-run-etl-jobs-against-non-native-jdbc-data-sources/
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easier to prepare and load your data for analytics. You can create and run an ETL job with a few clicks on the AWS Management Console. Just point AWS Glue to your data store. AWS Glue discovers your data and stores the associated metadata (for example, a table definition and schema) in the AWS Glue Data Catalog.
AWS Glue has native connectors to data sources using JDBC drivers, either on AWS or elsewhere, as long as there is IP connectivity. In this post, we demonstrate how to connect to data sources that are not natively supported in AWS Glue today. We walk through connecting to and running ETL jobs against two such data sources, IBM DB2 and SAP Sybase. However, you can use the same process with any other JDBC-accessible database.
AWS Glue data sources
AWS Glue natively supports the following data stores by using the JDBC protocol:
- Publicly accessible databases
- Amazon Aurora
- Microsoft SQL Server
For more information, see Adding a Connection to Your Data Store in the AWS Glue Developer Guide.
One of the fastest growing architectures deployed on AWS is the data lake. The ETL processes that are used to ingest, clean, transform, and structure data are critically important for this architecture. Having the flexibility to interoperate with a broader range of database engines allows for a quicker adoption of the data lake architecture.
For data sources that AWS Glue doesn’t natively support, such as IBM DB2, Pivotal Greenplum, SAP Sybase, or any other relational database management system (RDBMS), you can import custom database connectors from Amazon S3 into AWS Glue jobs. In this case, the connection to the data source must be made from the AWS Glue script to extract the data, rather than using AWS Glue connections. To learn more, see Providing Your Own Custom Scripts in the AWS Glue Developer Guide.
Setting up an ETL job for an IBM DB2 data source
The first example demonstrates how to connect the AWS Glue ETL job to an IBM DB2 instance, transform the data from the source, and store it in Apache Parquet format in Amazon S3. To successfully create the ETL job using an external JDBC driver, you must define the following:
- The S3 location of the job script
- The S3 location of the temporary directory
- The S3 location of the JDBC driver
- The S3 location of the Parquet data (output)
- The IAM role for the job
By default, AWS Glue suggests bucket names for the scripts and the temporary directory using the following format:
For JDBC drivers, you can create a similar location:
Also for the Parquet data (output), you can create a similar location:
Keep in mind that having the AWS Glue job and S3 buckets in the same AWS Region helps save on cross-Region data transfer fees. For this post, we will work in the US East (Ohio) Region (us-east-2).
Creating the IAM role
The next step is to set up the IAM role that the ETL job will use:
- Sign in to the AWS Management Console, and search for IAM:
- On the IAM console, choose Roles in the left navigation pane.
- Choose Create role. The role type of trusted entity must be an AWS service, specifically AWS Glue.
- Choose Next: Permissions.
- Search for the AWSGlueServiceRole policy, and select it.
- Search again, now for the SecretsManagerReadWrite This policy allows the AWS Glue job to access database credentials that are stored in AWS Secrets Manager.
CAUTION: This policy is open and is being used for testing purposes only. You should create a custom policy to narrow the access just to the secrets that you want to use in the ETL job.
- Select this policy, and choose Next: Review.
- Give your role a name, for example, GluePermissions, and confirm that both policies were selected.
- Choose Create role.
Now that you have created the IAM role, it’s time to upload the JDBC driver to the defined location in Amazon S3. For this example, we will use the DB2 driver, which is available on the IBM Support site.
Storing database credentials
It is a best practice to store database credentials in a safe store. In this case, we use AWS Secrets Manager to securely store credentials. Follow these steps to create those credentials:
- Open the console, and search for Secrets Manager.
- In the AWS Secrets Manager console, choose Store a new secret.
- Under Select a secret type, choose Other type of secrets.
- In the Secret key/value, set one row for each of the following parameters:
- db_url (for example, jdbc:db2://10.10.12.12:50000/SAMPLE)
- driver_name (ibm.db2.jcc.DB2Driver)
- output_bucket: (for example, aws-glue-data-output-1234567890-us-east-2/User)
- Choose Next.
- For Secret name, use DB2_Database_Connection_Info.
- Choose Next.
- Keep the Disable automatic rotation check box selected.
- Choose Next.
- Choose Store.
Adding a job in AWS Glue
The next step is to author the AWS Glue job, following these steps:
- In the AWS Management Console, search for AWS Glue.
- In the navigation pane on the left, choose Jobs under the ETL
- Choose Add job.
- Fill in the basic Job properties:
- Give the job a name (for example, db2-job).
- Choose the IAM role that you created previously (GluePermissions).
- For This job runs, choose A new script to be authored by you.
- For ETL language, choose Python.
- In the Script libraries and job parameters section, choose the location of your JDBC driver for Dependent jars path.
- Choose Next.
- On the Connections page, choose Next
- On the summary page, choose Save job and edit script. This creates the job and opens the script editor.
In the editor, replace the existing code with the following script. Important: Line 47 of the script corresponds to the mapping of the fields in the source table to the destination, dropping of the null fields to save space in the Parquet destination, and finally writing to Amazon S3 in Parquet format.
- Choose Save.
- Choose the black X on the right side of the screen to close the editor.
Running the ETL job
Now that you have created the job, the next step is to execute it as follows:
- On the Jobs page, select your new job. On the Action menu, choose Run job, and confirm that you want to run the job. Wait a few moments as it finishes the execution.
- After the job shows as Succeeded, choose Logs to read the output of the job.
- In the output of the job, you will find the result of executing the df.printSchema() and the message with the df.count().
Also, if you go to your output bucket in S3, you will find the Parquet result of the ETL job.
Using AWS Glue, you have created an ETL job that connects to an existing database using an external JDBC driver. It enables you to execute any transformation that you need.
Setting up an ETL job for an SAP Sybase data source
In this section, we describe how to create an AWS Glue ETL job against an SAP Sybase data source. The process mentioned in the previous section works for a Sybase data source with a few changes required in the job:
- While creating the job, choose the correct jar for the JDBC dependency.
- In the script, change the reference to the secret to be used from AWS Secrets Manager:
And change the mapping of the fields in line 47, as follows:
After you successfully execute the new ETL job, the output contains the same type of information that was generated with the DB2 data source.
Note that each of these JDBC drivers has its own nuances and different licensing terms that you should be aware of before using them.
Maximizing JDBC read parallelism
Something to keep in mind while working with big data sources is the memory consumption. In some cases, “Out of Memory” errors are generated when all the data is read into a single executor. One approach to optimize this is to rely on the parallelism on read that you can implement with Apache Spark and AWS Glue. To learn more, see the Apache Spark SQL module.
You can use the following options:
- partitionColumn: The name of an integer column that is used for partitioning.
- lowerBound: The minimum value of partitionColumn that is used to decide partition stride.
- upperBound: The maximum value of partitionColumn that is used to decide partition stride.
- numPartitions: The number of partitions. This, along with lowerBound (inclusive) and upperBound (exclusive), form partition strides for generated WHERE clause expressions used to split the partitionColumn When unset, this defaults to SparkContext.defaultParallelism.
- Those options specify the parallelism of the table read. lowerBound and upperBound decide the partition stride, but they don’t filter the rows in the table. Therefore, Spark partitions and returns all rows in the table. For example:
It’s important to be careful with the number of partitions because too many partitions could also result in Spark crashing your external database systems.
Using the process described in this post, you can connect to and run AWS Glue ETL jobs against any data source that can be reached using a JDBC driver. This includes new generations of common analytical databases like Greenplum and others.
You can improve the query efficiency of these datasets by using partitioning and pushdown predicates. For more information, see Managing Partitions for ETL Output in AWS Glue. This technique opens the door to moving data and feeding data lakes in hybrid environments.
If you found this post useful, be sure to check out Work with partitioned data in AWS Glue.
About the Authors
Kapil Shardha is a Technical Account Manager and supports enterprise customers with their AWS adoption. He has background in infrastructure automation and DevOps.
William Torrealba is an AWS Solutions Architect supporting customers with their AWS adoption. He has background in Application Development, High Available Distributed Systems, Automation, and DevOps.