All posts by Bo Li

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

Post Syndicated from Bo Li original https://aws.amazon.com/blogs/big-data/amazon-q-data-integration-adds-dataframe-support-and-in-prompt-context-aware-job-creation/

Amazon Q data integration, introduced in January 2024, allows you to use natural language to author extract, transform, load (ETL) jobs and operations in AWS Glue specific data abstraction DynamicFrame. This post introduces exciting new capabilities for Amazon Q data integration that work together to make ETL development more efficient and intuitive. We’ve added support for DataFrame-based code generation that works across any Spark environment. We’ve also introduced in-prompt context-aware development that applies details from your conversations, working seamlessly with a new iterative development experience. This means you can refine your ETL jobs through natural follow-up questions—starting with a basic data pipeline and progressively adding transformations, filters, and business logic through conversation. These improvements are available through the Amazon Q chat experience on the AWS Management Console, and the Amazon SageMaker Unified Studio (preview) visual ETL and notebook interfaces.

The DataFrame code generation now extends beyond AWS Glue DynamicFrame to support a broader range of data processing scenarios. You can now generate data integration jobs for various data sources and destinations, including Amazon Simple Storage Service (Amazon S3) data lakes with popular file formats like CSV, JSON, and Parquet, as well as modern table formats such as Apache Hudi, Delta, and Apache Iceberg. Amazon Q can generate ETL jobs for connecting to over 20 different data sources, including relational databases like PostgreSQL, MySQL and Oracle; data warehouses like Amazon Redshift, Snowflake, and Google BigQuery; NoSQL databases like Amazon DynamoDB, MongoDB, and OpenSearch; tables defined in the AWS Glue Data Catalog; and custom user-supplied JDBC and Spark connectors. Your generated jobs can use a variety of data transformations, including filters, projections, unions, joins, and aggregations, giving you the flexibility to handle complex data processing requirements.

In this post, we discuss how Amazon Q data integration transforms ETL workflow development.

Improved capabilities of Amazon Q data integration

Previously, Amazon Q data integration only generated code with template values that required you to fill in the configurations such as connection properties for data source and data sink and the configurations for transforms manually. With in-prompt context awareness, you can now include this information in your natural language query, and Amazon Q data integration will automatically extract and incorporate it into the workflow. In addition, generative visual ETL in the SageMaker Unified Studio (preview) visual editor allows you to reiterate and refine your ETL workflow with new requirements, enabling incremental development.

Solution overview

This post describes the end-to-end user experiences to demonstrate how Amazon Q data integration and SageMaker Unified Studio (preview) simplify your data integration and data engineering tasks with the new enhancements, by building a low-code no-code (LCNC) ETL workflow that enables seamless data ingestion and transformation across multiple data sources.

We demonstrate how to do the following:

  • Connect to diverse data sources
  • Perform table joins
  • Apply custom filters
  • Export processed data to Amazon S3

The following diagram illustrates the architecture.

Using Amazon Q data integration with Amazon SageMaker Unified Studio (preview)

In the first example, we use Amazon SageMaker Unified Studio (preview) to develop a visual ETL workflow incrementally. This pipeline reads data from different Amazon S3 based Data Catalog tables, performs transformations on the data, and writes the transformed data back into an Amazon S3. We use the allevents_pipe and venue_pipe files from the TICKIT dataset to demonstrate this capability. The TICKIT dataset records sales activities on the fictional TICKIT website, where users can purchase and sell tickets online for different types of events such as sports games, shows, and concerts.

The process involves merging the allevents_pipe and venue_pipe files from the TICKIT dataset. Next, the merged data is filtered to include only a specific geographic region. Then the transformed output data is saved to Amazon S3 for further processing in future.

Data preparation

The two datasets are hosted as two Data Catalog tables, venue and event, in a project in Amazon SageMaker Unified Studio (preview), as shown in the following screenshots.

Data processing

To process the data, complete the following steps:

  1. On the Amazon SageMaker Unified Studio console, on the Build menu, choose Visual ETL flow.

An Amazon Q chat window will help you provide a description for the ETL flow to be built.

  1. For this post, enter the following text:
    Create a Glue ETL flow connect to 2 Glue catalog tables venue and event in my database glue_db_4fthqih3vvk1if, join the results on the venue’s venueid and event’s e_venueid, and write output to a S3 location.
    (The database name is generated with the project ID suffixed to the given database name automatically).
  2. Choose Submit.

An initial data integration flow will be generated as shown in the following screenshot to read from the two Data Catalog tables, join the results, and write to Amazon S3. We can see the join conditions are correctly inferred from our request from the join node configuration displayed.

Let’s add another filter transform based on the venue state as DC.

  1. Choose the plus sign and choose the Amazon Q icon to ask a follow-up question.
  2. Enter the instructions filter on venue state with condition as venuestate==‘DC’ after joining the results to modify the workflow.

The workflow is updated with a new filter transform.

Upon checking the S3 data target, we can see the S3 path is now a placeholder <s3-path> and the output format is Parquet.

  1. We can ask the following question in Amazon Q:
    update the s3 sink node to write to s3://xxx-testing-in-356769412531/output/ in CSV format
    in the same way to update the Amazon S3 data target.
  2. Choose Show script to see the generated code is DataFrame based, with all context in place from all of our conversation.
  3. Finally, we can preview the data to be written to the target S3 path. Note that the data is a joined result with only the venue state DC included.

With Amazon Q data integration with Amazon SageMaker Unified Studio (preview), an LCNC user can create the visual ETL workflow by providing prompts to Amazon Q and the context for data sources and transformations are preserved. Subsequently, Amazon Q also generated the DataFrame-based code for data engineers or more experienced users to use the automatic ETL generated code for scripting purposes.

Amazon Q data integration with Amazon SageMaker Unified Studio (preview) notebook

Amazon Q data integration is also available in the Amazon SageMaker Unified Studio (preview) notebook experience. You can add a new cell and enter your comment to describe what you want to achieve. After you press Tab and Enter, the recommended code is shown.

For example, we provide the same initial question:

Create a Glue ETL flow to connect to 2 Glue catalog tables venue and event in my database glue_db_4fthqih3vvk1if, join the results on the venue’s venueid and event’s e_venueid, and write output to a S3 location.

Similar to the Amazon Q chat experience, the code is recommended. If you press Tab, then the recommended code is chosen.

The following video provides a full demonstration of these two experiences in Amazon SageMaker Unified Studio (preview).

Using Amazon Q data integration with AWS Glue Studio

In this section, we walk through the steps to use Amazon Q data integration with AWS Glue Studio

Data preparation

The two datasets are hosted in two Amazon S3 based Data Catalog tables, event and venue, in the database glue_db, which we can query from Amazon Athena. The following screenshot shows an example of the venue table.

Data processing

To start using the AWS Glue code generation capability, use the Amazon Q icon on the AWS Glue Studio console. You can start authoring a new job, and ask Amazon Q the question to create the same workflow:

Create a Glue ETL flow connect to 2 Glue catalog tables venue and event in my database glue_db, join the results on the venue’s venueid and event’s e_venueid, and then filter on venue state with condition as venuestate=='DC' and write to s3://<s3-bucket>/<folder>/output/ in CSV format.

You can see the same code is generated with all configurations in place. With this response, you can learn and understand how you can author AWS Glue code for your needs. You can copy and paste the generated code to the script editor. After you configure an AWS Identity and Access Management (IAM) role on the job, save and run the job. When the job is complete, you can begin querying the data exported to Amazon S3.

After the job is complete, you can verify the joined data by checking the specified S3 path. The data is filtered by venue state as DC and is now ready for downstream workloads to process.

The following video provides a full demonstration of the experience with AWS Glue Studio.

Conclusion

In this post, we explored how Amazon Q data integration transforms ETL workflow development, making it more intuitive and time-efficient, with the latest enhancement of in-prompt context awareness to accurately generate a data integration flow with reduced hallucinations, and multi-turn chat capabilities to incrementally update the data integration flow, add new transforms and update DAG nodes. Whether you’re working with the console or other Spark environments in SageMaker Unified Studio (preview), these new capabilities can significantly reduce your development time and complexity.

To learn more, refer to Amazon Q data integration in AWS Glue.


About the Authors

Bo Li is a Senior Software Development Engineer on the AWS Glue team. He is devoted to designing and building end-to-end solutions to address customers’ data analytic and processing needs with cloud-based, data-intensive technologies.

Stuti Deshpande is a Big Data Specialist Solutions Architect at AWS. She works with customers around the globe, providing them strategic and architectural guidance on implementing analytics solutions using AWS. She has extensive experience in big data, ETL, and analytics. In her free time, Stuti likes to travel, learn new dance forms, and enjoy quality time with family and friends.

Kartik Panjabi is a Software Development Manager on the AWS Glue team. His team builds generative AI features for the Data Integration and distributed system for data integration.

Shubham Mehta is a Senior Product Manager at AWS Analytics. He leads generative AI feature development across services such as AWS Glue, Amazon EMR, and Amazon MWAA, using AI/ML to simplify and enhance the experience of data practitioners building data applications on AWS.

Developing, testing, and deploying custom connectors for your data stores with AWS Glue

Post Syndicated from Bo Li original https://aws.amazon.com/blogs/big-data/developing-testing-and-deploying-custom-connectors-for-your-data-stores-with-aws-glue/

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue already integrates with various popular data stores such as the Amazon Redshift, RDS, MongoDB, and Amazon S3. Organizations continue to evolve and use a variety of data stores that best fit their applications and data requirements. We recently announced general availability of AWS Glue custom connectors, which makes it easy to discover and integrate with variety of additional data sources, such as SaaS applications and your custom data sources. With just a few clicks, you can search and select connectors from the AWS Marketplace and begin your data preparation workflow in minutes. We are also releasing a new framework to develop, validate, and deploy your own custom connectors (bring your own connectors / BYOC).

In this blog post, we go over three key aspects of AWS Glue custom connectors. First, we introduce the two mechanisms using which you can plug in a custom connector by either subscribing from AWS Marketplace or bring your own connector into Glue Spark jobs. Second, we describe the three interfaces based on Apache Spark DataSource, Amazon Athena Federated Query, and JDBC, which you can use to develop a custom connector with the released Glue Spark runtime.  Finally, we get deeper into the development process, and describe how Glue Spark runtime interfaces simplify data integration by offering powerful features that are built-in for Glue custom connectors. These features include job bookmarks for incremental loads of your data, at-source data filtering with SQL queries, partitioned execution for data parallelism, data type mapping, advanced Spark and built-in AWS Glue data transformations, integration with AWS Secrets Manager to securely store authentication credentials, AWS Glue Data Catalog for storing connections and table metadata. Glue custom connectors are also supported with AWS Glue Studio that enables visual authoring of your data integration jobs.

These data sources cover the following categories:

This post introduces two mechanisms to use custom connectors with AWS Glue Spark runtime and AWS Glue Studio console. First, we go over the user experience for seamless discovery and subscription to custom connectors developed by AWS Glue partners that are hosted on AWS Marketplace. Next, we go deeper into the five simple steps to develop and test your own connectors with AWS Glue Spark runtime, and deploy them into your production Apache Spark applications for ETL and analytics workloads that run on AWS Glue.

AWS Glue custom connectors: AWS Marketplace and BYOC

You can use an AWS Glue connector available on AWS Marketplace or bring your own connector (BYOC) and plug it into AWS Glue Spark runtime. This is in addition to the native connectors available with AWS Glue.

Connectors available on AWS Marketplace

As we make AWS Glue custom connectors generally available today, we have an ecosystem of Glue connectors listed on AWS Marketplace available from different AWS Glue partners, big data architects, and third-party developers. The following posts go into more detail on using some of these connectors for different use cases with AWS Glue:

BYOC connector example

Customers and application developers also need a method to develop connectors for custom data stores. The next section describes the end-to-end process to develop and test a custom connector using the AWS Glue Spark runtime library and interfaces locally.

After testing and validating, you can package and deploy the custom connector using the BYOC workflow in AWS Glue Studio. For instructions on deploying and using the Snowflake connector with AWS Glue jobs as a BYOC custom connector, see Performing data transformations using Snowflake and AWS Glue.

AWS Glue Spark runtime connector interfaces

AWS Glue Spark runtime offers three interfaces to plug in custom connectors built for existing frameworks: the Spark DataSource API, Amazon Athena Data Source API, or Java JDBC API. The following code snippets show how you can plug in these connectors into AWS Glue Spark runtime without any changes.

For connectors subscribed from AWS Marketplace, use the following code:

Datasource = glueContext.create_dynamic_frame.from_options(connection_type = "marketplace.spark|athena|jdbc", connection_options = {"dbTable":"Account","connectionName":"my-marketplace-connection"}, transformation_ctx = "DataSource0)

For custom connectors developed and deployed with AWS Glue, use the following code:

Datasource = glueContext.create_dynamic_frame.from_options(connection_type = "custom.spark|athena|jdbc", connection_options = {"dbTable":"Account","connectionName":"my-custom-connection"}, transformation_ctx = "DataSource0")

The following table summarizes the interfaces you need to implement for connectivity with AWS Glue Spark runtime using the Spark DataSource API.

Interfaces Description
DataSourceV2 The base interface for Spark DataSource v2 API.
ReadSupport A mix-in interface for DataSourceV2 for the connector to provide data reading ability and scan the data from the data source.
DataSourceReader A data source reader that is created by ReadSupport to scan the data from this data source. It also supports reading actual schema and generating a list of InputPartition for parallel reads from Spark executors.
InputPartition Each InputPartition is responsible for creating a data reader to read data into one RDD partition. InputPartitions are serialized and sent to executors, then the reader is created on executors to do the actual reading.
InputPartitionReader Responsible for reading data into an RDD partition.

The following table summarizes the interfaces you need to implement for connectivity with AWS Glue Spark runtime using the Athena Data Source API.

Interfaces Description
MetadataHandler
doGetSplits Splits up the reads required to scan the requested batch of partitions.
doListSchemaNames Gets the list of schemas (databases) that this source contains.
doGetTable Gets a definition (such as field names, types, and descriptions) of a table.
doListTables Gets the list of tables that this source contains.
getPartitions Gets the partitions that must be read from the request table.
RecordHandler
doReadRecords Reads the row data associated with the provided split.

The following diagram shows the class structure for the three interfaces and their execution on Spark drivers to read metadata and Spark executors to read data from the underlying datasource. The classes shown in pink are the ones that need to be implemented as part of the connector. Classes shown in green are already implemented as part of the Glue Spark runtime.

Steps to develop a custom connector

In the following sections, we describe how to develop, test, and validate an AWS Glue custom connector. We also show how to deploy the connectors to AWS Glue jobs using the AWS Glue Studio console.

Implementing the solution includes the following 5 high-level steps:

  1. Download and install AWS Glue Spark runtime, and review sample connectors.
  2. Develop using the required connector interface.
  3. Build, test, and validate your connector locally.
  4. Package and deploy the connector on AWS Glue.
  5. Use AWS Glue Studio to author a Spark application with the connector.

Downloading and installing AWS Glue Spark runtime and reviewing sample connectors

The first step to developing a connector is to install the Glue Spark runtime from Maven and refer to AWS Glue sample connectors on AWS Glue GitHub repository.

Developing and testing using the required connector interface

As discussed earlier, you can develop AWS Glue custom connectors with one of the following interfaces:

  • Spark DataSource
  • Athena Federated Query
  • JDBC

In this section, we walk you through each interface.

Spark DataSource interface

We use a simple example to illustrate the development of an AWS Glue custom connector with the Spark DataSource interface. You can also find intermediate and complex examples for developing connectors with more functionality for different data sources.

This solution implements a DataSourceReader that returns predefined data as InputPartitions stored in-memory with a given schema. The following interfaces need to be implemented for DataSourceReader. The DataSourceReader implementation runs on the Spark driver and plans the execution of Spark executors reading the data in InputPartitions:

class Reader implements DataSourceReader {
        public StructType readSchema() { ... }
        
        public List<InputPartition<InternalRow>> planInputPartitions() { ... }
}

The InputPartitions are read in parallel by different Spark executors using the InputPartitionReader implementation, which returns the records in Spark’s InternalRow format. The InputPartitionReader is essentially implemented to return an iterator of the records scanned from the underlying data store. Refer the following code:

class SimpleInputPartitionReader implements InputPartitionReader<InternalRow> {
    public boolean next() { ... }

    public InternalRow get() { ... }

    public void close() throws IOException { ... }
}

The second connector example shows how to use an Amazon S3 client to read the data in CSV format from an S3 bucket and path supplied as reader options. The third connector example shows how to use a JDBC driver to read data from a MySQL source. It also shows how to push down a SQL query to filter records at source and authenticate with the user name and password supplied as reader options.

You can plug the connectors based on the Spark DataSource API into AWS Glue Spark runtime as follows. You need to supply the connection_type for custom.spark and an AWS Glue catalog connection containing the reader options, such as user name and password. AWS Glue Spark runtime automatically converts the data source into a Glue DynamicFrame. The following code is an example to plug in the Elasticsearch Spark connector:

Datasource = glueContext.create_dynamic_frame.from_options(connection_type = "custom.spark", connection_options = {"path": "test", "es.nodes": "https://search-glue-etl-job-xxx.us-east-1.es.amazonaws.com","es.net.http.auth.user": "user","es.net.http.auth.pass": "pwd","es.port": "443","es.nodes.wan.only": "true" ,"connectionName":"my-custom-es-connection"}, transformation_ctx = "DataSource0")

AWS Glue Studio provides a visual ETL console that can also auto-generate the preceding code to construct a DynamicFrame for a deployed Spark connector (as described later in this post).

Athena Federated Query interface

AWS Glue Spark runtime also supports connectors developed with the Athena connector interface for federated queries. Similar to the Spark DataSource API, it requires implementation of two key handler interfaces: MetadataHandler and RecordHandler.

The MetadataHandler implementation runs on the Spark driver and contains the functions required to compute the schema, tables, and table partitions, and plan the actual scan by splitting the reads of individual partitions into different splits. See the following code:

public class MyMetadataHandler extends MetadataHandler{
       ListSchemasResponse doListSchemaNames(BlockAllocator allocator, ListSchemasRequest request) { … }

       ListTablesResponse doListTables(BlockAllocator allocator, ListTablesRequest request) { … }

       GetTableResponse doGetTable(BlockAllocator allocator, GetTableRequest request) { … }

       void getPartitions(BlockWriter blockWriter, GetTableLayoutRequest request, QueryStatusChecker queryStatusChecker) { … }

       GetSplitsResponse doGetSplits(BlockAllocator allocator, GetSplitsRequest request) { … }
}

The RecordHandler implements the reader to scan the data from the underlying data store associated with the split contained in the ReadRecordsRequest structure.

AWS Glue custom connectors uses the Athena RecordHandler interface, but it do not need the BlockSpiller implementation or use AWS Lambda to read the data from the underlying data store. Instead, the implementation directly runs inline within each Spark executor to return the records as Apache Arrow column vector batches. Refer the following code:

public class MyRecordHandlerextends RecordHandler{

void readWithConstraint(ConstraintEvaluator constraints, BlockSpiller spiller, ReadRecordsRequest recordsRequest, QueryStatusChecker queryStatusChecker){…}
}

AWS Glue Spark runtime can convert records returned by plugging in an Athena connector to an AWS Glue DynamicFrame as follows:

Datasource = glueContext.create_dynamic_frame.from_options(connection_type = "custom.athena", connection_options = {"tableName":"table","schemaName":"schema","connectionName":"my-custom-athena-connection"}, transformation_ctx = "DataSource0")

JDBC interface

AWS Glue Spark runtime also allows you to plug in any connector compliant with the JDBC interface. It allows you to pass in any connection option available with the JDBC connector as follows:

Datasource = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"dbTable":"Account","connectionName":"my-custom-jdbc-connection"}, transformation_ctx = "DataSource0")

Advanced ETL and analytics with AWS Glue

AWS Glue Spark runtime also provides different features supported out-of-the-box with the custom connectors to enable advanced extract, data transformations, and load.

AWS Glue Studio for visual authoring of ETL jobs

Data type mapping

You can type cast the columns while reading them from the underlying data store itself. For example, a dataTypeMapping of {"INTEGER":"STRING"} casts all integer columns to string while parsing the records and constructing the DynamicFrame. This also helps you cast columns to types of your choice. Refer the following code:

DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"dataTypeMapping":{"INTEGER":"STRING"}", connectionName":"test-connection-snowflake-jdbc"}, transformation_ctx = "DataSource0")

Partitioning for parallel reads

AWS Glue allows you to read data in parallel from the data store by partitioning it on a column by specifying the partitionColumn, lowerBound, upperBound, and numPartitions. This allows you to use data parallelism and multiple Spark executors allocated for the Spark application. Refer the following code, which reads data from Snowflake using 4 Spark executors in parallel. Data is partitioned across executors uniformly along the id column in the range [0, 200]:

DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"upperBound":"200","numPartitions":"4","partitionColumn":"id","lowerBound":"0","connectionName":"my-connection-snowflake"}, transformation_ctx = "DataSource0")

Glue Data Catalog connections

You can encapsulate all your connection properties with Glue Data Catalog connections and supply the connection name as follows. Integration with Glue Data Catalog connections allows you to use the same connection properties across multiple calls in a single Spark application or across different applications. See the following code:

DataSource = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"connectionName":"my-connection-snowflake"}, transformation_ctx = "DataSource0")

Secrets Manager for credentials

The Data Catalog connection can also contain a secretId corresponding to a secret stored in AWS Secrets Manager that can be used to securely gather authentication and credentials information at runtime. For more details on using a secretId on the AWS Glue Studio console, see Adding connectors to AWS Glue Studio. secretId can also be specified within the ETL script as follows.

DataSource = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"connectionName":"my-connection-snowflake", "secretId"-> "my-secret-id"}, transformation_ctx = "DataSource0")

Secret Id can be used to store credentials for different authentication mechanisms that your connector can support such as username/password, access keys, and OAuth.

SQL queries at source: Filtering with row predicates and column projections

AWS Glue Spark runtime allows you to push down SQL queries to filter data at source with row predicates and column projections. This allows you to load filtered data faster from data stores that support pushdowns. An example SQL query pushed down to a JDBC data source is SELECT id, name, department FROM department WHERE id < 200. Refer the following code:

DataSource = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"query":"SELECT id, name, department FROM department WHERE id < 200","connectionName":"my-connection-snowflake "}, transformation_ctx = "DataSource0")

Job bookmarks

AWS Glue job bookmarks allows for incremental loading of data from JDBC sources. It keeps track of the last processed record from the data store and processes new data records in subsequent AWS Glue job runs. Job bookmarks use the primary key as the default column as the bookmark key if it increases or decreases sequentially. Refer the following code that uses a transformation_ctx with job bookmarks enabled for the job:

DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"connectionName":"test-connection-snowflake-jdbc"}, transformation_ctx = "DataSource0")

AWS Glue transformations

AWS Glue offers more than 35 commonly used data transformation operations with DynamicFrames and Spark DataFrames. These transforms allow you to get insights from your data and prepare it for further analytics using hundreds of available Spark SQL functions. These transformations include popular functions for schema manipulation, projecting columns, and performing joins across different data sources; transforming data with map, split, and explode; flattening nested datasets with relationalize and unnest; group and aggregate records; and run arbitrary SQL on datasets.

VPC support for networking

AWS Glue jobs allow you to securely connect to your data stores within a private VPC subnet. You can also enable a NAT (Network Address Translation) gateway within a VPC to access both VPC resources and public internet.

Building, testing, and validating your connector locally

After developing the connector for your favorite data store with the interface of your choice, you can follow the instructions to build the connector using Maven by doing a maven install. This builds the connector and installs the resulting JAR into your local Maven repository. You can now include this JAR in the class path of your IDE or AWS Glue Spark runtime downloaded from Maven.

After you build and import the JAR, you can test it locally by plugging it in AWS Glue Spark runtime and writing a validation test. We provide sample validation tests in the AWS Glue’s GitHub repository. These cover several scenarios for both local testing and validation on the AWS Glue job system. The following table lists the validation tests, the functionality they test, and the associated interfaces.

Test Name Description JDBC Spark Athena
DataSourceTest Tests connector connectivity and reading functionality. x x x
ReadWriteTest Tests reading and writing end-to-end workflow. x x x
CatalogConnectionTest Tests catalog connection integration. x x x
DataSchemaTest Tests data schema from reading with the connector. x x x
SecretsManagerTest Tests Secrets Manager integration. x x
DataSinkTest Tests connector connectivity and writing functionality x x
ColumnPartitioningTest Tests connector column partitioning functionality. x
FilterPredicateTest Tests connector filter predicate functionality. x
JDBCUrlTest Tests connector extra parameters for JDBC Url functionality. x
DbtableQueryTest Tests connector dbTable and query option functionality. x
DataTypeMappingTest Tests connector custom data type mapping functionality. x

Functionality such as AWS Glue job bookmarks that allow incremental loads can be tested on the AWS Glue job system only. We also provide a Python script to run all tests together as a suite on the AWS Glue job system.

Packaging and deploying the connector on AWS Glue

We now discuss how you can package your connector and deploy it on AWS Glue using the BYOC workflow:

  1. Package the custom connector as a JAR and upload the JAR file to an Amaon S3 bucket in your account.
  2. Follow the flow to create a custom connector referencing the JAR in Amazon S3 from AWS Glue Studio.
  3. Instantiate a connection for that connector and create an AWS Glue job using it.

For step-by-step instructions on the BYOC workflow, see Creating custom connectors.

Alternatively, we also provide the scripts and instructions for you to share the connector publicly using AWS Marketplace for a price or free. For instructions on subscribing to the connector, see Subscribing to AWS Marketplace connectors.

Using AWS Glue Studio to author a Spark application

After you create a connection for using a BYOC or AWS Marketplace – AWS Glue connector, you can follow the instructions to visually author a Spark ETL application with AWS Glue Studio. These instructions are available here for Job Authoring with custom connectors. Following are screenshots from AWS Glue Studio:

Connectors on AWS Marketplace

Connectors on AWS Marketplace

Visually author Glue jobs using connectors with AWS Glue Studio

Step 1 – Select a connector

Following are screenshots from AWS Glue Studio:

Step 2 – Visually author the job using the associated connection

Conclusion

You can use two different mechanisms to use custom connectors with AWS Glue Spark runtime and AWS Glue Studio console. In this post, we discussed the user experience for seamless discovery and subscription to custom connectors, and walked you through developing and testing your own connectors with AWS Glue Spark runtime, and deploying them into your production Apache Spark applications for ETL and analytics workloads that run on AWS Glue.

Build a custom connector yourself or try one on AWS Marketplace with AWS Glue Studio.

If you would like to partner or add a new Glue connector to AWS Marketplace, please reach out to us at [email protected]

Resources

For additional resources, see the following:


About the Authors

Bo Li is a software engineer in AWS Glue and devoted to designing and building end-to-end solutions to address customer’s data analytic and processing needs with cloud-based data-intensive technologies.

 

 

 

Yubo Xu is a Sofware Development Engineer on the AWS Glue team. His main focus is to improve the stability and efficiency of Spark runtime for AWS Glue and the easiness to connect to various data sources. Outside of work, he enjoys reading books and hiking the trails in the Bay area with his dog, Luffy, a one-year old Shiba Inu.

 

 

Xiaoxi Liu is a software engineer at AWS Glue team. Her passion is building scalable distributed systems for efficiently managing big data on cloud and her concentrations are distributed system, big data and cloud computing

 

 

Mohit Saxena is a Software Development Manager at AWS Glue. His team works on Glue’s Spark runtime to enable new customer use cases for efficiently managing data lakes on AWS and optimize Apache Spark for performance and reliability.