All posts by Ray Wang

Seamlessly Integrate Data on Google BigQuery and ClickHouse Cloud with AWS Glue

Post Syndicated from Ray Wang original https://aws.amazon.com/blogs/big-data/seamlessly-integrate-data-on-google-bigquery-and-clickhouse-cloud-with-aws-glue/

Migrating from Google Cloud’s BigQuery to ClickHouse Cloud on AWS allows businesses to leverage the speed and efficiency of ClickHouse for real-time analytics while benefiting from AWS’s scalable and secure environment. This article provides a comprehensive guide to executing a direct data migration using AWS Glue ETL, highlighting the advantages and best practices for a seamless transition.

AWS Glue ETL enables organizations to discover, prepare, and integrate data at scale without the burden of managing infrastructure. With its built-in connectivity, Glue can seamlessly read data from Google Cloud’s BigQuery and write it to ClickHouse Cloud on AWS, removing the need for custom connectors or complex integration scripts. Beyond connectivity, Glue also provides advanced capabilities such as a visual ETL authoring interface, automated job scheduling, and serverless scaling, allowing teams to design, monitor, and manage their pipelines more efficiently. Together, these features simplify data integration, reduce latency, and deliver significant cost savings, enabling faster and more reliable migrations.

Prerequisites

Before using AWS Glue to integrate data into ClickHouse Cloud, you must first set up the ClickHouse environment on AWS. This includes creating and configuring your ClickHouse Cloud on AWS, making sure network access and security groups are properly defined, and verifying that the cluster endpoint is accessible. Once the ClickHouse environment is ready, you can leverage the AWS Glue built-in connector to seamlessly write data into ClickHouse Cloud from sources such as Google Cloud BigQuery. You can follow the next section to complete the setup.

  1. Set up ClickHouse Cloud on AWS
    1. Follow the ClickHouse official website to set up environment (remember to allow remote access in the config file if using Clickhouse OSS)
      https://clickhouse.com/docs/get-started/quick-start
  2. Subscribe the ClickHouse Glue marketplace connector
    1. Open Glue Connectors and choose Go to AWS Marketplace
    2. On the list of AWS Glue marketplace connectors, enter ClickHouse in the search bar. Then choose ClickHouse Connector for AWS Glue
    3. Choose View purchase options on the right top of the view
    4. Review Terms and Conditions and choose Accept Terms
    5. Choose Continue to Configuration once it’s enabled
    6. On Follow the vendor’s instructions part in the connector instructions as below, choose the connector enabling link at step 3

Configure AWS Glue ETL Job for ClickHouse Integration

AWS Glue enables direct migration by connecting with ClickHouse Cloud on AWS through built-in connectors, allowing for seamless ETL operations. Within the Glue console, users can configure jobs to read data from S3 and write it directly to ClickHouse Cloud. Using AWS Glue Data Catalog, data in S3 can be indexed for efficient processing, while Glue’s PySpark support allows for complex data transformations, including data type conversions, to support compatibility with ClickHouse’s schema.

  1. Open AWS Glue in the AWS Management Console
    1. Navigate to Data Catalog and Connections
    2. Create a new connection
  2. Configure BigQuery Connection in Glue
    1. Prepare a Google Cloud BigQuery Environment
    2. Create and Store Google Cloud Service Account Key (JSON format) in AWS Secret Manager, you can find the details in BigQuery connections.
    3. The JSON Format content example is as following:
      {
        "type": "service_account",
        "project_id": "h*********g0",
        "private_key_id": "cc***************81",
        "private_key": "-----BEGIN PRIVATE KEY-----\nMI***zEc=\n-----END PRIVATE KEY-----\n",
        "client_email": "clickhouse-sa@h*********g0.iam.gserviceaccount.com",
        "client_id": "1*********8",
        "auth_uri": "https://accounts.google.com/o/oauth2/auth",
        "token_uri": "https://oauth2.googleapis.com/token",
        "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
        "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/clickhouse-sa%40h*********g0.iam.gserviceaccount.com",
        "universe_domain": "googleapis.com"
      }

      • type: service_account.
      • project_id: The ID of the GCP project.
      • private_key_id: A unique ID for the private key within the file.
      • private_key: The actual private key.
      • client_email: The email address of the service account.
      • client_id: A unique client ID associated with the service account.
      • auth_uri, token_uri, auth_provider_x509_cert_url
      • client_x509_cert_url: URLs for authentication and token exchange with Google’s identity and access management systems.
      • universe_domain: The domain name of GCP, googleapis.com
    4. Create Google BigQuery Connection in AWS Glue
    5. Grant the IAM role associated with your AWS Glue job permission for S3, Secret Manager, Glue services, and AmazonEC2ContainerRegistryReadOnly for accessing connectors purchased from AWS Marketplace (reference doc)
  3. Create ClickHouse connection in AWS Glue
    1. Enter clickhouse-connection as its connection name
    2. Choose Create connection and activate connector
  4. Create a Glue job
    1. On the Connectors view as below, select clickhouse-connection and choose Create job
    2. Enter bq_to_clickhouse as its job name and configure gc_connector_role as its IAM Role
    3. Configure BigQuery connection and clickhouse-connection to the Connection property
    4. Choose the Script tab and Edit script. Then choose Confirm on the Edit script popup view.
    5. Copy and paste the following code onto the script editor which can be referred from clickhouse official doc
    6. The source code is as following:
      import sys
      from pyspark.sql import SparkSession
      from awsglue.context import GlueContext
      from awsglue.job import Job
      from awsglue.utils import getResolvedOptions
      
      args = getResolvedOptions(sys.argv, ['JOB_NAME'])
      spark = SparkSession.builder.getOrCreate()
      glueContext = GlueContext(spark.sparkContext)
      job = Job(glueContext)
      job.init(args['JOB_NAME'], args)
      
      connection_options = {
          "connectionName": "Bigquery connection",
          "parentProject": "YOUR_GCP_PROJECT_ID",
          "query": "SELECT * FROM `YOUR_GCP_PROJECT_ID.bq_test_dataset.bq_test_table`",
          "viewsEnabled": "true",
          "materializationDataset": "bq_test_dataset"
      }
      jdbc_url = " jdbc:clickhouse://YOUR_CLICKHOUSE_CONNECTION.us-east-1.aws.clickhouse.cloud:8443/clickhouse_database?ssl=true "
      username = "default"
      password = "YOUR_PASSWORD"
      query = "select * from clickhouse_database.clickhouse_test_table"
      # Add this before writing to test connection
      try:
          # Read from BigQuery with Glue Connection
          print("Reading data from BigQuery...")
          GoogleBigQuery_node1742453400261 = glueContext.create_dynamic_frame.from_options(
              connection_type="bigquery",
              connection_options=connection_options,
              transformation_ctx="GoogleBigQuery_node1742453400261"
          )
          # Convert to DataFrame
          bq_df = GoogleBigQuery_node1742453400261.toDF()
          print("Show data from BigQuery:")
          bq_df.show()
          
          # Write BigQuery Data to Clickhouse with JDBC
          bq_df.write \
          .format("jdbc") \
          .option("driver", 'com.clickhouse.jdbc.ClickHouseDriver') \
          .option("url", jdbc_url) \
          .option("user", username) \
          .option("password", password) \
          .option("dbtable", "clickhouse_test_table") \
          .mode("append") \
          .save()
          
          print("Write BigQuery Data to ClickHouse successfully")
          
          # Read from Clickhouse with JDBC
          reaf_df = (spark.read.format("jdbc")
          .option("driver", 'com.clickhouse.jdbc.ClickHouseDriver')
          .option("url", jdbc_url)
          .option("user", username)
          .option("password", password)
          .option("query", query)
          .option("ssl", "true")
          .load())
          
          print("Show Data from ClickHouse:")
          reaf_df.show()
          
      except Exception as e:
          print(f"ClickHouse connection test failed: {str(e)}")
          raise e
      finally:
          job.commit()

    7. Choose Save and Run on the right top of the current view

Testing and Validation

Testing is crucial to verify data accuracy and performance in the new environment. After the migration completes, run data integrity checks to confirm record counts and data quality in ClickHouse Cloud. Schema validation is essential, as each data field must align correctly with ClickHouse’s format. Running performance benchmarks, such as sample queries, will help verify that ClickHouse’s setup delivers the desired speed and efficiency gains.

  1. The Schema and Data in source BigQuery and destination Clickhouse

  2. AWS Glue output logs

Clean Up

After completing the migration, it’s important to clean up unused resources—such as BigQuery for sample data import and database resources in ClickHouse Cloud—to avoid unnecessary costs. Regarding IAM permissions, adhering to the principle of least privilege is advisable. This involves granting users and roles only the permissions necessary for their tasks and removing unnecessary permissions when they are no longer required. This approach enhances security by minimizing potential threat surfaces. Additionally, reviewing AWS Glue job costs and configurations can help identify optimization opportunities for future migrations. Monitoring overall costs and analyzing usage can reveal areas where code or configuration improvements may lead to cost savings.

Conclusion

AWS Glue ETL offers a robust and user-friendly solution for migrating data from BigQuery to ClickHouse Cloud on AWS. By utilizing Glue’s serverless architecture, organizations can perform data migrations that are efficient, secure, and cost-effective. The direct integration with ClickHouse streamlines data transfer, supporting high performance and flexibility. This migration approach is particularly well-suited for companies looking to enhance their real-time analytics capabilities on AWS.


About the Authors

Ray Wang

Ray Wang

Ray is a Senior Solutions Architect at AWS. With 12+ years of experience in the IT industry, Ray is dedicated to building modern solutions on the cloud, especially in NoSQL, big data, machine learning, and Generative AI. As a hungry go-getter, he passed all 12 AWS certificates to make his technical field not only deep but wide. He loves to read and watch sci-fi movies in his spare time.

Robert Chung

Robert Chung

Robert is a Solutions Architect at AWS with expertise across Infrastructure, Data, AI, and Modernization technologies. He has supported numerous financial services customers in driving cloud-native transformation, advancing data analytics, and accelerating mainframe modernization. His experience also extends to modern AI-DLC practices, enabling enterprises to innovate faster. With this background, Robert is well-equipped to address complex enterprise challenges and deliver impactful solutions.

Tomohiro Tanaka

Tomohiro Tanaka

Tomohiro is a Senior Cloud Support Engineer at Amazon Web Services (AWS). He’s passionate about helping customers use Apache Iceberg for their data lakes on AWS. In his free time, he enjoys a coffee break with his colleagues and making coffee at home.

Stanley Chukwuemeke

Stanley Chukwuemeke

Stanley is a Senior Partner Solutions Architect at AWS. He works with AWS technology partners to grow their business by creating joint go-to-market solutions using AWS data, analytics and AI services. He’s worked with data most of his career and passionate about database modernization and cloud adoption strategy to help drive enterprise modernization initiatives across industries.

Catalog and analyze Application Load Balancer logs more efficiently with AWS Glue custom classifiers and Amazon Athena

Post Syndicated from Ray Wang original https://aws.amazon.com/blogs/big-data/catalog-and-analyze-application-load-balancer-logs-more-efficiently-with-aws-glue-custom-classifiers-and-amazon-athena/

You can query Application Load Balancer (ALB) access logs for various purposes, such as analyzing traffic distribution and patterns. You can also easily use Amazon Athena to create a table and query against the ALB access logs on Amazon Simple Storage Service (Amazon S3). (For more information, see How do I analyze my Application Load Balancer access logs using Amazon Athena? and Querying Application Load Balancer Logs.) All queries are run against the whole table because it doesn’t define any partitions. If you have several years of ALB logs, you may want to use a partitioned table instead for better query performance and cost control. In fact, partitioning data is one of the Top 10 performance tuning tips for Athena.

However, because ALB log files aren’t stored in a Hive-style prefix (such as /year=2021/), the process of creating thousands of partitions using ALTER TABLE ADD PARTITION in Athena is cumbersome. This post shows a way to create and schedule an AWS Glue crawler with a Grok custom classifier that infers the schema of all ALB log files under the specified Amazon S3 prefix and populates the partition metadata (year, month, and day) automatically to the AWS Glue Data Catalog.

Prerequisites

To follow along with this post, complete the following prerequisites:

  1. Enable access logging of the ALBs, and have the files already ingested in the specified S3 bucket.
  2. Set up the Athena query result location. For more information, see Working with Query Results, Output Files, and Query History.

Solution overview

The following diagram illustrates the solution architecture.

To implement this solution, we complete the following steps:

  1. Prepare the Grok pattern for our ALB logs, and cross-check with a Grok debugger.
  2. Create an AWS Glue crawler with a Grok custom classifier.
  3. Run the crawler to prepare a table with partitions in the Data Catalog.
  4. Analyze the partitioned data using Athena and compare query speed vs. a non-partitioned table.

Prepare the Grok pattern for our ALB logs

As a preliminary step, locate the access log files on the Amazon S3 console, and manually inspect the files to observe the format and syntax. To allow an AWS Glue crawler to recognize the pattern, we need to use a Grok pattern to match against an expression and map specific parts into the corresponding fields. Approximately 100 sample Grok patterns are available in the Logstash Plugins GitHub, and we can write our own custom pattern if it’s not listed.

The following the basic syntax format for a Grok pattern %{PATTERN:FieldName}

The following is an example of an ALB access log:

http 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 192.168.131.39:2817 10.0.0.1:80 0.000 0.001 0.000 200 200 34 366 "GET http://www.example.com:80/ HTTP/1.1" "curl/7.46.0" - - arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337262-36d228ad5d99923122bbe354" "-" "-" 0 2018-07-02T22:22:48.364000Z "forward" "-" "-" "10.0.0.1:80" "200" "-" "-"
https 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 192.168.131.39:2817 10.0.0.1:80 0.086 0.048 0.037 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.46.0" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337281-1d84f3d73c47ec4e58577259" "www.example.com" "arn:aws:acm:us-east-2:123456789012:certificate/12345678-1234-1234-1234-123456789012" 1 2018-07-02T22:22:48.364000Z "authenticate,forward" "-" "-" "10.0.0.1:80" "200" "-" "-"
h2 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 10.0.1.252:48160 10.0.0.66:9000 0.000 0.002 0.000 200 200 5 257 "GET https://10.0.2.105:773/ HTTP/2.0" "curl/7.46.0" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337327-72bd00b0343d75b906739c42" "-" "-" 1 2018-07-02T22:22:48.364000Z "redirect" "https://example.com:80/" "-" "10.0.0.66:9000" "200" "-" "-"
ws 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 10.0.0.140:40914 10.0.1.192:8010 0.001 0.003 0.000 101 101 218 587 "GET http://10.0.0.30:80/ HTTP/1.1" "-" - - arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337364-23a8c76965a2ef7629b185e3" "-" "-" 1 2018-07-02T22:22:48.364000Z "forward" "-" "-" "10.0.1.192:8010" "101" "-" "-"
wss 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 10.0.0.140:44244 10.0.0.171:8010 0.000 0.001 0.000 101 101 218 786 "GET https://10.0.0.30:443/ HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:us-west-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337364-23a8c76965a2ef7629b185e3" "-" "-" 1 2018-07-02T22:22:48.364000Z "forward" "-" "-" "10.0.0.171:8010" "101" "-" "-"
http 2018-11-30T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 192.168.131.39:2817 - 0.000 0.001 0.000 200 200 34 366 "GET http://www.example.com:80/ HTTP/1.1" "curl/7.46.0" - - arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337364-23a8c76965a2ef7629b185e3" "-" "-" 0 2018-11-30T22:22:48.364000Z "forward" "-" "-" "-" "-" "-" "-"
http 2018-11-30T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 192.168.131.39:2817 - 0.000 0.001 0.000 502 - 34 366 "GET http://www.example.com:80/ HTTP/1.1" "curl/7.46.0" - - arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337364-23a8c76965a2ef7629b185e3" "-" "-" 0 2018-11-30T22:22:48.364000Z "forward" "-" "LambdaInvalidResponse" "-" "-" "-" "-"

To map the first field, the Grok pattern might look like the following code:

%{DATA:type}\s

The pattern includes the following components:

  • DATA maps to .*?
  • type is the column name
  • \s is the whitespace character

To map the second field, the Grok pattern might look like the following:

%{TIMESTAMP_ISO8601:time}\s

This pattern has the following elements:

  • TIMESTAMP_ISO8601 maps to %{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?
  • time is the column name
  • \s is the whitespace character

When writing Grok patterns, we should also consider corner cases. For example, the following code is a normal case:

%{BASE10NUM:target_processing_time}\s

But when considering the possibility of null value, we should replace the pattern with the following:

%{DATA:target_processing_time}\s

When our Grok pattern is ready, we can test the Grok pattern with sample input using a third-party Grok debugger. The following pattern is a good start, but always remember to test it with the actual ALB logs.

%{DATA:type}\s+%{TIMESTAMP_ISO8601:time}\s+%{DATA:elb}\s+%{DATA:client}\s+%{DATA:target}\s+%{BASE10NUM:request_processing_time}\s+%{DATA:target_processing_time}\s+%{BASE10NUM:response_processing_time}\s+%{BASE10NUM:elb_status_code}\s+%{DATA:target_status_code}\s+%{BASE10NUM:received_bytes}\s+%{BASE10NUM:sent_bytes}\s+\"%{DATA:request}\"\s+\"%{DATA:user_agent}\"\s+%{DATA:ssl_cipher}\s+%{DATA:ssl_protocol}\s+%{DATA:target_group_arn}\s+\"%{DATA:trace_id}\"\s+\"%{DATA:domain_name}\"\s+\"%{DATA:chosen_cert_arn}\"\s+%{DATA:matched_rule_priority}\s+%{TIMESTAMP_ISO8601:request_creation_time}\s+\"%{DATA:actions_executed}\"\s+\"%{DATA:redirect_url}\"\s+\"%{DATA:error_reason}\"\s+\"%{DATA:target_list}\"\s+\"%{DATA:target_status_code_list}\"\s+\"%{DATA:classification}\"\s+\"%{DATA:classification_reason}\"

Keep in mind that when you copy the Grok pattern from your browser, in some cases there are extra spaces in the end of the lines. Make sure to remove these extra spaces.

Create an AWS Glue crawler with a Grok custom classifier

Before you create your crawler, you first create a custom classifier. Complete the following steps:

  1. On the AWS Glue console, under Crawler, choose Classifiers.
  2. Choose Add classifier.
  3. For Classifier name, enter alb-logs-classifier.
  4. For Classifier type¸ select Grok.
  5. For Classification, enter alb-logs.
  6. For Grok pattern, enter the pattern from the previous section.
  7. Choose Create.

Now you can create your crawler.

  1. Choose Crawlers in the navigation pane.
  2. Choose Add crawler.
  3. For Crawler name, enter alb-access-log-crawler.
  4. For Selected classifiers, enter alb-logs-classifier.

  5. Choose Next.
  6. For Crawler source type, select Data stores.
  7. For Repeat crawls of S3 data stores, select Crawl new folders only.
  8. Choose Next.
  9. For Choose a data store, choose S3.
  10. For Crawl data in, select Specified path in my account.
  11. For Include path, enter the path to your ALB logs (for example, s3://alb-logs-directory/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/).
  12. Choose Next.
  13. When prompted to add another data store, select No and choose Next.
  14. Select Create an IAM role, and give it a name such as AWSGlueServiceRole-alb-logs-crawler.
  15. For Frequency, choose Daily.
  16. Indicate your start hour and minute.
  17. Choose Next.
  18. For Database, enter elb-access-log-db.
  19. For Prefix added to tables, enter alb_logs_.
  20. Expand Configuration options.
  21. Select Update all new and existing partitions with metadata from the table.
  22. Keep the other options at their default.
  23. Choose Next.
  24. Review your settings and choose Finish.

Run your AWS Glue crawler

Next, we run our crawler to prepare a table with partitions in the Data Catalog.

  1. On the AWS Glue console, choose Crawlers.
  2. Select the crawler we just created.
  3. Choose Run crawler.

When the crawler is complete, you receive a notification indicating that a table has been created.

Next, we review and edit the schema.

  1. Under Databases, choose Tables.
  2. Choose the table alb_logs_<region>.
  3. Cross-check the column name and corresponding data type.

The table has three columns: partiion_0, partition_1, and partition_2.

  1. Choose Edit schema.
  2. Rename the columns year, month, and day.
  3. Choose Save.

Analyze the data using Athena

Next, we analyze our data by querying the access logs. We compare the query speed between the following tables:

  • Non-partitioned table – All data is treated as a single table
  • Partitioned table – Data is partitioned by year, month, and day

Query the non-partitioned table

With the non-partitioned table, if we want to query access logs on a specific date, we have to write the WHERE clause using the LIKE operator because the data column was interpreted as a string. See the following code:

SELECT COUNT(1) FROM "elb-access-log-db"."alb_logs" WHERE type='h2' AND time LIKE '2020-12-29%';

The query takes 5.25 seconds to complete, with 3.15 MB data scanned.

Query the partitioned table

With the year, month, and day columns as partitions, we can use the following statement to query access logs on the same day:

SELECT COUNT(1) FROM "elb-access-log-db"."alb_logs" WHERE type='h2' AND year=2020 AND month=12 AND day=29;

This time the query takes only 1.89 seconds to complete, with 25.72 KB data scanned.

This query is faster and costs less (because less data is scanned) due to partition pruning.

Clean up

To avoid incurring future charges, delete the resources created in the Data Catalog, and delete the AWS Glue crawler.

Summary

In this post, we illustrated how to create an AWS Glue crawler that populates ALB logs metadata in the AWS Glue Data Catalog automatically with partitions by year, month, and day. With partition pruning, we can improve query performance and associated costs in Athena.

If you have questions or suggestions, please leave a comment.


About the Authors

Ray Wang is a Solutions Architect at AWS. With 8 years of experience in the IT industry, Ray is dedicated to building modern solutions on the cloud, especially in big data and machine learning. As a hungry go-getter, he passed all 12 AWS certificates to make his technical field not only deep but wide. He loves to read and watch sci-fi movies in his spare time.

Corvus Lee is a Data Lab Solutions Architect at AWS. He enjoys all kinds of data-related discussions with customers, from high-level like white boarding a data lake architecture, to the details of data modeling, writing Python/Spark code for data processing, and more.