All posts by Navnit Shukla

Process and analyze highly nested and large XML files using AWS Glue and Amazon Athena

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/process-and-analyze-highly-nested-and-large-xml-files-using-aws-glue-and-amazon-athena/

In today’s digital age, data is at the heart of every organization’s success. One of the most commonly used formats for exchanging data is XML. Analyzing XML files is crucial for several reasons. Firstly, XML files are used in many industries, including finance, healthcare, and government. Analyzing XML files can help organizations gain insights into their data, allowing them to make better decisions and improve their operations. Analyzing XML files can also help in data integration, because many applications and systems use XML as a standard data format. By analyzing XML files, organizations can easily integrate data from different sources and ensure consistency across their systems, However, XML files contain semi-structured, highly nested data, making it difficult to access and analyze information, especially if the file is large and has complex, highly nested schema.

XML files are well-suited for applications, but they may not be optimal for analytics engines. In order to enhance query performance and enable easy access in downstream analytics engines such as Amazon Athena, it’s crucial to preprocess XML files into a columnar format like Parquet. This transformation allows for improved efficiency and usability in analytics workflows. In this post, we show how to process XML data using AWS Glue and Athena.

Solution overview

We explore two distinct techniques that can streamline your XML file processing workflow:

  • Technique 1: Use an AWS Glue crawler and the AWS Glue visual editor – You can use the AWS Glue user interface in conjunction with a crawler to define the table structure for your XML files. This approach provides a user-friendly interface and is particularly suitable for individuals who prefer a graphical approach to managing their data.
  • Technique 2: Use AWS Glue DynamicFrames with inferred and fixed schemas – The crawler has a limitation when it comes to processing a single row in XML files larger than 1 MB. To overcome this restriction, we use an AWS Glue notebook to construct AWS Glue DynamicFrames, utilizing both inferred and fixed schemas. This method ensures efficient handling of XML files with rows exceeding 1 MB in size.

In both approaches, our ultimate goal is to convert XML files into Apache Parquet format, making them readily available for querying using Athena. With these techniques, you can enhance the processing speed and accessibility of your XML data, enabling you to derive valuable insights with ease.

Prerequisites

Before you begin this tutorial, complete the following prerequisites (these apply to both techniques):

  1. Download the XML files technique1.xml and technique2.xml.
  2. Upload the files to an Amazon Simple Storage Service (Amazon S3) bucket. You can upload them to the same S3 bucket in different folders or to different S3 buckets.
  3. Create an AWS Identity and Access Management (IAM) role for your ETL job or notebook as instructed in Set up IAM permissions for AWS Glue Studio.
  4. Add an inline policy to your role with the iam:PassRole action:
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": ["iam:PassRole"],
      "Effect": "Allow",
      "Resource": "arn:aws:iam::*:role/AWSGlueServiceRole*",
      "Condition": {
        "StringLike": {
          "iam:PassedToService": ["glue.amazonaws.com"]
        }
      }
    }
}
  1. Add a permissions policy to the role with access to your S3 bucket.

Now that we’re done with the prerequisites, let’s move on to implementing the first technique.

Technique 1: Use an AWS Glue crawler and the visual editor

The following diagram illustrates the simple architecture that you can use to implement the solution.

Processing and Analyzing XML file using AWS Glue and Amazon Athena

To analyze XML files stored in Amazon S3 using AWS Glue and Athena, we complete the following high-level steps:

  1. Create an AWS Glue crawler to extract XML metadata and create a table in the AWS Glue Data Catalog.
  2. Process and transform XML data into a format (like Parquet) suitable for Athena using an AWS Glue extract, transform, and load (ETL) job.
  3. Set up and run an AWS Glue job via the AWS Glue console or the AWS Command Line Interface (AWS CLI).
  4. Use the processed data (in Parquet format) with Athena tables, enabling SQL queries.
  5. Use the user-friendly interface in Athena to analyze the XML data with SQL queries on your data stored in Amazon S3.

This architecture is a scalable, cost-effective solution for analyzing XML data on Amazon S3 using AWS Glue and Athena. You can analyze large datasets without complex infrastructure management.

We use the AWS Glue crawler to extract XML file metadata. You can choose the default AWS Glue classifier for general-purpose XML classification. It automatically detects XML data structure and schema, which is useful for common formats.

We also use a custom XML classifier in this solution. It’s designed for specific XML schemas or formats, allowing precise metadata extraction. This is ideal for non-standard XML formats or when you need detailed control over classification. A custom classifier ensures only necessary metadata is extracted, simplifying downstream processing and analysis tasks. This approach optimizes the use of your XML files.

The following screenshot shows an example of an XML file with tags.

Create a custom classifier

In this step, you create a custom AWS Glue classifier to extract metadata from an XML file. Complete the following steps:

  1. On the AWS Glue console, under Crawlers in the navigation pane, choose Classifiers.
  2. Choose Add classifier.
  3. Select XML as the classifier type.
  4. Enter a name for the classifier, such as blog-glue-xml-contact.
  5. For Row tag, enter the name of the root tag that contains the metadata (for example, metadata).
  6. Choose Create.

Create an AWS Glue Crawler to crawl xml file

In this section, we are creating a Glue Crawler to extract the metadata from XML file using the customer classifier created in previous step.

Create a database

  1. Go to the AWS Glue console, choose Databases in the navigation pane.
  2. Click on Add database.
  3. Provide a name such as blog_glue_xml
  4. Choose Create Database

Create a Crawler

Complete the following steps to create your first crawler:

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Choose Create crawler.
  3. On the Set crawler properties page, provide a name for the new crawler (such as blog-glue-parquet), then choose Next.
  4. On the Choose data sources and classifiers page, select Not Yet under Data source configuration.
  5. Choose Add a data store.
  6. For S3 path, browse to s3://${BUCKET_NAME}/input/geologicalsurvey/.

Make sure you pick the XML folder rather than the file inside the folder.

  1. Leave the rest of the options as default and choose Add an S3 data source.
  2. Expand Custom classifiers – optional, choose blog-glue-xml-contact, then choose Next and keep the rest of the options as default.
  3. Choose your IAM role or choose Create new IAM role, add the suffix glue-xml-contact (for example, AWSGlueServiceNotebookRoleBlog), and choose Next.
  4. On the Set output and scheduling page, under Output configuration, choose blog_glue_xml for Target database.
  5. Enter console_ as the prefix added to tables (optional) and under Crawler schedule, keep the frequency set to On demand.
  6. Choose Next.
  7. Review all the parameters and choose Create crawler.

Run the Crawler

After you create the crawler, complete the following steps to run it:

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Open the crawler you created and choose Run.

The crawler will take 1–2 minutes to complete.

  1. When the crawler is complete, choose Databases in the navigation pane.
  2. Choose the database you crated and choose the table name to see the schema extracted by the crawler.

Create an AWS Glue job to convert the XML to Parquet format

In this step, you create an AWS Glue Studio job to convert the XML file into a Parquet file. Complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Under Create job, select Visual with a blank canvas.
  3. Choose Create.
  4. Rename the job to blog_glue_xml_job.

Now you have a blank AWS Glue Studio visual job editor. On the top of the editor are the tabs for different views.

  1. Choose the Script tab to see an empty shell of the AWS Glue ETL script.

As we add new steps in the visual editor, the script will be updated automatically.

  1. Choose the Job details tab to see all the job configurations.
  2. For IAM role, choose AWSGlueServiceNotebookRoleBlog.
  3. For Glue version, choose Glue 4.0 – Support Spark 3.3, Scala 2, Python 3.
  4. Set Requested number of workers to 2.
  5. Set Number of retries to 0.
  6. Choose the Visual tab to go back to the visual editor.
  7. On the Source drop-down menu, choose AWS Glue Data Catalog.
  8. On the Data source properties – Data Catalog tab, provide the following information:
    1. For Database, choose blog_glue_xml.
    2. For Table, choose the table that starts with the name console_ that the crawler created (for example, console_geologicalsurvey).
  9. On the Node properties tab, provide the following information:
    1. Change Name to geologicalsurvey dataset.
    2. Choose Action and the transformation Change Schema (Apply Mapping).
    3. Choose Node properties and change the name of the transform from Change Schema (Apply Mapping) to ApplyMapping.
    4. On the Target menu, choose S3.
  10. On the Data source properties – S3 tab, provide the following information:
    1. For Format, select Parquet.
    2. For Compression Type, select Uncompressed.
    3. For S3 source type, select S3 location.
    4. For S3 URL, enter s3://${BUCKET_NAME}/output/parquet/.
    5. Choose Node Properties and change the name to Output.
  11. Choose Save to save the job.
  12. Choose Run to run the job.

The following screenshot shows the job in the visual editor.

Create an AWS Gue Crawler to crawl the Parquet file

In this step, you create an AWS Glue crawler to extract metadata from the Parquet file you created using an AWS Glue Studio job. This time, you use the default classifier. Complete the following steps:

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Choose Create crawler.
  3. On the Set crawler properties page, provide a name for the new crawler, such as blog-glue-parquet-contact, then choose Next.
  4. On the Choose data sources and classifiers page, select Not Yet for Data source configuration.
  5. Choose Add a data store.
  6. For S3 path, browse to s3://${BUCKET_NAME}/output/parquet/.

Make sure you pick the parquet folder rather than the file inside the folder.

  1. Choose your IAM role created during the prerequisite section or choose Create new IAM role (for example, AWSGlueServiceNotebookRoleBlog), and choose Next.
  2. On the Set output and scheduling page, under Output configuration, choose blog_glue_xml for Database.
  3. Enter parquet_ as the prefix added to tables (optional) and under Crawler schedule, keep the frequency set to On demand.
  4. Choose Next.
  5. Review all the parameters and choose Create crawler.

Now you can run the crawler, which takes 1–2 minutes to complete.

You can preview the newly created schema for the Parquet file in the AWS Glue Data Catalog, which is similar to the schema of the XML file.

We now possess data that is suitable for use with Athena. In the next section, we perform data queries using Athena.

Query the Parquet file using Athena

Athena doesn’t support querying the XML file format, which is why you converted the XML file into Parquet for more efficient data querying and use dot notation to query complex types and nested structures.

The following example code uses dot notation to query nested data:

SELECT 
    idinfo.citation.citeinfo.origin,
    idinfo.citation.citeinfo.pubdate,
    idinfo.citation.citeinfo.title,
    idinfo.citation.citeinfo.geoform,
    idinfo.citation.citeinfo.pubinfo.pubplace,
    idinfo.citation.citeinfo.pubinfo.publish,
    idinfo.citation.citeinfo.onlink,
    idinfo.descript.abstract,
    idinfo.descript.purpose,
    idinfo.descript.supplinf,
    dataqual.attracc.attraccr, 
    dataqual.logic,
    dataqual.complete,
    dataqual.posacc.horizpa.horizpar,
    dataqual.posacc.vertacc.vertaccr,
    dataqual.lineage.procstep.procdate,
    dataqual.lineage.procstep.procdesc
FROM "blog_glue_xml"."parquet_parquet" limit 10;

Now that we’ve completed technique 1, let’s move on to learn about technique 2.

Technique 2: Use AWS Glue DynamicFrames with inferred and fixed schemas

In the previous section, we covered the process of handling a small XML file using an AWS Glue crawler to generate a table, an AWS Glue job to convert the file into Parquet format, and Athena to access the Parquet data. However, the crawler encounters limitations when it comes to processing XML files that exceed 1 MB in size. In this section, we delve into the topic of batch processing larger XML files, necessitating additional parsing to extract individual events and conduct analysis using Athena.

Our approach involves reading the XML files through AWS Glue DynamicFrames, employing both inferred and fixed schemas. Then we extract the individual events in Parquet format using the relationalize transformation, enabling us to query and analyze them seamlessly using Athena.

To implement this solution, you complete the following high-level steps:

  1. Create an AWS Glue notebook to read and analyze the XML file.
  2. Use DynamicFrames with InferSchema to read the XML file.
  3. Use the relationalize function to unnest any arrays.
  4. Convert the data to Parquet format.
  5. Query the Parquet data using Athena.
  6. Repeat the previous steps, but this time pass a schema to DynamicFrames instead of using InferSchema.

The electric vehicle population data XML file has a response tag at its root level. This tag contains an array of row tags, which are nested within it. The row tag is an array that contains a set of another row tags, which provide information about a vehicle, including its make, model, and other relevant details. The following screenshot shows an example.

Create an AWS Glue Notebook

To create an AWS Glue notebook, complete the following steps:

  1. Open the AWS Glue Studio console, choose Jobs in the navigation pane.
  2. Select Jupyter Notebook and choose Create.

  1. Enter a name for your AWS Glue job, such as blog_glue_xml_job_Jupyter.
  2. Choose the role that you created in the prerequisites (AWSGlueServiceNotebookRoleBlog).

The AWS Glue notebook comes with a preexisting example that demonstrates how to query a database and write the output to Amazon S3.

  1. Adjust the timeout (in minutes) as shown in the following screenshot and run the cell to create the AWS Glue interactive session.

Create basic Variables

After you create the interactive session, at the end of the notebook, create a new cell with the following variables (provide your own bucket name):

BUCKET_NAME='YOUR_BUCKET_NAME'
S3_SOURCE_XML_FILE = f's3://{BUCKET_NAME}/xml_dataset/'
S3_TEMP_FOLDER = f's3://{BUCKET_NAME}/temp/'
S3_OUTPUT_INFER_SCHEMA = f's3://{BUCKET_NAME}/infer_schema/'
INFER_SCHEMA_TABLE_NAME = 'infer_schema'
S3_OUTPUT_NO_INFER_SCHEMA = f's3://{BUCKET_NAME}/no_infer_schema/'
NO_INFER_SCHEMA_TABLE_NAME = 'no_infer_schema'
DATABASE_NAME = 'blog_xml'

Read the XML file inferring the schema

If you don’t pass a schema to the DynamicFrame, it will infer the schema of the files. To read the data using a dynamic frame, you can use the following command:

df = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": [S3_SOURCE_XML_FILE]},
    format="xml",
    format_options={"rowTag": "response"},
)

Print the DynamicFrame Schema

Print the schema with the following code:

df.printSchema()

The schema shows a nested structure with a row array containing multiple elements. To unnest this structure into lines, you can use the AWS Glue relationalize transformation:

df_relationalized = df.relationalize(
    "root", S3_TEMP_FOLDER
)

We are only interested in the information contained within the row array, and we can view the schema by using the following command:

df_relationalized.select("root_row.row").printSchema()

The column names contain row.row, which correspond to the array structure and array column in the dataset. We don’t rename the columns in this post; for instructions to do so, refer to Automate dynamic mapping and renaming of column names in data files using AWS Glue: Part 1. Then you can convert the data to Parquet format and create the AWS Glue table using the following command:


s3output = glueContext.getSink(
  path= S3_OUTPUT_INFER_SCHEMA,
  connection_type="s3",
  updateBehavior="UPDATE_IN_DATABASE",
  partitionKeys=[],
  compression="snappy",
  enableUpdateCatalog=True,
  transformation_ctx="s3output",
)
s3output.setCatalogInfo(
  catalogDatabase="blog_xml", catalogTableName="jupyter_notebook_with_infer_schema"
)
s3output.setFormat("glueparquet")
s3output.writeFrame(df_relationalized.select("root_row.row"))

AWS Glue DynamicFrame provides features that you can use in your ETL script to create and update a schema in the Data Catalog. We use the updateBehavior parameter to create the table directly in the Data Catalog. With this approach, we don’t need to run an AWS Glue crawler after the AWS Glue job is complete.

Read the XML file by setting a schema

An alternative way to read the file is by predefining a schema. To do this, complete the following steps:

  1. Import the AWS Glue data types:
    from awsglue.gluetypes import *

  2. Create a schema for the XML file:
    schema = StructType([ 
      Field("row", StructType([
        Field("row", ArrayType(StructType([
                Field("_2020_census_tract", LongType()),
                Field("__address", StringType()),
                Field("__id", StringType()),
                Field("__position", IntegerType()),
                Field("__uuid", StringType()),
                Field("base_msrp", IntegerType()),
                Field("cafv_type", StringType()),
                Field("city", StringType()),
                Field("county", StringType()),
                Field("dol_vehicle_id", IntegerType()),
                Field("electric_range", IntegerType()),
                Field("electric_utility", StringType()),
                Field("ev_type", StringType()),
                Field("geocoded_column", StringType()),
                Field("legislative_district", IntegerType()),
                Field("make", StringType()),
                Field("model", StringType()),
                Field("model_year", IntegerType()),
                Field("state", StringType()),
                Field("vin_1_10", StringType()),
                Field("zip_code", IntegerType())
        ])))
      ]))
    ])

  3. Pass the schema when reading the XML file:
    df = glueContext.create_dynamic_frame.from_options(
        connection_type="s3",
        connection_options={"paths": [S3_SOURCE_XML_FILE]},
        format="xml",
        format_options={"rowTag": "response", "withSchema": json.dumps(schema.jsonValue())},
    )

  4. Unnest the dataset like before:
    df_relationalized = df.relationalize(
        "root", S3_TEMP_FOLDER
    )

  5. Convert the dataset to Parquet and create the AWS Glue table:
    s3output = glueContext.getSink(
      path=S3_OUTPUT_NO_INFER_SCHEMA,
      connection_type="s3",
      updateBehavior="UPDATE_IN_DATABASE",
      partitionKeys=[],
      compression="snappy",
      enableUpdateCatalog=True,
      transformation_ctx="s3output",
    )
    s3output.setCatalogInfo(
      catalogDatabase="blog_xml", catalogTableName="jupyter_notebook_no_infer_schema"
    )
    s3output.setFormat("glueparquet")
    s3output.writeFrame(df_relationalized.select("root_row.row"))

Query the tables using Athena

Now that we have created both tables, we can query the tables using Athena. For example, we can use the following query:

SELECT * FROM "blog_xml"."jupyter_notebook_no_infer_schema " limit 10;

The following screenshot shows the results.

Clean Up

In this post, we created an IAM role, an AWS Glue Jupyter notebook, and two tables in the AWS Glue Data Catalog. We also uploaded some files to an S3 bucket. To clean up these objects, complete the following steps:

  1. On the IAM console, delete the role you created.
  2. On the AWS Glue Studio console, delete the custom classifier, crawler, ETL jobs, and Jupyter notebook.
  3. Navigate to the AWS Glue Data Catalog and delete the tables you created.
  4. On the Amazon S3 console, navigate to the bucket you created and delete the folders named temp, infer_schema, and no_infer_schema.

Key Takeaways

In AWS Glue, there’s a feature called InferSchema in AWS Glue DynamicFrames. It automatically figures out the structure of a data frame based on the data it contains. In contrast, defining a schema means explicitly stating how the data frame’s structure should be before loading the data.

XML, being a text-based format, doesn’t restrict the data types of its columns. This can cause issues with the InferSchema function. For example, in the first run, a file with column A having a value of 2 results in a Parquet file with column A as an integer. In the second run, a new file has column A with the value C, leading to a Parquet file with column A as a string. Now there are two files on S3, each with a column A of different data types, which can create problems downstream.

The same happens with complex data types like nested structures or arrays. For example, if a file has one tag entry called transaction, it’s inferred as a struct. But if another file has the same tag, it’s inferred as an array

Despite these data type issues, InferSchema is useful when you don’t know the schema or defining one manually is impractical. However, it’s not ideal for large or constantly changing datasets. Defining a schema is more precise, especially with complex data types, but has its own issues, like requiring manual effort and being inflexible to data changes.

InferSchema has limitations, like incorrect data type inference and issues with handling null values. Defining a schema also has limitations, like manual effort and potential errors.

Choosing between inferring and defining a schema depends on the project’s needs. InferSchema is great for quick exploration of small datasets, whereas defining a schema is better for larger, complex datasets requiring accuracy and consistency. Consider the trade-offs and constraints of each method to pick what suits your project best.

Conclusion

In this post, we explored two techniques for managing XML data using AWS Glue, each tailored to address specific needs and challenges you may encounter.

Technique 1 offers a user-friendly path for those who prefer a graphical interface. You can use an AWS Glue crawler and the visual editor to effortlessly define the table structure for your XML files. This approach simplifies the data management process and is particularly appealing to those looking for a straightforward way to handle their data.

However, we recognize that the crawler has its limitations, specifically when dealing with XML files having rows larger than 1 MB. This is where technique 2 comes to the rescue. By harnessing AWS Glue DynamicFrames with both inferred and fixed schemas, and employing an AWS Glue notebook, you can efficiently handle XML files of any size. This method provides a robust solution that ensures seamless processing even for XML files with rows exceeding the 1 MB constraint.

As you navigate the world of data management, having these techniques in your toolkit empowers you to make informed decisions based on the specific requirements of your project. Whether you prefer the simplicity of technique 1 or the scalability of technique 2, AWS Glue provides the flexibility you need to handle XML data effectively.


About the Authors

Navnit Shuklaserves as an AWS Specialist Solution Architect with a focus on Analytics. He possesses a strong enthusiasm for assisting clients in discovering valuable insights from their data. Through his expertise, he constructs innovative solutions that empower businesses to arrive at informed, data-driven choices. Notably, Navnit Shukla is the accomplished author of the book titled “Data Wrangling on AWS.

Patrick Muller works as a Senior Data Lab Architect at AWS. His main responsibility is to assist customers in turning their ideas into a production-ready data product. In his free time, Patrick enjoys playing soccer, watching movies, and traveling.

Amogh Gaikwad is a Senior Solutions Developer at Amazon Web Services. He helps global customers build and deploy AI/ML solutions on AWS. His work is mainly focused on computer vision, and natural language processing and helping customers optimize their AI/ML workloads for sustainability. Amogh has received his master’s in Computer Science specializing in Machine Learning.

Sheela Sonone is a Senior Resident Architect at AWS. She helps AWS customers make informed choices and tradeoffs about accelerating their data, analytics, and AI/ML workloads and implementations. In her spare time, she enjoys spending time with her family – usually on tennis courts.

Set up advanced rules to validate quality of multiple datasets with AWS Glue Data Quality

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/set-up-advanced-rules-to-validate-quality-of-multiple-datasets-with-aws-glue-data-quality/

Data is the lifeblood of modern businesses. In today’s data-driven world, companies rely on data to make informed decisions, gain a competitive edge, and provide exceptional customer experiences. However, not all data is created equal. Poor-quality data can lead to incorrect insights, bad decisions, and lost opportunities.

AWS Glue Data Quality measures and monitors the quality of your dataset. It supports both data quality at rest and data quality in AWS Glue extract, transform, and load (ETL) pipelines. Data quality at rest focuses on validating the data stored in data lakes, databases, or data warehouses. It ensures that the data meets specific quality standards before it is consumed. Data quality in ETL pipelines, on the other hand, ensures the quality of data as it moves through the ETL process. It helps identify data quality issues during the ETL pipeline, allowing for early detection and correction of problems and prevents the failure of the data pipeline because of data quality issues.

This is Part 3 of a five-post series on AWS Glue Data Quality. In this post, we demonstrate the advanced data quality checks that you can typically perform when bringing data from a database to an Amazon Simple Storage Service (Amazon S3) data lake. Check out the other posts in this series:

Use case overview

Let’s consider an example use case where we have a database named classicmodels that contains retail data for a car dealership. This example database includes sample data for various entities, such as Customers, Products, ProductLines, Orders, OrderDetails, Payments, Employees, and Offices. You can find more details about this example database in MySQL Sample Database.

In this scenario, we assume the role of a data engineer who is responsible for building a data pipeline. The primary objective is to extract data from a relational database, specifically an Amazon RDS for MySQL database, and store it in Amazon S3, which serves as a data lake. After the data is loaded into the data lake, the data engineer is also responsible for performing data quality checks to ensure that the data in the data lake maintains its quality. To achieve this, the data engineer uses the newly launched AWS Glue Data Quality evaluation feature.

The following diagram illustrates the entity relationship model that describes the relationships between different tables. In this post, we use the employees, customers, and products table.

Solution overview

This solution focuses on transferring data from an RDS for MySQL database to Amazon S3 and performing data quality checks using the AWS Glue ETL pipeline and AWS Glue Data Catalog. The workflow involves the following steps:

  1. Data is extracted from the RDS for MySQL database using AWS Glue ETL.
  2. The extracted data is stored in Amazon S3, which serves as the data lake.
  3. The Data Catalog and AWS Glue ETL pipeline are utilized to validate the successful completion of data ingestion by performing data quality checks on the data stored in Amazon S3.

The following diagram illustrates the solution architecture.

To implement the solution, we complete the following steps:

  1. Set up resources with AWS CloudFormation.
  2. Establish a connection to the RDS for MySQL instance from AWS Cloud9.
  3. Run an AWS Glue crawler on the RDS for MySQL database.
  4. Validate the Data Catalog.
  5. Run an AWS Glue ETL job to bring data from Amazon RDS for MySQL to Amazon S3.
  6. Evaluate the advanced data quality rules in the ETL job.
  7. Evaluate the advanced data quality rules in the Data Catalog.

Set up resources with AWS CloudFormation

This post includes a CloudFormation template for a quick setup. You can review and customize it to suit your needs. The template generates the following resources:

  • An RDS for MySQL database instance (source)
  • An S3 bucket for the data lake (destination)
  • An AWS Glue ETL job to bring data from source to destination
  • An AWS Glue crawler to crawl the RDS for MySQL databases and create a centralized Data Catalog
  • AWS Identity and Access Management (IAM) users and policies
  • An AWS Cloud9 environment to connect to the RDS DB instance and create a sample dataset
  • An Amazon VPC, public subnet, two private subnets, internet gateway, NAT gateway, and route tables

To launch the CloudFormation stack, complete the following steps:

  1. Sign in to the AWS CloudFormation console.
  2. Choose Launch Stack:
    BDB-2063-launch-cloudformation-stack
  3. Choose Next.
  4. For DatabaseUserPassword, enter your preferred password.
  5. Choose Next.
  6. Scroll to the end and choose Next.
  7. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names and choose Submit.

This stack can take around 10 minutes to complete, after which you can view the deployed stack on the AWS CloudFormation console.

Establish a connection to the RDS for MySQL instance from AWS Cloud9

To connect to the RDS for MySQL instance, complete the following steps:

  1. On the AWS Cloud9 console, choose Open under Cloud9 IDE for your environment.
  2. Run the following command to the AWS Cloud9 terminal. Provide your values for the MySQL endpoint (located on the CloudFormation stack’s Outputs tab), database user name, and database user password:
    $ mysql --host=<MySQLEndpoint> --user=<DatabaseUserName> password=<password>

  3. Download the SQL file.
  4. On the File menu, choose Upload from Local Files and upload the file to AWS Cloud9.
  5. Run the following SQL commands within the downloaded file:
    MySQL [(none)]> source mysqlsampledatabase.sql

  6. Retrieve a list of tables using the following SQL statement and make sure that eight tables are loaded successfully:
    use classicmodels;
    show tables;

Run an AWS Glue crawler on the RDS for MySQL database

To run your crawler, complete the following steps:

  1. On the AWS Glue console, choose Crawlers under Data Catalog in the navigation pane.
  2. Locate and run the crawler dq-rds-crawler.

The crawler will take a few minutes to crawl all the tables from the classicmodels database.

Validate the AWS Glue Data Catalog

To validate the Data Catalog when the crawler is complete, complete the following steps:

  1. On the AWS Glue console, choose Databases under Data Catalog in the navigation pane.
  2. Choose the mysql_private_classicmodels database.

You will able to see all the RDS tables available under mysql_private_classicmodels.

Run an AWS Glue ETL job to bring data from Amazon RDS for MySQL to Amazon S3

To run your ETL job, complete the following steps:

  1. On the AWS Glue console, choose Visual ETL under ETL jobs in the navigation pane.
  2. Select dq-rds-to-s3 from the job list and choose Run job.

When the job is complete, you will able to see three new tables under mysql_s3_db. It may take a few minutes to complete.

Now let’s dive into evaluating the data quality rules.

Evaluate the advanced data quality rules in the ETL job

In this section, we evaluate the results of different data quality rules.

ReferentialIntegrity

Let’s start with referential integrity. The ReferentialIntegrity data quality ruleset is currently supported in ETL jobs. This feature ensures that the relationships between tables in a database are maintained. It checks if the foreign key relationships between tables are valid and consistent, helping to identify any referential integrity violations.

  1. On the AWS Glue console, choose Visual ETL under ETL jobs in the navigation pane.
  2. In AWS Glue Studio, select Visual with a blank canvas.
  3. Provide a name for your job; for example, RDS ReferentialIntegrity.
  4. Choose the plus sign in the AWS Glue Studio canvas and on the Data tab, choose AWS Glue Data Catalog.
  5. For Name, enter a name for your data source; for example, employees.
  6. For Database, choose mysql_private_classicmodels.
  7. For Table, choose mysql_classicmodels_employees.
  8. Choose the plus sign in the AWS Glue Studio canvas and on the Data tab, choose AWS Glue Data Catalog.
  9. For Name, enter a name for your data source; for example, customers.
  10. For Database, choose mysql_private_classicmodels.
  11. For Table, choose mysql_classicmodels_employees.
  12. Choose the plus sign in the AWS Glue Studio canvas and on the Transform tab, choose Evaluate Data Quality.
  13. For Node parents, choose employees and customers.
  14. For Aliases for referenced data source, select Primary source for employees and for customers, enter the alias customers.

All other datasets are used as references to ensure that the primary dataset has good-quality data.

  1. Search for ReferentialIntegrity under Rule types and choose the plus sign to add an example ReferentialIntegrity rule.
  2. Replace the rule with the following code and keep the remaining options as default:
    Rules = [
        ReferentialIntegrity "employeenumber" "customers.salesRepEmployeeNumber" between 0.6 to 0.7
    ]

  3. Under Data quality action, select Publish results to Amazon CloudWatch and select Fail job without loading target data.
  4. On the Job details tab, choose GlueServiceRole-for-gluedq-blog for IAM role and keep the remaining options as default.
  5. Choose Run and wait for the job to complete.

It will take a few minutes to complete.

  1. When the job is complete, navigate to the Data quality tab and locate the Data quality results section.

You can confirm if the job completed successfully and which data quality rules it passed. In this example, it indicates that 60–70% of EmployeeNumber from the employees table are present in the customers table.

You can identify which records failed the referential integrity using AWS Glue Studio. To learn more, refer to Getting started with AWS Glue Data Quality for ETL Pipelines.

Similarly, if you are checking if all the EmployeeNumber from the employees table are present in the customers table, you can pass the following rule:

Rules = [
    ReferentialIntegrity "employeenumber" "customers.salesRepEmployeeNumber" = 1
]

DatasetMatch

DatasetMatch compares two datasets to identify differences and similarities. You can use it to detect changes between datasets or to find duplicates, missing values, or inconsistencies across datasets.

  1. On the AWS Glue console, choose Visual ETL under ETL jobs in the navigation pane.
  2. In AWS Glue Studio, select Visual with a blank canvas.
  3. Provide a name for your job; for example, RDS DatasetMatch.
  4. Choose the plus sign in the AWS Glue Studio canvas and on the Data tab, choose AWS Glue Data Catalog.
  5. For Name, enter a name for your data source; for example, rds_employees_primary.
  6. For Database, choose mysql_private_classicmodels.
  7. For Table, choose mysql_classicmodels_employees.
  8. Choose the plus sign in the AWS Glue Studio canvas and on the Data tab, choose AWS Glue Data Catalog.
  9. For Name, enter a name for your data source; for example, s3_employees_reference.
  10. For Database, choose mysql_s3_db.
  11. For Table, choose s3_employees.
  12. Choose the plus sign in the AWS Glue Studio canvas and on the Transform tab, choose Evaluate Data Quality.
  13. For Node parents, choose employees and customers.
  14. For Aliases for referenced data source, select Primary source for rds_employees_primary and for s3_employees_reference, enter the alias reference.
  15. Replace the default example rules with the following code and keep the remaining options as default:
    Rules = [
        DatasetMatch "reference" "employeenumber,employeenumber" = 1
    ]

  16. On the Job details tab, choose GlueServiceRole-for-gluedq-blog for IAM role and keep the remaining options as default.
  17. Choose Run and wait for the job to complete.
  18. When the job is complete, navigate to the Data quality tab and locate the Data quality results section.

In this example, it indicates both datasets are identical.

AggregateMatch

AggregateMatch verifies the accuracy of aggregated data. It compares the aggregated values in a dataset against the expected results to identify any discrepancies, such as incorrect sums, averages, counts, or other aggregate calculations. This is a performant option to evaluate if two datasets match at an aggregate level. For this rule, we clone the previous job we created for DatasetMatch.

  1. On the AWS Glue console, choose Visual ETL under ETL jobs in the navigation pane.
  2. Select RDS DatasetMatch and on the Actions menu, choose Clone job.
  3. Change the job name to DQ AggregateMatch.
  4. Change the dataset rds_employees_primary to rds_products_primary and the table to mysql_classicmodels_products.
  5. Change the dataset s3_orders_reference to s3_products_reference and the table to s3_products.
  6. Choose Evaluate Data Quality, and under Node parents, choose rds_products_primary and s3_products_reference.
  7. Replace the rules with the following code:
    AggregateMatch "avg(MSRP)" "avg(reference.MSRP)" = 1

  8. Choose Run and wait for the job to complete.
  9. When the job is complete, navigate to the Data quality tab and locate the Data quality results section.

The results indicate that the avg(msrp) on both datasets is the same.

RowCountMatch

RowCountMatch checks the number of rows in a dataset and compares it to an expected count. It helps identify missing or extra rows in a dataset, ensuring data completeness. For this rule, we edit the job we created earlier for AggregateMatch.

  1. On the AWS Glue console, choose Visual ETL under ETL jobs in the navigation pane.
  2. Select RDS AggregateMatch and on the Actions menu, choose Edit job.
  3. Choose Evaluate Data Quality and choose the plus sign next to RowCountMatch.
  4. Keep the default data quality rules and choose Save:
    RowCountMatch "reference" = 1.0

  5. Choose Run and wait for the job to complete.
  6. When the job is complete, navigate to the Data quality tab and locate the Data quality results section.

It shows that the DQ RowCountMatch rule failed, indicating a mismatch between the row count of the source RDS table and the target S3 table. Further investigation reveals that the ETL job ran four times for the Products table, and the row counts didn’t match.

SchemaMatch

SchemaMatch validates the schema of two datasets matches. It checks if the actual data types match the expected data types and flags any inconsistencies, such as a numeric column containing non-numeric values. For this rule, we edit the job we used for AggregateMatch.

  1. On the AWS Glue console, choose Visual ETL under ETL jobs in the navigation pane.
  2. Select RDS AggregateMatch and on the Actions menu, choose Edit job.
  3. Choose Evaluate Data Quality and choose the plus sign next to RowCountMatch.
  4. Update the default rules with the following code and save the job:
    SchemaMatch "reference" = 1.0

  5. Choose Run and wait for the job to complete.
  6. When the job is complete, navigate to the Data quality tab and locate the Data quality results section.

It should show a successful completion with a Rule passed status, indicating that the schemas of both datasets are identical.

Evaluate the advanced data quality rules in the Data Catalog

The AWS Glue Data Catalog also supports advanced data quality rules. For this post, we show one example of an aggregate match between Amazon S3 and Amazon RDS.

  1. On the AWS Glue console, choose Databases in the navigation pane.
  2. Choose the mysql_private_classicmodels database to view the three tables created under it.
  3. Choose the mysql_classicmodels_products table.
  4. On the Data quality tab, choose Create data quality rules.
  5. Search for AggregateMatch and choose the plus sign to view the default example rule.
  6. Add the following rules:
    Rules = [
        AggregateMatch "avg(msrp)" "avg(mysql_s3_db.s3_products.msrp)" >= 0.9,
        ReferentialIntegrity "productname,productcode" "mysql_s3_db.s3_products.{productname,productcode}" = 1
        ]

reference is the alias of the secondary dataset defined in the AWS Glue ETL job. For the Data Catalog, you can use <database_name>.<table_name>.<column_name> to reference secondary datasets.

  1. Choose Save ruleset and provide the name production_catalog_dq_check.
  2. Choose GlueServiceRole-for-gluedq-blog for IAM role and keep the remaining options as default.
  3. Choose Run and wait for the data quality check to complete.

When the job is complete, you can confirm that both data quality checks passed.

With these advanced data quality features of AWS Glue Data Quality, you can enhance the reliability, accuracy, and consistency of your data, leading to better insights and decision-making.

Clean up

To clean up your resources, complete the following steps:

  1. Delete the AWS Glue job.
  2. Delete the CloudFormation stack.

Conclusion

Data quality refers to the accuracy, completeness, consistency, timeliness, and validity of the information being collected, processed, and analyzed. High-quality data is essential for businesses to make informed decisions, gain valuable insights, and maintain their competitive advantage. As data complexity increases, advanced rules are critical to handle complex data quality challenges. The rules we demonstrated in this post can help you manage the quality of data that lives in disparate data sources, providing you the capabilities to reconcile them. Try them out and provide your feedback on what other use cases you need to solve!


About the authors

Navnit Shukla is AWS Specialist Solutions Architect in Analytics. He is passionate about helping customers uncover insights from their data. He builds solutions to help organizations make data-driven decisions.

Rahul Sharma is a Software Development Engineer at AWS Glue. He focuses on building distributed systems to support features in AWS Glue. He has a passion for helping customers build data management solutions on the AWS Cloud.

Edward Cho is a Software Development Engineer at AWS Glue. He has contributed to the AWS Glue Data Quality feature as well as the underlying open-source project Deequ.

Shriya Vanvari is a Software Developer Engineer in AWS Glue. She is passionate about learning how to build efficient and scalable systems to provide better experience for customers. Outside of work, she enjoys reading and chasing sunsets.

Extract ServiceNow data using AWS Glue Studio in an Amazon S3 data lake and analyze using Amazon Athena

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/extract-servicenow-data-using-aws-glue-studio-in-an-amazon-s3-data-lake-and-analyze-using-amazon-athena/

Many different cloud-based software as a service (SaaS) offerings are available in AWS. ServiceNow is one of the common cloud-based workflow automation platforms widely used by AWS customers. In the past few years, we saw a lot of customers who wanted to extract and integrate data from IT service management (ITSM) tools like ServiceNow for various use cases:

  • Generate insight from data – When you combine ServiceNow data with data from other services like CRM (such as Salesforce) or Martech data (such as Amazon Pinpoint) to generate better insights (e.g., building complete customer 360 view).
  • Archive data for future business or regulatory requirements – You can archive the data in raw form in your data lake to work on future use cases or just keep it to satisfy regulatory requirements such as auditing.
  • Improve performance by decoupling reporting or machine learning use cases from ITSM – When you move your ITSM reporting from ServiceNow to an Amazon Simple Storage Service (Amazon S3) data lake, there is no performance impact on your ServiceNow instance.
  • Data democratization – You can extract the data and put it into a data lake so it can be available to other business users and units to explore and use.

Many customers have been building modern data architectures on AWS, which includes building data lakes on Amazon S3 and using broad and deep AWS analytics and an AI/ML services to extract meaningful information from data by combining data from different data sources.

In this post, we provide a step-by-step guide to bring data from ServiceNow to an S3 data lake using AWS Glue Studio and analyze the data with Amazon Athena.

Solution overview

In this solution, ServiceNow data is being extracted through AWS Glue using a Marketplace connector. AWS Glue provides built-in support for the most commonly used data stores (such as Amazon Redshift, Amazon Aurora, Microsoft SQL Server, MySQL, MongoDB, and PostgreSQL) using JDBC connections. AWS Glue also allows you to use custom JDBC drivers in your extract, transform, and load (ETL) jobs. For data stores that are not natively supported, such as SaaS applications, you can use connectors and stored in Amazon S3. The data is cataloged in the AWS Glue Data Catalog, and we use Athena to query the data.

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning (ML), and application development. AWS Glue provides all the capabilities needed for data integration so you can start analyzing your data and put it to use in minutes instead of months.

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

ServiceNow is a cloud-based software platform for ITSM that helps to automate IT business management. It’s designed based on ITIL guidelines to provide service orientation for tasks, activities, and processes.

The following diagram illustrates our solution architecture.
aws glue blog

To implement the solution, we complete the following high-level steps:

  1. Subscribe to the AWS Glue Connector Marketplace for ServiceNow from AWS Marketplace.
  2. Create a connection in AWS Glue Studio.
  3. Create an AWS Identity and Access Management (IAM) role for AWS Glue.
  4. Configure and run an AWS Glue job that uses the connection.
  5. Run the query against the data lake (Amazon S3) using Athena.

Prerequisites

For this walkthrough, you should have the following:

  • An AWS account.
  • A ServiceNow account. To follow along with this post, you can sign up for a developer account, which is pre-populated with sample records in many of the ServiceNow objects.
  • ServiceNow connection properties credentials stored in AWS Secrets Manager. On the Secrets Manager console, create a new secret (select Other type of secrets) with a key-value pair for each property, for example:
    • Username – ServiceNow Instance account user name (for example, admin)
    • Password – ServiceNow Instance account password
    • Instance – ServiceNow instance name without https and .service-now.com

Copy the secret name to use when configuring the connection in AWS Glue Studio.

Subscribe to the AWS Glue Marketplace Connector for ServiceNow

To connect, we use the AWS Glue Marketplace Connector for ServiceNow. You need to subscribe to the connector from AWS Marketplace.

The AWS Glue Marketplace Connector for ServiceNow is provided by third-party independent software vendor (ISV) listed on AWS Marketplace. Associated subscription fees and AWS usage fees apply once subscribed.

To use the connector in AWS Glue, you need to activate the subscribed connector in AWS Glue Studio. The activation process creates a connector object and connection in your AWS account.

  1. On the AWS Glue console, choose AWS Glue Studio.
  2. Choose Connectors.
  3. Choose Marketplace.
  4. Search for the CData AWS Glue Connector for ServiceNow.


After you subscribe to the connector, a new config tab appears on the AWS Marketplace connector page.

  1. Review the pricing and other relevant information.
  2. Choose Continue to Subscribe.
  3. Choose Accept Terms.

After you subscribe to the connector, the next steps are to configure it.

  1. Retain the default selections for Delivery Method and Software Version to use the latest connector software version.
  2. Choose Continue to Launch.

  1. Choose Usage Instructions.


A pop-up appears with a hyperlink to activate the connector with AWS Glue Studio.

  1. Choose this link to start configuring the connection to your ServiceNow account in AWS Glue Studio.

Create a connection in AWS Glue Studio

Create a connection in AWS Glue Studio with the following steps:

  1. For Name, enter a unique name for your ServiceNow connection.
  2. For Connection credential type, choose username_password.
  3. For AWS Secret, choose the Secrets Manager secret you created as a prerequisite.

Don’t provide any additional details in the optional Credentials section because it retrieves the value from Secrets Manager.

  1. Choose Create connection and activate connector to finish creating the connection.

You should now be able to view the ServiceNow connector you subscribed to and its associated connection.

Create an IAM role for AWS Glue

The next step is to create an IAM role with the necessary permissions for the AWS Glue job. The name of the role must start with the string AWSGlueServiceRole for AWS Glue Studio to use it correctly. You need to grant your IAM role permissions that AWS Glue can assume when calling other services on your behalf. For more information, see Create an IAM Role for AWS Glue.

Attach the following AWS managed policies to the role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetResourcePolicy",
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret",
                "secretsmanager:ListSecretVersionIds"
            ],
            "Resource": [
                "{secret name arn}"
            ]
        }
    ]
}

For more information about permissions, see Review IAM permissions needed for the AWS Glue Studio user.

Configure and run the AWS Glue job

After you configure your connection, you can create and run an AWS Glue job.

Create a job that uses the connection

To create a job, complete the following steps:

  1. In AWS Glue Studio, choose Connectors.
  2. Select the connection you created.
  3. Choose Create job.


The visual job editor appears. A new source node, derived from the connection, is displayed on the job graph. In the node details panel on the right, the Data source properties tab is selected for user input.

Configure the source node properties

You can configure the access options for your connection to the data source on the Data source properties tab. For this post, we provide a simple walkthrough. Refer to the AWS Glue Studio User Guide for more information.

  1. On the Source menu, choose CData AWS Glue Connector for ServiceNow.

  1. On the Data source properties – Connector tab, make sure the source node for your connector is selected.

The Connection field is populated automatically with the name of the connection associated with the marketplace connector.

  1. Enter either a source table name or a query to use to retrieve data from the data source. For this post, we enter the table name incident.

  1. On the Transform menu, choose Apply Mapping.
  2. In a Node Property Tab, Select Node Parents CData AWS Glue Connector for ServiceNow.
  3. As we are connecting to an external data source; when you first look into Transform and Output schema tab; you won’t find the schema extracted from the source.
  4. In order for you to retrieve schema, Go to Data Preview tab, click on Start data preview session and select the IAM role you have created for this job.
  5. Once the Data preview is done, go to Data Source section and click on Use datapreview schema.
  6. Go to Transform and Check all the columns where Data Type showing as NULL.

  1. On the Target menu, choose Amazon S3.
  2. On the Data target properties – S3 tab, for Format, choose Parquet.
  3. For Compression Type, choose GZIP.
  4. For S3 Target Location, enter the Amazon S3 location to store the data.
  5. For Data Catalog update options, select Create a table in the Data Catalog and on subsequent runs, keep existing schema and add new partitions.
  6. For Database, enter sampledb.
  7. For Table name, enter incident.

Edit, save, and run the job

Edit the job by adding and editing the nodes in the job graph. See Editing ETL jobs in AWS Glue Studio for more information.

After you edit the job, enter the job properties.

  1. Choose the Job details tab above the visual graph editor.
  2. For Name, enter a job name.
  3. For IAM Role, choose an IAM role with the necessary permissions, as described previously.
  4. For Type, choose Spark.
  5. For Glue version, choose Glue 3.0 – Supports spark 3.1, Scala 2, Python 3.
  6. For Language, choose Python 3.
  7. Worker type : G.1X
  8. Requested number of workers: 2
  9. Number of retries: 1
  10. Job timeout (minutes): 3
  11. Use the default values for the other parameters.

For more information about job parameters, see Defining Job Properties for Spark Jobs.

12. After you save the job, choose Run to run the job.

Note – Running the Glue Job incur cost. You can learn more about AWS Glue Pricing here.

To view the generated script for the job, choose the Script tab at the top of the visual editor. The Job runs tab shows the job run history for the job. For more information about job run details, see View information for recent job runs.

Query against the data lake using Athena

After the job is complete, you can query the data in Athena.

  1. On the Athena console, choose the sampledb database.

You can view the newly created table called incident.

  1. Choose the options icon (three vertical dots) and choose Preview table to view the data.

Now let’s perform some analyses.

  1. Find all the incident tickets that are escalated by running the following query:
    SELECT task_effective_number FROM "sampledb"."incident" 
    where escalation = 2;

  1. Find ticket count with priority:
    SELECT priority, count(distinct task_effective_number)  FROM "sampledb"."incident"
    group by priority
    order by priority asc

Conclusion

In this post, we demonstrated how you can use an AWS Glue Studio connector to connect from ServiceNow and bring data into your data lake for further use cases.

AWS Glue provides built-in support for the most commonly used data stores (such as Amazon Redshift, Amazon Aurora, Microsoft SQL Server, MySQL, MongoDB, and PostgreSQL) using JDBC connections. AWS Glue also allows you to use custom JDBC drivers in your extract, transform, and load (ETL) jobs. For data stores that are not natively supported, such as SaaS applications, you can use connectors.

To learn more, refer to the AWS Glue Studio ConnectorAWS Glue Studio User Guide and Athena User Guide.


About the Authors

Navnit Shukla is AWS Specialist Solution Architect in Analytics. He is passionate about helping customers uncover insights from their data. He builds solutions to help organizations make data-driven decisions.

Srikanth Sopirala is a Principal Solutions Architect at AWS. He is a seasoned leader with over 20 years of experience, who is passionate about helping customers build scalable data and analytic solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys reading, spending time with his family, and road biking.

Naresh Gautam is a Principal Solutions Architect at AWS. His role is helping customers architect highly available, high-performance, and cost-effective data analytics solutions to empower customers with data-driven decision-making. In his free time, he enjoys meditation and cooking.

Enforce customized data quality rules in AWS Glue DataBrew

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/enforce-customized-data-quality-rules-in-aws-glue-databrew/

GIGO (garbage in, garbage out) is a concept common to computer science and mathematics: the quality of the output is determined by the quality of the input. In modern data architecture, you bring data from different data sources, which creates challenges around volume, velocity, and veracity. You might write unit tests for applications, but it’s equally important to ensure the data veracity of these applications, because incoming data quality can make or break your application. Incorrect, missing, or malformed data can have a large impact on production systems. Examples of data quality issues include but are not limited to the following:

  • Missing or incorrect values can lead to failures in the production system that require non-null values
  • Changes in the distribution of data can lead to unexpected outputs of machine learning (ML) models
  • Aggregations of incorrect data can lead to wrong business decisions
  • Incorrect data types have a big impact on financial or scientific institutes

In this post, we introduce data quality rules in AWS Glue DataBrew. DataBrew is a visual data preparation tool that makes it easy to profile and prepare data for analytics and ML. We demonstrate how to use DataBrew to define a list of rules in a new entity called a ruleset. A ruleset is a set of rules that compare different data metrics against expected values.

The post describes the implementation process and provides a step-by-step guide to build data quality checks in DataBrew.

Solution overview

To illustrate our data quality use case, we use a human resources dataset. This dataset contains the following attributes:

Emp ID, Name Prefix, First Name, Middle Initial,Last Name,Gender,E Mail,Father's Name,Mother's Name,Mother's Maiden Name,Date of Birth,Time of Birth,Age in Yrs.,Weight in Kgs.,Date of Joining,Quarter of Joining,Half of Joining,Year of Joining,Month of Joining,Month Name of Joining,Short Month,Day of Joining,DOW of Joining,Short DOW,Age in Company (Years),Salary,Last % Hike,SSN,Phone No. ,Place Name,County,City,State,Zip,Region,User Name,Password

For this post, we downloaded data with 5 million records, but feel free to use a smaller dataset to follow along with this post.

The following diagram illustrates the architecture for our solution.

The steps in this solution are as follows:

  1. Create a sample dataset.
  2. Create a ruleset.
  3. Create data quality rules.
  4. Create a profile job.
  5. Inspect the data quality rules validation results.
  6. Clean the dataset.
  7. Create a DataBrew job.
  8. Validate the data quality check with the updated dataset.

Prerequisites

Before you get started, complete the following prerequisites:

  1. Have an AWS account.
  2. Download the sample dataset.
  3. Extract the CSV file.
  4. Create an Amazon Simple Storage Service (Amazon S3) bucket with three folders: input, output, and profile.
  5. Upload the sample data in input folder to your S3 bucket (for example, s3://<s3 bucket name>/input/).

Create a sample dataset

To create your dataset, complete the following steps:

  1. On the DataBrew console, in the navigation pane, choose Datasets.
  2. Choose Connect new dataset.
  3. For Dataset name, enter a name (for example, human-resource-dataset).
  4. Under Data lake/data store, choose Amazon S3 as your source.
  5. For Enter your source from Amazon S3, enter the S3 bucket location where you uploaded your sample files (for example, s3://<s3 bucket name>/input/).
  6. Under Additional configurations, keep the selected file type CSV and CSV delimiter comma (,).
  7. Scroll to the bottom of the page and choose Create dataset.

The dataset is now available on the Datasets page.

Create a ruleset

We now define data quality rulesets against the dataset created in the previous step.

  1. On the DataBrew console, in the navigation pane, choose DQ Rules.
  2. Choose Create data quality ruleset.
  3. For Ruleset name, enter a name (­for example, human-resource-dataquality-ruleset).
  4. Under Associated dataset, choose the dataset you created earlier.

Create data quality rules

To add data quality rules, you can use rules and add multiple rules, and within each rule, you can define multiple checks.

For this post, we create the following data quality rules and data quality checks within the rules:

  • Row count is correct
  • No duplicate rows
  • Employee ID, email address, and SSN are unique
  • Employee ID and phone number are not be null
  • Employee ID and employee age in years has no negative values
  • SSN data format is correct (123-45-6789)
  • Phone number for string length is correct
  • Region column only has the specified region
  • Employee ID is an integer

Row count is correct

To check the total row count, complete the following steps:

  1. Add a new rule.
  2. For Rule name, enter a name (for example, Check total record count).
  3. For Data quality check scope, choose Individual check for each column.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. For Data quality checks¸ choose Number of rows.
  6. For Condition, choose Is equals.
  7. For Value, enter 5000000.

No duplicate rows

To check the dataset for duplicate rows, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Check dataset for duplicate rows).
  3. For Data quality check scope, choose Individual check for each column.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. Under Check 1, for Data quality check¸ choose Duplicate rows.
  6. For Condition, choose Is equals.
  7. For Value, enter 0 and choose rows on the drop-down menu.

Employee ID, email address, and SSN are unique

To check that the employee ID, email, and SSN are unique, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Check dataset for Unique Values).
  3. For Data quality check scope, choose Common checks for selected columns.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. For Selected columns, select Selected columns.
  6. Choose the columns Emp ID, e mail, and SSN.
  7. Under Check 1, for Data quality check, choose Unique values.
  8. For Condition, choose Is equals.
  9. For Value, enter 100 and choose %(percent) rows on the drop-down menu.

Employee ID and phone number are not be null

To check that employee IDs and phone numbers aren’t null, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Check Dataset for NOT NULL).
  3. For Data quality check scope, choose Common checks for selected columns.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. For Selected columns, select Selected columns.
  6. Choose the columns Emp ID and Phone No.
  7. Under Check 1, for Data quality check, choose Value is not missing.
  8. For Condition, choose Greater than equals.
  9. For Threshold, enter 100 and choose %(percent) rows on the drop-down menu.

Employee ID and age in years has no negative values

To check the employee ID and age for positive values, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Check emp ID and age for positive values).
  3. For Data quality check scope, choose Individual check for each column.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. Under Check 1, for Data quality check, choose Numeric values.
  6. Choose Emp ID on the drop-down menu.
  7. For Condition, choose Greater than equals.
  8. For Value, select Custom value and enter 0.
  9. Choose Add another quality check and repeat the same steps for age in years.

SSN data format is correct

To check the SSN data format, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Check dataset format).
  3. For Data quality check scope, choose Individual check for each column.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. Under Check 1, for Data quality check, choose String values.
  6. Choose SSN on the drop-down menu.
  7. For Condition, choose Matches (RegEx pattern).
  8. For RegEx value, enter ^[0-9]{3}-[0-9]{2}-[0-9]{4}$.

Phone number string length is correct

To check the length of the phone number, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Check Dataset Phone no. for string length).
  3. For Data quality check scope, choose Individual check for each column.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. Under Check 1, for Data quality check, choose Value string length.
  6. Choose Phone No on the drop-down menu.
  7. For Condition, choose Greater than equals.
  8. For Value, select Custom value and enter 9.
  9. Under Check 2, for Data quality check, choose Value string length.
  10. Choose Phone No on the drop-down menu.
  11. For Condition, choose Less than equals.
  12. For Value¸ select Custom value and enter 12.

Region column only has the specified region

To check the Region column, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Check Region column only for specific region).
  3. For Data quality check scope, choose Individual check for each column.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. Under Check 1, for Data quality check, choose Value is exactly.
  6. Choose Region on the drop-down menu.
  7. For Value, select Custom value.
  8. Choose the values Midwest, South, West, and Northeast.

Employee ID is an integer

To check that the employee ID is an integer, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Validate Emp ID is an Integer).
  3. For Data quality check scope, choose Individual check for each column.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. Under Check 1, for Data quality check, choose String values.
  6. Choose Emp ID on the drop-down menu.
  7. For Condition, choose Matches (RegEx pattern).
  8. For RegEx value, enter ^[0-9]+$.
  9. After you create all the rules, choose Create ruleset.

Your ruleset is now listed on the Data quality rulesets page.

Create a profile job

To create a profile job with your new ruleset, complete the following steps:

  1. On the Data quality rulesets page, select the ruleset you just created.
  2. Choose Create profile job with ruleset.
  3. For Job name, keep the prepopulated name or enter a new one.
  4. For Data sample, select Full dataset.

The default sample size is important for data quality rules validation, because it matters if you validate all the roles or a limited sample.

  1. Under Job output settings, for S3 location, enter the path to the profile bucket.

If you enter a new bucket name, the folder is created automatically.

  1. Keep the default settings for the remaining optional sections: Data profile configurations, Data quality rules, Advanced job settings, Associated schedules, and Tags.

The next step is to choose or create the AWS Identity and Access Management (IAM) role that grants DataBrew access to read from the input S3 bucket and write to the job output bucket.

  1. For Role name, choose an existing role or choose Create a new IAM role and enter an IAM role suffix.
  2. Choose Create and run job.

For more information about configuring and running DataBrew jobs, see Creating, running, and scheduling AWS Glue DataBrew jobs.

Inspect data quality rules validation results

To inspect the data quality rules, we need to let the profile job complete.

  1. On the Jobs page of the DataBrew console, choose the Profile jobs tab.
  2. Wait until the profile job status changes to Succeeded.
  3. When the job is complete, choose View data profile.

You’re redirected to the Data profile overview tab on the Datasets page.

  1. Choose the Data quality rules tab.

Here you can review the status to your data quality rules. As shown in the following screenshot, eight of the nine data quality rules defined were successful, and one rule failed.

Our failed data quality rule indicates that we found duplicate values for employee ID, SSN, and email.

  1. To confirm that the data has duplicate values, on the Column statistics tab, choose the Emp ID column.
  2. Scroll down to the section Top distinct values.

Similarly, you can check the E Mail and SSN columns to find that those columns also have duplicate values.

Now we have confirmed that our data has duplicate values. The next step is to clean up the dataset and rerun the quality rules validation.

Clean the dataset

To clean the dataset, we first need to create a project.

  1. On the DataBrew console, choose Projects.
  2. Choose Create project.
  3. For Project name, enter a name (for this post, human-resource-project-demo).
  4. For Select a dataset, select My datasets.
  5. Select the human-resource-dataset dataset.
  6. Keep the sampling size at its default.
  7. Under Permissions, for Role name, choose the IAM role that we created previously for our DataBrew profile job.
  8. Choose Create project.

The project takes a few minutes to open. When it’s complete, you can see your data.

Next, we delete the duplicate value from the Emp ID column.

  1. Choose the Emp ID column.
  2. Choose the more options icon (three dots) to view all the transforms available for this column.
  3. Choose Remove duplicate values.
  4. Repeat these steps for the SSN and E Mail columns.

You can now see the three applied steps in the Recipe pane.

Create a DataBrew job

The next step is to create a DataBrew job to run these transforms against the full dataset.

  1. On the project details page, choose Create job.
  2. For Job name, enter a name (for example, human-resource-after-dq-check).
  3. Under Job output settings¸ for File type, choose your final storage format to be CSV.
  4. For S3 location, enter your output S3 bucket location (for example, s3://<s3 bucket name>/output/).
  5. For Compression, choose None.
  6. Under Permissions, for Role name¸ choose the same IAM role we used previously.
  7. Choose Create and run job.
  8. Wait for job to complete; you can monitor the job on the Jobs page.

Validate the data quality check with the corrected dataset

To perform the data quality checks with the corrected dataset, complete the following steps:

  1. Follow the steps outlined earlier to create a new dataset, using the corrected data from the previous section.
  2. Choose the Amazon S3 location of the job output.
  3. Choose Create dataset.
  4. Choose DQ Rules and select the ruleset you created earlier.
  5. On the Actions menu, choose Duplicate.
  6. For Ruleset name, enter a name (for example, human-resource-dataquality-ruleset-on-corrected-dataset).
  7. Select the newly created dataset.
  8. Choose Create data quality ruleset.
  9. After the ruleset is created, select it and choose Create profile job with ruleset.
  10. Create a new profile job.
  11. Choose Create and run job.
  12. When the job is complete, repeat the steps from earlier to inspect the data quality rules validation results.

This time, under Data quality rules, all the rules are passed except Check total record count because you removed duplicate values.

On the Column statistics page, under Top distinct values for the Emp ID column, you can see the distinct values.

You can find similar results for the SSN and E Mail columns.

Clean up

To avoid incurring future charges, we recommend you delete the resources you created during this walkthrough.

Conclusion

As demonstrated in this post, you can use DataBrew to help create data quality rules, which can help you identify any discrepancies in your data. You can also use DataBrew to clean the data and validate it going forwards. You can learn more about AWS Glue DataBrew from here and learn around AWS Glue DataBrew pricing here.


About the Authors

Navnit Shukla is an AWS Specialist Solution Architect, Analytics, and is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

Harsh Vardhan Singh Gaur is an AWS Solutions Architect, specializing in Analytics. He has over 5 years of experience working in the field of big data and data science. He is passionate about helping customers adopt best practices and discover insights from their data.

Query SAP HANA using Athena Federated Query and join with data in your Amazon S3 data lake

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/query-sap-hana-using-athena-federated-query-and-join-with-data-in-your-amazon-s3-data-lake/

If you use data lakes in Amazon Simple Storage Service (Amazon S3) and use SAP HANA as your transactional data store, you may need to join the data in your data lake with SAP HANA in the cloud, SAP HANA running on Amazon Elastic Compute Cloud (Amazon EC2), or with an on-premises SAP HANA, for example to build a dashboard or create consolidated reporting.

In such use cases, Amazon Athena Federated Query allows you to seamlessly access the data from SAP HANA database without building ETL pipelines to copy or unload the data to the S3 data lake or SAP HANA. This removes the overhead of creating additional extract, transform, and load (ETL) processes and shortens the development cycle.

In this post, we walk you through a step-by-step configuration to set up Amazon Athena Federated Query using AWS Lambda to access data in a SAP HANA database running on AWS.

For this post, we will be using the SAP HANA Athena Federated query connector developed by Trianz. You can deploy the Athena Federated query connector developed by Trianz available in the AWS Serverless Application Repository.

Let’s start with discussing the solution and then detailing the steps involved.

Solution overview

Data federation is the capability to integrate data in another data store using a single interface (Athena). The following diagram depicts how Athena federation works by using Lambda to integrate with a federated data source.

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines to extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.

When a federated query is run, Athena identifies the parts of the query that should be routed to the data source connector and executes them with Lambda. The data source connector makes the connection to the source, runs the query, and returns the results to Athena. If the data doesn’t fit into Lambda RAM runtime memory, it spills the data to Amazon S3 and is later accessed by Athena.

Athena uses data source connectors which internally use Lambda to run federated queries. Data source connectors are pre-built and can be deployed from the Athena console or from the Serverless Application Repository. Based on the user submitting the query, connectors can provide or restrict access to specific data elements.

To implement this solution, we complete the following steps:

  1. Create a secret for the SAP HANA instance using AWS Secrets Manager.
  2. Create an S3 bucket and subfolder for Lambda to use.
  3. Configure Athena federation with the SAP HANA instance.
  4. Run federated queries with Athena.

Prerequisites

Before getting started, make sure you have a SAP HANA database up and running on AWS.

Create a secret for the SAP HANA instance

Our first step is to create a secret for the SAP HANA instance with a username and password using Secrets Manager.

  1. On the Secrets Manager console, choose Secrets.
  2. Choose Store a new secret.
  3. Select Other types of secrets.
  4. Set the credentials as key-value pairs (username, password) for your SAP HANA instance.
  5. For Secret name, enter a name for your secret. Use the prefix SAP HANAAFQ so it’s easy to find.
  6. Leave the remaining fields at their defaults and choose Next.
  7. Complete your secret creation.

Setting up your S3 bucket for Lambda

On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use

For this post, we have used (Amazon S3 bucket name/folder) athena-accelerator/saphana.

Configure Athena federation with the SAP HANA instance

To configure Athena federation with your SAP HANA instance, complete the following steps:

  1. On the AWS Serverless Application Repository console, choose Available applications.
  2. In the search field, enter TrianzSAPHANAAthenaJDBC.

In the Application settings section, provide the following details:

  1. For Application name, enter TrianzSAPHANAAthenaJDBC.
  2. For SecretNamePrefix, enter trianz-saphana-athena-jdbc.
  3. For SpillBucket, enter Athena-accelerator/saphana.

For JDBCConnectorConfig, use the format saphana://jdbc:sap://{saphana_instance_url}/?${secretname}.

  1. For DisableSpillEncyption, choose False.
  2. For LambdaFunctionName, enter trsaphana.
  3. For SecurityGroupID, use the security group id using which lambda can connect to the SAP HANA

Make sure to apply valid inbound and outbound rules based on your connection.

  1. For SpillPrefix, create a folder under the S3 bucket you created and specify the name (for example, athena-spill).
  2. For Subnetids – Use the subnets using which lambda can connect to SAP HANA instance with comma separation.

Make sure the subnet is in a VPC and has a NAT gateway and internet gateway attached.

  1. Select the I acknowledge check box.
  2. Choose Deploy.

Make sure that the AWS Identity and Access Management (IAM) roles have permissions to access AWS Serverless Application Repository, AWS CloudFormation, Amazon S3, Amazon CloudWatch, Amazon CloudTrail, Secrets Manager, Lambda, and Athena. For more information, see Example IAM Permissions Policies to Allow Athena Federated Query.

Run federated queries with Athena

Run your federated queries using lambda:trsaphana to run against tables in the SAP HANA database. trsaphana is the name of lambda function which we have created in step 7 of previous section of this blog.

lambda:trsaphana is a reference data source connector Lambda function using the format lambda:MyLambdaFunctionName. For more information, see Writing Federated Queries.

The following screenshot demonstrates joining the dataset between SAP HANA and the data lake.

Key performance best practice considerations

If you’re considering Athena federation with a SAP HANA database, we recommend the following best practices:

  • Athena federation works great for queries with predicate filtering because the predicates are pushed down to the SAP HANA database. Use filter and limited-range scans in your queries to avoid full table scans.
  • If your SQL query requires returning a large volume of data from the SAP HANA database to Athena (which could lead to query timeouts or slow performance), unload the large tables in your query from SAP HANA to your S3 data lake.
  • Star schema is a commonly used data model in SAP HANA databases. In the star schema model, unload your large fact tables into your S3 data lake and leave the dimension tables in your SAP HANA database. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your S3 data lake.
  • When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the SAP HANA database WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your SAP HANA database cluster to benefit from concurrent connections to queue up.

Conclusion

In this post, you learned how to configure and use Athena Federated query with SAP HANA using Lambda. Now you don’t need to wait for all the data in your SAP HANA data warehouse to be unloaded to Amazon S3 and maintained on a day-to-day basis to run your queries.

You can use the best practice considerations outlined in the post to help minimize the data transferred from SAP HANA for better performance. When queries are well-written for Federated query, the performance penalties are negligible.

For more information, see the Athena User Guide and Using Amazon Athena Federated Query.


About the Author

Navnit Shukla is AWS Specialist Solution Architect in Analytics. He is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

Query a Teradata database using Amazon Athena Federated Query and join with data in your Amazon S3 data lake

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/query-a-teradata-database-using-amazon-athena-federated-query-and-join-with-data-in-your-amazon-s3-data-lake/

If you use data lakes in Amazon Simple Storage Service (Amazon S3) and use Teradata as your transactional data store, you may need to join the data in your data lake with Teradata in the cloud, Teradata running on Amazon Elastic Compute Cloud (Amazon EC2), or with an on-premises Teradata database, for example to build a dashboard or create consolidated reporting.

In these use cases, the Amazon Athena Federated Query feature allows you to seamlessly access the data from Teradata database without having to move the data to your S3 data lake. This removes the overhead in managing such jobs.

In this post, we will walk you through a step-by-step configuration to set up Athena Federated Query using AWS Lambda to access data in a Teradata database running on premises.

For this post, we will be using the Oracle Athena Federated Query connector developed by Trianz. The runtime includes a Teradata instance on premises. Your Teradata instance can be on the cloud, on Amazon EC2, or on premises. You can deploy the Trianz Oracle Athena Federated Query connector from the AWS Serverless Application Repository.

Let’s start with discussing the solution and then detailing the steps involved.

Solution overview

Data federation is the capability to integrate data in another data store using a single interface (Athena). The following diagram depicts how Athena Federated Query works by using Lambda to integrate with a federated data source.

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines to extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.

When a federated query is run, Athena identifies the parts of the query that should be routed to the data source connector and executes them with Lambda. The data source connector makes the connection to the source, runs the query, and returns the results to Athena. If the data doesn’t fit into Lambda RAM runtime memory, it spills the data to Amazon S3 and is later accessed by Athena.

Athena uses data source connectors which internally use Lambda to run federated queries. Data source connectors are pre-built and can be deployed from the Athena console or from the Serverless Application Repository. Based on the user submitting the query, connectors can provide or restrict access to specific data elements.

To implement this solution, we complete the following steps:

  1. Create a secret for the Teradata instance using AWS Secrets Manager.
  2. Create an S3 bucket and subfolder for Lambda to use.
  3. Configure Athena federation with the Teradata instance.
  4. Run federated queries with Athena.

Prerequisite

Before you start this walkthrough, make sure your Teradata database is up and running.

Create a secret for the Teradata instance

Our first step is to create a secret for the Teradata instance with a username and password using Secrets Manager.

  1. On the Secrets Manager console, choose Secrets.
  2. Choose Store a new secret.
  3. Select Other types of secrets.
  4. Set the credentials as key-value pairs (username, password) for your Teradata instance.

  1. For Secret name, enter a name for your secret. Use the prefix TeradataAFQ so it’s easy to find.
  2. Leave the remaining fields at their defaults and choose Next.
  3. Complete your secret creation.

Set up your S3 bucket for Lambda

On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use. For this post, we create athena-accelerator/teradata.

Configure Athena federation with the Teradata instance

To configure Athena federation with Teradata instance, complete the following steps:

  1. On the AWS Serverless Application Repository console, choose Available applications.
  2. Select Show apps that create custom IAM roles or resource policies.
  3. In the search field, enter TrianzTeradataAthenaJDBC.
  4. Choose the application.

  1. For SecretNamePrefix, enter TeradataAFQ.
  2. For SpillBucket, enter Athena-accelerator/teradata.
  3. For JDBCConnectorConfig, use the format teradata://jdbc:teradata://hostname/user=testUser&password=testPassword.
  4. For DisableSpillEncryption, enter false.
  5. For LambdaFunctionName, enter teradataconnector.
  6. For SecurityGroupID, enter the security group ID where the Teradata instance is deployed.

Make sure to apply valid inbound and outbound rules based on your connection.

  1. For SpillPrefix, create a folder under the S3 bucket you created and specify the name (for example, athena-spill).
  2. For Subnetids, use the subnets where the Teradata instance is running with comma separation.

Make sure the subnet is in a VPC and has NAT gateway and internet gateway attached.

  1. Select the I acknowledge check box.
  2. Choose Deploy.

Make sure that the AWS Identity and Access Management (IAM) roles have permissions to access AWS Serverless Application Repository, AWS CloudFormation, Amazon S3, Amazon CloudWatch, Amazon CloudTrail, Secrets Manager, Lambda, and Athena. For more information about Athena IAM access, see Example IAM Permissions Policies to Allow Athena Federated Query.

Run federated queries with Athena

Run your queries using lambda:teradataconnector to run against tables in the Teradata database. teradataconnector is the name of lambda function which we have created in step 7 of previous section of this blog.

lambda:teradataconnector references a data source connector Lambda function using the format lambda:MyLambdaFunctionName. For more information, see Writing Federated Queries.

The following screenshot shows the query that joins the dataset between Teradata and the S3 data lake.

Key performance best practices

If you’re considering Athena Federated Query with Teradata, we recommend the following best practices:

  • Athena Federated query works great for queries with predicate filtering because the predicates are pushed down to the Teradata database. Use filter and limited-range scans in your queries to avoid full table scans.
  • If your SQL query requires returning a large volume of data from the Teradata database to Athena (which could lead to query timeouts or slow performance), you may consider moving data from Teradata to your S3 data lake.
  • The star schema is a commonly used data model in Teradata. In the star schema model, unload your large fact tables into your S3 data lake and leave the dimension tables in Teradata. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your S3 data lake.
  • When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the Teradata database WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your Teradata database cluster to benefit from concurrent connections to queue up.

Conclusion

In this post, you learned how to configure and use Athena Federated Query with Teradata. Now you don’t need to wait for all the data in your Teradata data warehouse to be unloaded to Amazon S3 and maintained on a day-to-day basis to run your queries.

You can use the best practices outlined in the post to help minimize the data transferred from Teradata for better performance. When queries are well written for Athena Federated Query, the performance penalties are negligible.

For more information, see the Athena User Guide and Using Amazon Athena Federated Query.


About the Author

Navnit Shukla is an AWS Specialist Solution Architect in Analytics. He is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

Query Snowflake using Athena Federated Query and join with data in your Amazon S3 data lake

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/query-snowflake-using-athena-federated-query-and-join-with-data-in-your-amazon-s3-data-lake/

If you use data lakes in Amazon Simple Storage Service (Amazon S3) and use Snowflake as your data warehouse solution, you may need to join your data in your data lake with Snowflake. For example, you may want to build a dashboard by joining historical data in your Amazon S3 data lake and the latest data in your Snowflake data warehouse or create consolidated reporting.

In such use cases, Amazon Athena Federated Query allows you to seamlessly access the data from Snowflake without building ETL pipelines to copy or unload the data to the S3 data lake or Snowflake. This removes the overhead of creating additional extract, transform, and load (ETL) processes and shortens the development cycle.

In this post, we will walk you through a step-by-step configuration to set up Athena Federated Query using AWS Lambda to access data in a Snowflake data warehouse.

For this post, we are using the Snowflake connector for Amazon Athena developed by Trianz.

Let’s start with discussing the solution and then detailing the steps involved.

Solution overview

Data Federation refers to the capability to query data in another data store using a single interface (Amazon Athena). The following diagram depicts how a single Amazon Athena federated query uses Lambda to query the underlying data source and parallelizes execution across many workers.

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines to extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.

When a federated query is run, Athena identifies the parts of the query that should be routed to the data source connector and executes them with Lambda. The data source connector makes the connection to the source, runs the query, and returns the results to Athena. If the data doesn’t fit into Lambda RAM runtime memory, it spills the data to Amazon S3 and is later accessed by Athena.

Athena uses data source connectors which internally use Lambda to run federated queries. Data source connectors are pre-built and can be deployed from the Athena console or from the Serverless Application Repository. Based on the user submitting the query, connectors can provide or restrict access to specific data elements.

To implement this solution, we complete the following steps:

  1. Create a secret for the Snowflake instance using AWS Secrets Manager.
  2. Create an S3 bucket and subfolder for Lambda to use.
  3. Configure Athena federation with the Snowflake instance.
  4. Run federated queries with Athena.

Prerequisites

Before getting started, make sure you have a Snowflake data warehouse up and running.

Create a secret for the Snowflake instance

Our first step is to create a secret for the Snowflake instance with a username and password using Secrets Manager.

  1. On the Secrets Manager console, choose Secrets.
  2. Choose Store a new secret.
  3. Select Other types of secrets.
  4. Enter the credentials as key-value pairs (username, password) for your Snowflake instance.
  5. For Secret name, enter a name for your secret. Use the prefix snowflake so it’s easy to find.

  1. Leave the remaining fields at their defaults and choose Next.
  2. Complete your secret creation.

Create an S3 bucket for Lambda

On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use. For this post, we use athena-accelerator/snowflake.

Configure Athena federation with the Snowflake instance

To configure Athena data source connector for Snowflake with your Snowflake instance, complete the following steps:

  1. On the AWS Serverless Application Repository console, choose Available applications.
  2. In the search field, enter TrianzSnowflakeAthenaJDBC.

  1. For Application name, enter TrianzSnowflakeAthenaJDBC.
  2. For SecretNamePrefix, enter trianz-snowflake-athena.
  3. For SpillBucket, enter Athena-accelerator/snowflake.
  4. For JDBCConnectorConfig, use the format snowflake://jdbc:snowflake://{snowflake_instance_url}/?warehouse={warehousename}&db={databasename}&schema={schemaname}&${secretname}

For example, we enter snowflake://jdbc:snowflake://trianz.snowflakecomputing.com/?warehouse=ATHENA_WH&db=ATHENA_DEV&schema=ATHENA&${trianz-snowflake-athena}DisableSpillEncyption – False

  1. For LambdaFunctionName, enter trsnowflake.
  2. For SecurityGroupID, enter the security group ID where the Snowflake instance is deployed.

Make sure to apply valid inbound and outbound rules based on your connection.

  1. For SpillPrefix, create a folder under the S3 bucket you created and specify the name (for example, athena-spill).
  2. For Subnetids, use the subnets where the Snowflake instance is running with comma separation.

Make sure the subnet is in a VPC and has NAT gateway and internet gateway attached.

  1. Select the I acknowledge check box.
  2. Choose Deploy.

Make sure that the AWS Identity and Access Management (IAM) roles have permissions to access AWS Serverless Application Repository, AWS CloudFormation, Amazon S3, Amazon CloudWatch, AWS CloudTrail, Secrets Manager, Lambda, and Athena. For more information, see Example IAM Permissions Policies to Allow Athena Federated Query.

Run federated queries with Athena

Before running your federated query, be sure that you have selected Athena engine version 2. The current Athena engine version for any workgroup can be found in the Athena console page.

Run your federated queries using lambda:trsnowflake to run against tables in the Snowflake database. This is the name of lambda function which we have created in step 7 of previous section of this blog.

lambda:trsnowflake is a reference data source connector Lambda function using the format lambda:MyLambdaFunctionName. For more information, see Writing Federated Queries.

The following screenshot is a unionall query example of data in Amazon S3 with a table in the AWS Glue Data Catalog and a table in Snowflake.

Key performance best practices

If you’re considering Athena Federated Query with Snowflake, we recommend the following best practices:

  • Athena Federated query works great for queries with predicate filtering because the predicates are pushed down to the Snowflake database. Use filter and limited-range scans in your queries to avoid full table scans.
  • If your SQL query requires returning a large volume of data from Snowflake to Athena (which could lead to query timeouts or slow performance), you may consider copying data from Snowflake to your S3 data lake.
  • The Snowflake schema, which is an extension of the star schema, is used as a data model in Snowflake. In the Snowflake schema model, unload your large fact tables into your S3 data lake and leave the dimension tables in Snowflake. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your S3 data lake.
  • When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the Snowflake database WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your Snowflake database cluster to benefit from concurrent connections to queue up.

Conclusion

In this post, you learned how to configure and use Athena federated with Snowflake using Lambda. With Athena Federated query user can leverage all of their data to produce analytics, derive business value without building ETL pipelines to bring data from different datastore such as Snowflake to Data Lake.

You can use the best practice considerations outlined in the post to help minimize the data transferred from Snowflake for better performance. When queries are well written for federation, the performance penalties are negligible.

For more information, see the Athena User Guide and Using Amazon Athena Federated Query.


About the Author

Navnit Shukla is AWS Specialist Solution Architect in Analytics. He is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

Query your Oracle database using Athena Federated Query and join with data in your Amazon S3 data lake

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/query-your-oracle-database-using-athena-federated-query-and-join-with-data-in-your-amazon-s3-data-lake/

If you use data lakes in Amazon Simple Storage Service (Amazon S3) and use Oracle as your transactional data store, you may need to join the data in your data lake with Oracle on Amazon Relational Database Service (Amazon RDS), Oracle running on Amazon Elastic Compute Cloud (Amazon EC2), or an on-premises Oracle database, for example to build a dashboard or create consolidated reporting.

In these use cases, Amazon Athena Federated Query allows you to seamlessly access the data you’re your Oracle database without having to move the data to the S3 data lake. This removes the overhead in managing such jobs.

In this post, we walk you through a step-by-step configuration to set up Athena Federated query using AWS Lambda to access data in Oracle on Amazon RDS.

For this post, we will be using the Oracle Athena Federated query connector developed by Trianz. The runtime includes Oracle XE running on Amazon EC2 and Amazon RDS. Your Oracle instance can be on Amazon RDS, Amazon EC2, or on premises. You can deploy the Trianz Oracle AFQ connector available in the AWS Serverless Application Repository.

Let’s start with discussing the solution and then detailing the steps involved.

Solution overview

Data federation is the capability to integrate data in another data store using a single interface (Amazon Athena). The following diagram depicts how Athena federation works by using Lambda to integrate with a federated data source.

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines to extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.

When a federated query is run, Athena identifies the parts of the query that should be routed to the data source connector and executes them with Lambda. The data source connector makes the connection to the source, runs the query, and returns the results to Athena. If the data doesn’t fit into Lambda RAM runtime memory, it spills the data to Amazon S3 and is later accessed by Athena.

Athena uses data source connectors which internally use Lambda to run federated queries. Data source connectors are pre-built and can be deployed from the Athena console or from the Serverless Application Repository. Based on the user submitting the query, connectors can provide or restrict access to specific data elements.

To implement this solution, we complete the following steps:

  1. Create a secret for the Oracle instance using AWS Secrets Manager.
  2. Create an S3 bucket and subfolder for Lambda to use.
  3. Configure Athena federation with the Oracle XE instance.
  4. Run federated queries with Athena.

Prerequisites

Before getting started, make sure you have an Oracle database up and running.

Create a secret for the Oracle instance

Our first step is to create a secret for the Oracle instance with a username and password using Secrets Manager.

  1. On the Secrets Manager console, choose Secrets.
  2. Choose Store a new secret.
  3. Select Other types of secrets.
  4. Enter the credentials as key-value pairs (username, password) for your Oracle XE instance.

  1. For Secret name, enter a name for your secret. Use the prefix OracleAFQ so it’s easy to find.
  2. Leave the remaining fields at their defaults and choose Next.
  3. Complete your secret creation.

Create an S3 bucket for Lambda

On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use. For this post, I use athena-accelerator/oracle.

Configure Athena federation with the Oracle XE instance

To configure Athena federation with your Oracle instance, complete the following steps:

  1. On the AWS Serverless Application Repository console, choose Available applications.
  2. In the search field, enter TrianzOracleAthenaJDBC.

  1. For Application name, enter TrianzOracleAthenaJDBC.
  2. For SecretNamePrefix, enter OracleAFQ_XE.
  3. For SpillBucket, enter Athena-accelerator/oracle.
  4. For JDBCConnectorConfig, use the format oracle://jdbc:oracle:thin:${secretname}@//hostname:port/servicename.

For example, we enter oracle://jdbc:oracle:thin:${OracleAFQ_XE}@//12.345.67.89:1521/xe.

  1. For DisableSpillEncryption, enter false.
  2. For LambdaFunctionName, enter oracleconnector.
  3. For SecurityGroupID, enter the security group ID where the Oracle instance is deployed.

Make sure to apply valid inbound and outbound rules based on your connection.

  1. For SpillPrefix, create a folder under the S3 bucket you created and specify the name (for example, athena-spill).
  2. For Subnetids, use the subnets where the Oracle instance is running with comma separation.

Make sure the subnet is in a VPC and has NAT gateway and internet gateway attached.

  1. Select the I acknowledge check box.
  2. Choose Deploy.

Make sure that the AWS Identity and Access Management (IAM) roles have permissions to access AWS Serverless Application Repository, AWS CloudFormation, Amazon S3, Amazon CloudWatch, AWS CloudTrail, Secrets Manager, Lambda, and Athena. For more information, see Example IAM Permissions Policies to Allow Athena Federated Query.

Run federated queries with Athena

Run your federated queries using lambda:trianzoracle against tables in the Oracle database. trianzoracle is the name of lambda function which we have created in step 7 of previous section of this blog

lambda:trianzoracle is a reference data source connector Lambda function using the format lambda:MyLambdaFunctionName. For more information, see Writing Federated Queries.

The following query joins the dataset between Oracle and the S3 data lake.

Key performance best practices

If you’re considering Athena Federated query with Oracle, we recommend the following best practices:

  • Athena Federated query works great for queries with predicate filtering because the predicates are pushed down to the Oracle database. Use filter and limited-range scans in your queries to avoid full table scans.
  • If your SQL query requires returning a large volume of data from the Oracle database to Athena (which could lead to query timeouts or slow performance), unload the large tables in your query from Oracle to your S3 data lake.
  • The star schema is a commonly used data model in Oracle. In the star schema model, unload your large fact tables into your S3 data lake and leave the dimension tables in Oracle. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your S3 data lake.
  • When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the Oracle database WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your Oracle database cluster to benefit from concurrent connections to queue up.

Conclusion

In this post, you learned how to configure and use Athena Federated query with Oracle. Now you don’t need to wait for all the data in your Oracle data warehouse to be unloaded to Amazon S3 and maintained on a day-to-day basis to run your queries.

You can use the best practice considerations outlined in the post to help minimize the data transferred from Oracle for better performance. When queries are well written for Federated query, the performance penalties are negligible.

For more information, see the Athena User Guide and Using Amazon Athena Federated Query.


About the Author

Navnit Shukla is AWS Specialist Solution Architect in Analytics. He is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.