All posts by Leonardo Gomez

Amazon DataZone introduces OpenLineage-compatible data lineage visualization in preview

2024-07-08 Leonardo Gomez

Post Syndicated from Leonardo Gomez original https://aws.amazon.com/blogs/big-data/amazon-datazone-introduces-openlineage-compatible-data-lineage-visualization-in-preview/

We are excited to announce the preview of API-driven, OpenLineage-compatible data lineage in Amazon DataZone to help you capture, store, and visualize lineage of data movement and transformations of data assets on Amazon DataZone.

With the Amazon DataZone OpenLineage-compatible API, domain administrators and data producers can capture and store lineage events beyond what is available in Amazon DataZone, including transformations in Amazon Simple Storage Service (Amazon S3), AWS Glue, and other AWS services. This provides a comprehensive view for data consumers browsing in Amazon DataZone, who can gain confidence of an asset’s origin, and data producers, who can assess the impact of changes to an asset by understanding its usage.

In this post, we discuss the latest features of data lineage in Amazon DataZone, its compatibility with OpenLineage, and how to get started capturing lineage from other services such as AWS Glue, Amazon Redshift, and Amazon Managed Workflows for Apache Airflow (Amazon MWAA) into Amazon DataZone through the API.

Why it matters to have data lineage

Data lineage gives you an overarching view into data assets, allowing you to see the origin of objects and their chain of connections. Data lineage enables tracking the movement of data over time, providing a clear understanding of where the data originated, how it has changed, and its ultimate destination within the data pipeline. With transparency around data origination, data consumers gain trust that the data is correct for their use case. Data lineage information is captured at levels such as tables, columns, and jobs, allowing you to conduct impact analysis and respond to data issues because, for example, you can see how one field impacts downstream sources. This equips you to make well-informed decisions before committing changes and avoid unwanted changes downstream.

Data lineage in Amazon DataZone is an API-driven, OpenLineage-compatible feature that helps you capture and visualize lineage events from OpenLineage-enabled systems or through an API, to trace data origins, track transformations, and view cross-organizational data consumption. The lineage visualized includes activities inside the Amazon DataZone business data catalog. Lineage captures the assets cataloged as well as the subscribers to those assets and to activities that happen outside the business data catalog captured programmatically using the API.

Additionally, Amazon DataZone versions lineage with each event, enabling you to visualize lineage at any point in time or compare transformations across an asset’s or job’s history. This historical lineage provides a deeper understanding of how data has evolved, which is essential for troubleshooting, auditing, and enforcing the integrity of data assets.

The following screenshot shows an example lineage graph visualized with the Amazon DataZone data catalog.

Introduction to OpenLineage compatible data lineage

The need to capture data lineage consistently across various analytical services and combine them into a unified object model is key in uncovering insights from the lineage artifact. OpenLineage is an open source project that offers a framework to collect and analyze lineage. It also offers reference implementation of an object model to persist metadata along with integration to major data and analytics tools.

The following are key concepts in OpenLineage:

Lineage events – OpenLineage captures lineage information through a series of events. An event is anything that represents a specific operation performed on the data that occurs in a data pipeline, such as data ingestion, transformation, or data consumption.
Lineage entities – Entities in OpenLineage represent the various data objects involved in the lineage process, such as datasets and tables.
Lineage runs – A lineage run represents a specific run of a data pipeline or a job, encompassing multiple lineage events and entities.
Lineage form types – Form types, or facets, provide additional metadata or context about lineage entities or events, enabling richer and more descriptive lineage information. OpenLineage offers facets for runs, jobs, and datasets, with the option to build custom facets.

The Amazon DataZone data lineage API is OpenLineage compatible and extends OpenLineage’s functionality by providing a materialization endpoint to persist the lineage outputs in an extensible object model. OpenLineage offers integrations for certain sources, and integration of these sources with Amazon DataZone is straightforward because the Amazon DataZone data lineage API understands the format and translates to the lineage data model.

The following diagram illustrates an example of the Amazon DataZone lineage data model.

In Amazon DataZone, every lineage node represents an underlying resource—there is a 1:1 mapping of the lineage node with a logical or physical resource such as table, view, or asset. The nodes represent a specific job with a specific run, or a node for a table or asset, and one node for a subscription target.

Each version of a node captures what happened to the underlying resource at that specific timestamp. In Amazon DataZone, lineage not only shares the story of data movement outside it, but it also represents the lineage of activities inside Amazon DataZone, such as asset creation, curation, publishing, and subscription.

To hydrate the lineage model in Amazon DataZone, two types of lineage are captured:

Lineage activities inside Amazon DataZone – This includes assets added to the catalog and published, and then details about the subscriptions are captured automatically. When you’re in the producer project context (for example, if the project you’re selected is the owning project of the asset you are browsing and you’re a member of that project), you will see two states of the dataset node:
- The inventory asset type node defines the asset in the catalog that is in an unpublished stage. Other users can’t subscribe to the inventory asset. To learn more, refer to Creating inventory and published data in Amazon DataZone.
- The published asset type represents the actual asset that is discoverable by data users across the organization. This is the asset type that can be subscribed by other project members. If you are a consumer and not part of the producing project of that asset, you will only see the published asset node.
Lineage activities outside of Amazon DataZone can be captured programmatically using the PostLineageEvent With these events captured either upstream or downstream of cataloged assets, data producers and consumers get a comprehensive view of data movement to check the origin of data or its consumption. We discuss how to use the API to capture lineage events later in this post.

There are two different types of lineage nodes available in Amazon DataZone:

Dataset node – In Amazon DataZone, lineage visualizes nodes that represent tables and views. Depending on the context of the project, the producers will be able to view both the inventory and published asset, whereas consumers can only view the published asset. When you first open the lineage tab on the asset details page, the cataloged dataset node will be the starting point for lineage graph traversal upstream or downstream. Dataset nodes include lineage nodes automated from Amazon DataZone and custom lineage nodes:
- Automated dataset nodes – These nodes include information about AWS Glue or Amazon Redshift assets published in the Amazon DataZone catalog. They’re automatically generated and include a corresponding AWS Glue or Amazon Redshift icon within the node.
- Custom dataset nodes – These nodes include information about assets that are not published in the Amazon DataZone catalog. They’re created manually by domain administrators (producers) and are represented by a default custom asset icon within the node. These are essentially custom lineage nodes created using the OpenLineage event format.
Job (run) node – This node captures the details of the job, which represents the latest run of a particular job and its run details. This node also captures multiple runs of the job and can be viewed on the History tab of the node details. Node details are made visible when you choose the icon.

Visualizing lineage in Amazon DataZone

Amazon DataZone offers a comprehensive experience for data producers and consumers. The asset details page provides a graphical representation of lineage, making it straightforward to visualize data relationships upstream or downstream. The asset details page provides the following capabilities to navigate the graph:

Column-level lineage – You can expand column-level lineage when available in dataset nodes. This automatically shows relationships with upstream or downstream dataset nodes if source column information is available.
Column search – If the dataset has more than 10 columns, the node presents pagination to navigate to columns not initially presented. To quickly view a particular column, you can search on the dataset node that lists just the searched column.
View dataset nodes only – If you want filter out the job nodes, you can choose the Open view control icon in the graph viewer and toggle the Display dataset nodes only This will remove all the job nodes from the graph and let you navigate just the dataset nodes.
Details pane – Each lineage node captures and displays the following details:
- Every dataset node has three tabs: Lineage info, Schema, and History. The History tab lists the different versions of lineage event captured for that node.
- The job node has a details pane to display job details with the tabs Job info and History. The details pane also captures queries or expressions run as part of the job.
Version tabs – All lineage nodes in Amazon DataZone data lineage will have versioning, captured as history, based on lineage events captured. You can view lineage at a selected timestamp that opens a new tab on the lineage page to help compare or contrast between the different timestamps.

The following screenshot shows an example of data lineage visualization.

You can experience the visualization with sample data by choosing Preview on the Lineage tab and choosing the Try sample lineage link. This opens a new browser tab with sample data to test and learn about the feature with or without a guided tour, as shown in the following screenshot.

Solution overview

Now that we understand the capabilities of the new data lineage feature in Amazon DataZone, let’s explore how you can get started in capturing lineage from AWS Glue tables and ETL (extract, transform, and load) jobs, Amazon Redshift, and Amazon MWAA.

The getting started scripts are also available in Amazon DataZone’s new GitHub repository.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
An AWS Identity and Access Management (IAM) user with access to the following services:
- AWS Cloud9
- AWS CloudFormation
- AWS CloudShell
- Amazon DataZone
- AWS Glue
- Amazon S3
An Amazon DataZone domain
An Amazon DataZone project, with a data lake environment created

If the AWS account you use to follow this post uses AWS Lake Formation to manage permissions on the AWS Glue Data Catalog, make sure that you log in as a user with access to create databases and tables. For more information, refer to Implicit Lake Formation permissions.

Launch the CloudFormation stack

To create your resources for this use case using AWS CloudFormation, complete the following steps:

Launch the CloudFormation stack in us-east-1:
For Stack name, enter a name for your stack.
Choose Next.
Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Create stack.

Wait for the stack formation to finish provisioning the resources. When you see the CREATE_COMPLETE status, you can proceed to the next steps.

Capture lineage from AWS Glue tables

For this example, we use CloudShell, which is a browser-based shell, to run the commands necessary to harvest lineage metadata from AWS Glue tables. Complete the following steps:

On the AWS Glue console, choose Crawlers in the navigation pane.
Select the AWSomeRetailCrawler crawler created by the CloudFormation template.
Choose Run.

When the crawler is complete, you’ll see a Succeeded status.

Now let’s harvest the lineage metadata using CloudShell.

Download the extract_glue_crawler_lineage.py file.
On the Amazon DataZone console, open CloudShell.

On the Actions menu, choose Update file.
Upload the extract_glue_crawler_lineage.py file.

Run the following commands:

sudo yum -y install python3
python3 -m venv env
. env/bin/activate
pip install boto3

You should get the following results.

After all the libraries and dependencies are configured, run the following command to harvest the lineage metadata from the inventory table:
```
python extract_glue_crawler_lineage.py -d awsome_retail_db -t inventory -r us-east-1 -i dzd_Your_doamin
```
The script asks for verification of the settings provided; enter Yes.

You should receive a notification indicating that the script ran successfully.

After you capture the lineage information from the Inventory table, complete the following steps to run the data source.

On the Amazon DataZone data portal, open the Sales
On the Data tab, choose Data sources in the navigation pane.

Select your data source job and choose Run.

For this example, we had a data source job called SalesDLDataSourceV2 already created pointing to the awesome_retail_db database. To learn more about how to create data source jobs, refer to Create and run an Amazon DataZone data source for the AWS Glue Data Catalog.

After the job runs successfully, you should see a confirmation message.

Now let’s view the lineage diagram generated by Amazon DataZone.

On the Data inventory tab, choose the Inventory table.
On the Inventory asset page, choose the new Lineage tab.

On the Lineage tab, you can see that Amazon DataZone created three nodes:

Job / Job run – This is based on the AWS Glue crawler used to harvest the asset technical metadata
Dataset – This is based on the S3 object that contains the data related to this asset
Table – This is the AWS Glue table created by the crawler

If you choose the Dataset node, Amazon DataZone offers information about the S3 object used to create the asset.

Capture data lineage for AWS Glue ETL jobs

In the previous section, we covered how to generate a data lineage diagram on top of a data asset. Now let’s see how we can create one for an AWS Glue job.

The CloudFormation template that we launched earlier created an AWS Glue job called Inventory_Insights. This job gets data from the Inventory table and creates a new table called Inventory_Insights with the aggregated data of the total products available in all the stores.

The CloudFormation template also copied the openlineage-spark_2.12-1.9.1.jar file to the S3 bucket created for this post. This file is necessary to generate lineage metadata from the AWS Glue job. We use version 1.9.1, which is compatible with AWS Glue 3.0, the version used to create the AWS Glue job for this post. If you’re using a different version of AWS Glue, you need to download the corresponding OpenLineage Spark plugin file that matches your AWS Glue version.

The OpenLineage Spark plugin is not able to extract data lineage from AWS Glue Spark jobs that use AWS Glue DynamicFrames. Use Spark SQL DataFrames instead.

Download the extract_glue_spark_lineage.py file.
On the Amazon DataZone console, open CloudShell.
On the Actions menu, choose Update file.
Upload the extract_glue_spark_lineage.py file.
On the CloudShell console, run the following command (if your CloudShell session expired, you can open a new session):
```
python extract_glue_spark_lineage.py —region "us-east-1" —domain-identifier 'dzd_Your Domain'
```
Confirm the information showed by the script by entering yes.

You will see the following message; this means that the script is ready to get the AWS Glue job lineage metadata after you run it.

Now let’s run the AWS Glue job created by the Cloud formation template.

On the AWS Glue console, choose ETL jobs in the navigation pane.
Select the Inventory_Insights job and choose Run job.

On the Job details tab, you will notice that the job has the following configuration:

Key --conf with value extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener --conf spark.openlineage.transport.type=console --conf spark.openlineage.facets.custom_environment_variables=[AWS_DEFAULT_REGION;GLUE_VERSION;GLUE_COMMAND_CRITERIA;GLUE_PYTHON_VERSION;]
Key --user-jars-first with value true
Dependent JARs path set as the S3 path s3://{your bucket}/lib/openlineage-spark_2.12-1.9.1.jar
The AWS Glue version set as 3.0

During the run of the job, you will see the following output on the CloudShell console.

This means that the script has successfully harvested the lineage metadata from the AWS Glue job.

Now let’s create an AWS Glue table based on the data created by the AWS Glue job. For this example, we use an AWS Glue crawler.

On the AWS Glue console, choose Crawlers in the navigation pane.
Select the AWSomeRetailCrawler crawler created by the CloudFormation template and choose Run.

When the crawler is complete, you will see the following message.

Now let’s open the Amazon DataZone portal to see how the diagram is represented in Amazon DataZone.

On the Amazon DataZone portal, choose the Sales project.
On the Data tab, choose Inventory data in the navigation pane.
Choose the inventory insights asset

On the Lineage tab, you can see the diagram created by Amazon DataZone. It shows three nodes:

- The AWS Glue crawler used to create the AWS Glue table
- The AWS Glue table created by the crawler
- The Amazon DataZone cataloged asset

To see the lineage information about the AWS Glue job that you ran to create the inventory_insights table, choose the arrows icon on the left side of the diagram.

Now you can see the full lineage diagram for the Inventory_insights table.

Choose the blue arrow icon in the inventory node to the left of the diagram.

You can see the evolution of the columns and the transformations that they had.

When you choose any of the nodes that are part of the diagram, you can see more details. For example, the inventory_insights node shows the following information.

Capture lineage from Amazon Redshift

Let’s explore how to generate a lineage diagram from Amazon Redshift. In this example, we use AWS Cloud9 because it allows us to configure the connection to the virtual private cloud (VPC) where our Redshift cluster resides. For more information about AWS Cloud9, refer to the AWS Cloud9 User Guide.

The CloudFormation template included as part of this post doesn’t cover the creation of a Redshift cluster or the creation of the tables used in this section. To learn more about how to create a Redshift cluster, see Step 1: Create a sample Amazon Redshift cluster. We use the following query to create the tables needed for this section of the post:

Create SCHEMA market

create table market.retail_sales (
  id BIGINT primary key,
  name character varying not null
);

create table market.online_sales (
  id BIGINT primary key,
  name character varying not null
);

/* Important to insert some data in the table */
INSERT INTO market.retail_sales
VALUES (123, 'item1')

INSERT INTO market.online_sales
VALUES (234, 'item2')

create table market.sales AS
Select id, name from market.retail_sales
Union ALL
Select id, name from market.online_sales;

Remember to add the IP address of your AWS Cloud9 environment to the security group with access to the Redshift cluster.

Download the requirements.txt and extract_redshift_lineage.py files.
On the File menu, choose Upload Local Files.
Upload the requirements.txt and extract_redshift_lineage.py files.

Run the following commands:

# Install Python 
sudo yum -y install python3

# dependency set up 
python3 -m venv env 
. env/bin/activate

pip install -r requirements.txt

You should be able to see the following messages.

To set the AWS credentials, run the following command:

export AWS_ACCESS_KEY_ID=<<Your Access Key>>
export AWS_SECRET_ACCESS_KEY=<<Your Secret Access Key>>
export AWS_SESSION_TOKEN=<<Your Session Token>>

Run the extract_redshift_lineage.py script to harvest the metadata necessary to generate the lineage diagram:

python extract_redshift_lineage.py \
 -r region \
 -i dzd_your_dz_domain_id \
 -n your-redshift-cluster-endpoint \
 -t your-rs-port \
 -d your-database \
 -s the-starting-date

Next, you will be prompted to enter the user name and password for the connection to your Amazon DataZone database.
When you receive a confirmation message, enter yes.

If the configuration was done correctly, you will see the following confirmation message.

Now let’s see how the diagram was created in Amazon DataZone.

On the Amazon DataZone data portal, open the Sales project.
On the Data tab, choose Data sources.
Run the data source job.

For this post, we already created a data source job called Sales_DW_Enviroment-default-datasource to add the Redshift data source to our Amazon DataZone project. To learn how to create a data source job, refer to Create and run an Amazon DataZone data source for Amazon Redshift

After you run the job, you’ll see the following confirmation message.

On the Data tab, choose Inventory data in the navigation pane.
Choose the total_sales asset.

Choose the Lineage tab.

Amazon DataZone create a three-node lineage diagram for the total sales table; you can choose any node to view its details.

Choose the arrows icon next to the Job/ Job run node to view a more complete lineage diagram.

Choose the Job / Job run

The Job Info section shows the query that was used to create the total sales table.

Capture lineage from Amazon MWAA

Apache Airflow is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows. Amazon MWAA is a managed service for Airflow that lets you use your current Airflow platform to orchestrate your workflows. OpenLineage supports integration with Airflow 2.6.3 using the openlineage-airflow package, and the same can be enabled on Amazon MWAA as a plugin. Once enabled, the plugin converts Airflow metadata to OpenLineage events, which are consumable by DataZone.PostLineageEvent.

The following diagram shows the setup required in Amazon MWAA to capture data lineage using OpenLineage and publish it to Amazon DataZone.

The workflow uses an Amazon MWAA DAG to invoke a data pipeline. The process is as follows:

The openlineage-airflow plugin is configured on Amazon MWAA as a lineage backend. Metadata about the DAG run is passed to the plugin, which converts it into OpenLineage format.
The lineage information collected is written to Amazon CloudWatch log group according to the Amazon MWAA environment.
A helper function captures the lineage information from the log file and publishes it to Amazon DataZone using the PostLineageEvent API.

The example used in the post uses Amazon MWAA version 2.6.3 and OpenLineage plugin version 1.4.1. For other Airflow versions supported by OpenLineage, refer to Supported Airflow versions.

Configure the OpenLineage plugin on Amazon MWAA to capture lineage

When harvesting lineage using OpenLineage, a Transport configuration needs to be set up, which tells OpenLineage where to emit the events to, for example the console or an HTTP endpoint. You can use ConsoleTransport, which logs the OpenLineage events in the Amazon MWAA task CloudWatch log group, which can then be published to Amazon DataZone using a helper function.

Specify the following in the requirements.txt file added to the S3 bucket configured for Amazon MWAA:

openlineage-airflow==1.4.1

In the Airflow logging configuration section under the MWAA configuration for the Airflow environment, enable Airflow task logs with log level INFO. The following screenshot shows a sample configuration.

A successful configuration will add a plugin to Airflow, which can be verified from the Airflow UI by choosing Plugins on the Admin menu.

In this post, we use a sample DAG to hydrate data to Redshift tables. The following screenshot shows the DAG in graph view.

Run the DAG and upon successful completion of a run, open the Amazon MWAA task CloudWatch log group for your Airflow environment (airflow-env_name-task) and filter based on the expression console.py to select events emitted by OpenLineage. The following screenshot shows the results.

Publish lineage to Amazon DataZone

Now that you have the lineage events emitted to CloudWatch, the next step is to publish them to Amazon DataZone to associate them to a data asset and visualize them on the business data catalog.

Download the files requirements.txt and airflow_cw_parse_log.py and gather environment details like AWS region, Amazon MWAA environment name and Amazon DataZone Domain ID.
The Amazon MWAA environment name can be obtained from the Amazon MWAA console.
The Amazon DataZone domain ID can be obtained from Amazon DataZone service console or from the Amazon DataZone portal.
Navigate to CloudShell and choose Upload files on the Actions menu to upload the files requirements.txt and extract_airflow_lineage.py.

After the files are uploaded, run the following script to filter lineage events from the Airflow task logs and publish them to Amazon DataZone:

# Set up virtual env and install dependencies
python -m venv env
pip install -r requirements.txt
. env/bin/activate

# run the script
python extract_airflow_lineage.py \
  --region us-east-1 \
  --domain-identifier your_domain_identifier \
  --airflow-environment-name your_airflow_environment_name

The function extract_airflow_lineage.py filters the lineage events from the Amazon MWAA task log group and publishes the lineage to the specified domain within Amazon DataZone.

Visualize lineage on Amazon DataZone

After the lineage is published to DataZone, open your DataZone project, navigate to the Data tab and chose a data asset that was accessed by the Amazon MWAA DAG. In this case, it is a subscribed asset.

Navigate to the Lineage tab to visualize the lineage published to Amazon DataZone.

Choose a node to look at additional lineage metadata. In the following screenshot, we can observe the producer of the lineage has been marked as airflow.

Conclusion

In this post, we shared the preview feature of data lineage in Amazon DataZone, how it works, and how you can capture lineage events, from AWS Glue, Amazon Redshift, and Amazon MWAA, to be visualized as part of the asset browsing experience.

To learn more about Amazon DataZone and how to get started, refer to the Getting started guide. Check out the YouTube playlist for some of the latest demos of Amazon DataZone and short descriptions of the capabilities available.

About the Authors

Leonardo Gomez is a Principal Analytics Specialist at AWS, with over a decade of experience in data management. Specializing in data governance, he assists customers worldwide in maximizing their data’s potential while promoting data democratization. Connect with him on LinkedIn.

Priya Tiruthani is a Senior Technical Product Manager with Amazon DataZone at AWS. She focuses on improving data discovery and curation required for data analytics. She is passionate about building innovative products to simplify customers’ end-to-end data journey, especially around data governance and analytics. Outside of work, she enjoys being outdoors to hike, capture nature’s beauty, and recently play pickleball.

Ron Kyker is a Principal Engineer with Amazon DataZone at AWS, where he helps drive innovation, solve complex problems, and set the bar for engineering excellence for his team. Outside of work, he enjoys board gaming with friends and family, movies, and wine tasting.

Srinivasan Kuppusamy is a Senior Cloud Architect – Data at AWS ProServe, where he helps customers solve their business problems using the power of AWS Cloud technology. His areas of interests are data and analytics, data governance, and AI/ML.

Use AWS Glue Data Catalog views to analyze data

2024-05-09 Leonardo Gomez

Post Syndicated from Leonardo Gomez original https://aws.amazon.com/blogs/big-data/use-aws-glue-data-catalog-views-to-analyze-data/

In this post, we show you how to use the new views feature the AWS Glue Data Catalog. SQL views are a powerful object used across relational databases. You can use views to decrease the time to insights of data by tailoring the data that is queried. Additionally, you can use the power of SQL in a view to express complex boundaries in data across multiple tables that can’t be expressed with simpler permissions. Data lakes provide customers the flexibility required to derive useful insights from data across many sources and many use cases. Data consumers can consume data where they need to across lines of business, increasing the velocity of insights generation.

Customers use many different processing engines in their data lakes, each of which have their own version of views with different capabilities. The AWS Glue Data Catalog and AWS Lake Formation provide a central location to manage your data across data lake engines.

AWS Glue has released a new feature, SQL views, which allows you to manage a single view object in the Data Catalog that can be queried from SQL engines. You can create a single view object with a different SQL version for each engine you want to query, such as Amazon Athena, Amazon Redshift, and Spark SQL on Amazon EMR. You can then manage access to these resources using the same Lake Formation permissions that are used to control tables in the data lake.

Solution overview

For this post, we use the Women’s E-Commerce Clothing Review. The objective is to create views in the Data Catalog so you can create a single common view schema and metadata object to use across engines (in this case, Athena). Doing so lets you use the same views across your data lakes to fit your use case. We create a view to mask the customer_id column in this dataset, then we will share this view to another user so that they can query this masked view.

Prerequisites

Before you can create a view in the AWS Glue Data Catalog, make sure that you have an AWS Identity and Access Management (IAM) role with the following configuration:

The following trust policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": [
           "glue.amazonaws.com",
           "lakeformation.amazonaws.com"
        ]
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

The following pass role policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Stmt1",
      "Action": [
        "iam:PassRole"
      ],
      "Effect": "Allow",
      "Resource": "*",
      "Condition": {
         "StringEquals": {
           "iam:PassedToService": [
             "glue.amazonaws.com",
             "lakeformation.amazonaws.com"
           ]
         }
       }
     }
   ]
}

Finally, you will also need the following permissions:
- "Glue:GetDatabase",
- "Glue:GetDatabases",
- "Glue:CreateTable",
- "Glue:GetTable",
- "Glue:UpdateTable",
- "Glue:DeleteTable",
- "Glue:GetTables",
- "Glue:SearchTables",
- "Glue:BatchGetPartition",
- "Glue:GetPartitions",
- "Glue:GetPartition",
- "Glue:GetTableVersion",
- "Glue:GetTableVersions"

Run the AWS CloudFormation template

You can deploy the AWS CloudFormation template glueviewsblog.yaml to create the Lake Formation database and table. The dataset will be loaded into an Amazon Simple Storage Service (Amazon S3) bucket.

For step-by-step instructions, refer to Creating a stack on the AWS CloudFormation console.

When the stack is complete, you can see a table called clothing_parquet on the Lake Formation console, as shown in the following screenshot.

Create a view on the Athena console

Now that you have your Lake Formation managed table, you can open the Athena console and create a Data Catalog view. Complete the following steps:

In the Athena query editor, run the following query on the Parquet dataset:

SELECT * FROM "clothing_reviews"."clothing_parquet" limit 10;

In the query results, the customer_id column is currently visible.

Next, you create a view called hidden_customerID and mask the customer_id column.

Create a view called hidden_customerID:

CREATE PROTECTED MULTI DIALECT VIEW clothing_reviews.hidden_customerid SECURITY DEFINER AS 
SELECT * FROM clothing_reviews.clothing_parquet

In the following screenshot, you can see a view called hidden_customerID was successfully created.

Run the following query to mask the first four characters of the customer_id column for the newly generated view:

ALTER VIEW clothing_reviews.hidden_customerid UPDATE DIALECT AS
SELECT '****' || substring(customer_id, 4) as customer_id,clothing_id,age,title,review_text,rating,recommend_ind,positive_feedback,division_name,department_name,class_name 
FROM clothing_reviews.clothing_parquet

You can see in the following screenshot that the view hidden_customerID has the customer_id column’s first four characters masked.

The original table clothing_parquet remains the same unmasked.

Grant access of the view to another user to query

Data Catalog views allow you to use Lake Formation to control access. In this step, you grant this view to another user called amazon_business_analyst and then query from that user.

Sign in to the Lake Formation console as admin.
In the navigation pane, choose Views.

As shown in the following screenshot, you can see the hidden_customerid view.

This user has no visibility to the view.

Grant permission to the amazon_business_analyst user from the data lake admin.

On the Athena console, query the hidden_customerid view.

You have successfully shared a view to the user and queried it from the Athena console.

Clean up

To avoid incurring future charges, delete the CloudFormation stack. For instructions, refer to Deleting a stack on the AWS CloudFormation console.

Conclusion

In this post, we demonstrated how to use the AWS Glue Data Catalog to create views. We then showed how to alter the views and mask the data. You can share the view with different users to query using Athena. For more information about this new feature, refer to Using AWS Glue Data Catalog views.

About the Authors

Leonardo Gomez is a Principal Analytics Specialist Solutions Architect at AWS. He has over a decade of experience in data management, helping customers around the globe address their business and technical needs. Connect with him on LinkedIn

Michael Chess – is a Product Manager on the AWS Lake Formation team based out of Palo Alto, CA. He specializes in permissions and data catalog features in the data lake.

Derek Liu – is a Senior Solutions Architect based out of Vancouver, BC. He enjoys helping customers solve big data challenges through AWS analytic services.

Monitor and optimize cost on AWS Glue for Apache Spark

2023-04-28 Leonardo Gomez

Post Syndicated from Leonardo Gomez original https://aws.amazon.com/blogs/big-data/monitor-optimize-cost-glue-spark/

AWS Glue is a serverless data integration service that makes it simple to discover, prepare, and combine data for analytics, machine learning (ML), and application development. You can use AWS Glue to create, run, and monitor data integration and ETL (extract, transform, and load) pipelines and catalog your assets across multiple data stores.

One of the most common questions we get from customers is how to effectively monitor and optimize costs on AWS Glue for Spark. The diversity of features and pricing options for AWS Glue offers the flexibility to effectively manage the cost of your data workloads and still keep the performance and capacity as per your business needs. Although the fundamental process of cost optimization for AWS Glue workloads remains the same, you can monitor job runs and analyze the costs and usage to find savings and take action to implement improvements to the code or configurations.

In this post, we demonstrate a tactical approach to help you manage and reduce cost through monitoring and optimization techniques on top of your AWS Glue workloads.

Monitor overall costs on AWS Glue for Apache Spark

AWS Glue for Apache Spark charges an hourly rate in 1-second increments with a minimum of 1 minute based on the number of data processing units (DPUs). Learn more in AWS Glue Pricing. This section describes a way to monitor overall costs on AWS Glue for Apache Spark.

AWS Cost Explorer

In AWS Cost Explorer, you can see overall trends of DPU hours. Complete the following steps:

On the Cost Explorer console, create a new cost and usage report.
For Service, choose Glue.
For Usage type, choose the following options:
1. Choose <Region>-ETL-DPU-Hour (DPU-Hour) for standard jobs.
2. Choose <Region>-ETL-Flex-DPU-Hour (DPU-Hour) for Flex jobs.
3. Choose <Region>-GlueInteractiveSession-DPU-Hour (DPU-Hour) for interactive sessions.
Choose Apply.

Learn more in Analyzing your costs with AWS Cost Explorer.

Monitor individual job run costs

This section describes a way to monitor individual job run costs on AWS Glue for Apache Spark. There are two options to achieve this.

AWS Glue Studio Monitoring page

On the Monitoring page in AWS Glue Studio, you can monitor the DPU hours you spent on a specific job run. The following screenshot shows three job runs that processed the same dataset; the first job run spent 0.66 DPU hours, and the second spent 0.44 DPU hours. The third one with Flex spent only 0.33 DPU hours.

GetJobRun and GetJobRuns APIs

The DPU hour values per job run can be retrieved through AWS APIs.

For auto scaling jobs and Flex jobs, the field DPUSeconds is available in GetJobRun and GetJobRuns API responses:

$ aws glue get-job-run --job-name ghcn --run-id jr_ccf6c31cc32184cea60b63b15c72035e31e62296846bad11cd1894d785f671f4
{
    "JobRun": {
        "Id": "jr_ccf6c31cc32184cea60b63b15c72035e31e62296846bad11cd1894d785f671f4",
        "Attempt": 0,
        "JobName": "ghcn",
        "StartedOn": "2023-02-08T19:14:53.821000+09:00",
        "LastModifiedOn": "2023-02-08T19:19:35.995000+09:00",
        "CompletedOn": "2023-02-08T19:19:35.995000+09:00",
        "JobRunState": "SUCCEEDED",
        "PredecessorRuns": [],
        "AllocatedCapacity": 10,
        "ExecutionTime": 274,
        "Timeout": 2880,
        "MaxCapacity": 10.0,
        "WorkerType": "G.1X",
        "NumberOfWorkers": 10,
        "LogGroupName": "/aws-glue/jobs",
        "GlueVersion": "3.0",
        "ExecutionClass": "FLEX",
        "DPUSeconds": 1137.0
    }
}

The field DPUSeconds returns 1137.0. This means 0.32 DPU hours which can be calculated in 1137.0/(60*60)=0.32.

For the other standard jobs without auto scaling, the field DPUSeconds is not available:

$ aws glue get-job-run --job-name ghcn --run-id jr_10dfa93fcbfdd997dd9492187584b07d305275531ff87b10b47f92c0c3bd6264
{
    "JobRun": {
        "Id": "jr_10dfa93fcbfdd997dd9492187584b07d305275531ff87b10b47f92c0c3bd6264",
        "Attempt": 0,
        "JobName": "ghcn",
        "StartedOn": "2023-02-07T16:38:05.155000+09:00",
        "LastModifiedOn": "2023-02-07T16:40:48.575000+09:00",
        "CompletedOn": "2023-02-07T16:40:48.575000+09:00",
        "JobRunState": "SUCCEEDED",
        "PredecessorRuns": [],
        "AllocatedCapacity": 10,
        "ExecutionTime": 157,
        "Timeout": 2880,
        "MaxCapacity": 10.0,
        "WorkerType": "G.1X",
        "NumberOfWorkers": 10,
        "LogGroupName": "/aws-glue/jobs",
        "GlueVersion": "3.0",
        "ExecutionClass": "STANDARD"
    }
}

For these jobs, you can calculate DPU hours by ExecutionTime*MaxCapacity/(60*60). Then you get 0.44 DPU hour by 157*10/(60*60)=0.44. Note that AWS Glue versions 2.0 and later have a 1-minute minimum billing.

AWS CloudFormation template

Because DPU hours can be retrieved through the GetJobRun and GetJobRuns APIs, you can integrate this with other services like Amazon CloudWatch to monitor trends of consumed DPU hours over time. For example, you can configure an Amazon EventBridge rule to invoke an AWS Lambda function to publish CloudWatch metrics every time AWS Glue jobs finish.

To help you configure that quickly, we provide an AWS CloudFormation template. You can review and customize it to suit your needs. Some of the resources this stack deploys incur costs when in use.

The CloudFormation template generates the following resources:

AWS Identity and Access Management (IAM) role
Lambda function
EventBridge rule

To create your resources, complete the following steps:

Sign in to the AWS CloudFormation console.
Choose Launch Stack:
Choose Next.
Choose Next.
On the next page, choose Next.
Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create stack.

Stack creation can take up to 3 minutes.

After you complete the stack creation, when AWS Glue jobs finish, the following DPUHours metrics are published under the Glue namespace in CloudWatch:

Aggregated metrics – Dimension=[JobType, GlueVersion, ExecutionClass]
Per-job metrics – Dimension=[JobName, JobRunId=ALL]
Per-job run metrics – Dimension=[JobName, JobRunId]

Aggregated metrics and per-job metrics are shown as in the following screenshot.

Each datapoint represents DPUHours per individual job run, so valid statistics for the CloudWatch metrics is SUM. With the CloudWatch metrics, you can have a granular view on DPU hours.

Options to optimize cost

This section describes key options to optimize costs on AWS Glue for Apache Spark:

Upgrade to the latest version
Auto scaling
Flex
Set the job’s timeout period appropriately
Interactive sessions
Smaller worker type for streaming jobs

We dive deep to the individual options.

Upgrade to the latest version

Having AWS Glue jobs running on the latest version enables you to take advantage of the latest functionalities and improvements offered by AWS Glue and the upgraded version of the supported engines such as Apache Spark. For example, AWS Glue 4.0 includes the new optimized Apache Spark 3.3.0 runtime and adds support for built-in pandas APIs as well as native support for Apache Hudi, Apache Iceberg, and Delta Lake formats, giving you more options for analyzing and storing your data. It also includes a new highly performant Amazon Redshift connector that is 10 times faster on TPC-DS benchmarking.

Auto scaling

One of the most common challenges to reduce cost is to identify the right amount of resources to run jobs. Users tend to overprovision workers in order to avoid resource-related problems, but part of those DPUs are not used, which increases costs unnecessarily. Starting with AWS Glue version 3.0, AWS Glue auto scaling helps you dynamically scale resources up and down based on the workload, for both batch and streaming jobs. Auto scaling reduces the need to optimize the number of workers to avoid over-provisioning resources for jobs, or paying for idle workers.

To enable auto scaling on AWS Glue Studio, go to the Job Details tab of your AWS Glue job and select Automatically scale number of workers.

You can learn more in Introducing AWS Glue Auto Scaling: Automatically resize serverless computing resources for lower cost with optimized Apache Spark.

Flex

For non-urgent data integration workloads that don’t require fast job start times or can afford to rerun the jobs in case of a failure, Flex could be a good option. The start times and runtimes of jobs using Flex vary because spare compute resources aren’t always available instantly and may be reclaimed during the run of a job. Flex-based jobs offer the same capabilities, including access to custom connectors, a visual job authoring experience, and a job scheduling system. With the Flex option, you can optimize the costs of your data integration workloads by up to 34%.

To enable Flex on AWS Glue Studio, go to the Job Details tab of your job and select Flex execution.

You can learn more in Introducing AWS Glue Flex jobs: Cost savings on ETL workloads.

Interactive sessions

One common practice among developers that create AWS Glue jobs is to run the same job several times every time a modification is made to the code. However, this may not be cost-effective depending of the number of workers assigned to the job and the number of times that it’s run. Also, this approach may slow down the development time because you have to wait until every job run is complete. To address this issue, in 2022 we released AWS Glue interactive sessions. This feature let developers process data interactively using a Jupyter-based notebook or IDE of their choice. Sessions start in seconds and have built-in cost management. As with AWS Glue jobs, you pay for only the resources you use. Interactive sessions allow developers to test their code line by line without needing to run the entire job to test any changes made to the code.

Set the job’s timeout period appropriately

Due to configuration issues, script coding errors, or data anomalies, sometimes AWS Glue jobs can take an exceptionally long time or struggle to process the data, and it can cause unexpected charges. AWS Glue gives you the ability to set a timeout value on any jobs. By default, an AWS Glue job is configured with 48 hours as the timeout value, but you can specify any timeout. We recommend identifying the average runtime of your job, and based on that, set an appropriate timeout period. This way, you can control cost per job run, prevent unexpected charges, and detect any problems related to the job earlier.

To change the timeout value on AWS Glue Studio, go to the Job Details tab of your job and enter a value for Job timeout.

Interactive sessions also have the same ability to set an idle timeout value on sessions. The default idle timeout value for Spark ETL sessions is 2880 minutes (48 hours). To change the timeout value, you can use %idle_timeout magic.

Smaller worker type for streaming jobs

Processing data in real time is a common use case for customers, but sometimes these streams have sporadic and low data volumes. G.1X and G.2X worker types could be too big for these workloads, especially if we consider streaming jobs may need to run 24/7. To help you reduce costs, in 2022 we released G.025X, a new quarter DPU worker type for streaming ETL jobs. With this new worker type, you can process low data volume streams at one-fourth of the cost.

To select the G.025X worker type on AWS Glue Studio, go to the Job Details tab of your job. For Type, choose Spark Streaming, then choose G 0.25X for Worker type.

You can learn more in Best practices to optimize cost and performance for AWS Glue streaming ETL jobs.

Performance tuning to optimize cost

Performance tuning plays an important role in reducing cost. The first action for performance tuning is to identify the bottlenecks. Without measuring the performance and identifying bottlenecks, it’s not realistic to optimize cost-effectively. CloudWatch metrics provide a simple view for quick analysis, and the Spark UI provides deeper view for performance tuning. It’s highly recommended to enable Spark UI for your jobs and then view the UI to identify the bottleneck.

The following are high-level strategies to optimize costs:

Scale cluster capacity
Reduce the amount of data scanned
Parallelize tasks
Optimize shuffles
Overcome data skew
Accelerate query planning

For this post, we discuss the techniques for reducing the amount of data scanned and parallelizing tasks.

Reduce the amount of data scanned: Enable job bookmarks

AWS Glue job bookmarks are a capability to process data incrementally when running a job multiple times on a scheduled interval. If your use case is an incremental data load, you can enable job bookmarks to avoid a full scan for all job runs and process only the delta from the last job run. This reduces the amount of data scanned and accelerates individual job runs.

Reduce the amount of data scanned: Partition pruning

If your input data is partitioned in advance, you can reduce the amount of data scan by pruning partitions.

For AWS Glue DynamicFrame, set push_down_predicate (and catalogPartitionPredicate), as shown in the following code. Learn more in Managing partitions for ETL output in AWS Glue.

# DynamicFrame
dyf = Glue_context.create_dynamic_frame.from_catalog(
    database=src_database_name,
    table_name=src_table_name,
    push_down_predicate = "year='2023' and month ='03'",
)

For Spark DataFrame (or Spark SQL), set a where or filter clause to prune partitions:

# DataFrame
df = spark.read.format("json").load("s3://<YourBucket>/year=2023/month=03/*/*.gz")
 
# SparkSQL 
df = spark.sql("SELECT * FROM <Table> WHERE year= '2023' and month = '03'")

Parallelize tasks: Parallelize JDBC reads

The number of concurrent reads from the JDBC source is determined by configuration. Note that by default, a single JDBC connection will read all the data from the source through a SELECT query.

Both AWS Glue DynamicFrame and Spark DataFrame support parallelize data scans across multiple tasks by splitting the dataset.

For AWS Glue DynamicFrame, set hashfield or hashexpression and hashpartition. Learn more in Reading from JDBC tables in parallel.

For Spark DataFrame, set numPartitions, partitionColumn, lowerBound, and upperBound. Learn more in JDBC To Other Databases.

Conclusion

In this post, we discussed methodologies for monitoring and optimizing cost on AWS Glue for Apache Spark. With these techniques, you can effectively monitor and optimize costs on AWS Glue for Spark.

If you have comments or feedback, please leave them in the comments.

About the Authors

Leonardo Gómez is a Principal Analytics Specialist Solutions Architect at AWS. He has over a decade of experience in data management, helping customers around the globe address their business and technical needs. Connect with him on LinkedIn

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.

Announcing AWS Glue crawler support for Snowflake

2022-11-19 Leonardo Gomez

Post Syndicated from Leonardo Gomez original https://aws.amazon.com/blogs/big-data/announcing-aws-glue-crawler-support-for-snowflake/

For data lake customers who need to discover petabytes of data, AWS Glue crawlers are a popular way to scan data in the background, so you can focus on using the data to make better intelligent decisions. You may also have data in data warehouses such as Snowflake and want the ability to discover the data in the warehouse and combine with data from data lakes to derive insights. AWS Glue crawlers now support Snowflake, making it easier for you to understand updates to Snowflake schema and extract meaningful insights.

To crawl a Snowflake database, you can create and schedule an AWS Glue crawler with an JDBC URL with credential information from AWS Secrets Manager. A configuration option allows you to specify if you want the crawler to crawl the entire database or limit the tables by including the schema or table path and exclude patterns to reduce crawl time. With each run of the crawler, the crawler inspects and catalogs information, such as updates or deletes to Snowflake tables, external tables, views, and materialized views in the AWS Glue Data Catalog. For Snowflake columns with non-Hive compatible types, such as geography or geometry, the crawler extracts that information as a raw data type and makes it available in the Data Catalog.

In this post, we set up an AWS Glue crawler to crawl the OpenStreetMap geospatial dataset, which is freely available through Snowflake Marketplace. This dataset includes all of the OpenStreetMap location data for New York. OpenStreetMap maintains data about businesses, roads, trails, cafes, railway stations, and much more, from all over the world.

Overview of solution

Snowflake is a cloud data platform that provides data solutions from data warehousing to data science. Snowflake Computing is an AWS Advanced Technology Partner with AWS Competencies in Data & Analytics, Machine Learning, and Retail, as well as an AWS service validation for AWS PrivateLink.

In this solution, we use a sample use case involving points of interest in New York City, based on the following Snowflake quick start. Follow sections 1 and 2 to get access to sample geospatial data from Snowflake Marketplace. We show how to interpret the geography data type and understand the different formats. We use the AWS Glue crawler to crawl this OpenStreetMap geospatial dataset and make it available in the Data Catalog with the geography data type maintained where appropriate.

Prerequisites

To follow along, you need the following:

An AWS account.
An AWS Identity and Access Management (IAM) user with access to the following services:
- Amazon Simple Storage Service (Amazon S3)
- AWS Glue
An IAM role with access to run AWS Glue crawlers.
If the AWS account you use to follow this post uses AWS Lake Formation to manage permissions on the AWS Glue Data Catalog, make sure that you log in as a user with access to create databases and tables. For more information, refer to Implicit Lake Formation permissions.
A Snowflake Enterprise Edition account with permission to create storage integrations, ideally in the AWS us-east-1 Region or closest available trial Region, like us-east-2. If necessary, you can subscribe to a Snowflake trial account on AWS Marketplace.
- On the Marketplace listing page, choose Continue to Subscribe, and then choose Accept Terms. You’re redirected to the Snowflake website to begin using the software. To complete your registration, choose Set Up Your Account.
- If you’re new to Snowflake, consider completing the Snowflake in 20 Minutes tutorial. By the end of the tutorial, you should know how to create required Snowflake objects, including warehouses, databases, and tables for storing and querying data.
A Snowflake worksheet (query editor) and associated access to a Snowflake virtual warehouse (compute) and database (storage).
Access to an existing Snowflake account with the ACCOUNTADMIN role or the IMPORT SHARE privilege.

Create an AWS Glue connection to Snowflake

For this post, an AWS Glue connection to your Snowflake cluster is necessary. For more details about how to create it, follow the steps in Performing data transformations using Snowflake and AWS Glue. The following screenshot shows the configuration used to create a connection to the Snowflake cluster for this post.

Create an AWS Glue crawler

To create your crawler, complete the following steps:

On the AWS Glue console, choose Crawlers in the navigation pane.
Choose Create crawler.
For Name, enter a name (for example, glue-blog-snowflake-crawler).
Choose Next.
For Is your data already mapped to Glue tables, select Not yet.
In the Data sources section, choose Add a data source.

For this post, you use a JDBC dataset as a source.

For Data source, choose JDBC.
For Connection, select the connection that you created earlier (for this post, SA-snowflake-connection).
For Include path, enter the path to the Snowflake database you created as a prerequisite (OSM_NEWYORK/NEW_YORK/%).
For Additional metadata, choose COMMENTS and RAWTYPE.

This allows the crawler to harvest metadata related to comments and raw types like geospatial columns.

Choose Add a JDBC data source.

Choose Next.
For Existing IAM role¸ choose the role you created as a prerequisite (for this post, we use AWSGlueServiceRole-DefualtRole).
Choose Next.

Now let’s create an AWS Glue database.

Under Target database, choose Add database.
For Name, enter gluesnowdb.
Choose Create database.
On the Set output and scheduling page, for Target database, choose the database you just created (gluesnowdb).
For Table name prefix, enter blog_.
For Frequency, choose On demand.
Choose Next.
Review the configuration and choose Create crawler.

Run the AWS Glue crawler

To run the crawler, complete the following steps:

On the AWS Glue console, choose Crawlers in the navigation pane.
Choose the crawler you created.
Choose Run crawler.

On the Crawler runs tab, you can see the current run of the crawler.

Wait until the crawler run is complete.

As shown in the following screenshot, 27 tables were added.

Now let’s see how these tables look in the AWS Glue Data Catalog.

Explore the AWS Glue tables

Let’s explore the tables created by the crawler.

On the AWS Glue console, chose Databases in the navigation pane.
Search for and choose the gluesnowdb database.

Now you can see the list of the tables created by the crawler.

Choose the blog_osm_newyork_new_york_v_osm_ny_amenity table.

In the Schema section, you can see that the raw type was also harvested from the source Snowflake database.

Choose the Advanced properties tab.
In the Table properties section, you can see that the classification is snowflake and the typeOfData is view.

Clean up

To avoid incurring future charges, and to clean up unused roles and policies, delete the resources you created: the CloudFormation stack, S3 bucket, AWS Glue crawler, AWS Glue database, and AWS Glue table.

Conclusion

AWS Glue crawlers now support Snowflake tables, views, and materialized views. Offering more options to integrate Snowflake databases to your AWS Glue Data Catalog. You can use AWS Glue crawlers to discover Snowflake datasets, extract schema information, and populate the Data Catalog.

In this post, we provided a procedure to set up AWS Glue crawlers to discover Snowflake tables, which reduces the time and cost needed to incrementally process Snowflake table data updates in the Data Catalog. To learn more about this feature, refer to the docs.

Special thanks to everyone who contributed to this crawler feature launch: Theo Xu, Hunny Vankawala, and Jessica Cheng.

Happy crawling!

Attribution

OpenStreetMap data by OpenStreetMap Foundation is licensed under Open Data Commons Open Database License (ODbL)

About the authors

Leonardo Gómez is a Senior Analytics Specialist Solutions Architect at AWS. Based in Toronto, Canada, he has over a decade of experience in data management, helping customers around the globe address their business and technical needs.

Bosco Albuquerque is a Sr. Partner Solutions Architect at AWS and has over 20 years of experience working with database and analytics products from enterprise database vendors and cloud providers. He has helped technology companies design and implement data analytics solutions and products.

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Build incremental crawls of data lakes with existing Glue catalog tables

2022-10-14 Leonardo Gomez

Post Syndicated from Leonardo Gomez original https://aws.amazon.com/blogs/big-data/build-incremental-crawls-of-data-lakes-with-existing-glue-catalog-tables/

AWS Glue includes crawlers, a capability that make discovering datasets simpler by scanning data in Amazon Simple Storage Service (Amazon S3) and relational databases, extracting their schema, and automatically populating the AWS Glue Data Catalog, which keeps the metadata current. This reduces the time to insight by making newly ingested data quickly available for analysis with your preferred analytics and machine learning (ML) tools.

Previously, you could reduce crawler cost by using Amazon S3 Event Notifications to incrementally crawl changes on Data Catalog tables created by crawler. Today, we’re extending this support to crawling and updating Data Catalog tables that are created by non-crawler methods, such as using data pipelines. This crawler feature can be useful for several use cases, such as following:

You currently have a data pipeline to create AWS Glue Data Catalog tables and want to offload detection of partition information from the data pipeline to a scheduled crawler
You have an S3 bucket with event notifications enabled and want to continuously catalog new changes and prevent creation of new tables in case of ill-formatted files that break the partition detection
You have manually created Data Catalog tables and want to run incremental crawls on new file additions instead of running full crawls due to long crawl times

To accomplish incremental crawling, you can configure Amazon S3 Event Notifications to be sent to an Amazon Simple Queue Service (Amazon SQS) queue. You can then use the SQS queue as a source to identify changes and can schedule or run an AWS Glue crawler with Data Catalog tables as a target. With each run of the crawler, the SQS queue is inspected for new events. If no new events are found, the crawler stops. If events are found in the queue, the crawler inspects their respective folders, processes through built-in classifiers (for CSV, JSON, AVRO, XML, and so on), and determines the changes. The crawler then updates the Data Catalog with new information, such as newly added or deleted partitions or columns. This feature reduces the cost and time to crawl large and frequently changing Amazon S3 data.

This post shows how to create an AWS Glue crawler that supports Amazon S3 event notification on existing Data Catalog tables using the new crawler UI and an AWS CloudFormation template.

Overview of solution

To demonstrate how the new AWS Glue crawler performs incremental updates, we use the Toronto parking tickets dataset—specifically data about parking tickets issued in the city of Toronto between 2019–2020. The goal is to create a manual dataset as well as its associated metadata tables in AWS Glue, followed by an event-based crawler that detects and implements changes to the manually created datasets and catalogs.

As mentioned before, instead of crawling all the subfolders on Amazon S3, we use an Amazon S3 event-based approach. This helps improve the crawl time by using Amazon S3 events to identify the changes between two crawls by listing all the files from the subfolder that triggered the event instead of listing the full Amazon S3 target. To accomplish this, we create an S3 bucket, an event-based crawler, an Amazon Simple Storage Service (Amazon SNS) topic, and an SQS queue. The following diagram illustrates our solution architecture.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
An AWS Identity and Access Management (IAM) user with access to the following services:
- AWS CloudFormation
- AWS Glue
- Amazon SNS
- Amazon SQS
- Amazon S3

If the AWS account you use to follow this post uses Lake Formation to manage permissions on the AWS Glue Data Catalog, make sure that you log in as a user with access to create databases and tables. For more information, refer to Implicit Lake Formation permissions.

Launch your CloudFormation stack

To create your resources for this use case, complete the following steps:

Launch your CloudFormation stack in us-east-1:
For Stack name, enter a name for your stack .
For paramBucketName, enter a name for your S3 bucket (with your account number).
Choose Next.
Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Create stack.

Wait for the stack formation to finish provisioning the requisite resources. When you see the CREATE_COMPLETE status, you can proceed to the next steps.

Additionally, note down the ARN of the SQS queue to use at a later point.

Query your Data Catalog

Next, we use Amazon Athena to confirm that the manual tables have been created in the Data Catalog, as part of the CloudFormation template.

On the Athena console, choose Launch query editor.
For Data source, choose AwsDataCatalog.
For Database, choose torontoparking.

The tickets table should appear in the Tables section.

Now you can query the table to see its contents.
You can write your own query, or choose Preview Table on the options menu.

This writes a simple SQL query to show us the first 10 rows.
Choose Run to run the query.

As we can see in the query results, the database and table for 2019 parking ticket data have been created and partitioned.

Create the Amazon S3 event crawler

The next step is to create the crawler that detects and crawls only on incrementally updated tables.

On the AWS Glue console, choose Crawlers in the navigation pane.
Choose Create crawler.
For Name, enter a name.
Choose Next.

Now we need to select the data source for the crawler.
Select Yes to indicate that our data is already mapped to our AWS Glue Data Catalog.
Choose Add tables.
For Database, choose torontoparking and for Tables, choose tickets.
Select Crawl based on events.
For Include SQS ARN, enter the ARN you saved from the CloudFormation stack outputs.
Choose Confirm.

You should now see the table populated under Glue tables, with the parameter set as Recrawl by event.
Choose Next.
For Existing IAM role, choose the IAM role created by the CloudFormation template (GlueCrawlerTableRole).
Choose Next.
For Frequency, choose On demand.

You also have the option of choosing a schedule on which the crawler will run regularly.
Choose Next.
Review the configurations and choose Create crawler.

Now that the crawler has been created, we add the 2020 ticketing data to our S3 bucket so that we can test our new crawler. For this step, we use the AWS Command Line Interface (AWS CLI)

To add this data, use the following command:

aws s3 cp s3://aws-bigdata-blog/artifacts/gluenewcrawlerui2/source/year=2020/Parking_Tags_Data_2020.000.csv s3://glue-table-crawler-blog-<YOURACCOUNTNUMBER>/year=2020/Parking_Tags_Data_2020.000.csv

After successful completion of this command, your S3 bucket should contain the 2020 ticketing data and your crawler is ready to run. The terminal should return the following:

copy: s3://aws-bigdata-blog/artifacts/gluenewcrawlerui2/source/year=2020/Parking_Tags_Data_2020.000.csv to s3://glue-table-crawler-blog-<YOURACCOUNTNUMBER>/year=2020/Parking_Tags_Data_2020.000.csvRun the crawler and verify the updates

Run the crawler and verify the updates

Now that the new folder has been created, we run the crawler to detect the changes in the table and partitions.

Navigate to your crawler on the AWS Glue console and choose Run crawler.

After running the crawler, you should see that it added the 2020 data to the tickets table.
On the Athena console, we can ensure that the Data Catalog has been updated by adding a where year = 2020 filter to the query.

AWS CLI option

You can also create the crawler using the AWS CLI. For more information, refer to create-crawler.

Clean up

Conclusion

You can use AWS Glue crawlers to discover datasets, extract schema information, and populate the AWS Glue Data Catalog. In this post, we provided a CloudFormation template to set up AWS Glue crawlers to use Amazon S3 event notifications on existing Data Catalog tables, which reduces the time and cost needed to incrementally process table data updates in the Data Catalog.

With this feature, incremental crawling can now be offloaded from data pipelines to the scheduled AWS Glue crawler, reducing cost. This alleviates the need for full crawls, thereby reducing crawl times and Data Processing Units (DPUs) required to run the crawler. This is especially useful for customers that have S3 buckets with event notifications enabled and want to continuously catalog new changes.

To learn more about this feature, refer to Accelerating crawls using Amazon S3 event notifications.

Special thanks to everyone who contributed to this crawler feature launch: Theo Xu, Jessica Cheng, Arvin Mohanty, and Joseph Barlan.

About the authors

Leonardo Gómez is a Senior Analytics Specialist Solutions Architect at AWS. Based in Toronto, Canada, he has over a decade of experience in data management, helping customers around the globe address their business and technical needs.

Aayzed Tanweer is a Solutions Architect working with startup customers in the FinTech space, with a special focus on analytics services. Originally hailing from Toronto, he recently moved to New York City, where he enjoys eating his way through the city and exploring its many peculiar nooks and crannies.

Code versioning using AWS Glue Studio and GitHub

2022-10-14 Leonardo Gomez

Post Syndicated from Leonardo Gomez original https://aws.amazon.com/blogs/big-data/code-versioning-using-aws-glue-studio-and-github/

AWS Glue now offers integration with Git, an open-source version control system widely used across the developer community. Thanks to this integration, you can incorporate your existing DevOps practices on AWS Glue jobs. AWS Glue is a serverless data integration service that helps you create jobs based on Apache Spark or Python to perform extract, transform, and load (ETL) tasks on datasets of almost any size.

Git integration in AWS Glue works for all AWS Glue job types, both visual and code-based. It offers built-in integration with both GitHub and AWS CodeCommit, and makes it easier to use automation tools like Jenkins and AWS CodeDeploy to deploy AWS Glue jobs. AWS Glue Studio’s visual editor now also supports parameterizing data sources and targets for transparent deployments between environments.

Overview of solution

To demonstrate how to integrate AWS Glue Studio with a code hosting platform for version control and collaboration, we use the Toronto parking tickets dataset, specifically the data about parking tickets issued in the city of Toronto in 2019. The goal is to create a job to filter parking tickets based on a specific category and push the code to a GitHub repo for version control. After the job is uploaded on the repository, we make some changes to the code and pull the changes back to the AWS Glue job.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
A GitHub account
An AWS Identity and Access Management (IAM) user with access to the following services:
- AWS CloudFormation
- AWS Glue
- Amazon Simple Storage Service (Amazon S3)

Launch your CloudFormation stack

To create your resources for this use case, complete the following steps:

Launch your CloudFormation stack in us-east-1:
Under Parameters, for paramBucketName, enter a name for your S3 bucket (include your account number).
Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Create stack.
Wait until the creation of the stack is complete, as shown on the AWS CloudFormation console.

Launching this stack creates AWS resources. You need the following resources from the Outputs tab for the next steps:

CFNGlueRole – The IAM role to run AWS Glue jobs
S3Bucket – The name of the S3 bucket to store solution-related files
CFNDatabaseBlog – The AWS Glue database to store the table related to this post
CFNTableTickets – The AWS Glue table to use as part of the sample job

Configure the GitHub repository

We use GitHub as the source control system for this post. In order to use it, you need a GitHub account. After the account is created, you need to create following components:

GitHub repository – Create a repository and name it glue-ver-log. For instructions, refer to Create a repo.
Branch – Create a branch and name it develop. For instructions, refer to Managing branches.
Personal access token – For instructions, refer to Creating a personal access token. Make sure to keep the personal access token handy because you use it in later steps.

Create an AWS Glue Studio job

Now that the infrastructure is set up, let’s author an AWS Glue job in our account. Complete the following steps:

On the AWS Glue console, choose Jobs in the navigation pane.
Select Visual job with blank canvas and choose Create.
Enter a name for the job using the title editor. For example, aws-glue-git-demo-job.
On the Visual tab, choose Source and then choose AWS Glue Data Catalog
For Database, choose torontoparking and for Table, choose tickets.
Choose Transform and then Filter.
Add a filter by infraction_description and set the value to PARK ON PRIVATE PROPERTY.
Choose Target and then choose Amazon S3.
For Format, choose Parquet.
For S3 Target Location, enter s3://glue-version-blog-YOUR ACOUNT NUMBER/output/.
For Data Catalog update options, select Do not update the Data Catalog.
Go to the Script tab to verify that a script has been generated.
Go to the Job Details tab to make sure that the role GlueBlogRole is selected and leave everything else with the default values.

Because the catalog table names in the production and development environment may be different, AWS Glue Studio now allows you to parameterize visual jobs. To do so, perform the following steps:
On the Job details tab, scroll to the Job parameters section under Advanced properties.
Create the --source.database.name parameter and set the value to torontoparking.
Create the --souce.table.name parameter and set the value to tickets.
Go to the Visual tab and choose the AWS Glue Data Catalog node.Notice that under each of the database and table selection options is a new expandable section called Use runtime parameters.
The run time parameters are auto populated with the parameters previously created. Clicking on the Apply button will apply the default values for these parameters.
Go to the Script tab to review the script.AWS Glue Studio code generation automatically picks up the parameters to resolve and then makes the appropriate references in the script so that the parameters can be used.
Now the job is ready to be pushed into the develop branch of our version control system.
On the Version Control tab, for Version control system, choose Github.
For Personal access token, enter your GitHub token.
For Repository owner, enter the owner of your GitHub account.
In the Repository configuration section, for Repository, choose glue-ver-blog.
For Branch, choose develop.
For Folder, leave it blank.
Choose Save to save the job.

Push to the repository

Now the job can be pushed to the remote repository.

On the Actions menu, choose Push to repository.
Choose Confirm to confirm the operation.

After the operation succeeds, the page reloads to reflect the latest information from the version control system. A notification shows the latest available commit and links you to the commit on GitHub.
Choose the commit link to go to the repository on GitHub.

You have successfully created your first commit to GitHub from AWS Glue Studio!

Pull from the repository

Now that we have committed the AWS Glue job to GitHub, it’s time to see how we can pull changes using AWS Glue Studio. For this demo, we make a small modification in our example job using the GitHub UI and then pull the changes using AWS Glue Studio.

On GitHub, choose the develop branch.
Choose the aws-glue-git-demo-job folder.
Choose the aws-glue-git-demo-job.json file.
Choose the edit icon.
Set the MaxRetries parameter to 1.
Choose Commit changes.
Return to the AWS Glue console and on the Actions menu, choose Pull from repository.
Choose Confirm.

Notice that the commit ID has changed.

On the Job details tab, you can see that the value for Number of retries is 1.

Clean up

To avoid incurring future charges, and to clean up unused roles and policies, delete the resources you created: the datasets, CloudFormation stack, S3 bucket, AWS Glue job, AWS Glue database, and AWS Glue table.

Conclusion

This post showed how to integrate AWS Glue with GitHub, but this is only the beginning—now you can use the most popular functionalities offered by Git.

To learn more and get started using the AWS Glue Studio Git integration, refer to Configuring Git integration in AWS Glue.

About the authors

Daiyan Alamgir is a Principal Frontend Engineer on AWS Glue based in New York. He leads the AWS Glue UI team and is focused on building interactive web-based applications for data analysts and engineers to address their data integration use cases.

Detect and process sensitive data using AWS Glue Studio

2022-09-20 Leonardo Gomez

Post Syndicated from Leonardo Gomez original https://aws.amazon.com/blogs/big-data/detect-and-process-sensitive-data-using-aws-glue-studio/

Data lakes offer the possibility of sharing diverse types of data with different teams and roles to cover numerous use cases. This is very important in order to implement a data democratization strategy and incentivize the collaboration between lines of business. When a data lake is being designed, one of the most important aspects to consider is data privacy. Without it, sensitive information could be accessed by the wrong team, which may affect the reliability of a data platform. However, identifying sensitive data inside a data lake could represent a challenge due to the diversity of the data and also its volume.

Earlier this year, AWS Glue announced the new sensitive data detection and processing feature to help you identify and protect sensitive information in a straightforward way using AWS Glue Studio. This feature uses pattern matching and machine learning to automatically recognize personally identifiable information (PII) and other sensitive data at the column or cell level as part of AWS Glue jobs.

Sensitive data detection in AWS Glue identifies a variety of sensitive data like phone and credit card numbers, and also offers the option to create custom identification patterns or entities to cover your specific use cases. Additionally, it helps you take action, such as creating a new column that contains any sensitive data detected as part of a row or redacting the sensitive information before writing records into a data lake.

This post shows how to create an AWS Glue job that identifies sensitive data at the row level. We also show how create a custom identification pattern to identify case-specific entities.

Overview of solution

To demonstrate how to create an AWS Glue job to identify sensitive data, we use a test dataset with customer comments that contain private data like Social Security number (SSN), phone number, and bank account number. The goal is to create a job that automatically identifies the sensitive data and triggers an action to redact it.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
An AWS Identity and Access Management (IAM) user with access to the following services:
- Amazon Athena
- AWS CloudFormation
- AWS Glue
- Amazon Simple Storage Service (Amazon S3)

Launch your CloudFormation stack

To create your resources for this use case, complete the following steps:

Launch your CloudFormation stack in us-east-1:
Under Parameters, enter a name for your S3 bucket (include your account number).
Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Create stack.
Wait until the creation of the stack is complete, as shown on the AWS CloudFormation console.

Launching this stack creates AWS resources. You need the following resources from the Outputs tab for the next steps:

GlueSenRole – The IAM role to run AWS Glue jobs
BucketName – The name of the S3 bucket to store solution-related files
GlueDatabase – The AWS Glue database to store the table related to this post

Create and run an AWS Glue job

Let’s first create the dataset that is going to be used as the source of the AWS Glue job:

Open AWS CloudShell.
Run the following command:
```
aws s3 cp s3://aws-bigdata-blog/artifacts/gluesendata/sourcedata/customer_comments.csv s3://glue-sendata-blog-<YOUR ACCOUNT NUMBER>/customer_comments/customer_comments.csv
```
This action copies the dataset that is going to be used as the input for the AWS Glue job covered in this post.

Now, let’s create the AWS Glue job.

On the AWS Glue Studio console, choose Jobs in the navigation pane.
Select Visual with blank canvas.
Choose the Job Details tab to configure the job.
For Name, enter GlueSenJob.
For IAM Role, choose the role GlueSenDataBlogRole.
For Glue version, choose Glue 3.0.
For Job bookmark, choose Disable.
Choose Save.
After the job is saved, choose the Visual tab and on the Source menu, choose Amazon S3.
On the Data source properties -S3 tab, for S3 source type, select S3 location.
Add the S3 location of the file that you copied previously using CloudShell.
Choose Infer schema.

This last action infers the schema and file type of the of the source for this post, as you can see in the following screenshot.

Now, let’s see what the data looks like.

On the Data preview tab, choose Start data preview session.
For IAM role, choose the role GlueSeDataBlogRole.
Choose Confirm.

This last step may take a couple of minutes to run.

When you review the data, you can see that sensitive data like phone numbers, email addresses, and SSNs are part of the customer comments.

Now let’s identify the sensitive data in the comments dataset and mask it.

On the Transform menu, choose Detect PII.

The AWS Glue sensitive data identification feature allows you to find sensitive data at the row and column level, which covers a diverse number of use cases. For this post, because we scan comments made by customers, we use the row-level scan.

On the Transform tab, select Find sensitive data in each row.
For Types of sensitive information to detect, select Select specific patterns.

Now we need to select the entities or patterns that are going to be identified by the job.

For Selected patterns, choose Browse.
Select the following patterns:
1. Credit Card
2. Email Address
3. IP Address
4. Mac Address
5. Person’s Name
6. Social Security Number (SSN)
7. US Passport
8. US Phone
9. US/Canada bank account
Choose Confirm.

After the sensitive data is identified, AWS Glue offers two options:

Enrich data with detection results – Adds a new column to the dataset with the list of the entities or patterns that were identified in that specific row.
Redact detected text – Replaces the sensitive data with a custom string. For this post, we use the redaction option.

For Actions, select Redact detected text.
For Replacement text, enter ####.

Let’s see how the dataset looks now.

Check the result data on the Data preview tab.

As you can see, the majority of the sensitive data was redacted, but there is a number on row 11 that isn’t masked. This is because it’s a Canadian permanent resident number, and this pattern isn’t part of the ones that the sensitive data identification feature offers. However, we can add a custom pattern to identify this number.

On the Transform tab, for Selected patterns, choose Create new.

This action opens the Create detection pattern window, where we create the custom pattern to identify the Canadian permanent resident number.

For Pattern name, enter Can_PR_Number.
For Expression, enter the regular expression [P]+[D]+[0]\d\d\d\d\d\d
Choose Validate.
Wait until you get the validation message, then choose Create pattern.

Now you can see the new pattern listed under Custom patterns.

On the AWS Glue Studio Console, for Selected patterns, choose Browse.

Now you can see Can_PR_Number as part of the pattern list.

Select Can_PR_Number and choose Confirm.

On the Data preview tab, you can see that the Canadian permanent resident number has been redacted.

Let’s add a destination for the dataset with redacted information.

On the Target menu, choose Amazon S3.
On the Data target properties -S3 tab, for Format, choose Parquet.
For S3 Target Location, enter s3://glue-sendata-blog-<YOUR ACCOUNT ID>/output/redacted_comments/.
For Data Catalog update options, select Create a table in the Data Catalog and on subsequent runs, update the schema and add new partitions.
For Database, choose gluesenblog.
For Table name, enter custcomredacted.
Choose Save, then choose Run.

You can view the job run details on the Runs tab.

Wait until the job is complete.

Query the dataset

Now let’s see what the final dataset looks like. To do so, we query the data with Athena. As part of this post, we assume that a query result location for Athena is already configured; if not, refer to Working with query results, recent queries, and output files.

On the Athena console, open the query editor.
For Database, choose the gluesenblog database.

Run the following query:

SELECT * FROM "gluesenblog"."custcomredacted" limit 15;

Verify the results; you can observe that all the sensitive data is redacted.

Clean up

To avoid incurring future charges, and to clean up unused roles and policies, delete the resources you created: Datasets, CloudFormation stack, S3 bucket, AWS Glue job, AWS Glue database, and AWS Glue table.

Conclusion

AWS Glue sensitive data detection offers an easy way to identify and process private data, without coding. This feature allows you to detect and redact sensitive data when it’s ingested into a data lake, enforcing data privacy before the data is available to data consumers. AWS Glue sensitive data detection is generally available in all Regions that support AWS Glue.

To learn more and get started using AWS Glue sensitive data detection, refer to Detect and process sensitive data.

About the author

Set up and monitor AWS Glue crawlers using the enhanced AWS Glue UI and crawler history

2022-08-26 Leonardo Gomez

Post Syndicated from Leonardo Gomez original https://aws.amazon.com/blogs/big-data/set-up-and-monitor-aws-glue-crawlers-using-the-enhanced-aws-glue-ui-and-crawler-history/

A data lake is a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis. Setting up and managing data lakes today involves a lot of manual, complicated, and time-consuming tasks. AWS Glue and AWS Lake Formation make it easy to build, secure, and manage data lakes. As data from existing data stores is moved in the data lake, there is a need to catalog the data to prepare it for analytics from services such as Amazon Athena.

AWS Glue crawlers are a popular way to populate the AWS Glue Catalog. AWS Glue crawlers are a key component that allow you to connect to data sources or targets, use different classifiers to determine the logical schema for the data, and create metadata in the Data Catalog. You can run crawlers on a schedule, on demand, or triggered based on an Amazon Simple Storage Service (Amazon S3) event to ensure that the Data Catalog is up to date. Using S3 event notifications can reduce the cost and time a crawler needs to update large and frequently changing tables.

The AWS Glue crawlers UI has been redesigned to offer a better user experience, and new functionalities have been added. This new UI provides easier setup of crawlers across multiple sources, including Amazon S3, Amazon DynamoDB, Amazon Redshift, Amazon Aurora, Amazon DocumentDB (with MongoDB compatibility), Delta Lake, MariaDB, Microsoft SQL Server, MySQL, Oracle, PostgreSQL, and MongoDB. A new AWS Glue crawler history feature has also been launched, which provides a convenient way to view crawler runs, their schedules, data sources, and tags. For each crawl, the crawler history offers a summary of data modifications such as changes in the database schema or Amazon S3 partition changes. Crawler history also provides DPU hours that can reduce the time to analyze and debug crawler operations and costs.

This post shows how to create an AWS Glue crawler that supports S3 event notification using the new UI. We also show how to navigate through the new crawler history section and get valuable insights.

Overview of solution

To demonstrate how to create an AWS Glue crawler using the new UI, we use the Toronto parking tickets dataset, specifically the data about parking tickets issued in the city of Toronto between 2017–2018. The goal is to create a crawler based on S3 events, run it, and explore the information showed in the UI about the run of this crawler.

As mentioned before, instead of crawling all the subfolders on Amazon S3, we use an S3 event-based approach. This helps improve the crawl time by using S3 events to identify the changes between two crawls by listing all the files from the subfolder that triggered the event instead of listing the full Amazon S3 target. For this post, we create an S3 event, Amazon Simple Storage Service (Amazon SNS) topic, and Amazon Simple Queue Service (Amazon SQS ) queue.

The following diagram illustrates our solution architecture.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
An AWS Identity and Access Management (IAM) user with access to the following services:
- Amazon S3
- AWS Glue
- Amazon SNS
- Amazon SQS
- AWS CloudFormation

Launch your CloudFormation stack

To create your resources for this use case, complete the following steps:

Launch your CloudFormation stack in us-east-1:
Under Parameters, enter a name for your S3 bucket (include your account number).
Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Create stack.
Wait until the creation of the stack is complete, as shown on the AWS CloudFormation console.
On the stack’s Outputs tab, take note of the SQS queue ARN—we use it during the crawler creation process.

Launching this stack creates AWS resources. You need the following resources from the Outputs tab for the next steps:

GlueCrawlerRole – The IAM role to run AWS Glue jobs
BucketName – The name of the S3 bucket to store solution-related files
GlueSNSTopic – The SNS topic, which we use as the target for the S3 event
SQSArn – The SQS queue ARN; this queue is going to be consumed by the AWS Glue crawler

Create an AWS Glue crawler

Let’s first create the dataset that is going to be used as the source of the AWS Glue crawler:

Open AWS CloudShell.
Run the following command:
```
aws s3 cp s3://aws-bigdata-blog/artifacts/gluenewcrawlerui/sourcedata/year=2017/Parking_Tags_Data_2017_2.csv s3://glue-crawler-blog-<YOUR ACCOUNT NUMBER>/torontotickets/year=2017/Parking_Tags_Data_2017_2.csv
```
This action triggers an S3 event that sends a message to the SNS topic that you created using the CloudFormation template. This message is consumed by an SQS queue that will be input for the AWS Glue crawler.

Now, let’s create the AWS Glue crawler.
On the AWS Glue console, choose Crawlers in the navigation pane.
Choose Create crawler.
For Name, enter a name (for example, BlogPostCrawler).
Choose Next.
For Is your data already mapped to Glue tables, select Not yet.
In the Data sources section, choose Add data source.

For this post, you use an S3 dataset as a source.
For Data source, choose S3.
For Location of S3 data, select In this account.
For S3 path, enter the path to the S3 bucket you created with the CloudFormation template (s3://glue-crawler-blog-YOUR ACCOUNT NUMBER/torontotickets/).
For Subsequent crawler runs, select Crawl based on events.
Enter the SQS queue ARN you created earlier.
Choose Add a S3 data source.
Choose Next.
For Existing IAM role¸ choose the role you created (GlueCrawlerBlogRole).
Choose Next.

Now let’s create an AWS Glue database.
Under Target database, choose Add database.
For Name, enter blogdb.
For Location, choose the S3 bucket created by the CloudFormation template.
Choose Create database.
On the Set output and scheduling page, for Target database, choose the database you just created (blogdb).
For Table name prefix, enter blog.
For Maximum table threshold, you can optionally set a limit for the number of tables that this crawler can scan. For this post, we leave this option blank.
For Frequency, choose On demand.
Choose Next.
Review the configuration and choose Create crawler.

Run the AWS Glue crawler

To run the crawler, navigate to the crawler on the AWS Glue console.

Choose Run crawler.

On the Crawler runs tab, you can see the current run of the crawler.

Explore the crawler run history data

When the crawler is complete, you can see the following details:

Duration – The exact duration time of the crawler run
DPU hours – The number of DPU hours spent during the crawler run; this is very useful to calculate costs
Table changes – The changes applied to the table, like new columns or partitions

Choose Table changes to see the crawler run summary.

You can see the table blogtorontotickets was created, and also a 2017 partition.

Let’s add more data to the S3 bucket to see how the crawler processes this change.

Open CloudShell.

Run the following command:

aws s3 cp s3://aws-bigdata-blog/artifacts/gluenewcrawlerui/sourcedata/year=2018/Parking_Tags_Data_2018_1.csv s3://glue-crawler-blog-<YOUR ACCOUNT NUMBER>/torontotickets/year=2018/Parking_Tags_Data_2018_1.csv

Choose Run crawler to run the crawler one more time.

You can see the second run of the crawler listed.

Note that the DPU hours were reduced by more than half; this is because only one partition was scanned and added. Having an event-based crawler helps reduce runtime and cost.

You can choose the Table changes information of the second run to see more details.

Note under Partitions added, the 2018 partition was created.

Additional notes

Keep in mind the following considerations:

Crawler history is supported for crawls that have occurred since the launch date of the crawler history feature, and only retains up to 12 months of crawls. Older crawls will not be returned.
To set up a crawler using AWS CloudFormation, you can use following template.
You can get all the crawls of a specified crawler by using list-crawls APIs.
You can update existing crawlers with a single Amazon S3 target to use this new feature. You can do this either via the AWS Glue console or by calling the update_crawler API.

Clean up

Conclusion

You can use AWS Glue crawlers to discover datasets, extract schema information, and populate the AWS Glue Data Catalog. AWS Glue crawlers now provide an easier-to-use UI workflow to set up crawlers and also provide metrics associated with past crawlers run to simplify monitoring and auditing. In this post, we provided a CloudFormation template to set up AWS Glue crawlers to use S3 event notifications, which reduces the time and cost needed to incrementally process table data updates in the AWS Glue Data Catalog. We also showed you how to monitor and understand the cost of crawlers.

Special thanks to everyone who contributed to the crawler history launch: Theo Xu, Jessica Cheng and Joseph Barlan.

Happy crawling!

About the authors

Leonardo Gómez is a Senior Analytics Specialist Solutions Architect at AWS. Based in Toronto, Canada, He has over a decade of experience in data management, helping customers around the globe address their business and technical needs. Connect with him on LinkedIn.

Making ETL easier with AWS Glue Studio

2020-09-25 Leonardo Gomez

Post Syndicated from Leonardo Gomez original https://aws.amazon.com/blogs/big-data/making-etl-easier-with-aws-glue-studio/

AWS Glue Studio is an easy-to-use graphical interface that speeds up the process of authoring, running, and monitoring extract, transform, and load (ETL) jobs in AWS Glue. The visual interface allows those who don’t know Apache Spark to design jobs without coding experience and accelerates the process for those who do.

AWS Glue Studio was designed to help you create ETL jobs easily. After you design a job in the graphical interface, it generates Apache Spark code for you, abstracting users from the challenges of coding. When the job is ready, you can run it and monitor the job status using the integrated UI.

AWS Glue Studio supports different types of data sources, both structured and semi-structured, and offers data processing in real time and batch. You can extract data from sources like Amazon Simple Storage Service (Amazon S3), Amazon Relational Database Service (Amazon RDS), Amazon Kinesis, and Apache Kafka. It also offers Amazon S3 and tables defined in the AWS Glue Data Catalog as destinations.

This post shows you how to create an ETL job to extract, filter, join, and aggregate data easily using AWS Glue Studio.

About this blog post
Time to read	15 minutes
Time to complete	45 minutes
Cost to complete (estimated)	Amazon S3: $0.023 AWS Glue: 0.036 AWS Identity & Access Management: $0 Total Cost: $0.059
Learning level	Intermediate (200)
Services used	AWS Glue, Amazon S3, AWS Identity and Access Management

Overview of solution

To demonstrate how to create an ETL job using AWS Glue Studio, we use the Toronto parking tickets dataset, specifically the data about parking tickets issued in the city of Toronto in 2018, and the trials dataset, which contains all the information about the trials related to those parking tickets. The goal is to filter, join, and aggregate the two datasets to get the number of parking tickets handled per court in the city of Toronto during that year.

Prerequisites

For this walkthrough, you should have an AWS account. For this post, you launch the required AWS resources using AWS CloudFormation in the us-east-1 Region. If you haven’t signed up for AWS, complete the following tasks:

Create an account.
Create an AWS Identity and Access Management (IAM) user. For instructions, see Create an IAM User.

Important: If the AWS account you use to follow this guide uses AWS Lake Formation to manage permissions on the Glue data catalog, make sure that you log in as a user that is both a Data lake administrator and a Database creator, as described in the documentation.

Launching your CloudFormation stack

To create your resources for this use case, complete the following steps:

Launch your stack in us-east-1:
Select the I acknowledge that AWS CloudFormation might create IAM resources with custom names option.
Choose Create stack.

Launching this stack creates AWS resources. The following resources shown in the AWS CloudFormation output are the ones you need in the next steps:

Key – Description
AWSGlueStudioRole – IAM role to run AWS Glue jobs
AWSGlueStudioS3Bucket – Name of the S3 bucket to store blog-related files
AWSGlueStudioTicketsYYZDB – AWS Glue Data Catalog database
AWSGlueStudioTableTickets – Data Catalog table to use as a source
AWSGlueStudioTableTrials – Data Catalog table to use as a source
AWSGlueStudioParkingTicketCount –Data Catalog table to use as the destination

Creating a job

A job is the AWS Glue component that allows the implementation of business logic to transform data as part of the ETL process. For more information, see Adding Jobs in AWS Glue.

To create an AWS Glue job using AWS Glue Studio, complete the following steps:

On the AWS Management Console, choose Services.
Under Analytics, choose AWS Glue.
In the navigation pane, choose AWS Glue Studio.
On the AWS Glue Studio home page, choose Create and manage jobs.

AWS Glue Studio supports different sources, including Amazon S3, Amazon RDS, Amazon Kinesis, and Apache Kafka. For this post, you use two AWS Glue tables as data sources and one S3 bucket as the destination.

In the Create Job section, select Blank graph.
Choose Create.

This takes you to the Visual Canvas to create an AWS Glue job.

Change the Job name from Untitled Job to YYZ-Tickets-Job.

You now have an AWS Glue job ready to filter, join, and aggregate data from two different sources.

Adding sources

For this post, you use two AWS Glue tables as data sources: Tickets and Trials, which the CloudFormation template created. The data is located in an external S3 bucket in Parquet format. To add these tables as sources, complete the following steps:

Choose the (+) icon.
On the Node properties tab, for Name, enter Tickets.
For Node type, choose S3.
On the Data Source properties -S3 tab, for Database, choose yyz-tickets.
For Table, choose tickets.
For Partition predicate (optional), leave blank.

Before adding the second data source to the ETL job, be sure that the node you just created isn’t selected.

Choose the (+) icon.
On the Node properties tab, for Name, enter Trials.
For Node type, choose S3.
On the Data Source properties -S3 tab, for Database, choose yyz-tickets.
For Table, choose trials.
For Partition predicate (optional), leave blank.

You now have two AWS Glue tables as the data sources for the AWS Glue job.

Adding transforms

A transform is the AWS Glue Studio component were the data is modified. You have the option of using different transforms that are part of this service or custom code. To add transforms, complete the following steps:

Choose the Tickets node.
Choose the (+) icon.
On the Node properties tab, for Name, enter Ticket_Mapping.
For Node type, choose ApplyMapping.
For Node parents, choose Tickets.
On the Transform tab, change the ticket_number data type from decimal to int.
Drop the following columns:
- Location1
- Location2
- Location3
- Location4
- Province

Now you add a second ApplyMapping transform to modify the Trials data source.

Choose the Trials data source node.
Choose the (+) icon.
On the Node properties tab, for Name, enter Trial_Mapping.
For Node type, choose ApplyMapping.
For Node parents, leave at default value (Trials).

On the Transform tab, change the parking_ticket_number data type from long to int.

Now that you have set the right data types and removed some of the columns, it’s time to join the data sources using the Join transform.

Choose the Ticket_Mapping transform.
Choose the (+) icon.
On the Node properties tab, for Name, enter Join_Ticket_Trial.
For Node type, choose Join.
For Node parents, choose Ticket_Mapping and Trial_Mapping.
On the Transform tab, for Join type, choose Inner join.
For Join conditions, choose Add condition.
For Ticket_Mapping, choose ticket_number.
For Trial_Mapping, choose parking_ticket_number.

Now the two data sources are joined by the ticket_number and parking_ticket_number columns.

Performing data aggregation

In this step, you do some data aggregation to see the number of tickets handled per court in Toronto.

AWS Glue Studio offers the option of adding custom code for those use cases that need a more complex transformation. For this post, we use PySpark code to do the data transformation. It contains Sparksql code and a combination of dynamic frames and data frames.

Choose the Join_Tickets_Trial transform.
Choose the (+) icon.
On the Node properties tab, for Name, enter Aggregate_Tickets.
For Node type, choose Custom transform.
For Node parents, leave Join_Ticket_Trial selected.
On the Transform tab, for Code block, change the function name from MyTransform to Aggregate_Tickets.

Enter the following code:

selected = dfc.select(list(dfc.keys())[0]).toDF()
selected.createOrReplaceTempView("ticketcount")
totals = spark.sql("select court_location as location, infraction_description as infraction, count(infraction_code) as total  FROM ticketcount group by infraction_description, infraction_code, court_location order by court_location asc")
results = DynamicFrame.fromDF(totals, glueContext, "results")
return DynamicFrameCollection({"results": results}, glueContext)

After adding the custom transformation to the AWS Glue job, you want to store the result of the aggregation in the S3 bucket. To do this, you need a Select from collection transform to read the output from the Aggregate_Tickets node and send it to the destination.

Choose the New node node.
Leave the Transform tab with the default values.
On the Node Properties tab, change the name of the transform to Select_Aggregated_Data.
Leave everything else with the default values.
Choose the Select_Aggregated_Data node.
Choose the (+) icon.
On the Node properties tab, for Name, enter Ticket_Count_Dest.
For Node type, choose S3 in the Data target section.
For Node parents, choose Select_Aggregated_Data.
On the Data Target Properties-S3 tab, for Format, choose Parquet.
For Compression Type, choose GZIP.
For S3 Target Location, enter s3://glue-studio-blog-{Your Account ID as a 12-digit number}/parking_tickets_count/.

The job should look like the following screenshot.

You now have three transforms to do data mapping, filtering, and aggregation.

Configuring the job

When the logic behind the job is complete, you must set the parameters for the job run. In this section, you configure the job by selecting components such as the IAM role and the AWS Glue version you use to run the job.

On the Job details tab, for Description, enter Glue Studio blog post job.
For IAM Role, choose AWSGlueStudioRole (which the CloudFormation template created).
For Job Bookmark, choose Disable.
For Number of retries, optionally enter 1.
Choose Save.
When the job is saved, choose Run.

Monitoring the job

AWS Glue Studio offers a job monitoring dashboard that provides comprehensive information about your jobs. You can get job statistics and see detailed info about the job and the job status when running.

In the AWS Glue Studio navigation panel, choose Monitoring.
Choose the entry with the job name YYZ-Tickets_Job.
For get more details about the job run, choose View run details.
Wait until Run Status changes to Succeeded.

You can verify that the job ran successfully on the Amazon Athena console.

On the Athena console, choose the yyz-tickets database.
Choose the … icon next to the parking_tickets_count table (which the CloudFormation template created).

For more information about creating AWS Glue tables, see Defining Tables in the AWS Glue Data Catalog.

Choose Preview table.

As you can see in the following screenshot, the information that the job generated is available and you can query the number of tickets types per court issued in the city of Toronto in 2018.

Cleaning up

To avoid incurring future charges and to clean up unused roles and policies, delete the resources you created: the CloudFormation stack, S3 bucket, and AWS Glue job.

Conclusion

In this post, you learned how to use AWS Glue Studio to create an ETL job. You can use AWS Glue Studio to speed up the ETL job creation process and allow different personas to transform data without any previous coding experience. For more information about AWS Glue Studio, see the AWS Glue Studio documentation and What’s New with AWS.

About the author

Leonardo Gómez is a Senior Analytics Specialist Solution Architect at AWS. Based in Toronto, Canada, He works with customers across Canada to design and build big data solutions.