Tag Archives: AWS Glue

The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables

2024-09-12 Sandeep Adwankar

Post Syndicated from Sandeep Adwankar original https://aws.amazon.com/blogs/big-data/the-aws-glue-data-catalog-now-supports-storage-optimization-of-apache-iceberg-tables/

The AWS Glue Data Catalog now enhances managed table optimization of Apache Iceberg tables by automatically removing data files that are no longer needed. Along with the Glue Data Catalog’s automated compaction feature, these storage optimizations can help you reduce metadata overhead, control storage costs, and improve query performance.

Iceberg creates a new version called a snapshot for every change to the data in the table. Iceberg has features like time travel and rollback that allow you to query data lake snapshots or roll back to previous versions. As more table changes are made, more data files are created. In addition, any failures during writing to Iceberg tables will create data files that aren’t referenced in snapshots, also known as orphan files. Time travel features, though useful, may conflict with regulations like GDPR that require permanent data deletion. Because time travel allows accessing data through historical snapshots, additional safeguards are needed to maintain compliance with data privacy laws. To control storage costs and comply with regulations, many organizations have created custom data pipelines that periodically expire snapshots in a table that are no longer needed and remove orphan files. However, building these custom pipelines is time-consuming and expensive.

With this launch, you can enable Glue Data Catalog table optimization to include snapshot and orphan data management along with compaction. You can enable this by providing configurations such as a default retention period and maximum days to keep orphan files. The Glue Data Catalog monitors tables daily, removes snapshots from table metadata, and removes the data files and orphan files that are no longer needed. The Glue Data Catalog honors retention policies for Iceberg branches and tags referencing snapshots. You can now get an always-optimized Amazon Simple Storage Service (Amazon S3) layout by automatically removing expired snapshots and orphan files. You can view the history of data, manifest, manifest lists, and orphan files deleted from the table optimization tab on the AWS Glue Data Catalog console.

In this post, we show how to enable managed retention and orphan file deletion on an Apache Iceberg table for storage optimization.

Solution overview

For this post, we use a table called customer in the iceberg_blog_db database, where data is added continuously by a streaming application—around 10,000 records (file size less than 100 KB) every 10 minutes, which includes change data capture (CDC) as well. The customer table data and metadata are stored in the S3 bucket. Because the data is updated and deleted as part of CDC, new snapshots are created for every change to the data in the table.

Managed compaction is enabled on this table for query optimization, which results in new snapshots being created when compaction rewrites several small files into a few compacted files, leaving the old small files in storage. This results in data and metadata in Amazon S3 growing at a rapid pace, which can become cost-prohibitive.

Snapshots are timestamped versions of an iceberg table. Snapshot retention configurations allow customers to enforce how long to retain snapshots and how many snapshots to retain. Configuring a snapshot retention optimizer can help manage storage overhead by removing older, unnecessary snapshots and their underlying files.

Orphan files are files that are no longer referenced by the Iceberg table metadata. These files can accumulate over time, especially after operations like table deletions or failed ETL jobs. Enabling orphan file deletion allows AWS Glue to periodically identify and remove these unnecessary files, freeing up storage.

The following diagram illustrates the architecture.

architecture

In the following sections, we demonstrate how to enable managed retention and orphan file deletion on the AWS Glue managed Iceberg table.

Prerequisite

Have an AWS account. If you don’t have an account, you can create one.

Set up resources with AWS CloudFormation

This post includes a CloudFormation template for a quick setup. You can review and customize it to suit your needs. The template generates the following resources:

An S3 bucket to store the dataset, Glue job scripts, and so on
Data Catalog database
An AWS Glue job that creates and modifies sample customer data in your S3 bucket with a Trigger every 10 mins
AWS Identity and Access Management (AWS IAM) roles and policies – glueroleoutput

To launch the CloudFormation stack, complete the following steps:

Sign in to the AWS CloudFormation console.
Choose Launch Stack.
Choose Next.
Leave the parameters as default or make appropriate changes based on your requirements, then choose Next.
Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create.

This stack can take around 5-10 minutes to complete, after which you can view the deployed stack on the AWS CloudFormation console.

CFN

Note down the role glueroleouput value that will be used when enabling optimization setup.

From the Amazon S3 console, note the Amazon S3 bucket and you can monitor how the data will be continuously updated every 10 mins with the AWS Glue Job.

S3 buckets

Enable snapshot retention

We want to remove metadata and data files of snapshots older than 1 day and the number of snapshots to retain a maximum of 1. To enable snapshot expiry, you enable snapshot retention on the customer table by setting the retention configuration as shown in the following steps, and AWS Glue will run background operations to perform these table maintenance operations, enforcing these settings one time per day.

Sign in to the AWS Glue console as an administrator.
Under Data Catalog in the navigation pane, choose Tables.
Search for and select the customer table.
On the Actions menu, choose Enable under Optimization.
Specify your optimization settings by selecting Snapshot retention.
Under Optimization configuration, select Customize settings and provide the following:
1. For IAM role, choose role created as CloudFormation resource.
2. Set Snapshot retention period as 1 day.
3. Set Minimum snapshots to retain as 1.
4. Choose Yes for Delete expire files.
Select the acknowledgement check box and choose Enable.

optimization enable

Alternatively, you can install or update the latest AWS Command Line Interface (AWS CLI) version to run the AWS CLI to enable snapshot retention. For instructions, refer to Installing or updating the latest version of the AWS CLI. Use the following code to enable snapshot retention:

aws glue create-table-optimizer
--catalog-id 112233445566
--database-name iceberg_blog_db
--table-name customer
--table-optimizer-configuration
'{
"roleArn": "arn:aws:iam::112233445566:role/<glueroleoutput>",
"enabled": true,
"retentionConfiguration": {
"icebergConfiguration": {
"snapshotRetentionPeriodInDays": 1,
"numberOfSnapshotsToRetain": 1,
"cleanExpiredFiles": true
}
}
}'
--type retention
--region us-east-1

Enable orphan file deletion

We want to remove metadata and data files that aren’t referenced of snapshots older than 1 day and the number of snapshots to retain a maximum of 1. Complete the steps to enable orphan file deletion on the customer table, and AWS Glue will run background operations to perform these table maintenance operations enforcing these settings one time per day.

Under Optimization configuration, select Customize settings and provide the following:
1. For IAM role, choose role created as CloudFormation resource.
2. Set Delete orphan file period as 1 day.
Select the acknowledgement check box and choose Enable.

Alternatively, you can use the AWS CLI to enable orphan file deletion:

aws glue create-table-optimizer
--catalog-id 112233445566
--database-name iceberg_blog_db
--table-name customer
--table-optimizer-configuration
'{
"roleArn": "arn:aws:iam::112233445566:role/<glueroleoutput>",
"enabled": true,
"orphanFileDeletionConfiguration": {
"icebergConfiguration": {
"orphanFileRetentionPeriodInDays": 1
}
}
}'
--type orphan_file_deletion
--region us-east-1

Based on the optimizer configuration, you will start seeing the optimization history in the AWS Glue Data Catalog

runs

Validate the solution

To validate the snapshot retention and orphan file deletion configuration, complete the following steps:

Sign in to the AWS Glue console as an administrator.
Under Data Catalog in the navigation pane, choose Tables.
Search for and choose the customer table.
Choose the Table optimization tab to view the optimization job run history.

runs

Alternatively, you can use the AWS CLI to verify snapshot retention:

aws glue get-table-optimizer --catalog-id 112233445566 --database-name iceberg_blog_db --table-name customer --type retention

You can also use the AWS CLI to verify orphan file deletion:

aws glue get-table-optimizer --catalog-id 112233445566 --database-name iceberg_blog_db --table-name customer --type orphan_file_deletion

Monitor CloudWatch metrics for Amazon S3

The following metrics show a steep increase in the bucket size as streaming of customer data happens along with CDC, leading to an increase in the metadata and data objects as snapshots are created. When snapshot retention (“snapshotRetentionPeriodInDays“: 1, “numberOfSnapshotsToRetain“: 50) and orphan file deletion (“orphanFileRetentionPeriodInDays“: 1) enabled, there is drop in the total bucket size for the customer prefix and the total number of objects as the maintenance takes place, eventually leading to optimized storage.

metrics

Clean up

To avoid incurring future charges, delete the resources you created in the Glue, Data Catalog, and S3 bucket used for storage.

Conclusion

Two of the key features of Iceberg are time travel and rollbacks, allowing you to query data at previous points in time and roll back unwanted changes to your tables. This is facilitated through the concept of Iceberg snapshots, which are a complete set of data files in the table at a point in time. With these new releases, the Data Catalog now provides storage optimizations that can help you reduce metadata overhead, control storage costs, and improve query performance.

To learn more about using the AWS Glue Data Catalog, refer to Optimizing Iceberg Tables.

A special thanks to everyone who contributed to the launch: Sangeet Lohariwala, Arvin Mohanty, Juan Santillan, Sandya Krishnanand, Mert Hocanin, Yanting Zhang and Shyam Rathi.

About the Authors

Sandeep Adwankar is a Senior Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Paul Villena is a Senior Analytics Solutions Architect in AWS with expertise in building modern data and analytics solutions to drive business value. He works with customers to help them harness the power of the cloud. His areas of interests are infrastructure as code, serverless technologies, and coding in Python.

Migrate Delta tables from Azure Data Lake Storage to Amazon S3 using AWS Glue

2024-09-10 Nitin Kumar

Post Syndicated from Nitin Kumar original https://aws.amazon.com/blogs/big-data/migrate-delta-tables-from-azure-data-lake-storage-to-amazon-s3-using-aws-glue/

Organizations are increasingly using a multi-cloud strategy to run their production workloads. We often see requests from customers who have started their data journey by building data lakes on Microsoft Azure, to extend access to the data to AWS services. Customers want to use a variety of AWS analytics, data, AI, and machine learning (ML) services like AWS Glue, Amazon Redshift, and Amazon SageMaker to build more cost-efficient, performant data solutions harnessing the strength of individual cloud service providers for their business use cases.

In such scenarios, data engineers face challenges in connecting and extracting data from storage containers on Microsoft Azure. Customers typically use Azure Data Lake Storage Gen2 (ADLS Gen2) as their data lake storage medium and store the data in open table formats like Delta tables, and want to use AWS analytics services like AWS Glue to read the delta tables. AWS Glue, with its ability to process data using Apache Spark and connect to various data sources, is a suitable solution for addressing the challenges of accessing data across multiple cloud environments.

AWS Glue is a serverless data integration service that makes it straightforward to discover, prepare, and combine data for analytics, ML, and application development. AWS Glue custom connectors allow you to discover and integrate additional data sources, such as software as a service (SaaS) applications and your custom data sources. With just a few clicks, you can search for and subscribe to connectors from AWS Marketplace and begin your data preparation workflow in minutes.

In this post, we explain how you can extract data from ADLS Gen2 using the Azure Data Lake Storage Connector for AWS Glue. We specifically demonstrate how to import data stored in Delta tables in ADLS Gen2. We provide step-by-step guidance on how to configure the connector, author an AWS Glue ETL (extract, transform, and load) script, and load the extracted data into Amazon Simple Storage Service (Amazon S3).

Azure Data Lake Storage Connector for AWS Glue

The Azure Data Lake Storage Connector for AWS Glue simplifies the process of connecting AWS Glue jobs to extract data from ADLS Gen2. It uses the Hadoop’s FileSystem interface and the ADLS Gen2 connector for Hadoop. The Azure Data Lake Storage Connector for AWS Glue also includes the hadoop-azure module, which lets you run Apache Hadoop or Apache Spark jobs directly with data in ADLS. When the connector is added to the AWS Glue environment, AWS Glue loads the library from the Amazon Elastic Container Registry (Amazon ECR) repository during initialization (as a connector). When AWS Glue has internet access, the Spark job in AWS Glue can read from and write to ADLS.

With the availability of the Azure Data Lake Storage Connector for AWS Glue in AWS Marketplace, an AWS Glue connection makes sure you have the required packages to use in your AWS Glue job.

For this post, we use the Shared Key authentication method.

Solution overview

In this post, our objective is to migrate a product table named sample_delta_table, which currently resides in ADLS Gen2, to Amazon S3. To accomplish this, we use AWS Glue, the Azure Data Lake Storage Connector for AWS Glue, and AWS Secrets Manager to securely store the Azure shared key. We employed an AWS Glue serverless ETL job, configured with the connector, to establish a connection to ADLS using shared key authentication over the public internet. After the table is migrated to Amazon S3, we use Amazon Athena to query Delta Lake tables.

The following architecture diagram illustrates how AWS Glue facilitates data ingestion from ADLS.

Prerequisites

You need the following prerequisites:

A storage account in Microsoft Azure and your data path in ADLS Gen2. Prepare the storage account credentials in advance. For instructions, refer to Create a storage account shared key.
An AWS Identity and Access Management (IAM) role for the AWS Glue job with the following policies:
- AWSGlueServiceRole, which allows the AWS Glue service role access to related services.
- AmazonEC2ContainerRegistryReadOnly, which provides read-only access to Amazon EC2 Container Registry repositories. This policy is for using AWS Marketplace connector libraries.
- A Secrets Manager policy, which provides read access to the secret in Secrets Manager.
- An S3 bucket policy for the S3 bucket that you need to load ETL data from Data Lake.

Configure your ADLS Gen2 account in Secrets Manager

Complete the following steps to create a secret in Secrets Manager to store the ADLS credentials:

On the Secrets Manager console, choose Store a new secret.
For Secret type, select Other type of secret.
Enter the key accountName for the ADLS Gen2 storage account name.
Enter the key accountKey for the ADLS Gen2 storage account key.
Enter the key container for the ADLS Gen2 container.
Leave the rest of the options as default and choose Next.

Enter a name for the secret (for example, adlstorage_credentials).
Choose Next.
Complete the rest of the steps to store the secret.

Subscribe to the Azure Data Lake Storage Connector for AWS Glue

The Azure Data Lake Storage Connector for AWS Glue simplifies the process of connecting AWS Glue jobs to extract data from ADLS Gen2. The connector is available as an AWS Marketplace offering.

Complete the following steps to subscribe to the connector:

Log in to your AWS account with the necessary permissions.
Navigate to the AWS Marketplace page for the Azure Data Lake Storage Connector for AWS Glue.
Choose Continue to Subscribe.
Choose Continue to Configuration after reading the EULA.

For Fulfilment option, choose Glue 4.0.
For Software version, choose the latest software version.
Choose Continue to Launch.

Create a custom connection in AWS Glue

After you’re subscribed to the connector, complete the following steps to create an AWS Glue connection based on it. This connection will be added to the AWS Glue job to make sure the connector is available and the data store connection information is accessible to establish a network pathway.

To create the AWS Glue connection, you need to activate the Azure Data Lake Storage Connector for AWS Glue on the AWS Glue Studio console. After you choose Continue to Launch in the previous steps, you’re redirected to the connector landing page.

In the Configuration details section, choose Usage instructions.
Choose Activate the Glue connector from AWS Glue Studio.

The AWS Glue Studio console allows the option to either activate the connector or activate it and create the connection in one step. For this post, we choose the second option.

For Connector, confirm Azure ADLS Connector for AWS Glue 4.0 is selected.
For Name, enter a name for the connection (for example, AzureADLSStorageGen2Connection).
Enter an optional description.
Choose Create connection and activate connector.

The connection is now ready for use. The connector and connection information is visible on the Data connections page of the AWS Glue console.

Read Delta tables from ADLS Gen2 using the connector in an AWS Glue ETL job

Complete the following steps to create an AWS Glue job and configure the AWS Glue connection and job parameter options:

On the AWS Glue console, choose ETL jobs in the navigation pane.
Choose Author code with a script editor and choose Script editor.
Choose Create script and go to the Job details section.
Update the settings for Name and IAM role.
Under Advanced properties, add the AWS Glue connection AzureADLSStorageGen2Connection created in previous steps.

For Job parameters, add the key --datalake-formats with the value as delta.

Use the following script to read the Delta table from ADLS. Provide the path to where you have Delta table files in your Azure storage account container and the S3 bucket for writing delta files to the output S3 location.

from pyspark.sql import SparkSession
from delta.tables import *
import boto3
import json

spark = SparkSession.builder.getOrCreate()

sm = boto3.client('secretsmanager')
response = sm.get_secret_value(SecretId="adlstorage_credentials")
value = json.loads(response['SecretString'])
account_name_sparkconfig = f"fs.azure.account.key.{value['accountName']}.dfs.core.windows.net"
account_name = value['accountName']
account_key = value['accountKey']
container_name = value['container']
path = f"abfss://{container_name}@{account_name}.dfs.core.windows.net/path-to-delta-table-files/"
s3DeltaTablePath="s3://yourdatalakebucketname/deltatablepath/"

# Method: Shared Key  
spark.conf.set(account_name_sparkconfig, account_key)

# Read delta table from ADLS gen2 storage
df = spark.read.format("delta").load(path)

# Write delta table to S3 path.
if DeltaTable.isDeltaTable(spark,s3DeltaTablePath):
    s3deltaTable = DeltaTable.forPath(spark,s3DeltaTablePath)
    print("Merge to existing s3 delta table")
    (s3deltaTable.alias("target")
        .merge(df.alias("source"), "target.product_id = source.product_id")
        .whenMatchedUpdateAll()
        .whenNotMatchedInsertAll()
        .execute()
    )
else:
    print("Create delta table to S3.")
    df.write.format("delta").save(s3DeltaTablePath)

Choose Run to start the job.
On the Runs tab, confirm the job ran successfully.
On the Amazon S3 console, verify the delta files in the S3 bucket (Delta table path).
Create a database and table in Athena to query the migrated Delta table in Amazon S3.

You can accomplish this step using an AWS Glue crawler. The crawler can automatically crawl your Delta table stored in Amazon S3 and create the necessary metadata in the AWS Glue Data Catalog. Athena can then use this metadata to query and analyze the Delta table seamlessly. For more information, see Crawl Delta Lake tables using AWS Glue crawlers.

CREATE DATABASE deltadb;

CREATE EXTERNAL TABLE deltadb.sample_delta_table
LOCATION 's3://yourdatalakebucketname/deltatablepath/'
TBLPROPERTIES ('table_type'='DELTA');

12. Query the Delta table:

SELECT * FROM "deltadb"."sample_delta_table" limit 10;

By following the steps outlined in the post, you have successfully migrated a Delta table from ADLS Gen2 to Amazon S3 using an AWS Glue ETL job.

Read the Delta table in an AWS Glue notebook

The following are optional steps if you want to read the Delta table from ADLS Gen2 in an AWS Glue notebook:

Create a notebook and run the following code in the first notebook cell to configure the AWS Glue connection and --datalake-formats in an interactive session:

%idle_timeout 30
%glue_version 4.0
%worker_type G.1X
%number_of_workers 5
%connections AzureADLSStorageGen2Connection
%%configure
{
   "--datalake-formats": "delta"
}

Run the following code in a new cell to read the Delta table stored in ADLS Gen 2. Provide the path to where you have delta files in an Azure storage account container and the S3 bucket for writing delta files to Amazon S3.

from pyspark.sql import SparkSession
from delta.tables import *
import boto3
import json

spark = SparkSession.builder.getOrCreate()

sm = boto3.client('secretsmanager')
response = sm.get_secret_value(SecretId="adlstorage_credentials")
value = json.loads(response['SecretString'])
account_name_sparkconfig = f"fs.azure.account.key.{value['accountName']}.dfs.core.windows.net"
account_name = value['accountName']
account_key = value['accountKey']
container_name = value['container']
path = f"abfss://{container_name}@{account_name}.dfs.core.windows.net/path-to-delta-table-files/"
s3DeltaTablePath="s3://yourdatalakebucketname/deltatablepath/"

# Method: Shared Key  
spark.conf.set(account_name_sparkconfig, account_key)

# Read delta table from ADLS gen2 storage
df = spark.read.format("delta").load(path)

# Write delta table to S3 path.
if DeltaTable.isDeltaTable(spark,s3DeltaTablePath):
    s3deltaTable = DeltaTable.forPath(spark,s3DeltaTablePath)
    print("Merge to existing s3 delta table")
    (s3deltaTable.alias("target")
        .merge(df.alias("source"), "target.product_id = source.product_id")
        .whenMatchedUpdateAll()
        .whenNotMatchedInsertAll()
        .execute()
    )
else:
    print("Create delta table to S3.")
    df.write.format("delta").save(s3DeltaTablePath)

Clean up

To clean up your resources, complete the following steps:

Remove the AWS Glue job, database, table, and connection:
1. On the AWS Glue console, choose Tables in the navigation pane, select sample_delta_table, and choose Delete.
2. Choose Databases in the navigation pane, select deltadb, and choose Delete.
3. Choose Connections in the navigation pane, select AzureADLSStorageGen2Connection, and on the Actions menu, choose Delete.
On the Secrets Manager console, choose Secrets in the navigation pane, select adlstorage_credentials, and on the Actions menu, choose Delete secret.
If you are no longer going to use this connector, you can cancel the subscription to the connector:
1. On the AWS Marketplace console, choose Manage subscriptions.
2. Select the subscription for the product that you want to cancel, and on the Actions menu, choose Cancel subscription.
3. Read the information provided and select the acknowledgement check box.
4. Choose Yes, cancel subscription.
On the Amazon S3 console, delete the data in the S3 bucket that you used in the previous steps.

You can also use the AWS Command Line Interface (AWS CLI) to remove the AWS Glue and Secrets Manager resources. Remove the AWS Glue job, database, table, connection, and Secrets Manager secret with the following command:

aws glue delete-job —job-name <your_job_name>
aws glue delete-connection —connection-name <your_connection_name>
aws secretsmanager delete-secret —secret-id <your_secretsmanager_id>
aws glue delete-table --database-name deltadb --name sample_delta_table
aws glue delete-database --name deltadb

Conclusion

In this post, we demonstrated a real-world example of migrating a Delta table from Azure Delta Lake Storage Gen2 to Amazon S3 using AWS Glue. We used an AWS Glue serverless ETL job, configured with an AWS Marketplace connector, to establish a connection to ADLS using shared key authentication over the public internet. Additionally, we used Secrets Manager to securely store the shared key and seamlessly integrate it within the AWS Glue ETL job, providing a secure and efficient migration process. Lastly, we provided guidance on querying the Delta Lake table from Athena.

Try out the solution for your own use case, and let us know your feedback and questions in the comments.

About the Authors

Nitin Kumar is a Cloud Engineer (ETL) at Amazon Web Services, specialized in AWS Glue. With a decade of experience, he excels in aiding customers with their big data workloads, focusing on data processing and analytics. He is committed to helping customers overcome ETL challenges and develop scalable data processing and analytics pipelines on AWS. In his free time, he likes to watch movies and spend time with his family.

Shubham Purwar is a Cloud Engineer (ETL) at AWS Bengaluru, specialized in AWS Glue and Amazon Athena. He is passionate about helping customers solve issues related to their ETL workload and implement scalable data processing and analytics pipelines on AWS. In his free time, Shubham loves to spend time with his family and travel around the world.

Pramod Kumar P is a Solutions Architect at Amazon Web Services. With 19 years of technology experience and close to a decade of designing and architecting connectivity solutions (IoT) on AWS, he guides customers to build solutions with the right architectural tenets to meet their business outcomes.

Madhavi Watve is a Senior Solutions Architect at Amazon Web Services, providing help and guidance to a broad range of customers to build highly secure, scalable, reliable, and cost-efficient applications on the cloud. She brings over 20 years of technology experience in software development and architecture and is data analytics specialist.

Swathi S is a Technical Account Manager with the Enterprise Support team in Amazon Web Services. She has over 6 years of experience with AWS on big data technologies and specializes in analytics frameworks. She is passionate about helping AWS customers navigate the cloud space and enjoys assisting with design and optimization of analytics workloads on AWS.

Use the AWS CDK with the Data Solutions Framework to provision and manage Amazon Redshift Serverless

2024-09-04 Jan Michael Go Tan

Post Syndicated from Jan Michael Go Tan original https://aws.amazon.com/blogs/big-data/use-the-aws-cdk-with-the-data-solutions-framework-to-provision-and-manage-amazon-redshift-serverless/

In February 2024, we announced the release of the Data Solutions Framework (DSF), an opinionated open source framework for building data solutions on AWS. DSF is built using the AWS Cloud Development Kit (AWS CDK) to package infrastructure components into L3 AWS CDK constructs on top of AWS services. L3 constructs are implementations of common technical patterns and create multiple resources that are configured to work with each other.

In this post, we demonstrate how to use the AWS CDK and DSF to create a multi-data warehouse platform based on Amazon Redshift Serverless. DSF simplifies the provisioning of Redshift Serverless, initialization and cataloging of data, and data sharing between different data warehouse deployments. Using a programmatic approach with the AWS CDK and DSF allows you to apply GitOps principles to your analytics workloads and realize the following benefits:

You can deploy using continuous integration and delivery (CI/CD) pipelines, including the definitions of Redshift objects (databases, tables, shares, and so on)
You can roll out changes consistently across multiple environments
You can bootstrap data warehouses (table creation, ingestion of data, and so on) using code and use version control to simplify the setup of testing environments
You can test changes before deployment using AWS CDK built-in testing capabilities

In addition, DSF’s Redshift Serverless L3 constructs provide a number of built-in capabilities that can accelerate development while helping you follow best practices. For example:

Running extract, transform, and load (ETL) jobs to and from Amazon Redshift is more straightforward because an AWS Glue connection resource is automatically created and configured. This means data engineers don’t have to configure this resource and can use it right away with their AWS Glue ETL jobs.
Similarly, with discovery of data inside Amazon Redshift, DSF provides a convenient method to configure an AWS Glue crawler to populate the AWS Glue Data Catalog for ease of discovery as well as ease of referencing tables when creating ETL jobs. The configured AWS Glue crawler uses an AWS Identity and Access Management (IAM) role that follows least privilege.
Sharing data between Redshift data warehouses is a common approach to improve collaboration between lines of business without duplicating data. DSF provides convenient methods for the end-to-end flow for both data producer and consumer.

Solution overview

The solution demonstrates a common pattern where a data warehouse is used as a serving layer for business intelligence (BI) workloads on top of data lake data. The source data is stored in Amazon Simple Storage Service (Amazon S3) buckets, then ingested into a Redshift producer data warehouse to create materialized views and aggregate data, and finally shared with a Redshift consumer running BI queries from the end-users. The following diagram illustrates the high-level architecture.

Solution Overview

In the post, we use Python for the example code. DSF also supports TypeScript.

Prerequisites

Because we’re using the AWS CDK, complete the steps in Getting Started with the AWS CDK before you implement the solution.

Initialize the project and provision a Redshift Serverless namespace and workgroup

Let’s start with initializing the project and including DSF as a dependency. You can run this code in your local terminal, or you can use AWS Cloud9:

mkdir dsf-redshift-blog && cd dsf-redshift-blog
cdk init --language python

Open the project folder in your IDE and complete the following steps:

Open the app.py file.
In this file, make sure to uncomment the first env This configures the AWS CDK environment depending on the AWS profile used during the deployment.
Add a configuration flag in the cdk.context.json file at the root of the project (if it doesn’t exist, create the file):
```
{  
    "@data-solutions-framework-on-aws/removeDataOnDestroy": true 
}
```

Setting the @data-solutions-framework-on-aws/removeDataOnDestroy configuration flag to true makes sure resources that have the removal_policy parameter set to RemovalPolicy.DESTROY are destroyed when the AWS CDK stack is deleted. This is a guardrail DSF uses to prevent accidentally deleting data.

Now that the project is configured, you can start adding resources to the stack.

Navigate to the dsf_redshift_blog folder and open the dsf_redshift_blog_stack.py file.

This is where we configure the resources to be deployed.

To get started building the end-to-end demo, add the following import statements at the top of the file, which allows you to start defining the resources from both the AWS CDK core library as well as DSF:

from aws_cdk import (
    RemovalPolicy,
    Stack
)

from aws_cdk.aws_s3 import Bucket
from aws_cdk.aws_iam import Role, ServicePrincipal
from constructs import Construct
from cdklabs import aws_data_solutions_framework as dsf

We use several DSF-specific constructs to build the demo:

DataLakeStorage – This creates three S3 buckets, named Bronze, Silver, and Gold, to represent the different data layers.
S3DataCopy – This manages the copying of data from one bucket to another bucket.
RedshiftServerlessNamespace – This creates a Redshift Serverless namespace where database objects and users are stored.
RedshiftServerlessWorkgroup – This creates a Redshift Serverless workgroup that contains compute- and network-related configurations for the data warehouse. This is also the entry point for several convenient functionalities that DSF provides, such as cataloging of Redshift tables, running SQL statements as part of the AWS CDK (such as creating tables, data ingestion, merging of tables, and more), and sharing datasets across different Redshift clusters without moving data.

Now that you have imported the libraries, create a set of S3 buckets following the medallion architecture best practices with bronze, silver, and gold data layers.

The high-level definitions of each layer are as follows:

Bronze represents raw data; this is where data from various source systems lands. No schema is needed.
Silver is cleaned and potentially augmented data. The schema is enforced in this layer.
Gold is data that’s further refined and aggregated to serve a specific business need.

Using the DataLakeStorage construct, you can create these three S3 buckets with the following best practices:

Encryption at rest through AWS Key Management Service (AWS KMS) is turned on
SSL is enforced
The use of S3 bucket keys is turned on

There’s a default S3 lifecycle rule defined to delete incomplete multipart uploads after 1 day

data_lake = dsf.storage.DataLakeStorage(self,
    'DataLake',
    removal_policy=RemovalPolicy.DESTROY)

After you create the S3 buckets, copy over the data using the S3DataCopy For this demo, we land the data in the Silver bucket because it’s already cleaned:

source_bucket = Bucket.from_bucket_name(self, 
    'SourceBucket', 
    bucket_name='redshift-immersionday-labs')

data_copy = dsf.utils.S3DataCopy(self,
    'SourceData', 
    source_bucket=source_bucket, 
    source_bucket_prefix='data/amazon-reviews/', 
    source_bucket_region='us-west-2', 
    target_bucket=data_lake.silver_bucket, 
    target_bucket_prefix='silver/amazon-reviews/')

In order for Amazon Redshift to ingest the data in Amazon S3, it needs an IAM role with the right permissions. This role will be associated with the Redshift Serverless namespace that you create next.
```
lake_role = Role(self, 
    'LakeRole', 
    assumed_by=ServicePrincipal('redshift.amazonaws.com'))

data_lake.silver_bucket.grant_read(lake_role)
```
To provision Redshift Serverless, configure two resources: a namespace and a workgroup. DSF provides L3 constructs for both:
1. RedshiftServerlessNamespace
2. RedshiftServerlessWorkgroup
Both constructs follow security best practices, including:
- The default virtual private cloud (VPC) uses private subnets (with public access disabled).
- Data is encrypted at rest through AWS KMS with automatic key rotation.
- Admin credentials are stored in AWS Secrets Manager with automatic rotation managed by Amazon Redshift.
- A default AWS Glue connection is automatically created using private connectivity. This can be used by AWS Glue crawlers as well as AWS Glue ETL jobs to connect to Amazon Redshift.
The RedshiftServerlessWorkgroup construct is the main entry point for other capabilities, such as integration with the AWS Glue Data Catalog, Redshift Data API, and Data Sharing API.
1. In the following example, use the defaults provided by the construct and associate the IAM role that you created earlier to give Amazon Redshift access to the data lake for data ingestion:
```
namespace = dsf.consumption.RedshiftServerlessNamespace(self, 
    'Namespace', 
    db_name='defaultdb', 
    name='producer', 
    removal_policy=RemovalPolicy.DESTROY, 
    default_iam_role=lake_role)

workgroup = dsf.consumption.RedshiftServerlessWorkgroup(self, 
    'Workgroup', 
    name='producer', 
    namespace=namespace, 
    removal_policy=RemovalPolicy.DESTROY)
```

Create tables and ingest data

To create a table, you can use the runCustomSQL method in the RedshiftServerlessWorkgroup construct. This method allows you to run arbitrary SQL statements when the resource is being created (such as create table or create materialized view) and when it’s being deleted (such as drop table or drop materialized view).

Add the following code after the RedshiftServerlessWorkgroup instantiation:

create_amazon_reviews_table = workgroup.run_custom_sql('CreateAmazonReviewsTable', 
    database_name='defaultdb', 
    sql='CREATE TABLE amazon_reviews (marketplace character varying(16383) ENCODE lzo, customer_id character varying(16383) ENCODE lzo, review_id character varying(16383) ENCODE lzo, product_id character varying(16383) ENCODE lzo, product_parent character varying(16383) ENCODE lzo, product_title character varying(16383) ENCODE lzo, star_rating integer ENCODE az64, helpful_votes integer ENCODE az64, total_votes integer ENCODE az64, vine character varying(16383) ENCODE lzo, verified_purchase character varying(16383) ENCODE lzo, review_headline character varying(max) ENCODE lzo, review_body character varying(max) ENCODE lzo, review_date date ENCODE az64, year integer ENCODE az64) DISTSTYLE AUTO;', 
    delete_sql='drop table amazon_reviews')

load_amazon_reviews_data = workgroup.ingest_data('amazon_reviews_ingest_data', 
    'defaultdb', 
    'amazon_reviews', 
    data_lake.silver_bucket, 
    'silver/amazon-reviews/', 
    'FORMAT parquet')

load_amazon_reviews_data.node.add_dependency(create_amazon_reviews_table)
load_amazon_reviews_data.node.add_dependency(data_copy)

Given the asynchronous nature of some of the resource creation, we also enforce dependencies between some resources; otherwise, the AWS CDK would try to create them in parallel to accelerate the deployment. The preceding dependency statements establish the following:

Before you load the data, the S3 data copy is complete, so the data exists in the source bucket of the ingestion
Before you load the data, the target table has been created in the Redshift namespace

Bootstrapping example (materialized views)

The workgroup.run_custom_sql() method provides flexibility in how you can bootstrap your Redshift data warehouse using the AWS CDK. For example, you can create a materialized view to improve the queries’ performance by pre-aggregating data from the Amazon reviews:

materialized_view = workgroup.run_custom_sql('MvProductAnalysis',
    database_name='defaultdb',
    sql=f'''CREATE MATERIALIZED VIEW mv_product_analysis AS SELECT review_date, product_title, COUNT(1) AS review_total, SUM(star_rating) AS rating FROM amazon_reviews WHERE marketplace = 'US' GROUP BY 1,2;''',
    delete_sql='drop materialized view mv_product_analysis')

materialized_view.node.add_dependency(load_amazon_reviews_data)

Catalog tables in Amazon Redshift

The deployment of RedshiftServerlessWorkgroup automatically creates an AWS Glue connection resource that can be used by AWS Glue crawlers and AWS Glue ETL jobs. This is directly exposed from the workgroup construct through the glue_connection property. Using this connection, the workgroup construct exposes a convenient method to catalog the tables inside the associated Redshift Serverless namespace. The following an example code:

workgroup.catalog_tables('DefaultDBCatalog', 'mv_product_analysis')

This single line of code creates a database in the Data Catalog named mv_product_analysis and the associated crawler with the IAM role and network configuration already configured. By default, it crawls all the tables inside the public schema in the default database indicated when the Redshift Serverless namespace was created. To override this, the third parameter in the catalogTables method allows you to define a pattern on what to crawl (see the JDBC data store in the include path).

You can run the crawler using the AWS Glue console or invoke it using the SDK, AWS Command Line Interface (AWS CLI), or AWS CDK using AwsCustomResource.

Data sharing

DSF supports Redshift data sharing for both sides (producers and consumers) as well as same account and cross-account scenarios. Let’s create another Redshift Serverless namespace and workgroup to demonstrate the interaction:

namespace2 = dsf.consumption.RedshiftServerlessNamespace(self, 
    "Namespace2", 
    db_name="defaultdb", 
    name="consumer", 
    default_iam_role=lake_role, 
    removal_policy=RemovalPolicy.DESTROY)

workgroup2 = dsf.consumption.RedshiftServerlessWorkgroup(self, 
    "Workgroup2", 
    name="consumer", 
    namespace=namespace2, 
    removal_policy=RemovalPolicy.DESTROY)

For producers

For producers, complete the following steps:

Create the new share and populate the share with the schema or tables:

data_share = workgroup.create_share('DataSharing', 
    'defaultdb', 
    'defaultdbshare', 
    'public', ['mv_product_analysis'])

data_share.new_share_custom_resource.node.add_dependency(materialized_view)

Create access grants:

To grant to a cluster in the same account:

share_grant = workgroup.grant_access_to_share("GrantToSameAccount", 
    data_share, 
    namespace2.namespace_id)

share_grant.resource.node.add_dependency(data_share.new_share_custom_resource)
share_grant.resource.node.add_dependency(namespace2)

To grant to a different account:

workgroup.grant_access_to_share('GrantToDifferentAccount', 
    tpcdsShare, 
    undefined, 
    '<ACCOUNT_ID_OF_CONSUMER>', 
    true)

The last parameter in the grant_access_to_share method allows to automatically authorize the cross-account access on the data share. Omitting this parameter would default to no authorization; which means a Redshift administrator needs to authorize the cross-account share either using the AWS CLI, SDK, or Amazon Redshift console.

For consumers

For the same account share, to create the database from the share, use the following code:

create_db_from_share = workgroup2.create_database_from_share("CreateDatabaseFromShare", 
    "marketing", 
    data_share.data_share_name, 
    data_share.producer_namespace)

create_db_from_share.resource.node.add_dependency(share_grant.resource)
create_db_from_share.resource.node.add_dependency(workgroup2)

For cross-account grants, the syntax is similar, but you need to indicate the producer account ID:

consumerWorkgroup.create_database_from_share('CreateCrossAccountDatabaseFromShare', 
    'tpcds', 
    <PRODUCER_SHARE_NAME>, 
    <PRODUCER_NAMESPACE_ID>, 
    <PRODUCER_ACCOUNT_ID>)

To see the full working example, follow the instructions in the accompanying GitHub repository.

Deploy the resources using the AWS CDK

To deploy the resources, run the following code:

cdk deploy

You can review the resources created, as shown in the following screenshot.

Confirm the changes for the deployment to start. Wait a few minutes for the project to be deployed; you can keep track of the deployment using the AWS CLI or the AWS CloudFormation console.

When the deployment is complete, you should see two Redshift workgroups (one producer and one consumer).

Using Amazon Redshift Query Editor v2, you can log in to the producer Redshift workgroup using Secrets Manager, as shown in the following screenshot.

Producer QEV2 Login

After you log in, you can see the tables and views that you created using DSF in the defaultdb database.

QEv2 Tables

Log in to the consumer Redshift workgroup to see the shared dataset from the producer Redshift workgroup under the marketing database.

Clean up

You can run cdk destroy in your local terminal to delete the stack. Because you marked the constructs with a RemovalPolicy.DESTROY and configured DSF to remove data on destroy, running cdk destroy or deleting the stack from the AWS CloudFormation console will clean up the provisioned resources.

Conclusion

In this post, we demonstrated how to use the AWS CDK along with the DSF to manage Redshift Serverless as code. Codifying the deployment of resources helps provide consistency across multiple environments. Aside from infrastructure, DSF also provides capabilities to bootstrap (table creation, ingestion of data, and more) Amazon Redshift and manage objects, all from the AWS CDK. This means that changes can be version controlled, reviewed, and even unit tested.

In addition to Redshift Serverless, DSF supports other AWS services, such as Amazon Athena, Amazon EMR, and many more. Our roadmap is publicly available, and we look forward to your feature requests, contributions, and feedback.

You can get started using DSF by following our quick start guide.

About the authors

Jan Michael Go Tan is a Principal Solutions Architect for Amazon Web Services. He helps customers design scalable and innovative solutions with the AWS Cloud.

Vincent Gromakowski is an Analytics Specialist Solutions Architect at AWS where he enjoys solving customers’ analytics, NoSQL, and streaming challenges. He has a strong expertise on distributed data processing engines and resource orchestration platform.

Accelerate data integration with Salesforce and AWS using AWS Glue

2024-09-04 Ramakant Joshi

Post Syndicated from Ramakant Joshi original https://aws.amazon.com/blogs/big-data/accelerate-data-integration-with-salesforce-and-aws-using-aws-glue/

The rapid adoption of software as a service (SaaS) solutions has led to data silos across various platforms, presenting challenges in consolidating insights from diverse sources. Effective data analytics relies on seamlessly integrating data from disparate systems through identifying, gathering, cleansing, and combining relevant data into a unified format. AWS Glue, a serverless data integration service, has simplified this process by offering scalable, efficient, and cost-effective solutions for integrating data from various sources. With AWS Glue, you can streamline data integration, reduce data silos and complexities, and gain agility in managing data pipelines, ultimately unlocking the true potential of your data assets for analytics, data-driven decision-making, and innovation.

This post explores the new Salesforce connector for AWS Glue and demonstrates how to build a modern extract, transform, and load (ETL) pipeline with AWS Glue ETL scripts.

Introducing the Salesforce connector for AWS Glue

To meet the demands of diverse data integration use cases, AWS Glue now supports SaaS connectivity for Salesforce. This enables users to quickly preview and transfer their customer relationship management (CRM) data, fetch the schema dynamically on request, and query the data. With the AWS Glue Salesforce connector, you can ingest and transform your CRM data to any of the AWS Glue supported destinations, including Amazon Simple Storage Service (Amazon S3), in your preferred format, including Apache Iceberg, Apache Hudi, and Linux Foundation Delta Lake; data warehouses such as Amazon Redshift and Snowflake; and many more. Reverse ETL use cases are also supported, allowing you to write data back to Salesforce.

The following are key benefits of the Salesforce connector for AWS Glue:

You can use AWS Glue native capabilities
It is well tested with AWS Glue capabilities and is production ready for any data integration workload
It works seamlessly on top of AWS Glue and Apache Spark in a distributed fashion for efficient data processing

Solution overview

For our use case, we want to retrieve the full load of a Salesforce account object in a data lake on Amazon S3 and capture the incremental changes. This solution also allows you to update certain fields of the account object in the data lake and push it back to Salesforce. To achieve this, you create two ETL jobs using AWS Glue with the Salesforce connector, and create a transactional data lake on Amazon S3 using Apache Iceberg.

In the first job, you configure AWS Glue to ingest the account object from Salesforce and save it into a transactional data lake on Amazon S3 in Apache Iceberg format. Then you update the account object data that is extracted from the first job in the transactional data lake in Amazon S3. Lastly, you run the second job to send that change back to Salesforce.

Prerequisites

Complete the following prerequisite steps:

Create an S3 bucket to store the results.
Sign up for a Salesforce account, if you don’t already have one.

Create an AWS Identity and Access Management (IAM) role for the AWS Glue ETL job to use. The role must grant access to all resources used by the job, including Amazon S3 and AWS Secrets Manager. For this post, we name the role AWSGlueServiceRole-SalesforceConnectorJob. Use the following policies:

AWS managed policies:
- AWSGlueServiceRole
- SecretsManagerReadWrite

Inline policy:

{
       "Version": "2012-10-17",
       "Statement": [
              {
                      "Sid": "VisualEditor0",
                      "Effect": "Allow",
                      "Action": [
                             "s3:PutObject",
                             "s3:GetObjectAcl",
                             "s3:GetObject",
                             "s3:GetObjectAttributes",
                             "s3:ListBucket",
                             "s3:DeleteObject",
                             "s3:PutObjectAcl"],
                      "Resource": [
                             "arn:aws:s3:::<S3-BUCKET-NAME>",
                             "arn:aws:s3:::<S3-BUCKET-NAME>/*"
                      ]
              }
       ]
}

Create the AWS Glue connection for Salesforce:
1. The Salesforce connector supports two OAuth2 grant types: JWT_BEARER and AUTHORIZATION_CODE. For this post, we use the AUTHORIZATION_CODE grant type.
2. On the Secrets Manager console, create a new secret. Add two keys, ACCESS_TOKEN and REFRESH_TOKEN, and keep their values blank. These will be populated after you enter your Salesforce credentials.
3. Configure the Salesforce connection in AWS Glue. Use AWSGlueServiceRole-SalesforceConnectorJob while creating the Salesforce connection. For this post, we name the connection Salesforce_Connection.
4. In the Authorization section, choose Authorization Code and the secret you created in the previous step.
5. Provide your Salesforce credentials when prompted. The ACCESS_TOKEN and REFRESH_TOKEN keys will be populated after you enter your Salesforce credentials.
Create an AWS Glue database. For this post, we name it glue_etl_salesforce_db.

Create an ETL job to ingest the account object from Salesforce

Complete the following steps to create a new ETL job in AWS Glue Studio to transfer data from Salesforce to Amazon S3:

On the AWS Glue console, create a new job (with the Script editor option). For this post, we name the job Salesforce_to_S3_Account_Ingestion.
On the Script tab, enter the Salesforce_to_S3_Account_Ingestion script.

Make sure that the name, which you used to create the Salesforce connection, is passed as the connectionName parameter value in the script, as shown in the following code example:

# Script generated for node Salesforce

input_Salesforce_Dyf = glueContext.create_dynamic_frame.from_options(connection_type="salesforce", connection_options={"entityName": "Account", "apiVersion": "v60.0", "connectionName": "Salesforce_Connection"}, transformation_ctx="inputSalesforceDyf")

The script fetches records from the Salesforce account object. Then it checks if the account table exists in the transactional data lake. If the table doesn’t exist, it creates a new table and inserts the records. If the table exists, it performs an upsert operation.

On the Job details tab, for IAM role, choose AWSGlueServiceRole-SalesforceConnectorJob.
Under Advanced properties, for Additional network connection, choose the Salesforce connection.
Set up the job parameters:
1. --conf: spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.warehouse=file:///tmp/spark-warehouse
2. --datalake-formats: iceberg
3. --db_name: glue_etl_salesforce_db
4. --s3_bucket_name: your S3 bucket
5. --table_name: account
Save the job and run it.

Depending on the size of the data in your account object in Salesforce, the job will take a few minutes to complete. After a successful job run, a new table called account is created and populated with Salesforce account information.

You can use Amazon Athena to query the data:

SELECT id, name, type, active__c, upsellopportunity__c, lastmodifieddate

FROM "glue_etl_salesforce_db"."account"

Validate transactional capabilities

You can validate the transactional capabilities supported by Apache Iceberg. For testing, try three operations: insert, update, and delete:

Create a new account object in Salesforce, rerun the AWS Glue job, then run the query in Athena to validate the new account is created.
Delete an account in Salesforce, rerun the AWS Glue job, and validate the deletion using Athena.
Update an account in Salesforce, rerun the AWS Glue job, and validate the update operation using Athena.

Create an ETL job to send updates back to Salesforce

AWS Glue also allows you to write data back to Salesforce. Complete the following steps to create an ETL job in AWS Glue to get updates from the transactional data lake and write them to Salesforce. In this scenario, you update an account record and push it back to Salesforce.

On the AWS Glue console, create a new job (with the Script editor option). For this post, we name the job S3_to_Salesforce_Account_Writeback.
On the Script tab, enter the S3_to_Salesforce_Account_Writeback script.

Make sure that the name, which you used to create the Salesforce connection, is passed as the connectionName parameter value in the script:

# Script generated for node Salesforce

Salesforce_node = glueContext.write_dynamic_frame.from_options(frame=SelectFields_dyf, connection_type="salesforce", connection_options={"apiVersion": "v60.0", "connectionName": "Salesforce_Connection", "entityName": "Account", "writeOperation": "UPDATE", "idFieldNames": "Id"}, transformation_ctx="Salesforce_node")

On the Job details tab, for IAM role, choose AWSGlueServiceRole-SalesforceConnectorJob.
Configure the job parameters:
1. --conf:
  spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.warehouse=file:///tmp/spark-warehouse
2. --datalake-formats: iceberg
3. --db_name: glue_etl_salesforce_db
4. --table_name: account

Run the update query in Athena to change the value of UpsellOpportunity__c for a Salesforce account to “Yes”:

update “glue_etl_salesforce_db”.”account”
set upsellopportunity__c = ‘Yes’
where name = ‘<SF Account>’

Run the S3_to_Salesforce_Account_Writeback AWS Glue job.

Depending on the size of the data in your account object in Salesforce, the job will take a few minutes to complete.

Validate the object in Salesforce. The value of UpsellOpportunity should change.

You have now successfully validated the Salesforce connector.

Considerations

You can set up AWS Glue job triggers to run the ETL jobs on a schedule, so that the data is regularly synchronized between Salesforce and Amazon S3. You can also integrate the ETL jobs with other AWS services, such as AWS Step Functions, Amazon Managed Workflows for Apache Airflow (Amazon MWAA), AWS Lambda, or Amazon EventBridge, to create a more advanced data processing pipeline.

By default, the Salesforce connector doesn’t import deleted records from Salesforce objects. However, you can set the IMPORT_DELETED_RECORDS option to “true” to import all records, including the deleted ones. Refer to Salesforce connection options for different Salesforce connection options.

# Script generated for node Salesforce

input_Salesforce_Dyf = glueContext.create_dynamic_frame.from_options(connection_type = "salesforce", connection_options = {"entityName": "Account", "apiVersion": "v60.0", "connectionName": " Salesforce_Connection", "IMPORT_DELETED_RECORDS": "true"},  transformation_ctx="inputSalesforceDyf")

Clean up

To avoid incurring charges, clean up the resources used in this post from your AWS account, including the AWS Glue jobs, Salesforce connection, Secrets Manager secret, IAM role, and S3 bucket.

Conclusion

The AWS Glue connector for Salesforce simplifies the analytics pipeline, reduces time to insights, and facilitates data-driven decision-making. It empowers organizations to streamline data integration and analytics. The serverless nature of AWS Glue means there is no infrastructure management, and you pay only for the resources consumed while your jobs are running. As organizations increasingly rely on data for decision-making, this Salesforce connector provides an efficient, cost-effective, and agile solution to swiftly meet data analytics needs.

To learn more about the AWS Glue connector for Salesforce, refer to Connecting to Salesforce in AWS Glue Studio. In this user guide, we walk through the entire process, from setting up the connection to running the data transfer flow. For more information on AWS Glue, visit AWS Glue.

About the authors

Ramakant Joshi is an AWS Solutions Architect, specializing in the analytics and serverless domain. He has a background in software development and hybrid architectures, and is passionate about helping customers modernize their cloud architecture.

BDB-4354-awskamen Kamen Sharlandjiev is a Sr. Big Data and ETL Solutions Architect, Amazon MWAA and AWS Glue ETL expert. He’s on a mission to make life easier for customers who are facing complex data integration and orchestration challenges. His secret weapon? Fully managed AWS services that can get the job done with minimal effort. Follow Kamen on LinkedIn to keep up to date with the latest Amazon MWAA and AWS Glue features and news!

Debaprasun Chakraborty is an AWS Solutions Architect, specializing in the analytics domain. He has around 20 years of software development and architecture experience. He is passionate about helping customers in cloud adoption, migration and strategy.

Introducing job queuing to scale your AWS Glue workloads

2024-09-03 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/introducing-job-queuing-to-scale-your-aws-glue-workloads/

Data is a key driver for your business. Data volume can increase significantly over time, and it often requires concurrent consumption of large compute resources. Data integration workloads can become increasingly concurrent as more and more applications demand access to data at the same time. In AWS, hundreds of thousands of customers use AWS Glue, a serverless data integration service, for integrating data across multiple data sources at scale. AWS Glue jobs can be triggered asynchronously via a schedule or event, or started synchronously, on-demand.

Your AWS account has quotas, also referred to as limits, which are the maximum number of service resources for your AWS account. AWS Glue quotas helps guarantee the availability of AWS Glue resources and prevents accidental over provisioning of resources. However, with large or spiky workloads, it can be challenging to manage job run concurrency or Data Processing Units (DPU) to stay under the service quotas.
Traditionally, when you hit the quota of concurrent Glue job runs, your jobs fail immediately.

Today, we are pleased to announce the general availability of AWS Glue job queuing. Job queuing increases scalability and improves the customer experience of managing AWS Glue jobs. With this new capability, you no longer need to manage concurrency of your AWS Glue job runs and attempt retries just to avoid job failures due to high concurrency. You can simply start your jobs, and when the job runs are in Waiting state, the AWS Glue job queuing feature staggers jobs automatically whenever possible. This increases your job success rates and the experience for large concurrency workloads.

This post demonstrates how job queuing helps you scale your Glue workloads and how job queuing works.

Use cases and benefits for job queuing

The following are common data integration use cases where many concurrent job runs are needed:

Many different data sources need to be read in parallel
Multiple large datasets need to be processed concurrently
Data is processed in an event-driven fashion, and many events occur at the same time

AWS Glue has the following service quotas per Region and account related to concurrent usage:

Max concurrent job runs per account
Max concurrent job runs per job
Max task DPUs per account

You can also configure maximum concurrency for individual jobs.

In the aforementioned typical use cases, when you run a job through the StartJobRun API or AWS Glue console, you may hit the upper limit defined at any of the discussed places. If this happens, your job fails immediately due to errors like ConcurrentRunsExceededException returned by the AWS Glue API endpoint.

Job queuing helps those typical use cases without forcing you to manage concurrency between all your job runs. You no longer need to make manual retries when you get ConcurrentRunsExceededException. Job queuing enqueues job runs when you hit the limit and automatically reattempts job runs when resources free up. It simplifies your daily operation and reduces latency for the retries. It also allows you to scale more with AWS Glue jobs.

In the next section, we describe how job queuing is configured.

Configure job queuing for Glue jobs

To enable job queuing on the AWS Glue Studio console, complete the following steps:

Open AWS Glue console.
Choose Jobs.
Choose your job.
Choose the Job details tab.
For Job Run Queuing, select Enable job runs to be queued to run later when they cannot run immediately due to service quotas
Choose Save.

In the next section, we describe how job queuing works.

How AWS Glue jobs work with job queuing

In the current job run lifecycle, the job-level and account-level limits are checked when a job starts, and the job moves to a Failed state when these limits are reached. With job queuing, your job run state goes into a Waiting state to be reattempted instead of Failed. The Waiting state means that job run is queued for retry after the limits have been exceeded or resources were not unavailable. Job queueing is another retry mechanism in addition to the customer-specified max retry.

AWS Glue job queuing will improve the success rates of job runs and reduce failures due to limits, but it doesn’t guarantee job run success. Limits and resources could still be unavailable by the time the reattempt run starts.

The following screenshot shows that two job runs are in the Waiting state:

The following limits are covered by job queuing:

Max concurrent job runs per account exceeded
Max concurrent job runs per job exceeded (which includes the account-level service quota as well as the configured parameter on the job)
Max concurrent DPUs exceeded
Resource unavailable due to IP address exhaustion in VPCs

The retry mechanism is configured to retry for a maximum of 15 minutes or 10 attempts, whichever comes first.

Here’s the state transition diagram for job runs when job queuing is enabled.

Considerations

Keep in mind the following considerations:

AWS Glue Flex jobs are not supported
With job queuing enabled, the parameter MaxRetries is not configurable for the same job

Conclusion

In this post, we described how the new job queuing capability helps you scale your AWS Glue job workload. You can start leveraging job queuing for your new jobs or existing jobs today. We are looking forward to hearing your feedback.

About the authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Gyan Radhakrishnan is a Software Development Engineer on the AWS Glue team. He is working on designing and building end-to-end solutions for data intensive applications.

Simon Kern is a Software Development Engineer on the AWS Glue team. He is enthusiastic about serverless technologies, data engineering and building great services.

Dana Adylova is a Software Development Engineer on the AWS Glue team. She is working on building software for supporting data intensive applications. In her spare time, she enjoys knitting and reading sci-fi.

Matt Su is a Senior Product Manager on the AWS Glue team. He enjoys helping customers uncover insights and make better decisions using their data with AWS Analytic services. In his spare time, he enjoys skiing and gardening.

Copy and mask PII between Amazon RDS databases using visual ETL jobs in AWS Glue Studio

2024-08-26 Monica Alcalde Angel

Post Syndicated from Monica Alcalde Angel original https://aws.amazon.com/blogs/big-data/copy-and-mask-pii-between-amazon-rds-databases-using-visual-etl-jobs-in-aws-glue-studio/

Moving and transforming data between databases is a common need for many organizations. Duplicating data from a production database to a lower or lateral environment and masking personally identifiable information (PII) to comply with regulations enables development, testing, and reporting without impacting critical systems or exposing sensitive customer data. However, manually anonymizing cloned information can be taxing for security and database teams.

You can use AWS Glue Studio to set up data replication and mask PII with no coding required. AWS Glue Studio visual editor provides a low-code graphic environment to build, run, and monitor extract, transform, and load (ETL) scripts. Behind the scenes, AWS Glue handles underlying resource provisioning, job monitoring, and retries. There’s no infrastructure to manage, so you can focus on rapidly building compliant data flows between key systems.

In this post, I’ll walk you through how to copy data from one Amazon Relational Database Service (Amazon RDS) for PostgreSQL database to another, while scrubbing PII along the way using AWS Glue. You will learn how to prepare a multi-account environment to access the databases from AWS Glue, and how to model an ETL data flow that automatically masks PII as part of the transfer process, so that no sensitive information will be copied to the target database in its original form. By the end, you’ll be able to rapidly build data movement pipelines between data sources and targets, that can hide PII in order to protect individual identities, without needing to write code.

Solution overview

The following diagram illustrates the solution architecture:

The solution uses AWS Glue as an ETL engine to extract data from the source Amazon RDS database. Built-in data transformations then scrub columns containing PII using pre-defined masking functions. Finally, the AWS Glue ETL job inserts privacy-protected data into the target Amazon RDS database.

This solution employs multiple AWS accounts. Having multi-account environments is an AWS best practice to help isolate and manage your applications and data. The AWS Glue account shown in the diagram is a dedicated account that facilitates the creation and management of all necessary AWS Glue resources. This solution works across a broad array of connections that AWS Glue supports, so you can centralize the orchestration in one dedicated AWS account.

It is important to highlight the following notes about this solution:

Following AWS best practices, the three AWS accounts discussed are part of an organization, but this is not mandatory for this solution to work.
This solution is suitable for use cases that don’t require real-time replication and can run on a schedule or be initiated through events.

Walkthrough

To implement this solution, this guide walks you through the following steps:

Enable connectivity from the AWS Glue account to the source and target accounts
Create AWS Glue components for the ETL job
Create and run the AWS Glue ETL job
Verify results

Prerequisites

For this walkthrough, we’re using Amazon RDS for PostgreSQL 13.14-R1. Note that the solution will work with other versions and database engines that support the same JDBC driver versions as AWS Glue. See JDBC connections for further details.

To follow along with this post, you should have the following prerequisites:

Three AWS accounts as follows:
1. Source account: Hosts the source Amazon RDS for PostgreSQL database. The database contains a table with sensitive information and resides within a private subnet. For future reference, record the associated virtual private cloud (VPC) ID, security group, and private subnets associated to the Amazon RDS database.
2. Target account: Contains the target Amazon RDS for PostgreSQL database, with the same table structure as the source table, initially empty. The database resides within a private subnet. Similarly, write down the associated VPC ID, security group ID and private subnets.
3. AWS Glue account: This dedicated account holds a VPC, a private subnet, and a security group. As mentioned in the AWS Glue documentation, the security group includes a self-referencing inbound rule for All TCP and TCP ports (0-65535) to allow AWS Glue to communicate with its components.

The following figure shows a self-referencing inbound rule needed on the AWS Glue account security group.

Make sure the three VPC CIDRs do not overlap with each other, as shown in the following table:

	VPC	Private subnet
Source account	10.2.0.0/16	10.2.10.0/24
AWS Glue account	10.1.0.0/16	10.1.10.0/24
Target account	10.3.0.0/16	10.3.10.0/24

The VPC network attributes enableDnsHostnames and enableDnsSupport are set to true on each VPC. For details, see Using DNS with your VPC.
An AWS Identity and Access Management (IAM) role is used for AWS Glue. For instructions, see Create IAM role for AWS Glue.
A user on the AWS Glue account with access to the AWS Management Console and permissions for AWS Glue Studio. See Set up IAM permissions for AWS Glue Studio for instructions.
An Amazon Simple Storage Service (Amazon S3) endpoint on the AWS Glue account. AWS Glue requires this endpoint to store the ETL script. During the S3 endpoint set up, make sure you associate the endpoint with the route table assigned to the private subnet on the AWS Glue account. For details on creating an S3 endpoint, see Amazon VPC Endpoints for Amazon S3.

The following diagram illustrates the environment with all prerequisites:

To streamline the process of setting up the prerequisites, you can follow the directions in the README file on this GitHub repository.

Database tables

For this example, both source and target databases contain a customer table with the exact same structure. The former is prepopulated with data as shown in the following figure:

The AWS Glue ETL job you will create focuses on masking sensitive information within specific columns. These are last_name, email, phone_number, ssn and notes.

If you want to use the same table structure and data, the SQL statements are provided in the GitHub repository.

Step 1 – Enable connectivity from the AWS Glue account to the source and target accounts

When creating an AWS Glue ETL job, provide the AWS IAM role, VPC ID, subnet ID, and security groups needed for AWS Glue to access the JDBC databases. See AWS Glue: How it works for further details.

In our example, the role, groups, and other information are in the dedicated AWS Glue account. However, for AWS Glue to connect to the databases, you need to enable access to source and target databases from your AWS Glue account’s subnet and security group.

To enable access, first you inter-connect the VPCs. This can be done using VPC peering or AWS Transit Gateway. For this example, we use VPC peering. Alternatively, you can use an S3 bucket as an intermediary storage location. See Setting up network access to data stores for further details.

Follow these steps:

Peer AWS Glue account VPC with the database VPCs
Update the route tables
Update the database security groups

Peer AWS Glue account VPC with database VPCs

Complete the following steps in the AWS VPC console:

On the AWS Glue account, create two VPC peering connections as described in Create VPC peering connection, one for the source account VPC, and one for the target account VPC.
On the source account, accept the VPC peering request. For instructions, see Accept VPC peering connection
On the target account, accept the VPC peering request as well.
On the AWS Glue account, enable DNS Settings on each peering connection. This allows AWS Glue to resolve the private IP address of your databases. For instructions, follow Enable DNS resolution for VPC peering connection.

After completing the preceding steps, the list of peering connections on the AWS Glue account should look like the following figure:
Note that source and target account VPCs are not peered together. Connectivity between the two accounts isn’t needed.

Update subnet route tables

This step will enable traffic from the AWS Glue account VPC to the VPC subnets associate to the databases in the source and target accounts.

Complete the following steps in the AWS VPC console:

On the AWS Glue account’s route table, for each VPC peering connection, add one route to each private subnet associated to the database. These routes enable AWS Glue to establish a connection to the databases and limit traffic from the AWS Glue account to only the subnets associated to the databases.
On the source account’s route table of the private subnets associated to the database, add one route for the VPC peering with the AWS Glue account. This route will allow traffic back to the AWS Glue account.
Repeat step 2 on the target account’s route table.

For instructions on how to update route tables, see Work with route tables.

Update database security groups

This step is required to allow traffic from the AWS Glue account’s security group to the source and target security groups associated to the databases.

For instructions on how to update security groups, see Work with security groups.

Complete the following steps in the AWS VPC console:

On the source account’s database security group, add an inbound rule with Type PostgreSQL and Source, the AWS Glue account security group.
Repeat step 1 from the target account’s database security group.

The following diagram shows the environment with connectivity enabled from the AWS Glue account to the source and target accounts:

Step 2 – Create AWS Glue components for the ETL job

The next task is to create the AWS Glue components to synchronize the source and target database schemas with the AWS Glue Data Catalog.

Follow these steps:

Create an AWS Glue Connection for each Amazon RDS database.
Create AWS Glue Crawlers to populate the Data Catalog.
Run the crawlers.

Create AWS Glue connections

Connections enable AWS Glue to access your databases. The main benefit of creating AWS Glue connections is that connections save time by not making you have to specify all connection details every time you create a job. You can then reuse connections when creating jobs in AWS Glue Studio without having to manually enter connection details each time. This makes the job creation process more consistent and faster.

Complete these steps on the AWS Glue account:

On the AWS Glue console, choose the Data connections link on the navigation pane.
Choose Create connection and follow the instructions in the Create connection wizard:
1. In Choose data source, choose JDBC as data source.
2. In Configure connection:
  - For JDBC URL, enter the JDBC URL for the source database. For PostgreSQL, the syntax is:
```
jdbc:postgresql://database-endpoint:5432/database-name
```
    You can find the database-endpoint on the Amazon RDS console on the source account.
  - Expand Network options. For VPC, Subnet and Security group, select the ones in the centralized AWS Glue account, as shown in the following figure:
3. In Set Properties, for Name enter Source DB connection-Postgresql.
Repeat steps 1 and 2 to create the connection to the target Amazon RDS database. Name the connection Target DB connection-Postgresql.

Now you have two connections, one for each Amazon RDS database.

Create AWS Glue crawlers

AWS Glue crawlers allow you to automate data discovery and cataloging from data sources and targets. Crawlers explore data stores and auto-generate metadata to populate the Data Catalog, registering discovered tables in the Data Catalog. This helps you to discover and work with the data to build ETL jobs.

To create a crawler for each Amazon RDS database, complete the following steps on the AWS Glue account:

On the AWS Glue console, choose Crawlers in the navigation pane.
Choose Create crawler and follow the instructions in the Add crawler wizard:
1. In Set crawler properties, for Name enter Source PostgreSQL database crawler.
2. In Chose data sources and classifiers, choose Not yet.
3. In Add data source, for Data source choose JDBC, as shown in the following figure:
4. For Connection, choose Source DB Connection - Postgresql.
5. For Include path, enter the path of your database including the schema. For our example, the path is sourcedb/cx/% where sourcedb is the name of the database, and cx the schema with the customer table.
6. In Configure security settings, choose the AWS IAM service role created a part of the prerequisites.
7. In Set output and scheduling, since we don’t have a database yet in the Data Catalog to store the source database metadata, choose Add database and create a database named sourcedb-postgresql.
Repeat steps 1 and 2 to create a crawler for the target database:
1. In Set crawler properties, for Name enter Target PostgreSQL database crawler.
2. In Add data source, for Connection, choose Target DB Connection-Postgresql, and for Include path enter targetdb/cx/%.
3. In Add database, for Name enter targetdb-postgresql.

Now you have two crawlers, one for each Amazon RDS database, as shown in the following figure:

Run the crawlers

Next, run the crawlers. When you run a crawler, the crawler connects to the designated data store and automatically populates the Data Catalog with metadata table definitions (columns, data types, partitions, and so on). This saves time over manually defining schemas.

From the Crawlers list, select both Source PostgreSQL database crawler and Target PostgreSQL database crawler, and choose Run.

When finished, each crawler creates a table in the Data Catalog. These tables are the metadata representation of the customer tables.

You now have all the resources to start creating AWS Glue ETL jobs!

Step 3 – Create and run the AWS Glue ETL Job

The proposed ETL job runs four tasks:

Source data extraction – Establishes a connection to the Amazon RDS source database and extracts the data to replicate.
PII detection and scrubbing.
Data transformation – Adjusts and removes unnecessary fields.
Target data loading – Establishes a connection to the target Amazon RDS database and inserts data with masked PII.

Let’s jump into AWS Glue Studio to create the AWS Glue ETL job.

Sign in to the AWS Glue console with your AWS Glue account.
Choose ETL jobs in the navigation pane.
Choose Visual ETL as shown in the following figure:

Task 1 – Source data extraction

Add a node to connect to the Amazon RDS source database:

Choose AWS Glue Data Catalog from the Sources. This adds a data source node to the canvas.
On the Data source properties panel, select sourcedb-postgresql database and source_cx_customer table from the Data Catalog as shown in the following figure:

Task 2 – PII detection and scrubbing

To detect and mask PII, select Detect Sensitive Data node from the Transforms tab.

Let’s take a deeper look into the Transform options on the properties panel for the Detect Sensitive Data node:

First, you can choose how you want the data to be scanned. You can select Find sensitive data in each row or Find columns that contain sensitive data as shown in the following figure. Choosing the former scans all rows for comprehensive PII identification, while the latter scans a sample for PII location at lower cost.

Selecting Find sensitive data in each row allows you to specify fine-grained action overrides. If you know your data, with fine-grained actions you can exclude certain columns from detection. You can also customize the entities to detect for every column in your dataset and skip entities that you know aren’t in specific columns. This allows your jobs to be more performant by eliminating unnecessary detection calls for those entities and perform actions unique to each column and entity combination.

In our example, we know our data and we want to apply fine-grained actions to specific columns, so let’s select Find sensitive data in each row. We’ll explore fine-grained actions further below.

Next, you select the types of sensitive information to detect. Take some time to explore the three different options.

In our example, again because we know the data, let’s select Select specific patterns. For Selected patterns, choose Person’s name, Email Address, Credit Card, Social Security Number (SSN) and US Phone as shown in the following figure. Note that some patterns, such as SSNs, apply specifically to the United States and might not detect PII data for other countries. But there are available categories applicable to other countries, and you can also use regular expressions in AWS Glue Studio to create detection entities to help meet your needs.

Next, select the level of detection sensitivity. Leave the default value (High).
Next, choose the global action to take on detected entities. Select REDACT and enter **** as the Redaction Text.
Next, you can specify fine-grained actions (overrides). Overrides are optional, but in our example, we want to exclude certain columns from detection, scan certain PII entity types on specific columns only, and specify different redaction text settings for different entity types.

Choose Add to specify the fine-grained action for each entity as shown in the following figure:

Task 3 – Data transformation

When the Detect Sensitive Data node runs, it converts the id column to string type and it adds a column named DetectedEntities with PII detection metadata to the output. We don’t need to store such metadata information in the target table, and we need to convert the id column back to integer, so let’s add a Change Schema transform node to the ETL job, as shown in the following figure. This will make these changes for us.

Note: You must select the DetectedEntities Drop checkbox for the transform node to drop the added field.

Task 4 – Target data loading

The last task for the ETL job is to establish a connection to the target database and insert the data with PII masked:

Choose AWS Glue Data Catalog from the Targets. This adds a data target node to the canvas.
On the Data target properties panel, choose targetdb-postgresql and target_cx_customer, as shown in the following figure.

Save and run the ETL job

From the Job details tab, for Name, enter ETL - Replicate customer data.
For IAM Role, choose the AWS Glue role created as part of the prerequisites.
Choose Save, then choose Run.

Monitor the job until it successfully finishes from Job run monitoring on the navigation pane.

Step 4 – Verify the results

Connect to the Amazon RDS target database and verify that the replicated rows contain the scrubbed PII data, confirming sensitive information was masked properly in transit between databases as shown in the following figure:

And that’s it! With AWS Glue Studio, you can create ETL jobs to copy data between databases and transform it along the way without any coding. Try other types of sensitive information for securing your sensitive data during replication. Also try adding and combining multiple and heterogenous data sources and targets.

Clean up

To clean up the resources created:

Delete the AWS Glue ETL job, crawlers, Data Catalog databases, and connections.
Delete the VPC peering connections.
Delete the routes added to the route tables, and inbound rules added to the security groups on the three AWS accounts.
On the AWS Glue account, delete associated Amazon S3 objects. These are in the S3 bucket with aws-glue-assets-account_id-region in its name, where account-id is your AWS Glue account ID, and region is the AWS Region you used.
Delete the Amazon RDS databases you created if you no longer need them. If you used the GitHub repository, then delete the AWS CloudFormation stacks.

Conclusion

In this post, you learned how to use AWS Glue Studio to build an ETL job that copies data from one Amazon RDS database to another and automatically detects PII data and masks the data in-flight, without writing code.

By using AWS Glue for database replication, organizations can eliminate manual processes to find hidden PII and bespoke scripting to transform it by building centralized, visible data sanitization pipelines. This improves security and compliance, and speeds time-to-market for test or analytics data provisioning.

About the Author

Monica Alcalde Angel is a Senior Solutions Architect in the Financial Services, Fintech team at AWS. She works with Blockchain and Crypto AWS customers, helping them accelerate their time to value when using AWS. She lives in New York City, and outside of work, she is passionate about traveling.

Unlock scalable analytics with a secure connectivity pattern in AWS Glue to read from or write to Snowflake

2024-08-19 Caio Montovani

Post Syndicated from Caio Montovani original https://aws.amazon.com/blogs/big-data/unlock-scalable-analytics-with-a-secure-connectivity-pattern-in-aws-glue-to-read-from-or-write-to-snowflake/

In today’s data-driven world, the ability to seamlessly integrate and utilize diverse data sources is critical for gaining actionable insights and driving innovation. As organizations increasingly rely on data stored across various platforms, such as Snowflake, Amazon Simple Storage Service (Amazon S3), and various software as a service (SaaS) applications, the challenge of bringing these disparate data sources together has never been more pressing.

AWS Glue is a robust data integration service that facilitates the consolidation of data from different origins, empowering businesses to use the full potential of their data assets. By using AWS Glue to integrate data from Snowflake, Amazon S3, and SaaS applications, organizations can unlock new opportunities in generative artificial intelligence (AI), machine learning (ML), business intelligence (BI), and self-service analytics or feed data to underlying applications.

In this post, we explore how AWS Glue can serve as the data integration service to bring the data from Snowflake for your data integration strategy, enabling you to harness the power of your data ecosystem and drive meaningful outcomes across various use cases.

Use case

Consider a large ecommerce company that relies heavily on data-driven insights to optimize its operations, marketing strategies, and customer experiences. The company stores vast amounts of transactional data, customer information, and product catalogs in Snowflake. However, they also generate and collect data from various other sources, such as web logs stored in Amazon S3, social media platforms, and third-party data providers. To gain a comprehensive understanding of their business and make informed decisions, the company needs to integrate and analyze data from all these sources seamlessly.

One crucial business requirement for the ecommerce company is to generate a Pricing Summary Report that provides a detailed analysis of pricing and discounting strategies. This report is essential for understanding revenue streams, identifying opportunities for optimization, and making data-driven decisions regarding pricing and promotions. After the Pricing Summary Report is generated and stored in Amazon S3, the company can use AWS analytics services to generate interactive BI dashboards and run one-time queries on the report. This allows business analysts and decision-makers to gain valuable insights, visualize key metrics, and explore the data in depth, enabling informed decision-making and strategic planning for pricing and promotional strategies.

Solution overview

The following architecture diagram illustrates a secure and efficient solution of integrating Snowflake data with Amazon S3, using the native Snowflake connector in AWS Glue. This setup uses AWS PrivateLink to provide secure connectivity between AWS services across different virtual private clouds (VPCs), eliminating the need to expose data to the public internet, which is a critical need for organizations.

The following are the key components and steps in the integration process:

Establish a secure, private connection between your AWS account and your Snowflake account using PrivateLink. This involves creating VPC endpoints in both the AWS and Snowflake VPCs, making sure data transfer remains within the AWS network.
Use Amazon Route 53 to create a private hosted zone that resolves the Snowflake endpoint within your VPC. This allows AWS Glue jobs to connect to Snowflake using a private DNS name, maintaining the security and integrity of the data transfer.
Create an AWS Glue job to handle the extract, transform, and load (ETL) process on data from Snowflake to Amazon S3. The AWS Glue job uses the secure connection established by the VPC endpoints to access Snowflake data. Snowflake credentials are securely stored in AWS Secrets Manager. The AWS Glue job retrieves these credentials at runtime to authenticate and connect to Snowflake, providing secure access management. A VPC endpoint enables you to securely communicate with this service without traversing the public internet, enhancing security and performance.
Store the extracted and transformed data in Amazon S3. Organize the data into appropriate structures, such as partitioned folders, to optimize query performance and data management. We use a VPC endpoint enabled to securely communicate with this service without traversing the public internet, enhancing security and performance. We also use Amazon S3 to store AWS Glue scripts, logs, and temporary data generated during the ETL process.

This approach offers the following benefits:

Enhanced security – By using PrivateLink and VPC endpoints, data transfer between Snowflake and Amazon S3 is secured within the AWS network, reducing exposure to potential security threats.
Efficient data integration – AWS Glue simplifies the ETL process, providing a scalable and flexible solution for data integration between Snowflake and Amazon S3.
Cost-effectiveness – Using Amazon S3 for data storage, combined with the AWS Glue pay-as-you-go pricing model, helps optimize costs associated with data management and integration.
Scalability and flexibility – The architecture supports scalable data transfers and can be extended to integrate additional data sources and destinations as needed.

By following this architecture and taking advantage of the capabilities of AWS Glue, PrivateLink, and associated AWS services, organizations can achieve a robust, secure, and efficient data integration solution, enabling them to harness the full potential of their Snowflake and Amazon S3 data for advanced analytics and BI.

Prerequisites

Complete the following prerequisites before setting up the solution:

Verify that you have access to AWS account with the necessary permissions to provision resources in services such as Route 53, Amazon S3, AWS Glue, Secrets Manager, and Amazon Virtual Private Cloud (Amazon VPC) using AWS CloudFormation, which lets you model, provision, and manage AWS and third-party resources by treating infrastructure as code.
Confirm that you have access to Snowflake hosted in AWS with required permissions to run the steps to configure PrivateLink. Refer to Enabling AWS PrivateLink in the Snowflake documentation to verify the steps, required access level, and service level to set the configurations. After you enable PrivateLink, save the value of the following parameters provided by Snowflake to use in the next step in this post:
1. privatelink-vpce-id
2. privatelink-account-url
3. privatelink_ocsp-url
4. regionless-snowsight-privatelink-url
Make sure you have a Snowflake user snowflakeUser and password snowflakePassword with required permissions to read from and write to Snowflake. The user and password are used in the AWS Glue connection to authenticate within Snowflake.
If your Snowflake user doesn’t have a default warehouse set, you will need a warehouse name. We use snowflakeWarehouse as a placeholder for the warehouse name; replace it with your actual warehouse name.
If you’re new to Snowflake, consider completing the Snowflake in 20 Minutes By the end of the tutorial, you should know how to create required Snowflake objects, including warehouses, databases, and tables for storing and querying data.

Create resources with AWS CloudFormation

This post includes a CloudFormation template for a quick setup of the base resources. You can review and customize it to suit your needs if needed. The CloudFormation template generates the following resources:

VPC (vpc-blog-glue-snowflake)
Subnets (one public subnet and three private subnets)
Route tables that are explicitly associated with the subnets
Security groups that are used to provision the endpoints for Secrets Manager, Amazon S3, and Snowflake, as well as used to provision the AWS Glue connection
Endpoints for Secrets Manager, Amazon S3, and Snowflake
Route 53 hosted zone, which is a container for DNS records
Route 53 record set to route traffic to the Snowflake endpoint
S3 bucket (blog-glue-snowflake-*)
AWS Identity and Access Management (IAM) role for AWS Glue (blog-glue-snowflake-GlueServiceRole-*)
AWS Glue database (db_blog_glue_snowflake)
Amazon Athena workgroup (blog-workgroup)

To create your resources, complete the following steps:

Sign in to the AWS CloudFormation console.
Choose Launch Stack to launch the CloudFormation stack.
Provide the CloudFormation stack parameters:
1. For PrivateLinkAccountURL, enter the value of the parameter privatelink-account-url obtained in the prerequisites.
2. For PrivateLinkOcspURL, enter the value of the parameter privatelink_ocsp-url obtained in the prerequisites.
3. For PrivateLinkVpceId, enter the value of the parameter privatelink-vpce-id obtained in the prerequisites.
4. For PrivateSubnet1CIDR, enter the IP addresses for your private subnet 1.
5. For PrivateSubnet2CIDR, enter the IP addresses for your private subnet 2.
6. For PrivateSubnet3CIDR, enter the IP addresses for your private subnet 3.
7. For PublicSubnet1CIDR, enter the IP addresses for your public subnet 1.
8. For RegionlessSnowsightPrivateLinkURL, enter the value of the parameter regionless-snowsight-privatelink-url obtained in the prerequisites.
9. For VpcCIDR, enter the IP addresses for your VPC.
Choose Next.
Select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Submit and wait for the stack creation step to complete.

After the CloudFormation stack is successfully created, you can see all the resources created on the Resources tab.

Navigate to the Outputs tab to see the outputs provided by CloudFormation stack. Save the value of the outputs GlueSecurityGroupId, VpcId, and PrivateSubnet1Id to use in the next step in this post.

Update the Secrets Manager secret with Snowflake credentials for the AWS Glue connection

To update the Secrets Manager secret with user snowflakeUser, password snowflakePassword, and warehouse snowflakeWarehouse that you will use in the AWS Glue connection to establish a connection to Snowflake, complete the following steps:

On the Secrets Manager console, choose Secrets in the navigation pane.
Open the secret blog-glue-snowflake-credentials.
Under Secret value, choose Retrieve secret value.

Choose Edit.
Enter the user snowflakeUser, password snowflakePassword, and warehouse snowflakeWarehouse for the keys sfUser, sfPassword, and sfWarehouse, respectively.
Choose Save.

Create the AWS Glue connection for Snowflake

An AWS Glue connection is an AWS Glue Data Catalog object that stores login credentials, URI strings, VPC information, and more for a particular data store. AWS Glue crawlers, jobs, and development endpoints use connections in order to access certain types of data stores. To create an AWS Glue connection to Snowflake, complete the following steps:

On the AWS Glue console, in the navigation pane, under Data catalog, choose Connections.
Choose Create connection.
For Data sources, search for and select Snowflake.
Choose Next.

For Snowflake URL, enter https://<privatelink-account-url>.

To obtain the Snowflake PrivateLink account URL, refer to parameters obtained in the prerequisites.

For AWS Secret, choose the secret blog-glue-snowflake-credentials.
For VPC, choose the VpcId value obtained from the CloudFormation stack output.
For Subnet, choose the PrivateSubnet1Id value obtained from the CloudFormation stack output.
For Security groups, choose the GlueSecurityGroupId value obtained from the CloudFormation stack output.
Choose Next.

In the Connection Properties section, for Name, enter glue-snowflake-connection.
Choose Next.

Choose Create connection.

Create an AWS Glue job

You’re now ready to define the AWS Glue job using the Snowflake connection. To create an AWS Glue job to read from Snowflake, complete the following steps:

On the AWS Glue console, under ETL jobs in the navigation pane, choose Visual ETL.

Choose the Job details tab.
For Name, enter a name, for example, Pricing Summary Report Job.
For Description, enter a meaningful description for the job.
For IAM Role, choose the role that has access to the target S3 location where the job is writing to and the source location from where it’s loading the Snowflake data and also to run the AWS Glue job. You can find this role in your CloudFormation stack output, named blog-glue-snowflake-GlueServiceRole-*.
Use the default options for Type, Glue version, Language, Worker type, Number of workers, Number of retries, and Job timeout.
For Job bookmark, choose Disable.
Choose Save to save the job.

On the Visual tab, choose Add nodes.

For Sources, choose Snowflake.

Choose Data source – Snowflake in the AWS Glue Studio canvas.
For Name, enter Snowflake_Pricing_Summary.
For Snowflake connection, choose glue-snowflake-connection.
For Snowflake source, select Enter a custom query.
For Database, enter snowflake_sample_data.
For Snowflake query, add the following Snowflake query:

SELECT l_returnflag
    , l_linestatus
    , Sum(l_quantity) AS sum_qty
    , Sum(l_extendedprice) AS sum_base_price
    , Sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price
    , Sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge
    , Avg(l_quantity) AS avg_qty
    , Avg(l_extendedprice) AS avg_price
    , Avg(l_discount) AS avg_disc
    , Count(*) AS count_order
FROM tpch_sf1.lineitem
WHERE l_shipdate <= Dateadd(day, - 90, To_date('1998-12-01'))
GROUP BY l_returnflag
    , l_linestatus
ORDER BY l_returnflag
    , l_linestatus;

The Pricing Summary Report provides a summary pricing report for all line items shipped as of a given date. The date is within 60–120 days of the greatest ship date contained in the database. The query lists totals for extended price, discounted extended price, discounted extended price plus tax, average quantity, average extended price, and average discount. These aggregates are grouped by RETURNFLAG and LINESTATUS, and listed in ascending order of RETURNFLAG and LINESTATUS. A count of the number of line items in each group is included.

For Custom Snowflake properties, specify Key as sfSchema and Value as tpch_sf1.
Choose Save.

Next, you add the destination as an S3 bucket.

On the Visual tab, choose Add nodes.
For Targets, choose Amazon S3.

Choose Data target – S3 bucket in the AWS Glue Studio canvas.
For Name, enter S3_Pricing_Summary.
For Node parents, select Snowflake_Pricing_Summary.
For Format, select Parquet.
For S3 Target Location, enter s3://<YourBucketName>/pricing_summary_report/ (use the name of your bucket).
For Data Catalog update options, select Create a table in the Data Catalog and on subsequent runs, update the schema and add new partitions.
For Database, choose db_blog_glue_snowflake.
For Table name, enter tb_pricing_summary.
Choose Save.
Choose Run to run the job, and monitor its status on the Runs tab.

You successfully completed the steps to create an AWS Glue job that reads data from Snowflake and loads the results into an S3 bucket using a secure connectivity pattern. Eventually, if you want to transform the data before loading it into Amazon S3, you can use AWS Glue transformations available in AWS Glue Studio. Using AWS Glue transformations is crucial when creating an AWS Glue job because they enable efficient data cleansing, enrichment, and restructuring, making sure the data is in the desired format and quality for downstream processes. Refer to Editing AWS Glue managed data transform nodes for more information.

Validate the results

After the job is complete, you can validate the output of the ETL job run in Athena, a serverless interactive analytics service. To validate the output, complete the following steps:

On the Athena console, choose Launch Query Editor.
For Workgroup, choose blog-workgroup.
If the message “All queries run in the Workgroup, blog-workgroup, will use the following settings:” is displayed, choose Acknowledge.
For Database, choose db_blog_glue_snowflake.
For Query, enter the following statement:

SELECT l_returnflag
    , l_linestatus
    , sum_qty
    , sum_base_price
FROM db_blog_glue_snowflake.tb_pricing_summary

Choose Run.

You have successfully validated your data for the AWS Glue job Pricing Summary Report Job.

Clean up

To clean up your resources, complete the following tasks:

Delete the AWS Glue job Pricing Summary Report Job.
Delete the AWS Glue connection glue-snowflake-connection.
Stop any AWS Glue interactive sessions.
Delete content from the S3 bucket blog-glue-snowflake-*.
Delete the CloudFormation stack blog-glue-snowflake.

Conclusion

Using the native Snowflake connector in AWS Glue provides an efficient and secure way to integrate data from Snowflake into your data pipelines on AWS. By following the steps outlined in this post, you can establish a private connectivity channel between AWS Glue and your Snowflake using PrivateLink, Amazon VPC, security groups, and Secrets Manager.

This architecture allows you to read data from and write data to Snowflake tables directly from AWS Glue jobs running on Spark. The secure connectivity pattern prevents data transfers over the public internet, enhancing data privacy and security.

Combining AWS data integration services like AWS Glue with data platforms like Snowflake allows you to build scalable, secure data lakes and pipelines to power analytics, BI, data science, and ML use cases.

In summary, the native Snowflake connector and private connectivity model outlined here provide a performant, secure way to include Snowflake data in AWS big data workflows. This unlocks scalable analytics while maintaining data governance, compliance, and access control. For more information on AWS Glue, visit AWS Glue.

About the Authors

Caio Sgaraboto Montovani is a Sr. Specialist Solutions Architect, Data Lake and AI/ML within AWS Professional Services, developing scalable solutions according customer needs. His vast experience has helped customers in different industries such as life sciences and healthcare, retail, banking, and aviation build solutions in data analytics, machine learning, and generative AI. He is passionate about rock and roll and cooking, and loves to spend time with his family.

Kartikay Khator is a Solutions Architect within Global Life Sciences at AWS, where he dedicates his efforts to developing innovative and scalable solutions that cater to the evolving needs of customers. His expertise lies in harnessing the capabilities of AWS analytics services. Extending beyond his professional pursuits, he finds joy and fulfillment in the world of running and hiking. Having already completed two marathons, he is currently preparing for his next marathon challenge.

Navnit Shukla, an AWS Specialist Solution Architect specializing in Analytics, is passionate about helping clients uncover valuable insights from their data. Leveraging his expertise, he develops inventive solutions that empower businesses to make informed, data-driven decisions. Notably, Navnit is the accomplished author of the book “Data Wrangling on AWS,” showcasing his expertise in the field.

Bosco Albuquerque is a Sr. Partner Solutions Architect at AWS and has over 20 years of experience working with database and analytics products from enterprise database vendors and cloud providers. He has helped technology companies design and implement data analytics solutions and products.

Implement data quality checks on Amazon Redshift data assets and integrate with Amazon DataZone

2024-08-15 Lakshmi Nair

Post Syndicated from Lakshmi Nair original https://aws.amazon.com/blogs/big-data/implement-data-quality-checks-on-amazon-redshift-data-assets-and-integrate-with-amazon-datazone/

Data quality is crucial in data pipelines because it directly impacts the validity of the business insights derived from the data. Today, many organizations use AWS Glue Data Quality to define and enforce data quality rules on their data at rest and in transit. However, one of the most pressing challenges faced by organizations is providing users with visibility into the health and reliability of their data assets. This is particularly crucial in the context of business data catalogs using Amazon DataZone, where users rely on the trustworthiness of the data for informed decision-making. As the data gets updated and refreshed, there is a risk of quality degradation due to upstream processes.

Amazon DataZone is a data management service designed to streamline data discovery, data cataloging, data sharing, and governance. It allows your organization to have a single secure data hub where everyone in the organization can find, access, and collaborate on data across AWS, on premises, and even third-party sources. It simplifies the data access for analysts, engineers, and business users, allowing them to discover, use, and share data seamlessly. Data producers (data owners) can add context and control access through predefined approvals, providing secure and governed data sharing. The following diagram illustrates the Amazon DataZone high-level architecture. To learn more about the core components of Amazon DataZone, refer to Amazon DataZone terminology and concepts.

To address the issue of data quality, Amazon DataZone now integrates directly with AWS Glue Data Quality, allowing you to visualize data quality scores for AWS Glue Data Catalog assets directly within the Amazon DataZone web portal. You can access the insights about data quality scores on various key performance indicators (KPIs) such as data completeness, uniqueness, and accuracy.

By providing a comprehensive view of the data quality validation rules applied on the data asset, you can make informed decisions about the suitability of the specific data assets for their intended use. Amazon DataZone also integrates historical trends of the data quality runs of the asset, giving full visibility and indicating if the quality of the asset improved or degraded over time. With the Amazon DataZone APIs, data owners can integrate data quality rules from third-party systems into a specific data asset. The following screenshot shows an example of data quality insights embedded in the Amazon DataZone business catalog. To learn more, see Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions.

In this post, we show how to capture the data quality metrics for data assets produced in Amazon Redshift.

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. Amazon DataZone natively supports data sharing for Amazon Redshift data assets.

With Amazon DataZone, the data owner can directly import the technical metadata of a Redshift database table and views to the Amazon DataZone project’s inventory. As these data assets gets imported into Amazon DataZone, it bypasses the AWS Glue Data Catalog, creating a gap in data quality integration. This post proposes a solution to enrich the Amazon Redshift data asset with data quality scores and KPI metrics.

Solution overview

The proposed solution uses AWS Glue Studio to create a visual extract, transform, and load (ETL) pipeline for data quality validation and a custom visual transform to post the data quality results to Amazon DataZone. The following screenshot illustrates this pipeline.

The pipeline starts by establishing a connection directly to Amazon Redshift and then applies necessary data quality rules defined in AWS Glue based on the organization’s business needs. After applying the rules, the pipeline validates the data against those rules. The outcome of the rules is then pushed to Amazon DataZone using a custom visual transform that implements Amazon DataZone APIs.

The custom visual transform in the data pipeline makes the complex logic of Python code reusable so that data engineers can encapsulate this module in their own data pipelines to post the data quality results. The transform can be used independently of the source data being analyzed.

Each business unit can use this solution by retaining complete autonomy in defining and applying their own data quality rules tailored to their specific domain. These rules maintain the accuracy and integrity of their data. The prebuilt custom transform acts as a central component for each of these business units, where they can reuse this module in their domain-specific pipelines, thereby simplifying the integration. To post the domain-specific data quality results using a custom visual transform, each business unit can simply reuse the code libraries and configure parameters such as Amazon DataZone domain, role to assume, and name of the table and schema in Amazon DataZone where the data quality results need to be posted.

In the following sections, we walk through the steps to post the AWS Glue Data Quality score and results for your Redshift table to Amazon DataZone.

Prerequisites

To follow along, you should have the following:

AWS account – You must have an active AWS account before you proceed. If you don’t have one, refer to How do I create and activate a new AWS account?
Redshift cluster – You can create a provisioned cluster following the instructions in Create a sample Amazon Redshift cluster, or provision an Amazon Redshift Serverless workgroup following the instructions in Get started with Amazon Redshift Serverless data warehouses.
Amazon DataZone resources – You need a domain for Amazon DataZone, an Amazon DataZone project, and a new Amazon DataZone environment (with a DataWarehouseProfile). For instructions, refer to Amazon DataZone quickstart with Amazon Redshift data.
IAM role – If you want to follow in your own environment and your Amazon DataZone domain has associated additional producer accounts, make sure that you have an AWS Identity and Access Management (IAM) role in the Amazon DataZone domain account with Amazon DataZone write privileges that your AWS Glue role can assume.
AWS Glue connection for your Redshift cluster – For instructions to create this connection, refer to Exploring new ETL and ELT capabilities for Amazon Redshift from the AWS Glue Studio visual editor.

The solution uses a custom visual transform to post the data quality scores from AWS Glue Studio. For more information, refer to Create your own reusable visual transforms for AWS Glue Studio.

A custom visual transform lets you define, reuse, and share business-specific ETL logic with your teams. Each business unit can apply their own data quality checks relevant to their domain and reuse the custom visual transform to push the data quality result to Amazon DataZone and integrate the data quality metrics with their data assets. This eliminates the risk of inconsistencies that might arise when writing similar logic in different code bases and helps achieve a faster development cycle and improved efficiency.

For the custom transform to work, you need to upload two files to an Amazon Simple Storage Service (Amazon S3) bucket in the same AWS account where you intend to run AWS Glue. Download the following files:

post_dq_results_to_datazone.py – Represents the transform function and holds the implementation logic
post_dq_results_to_datazone.json – Represents the config file required to define the transform

Copy these downloaded files to your AWS Glue assets S3 bucket in the folder transforms (s3://aws-glue-assets–<account id>-<region>/transforms). By default, AWS Glue Studio will read all JSON files from the transforms folder in the same S3 bucket.

In the following sections, we walk you through the steps of building an ETL pipeline for data quality validation using AWS Glue Studio.

Create a new AWS Glue visual ETL job

You can use AWS Glue for Spark to read from and write to tables in Redshift databases. AWS Glue provides built-in support for Amazon Redshift. On the AWS Glue console, choose Author and edit ETL jobs to create a new visual ETL job.

Establish an Amazon Redshift connection

In the job pane, choose Amazon Redshift as the source. For Redshift connection, choose the connection created as prerequisite, then specify the relevant schema and table on which the data quality checks need to be applied.

Apply data quality rules and validation checks on the source

The next step is to add the Evaluate Data Quality node to your visual job editor. This node allows you to define and apply domain-specific data quality rules relevant to your data. After the rules are defined, you can choose to output the data quality results. The outcomes of these rules can be stored in an Amazon S3 location. You can additionally choose to publish the data quality results to Amazon CloudWatch and set alert notifications based on the thresholds.

Preview data quality results

Choosing the data quality results automatically adds the new node ruleOutcomes. The preview of the data quality results from the ruleOutcomes node is illustrated in the following screenshot. The node outputs the data quality results, including the outcomes of each rule and its failure reason.

previewdqresults

Post the data quality results to Amazon DataZone

The output of the ruleOutcomes node is then passed to the custom visual transform. After both files are uploaded, the AWS Glue Studio visual editor automatically lists the transform as mentioned in post_dq_results_to_datazone.json (in this case, Datazone DQ Result Sink) among the other transforms. Additionally, AWS Glue Studio will parse the JSON definition file to display the transform metadata such as name, description, and list of parameters. In this case, it lists parameters such as the role to assume, domain ID of the Amazon DataZone domain, and table and schema name of the data asset.

Fill in the parameters:

Role to assume is optional and can be left empty; it’s only needed when your AWS Glue job runs in an associated account
For Domain ID, the ID for your Amazon DataZone domain can be found in the Amazon DataZone portal by choosing the user profile name

datazone page

Table name and Schema name are the same ones you used when creating the Redshift source transform
Data quality ruleset name is the name you want to give to the ruleset in Amazon DataZone; you could have multiple rulesets for the same table
Max results is the maximum number of Amazon DataZone assets you want the script to return in case multiple matches are available for the same table and schema name

Edit the job details and in the job parameters, add the following key-value pair to import the right version of Boto3 containing the latest Amazon DataZone APIs:

--additional-python-modules

boto3>=1.34.105

Finally, save and run the job.

dqrules post datazone

The implementation logic of inserting the data quality values in Amazon DataZone is mentioned in the post Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions . In the post_dq_results_to_datazone.py script, we only adapted the code to extract the metadata from the AWS Glue Evaluate Data Quality transform results, and added methods to find the right DataZone asset based on the table information. You can review the code in the script if you are curious.

After the AWS Glue ETL job run is complete, you can navigate to the Amazon DataZone console and confirm that the data quality information is now displayed on the relevant asset page.

Conclusion

In this post, we demonstrated how you can use the power of AWS Glue Data Quality and Amazon DataZone to implement comprehensive data quality monitoring on your Amazon Redshift data assets. By integrating these two services, you can provide data consumers with valuable insights into the quality and reliability of the data, fostering trust and enabling self-service data discovery and more informed decision-making across your organization.

If you’re looking to enhance the data quality of your Amazon Redshift environment and improve data-driven decision-making, we encourage you to explore the integration of AWS Glue Data Quality and Amazon DataZone, and the new preview for OpenLineage-compatible data lineage visualization in Amazon DataZone. For more information and detailed implementation guidance, refer to the following resources:

About the Authors

Fabrizio Napolitano is a Principal Specialist Solutions Architect for DB and Analytics. He has worked in the analytics space for the last 20 years, and has recently and quite by surprise become a Hockey Dad after moving to Canada.

Lakshmi Nair is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries. She focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance.

Varsha Velagapudi is a Senior Technical Product Manager with Amazon DataZone at AWS. She focuses on improving data discovery and curation required for data analytics. She is passionate about simplifying customers’ AI/ML and analytics journey to help them succeed in their day-to-day tasks. Outside of work, she enjoys nature and outdoor activities, reading, and traveling.

Use AWS Glue to streamline SFTP data processing

2024-08-13 Seun Akinyosoye

Post Syndicated from Seun Akinyosoye original https://aws.amazon.com/blogs/big-data/use-aws-glue-to-streamline-sftp-data-processing/

In today’s data-driven world, seamless integration and transformation of data across diverse sources into actionable insights is paramount. AWS Glue is a serverless data integration service that helps analytics users to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. With AWS Glue, you can discover and connect to hundreds of diverse data sources and manage your data in a centralized data catalog. It enables you to visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes.

In this blog post, we explore how to use the SFTP Connector for AWS Glue from the AWS Marketplace to efficiently process data from Secure File Transfer Protocol (SFTP) servers into Amazon Simple Storage Service (Amazon S3), further empowering your data analytics and insights.

Introducing the SFTP connector for AWS Glue

The SFTP connector for AWS Glue simplifies the process of connecting AWS Glue jobs to extract data from SFTP storage and to load data into SFTP storage. This connector provides comprehensive access to SFTP storage, facilitating cloud ETL processes for operational reporting, backup and disaster recovery, data governance, and more.

Solution overview

In this example, you use AWS Glue Studio to connect to an SFTP server, then enrich that data and upload it to Amazon S3. The SFTP connector is used to manage the connection to the SFTP server. You will load the event data from the SFTP site, join it to the venue data stored on Amazon S3, apply transformations, and store the data in Amazon S3. The event and venue files are from the TICKIT dataset.

The TICKIT dataset tracks sales activity for the fictional TICKIT website, where users buy and sell tickets online for sporting events, shows, and concerts. In this dataset, analysts can identify ticket movement over time, success rates for sellers, and best-selling events, venues, and seasons.

For this example, you use AWS Glue Studio to develop a visual ETL pipeline. This pipeline will read data from an SFTP server, perform transformations, and then load the transformed data into Amazon S3. The following diagram illustrates this architecture.

solution overview

By the end of this post, your visual ETL job will resemble the following screenshot.

final solution

Prerequisites

For this solution, you need the following:

Subscribe to the SFTP Connector for AWS Glue in the AWS Marketplace.
Access to an SFTP server with permissions to upload and download data.
- If the SFTP server is hosted on Amazon Elastic Compute Cloud (Amazon EC2), we recommend that the network communication between the SFTP server and the AWS Glue job happens within the virtual private cloud (VPC) as pictured in the preceding architecture diagram. Running your Glue job within a VPC and security group will be discussed further in the steps to create the AWS Glue job.
- If the SFTP server is hosted within your on-premises network, we recommend that the network communication between the SFTP server and the Glue job happens through VPN or AWS DirectConnect.
Access to an S3 bucket or the permissions to create an S3 bucket. We recommend that you connect to that bucket using a gateway endpoint. This will allow you to connect to your S3 bucket directly from your VPC. If you need to create an S3 bucket to store the results, complete the following steps:
1. On the Amazon S3 console, choose Buckets in the navigation pane.
2. Choose Create bucket.
3. For Name, enter a globally unique name for your bucket; for example, tickit-use1-<accountnumber>.
4. Choose Create bucket.
5. For this demonstration, create a folder with the name tickit in your S3 bucket.
6. Create the gateway endpoint.
Create an AWS Identity and Access Management (IAM) role for the AWS Glue ETL job. You must specify an IAM role for the job to use. The role must grant access to all resources used by the job, including Amazon S3 (for any sources, targets, scripts, and temporary directories) and AWS Secrets Manager. For instructions, see Configure an IAM role for your ETL job.

Load dataset to SFTP site

Load the allevents_pipe.txt file and venue_pipe.txt file from the TICKIT dataset to your SFTP server.

Store SFTP server sign-in credentials

An AWS Glue connection is a Data Catalog object that stores connection information, such as URI strings and location to credentials that are stored in a Secrets Manager secret.

To store the SFTP server username and password in Secrets Manager, complete the following steps:

On the Secrets Manager console, choose Secrets in the navigation pane.
Choose Store a new secret.
Select Other type of secret.
Enter host as Secret key and your SFTP server’s IP address (for example, 153.47.122) as the Secret value, then choose Add row.
Enter the username as Secret key and your SFTP username as Secret value, then choose Add row.
Enter password as Secret key and your SFTP password as Secret value, then choose Add row.
Enter keyS3Uri as Secret Key and the Amazon S3 location of your SFTP secret key file as Secret value

Note: Secret Value is the full S3 path where the SFTP server key file is stored. For example:s3://sftp-bucket-johndoe123/id_rsa.

For Secret name, enter a descriptive name, then choose Next.
Choose Next to move to the review step, then choose Store.

secret value

Create a connection to the SFTP server in AWS Glue

Complete the following steps to create your connection to the SFTP server.

On the AWS Glue console, under Data Catalog in the navigation pane, choose Connections.

creating sftp connection from marketplace

Select the SFTP connector for AWS Glue 4.0. Then choose Create connection.

using sftp connector

Enter a name for the connection and then, under Connection access, choose the Secrets Manager secret you created for you SFTP server credentials.

Create a connection to the VPC in AWS Glue

A data connection is used to establish network connectivity between the VPC and the AWS Glue job. To create the VPC connection, complete the following steps.

On the AWS Glue console page, click on Data Connections location on the left side menu.
Click the Create connection button in the Connections panel.

creating connection for VPC

Select Network

choosing network option

Select the VPC, Subnet, and Security Group that your SFTP server resides in. Click Next.

choosing vpc, subnet, sg for connection

Name the connection SFTP VPC Connect and then click

Deploy the solution

Now that we completed the prerequisites, we are going to setup the AWS Glue Studio job for this solution. We will create a glue studio job, add events and venue data from the SFTP server, carry out data transformations and load transformed data to s3.

Create your AWS Glue Studio job:

On the AWS Glue console, under ETL Jobs in the navigation pane, choose Visual ETL.
Select Visual ETL in the central pane.
Choose the pencil icon to enter a name for your job.
Choose the Job details tab.

choosing job details

Scroll down to and select Advanced properties and expand.
Scroll to Connections and select SFTP VPC Connect.

choosing sftp vpc connection

Choose Visual to go back to the workflow editor page.

Add the events data from the SFTP server as your first data set:

Choose Add nodes and select SFTP Connector for AWS Glue 4.0 on the Sources
Enter the following for Data source properties for:
1. Connection: Select the connection to the SFTP server that you created in Create the connection to the SFTP server in AWS Glue.
2. Enter the following key-value pairs:

Key	Value
header	false
path	/files (this should be the path to the event file in your SFTP server)
fileFormat	csv
delimiter	\|

glue studio job configuration

Rename the columns of the Event dataset:

Choose Add nodes and choose Change Schema on the Transforms
Enter the following transform properties:
1. For Name, enter Rename Event data.
2. For Node parents, select SFTP Connector for AWS Glue 4.0.
3. In the Change Schema section, map the source keys to the target keys:
  1. col0: eventid
  2. col1: e_venueid
  3. col2: catid
  4. col3: dateid
  5. col4: eventname
  6. col5: starttime

transforming event data

Add the venue_pipe.txt file from the SFTP site:

Choose Add nodes and choose SFTP Connector for AWS Glue 4.0 on the Sources
Enter the following for Data source properties for:
1. Connection: Select the connection to the SFTP server that you created in Create the connection to the SFTP server in AWS Glue.
2. Enter the following key-value pairs:

Key	Value
header	false
path	/files (this should be the path to the venue file in your SFTP site)
fileFormat	csv
delimiter	\|

Rename the columns of the venue dataset:

Choose Add nodes and choose Change Schema on the Transforms
Enter the following transform properties:
1. For Name, enter Rename Venue data.
2. For Node parents, select Venue.
3. In the Change Schema section, map the source keys to the target keys:
  1. col0: venueid
  2. col1: venuename
  3. col2: venuecity
  4. col3: venuestate
  5. col4: venueseats

transforming venue data

Join the venue and event datasets.

Choose Add nodes and choose Join on the Transforms
Enter the following transform properties:
1. For Name, enter Join.
2. For Node parents, select Rename Venue data and Rename Event data.
3. For Join type¸ select Inner join.
4. For Join conditions, select venueid for Rename Venue data and e_venueid for Rename Event data.

transform join venue and event

Drop the duplicate field:

Choose Add nodes and choose Drop Fields on the Transforms
Enter the following transform properties:
1. For Name, enter Drop Fields.
2. For Node parents, select Join.
3. In the DropFields section, select e_venueid.

drop field transform

Load the data into your S3 bucket:

Choose Add nodes and choose Amazon S3 from the Sources
Enter the following transform properties:
1. For Node parents, select Drop Fields.
2. For Format, select CSV.
3. For Compression Type, select None.
4. For S3 Target Location, choose your S3 bucket and enter your desired file name followed by a slash (/).

loading data to s3 target

You can now save and run your AWS Glue visual ETL Job. Run the job and then go to the Runs tab to monitor its progress. After the job has completed, the Run status will change to Succeeded. The data will be in the target S3 bucket.

completed job

Clean up

To avoid incurring additional charges caused by resources created as part of this post, make sure you delete the items created in the AWS Account for this post:

Delete the Secrets Manager key created for the SFTP connector . credentials.
Delete the SFTP connector.
Unsubscribe from the SFTP Connector in AWS Marketplace.
Delete the data loaded to the Amazon S3 bucket and the bucket.
Delete the AWS Glue visual ETL job.

Conclusion

In this blog post, we demonstrated how to use the SFTP connector for AWS Glue to streamline the processing of data from SFTP servers into Amazon S3. This integration plays a pivotal role in enhancing your data analytics capabilities by offering an efficient and straightforward method to bring together disparate data sources. Whether your goal is to analyze SFTP server data for actionable insights, bolster your reporting mechanisms, or enrich your business intelligence tools, this connector ensures a more streamlined and cost-effective approach to achieving your data objectives.

For further details on the SFTP connector, see the SFTP Connector for Glue documentation.

About the Authors

Sean Bjurstrom is a Technical Account Manager in ISV accounts at Amazon Web Services, where he specializes in Analytics technologies and draws on his background in consulting to support customers on their analytics and cloud journeys. Sean is passionate about helping businesses harness the power of data to drive innovation and growth. Outside of work, he enjoys running and has participated in several marathons.

Seun Akinyosoye is a Sr. Technical Account Manager supporting public sector customer at Amazon Web Services. Seun has a background in analytics, data engineering which he uses to help customers achieve their outcomes and goals. Outside of work Seun enjoys spending time with his family, reading, traveling and supporting his favorite sports teams.

Vinod Jayendra is a Enterprise Support Lead in ISV accounts at Amazon Web Services, where he helps customers in solving their architectural, operational, and cost optimization challenges. With a particular focus on Serverless technologies, he draws from his extensive background in application development to deliver top-tier solutions. Beyond work, he finds joy in quality family time, embarking on biking adventures, and coaching youth sports team.

Kamen Sharlandjiev is a Sr. Big Data and ETL Solutions Architect, MWAA and AWS Glue ETL expert. He’s on a mission to make life easier for customers who are facing complex data integration and orchestration challenges. His secret weapon? Fully managed AWS services that can get the job done with minimal effort. Follow Kamen on LinkedIn to keep up to date with the latest MWAA and AWS Glue features and news!

Chris Scull is a Solutions Architect dealing in orchestration tools and modern cloud technologies. With two years of experience at AWS, Chris has developed an interest in Amazon Managed Workflows for Apache Airflow, which allows for efficient data processing and workflow management. Additionally, he is passionate about exploring the capabilities of GenAI with Bedrock, a platform for building generative AI applications on AWS.

Shengjie Luo is a Big data architect of Amazon Cloud Technology professional service team. Responsible for solutions consulting, architecture and delivery of AWS based data warehouse and data lake, and good at server-less computing, data migration, cloud data integration, data warehouse planning, data service architecture design and implementation.

Qiushuang Feng is a Solutions Architect at AWS, responsible for Enterprise customers’ technical architecture design, consulting, and design optimization on AWS Cloud services. Before joining AWS, Qiushuang worked in IT companies such as IBM and Oracle, and accumulated rich practical experience in development and analytics.

Query AWS Glue Data Catalog views using Amazon Athena and Amazon Redshift

2024-08-08 Pathik Shah

Post Syndicated from Pathik Shah original https://aws.amazon.com/blogs/big-data/query-aws-glue-data-catalog-views-using-amazon-athena-and-amazon-redshift/

Today’s data lakes are expanding across lines of business operating in diverse landscapes and using various engines to process and analyze data. Traditionally, SQL views have been used to define and share filtered data sets that meet the requirements of these lines of business for easier consumption. However, with customers using different processing engines in their data lakes, each with its own version of views, they’re creating separate views per engine, adding to maintenance overhead. Furthermore, accessing these engine-defined views requires customers to have elevated access levels, granting them access to both the SQL view itself and the underlying databases and tables referenced in the view’s SQL definition. This approach impedes granting consistent access to a subset of data using SQL views, hampering productivity and increasing management overhead.

Glue Data Catalog views is a new feature of the AWS Glue Data Catalog that customers can use to create a common view schema and single metadata container that can hold view-definitions in different dialects that can be used across engines such as Amazon Redshift and Amazon Athena. By defining a single view object that can be queried from multiple engines, Data Catalog views enable customers to manage permissions on a single view schema consistently using AWS Lake Formation. A view can be shared across different AWS accounts as well. For querying these views, users need access to the view object only and don’t need access to the referenced databases and tables in the view definition. Further, all requests against the Data Catalog views, such as requests for access credentials on underlying resources, will be logged as AWS CloudTrail management events for auditing purposes.

In this blog post, we will show how you can define and query a Data Catalog view on top of open source table formats such as Iceberg across Athena and Amazon Redshift. We will also show you the configurations needed to restrict access to the underlying database and tables. To follow along, we have provided an AWS CloudFormation template.

Use case

An Example Corp has two business units: Sales and Marketing. The Sales business unit owns customer datasets, including customer details and customer addresses. The Marketing business unit wants to conduct a targeted marketing campaign based on a preferred customer list and has requested data from the Sales business unit. The Sales business unit’s data steward (AWS Identity and Access Management (IAM) role: product_owner_role), who owns the customer and customer address datasets, plans to create and share non-sensitive details of preferred customers with the Marketing unit’s data analyst (business_analyst_role) for their campaign use case. The Marketing team analyst plans to use Athena for interactive analysis for the marketing campaign and later, use Amazon Redshift to generate the campaign report.

In this solution, we demonstrate how you can use Data Catalog views to share a subset of customer details stored in Iceberg format filtered by the preferred flag. This view can be seamlessly queried using Athena and Amazon Redshift Spectrum, with data access centrally managed through AWS Lake Formation.

Prerequisites

For the solution in this blog post, you need the following:

An AWS account. If you don’t have an account, you can create one.
You have created a data lake administrator Take note of this role’s Amazon Resource Name (ARN) to use later. For simplicity’s sake, this post will use IAM Admin role as the Datalake Admin and Redshift Admin but make sure that in your environment you follow the principle of least privilege.
Under Data Catalog settings, have the default settings in place. Both of the following options should be selected:
- Use only IAM access control for new databases
- Use only IAM access control for new tables in new databases

Get started

To follow the steps in this post, sign in to the AWS Management Console as the IAM Admin and deploy the following CloudFormation stack to create the necessary resources:

Choose to deploy the CloudFormation template.
Provide an IAM role that you have already configured as a Lake Formation administrator.
Complete the steps to deploy the template. Leave all settings as default.
Select I acknowledge that AWS CloudFormation might create IAM resources, then choose Submit.

The CloudFormation stack creates the following resources. Make a note of these values—you will use them later.

Amazon Simple Storage Service (Amazon S3) buckets that store the table data and Athena query result
IAM roles: product_owner_role and business_analyst_role
Virtual private cloud (VPC) with the required network configuration, which will be used for compute
AWS Glue database: customerdb, which contains the customer and customer_address tables in Iceberg format
Glue database: customerviewdb, which will contain the Data Catalog views
Redshift Serverless cluster

The CloudFormation stack also registers the data lake bucket with Lake Formation in Lake Formation access mode. You can verify this by navigating to the Lake Formation console and selecting Data lake locations under Administration.

Solution overview

The following figure shows the architecture of the solution.

As a requirement to create a Data Catalog view, the data lake S3 locations for the tables (customer and customer_address) need to be registered with Lake Formation and granted full permission to product_owner_role.

The Sales product owner: product_owner_role is also granted permission to create views under customerviewdb using Lake Formation.

After the Glue Data Catalog View (customer_view) is created on the customer dataset with the required subset of customer information, the view is shared with the Marketing analyst (business_analyst_role), who can then query the preferred customer’s non sensitive information as defined by the view without having access to underlying customer tables.

Enable Lake Formation permission mode on the customerdbdatabase and its tables.
Grant the database (customerdb) and tables (customer and customer_address) full permission to product_owner_role using Lake Formation.
Enable Lake Formation permission mode on the database (customerviewdb) where the multiple dialect Data Catalog view will be created.
Grant full database permission to product_owner_role using Lake Formation.
Create Data Catalog views as product_owner_role using Athena and Amazon Redshift to add engine dialects.
Share the database and Data Catalog views read permission to business_analyst_role using Lake Formation.
Query the Data Catalog view using business_analyst_role from Athena and Amazon Redshift engine.

With the prerequisites in place and an understanding of the overall solution, you’re ready to set up the solution.

Set up Lake Formation permissions for product_owner_role

Sign in to the LakeFormation console as a data lake administrator. For the examples in this post, we use the IAM Admin role, Admin as the data lake admin.

Enable Lake Formation permission mode on customerdb and its tables

In the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
Choose customerdb and choose Edit.
Under Default permissions for newly created tables, clear Use only IAM access control for new tables in this database.
Choose Save.
Under Data Catalog in the navigation pane, choose Databases.
Select customerdb and under Action, select View
Select the IAMAllowedPrincipal from the list and choose Revoke.
Repeat the same for all tables under the database customerdb.

Grant the product_owner_role access to customerdb and its tables

Grant product_owner_role all permissions to the customerdb database.

On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permissions.
Choose Grant.
Under Principals, select IAM users and roles.
Select product_owner_role.
Under LF-Tags or catalog resources, select Named Data Catalog resourcesand select customerdb for Databases.
Select SUPER for Database permissions.
Choose Grant to apply the permissions.

Grant product_owner_role all permissions to the customer and customer_address tables.

On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permission
Choose Grant.
Under Principals, select IAM users and roles.
Choose the product_owner_role.
Under LF-Tags or catalog resources, choose Named Data Catalog resourcesand select customerdb for databases and customer and customer_address for tables.
Choose SUPER for Table permissions.
Choose Grant to apply the permissions.

Enable Lake Formation permission mode

Enable Lake Formation permission mode on the database where the Data Catalog view will be created.

In the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
Select customerviewdb and choose Edit.
Under Default permissions for newly created tables, clear Use only IAM access control for new tables in this database.
Choose Save.
Choose Databases from Data Catalog in the navigation pane.
Select customerviewdb and under Action select View.
Select the IAMAllowedPrincipal from the list and choose Revoke.

Grant the product_owner_role access to customerviewdb using Lake Formation mode

Grant product_owner_role all permissions to the customerviewdb database.

On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permissions.
Choose Grant
Under Principals, select IAM users and roles.
Choose product_owner_role
Under LF-Tags or catalog resources, choose Named Data Catalog resourcesand select customerviewdb for Databases.
Select SUPER for Database permissions.
Choose Grant to apply the permissions.

Create Glue Data Catalog views as product_owner_role

Now that you have Lake Formation permissions set on the databases and tables, you will use the product_owner_role to create Data Catalog views using Athena and Amazon Redshift. This will also add the engine dialects for Athena and Amazon Redshift.

Add the Athena dialect

In the AWS console, either sign in using product_owner_role or, if you’re already signed in as an Admin, switch to product_owner_role.
Launch query editor and select the workgroup athena_glueview from the upper right side of the console. You will create a view that combines data from the customer and customer_address tables, specifically for customers who are marked as preferred. The tables include personal information about the customer, such as their name, date of birth, country of birth, and email address.

Run the following in the query editor to create the customer_view view under the customerviewdb database.

create protected multi dialect view customerviewdb.customer_view
security definer
as
select c_customer_id, c_first_name, c_last_name, c_birth_day, c_birth_month,
c_birth_year, c_birth_country, c_email_address,
ca_country,ca_zip
from customerdb.customer, customerdb.customer_address
where c_current_addr_sk = ca_address_sk and c_preferred_cust_flag='Y';

Run the following query to preview the view you just created.
```
select * from customerviewdb.customer_view limit 10;
```
Run following query to find the top three birth years with the highest customer counts from the customer_view view and display the birth year and corresponding customer count for each.
```
select c_birth_year,
	count(*) as count
from "customerviewdb"."customer_view"
group by c_birth_year
order by count desc
limit 3
```

Output:

To validate that the view is created, go to the navigation pane and choose Views under Data catalog on the Lake Formation console
Select customer_view and go to the SQL definition section to validate the Athena engine dialect.

When you created the view in Athena, it added the dialect for Athena engine. Next, to support the use case described earlier, the marketing campaign report needs to be generated using Amazon Redshift. For this, you need to add the Redshift dialect to the view so you can query it using Amazon Redshift as an engine.

Add the Amazon Redshift dialect

Connect to the Serverless cluster as Admin (federated user) and run the following statements to grant permission on the Glue automount database (awsdatacatalog) access to product_owner_role and business_analyst_role.

create user  "IAMR:product_owner_role" password disable;
create user  "IAMR:business_analyst_role" password disable;

grant usage on database awsdatacatalog to "IAMR:product_owner_role";
grant usage on database awsdatacatalog to "IAMR:business_analyst_role";

Sign in to the Amazon Redshift console as product_owner_role and sign in to the QEv2 editor using product_owner_role (as a federated user). You will use the following ALTER VIEW query to add the Amazon Redshift engine dialect to the view created previously using Athena.

Run the following in the query editor:

alter external view awsdatacatalog.customerviewdb.customer_view AS
select c_customer_id, c_first_name, c_last_name, c_birth_day, c_birth_month,
c_birth_year, c_birth_country, c_email_address,
ca_country, ca_zip
from awsdatacatalog.customerdb.customer, awsdatacatalog.customerdb.customer_address
where c_current_addr_sk = ca_address_sk and c_preferred_cust_flag='Y'

Run following query to preview the view.

select * from awsdatacatalog.customerviewdb.customer_view limit 10;

Run the same query that you ran in Athena to find the top three birth years with the highest customer counts from the customer_view view and display the birth year and corresponding customer count for each.
```
select c_birth_year,
	count(*) as count
from awsdatacatalog.customerviewdb.customer_view
group by c_birth_year
order by count desc
limit 3
```

By querying the same view and running the same query in Redshift, you obtained the same result set as you observed in Athena.

Validate the dialects added

Now that you have added all the dialects, navigate to the Lake Formation console to see how the dialects are stored.

On the Lake Formation console, under Data catalog in the navigation pane, choose Views.
Select customer_view and go to SQL definitions section to validate that the Athena and Amazon Redshift dialects have been added.

Alternatively, you can also create the view using Redshift to add Redshift dialect and update in Athena to add the Athena dialect.

Next, you will see how the business_analyst_role can query the view without having access to query the underlying tables and the Amazon S3 location where the data exists.

Set up Lake Formation permissions for business_analyst_role

Sign in to the Lake Formation console as the DataLake administrator (For this blog, we use the IAM Admin role, Admin, as the Datalake admin).

Grant business_analyst_role access to the database and view using Lake Formation

On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permissions.
Choose Grant
Under Principals, select IAM users and roles.
Select business_analyst_role.
Under LF-Tags or catalog resources, select Named Data Catalog resources and select customerviewdb for Databases.
Select DESCRIBE for Database permissions.
Choose Grant to apply the permissions.

Grant the business_analyst_role SELECT and DESCRIBE permissions to customer_view

On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permission.
Choose Grant.
Under Principals, select IAM users and roles.
Select business_analyst_role.
Under LF-Tags or catalog resources, choose Named Data Catalog resources and select customerviewdb for Databases and customer_view for Views.
Choose SELECT and DESCRIBE for View permissions.
Choose Grant to apply the permissions.

Query the Data Catalog views using business_analyst_role

Now that you have set up the solution, test it by querying the data using Athena and Amazon Redshift.

Using Athena

Sign in to the Athena console as business_analyst_role.
Launch query editor and select the workgroup athena_glueview. Select database customerviewdb from the dropdown on the left and you should be able to see the view created previously using product_owner_role. Also, notice that no tables are shown because business_analyst_role doesn’t have access granted for the base tables.
Run the following in the query editor to query the view query.
```
select * from customerviewdb.customer_view limit 10
```

As you can see in the preceding figure, business_analyst_role can query the view without having access to the underlying tables.

Next, query the table customer on which the view is created. It should give an error.
```
SELECT * FROM customerdb.customer limit 10
```

Using Amazon Redshift

Navigate to the Amazon Redshift console and sign in to Amazon Redshift query editor v2. Connect to the Serverless cluster as business_analyst_role (federated user) and run the following in the query editor to query the view.
Select the customerviewdb on the left side of the console. You should see the view customer_view. Also, note that you cannot see the tables from which the view is created. Run the following in the query editor to query the view.
```
SELECT * FROM "awsdatacatalog"."customerviewdb"."customer_view";
```

The business analyst user can run the analysis on the Data Catalog view without needing access to the underlying databases and tables on from which the view is created.

Glue Data Catalog views offer solutions for various data access and governance scenarios. Organizations can use this feature to define granular access controls on sensitive data—such as personally identifiable information (PII) or financial records—to help them comply with data privacy regulations. Additionally, you can use Data Catalog views to implement row-level, column-level, or even cell-level filtering based on the specific privileges assigned to different user roles or personas, allowing for fine-grained data access control. Furthermore, Data Catalog views can be used in data mesh patterns, enabling secure, domain-specific data sharing across the organization for self-service analytics, while allowing users to use preferred analytics engines like Athena or Amazon Redshift on the same views for governance and consistent data access.

Clean up

To avoid incurring future charges, delete the CloudFormation stack. For instructions, see Deleting a stack on the AWS CloudFormation console. Ensure that the following resources created for this blog post are removed:

S3 buckets
IAM roles
VPC with network components
Data Catalog database, tables and views
Amazon Redshift Serverless cluster
Athena workgroup

Conclusion

In this post, we demonstrated how to use AWS Glue Data Catalog views across multiple engines such as Athena and Redshift. You can share Data Catalog views so that different personas can query them. For more information about this new feature, see Using AWS Glue Data Catalog views.

About the Authors

Pathik Shah is a Sr. Analytics Architect on Amazon Athena. He joined AWS in 2015 and has been focusing in the big data analytics space since then, helping customers build scalable and robust solutions using AWS analytics services.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Derek Liu is a Senior Solutions Architect based out of Vancouver, BC. He enjoys helping customers solve big data challenges through AWS analytic services.

Introducing AWS Glue Data Quality anomaly detection

2024-08-08 Noah Soprala

Post Syndicated from Noah Soprala original https://aws.amazon.com/blogs/big-data/introducing-aws-glue-data-quality-anomaly-detection/

Thousands of organizations build data integration pipelines to extract and transform data. They establish data quality rules to ensure the extracted data is of high quality for accurate business decisions. These rules commonly assess the data based on fixed criteria reflecting the current business state. However, when the business environment changes, data properties shift, rendering these fixed criteria outdated and causing poor data quality.

For example, a data engineer at a retail company established a rule that validates daily sales must exceed a 1-million-dollar threshold. After a few months, daily sales surpassed 2 million dollars, rendering the threshold obsolete. The data engineer couldn’t update the rules to reflect the latest thresholds due to lack of notification and the effort required to manually analyze and update the rule. Later in the month, business users noticed a 25% drop in their sales. After hours of investigation, the data engineers discovered that an extract, transform, and load (ETL) pipeline responsible for extracting data from some stores had failed without generating errors. The rule with outdated thresholds continued to operate successfully without detecting this anomaly.

Also, breaks or gaps that significantly deviate from the seasonal pattern can sometimes point to data quality issues. For instance, retail sales may be highest on weekends and holiday seasons while relatively low on weekdays. Divergence from this pattern may indicate data quality issues such as missing data from a store or shifts in business circumstances. Data quality rules with fixed criteria can’t detect seasonal patterns because this requires advanced algorithms that can learn from past patterns and capture seasonality to detect deviations. You need the ability spot anomalies with ease, enabling you to proactively detect data quality issues and make confident business decisions.

To address these challenges, we are excited to announce the general availability of anomaly detection capabilities in AWS Glue Data Quality. In this post, we demonstrate how this feature works with an example. We provide an AWS Cloud Formation template to deploy this setup and experiment with this feature.

For completeness and ease of navigation, you can explore all the following AWS Glue Data Quality blog posts. This will help you understand all the other capabilities of AWS Glue Data Quality, in addition to anomaly detection.

Solution overview

For our use case, a data engineer wants to measure and monitor data quality of the New York taxi ride dataset. The data engineer knows about a few rules, but wants to monitor critical columns and be notified about any anomalies in these columns. These columns include fare amount, and the data engineer wants to be notified about any major deviations. Another attribute is the number of rides, which varies during peak hours, mid-day hours, and night hours. Also, as the city grows, there will be gradual increase in the number of rides overall. We use anomaly detection to help set up and maintain rules for this seasonality and growing trend.

We demonstrate this feature with the following steps:

Deploy a CloudFormation template that will generate 7 days of NYC taxi data.
Create an AWS Glue ETL job and configure the anomaly detection capability.
Run the job for 6 days and explore how AWS Glue Data Quality learns from data statistics and detects anomalies.

Set up resources with AWS CloudFormation

This post includes a CloudFormation template for a quick setup. You can review and customize it to suit your needs. The template generates the following resources:

An Amazon Simple Storage Service (Amazon S3) bucket (anomaly-detection-blog-<account-id>-<region>)
An AWS Identity and Access Management (IAM) policy to associate with the S3 bucket (anomaly-detection-blog-<account-id>-<region>)
An IAM role with AWS Glue run permission as well as read and write permission on the S3 bucket (anomaly_detection_blog_GlueServiceRole)
An AWS Glue database to catalog the data (anomaly_detection_blog_db)
An AWS Glue visual ETL job to generate sample data (anomaly_detection_blog_data_generator_job)

To create your resources, complete the following steps:

Launch your CloudFormation stack in us-east-1.
Keep all settings as default.
Select I acknowledge that AWS CloudFormation might create IAM resources and choose Create stack.
When the stack is complete, copy the AWS Glue script to the S3 bucket anomaly-detection-blog-<account-id>-<region>.
Open AWS CloudShell.

Run the following command; replace account-id and region as appropriate:

aws s3 cp s3://aws-blogs-artifacts-public/BDB-4485/scripts/anomaly_detection_blog_data_generator_job.py s3://anomaly-detection-blog-<account-id>-<region>/scripts/anomaly_detection_blog_data_generator_job.py

Run the data generator job

As part of the CloudFormation template, a data generator AWS Glue job is provisioned in your AWS account. Complete the following steps to run the job:

On the AWS Glue console, choose ETL jobs in the navigation pane.
Choose the job
Review the script on the Script
On the Job details tab, verify the job run parameters in the Advanced section:
1. bucket_name – The S3 bucket name where you want the data to be generated.
2. bucket_prefix – The prefix in the S3 bucket.
3. gluecatalog_database_name – The database name in the AWS Glue Data Catalog that was created by the CloudFormation template.
4. gluecatalog_table_name – The table name to be created in the Data Catalog in the database.
Choose Run to run this job.
On the Runs tab, monitor the job until the Run status column shows as Succeeded.

When the job is complete, it will have generated the NYC taxi dataset for the date range of May 1, 2024, to May 7, 2024, in the specified S3 bucket and cataloged the table and partitions in the Data Catalog for year, month, day, and hour. This dataset contains 7 day of hourly rides that fluctuates between high and low on alternate days. For instance, on Monday, there are approximately 1,400 rides, on Tuesday around 700 rides, and this pattern continues. Of the 7 days, the first 5 days of data is non-anomalous. However, on the sixth day, an anomaly occurs where the number of rows jumps to around 2,200 and the fare_amount is set to an unusually high value of 95 for mid-day traffic.

Create an AWS Glue visual ETL job

Complete the following steps:

On the AWS Glue console, create a new AWS Glue visual job named anomaly-detection-blog-visual.
On the Job details tab, provide the IAM role created by the CloudFormation stack.
On the Visual tab, add an S3 node for the data source.
Provide the following parameters:
1. For Database, choose anomaly_detection_blog_db.
2. For Table, choose nyctaxi_raw.
3. For Partition predicate, enter year==2024 AND month==5 AND day==1.

Add the Evaluate Data Quality transform and add use the following rule for fare_amount:
```
Rules = [
    ColumnValues "fare_amount" between 1 and 100
]
```

Because we’re still trying to understand the statistics on this metric, we start with a wide range rule, and after a few runs, we will analyze the results and fine-tune as needed.

Next, we add two analyzers: one for RowCount and another for distinct values of pulocationid.

On the Anomaly detection tab, choose Add analyzer.
For Statistics, enter RowCount.
Add a second analyzer.
For Statistics, enter DistinctValuesCount and for Columns, enter pulocationid.

Your final ruleset should look like the following code:

Rules = [
    ColumnValues "fare_amount" between 1 and 100
]
Analyzers = [
DistinctValuesCount "pulocationid",
RowCount
]

Save the job.

We have now generated a synthetic NYC taxi dataset and authored an AWS Glue visual ETL job to read from this dataset and perform analysis with one rule and two analyzers.

Run and evaluate the visual ETL job

Before we run the job, let’s look at how anomaly detection works. In this example, we have configured one rule and two analyzers. Rules have thresholds to compare what good looks like. Sometimes, you might know the critical columns, but not know specific thresholds. Rules and analyzers gather data statistics or data profiles. In this example, AWS Glue Data Quality will gather four statistics (a ColumnValue rule will gather two statistics, namely minimum and maximum fare amount, and two analyzers will gather two statistics). After gathering three data points from three runs, AWS Glue Data Quality will predict the fourth run along with upper and lower bounds. It will then compare the predicted value with the actual value. When the actual value breaches the predicted upper or lower bounds, it will create an anomaly.

Let’s see this in action.

Run the job for 5 days and analyze results

Because the first 5 days of data is non-anomalous, it will set a baseline with seasonality for training the model. Complete the following steps to run the job five times, once for each day’s partition:

Choose the S3 node on the Visual tab and go to its properties.
Set the day field in the partition predicate to 1.
Choose Run to run this job.
Monitor the job on the Runs tab for Succeeded
Repeat these steps four more times, each time incrementing the day field in the partition predicate. Run the jobs at more or less regular intervals to get a clean graph that simulates the automated scheduled pipeline.
After five successful runs, go to the Data quality tab, where you should see the statistic gathered for fare_amount and RowCount.

The anomaly detection algorithm takes a minimum of three data points to learn and start predicting. After three runs, you may see multiple anomalies detected in your dataset. This is expected because every new trend is seen as an anomaly at first. As the algorithm processes more and more records, it learns from it and sets the upper and lower bounds on your data accurately. The upper and lower bound predictions are dependent on the interval between the job runs.

Also, we can observe that the data quality score is always 100% based on the generic fare_amount rule we set up. You can explore the statistics by choosing the View trends links for each of the metrics to deep dive into the values. For example, the following screenshot shows the values for minimum fare_amount over a set of runs.

The model has predicted the upper bound to be around 1.4 and the lower bound to be around 1.2 for the minimum statistic of the fare_amount metric. When these bounds are breached, it would be considered an anomaly.

Run the job for the sixth (anomalous) day and analyze results

For the sixth day, we process a file that has two known anomalies. With this run, you should see anomalies detected on the graph. Complete the following steps:

Choose the S3 node on the Visual tab and go to its properties.
Set the day field in the partition predicate to 6.
Choose Run to run this job.
Monitor the job on the Runs tab for Succeeded

You should see a screenshot as follows where two anomalies are detected as expected: one for fare_amount with a high value of 95 and one for RowCount with a value of 2776.

Notice that even though the fare_amount score was anomalous and high, the data quality score is still 100%. We will fix this later.

Let’s investigate the RowCount anomaly further. As shown in the following screenshot, if you expand the anomaly record, you can see how the prediction upper bound was breached to cause this anomaly.

Up until this point, we saw how a baseline was set for the model training and statistics collected. We also saw how an anomalous value in our dataset was flagged as an anomaly by the model.

Update data quality rules based on findings

Now that we understand the statistics, lets adjust our ruleset such that when the rules fail, the data quality score is impacted. We take rule recommendations from the anomaly detection feature and add them to the ruleset.

As shown earlier, when the anomaly is detected, it gives you rule recommendations to the right of the graph. For this case, the rule recommendation states the RowCount metric should be between 275.0–1966.0. Let’s update our visual job.

Copy the rule under Rule Recommendations for RowCount.
On the Visual tab, choose the Evaluate Data Quality node, go to its properties, and enter the rule in the rules editor.
Repeat these steps for fare_amount.

You can adjust your final ruleset to look as follows:

Rules = [
    ColumnValues "fare_amount" <= 52, 
    RowCount between 100 and 1800
]
Analyzers = [
DistinctValuesCount "pulocationid",
RowCount
]

Save the job, but don’t run it yet.

So far, we have learned how to use statistics collected to adjust the rules and make sure our data quality score is accurate. But there is a problem—the anomalous values influence the model training, forcing the upper and lower bounds to adjust to the anomaly. We need to exclude those data points.

Exclude the RowCount anomaly

When an anomaly is detected in your dataset, the upper and lower bound prediction will adjust to it because it will assume it’s a seasonality by default. After investigation, if you believe that it is indeed an anomaly and not a seasonality, you should exclude the anomaly so it doesn’t impact future predictions.

Because our sixth run is an anomaly, you can complete the following steps to exclude it:

On the Anomalies tab, select the anomaly row you want to exclude.
On the Edit training inputs menu, choose Exclude anomaly.
Choose Save and retrain.
Choose the refresh icon.

If you need to view previous anomalous runs, navigate to the Data quality trend graph, hover over the anomaly data point, and choose View selected run results. This will take you to the job run on a new tab where you can follow the preceding steps to exclude the anomaly.

Alternatively, if you ran the job over a period of time and need to exclude multiple data points, you can do so from the Statistics tab:

On the Data quality tab, go to the Statistics tab and choose View trends for RowCount.
Select the value you want to exclude.
On the Edit training inputs menu, choose Exclude anomaly.
Choose Save and retrain.
Choose the refresh icon.

It may take a few seconds to reflect the change.

The following figure shows how the model adjusted to the anomalies before exclusion.

The following figure shows how the model retrained itself after the anomalies were excluded.

Now that the predictions are adjusted, all future out-of-range values will be detected as anomalies again.

Now you can run the job for day 7, which has non-anomalous data, and explore the trends.

Add an anomaly detection rule

It can be challenging to modify the rule values with the growing business trends. For example, at some point in future, the NYC taxi rows will exceed the now anomalous RowCount value of 2200. As you run the job over a longer period of time, the model matures and fine-tunes itself to the incoming data. At that point, you can make anomaly detection a rule by itself so you don’t have to update the values and can stop the jobs or decrease the data quality score. When there is an anomaly in the dataset, it means that the quality of the data is not good and the data quality score should reflect that. Let’s add a DetectAnomalies rule for the RowCount metric.

On the Visual tab, choose the Evaluate Data Quality node.
For Rule types, search for and choose DetectAnomalies, then add the rule.

Your final ruleset should look like the following screenshot. Notice that you don’t have any values for RowCount.

This is the real power of anomaly detection in your ETL pipeline.

Seasonality use case

The following screenshot shows an example of a trend with a more in-depth seasonality. The NYC taxi dataset has a varying number of rides throughout the day depending on peak hours, mid-day hours, and night hours. The following anomaly detection job ran on the current timestamp every hour to capture the seasonality of the day, and the upper and lower bounds have adjusted to this seasonality. When the number of rides drops unexpectedly within that seasonality trend, it is detected as an anomaly.

We saw how a data engineer can build anomaly detection into their pipeline for the incoming flow of data being processed at regular interval. We also learned how you can make anomaly detection a rule after the model is mature and fail the job, if an anomaly is detected, to avoid redundant downstream processing.

Clean up

To clean up your resources, complete the following steps:

On the Amazon S3 console, empty the S3 bucket created by the CloudFormation stack.
On the AWS Glue console, delete the anomaly-detection-blog-visual AWS Glue job you created.
If you deployed the CloudFormation stack, delete the stack on the AWS CloudFormation console.

Conclusion

This post demonstrated the new anomaly detection feature in AWS Glue Data Quality. Although data quality static and dynamic rules are very useful, they can’t capture data seasonality and how data changes as your business evolves. A machine learning model supporting anomaly detection can understand these complex changes and inform you of anomalies in the dataset. Also, the recommendations provided can help you author accurate data quality rules. You can also enable anomaly detection as a rule after the model has been trained over a longer period of time on a sufficient amount of data.

To learn more about AWS Glue Data Quality, check out AWS Glue Data Quality. If you have any comments or feedback, leave them in the comments section.

About the authors

Noah Soprala is a Solutions Architect based out of Dallas. He is a trusted advisor to his customers in the ISV industry and helps them build innovative solutions using AWS technologies. Noah has over 20+ years of experience in consulting, development and solution delivery.

Shovan Kanjilal is a Senior Analytics and Machine Learning Architect with Amazon Web Services. He is passionate about helping customers build scalable, secure and high-performance data solutions in the cloud.

Shiv Narayanan is a Technical Product Manager for AWS Glue’s data management capabilities like data quality, sensitive data detection and streaming capabilities. Shiv has over 20 years of data management experience in consulting, business development and product management.

Jesus Max Hernandez is a Software Development Engineer at AWS Glue. He joined the team after graduating from The University of Texas at El Paso, and the majority of his work has been in frontend development. Outside of work, you can find him practicing guitar or playing flag football.

Tyler McDaniel is a software development engineer on the AWS Glue team with diverse technical interests, including high-performance computing and optimization, distributed systems, and machine learning operations. He has eight years of experience in software and research roles.

Andrius Juodelis is a Software Development Engineer at AWS Glue with a keen interest in AI, designing machine learning systems, and data engineering.

AWS Glue mutual TLS authentication for Amazon MSK

2024-08-07 Edward Ondari

Post Syndicated from Edward Ondari original https://aws.amazon.com/blogs/big-data/aws-glue-mutual-tls-authentication-for-amazon-msk/

In today’s landscape, data streams continuously from countless sources such as social media interactions to Internet of Things (IoT) device readings. This torrent of real-time information presents both a challenge and an opportunity for businesses. To harness the power of this data effectively, organizations need robust systems for ingesting, processing, and analyzing streaming data at scale. Enter Apache Kafka: a distributed streaming platform that has revolutionized how companies handle real-time data pipelines and build responsive, event-driven applications. AWS Glue is used to process and analyze large volumes of real-time data and perform complex transformations on the streaming data from Apache Kafka.

Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed Apache Kafka service. You can activate a combination of authentication modes on new or existing MSK clusters. The supported authentication modes are AWS Identity and Access Management (IAM) access control, mutual Transport Layer Security (TLS), and Simple Authentication and Security Layer/Salted Challenge Response Mechanism (SASL/SCRAM). For more information about using IAM authentication, refer to Securely process near-real-time data from Amazon MSK Serverless using an AWS Glue streaming ETL job with IAM authentication.

Mutual TLS authentication requires both the server and the client to present certificates to prove their identity. It’s ideal for hybrid applications that need a common authentication model. It’s also a commonly used authentication mechanism for business-to-business applications and is used in standards such as open banking, which enables secure open API integrations for financial institutions. For Amazon MSK, AWS Private Certificate Authority (AWS Private CA) is used to issue the X.509 certificates and for authenticating clients.

This post describes how to set up AWS Glue jobs to produce, consume, and process messages on an MSK cluster using mutual TLS authentication. AWS Glue will automatically infer the schema from the streaming data and store the metadata in the AWS Glue Data Catalog for analysis using analytics tools such as Amazon Athena.

Example use case

In our example use case, a hospital facility regularly monitors the body temperatures for patients admitted in the emergency ward using smart thermometers. Each device automatically records the patients’ temperature readings and posts the records to a central monitoring application API. Each posted record is a JSON formatted message that contains the deviceId that uniquely identifies the thermometer, a patientId to identify the patient, the patient’s temperature reading, and the eventTime when the temperature was recorded.

The central monitoring application checks the hourly average temperature readings for each patient and notifies the hospital’s healthcare workers when a patient’s average temperature exceeds accepted thresholds (36.1–37.2°C). In our case, we use the Athena console to analyze the readings.

Overview of the solution

In this post, we use an AWS Glue Python shell job to simulate incoming data from the hospital thermometers. This job produces messages that are securely written to an MSK cluster using mutual TLS authentication.

To process the streaming data from the MSK cluster, we deploy an AWS Glue Streaming extract, transform, and load (ETL) job. This job automatically infers the schema from the incoming data, stores the schema metadata in the Data Catalog, and then stores the processed data as efficient Parquet files in Amazon Simple Storage Service (Amazon S3). We use Athena to query the output table in the Data Catalog and uncover insights.

The following diagram illustrates the architecture of the solution.

Solution architecture

The solution workflow consists of the following steps:

Create a private certificate authority (CA) using AWS Certificate Manager (ACM).
Set up an MSK cluster with mutual TLS authentication.
Create a Java keystore (JKS) file and generate a client certificate and private key.
Create a Kafka connection in AWS Glue.
Create a Python shell job in AWS Glue to create a topic and push messages to Kafka.
Create an AWS Glue Streaming job to consume and process the messages.
Analyze the processed data in Athena.

Prerequisites

You should have the following prerequisites:

Access to AWS CloudShell or the AWS Command Line Interface (AWS CLI).
A VPC with a minimum of two subnets in two Availability Zones and a NAT gateway with a route to a public subnet. You can use the following AWS CloudFormation stack to set up the VPC:

This template creates two NAT gateways as shown in the following diagram. However, it’s possible to route the traffic to a single NAT gateway in one Availability Zone for test and development workloads. For redundancy in production workloads, it’s recommended that there is one NAT gateway available in each Availability Zone.

VPC setup

The stack also creates a security group with a self-referencing rule to allow communication between AWS Glue components.

Create a private CA using ACM

Complete the following steps to create a root CA. For more details, refer to Creating a private CA.

On the AWS Private CA console, choose Create a private CA.
For Mode options, select either General-purpose or Short-lived certificate for lower pricing.
For CA type options, select Root.
Provide certificate details by providing at least one distinguished name.

Create private CA

Leave the remaining default options and select the acknowledge checkbox.
Choose Create CA.
On the Actions menu, choose Install CA certificate and choose Confirm and install.

Install certificate

Set up an MSK cluster with mutual TLS authentication

Before setting up the MSK cluster, make sure you have a VPC with at least two private subnets in different Availability Zones and a NAT gateway with a route to the internet. A CloudFormation template is provided in the prerequisites section.

Complete the following steps to set up your cluster:

On the Amazon MSK console, choose Create cluster.
For Creation method, Custom create.
For Cluster type, select Provisioned.
For Broker size, you can choose kafka.t3.small for the purpose of this post.
For Number of zones, choose 2.
Choose Next.
In the Networking section, select the VPC, private subnets, and security group you created in the prerequisites section.
In the Security settings section, under Access control methods, select TLS client authentication through AWS Certificate Manager (ACM).
For AWS Private CAs, choose the AWS private CA you created earlier.

The MSK cluster creation can take up to 30 minutes to complete.

Create a JKS file and generate a client certificate and private key

Using the root CA, you generate client certificates to use for authentication. The following instructions are for CloudShell, but can also be adapted for a client machine with Java and the AWS CLI installed.

Open a new CloudShell session and run the following commands to create the certs directory and install Java:

mkdir certs
cd certs
sudo yum -y install java-11-amazon-corretto-headless

Run the following command to create a keystore file with a private key in JKS format. Replace Distinguished-Name, Example-Alias, Your-Store-Pass, and Your-Key-Pass with strings of your choice:

keytool -genkey -keystore kafka.client.keystore.jks -validity 300 -storepass Your-Store-Pass -keypass Your-Key-Pass -dname "CN=Distinguished-Name" -alias Example-Alias -storetype pkcs12

Generate a certificate signing request (CSR) with the private key created in the preceding step:

keytool -keystore kafka.client.keystore.jks -certreq -file csr.pem -alias Example-Alias -storepass Your-Store-Pass -keypass Your-Key-Pass

Run the following command to remove the word NEW (and the single space that follows it) from the beginning and end of the file:

sed -i -E '1,$ s/NEW //' csr.pem

The file should start with -----BEGIN CERTIFICATE REQUEST----- and end with -----END CERTIFICATE REQUEST-----

Using the CSR file, create a client certificate using the following command. Replace Private-CA-ARN with the ARN of the private CA you created.

aws acm-pca issue-certificate --certificate-authority-arn Private-CA-ARN --csr fileb://csr.pem --signing-algorithm "SHA256WITHRSA" --validity Value=300,Type="DAYS"

The command should print out the ARN of the issued certificate. Save the CertificateArn value for use in the next step.

{
"CertificateArn": "arn:aws:acm-pca:region:account:certificate-authority/CA_ID/certificate/certificate_ID"
}

Use the Private-CA-ARN together with the CertificateArn (arn:aws:acp-pca:<region>:...) generated in the preceding step to retrieve the signed client certificate. This will create a client-cert.pem file.

aws acm-pca get-certificate --certificate-authority-arn Private-CA-ARN --certificate-arn Certificate-ARN | jq -r '.Certificate + "\n" + .CertificateChain' >> client-cert.pem

Add the certificate into the Java keystore so you can present it when you talk to the MSK brokers:

keytool -keystore kafka.client.keystore.jks -import -file client-cert.pem -alias Example-Alias -storepass Your-Store-Pass -keypass Your-Key-Pass -noprompt

Extract the private key from the JKS file. Provide the same destkeypass and deststorepass and enter the keystore password when prompted.

keytool -importkeystore -srckeystore kafka.client.keystore.jks -destkeystore keystore.p12 -srcalias Example-Alias -deststorepass Your-Store-Pass -destkeypass Your-Key-Pass -deststoretype PKCS12

Convert the private key to PEM format. Enter the keystore password you provided in the previous step when prompted.

openssl pkcs12 -in keystore.p12 -nodes -nocerts -out private-key.pem

Remove the lines that begin with Bag Attributes.. from the top of the file:

sed -i -ne '/-BEGIN PRIVATE KEY-/,/-END PRIVATE KEY-/p' private-key.pem

Upload the client-cert.pem, client.keystore.jks, and private-key.pem files to Amazon S3. You can either create a new S3 bucket or use an existing bucket to store the following objects. Replace <s3://aws-glue-assets-11111111222222-us-east-1/certs/> with your S3 location.

aws s3 sync ~/certs s3://aws-glue-assets-11111111222222-us-east-1/certs/ --exclude '*' --include 'client-cert.pem' --include 'private-key.pem' --include 'kafka.client.keystore.jks'

Create a Kafka connection in AWS Glue

Complete the following steps to create a Kafka connection:

On the AWS Glue console, choose Data connections in the navigation pane.
Choose Create connection.
Select Apache Kafka and choose Next.
For Amazon Managed Streaming for Apache Kafka Cluster, choose the MSK cluster you created earlier.

Create Glue Kafka connection

Choose TLS client authentication for Authentication method.
Enter the S3 path to the keystore you created earlier and provide the keystore and client key passwords you used for the -storepass and -keypass

Add authentication method to connection

Under Networking options, choose your VPC, a private subnet, and a security group. The security group should contain a self-referencing rule.
On the next page, provide a name for the connection (for example, Kafka-connection) and choose Create connection.

Create a Python shell job in AWS Glue to create a topic and push messages to Kafka

In this section, you create a Python shell job to create a new Kafka topic and push JSON messages to the topic. Complete the following steps:

On the AWS Glue console, choose ETL jobs.
In the Script section, for Engine, choose Python shell.
Choose Create script.

Create Python shell job

Enter the following script in the editor:

import sys
from awsglue.utils import getResolvedOptions
from kafka.admin import KafkaAdminClient, NewTopic
from kafka import KafkaProducer
from kafka.errors import TopicAlreadyExistsError
from urllib.parse import urlparse

import json
import uuid
import datetime
import boto3
import time
import random

# Fetch job parameters
args = getResolvedOptions(sys.argv, ['connection-names', 'client-cert', 'private-key'])

# Download client certificate and private key files from S3
TOPIC = 'example_topic'
client_cert = urlparse(args['client_cert'])
private_key = urlparse(args['private_key'])

s3 = boto3.client('s3')
s3.download_file(client_cert.netloc, client_cert.path.lstrip('/'),  client_cert.path.split('/')[-1])
s3.download_file(private_key.netloc, private_key.path.lstrip('/'),  private_key.path.split('/')[-1])

# Fetch bootstrap servers from connection
args = getResolvedOptions(sys.argv, ['connection-names'])
if ',' in args['connection_names']:
    raise ValueError("Choose only one connection name in the job details tab!")
glue_client = boto3.client('glue')
response = glue_client.get_connection(Name=args['connection_names'], HidePassword=True)
bootstrapServers = response['Connection']['ConnectionProperties']['KAFKA_BOOTSTRAP_SERVERS']

# Create topic and push messages 
admin_client = KafkaAdminClient(bootstrap_servers= bootstrapServers, security_protocol= 'SSL', ssl_certfile= client_cert.path.split('/')[-1], ssl_keyfile= private_key.path.split('/')[-1])
try:
    admin_client.create_topics(new_topics=[NewTopic(name=TOPIC, num_partitions=1, replication_factor=1)], validate_only=False)
except TopicAlreadyExistsError:
    # Topic already exists
    pass
admin_client.close()

# Generate JSON messages for the new topic
producer = KafkaProducer(value_serializer=lambda m: json.dumps(m).encode('ascii'), bootstrap_servers=bootstrapServers, security_protocol='SSL', 
                         ssl_check_hostname=True, ssl_certfile= client_cert.path.split('/')[-1], ssl_keyfile= private_key.path.split('/')[-1])
                         
for i in range(1200):
    _event = {
        "deviceId": str(uuid.uuid4()),
        "patientId": "PI" + str(random.randint(1,15)).rjust(5, '0'),
        "temperature": round(random.uniform(32.1, 40.9), 1),
        "eventTime": str(datetime.datetime.now())
    }
    producer.send(TOPIC, _event)
    time.sleep(3)
    
producer.close()

On the Job details tab, provide a name for your job, such as Kafka-msk-producer.
Choose an IAM role. If you don’t have one, create one following the instructions in Configuring IAM permissions for AWS Glue.
Under Advanced properties, for Connections, choose the Kafka-connection connection you created.
Under Job parameters, add the following parameters and values:
1. Key: --additional-python-modules, value: kafka-python.
2. Key: --client-cert, value: s3://aws-glue-assets-11111111222222-us-east-1/certs/client-cert.pem. Replace with your client-cert.pem Amazon S3 location from earlier.
3. Key: --private-key, value: s3://aws-glue-assets-11111111222222-us-east-1/certs/private-key.pem. Replace with your private-key.pem Amazon S3 location from earlier.
Save and run the job.

You can confirm that the job run status is Running on the Runs tab.

At this point, we have successfully created a Python shell job to simulate the thermometers sending temperature readings to the monitoring application. The job will run for approximately 1 hour and push 1,200 records to Amazon MSK.

Alternatively, you can replace the Python shell job with a Scala ETL job to act as a producer to send messages to the MSK cluster. In this case, use the JKS file for authentication using ssl.keystore.type=JKS. If you’re using PEM format keys, the current version of Kafka clients libraries (2.4.1) installed in AWS Glue version 4 don’t yet support authentication through certificates in PEM format (as of this writing).

Create an AWS Glue Streaming job to consume and process the messages

You can now create an AWS Glue ETL job to consume and process the messages in the Kafka topic. AWS Glue will automatically infer the schema from the files. Complete the following steps:

On the AWS Glue console, choose Visual ETL in the navigation pane.
Choose Visual ETL to author a new job.
For Sources, choose Apache Kafka.
For Connection name, choose the node and connection name you created earlier.
For Topic name, enter the topic name (example_topic) you created earlier.
Leave the rest of the options as default.

Kafka data source

Add a new target node called Amazon S3 to store the output Parquet files generated from the streaming data.
Choose Parquet as the data format and provide an S3 output location for the generated files.
Select the option to allow AWS Glue to create a table in the Data Catalog and provide the database and table names.

S3 Output node

On the job details tab, provide the following options:
1. For the requested number of workers, enter 2.
2. For IAM Role, choose an IAM role with permissions to read and write to the S3 output location.
3. For Job timeout, enter 60 (for the job to stop after 60 minutes).
4. Under Advanced properties, for Connections, choose the connection you created.
Save and run the job.

You can confirm the S3 output location for new Parquet files created under the prefixes s3://<output-location>/ingest_year=XXXX/ingest_month=XX/ingest_day=XX/ingest_hour=XX/.

At this point, you have created a streaming job to process events from Amazon MSK and store the JSON formatted records as Parquet files in Amazon S3. AWS Glue streaming jobs are meant to be running continuously to process streaming data. We have set the timeout to stop the job after 60 minutes. You can also stop the job manually after the records have been processed to Amazon S3.

Analyze the data in Athena

Going back to our example use case, you can run the following query in Athena to monitor and track the hourly average temperature readings for patients that exceed the normal thresholds (36.1–37.2°C):

SELECT
date_format(parse_datetime(eventTime, 'yyyy-MM-dd HH:mm:ss.SSSSSS'), '%h %p') hour,
patientId,
round(avg(temperature), 1) average_temperature,
count(temperature) readings
FROM "default"."devices_data"
GROUP BY 1, 2
HAVING avg(temperature) > 37.2 or avg(temperature) < 36.1
ORDER BY 2, 1 DESC

Amazon Athena Console

Run the query multiple times and observe how the average_temperature and the number of readings changes with new incoming data from the AWS Glue Streaming job. In our example scenario, healthcare workers can use this information to identify patients who are experiencing consistent high or low body temperatures and give the required attention.

At this point, we have successfully created and ingested streaming data to our MSK cluster using mutual TLS authentication. We only needed the certificates generated by AWS Private CA to authenticate our AWS Glue clients to the MSK cluster and process the streaming data with an AWS Glue Streaming job. Finally, we used Athena to visualize the data and observed how the data changes in near real time.

Clean up

To clean up the resources created in this post, complete the following steps:

Delete the private CA you created.
Delete the MSK cluster you created.
Delete the AWS Glue connection you created.
Stop the jobs if they are still running and delete the jobs you created.
If you used the CloudFormation stack provided in the prerequisites, delete the CloudFormation stack to delete the VPC and other networking components.

Conclusion

This post demonstrated how you can use AWS Glue to consume, process, and store streaming data for Amazon MSK using mutual TLS authentication. AWS Glue Streaming automatically infers the schema and creates a table in the Data Catalog. You can then query the table using other data analysis tools like Athena, Amazon Redshift, and Amazon QuickSight to provide insights into the streaming data.

Try out the solution for yourself, and let us know your questions and feedback in the comments section.

About the Authors

Edward Okemwa is a Big Data Cloud Support Engineer (ETL) at AWS Nairobi specializing in AWS Glue and Amazon Athena. He is dedicated to providing customers with technical guidance and resolving issues related to processing and analyzing large volumes of data. In his free time, he enjoys singing choral music and playing football.

Emmanuel Mashandudze is a Senior Big Data Cloud Engineer specializing in AWS Glue. He collaborates with product teams to help customers efficiently transform data in the cloud. He helps customers design and implements robust data pipelines. Outside of work, Emmanuel is an avid marathon runner, sports enthusiast and enjoys creating memories with his family.

Set up cross-account AWS Glue Data Catalog access using AWS Lake Formation and AWS IAM Identity Center with Amazon Redshift and Amazon QuickSight

2024-08-05 Poulomi Dasgupta

Post Syndicated from Poulomi Dasgupta original https://aws.amazon.com/blogs/big-data/set-up-cross-account-aws-glue-data-catalog-access-using-aws-lake-formation-and-aws-iam-identity-center-with-amazon-redshift-and-amazon-quicksight/

Most organizations manage their workforce identity centrally in external identity providers (IdPs) and are comprised of multiple business units that produce their own datasets and manage the lifecycle spread across multiple AWS accounts. These business units have varying landscapes, where a data lake is managed by Amazon Simple Storage Service (Amazon S3) and analytics workloads are run on Amazon Redshift, a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data.

Business units that create data products like to share them with others, without copying the data, to promote analysis to derive insights. Also, they want tighter control on user access and the ability to audit access to their data products. To address this, enterprises usually catalog the datasets in the AWS Glue Data Catalog for data discovery and use AWS Lake Formation for fine-grained access control to adhere to the compliance and operating security model for their business units. Given the diverse range of services, fine-grained data sharing, and personas involved, these enterprises often want a streamlined experience for enterprise user identities when accessing their data using AWS Analytics services.

AWS IAM Identity Center enables centralized management of workforce user access to AWS accounts and applications using a local identity store or by connecting corporate directories using IdPs. Amazon Redshift and AWS Lake Formation are integrated with the new trusted identity propagation capability in IAM Identity Center, allowing you to use third-party IdPs such as Microsoft Entra ID (Azure AD), Okta, Ping, and OneLogin.

With trusted identity propagation, Lake Formation enables data administrators to directly provide fine-grained access to corporate users and groups, and simplifies the traceability of end-to-end data access across supported AWS services. Because access is managed based on a user’s corporate identity, end-users don’t need to use database local user credentials or assume an AWS Identity and Access Management (IAM) role to access data. Furthermore, this enables effective user permissions based on collective group membership and supports group hierarchy.

In this post, we cover how to enable trusted identity propagation with AWS IAM Identity Center, Amazon Redshift, and AWS Lake Formation residing on separate AWS accounts and set up cross-account sharing of an S3 data lake for enterprise identities using AWS Lake Formation to enable analytics using Amazon Redshift. Then we use Amazon QuickSight to build insights using Redshift tables as our data source.

Solution overview

This post covers a use case where an organization centrally manages corporate users within their IdP and where the users belong to multiple business units. Their goal is to enable centralized user authentication through IAM Identity Center in the management account, while keeping the business unit that analyzes data using a Redshift cluster and the business unit that produces data cataloged using the Data Catalog in separate member accounts. This allows them to maintain a single authentication mechanism through IAM Identity Center within an organization while retaining access control, resource, and cost separation through the use of separate AWS accounts per business units and enabling cross-account data sharing using Lake Formation.

For this solution, AWS Organizations is enabled in the central management account and IAM Identity Center is configured for managing workforce identities. The organization has two member accounts: one account that manages the S3 data lake using the Data Catalog, and another account that runs analytical workloads on Amazon Redshift and QuickSight, with all the services enabled with trusted identity propagation. Amazon Redshift will access cross-account AWS Glue resources using IAM Identity Center users and groups set up in the central management account using QuickSight in member account 1. In member account 2, permissions on the AWS Glue resources are managed using Lake Formation and are shared with member account 1 using Lake Formation data sharing.

The following diagram illustrates the solution architecture.

The solution consists of the following:

In the centralized management account, we create a permission set and create account assignments for Redshift_Member_Account. We integrate users and groups from the IdP with IAM Identity Center.
Member account 1 (Redshift_Member_Account) is where the Redshift cluster and application exist.
Member account 2 (Glue_Member_Account) is where metadata is cataloged in the Data Catalog and Lake Formation is enabled with IAM Identity Center integration.
We assign permissions to two IAM Identity Center groups to access the Data Catalog resources:
- awssso-sales – We apply column-level filtering for this group so that users belonging to this group will be able to select two columns and read all rows.
- awssso-finance – We apply row-level filtering using data filters for this group so that users belonging to this group will be able to select all columns and see rows after row-level filtering is applied.
We apply different permissions for three IAM Identity Center users:
- User Ethan, part of awssso-sales – Ethan will be able to select two columns and read all rows.
- User Frank, part of awssso-finance – Frank will be able to select all columns and see rows after row-level filtering is applied.
- User Brian, part of awssso-sales and awssso-finance – Brian inherits permissions defined for both groups.
We set up QuickSight in the same account where Amazon Redshift exists, enabling authentication using IAM Identity Center.

Prerequisites

You should have the following prerequisites alreday set up:

A centralized management account where IAM Identity Center is enabled with member accounts added. For more information, see Enabling AWS IAM Identity Center.
Optionally, connect IAM Identity Center with your preferred IdP and sync users and groups. For instructions, refer to Getting started tutorials.
Member account 1 (Redshift_Member_Account) where the Redshift cluster and application exist. To set up a Redshift cluster with IAM Identity Center integration enabled, refer to Integrate Identity Provider (IdP) with Amazon Redshift Query Editor V2 using AWS IAM Identity Center for seamless Single Sign-On.
Member account 2 (Glue_Member_Account) where metadata is cataloged in the Data Catalog.

Member account 2 configuration

Sign in to the Lake Formation console as the data lake administrator. To learn more about setting up permissions for a data lake administrator, see Create a data lake administrator.

In this section, we walk through the steps to set up Lake Formation, enable Lake Formation permissions, and grant database and table permissions to IAM Identity Center groups.

Set up Lake Formation

Complete the steps in this section to set up Lake Formation.

Create AWS Glue resources

You can use an existing AWS Glue database that has a few tables. For this post, we use a database called customerdb and a table called reviews whose data is stored in the S3 bucket lf-datalake-<account-id>-<region>.

Register the S3 bucket location

Complete the following steps to register the S3 bucket location:

On the Lake Formation console, in the navigation pane, under Administration, choose Data lake locations.
Choose Register location.
For Amazon S3 location, enter the S3 bucket location that contains table data.
For IAM role, provide a user-defined IAM role. For instructions to create a user-defined IAM role, refer to Requirements for roles used to register locations.
For Permission mode, select Lake Formation.
Choose Register location.

Set cross-account version

Complete the following steps to set your cross-account version:

Sign in to the Lake Formation console as the data lake admin.
In the navigation pane, under Administration, choose Data Catalog settings.
Under Cross-account version settings, keep the latest version (Version 4) as the current cross-account version.
Choose Save.

Add permissions required for cross-account access

If the AWS Glue Data Catalog resource policy is already enabled in the account, then you can either remove the policy or add the following permissions to the policy that are required for cross-account grants. The provided policy enables AWS Resource Access Manager (AWS RAM) to share a resource policy while cross-account grants are made using Lake Formation. For more information, refer to Prerequisites. Please skip to the following step if your policy is blank under Catalog Settings.

Sign in to the AWS Glue console as an IAM admin.
In the navigation pane, under Data Catalog, choose Catalog settings.
Under Permissions, add the following policy, and provide the account ID where your AWS Glue resources exist:

{ "Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "ram.amazonaws.com"
},
"Action": "glue:ShareResource",
"Resource": [
"arn:aws:glue:us-east-1:<account-id>:table/*/*",
"arn:aws:glue:us-east-1:<account-id>:database/*",
"arn:aws:glue:us-east-1:<account-id>:catalog"
]
}
]
}

Choose Save.

For more information, see Granting cross-account access.

Enable IAM Identity Center integration for Lake Formation

To integrate IAM Identity Center with your Lake Formation organization instance of IAM Identity Center, refer to Connecting Lake Formation with IAM Identity Center.

To enable cross-account sharing for IAM Identity Center users and groups, add the target recipient accounts to your Lake Formation IAM Identity Center integration under the AWS account and organization IDs.

Sign in to the Lake Formation console as a data lake admin.
In the navigation pane, under Administration, choose IAM Identity Center integration.
Under AWS account and organization IDs, choose Add.
Enter your target accounts.
Choose Add.

Enable Lake Formation permissions for databases

For Data Catalog databases that contain tables that you might share, you can stop new tables from having the default grant of Super to IAMAllowedPrincipals. Complete the following steps:

Sign in to the Lake Formation console as a data lake admin.
In the navigation pane, under Data Catalog, choose Databases.
Select the database customerdb.
Choose Actions, then choose Edit.
Under Default permissions for newly created tables, deselect Use only IAM access control for new tables in this database.
Choose Save.

For Data Catalog databases, remove IAMAllowedPrincipals.

Under Data Catalog in the navigation pane, choose Databases.
Select the database customerdb.
Choose Actions, then choose View.
Select IAMAllowedPrincipals and choose Revoke.

Repeat the same steps for tables under the customerdb database.

Grant database permissions to IAM Identity Center groups

Complete the following steps to grant database permissions to your IAM Identity Center groups:

On the Lake Formation console, under Data Catalog, choose Databases.
Select the database customerdb.
Choose Actions, then choose Grant.
Select IAM Identity Center.
Choose Add and select Get Started.
Search for and select your IAM Identity Center group names and choose Assign.

Select Named Data Catalog resources.
Under Databases, choose customerdb.
Under Database permissions, select Describe for Database permissions.
Choose Grant.

Grant table permissions to IAM Identity Center groups

In the following section, we will grant different permissions to our two IAM Identity Center groups.

Column filter

We first add permissions to the group awssso-sales. This group will have access to the customerdb database and be able to select only two columns and read all rows.

On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
Select the database customerdb.
Choose Actions, then choose Grant.
Select IAM Identity Center.
Choose Add and select Get Started.
Search for and select awssso-sales and choose Assign.

Select Named Data Catalog resources.
Under Databases, choose customerdb.
Under Tables, choose reviews.
Under Table permissions, select Select for Table permissions.
Select Column-based access.
Select Include columns and choose product_title and star_rating.
Choose Grant.

Row filter

Next, we grant permissions to awssso-finance. This group will have access to customerdb and be able to select all columns and apply filters on rows.

We need to first create a data filter by performing the following steps:

On the Lake Formation console, choose Data filters under Data Catalog.
Choose Create data filter.
For Data filter name, provide a name.
For Target database, choose customerdb.
For Target table, choose reviews.
For Column-level access, select Access to all columns.
For Row-level access, choose Filter rows and apply your filter. In this example, we are filtering reviews with star_rating as 5.
Choose Create data filter.

Under Data Catalog in the navigation pane, choose Databases.
Select the database customerdb.
Choose Actions, then choose Grant.
Select IAM Identity Center.
Choose Add and select Get Started.
Search for and select awssso-finance and choose Assign.
Select Named Data Catalog resources.
Under Databases, choose customerdb.
Under Tables, choose reviews.
Under Data Filters, choose the High_Rating
Under Data Filter permissions, select Select.
Choose Grant.

Member account 1 configuration

In this section, we walk through the steps to add Amazon Redshift Spectrum table access in member account 1, where the Redshift cluster and application exist.

Accept Invite from RAM

You should have received a Resource Access Manager (RAM) invite from member account 2 when you added member account 1 under IAM Identity Center integration in Lake Formation at the member account 1.

Navigate to Resource Access Manager(RAM) from admin console.
Under Shared with me, click on resource shares.
Select the resource name and click on Accept resource share.

Please make sure that you have followed this entire blog to establish the Redshift Integration with IAM Identity Center before following the next steps.

Set up Redshift Spectrum table access for the IAM Identity Center group

Complete the following steps to set up Redshift Spectrum table access:

Sign in to the Amazon Redshift console using the admin role.
Navigate to Query Editor v2.
Choose the options menu (three dots) next to the cluster and choose Create connection.
Connect as the admin user and run the following commands to make the shared resource link data in the S3 data lake available to the sales group (use the account ID where the Data Catalog exists):

create external schema if not exists <schema_name> from DATA CATALOG database '<glue_catalog_name>' catalog_id '<accountid>';
grant usage on schema <schema_name> to role "<role_name>";

For example:

create external schema if not exists cross_account_glue_schema from DATA CATALOG database 'customerdb' catalog_id '932880906720';
grant usage on schema cross_account_glue_schema to role "awsidc:awssso-sales";
grant usage on schema cross_account_glue_schema to role "awsidc:awssso-finance";

Validate Redshift Spectrum access as an IAM Identity Center user

Complete the following steps to validate access:

On the Amazon Redshift console, navigate to Query Editor v2.
Choose the options menu (three dots) next to the cluster and choose Create connection.
Select IAM Identity Center.
Enter your Okta user name and password in the browser pop-up.

When you’re connected as a federated user, run the following SQL commands to query the cross_account_glue_schema data lake table.

select * from "dev"."cross_account_glue_schema"."reviews";

The following screenshot shows that user Ethan, who is part of the awssso-sales group, has access to two columns and all rows from the Data Catalog.

The following screenshot shows that user Frank, who is part of the awssso-finance group, has access to all columns for records that have star_rating as 5.

The following screenshot shows that user Brian, who is part of awssso-sales and awssso-finance, has access to all columns for records that have star_rating as 5 and access to only two columns (other columns are returned NULL) for records with star_rating other than 5.

Subscribe to QuickSight with IAM Identity Center

In this post, we set up QuickSight in the same account where the Redshift cluster exists. You can use the same or a different member account for QuickSight setup. To subscribe to QuickSight, complete the following steps:

Sign in to your AWS account and open the QuickSight console.
Choose Sign up for QuickSight.

Enter a notification email address for the QuickSight account owner or group. This email address will receive service and usage notifications.
Select the identity option that you want to subscribe with. For this post, we select Use AWS IAM Identity Center.
Enter a QuickSight account name.
Choose Configure.

Next, assign groups in IAM Identity Center to roles in QuickSight (admin, author, and reader.) This step enables your users to access the QuickSight application. In this post, we choose awssso-sales and awssso-finance for Admin group.
Specify an IAM role to control QuickSight access to your AWS resources. In this post, we select Use QuickSight managed role (default).
For this post, we deselect Add Paginated Reports.
Review the choices that you made, then choose Finish.

Enable trusted identity propagation in QuickSight

Trusted identity propagation authenticates the end-user in Amazon Redshift when they access QuickSight assets that use a trusted identity propagation enabled data source. When an author creates a data source with trusted identity propagation, the identity of the data source consumers in QuickSight is propagated and logged in AWS CloudTrail. This allows database administrators to centrally manage data security in Amazon Redshift and automatically apply data security rules to data consumers in QuickSight.

To configure QuickSight to connect to Amazon Redshift data sources with trusted identity propagation, configure Amazon Redshift OAuth scopes to your QuickSight account:

aws quicksight update-identity-propagation-config --aws-account-id "AWSACCOUNTID" --service "REDSHIFT" --authorized-targets "IAM Identity Center managed application ARN"

For example:

aws quicksight update-identity-propagation-config --aws-account-id "1234123123" --service "REDSHIFT" --authorized-targets "arn:aws:sso::XXXXXXXXXXXX:application/ssoins-XXXXXXXXXXXX/apl-XXXXXXXXXXXX"

After you have added the scope, the following command lists all OAuth scopes that are currently on a QuickSight account:

aws quicksight list-identity-propagation-configs --aws-account-id "AWSACCOUNTID"

The following code is the example with output:

aws quicksight list-identity-propagation-configs --aws-account-id "1234123123"
{
"Status": 200,
"Services": [
{
"Service": "REDSHIFT",
"AuthorizedTargets": [
"arn:aws:sso::1004123000:application/ssoins-1234f1234bb1f123/apl-12a1234e2e391234"
]
}
],
"RequestId": "116ec1b0-1533-4ed2-b5a6-d7577e073b35"
}

For more information, refer to Authorizing connections from Amazon QuickSight to Amazon Redshift clusters.

For QuickSight to connect to a Redshift instance, you must add an appropriate IP address range in the Redshift security group for the specific AWS Region. For more information, see AWS Regions, websites, IP address ranges, and endpoints.

Test your IAM Identity Center and Amazon Redshift integration with QuickSight

Now you’re ready to connect to Amazon Redshift using QuickSight.

In the management account, open the IAM Identity Center console and copy the AWS access portal URL from the dashboard.
Sign out from the management account and enter the AWS access portal URL in a new browser window.
In the pop-up window, enter your IdP credentials.

After successful authentication, you’ll be logged in to the AWS Management Console as a federated user.

On the Applications tab, select the QuickSight app.
After you federate to QuickSight, choose Datasets.
Select New Dataset and then choose Redshift (Auto Discovered).
Enter your data source details. Make sure to select Single sign-on for Authentication method.
Choose Create data source.

Congratulations! You’re signed in using IAM Identity Center integration with Amazon Redshift and are ready to explore and analyze your data using QuickSight.

The following screenshot from QuickSight shows that user Ethan, who is part of the awssso-sales group, has access to two columns and all rows from the Data Catalog.

The following screenshot from QuickSight shows that user Frank, who is part of the awssso-finance group, has access to all columns for records that have star_rating as 5.

The following screenshot from QuickSight shows that user Brian, who is part of awssso-sales and awssso-finance, has access to all columns for records that have star_rating as 5 and access to only two columns (other columns are returned NULL) for records with star_rating other than 5.

Clean up

Complete the following steps to clean up your resources:

Delete the data from the S3 bucket.
Delete the Data Catalog objects that you created as part of this post.
Delete the Lake Formation resources and QuickSight account.
If you created new Redshift cluster for testing this solution, delete the cluster.

Conclusion

In this post, we established cross-account access to enable centralized user authentication through IAM Identity Center in the management account, while keeping the Amazon Redshift and AWS Glue resources isolated by business unit in separate member accounts. We used Query Editor V2 for querying the data from Amazon Redshift. Then we showed how to build user-facing dashboards by integrating with QuickSight. Refer to Integrate Tableau and Okta with Amazon Redshift using AWS IAM Identity Center to learn about integrating Tableau and Okta with Amazon Redshift using IAM Identity Center.

Learn more about IAM Identity Center with Amazon Redshift, QuickSight, and Lake Formation. Leave your questions and feedback in the comments section.

About the Authors

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Maneesh Sharma is a Senior Database Engineer at AWS with more than a decade of experience designing and implementing large-scale data warehouse and analytics solutions. He collaborates with various Amazon Redshift Partners and customers to drive better integration.

Poulomi Dasgupta is a Senior Analytics Solutions Architect with AWS. She is passionate about helping customers build cloud-based analytics solutions to solve their business problems. Outside of work, she likes travelling and spending time with her family.

Create a customizable cross-company log lake for compliance, Part I: Business Background

2024-08-01 Colin Carson

Post Syndicated from Colin Carson original https://aws.amazon.com/blogs/big-data/create-a-customizable-cross-company-log-lake-for-compliance-part-i-business-background/

As described in a previous post, AWS Session Manager, a capability of AWS Systems Manager, can be used to manage access to Amazon Elastic Compute Cloud (Amazon EC2) instances by administrators who need elevated permissions for setup, troubleshooting, or emergency changes. While working for a large global organization with thousands of accounts, we were asked to answer a specific business question: “What did employees with privileged access do in Session Manager?”

This question had an initial answer: use logging and auditing capabilities of Session Manager and integration with other AWS services, including recording connections (StartSession API calls) with AWS CloudTrail, and recording commands (keystrokes) by streaming session data to Amazon CloudWatch Logs.

This was helpful, but only the beginning. We had more requirements and questions:

After session activity is logged to CloudWatch Logs, then what?
How can we provide useful data structures that minimize work to read out, delivering faster performance, using more data, with more convenience?
How do we support a variety of usage patterns, such as ongoing system-to-system bulk transfer, or an ad-hoc query by a human for a single session?
How should we share and implement governance?
Thinking bigger, what about the same question for a different service or across more than one use case? How do we add what other API activity happened before or after a connection—in other words, context?

We needed more comprehensive functionality, more customization, and more control than a single service or feature could offer. Our journey began where previous customer stories about using Session Manager for privileged access (similar to our situation), least privilege, and guardrails ended. We had to create something new that combined existing approaches and ideas:

Low-level primitives such as Amazon Simple Storage Service (Amazon S3).
Latest features and approaches of AWS, such as vertical and horizontal scaling in AWS Glue.
Our experience working with legal, audit, and compliance in large enterprise environments.
Customer feedback.

In this post, we introduce Log Lake, a do-it-yourself data lake based on logs from CloudWatch and AWS CloudTrail. We share our story in three parts:

Part 1: Business background – We share why we created Log Lake and AWS alternatives that might be faster or easier for you.
Part 2: Build – We describe the architecture and how to set it up using AWS CloudFormation templates.
Part 3: Add – We show you how to add invocation logs, model input, and model output from Amazon Bedrock to Log Lake.

Do you really want to do it yourself?

Before you build your own log lake, consider the latest, highest-level options already available in AWS–they can save you a lot of work. Whenever possible, choose AWS services and approaches that abstract away undifferentiated heavy lifting to AWS so you can spend time on adding new business value instead of managing overhead. Know the use cases services were designed for, so you have a sense of what they already can do today and where they’re going tomorrow.

If that doesn’t work, and you don’t see an option that delivers the customer experience you want, then you can mix and match primitives in AWS for more flexibility and freedom, as we did for Log Lake.

Session Manager activity logging

As we mentioned in our introduction, you can save logging data to AmazonS3, add a table on top, and query that table using Amazon Athena—this is what we recommend you consider first because it’s straightforward.

This would result in files with the sessionid in the name. If you want, you can process these files into a calendarday, sessionid, sessiondata format using an S3 event notification that invokes a function (and make sure to save it to a different bucket, in a different table, to avoid causing recursive loops). The function could derive the calendarday and sessionid from the S3 key metadata, and sessiondata would be the entire file contents.

Alternatively, you can sign to one log group in CloudWatch logs, have an Amazon Data Firehose subscription filter move that to S3 (this file would have additional metadata in the JSON content and more customization potential from filters). This was used in our situation, but it wasn’t enough by itself.

AWS CloudTrail Lake

CloudTrail Lake is for running queries on events over years of history and with near real-time latency and offers a deeper and more customizable view of events than CloudTrail Event history. CloudTrail Lake enables you to federate an event data store, which lets you view the metadata in the AWS Glue catalog and run Athena queries. For needs involving one organization and ongoing ingesting from a trail (or point-in-time import from Amazon S3, or both), you can consider CloudTrail Lake.

We considered CloudTrail Lake, as either a managed lake option or source for CloudTrail only, but ended up creating our own AWS Glue job instead. This was because of a combination of reasons, including full control over schema and jobs, ability to ingest data from an S3 bucket of our choosing as an ongoing source, fine-grained filtering on account, AWS Region, and eventName (eventName filtering wasn’t supported for management events ), and cost.

The cost of CloudTrail lake based on uncompressed data ingested (data size can be 10 times larger than in Amazon S3) was a factor for our use case. In one test, we found CloudTrail Lake to be 38 times faster to process the same workload as Log Lake, but Log Lake was 10–100 times less costly depending on filters, timing, and account activity. Our test workload was 15.9 GB file size in S3, 199 million events, and 400 thousand files, spread across over 150 accounts and 3 Regions. Filters Log Lake applied were eventname='StartSession', 'AssumeRole', 'AssumeRoleWithSAML', and five arbitrary allow listed accounts. These tests might be different from your use case, so you should do your own testing, gather your own data, and decide for yourself.

Other services

The products mentioned previously are the most relevant to the outcomes we were trying to accomplish, but you should consider security, identity, and compliance products on AWS, too. These products and features can be used either as an alternative to Log Lake or to add functionality.

As an example, Amazon Bedrock can add functionality in three ways:

To skip the search and query Log Lake for you
To summarize across logs
As a source for logs (similar to Session Manager as a source for CloudWatch logs)

Querying means you can have an AI agent query your AWS Glue catalog (such as the Log Lake catalog) for data-based results. Summarizing means you can use generative artificial intelligence (AI) to summarize your text logs from a knowledge base as part of retrieval augmented generation (RAG), to ask questions like “How many log files are exactly the same? Who changed IAM roles last night?” Considerations and limitations apply.

Adding Amazon Bedrock as a source means using invocation logging to collect requests and responses.

Because we wanted to store very large amounts of data frugally (compressed and columnar format, not text) and produce non-generative (data-based) results that can be used for legal compliance and security, we didn’t use Amazon Bedrock in Log Lake—but we will revisit this topic in Part 3 when we detail how to use the approach we used for Session Manager for Amazon Bedrock.

Business background

When we began talking with our business partners, sponsors, and other stakeholders, important questions, problems, opportunities, and requirements emerged.

Why we needed to do this

Legal, security, identity, and compliance authorities of the large enterprise we were working for had created a customer-specific control. To comply with the control objective, use of elevated privileges required a manager to manually review all available data (including any session manager activity) to confirm or deny if use of elevated privileges was justified. This was a compliance use case that, when solved, could be applied to more use cases such as auditing and reporting.

Note on terms:

Here, the customer in customer-specific control means a control that is solely the responsibility of a customer, not AWS, as described in the AWS Shared Responsibility Model.
In this article, we define auditing broadly as testing information technology (IT) controls to mitigate risk, by anyone, at any cadence (ongoing as part of day-to-day operations, or one time only). We don’t refer to auditing that is financial, only conducted by an independent third-party, or only at certain times. We use self-review and auditing interchangeably.
We also define reporting broadly as presenting data for a specific purpose in a specific format to evaluate business performance and facilitate data-driven decisions—such as answering “how many employees had sessions last week?”

The use case

Our first and most important use case was a manager who needed to review activity, such as from an after-hours on-call page the previous night. If the manager needed to have additional discussions with their employee or needed additional time to consider activity, they had up to a week (7 calendar days) before they needed to confirm or deny elevated privileges were needed, based on their team’s procedures. A manager needed to review an entire set of events that all share the same session, regardless of known keywords or specific strings, as part of all available data in AWS. This was the workflow:

Employee uses homegrown application and standardized workflow to access Amazon EC2 with elevated privileges using Session Manager.
API activity in CloudTrail and continuous logging to CloudWatch logs.
The problem space – Data somehow gets procured, processed, and provided (this would become Log Lake later).
Another homegrown system (different from step 1) presents session activity to managers and applies access controls (a manager should only review activity for their own employees, and not be able to peruse data outside their team). This data might be only one StartSession API call and no session details, or might be thousands of lines from cat file
The manager reviews all available activity, makes an informed decision, and confirms or denies if use was justified.

This was an ongoing day-to-day operation, with a narrow scope. First, this meant only data available in AWS; if something couldn’t be captured by AWS, it was out of scope. If something was possible, it should be made available. Second, this meant only certain workflows; using Session Manager with elevated privileges for a specific, documented standard operating procedure.

Avoiding review

The simplest solution would be to block sessions on Amazon EC2 with elevated privileges, and fully automate build and deployment. This was possible for some but not all workloads, because some workloads required initial setup, troubleshooting, or emergency changes of Marketplace AMIs.

Is accurate logging and auditing possible?

We won’t extensively detail ways to bypass controls here, but there are important limitations and considerations we had to consider, and we recommend you do too.

First, logging isn’t available for sessionType Port, which includes SSH. This could be mitigated by ensuring employees can only use a custom application layer to start sessions without SSH. Blocking direct SSH access to EC2 instances using security group policies is another option.

Second, there are many ways to intentionally or accidentally hide or obfuscate activity in a session, making review of a specific command difficult or impossible. This was acceptable for our use case for multiple reasons:

A manager would always know if a session started and needed review from CloudTrail (our source signal). We joined to CloudWatch to meet our all available data requirement.
Continuous streaming to CloudWatch logs would log activity as it happened. Additionally, streaming to CloudWatch Logs supported interactive shell access, and our use case only used interactive shell access (sessionType Standard_Stream). Streaming isn’t supported for sessionType, InteractiveCommands, or NonInteractiveCommands.
The most important workflow to review involved an engineered application with one standard operating procedure (less variety than all the ways Session Manager could be used).
Most importantly, the manager was responsible for reviewing the reports and expected to apply their own judgement and interpret what happened. For example, a manager review could result in a follow up conversation with the employee that could improve business processes. A manager might ask their employee, “Can you help me understand why you ran this command? Do we need to update our runbook or automate something in deployment?”

To protect data against tampering, changes, or deletion, AWS provides tools and features such as AWS Identity and Access Management (IAM) policies and permissions and Amazon S3 Object Lock.

Security and compliance are a shared responsibility between AWS and the customer, and customers need to decide what AWS services and features to use for their use case. We recommend customers consider a comprehensive approach that considers overall system design and includes multiple layers of security controls (defense in depth). For more information, see the Security pillar of the AWS Well-Architected Framework.

Avoiding automation

Manual review can be a painful process, but we couldn’t automate review for two reasons: Legal requirements and to add friction to the feedback loop felt by a manager whenever an employee used elevated privileges, to discourage using elevated privileges.

Works with existing

We had to work with existing architecture, spanning thousands of accounts and multiple AWS Organizations. This meant sourcing data from buckets as an edge and point of ingress. Specifically, CloudTrail data was managed and consolidated outside of CloudTrail, across organizations and trails, into S3 buckets. CloudWatch data was also consolidated to S3 buckets, from Session Manager to CloudWatch Logs, with Amazon Data Firehose subscription filters on CloudWatch Logs pointing to S3. To avoid negative side effects on existing business processes, our business partners didn’t want to change settings in CloudTrail, CloudWatch, and Firehose. This meant Log Lake needed features and flexibility that enabled changes without impacting other workstreams using the same sources.

Event filtering is not a data lake

Before we were asked to help, there were attempts to do event filtering. One attempt tried to monitor session activity using Amazon EventBridge. This was limited to AWS API operations recorded by CloudTrail such as StartSession and didn’t include the information from inside the session, which was in CloudWatch Logs. Another attempt tried event filtering CloudWatch in the form of a subscription filter. Also, an attempt was made using EventBridge Event Bus with EventBridge rules, and storage in Amazon DynamoDB. These attempts didn’t deliver the expected results because of a combination of factors:

Size

Couldn’t accept large session log payloads because of the EventBridge PutEvents limit of 256 KB entry size. Saving large entries to Amazon S3 and using the object URL in the PutEvents entry would avoid this limitation in EventBridge, but wouldn’t pass the most important information the manager needed to review (the event’s sessionData element). This meant managing files and physical dependencies, and losing the metastore benefit of working with data as logical sets and objects.

Storage

Event filtering was a way to process data, not storage or a source of truth. We asked, how do we restore data lost in flight or destroyed after landing? If components are deleted or undergoing maintenance, can we still procure, process, and provide data—at all three layers independently? Without storage, no.

Data quality

No source of truth meant data quality checks weren’t possible. We couldn’t answer questions like: “Did the last job process more than 90 percent of events from CloudTrail in DynamoDB?” or“What percentage are we missing from source to target?”

Anti-patterns

DynamoDB as long-term storage wasn’t the most appropriate data store for large analytical workloads, low I/O, and highly complex many-to-many joins.

Reading out

Deliveries were fast, but work (and time and cost) was needed after delivery. In other words, queries had to do extra work to transform raw data into the needed format at time of read, which had a significant, cumulative effect on performance and cost. Imagine users running a select * from table without any filters on years of data and paying for storage and compute of those queries.

Cost of ownership

Filtering by event contents (sessionData from CloudWatch) required knowledge of session behavior, which was business logic. This meant changes to business logic required changes to event filtering. Imagine being asked to change CloudWatch filters or EventBridge rules based on a business process change, and trying to remember where to make the change, or troubleshoot why expected events weren’t being passed. This meant a higher cost of ownership and slower cycle times at best, and inability to meet SLA and scale at worst.

Accidental coupling

Creates accidental coupling between downstream consumers and low-level events. Consumers who directly integrate against events might get different schemas at different times for the same events, or events they don’t need. There’s no way to manage data at a higher level than event, at the level of sets (like all events for one sessionid), or at the object level (a table designed for dependencies). In other words, there was no metastore layer that separated the schema from the files, like in a data lake.

More sources (data to load in)

There were other, less important use cases that we wanted to expand to later: inventory management and security.

For inventory management, such as identifying EC2 instances running a Systems Manager agent that’s missing a patch, finding IAM users with inline policies, or finding Redshift clusters with nodes that aren’t RA3. This data would come from AWS Config unless it isn’t a supported resource type. We cut inventory management from scope because AWS Config data could be added to an AWS Glue catalog later, and queried from Athena using an approach like the one described in How to query your AWS resource configuration states using AWS Config and Amazon Athena.

For security, Splunk and OpenSearch were already in use for serviceability and operational analysis, sourcing files from Amazon S3. Log Lake is a complementary approach sourcing from the same data, which adds metadata and simplified data structures at the cost of latency. For more information about having different tools analyze the same data, see Solving big data problems on AWS.

More use cases (reasons to read out)

We knew from the first meeting that this was a bigger opportunity than just building a dataset for sessions from Systems Manager for manual manager review. Once we had procured logs from CloudTrail and CloudWatch, set up Glue jobs to process logs into convenient tables, and were able to join across these tables, we could change filters and configuration settings to answer questions about additional services and use cases, too. Similar to how we process data for Session Manager, we could expand the filters on Log Lake’s Glue jobs, and add data for Amazon Bedrock model invocation logging. For other use cases, we could use Log Lake as a source for automation (rules-based or ML), deep forensic investigations, or string-match searches (such as IP addresses or user names).

Additional technical considerations

*How did we define session? We would always know if a session started from StartSession event in CloudTrail API activity. Regarding when a session ended, we did not use TerminateSession because this was not always present and we considered this domain-specific logic. Log Lake enabled downstream customers to decide how to interpret the data. For example, our most important workflow had a Systems Manager timeout of 15 minutes, and our SLA was 90 minutes. This meant managers knew a session with a start time more than 2 hours prior to the current time was already ended.

*CloudWatch data required additional processing compared to CloudTrail, because CloudWatch logs from Firehose were saved in gzip format without gz suffix and had multiple JSON documents in the same line that needed to be processed to be on separate lines. Firehose can transform and convert records, such as invoking a Lambda function to transform, convert JSON to ORC, and decompress data, but our business partners didn’t want to change existing settings.

How to get the data (a deep dive)

To support the dataset needed for a manager to review, we needed to identify API-specific metadata (time, event source, and event name), and then join it to session data. CloudTrail was necessary because it was the most authoritative source for AWS API activity, specifically StartSession and AssumeRole and AssumeRoleWithSAML events, and contained context that didn’t exist in CloudWatch Logs (such as the error code AccessDenied) which could be useful for compliance and investigation. CloudWatch was necessary because it contained the keystrokes in a session, in the CloudWatch log’s sessionData element. We needed to obtain the AWS source of record from CloudTrail, but we recommend you check with your authorities to confirm you really need to join to CloudTrail. We mention this in case you hear this question “why not derive some sort of earliest eventTime from CloudWatch logs, and skip joining to CloudTrail entirely? That would cut size and complexity by half.”

To join CloudTrail (eventTime, eventname, errorCode, errorMessage, and so on) with CloudWatch (sessionData), we had to do the following:

Get the higher level API data from CloudTrail (time, event source, and event name), as the authoritative source for auditing Session Manager. To get this, we needed to look inside all CloudTrail logs and get only the rows with eventname=‘StartSession’ and eventsource=‘ssm.amazonaws.com’ (events from Systems Manager)—our business partners described this as looking for a needle in a haystack, because this could be only one session event across millions or billions of files. After we obtained this metadata, we needed to extract the sessionid to know what session to join it to, and we chose to extract sessionid from responseelements. Alternatively, we could use useridentity.sessioncontext.sourceidentity if a principal provided it while assuming a role (requires sts:SetSourceIdentity in the role trust policy).

Sample of a single record’s responseelements.sessionid value: "sessionid":"theuser-thefederation-0b7c1cc185ccf51a9"

The actual sessionid was the final element of the logstream: 0b7c1cc185ccf51a9.

Next we needed to get all logs for a single session from CloudWatch. Similarly to CloudTrail, we needed to look inside all CloudWatch logs landing in Amazon S3 from Firehose to identify only the needles that contained "logGroup":"/aws/ssm/sessionlogs". Then, we could get sessionid from logstream or sessionId, and get session activity from the message.sessionData.

Sample of a single record’s logStream element: "sessionId": "theuser-thefederation-0b7c1cc185ccf51a9"

Note: Looking inside the log isn’t always necessary. We did it because we had to work with existing logs Firehose put to Amazon S3, which didn’t have the logstream (and sessionid) in the file name. For example, a file from Firehose might have a name like

cloudwatch-logs-otherlogs-3-2024-03-03-22-22-55-55239a3d-622e-40c0-9615-ad4f5d4381fa

If we were able to use the ability of Session Manager to send to S3 directly, the file name in S3 is the loggroup (theuser-thefederation-0b7c1cc185ccf51a9.dms)and could be used to derive sessionid without looking inside the file.

Downstream of Log Lake, consumers could join on sessionid which was derived in the previous step.

What’s different about Log Lake

If you remember one thing about Log Lake, remember this: Log Lake is a data lake for compliance-related use cases, uses CloudTrail and CloudWatch as data sources, has separate tables for writing (original raw) and reading (read-optimized or readready), and gives you control over all components so you can customize it for yourself.

Here are some of the signature qualities of Log Lake:

Legal, identity, or compliance use cases

This includes deep dive forensic investigation, meaning use cases that are large volume, historical, and analytical. Because Log Lake uses Amazon S3, it can meet regulatory requirements that require write-once-read-many (WORM) storage.

AWS Well-Architected Framework

Log Lake applies real-world, time-tested design principles from the AWS Well-Architected Framework. This includes, but is not limited to:

Using the technology approach that best aligns with your goals (Performance Efficiency pillar).
Horizontal scaling, workload utilization, avoiding resource saturation, and recovering from failure (Reliability pillar). There were two scenarios we especially wanted to enable that didn’t exist when we began: regular run” jobs that ran at a predefined cadence with auto backfill ability (a failure of one or more jobs could be recovered by a subsequent job), and large backfills (hours or more). Another way we implemented resiliency was timeouts and retries in both file processing (AddAPart function) and metadata processing (AWS Glue jobs). If you want more resiliency you can add additional independent replicas of functionality, such as jobs, storage, or delivery doors, as part of a cell-based architecture.
Anticipating and learning from failure and implementing observability (Operational Excellence pillar). For jobs, this meant monitoring and alarming on workload utilization and timeouts. For storage this meant CloudWatch metrics for Amazon S3.

Operational Excellence also meant knowing service quotas, performing workload testing, and defining and documenting runbook processes. If we hadn’t tried to break something to see where the limit is, then we considered it untested and inappropriate for production use. To test, we would determine the highest single day volume we’d seen in the past year, and then run that same volume in an hour to see if (and how) it would break.

High-Performance, Portable Partition Adding (AddAPart)

Log Lake adds partitions to tables using Lambda functions with SQS, a pattern we call AddAPart. This uses Amazon Simple Query Service (SQS) to decouple triggers (files landing in Amazon S3) from actions (associating that file with metastore partition). Think of this as having four F’s:

Fair (first in, first out)
Flat (idempotent logic to deduplicate and avoid unnecessary API calls from different files triggering the same action)
File Filtering (on name).

This means no AWS Glue crawlers, no alter table or msck repair table to add partitions in Athena, and can be reused across sources and buckets. The management of partitions in Log Lake makes using partition-related features available in AWS Glue, including AWS Glue partition indexes and workload partitioning and bounded execution.

File name filtering uses the same central controls for lower cost of ownership, faster changes, troubleshooting from one location, and emergency levers—this means that if you want to avoid log recursion happening from a specific account, or want to exclude a Region because of regulatory compliance, you can do it in one place, managed by your change control process, before you pay for processing in downstream jobs.

If you want to tell a team, “onboard your data source to our log lake, here are the steps you can use to self-serve,” you can use AddAPart to do that. We describe this in Part 2.

Readready Tables

In Log Lake, data structures offer differentiated value to users, and original raw data isn’t directly exposed to downstream users by default. For each source, Log Lake has a corresponding read-optimized readready table.

Instead of this:

from_cloudtrail_raw

from_cloudwatch_raw

Log Lake exposes only these to users:

from_cloudtrail_readready

from_cloudwatch_readready

In Part 2, we describe these tables in detail. Here are our answers to frequently asked questions about readready tables:

Q: Doesn’t this have an up-front cost to process raw into readready? Why not pass the work (and cost) to downstream users?

A: Yes, and for us the cost of processing partitions of raw into readready happened once and was fixed, and was offset by the variable costs of querying, which was from many company-wide callers (systemic and human), with high frequency, and large volume.

Q: How much better are readready tables in terms of performance, cost, and convenience? How do you achieve these gains? How do you measure “convenience”?

A: In most tests, readready tables are 5–10 times faster to query and more than 2 times smaller in Amazon S3. Log Lake applies more than one technique: omitting columns, partition design, AWS Glue partition indexes, data types (readready tables don’t allow any nested complex data types within a column, such as struct<struct>), columnar storage (ORC), and compression (ZLIB). We measure convenience as the amount of operations required to join on a sessionid; using Log Lake’s readready tables this is 0 (zero).

Q: Do raw and readready use the same files or buckets?

A: No, files and buckets are not shared. This decouples writes from reads, improves both write and read performance, and adds resiliency.

This question is important when designing for large sizes and scaling, because a single job or downstream read alone can span millions of files in Amazon S3. S3 scaling doesn’t happen immediately, so queries against raw or original data involving many tiny JSON files can cause S3 503 errors when it exceeds 5,500 GET/HEAD per second. More than one bucket helps avoid resource saturation. There is another option that we didn’t have when we created Log Lake: S3 Express One Zone. For reliability, we still recommend not putting all your files in one bucket. Also, don’t forget to filter your data.

Customization and control

You can customize and control all components (columns or schema, data types, compression, job logic, job schedule, and so on) because Log Lake is built using AWS primitives—such as Amazon SQS and Amazon S3—for the most comprehensive combination of features with the most freedom to customize. If you want to change something, you can.

From mono to many

Rather than one large, monolithic lake that is tightly coupled to other systems, Log Lake is just one node in a larger network of distributed data products across different data domains—this concept is data mesh. Just like the AWS APIs it is built on, Log Lake abstracts away heavy lifting and enables users to move faster, more efficiently, and not wait for centralized teams to make changes. Log Lake does not try to cover all use cases—instead, Log Lake’s data can be accessed and consumed by domain-specific teams, empowering business experts to self-serve.

When you need more flexibility and freedom

As builders, sometimes you want to dissect a customer experience, find problems, and figure out ways to make it better. That means going a layer down to mix and match primitives together to get more comprehensive features and more customization, flexibility, and freedom.

We built Log Lake for our long-term needs, but it would have been easier in the short-term to save Session Manager logs to Amazon S3 and query them with Athena. If you have considered what already exists in AWS, and you’re sure you need more comprehensive abilities or customization, read on to Part 2: Build, which explains Log Lake’s architecture and how you can set it up.

If you have feedback and questions, let us know in the comments section.

References

About the authors

Colin Carson is a Data Engineer at AWS ProServe. He has designed and built data infrastructure for multiple teams at Amazon, including Internal Audit, Risk & Compliance, HR Hiring Science, and Security.

Sean O’Sullivan is a Cloud Infrastructure Architect at AWS ProServe. He has over 8 years industry experience working with customers to drive digital transformation projects, helping architect, automate, and engineer solutions in AWS.

Synchronize data lakes with CDC-based UPSERT using open table format, AWS Glue, and Amazon MSK

2024-07-31 Shubham Purwar

Post Syndicated from Shubham Purwar original https://aws.amazon.com/blogs/big-data/synchronize-data-lakes-with-cdc-based-upsert-using-open-table-format-aws-glue-and-amazon-msk/

In the current industry landscape, data lakes have become a cornerstone of modern data architecture, serving as repositories for vast amounts of structured and unstructured data. Change data capture (CDC) refers to the process of identifying and capturing changes made to data in a database and then delivering those changes in a downstream system. Capturing every change from transactions in a source database and moving them to the target keeps the systems synchronized, and helps with analytics use cases and zero-downtime database migrations.

However, efficiently managing and synchronizing data within these lakes presents a significant challenge. Maintaining data consistency and integrity across distributed data lakes is crucial for decision-making and analytics. Inaccurate or outdated data can lead to flawed insights and business decisions. Businesses require synchronized data to gain actionable insights and respond swiftly to changing market conditions. Scalability is a critical concern for data lakes, because they need to accommodate growing volumes of data without compromising performance or incurring exorbitant costs.

To address these issues effectively, we propose using Amazon Managed Streaming for Apache Kafka (Amazon MSK), a fully managed Apache Kafka service that offers a seamless way to ingest and process streaming data. We use MSK connect—an AWS managed service to deploy and run Kafka Connect to build an end-to-end CDC application that uses Debezium MySQL connector to process, insert, update, and delete records from MySQL and a confluent Amazon Simple Storage Service (Amazon S3) sink connector to write to Amazon S3 as raw data that can be consumed by other downstream application for further use cases. To process batch data effectively, we use AWS Glue, a serverless data integration service that uses the Spark framework to process the data from S3 and copies the data to the open table format layer. Open table format manages large collections of files as tables and supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries. We chose Delta Lake as an example open table format, but you can achieve the same results using Apache Iceberg or Apache Hudi.

The post illustrates the construction of a comprehensive CDC system, enabling the processing of CDC data sourced from Amazon Relational Database Service (Amazon RDS) for MySQL. Initially, we’re creating a raw data lake of all modified records in the database in near real time using Amazon MSK and writing to Amazon S3 as raw data. This raw data can then be used to build a data warehouse or even a special type of data storage that’s optimized for analytics, such as a Delta Lake on S3. Later, we use an AWS Glue exchange, transform, and load (ETL) job for batch processing of CDC data from the S3 raw data lake. A key advantage of this setup is that you have complete control over the entire process, from capturing the changes in your database to transforming the data for your specific needs. This flexibility allows you to adapt the system to different use cases.

This is achieved through integration with MSK Connect using the Debezium MySQL connector, followed by writing data to Amazon S3 facilitated by the Confluent S3 Sink Connector. Subsequently, the data is processed from S3 using an AWS Glue ETL job, and then stored in the data lake layer. Finally, the Delta Lake table is queried using Amazon Athena.

Note: If you require real-time data processing of the CDC data, you can bypass the batch approach and use an AWS Glue streaming job instead. This job would directly connect to the Kafka topic in MSK, grabbing the data as soon as changes occur. It can then process and transform the data as needed, creating a Delta Lake on Amazon S3 that reflects the latest updates according to your business needs. This approach ensures you have the most up-to-date data available for real-time analytics.

Solution overview

The following diagram illustrates the architecture that you implement through this blog post. Each number represents a major component of the solution.

The workflow consists of the following:

Near real-time data capture from MySQL and streaming to Amazon S3
1. The process starts with data originating from Amazon RDS for
2. A Debezium connector is used to capture changes to the data in the RDS instance in near real time. Debezium is a distributed platform that converts information from your existing databases into event streams, enabling applications to detect and immediately respond to row-level changes in the databases. Debezium is built on top of Apache Kafka and provides a set of Kafka Connect compatible connectors.
3. The captured data changes are then streamed to an Amazon MSK topic. MSK is a managed service that simplifies running Apache Kafka on AWS.
4. The processed data stream (topic) is streamed from MSK to Amazon S3 in JSON format. The Confluent S3 Sink Connector allows near real-time data transfer from an MSK cluster to an S3 bucket.
Batch processing the CDC raw data and writing it into the data lake
1. Set up an AWS Glue ETL job to process the raw CDC
2. This job reads bookmarked data from an S3 raw bucket and writes into the data lake in open file format (Delta). The job also creates the Delta Lake table in AWS Glue Data Catalog.
3. Delta Lake is an open-source storage layer built on top of existing data lakes. It adds functionalities like ACID transactions and versioning to improve data reliability and manageability.
Analyze the data using serverless interactive query service
1. Athena, a serverless interactive query service, can be used to query the Delta Lake table created in Glue Data Catalog. This allows for interactive data analysis without managing infrastructure.

For this post, we create the solution resources in the us-east-1 AWS Region using AWS CloudFormation templates. In the following sections, we show you how to configure your resources and implement the solution.

Configure resources with AWS CloudFormation

In this post, you use the following two CloudFormation templates. The advantage of using two different templates is that you can decouple the resource creation of the CDC pipeline and AWS Glue processing according to your use case, and if you have requirements to create specific process resources only.

vpc-msk-mskconnect-rds-client.yaml – This template sets up the CDC pipeline resources such as a virtual private cloud (VPC), subnet, security group, AWS Identity and Access Management (IAM) roles, NAT, internet gateway, Amazon Elastic Compute Cloud (Amazon EC2) client, Amazon MSK, MSKConnect, RDS, and S3
gluejob-setup.yaml – This template sets up the data processing resources such as the AWS Glue table, database and ETL

Configure MSK and MSK connect

To start, you’ll configure MKS and MSK connect using Debezium connector to capture incremental changes in table and write into Amazon S3 using an S3 sink connector. The vpc-msk-mskconnect-rds-client.yaml stack creates a VPC, private and public subnets, security groups, S3 buckets, Amazon MSK cluster, EC2 instance with Kafka client, RDS database, and MSK connectors, and its worker configurations.

Launch the stack vpc-msk-mskconnect-rds-client using the CloudFormation template:
Provide the parameter values as listed in the following

.	A	B	C
1	Parameters	Description	Sample value
2	`EnvironmentName`	An environment name that is prefixed to resource names.	msk-delta-cdc-pipeline
3	`DatabasePassword`	Database admin account password.	S3cretPwd99
4	`InstanceType`	MSK client EC2 instance type.	t2.micro
5	`LatestAmiId`	Latest AMI ID of Amazon Linux 2023 for EC2 instance. You can use the default value.	/aws/service/ami-amazon-linux- latest/al2023-ami-kernel-6.1-x86_64
6	`VpcCIDR`	IP range (CIDR notation) for this VPC.	10.192.0.0/16
7	`PublicSubnet1CIDR`	IP range (CIDR notation) for the public subnet in the first Availability Zone.	10.192.10.0/24
8	`PublicSubnet2CIDR`	IP range (CIDR notation) for the public subnet in the second Availability Zone.	10.192.11.0/24
9	`PrivateSubnet1CIDR`	IP range (CIDR notation) for the private subnet in the first Availability Zone.	10.192.20.0/24
10	`PrivateSubnet2CIDR`	IP range (CIDR notation) for the private subnet in the second Availability Zone.	10.192.21.0/24
11	`PrivateSubnet3CIDR`	IP range (CIDR notation) for the private subnet in the third Availability Zone.	10.192.22.0/24

The stack creation process can take approximately one hour to complete. Check the Outputs tab for the stack after the stack is created.

Next, you set up the AWS Glue data processing resources such as the AWS Glue database, table, and ETL job.

Implement UPSERT on an S3 data lake with Delta Lake using AWS Glue

The gluejob-setup.yaml CloudFormation template creates a database, IAM role, and AWS Glue ETL job. Retrieve the values for S3BucketNameForOutput, and S3BucketNameForScript from the vpc-msk-mskconnect-rds-client stack’s Outputs tab to use in this template. Complete the following steps:

Launch the stack gluejob-setup.
Provide parameter values as listed in the following

.	A	B	C
1	Parameters	Description	Sample value
2	`EnvironmentName`	Environment name that is prefixed to resource names.	gluejob-setup
3	`GlueDataBaseName`	Name of the Data Catalog database.	glue_cdc_blog_db
4	GlueTableName	Name of the Data Catalog table.	blog_cdc_tbl
5	`S3BucketForGlueScript`	Bucket name for the AWS Glue ETL script.	Use the S3 bucket name from the previous stack. For example, aws- gluescript-${AWS::AccountId}-${AWS::Region}-${EnvironmentNam e
6	`GlueWorkerType`	Worker type for AWS Glue job. For example, G.1X	G.1X
7	`NumberOfWorkers`	Number of workers in the AWS Glue job.	3
8	`S3BucketForOutput`	Bucket name for writing data from the AWS Glue job.	aws-glueoutput-${AWS::AccountId}-${AWS::Region}-${EnvironmentName}
9	`S3ConnectorTargetBucketname`	Bucket name where the Amazon MSK S3 sink connector writes the data from the Kafka topic.	msk-lab-${AWS::AccountId}- target-bucket

The stack creation process can take approximately 2 minutes to complete. Check the Outputs tab for the stack after the stack is created.

In the gluejob-setup stack, we created an AWS Glue database and AWS Glue job. For further clarity, you can examine the AWS Glue database and job generated using the CloudFormation template.

After successfully creating the CloudFormation stack, you can proceed with processing data using the AWS Glue ETL job.

Run the AWS Glue ETL job

To process the data created in the S3 bucket from Amazon MSK using the AWS Glue ETL job that you set up in the previous section, complete the following steps:

On the CloudFormation console, choose the stack gluejob-setup.
On the Outputs tab, retrieve the name of the AWS Glue ETL job from the GlueJobName In the following screenshot, the name is GlueCDCJob-glue-delta-cdc.

On the AWS Glue console, choose ETL jobs in the navigation pane.
Search for the AWS Glue ETL job named GlueCDCJob-glue-delta-cdc.
Choose the job name to open its details page.
Choose Run to start the On the Runs tab, confirm if the job ran without failure.

Retrieve the OutputBucketName from the gluejob-setup template output.
On the Amazon S3 console, navigate to the S3 bucket to verify the data.

Note: We have enabled AWS Glue job bookmark, which will make sure job will process the new data in each job run.

Query the Delta Lake table using Athena

After the AWS Glue ETL job has successfully created the Delta Lake table for the processed data in the Data Catalog, follow these steps to validate the data using Athena:

On the Athena console, navigate to the query editor.
Choose the Data Catalog as the data source.
Choose the database glue_cdc_blog_db created using gluejob-setup stack.
To validate the data, run the following query to preview the data and find the total count.

SELECT * FROM "glue_cdc_blog_db"."blog_cdc_tbl" ORDER BY cust_id DESC LIMIT 40;

SELECT COUNT(*) FROM "glue_cdc_blog_db"."blog_cdc_tbl";

The following screenshot shows the output of our example query.

Upload incremental (CDC) data for further processing

After we process the initial full load, let’s perform insert, update, and delete records in MySQL, which will be processed by the Debezium mysql connector and written to Amazon S3 using a confluent S3 sink connector.

On the Amazon EC2 console, go to the EC2 instance named KafkaClientInstance that you created using the CloudFormation template.

Run the following commands to insert the data into the RDS table. Use the database password from the CloudFormation stack parameter tab.

sudo su - ec2-user
RDS_AURORA_ENDPOINT=`aws rds describe-db-instances --region us-east-1 | jq -r '.DBInstances[] | select(.DBName == "salesdb") | .Endpoint.Address'`
mysql -f -u master -h $RDS_AURORA_ENDPOINT  --password

Now perform the insert into the CUSTOMER table.

use salesdb;
INSERT into CUSTOMER values(8887,'Customer Name 8887','Market segment 8887');
INSERT into CUSTOMER values(8888,'Customer Name 8888','Market segment 8888');
INSERT into CUSTOMER values(8889,'Customer Name 8889','Market segment 8889');

Run the AWS Glue job again to update the Delta Lake table with new records.
Use the Athena console to validate the data.

Perform the insert, update, and delete in the CUSTOMER table.

UPDATE CUSTOMER SET NAME='Customer Name update 8888',MKTSEGMENT='Market segment update 8888' where CUST_ID = 8888;
UPDATE CUSTOMER SET NAME='Customer Name update 8889',MKTSEGMENT='Market segment update 8889' where CUST_ID = 8889;
DELETE FROM CUSTOMER where CUST_ID = 8887;
INSERT into CUSTOMER values(9000,'Customer Name 9000','Market segment 9000');

Run the AWS Glue job again to update the Delta Lake table with the insert, update, and delete records.
Use the Athena console to validate the data to verify the update and delete records in the Delta Lake table.

Clean up

To clean up your resources, complete the following steps:

Delete the CloudFormation stack gluejob-setup.
Delete the CloudFormation stack vpc-msk-mskconnect-rds-client.

Conclusion

Organizations continually seek high-performance, cost-effective, and scalable analytical solutions to extract value from their operational data sources in near real time. The analytical platform must be capable of receiving updates to operational data as they happen. Traditional data lake solutions often struggle with managing changes in source data, but the Delta Lake framework addresses this challenge. This post illustrates the process of constructing an end-to-end change data capture (CDC) application using Amazon MSK, MSK Connect, AWS Glue, and native Delta Lake tables, alongside guidance on querying Delta Lake tables from Amazon Athena. This architectural pattern can be adapted to other data sources employing various Kafka connectors, enabling the creation of data lakes supporting UPSERT operations using AWS Glue and native Delta Lake tables. For further insights, see the MSK Connect examples.

About the authors

Shubham Purwar is a Cloud Engineer (ETL) at AWS Bengaluru specializing in AWS Glue and Athena. He is passionate about helping customers solve issues related to their ETL workload and implement scalable data processing and analytics pipelines on AWS. In his free time, Shubham loves to spend time with his family and travel around the world.

Nitin Kumar is a Cloud Engineer (ETL) at AWS, specializing in AWS Glue. With a decade of experience, he excels in aiding customers with their big data workloads, focusing on data processing and analytics. He is committed to helping customers overcome ETL challenges and develop scalable data processing and analytics pipelines on AWS. In his free time, he likes to watch movies and spend time with his family.

Plan your advertising campaigns with Amazon Marketing Cloud on AWS Clean Rooms, now generally available

2024-07-31 Veliswa Boya

Post Syndicated from Veliswa Boya original https://aws.amazon.com/blogs/aws/plan-your-advertising-campaigns-with-amazon-marketing-cloud-on-aws-clean-rooms-now-generally-available/

Today, we are announcing the general availability of Amazon Marketing Cloud (AMC) on AWS Clean Rooms to help advertisers use their first-party signals to collaborate with Amazon Ads unique signals. With this collaboration, advertisers can generate differentiated insights, discover new audiences, and enable advertising campaign planning, activation, and measurement use cases, all without having to move their underlying signals outside of their AWS account. With AMC on AWS Clean Rooms, customers can easily prepare their data, match and create audiences, use custom insights to activate more relevant advertising campaigns with Amazon Ads, and measure return on ad spend. All of this can be accomplished from the most secure cloud computing environment available today.

Advertisers continually strive to reach new audiences and deliver relevant, marketing campaigns to better engage their customers. Yet, the advertising and marketing landscape is undergoing a fundamental shift with signal loss and fragmentation. As such, advertisers and their partners need to collaborate together using signals that are stored across many applications to personalize their advertising campaigns. However, to work with one another to gather insights, companies typically need to share a copy of their signals with their partners, which is often not aligned with their data governance, security and privacy, IT, and legal teams’ policies. As a result, many businesses miss opportunities to fully maximize the value of their first-party signals and improve planning, activation, and measurement outcomes for their campaigns.

AMC on AWS Clean Rooms makes it easier and scalable for advertisers to use their first-party signals with Amazon Ads, including collaborating across event-level signals and modeling unique audiences to help improve media planning, activation, and outcomes without having to move underlying signals outside their cloud environment.

AMC on AWS Clean Rooms prerequisites (environment setup)
To get started with AMC on AWS Clean Rooms, the advertiser needs an AWS account and a dataset that contains user population and event-level data stored in open data formats (CSV, Parquet, or Iceberg) in an Amazon Simple Storage Service (Amazon S3) bucket. The next step is to send an email to the Amazon Ads team to request the creation of an AMC instance. Once an instance has been created, the Amazon Ads team will create an AWS Clean Rooms collaboration and invite the advertiser to join the collaboration.

How it works
1. Join an AWS Clean Rooms collaboration and create an ID namespace.
2. Configure and associate tables to an AMC collaboration.
3. Run an ID mapping workflow to create and populate the ID mapping table.
4. Run a query in AMC.

Walkthrough

1. Join an AWS Clean Rooms collaboration and create an ID namespace.
The advertiser will accept the collaboration invite by creating a membership in their AWS account. Once in the collaboration, the advertiser will access the AWS Clean Rooms console and then select the AWS Entity Resolution ID namespace generated when the collaboration was created to start the process of using their data for matching and collaboration in AWS Clean Rooms. Next, specify the AWS Glue table and the associated schema mapping and choose the S3 bucket in the same AWS Region as the collaboration for temporarily storing your data while it processes. Lastly, the advertiser will provide permissions to read your data input from AWS Glue and write to Amazon S3 on their behalf.

In the AirportLink collaboration shown in the following screenshot, the advertiser (member AirportLink2) accepts a collaboration invite sent by member AirportLink1.

2. Configure and associate tables to an AMC collaboration.
After joining the collaboration, the advertiser will create configured tables on their purchase data, add custom analysis rule, and associate the configured table to the collaboration.

Within the collaboration, the advertiser will set up a collaboration analysis rule to control which party can receive the result of a query run on the associated table.

3. Run an ID mapping workflow to create and populate the ID mapping table.
Now that the ID namespace is associated with the collaboration, the Amazon Ads team will create an ID mapping table in the AWS Clean Rooms console. This step requires both the advertiser (source) and the Amazon Ads team (target) to associate their ID namespace resources to the collaboration. Amazon Ads will provide the methods of mapping and configuration, add the details for querying to name the ID mapping table, and provide permission for AWS Clean Rooms to execute and track the ID mapping workflow job on their behalf. Finally, the Amazon Ads team will select Create and Populate to start the mapping workflow and generate an ID mapping table that captures a common user cohort, who were matched on the rules provided in Step 2.

4. Run a query in AMC.
Advertisers can either use templates or write a SQL query to run for analysis and get query results for further insights. They can run the SQL query in the following ways:

Run a SQL query with AMC data and the advertiser’s data that return the results to the advertiser’s S3 bucket using aggregate analysis. An example query is “How many of the customers who are registered for my email list saw the ads I’m running on Amazon?”
Run a SQL query to create an audience on the advertiser’s data or overlap with AMC signals that returns results to the S3 bucket of Amazon Ads. An example query is to generate an audience to target in an ad campaign.
Run an AWS Clean Rooms ML lookalike modeling job where Amazon Ads contributes the configured model and the advertiser contributes a seed audience. The resulting segment (list of user ad IDs) is sent to Amazon Ads.

After running the query, the advertiser can create an audience using a rule-based audience or a similar audience by navigating to the Audience tab in AMC. The output of the audience query will be sent directly to Amazon Demand Side Platform (DSP). The following table shows the options available to you when creating the audience:

If you want to	Then
Use pre-built audience templates	Select Create with instructional query from the dropdown list
Create custom audience queries	Select Create new query from the dropdown list

When creating a new query, the advertiser will configure various options such as name, description, and date adjustments. Additionally, the advertiser can choose from the two following audience types:

– Rule-based audience – Create audience-based on the audience query.
– Similar audience – Create machine learning (ML) based audiences based on the seed audience outputs from the audience query.

Now available
AMC on AWS Clean Rooms is available in in the US East (N. Virginia) Region. Be sure to check the full Region list for future updates. Learn more about AMC on AWS Clean Rooms in the AWS documentation.

Give it a try by emailing the Amazon Ads team to get started and send feedback to the AWS re:Post for AWS Clean Rooms or through your usual AWS Support contacts.

— Veliswa

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

2024-07-29 Michael Greenshtein

Post Syndicated from Michael Greenshtein original https://aws.amazon.com/blogs/big-data/monitoring-apache-iceberg-metadata-layer-using-aws-lambda-aws-glue-and-aws-cloudwatch/

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources. Data lakes provide a unified repository for organizations to store and use large volumes of data. This enables more informed decision-making and innovative insights through various analytics and machine learning applications.

Despite their advantages, traditional data lake architectures often grapple with challenges such as understanding deviations from the most optimal state of the table over time, identifying issues in data pipelines, and monitoring a large number of tables. As data volumes grow, the complexity of maintaining operational excellence also increases. Monitoring and tracking issues in the data management lifecycle are essential for achieving operational excellence in data lakes.

This is where Apache Iceberg comes into play, offering a new approach to data lake management. Apache Iceberg is an open table format designed specifically to improve the performance, reliability, and scalability of data lakes. It addresses many of the shortcomings of traditional data lakes by providing features such as ACID transactions, schema evolution, row-level updates and deletes, and time travel.

In this blog post, we’ll discuss how the metadata layer of Apache Iceberg can be used to make data lakes more efficient. You will learn about an open-source solution that can collect important metrics from the Iceberg metadata layer. Based on collected metrics, we will provide recommendations on how to improve the efficiency of Iceberg tables. Additionally, you will learn how to use Amazon CloudWatch anomaly detection feature to detect ingestion issues.

Deep dive into Iceberg’s Metadata layer

Before diving into a solution, let’s understand how the Apache Iceberg metadata layer works. The Iceberg metadata layer provides an open specification instructing integrated big data engines such as Spark or Trino how to run read and write operations and how to resolve concurrency issues. It’s crucial for maintaining inter-operability between different engines. It stores detailed information about tables such as schema, partitioning, and file organization in versioned JSON and Avro files. This ensures that each change is tracked and reversible, enhancing data governance and auditability.

Apache Iceberg metadata layer architecture diagram

History and versioning: Iceberg’s versioning feature captures every change in table metadata as immutable snapshots, facilitating data integrity, historical views, and rollbacks.

File organization and snapshot management: Metadata closely manages data files, detailing file paths, formats, and partitions, supporting multiple file formats like Parquet, Avro, and ORC. This organization helps with efficient data retrieval through predicate pushdown, minimizing unnecessary data scans. Snapshot management allows concurrent data operations without interference, maintaining data consistency across transactions.

In addition to its core metadata management capabilities, Apache Iceberg also provides specialized metadata tables—snapshots, files, and partitions—that provide deeper insights and control over data management processes. These tables are dynamically generated and provide a live view of the metadata for query purposes, facilitating advanced data operations:

Snapshots table: This table lists all snapshots of a table, including snapshot IDs, timestamps, and operation types. It enables users to track changes over time and manage version history effectively.
Files table: The files table provides detailed information on each file in the table, including file paths, sizes, and partition values. It is essential for optimizing read and write performance.
Partitions table: This table shows how data is partitioned across different files and provides statistics for each partition, which is crucial for understanding and optimizing data distribution.

Metadata tables enhance Iceberg’s functionality by making metadata queries straightforward and efficient. Using these tables, data teams can gain precise control over data snapshots, file management, and partition strategies, further improving data system reliability and performance.

Before you get started

The next section describes a packaged open source solution using Apache Iceberg’s metadata layer and AWS services to enhance monitoring across your Iceberg tables.

Before we deep dive into the suggested solution, let’s mention Iceberg MetricsReporter, which is a native way to emit metrics for Apache Iceberg. It supports two types of reports: one for commits and one for scans. The default output is log based. It produces log files as a result of commit or scan operations. To submit metrics to CloudWatch or any other monitoring tool, users need to create and configure a custom MetricsReporter implementation. MetricsReporter is supported in Apache Iceberg v1.1.0 and later versions, and customers who want to use it must enable it through Spark configuration on their existing pipelines.

The following is deployed independently and doesn’t require any configuration changes to existing data pipelines. It can immediately start monitoring all the tables within the AWS account and AWS Region where it’s deployed. This solution introduces an additional latency of metrics arrival between 20 and 80 seconds compared to MetricsReporter but offers seamless integration without the need for custom configurations or changes to current workflows.

Solution overview

This solution is specifically designed for customers who run Apache Iceberg on Amazon Simple Storage Service (Amazon S3) and use AWS Glue as their data catalog.

Key features

This solution uses an AWS Lambda deployment package to collect metrics from Apache Iceberg tables. The metrics are then submitted to CloudWatch where you can create metrics visualizations to help recognize trends and anomalies over time.

The solution is designed to be lightweight, focusing on collecting metrics directly from the Iceberg metadata layer without scanning the actual data layer. This approach significantly reduces the compute capacity required, making it efficient and cost-effective. Key features of the solution include:

Time-series metrics collection: The solution monitors Iceberg tables continuously to identify trends and detect anomalies in data ingestion rates, partition skewness, and more.
Event-driven architecture: The solution uses Amazon EventBridge to launch a Lambda function when the state of an AWS Glue Data Catalog table changes. This ensures real-time metrics collection every time a transaction is committed to an Iceberg table.
Efficient data retrieval: Incorporates minimal compute resources by utilizing AWS Glue interactive sessions and the pyiceberg library to directly access Iceberg metadata tables such as snapshots, partitions, and files.

Metrics tracked

As of the blog release date, the solution collects over 25 metrics. These metrics are categorized into several groups:

Snapshot metrics: Include total and changes in data files, delete files, records added or removed, and size changes.
Partition and file metrics: Aggregated and per-partition metrics like average, maximum, minimum record counts and file sizes, which help in understanding data distribution and help optimizing storage.

To see the complete list of metrics, go to the GitHub repository.

Visualizing data with CloudWatch dashboards

The solution also provides a sample CloudWatch dashboard to visualize the collected metrics. Metrics visualization is important for real-time monitoring and detecting operational issues. The provided helper script simplifies the set up and deployment of the dashboard.

Amazon CloudWatch dashboard

You can go to the GitHub repository to learn more about how to deploy the solution in your AWS account.

What are the vital metrics for Apache Iceberg tables?

This section discusses specific metrics from Iceberg’s metadata and explains why they’re important for monitoring data quality and system performance. The metrics are broken down into three parts: insight, challenge, and action. This provides a clear path for practical application. In this section, we provide only a subset of the available metrics that the solution can collect, for a complete list, see the solution Github page.

1. snapshot.added_data_files, snapshot.added_records

Metric insight: The number of data files and number of records added to the table during the last transaction. The ingestion rate measures the speed at which new data is added to the data lake. This metric helps identify bottlenecks or inefficiencies in data pipelines, guiding capacity planning and scalability decisions.
Challenge: A sudden drop in the ingestion rate can indicate failures in data ingestion pipelines, source system outages, configuration errors or traffic spikes.
Action: Teams need to establish real-time monitoring and alert systems to detect drops in ingestion rates promptly, allowing quick investigations and resolutions.

2. files.avg_record_count, files.avg_file_size

Metric insight: These metrics provide insights into the distribution and storage efficiency of the table. Small file sizes might suggest excessive fragmentation.
Challenge: Excessively small file sizes can indicate inefficient data storage leading to increased read operations and higher I/O costs.
Action: Implementing regular data compaction processes helps consolidate small files, optimizing storage and enhancing content delivery speeds as demonstrated by a streaming service. Data Catalog offers automatic compaction of Apache Iceberg tables. To learn more about compacting Apache Iceberg tables, see Enable compaction in Working with tables on the AWS Glue console.

3. partitions.skew_record_count, partitions.skew_file_count

Metric insight: The metrics indicate the asymmetry of the data distribution across the available table partitions. A skewness value of zero, or very close to zero, suggests that the data is balanced. Positive or negative skewness values might indicate a problem.
Challenge: Imbalances in data distribution across partitions can lead to inefficiencies and slow query responses.
Action: Regularly analyze data distribution metrics to adjust partitioning configuration. Apache Iceberg allows you to transform partitions dynamically, which enables optimization of table partitioning as query patterns or data volumes change, without impacting your existing data.

4. snapshot.deleted_records, snapshot.total_delete_files, snapshot.added_position_deletes

Metric insight: Deletion metrics in Apache Iceberg provide important information on the volume and nature of data deletions within a table. These metrics help track how often data is removed or updated, which is essential for managing data lifecycle and compliance with data retention policies.
Challenge: High values in these metrics can indicate excessive deletions or updates, which might lead to fragmentation and decreased query performance.
Action: To address these challenges, run compaction periodically to ensure deleted rows do not persist in new files. Regularly review and adjust data retention policies and consider expiring old snapshots to keep only necessary amount of data files. You can run compaction operation on specific partitions using Amazon Athena Optimize

Effective monitoring is essential for making informed decisions about necessary maintenance actions for Apache Iceberg tables. Determining the right timing for these actions is crucial. Implementing timely preventative maintenance ensures high operational efficiency of the data lake and helps to address potential issues before they become significant problems.

Using Amazon CloudWatch for anomaly detection and alerts

This section assumes that you have completed the solution setup and collected operational metrics from your Apache Iceberg tables into Amazon CloudWatch.

Now you can start setting up some alerts and detect anomalies.

We guide you on setting up the anomaly detection and configuring alerts in CloudWatch to monitor the snapshot.added_records metric, which indicates the ingestion rate of data written into an Apache Iceberg table.

Set up anomaly detection

CloudWatch anomaly detection applies machine learning algorithms to continuously analyze system metrics, determine normal baselines, and identify items that are outside of the established patterns. Here is how you configure it:

Amazon CloudWatch anomaly detection screenshot

Select Metrics: In the AWS Management Console for Cloudwatch, go to the Metrics tab and search for and select snapshot.added_records.
Create anomaly detection models: Choose the Graphed metrics tab and click the Pulse icon to enable anomaly detection.
Set Sensitivity: The second parameter of the ANOMALY_DETECTION_BAND (m1, 5) is to adjust the sensitivity of the anomaly detection. The goal is to balance detecting real issues and reducing false positives.

Configure alerts

After the anomaly detection model is set up, set up an alert to notify operations teams about potential issues:

Create alarm: Choose the bell icon under Actions on the same Graphed metrics tab.
Alarm settings: Set the alarm to notify the operations team when the snapshot.added_records metric is outside the anomaly detection band for two consecutive periods. This helps reduce the risk of false alerts.
Alarm actions: Configure CloudWatch to send an alarm email to the operations team. In addition to sending emails, CloudWatch alarm actions can automatically launch remediation processes, such as scaling operations or initiating data compaction.

Best practices

Regularly review and adjust models: As data patterns evolve, periodically review and adjust anomaly detection models and alarm settings to remain effective.
Comprehensive coverage: Ensure that all critical aspects of the data pipeline are monitored, not just a few metrics.
Documentation and communication: Maintain clear documentation of what each metric and alarm represent and ensure that your operations team understands the monitoring set up and response procedures. Set up the alerting mechanisms to send notifications through appropriate channels such as email, corporate messenger, or telephone to ensure your operations team stays informed and can quickly address the issues.
Create playbooks and automate remediation tasks: Establish detailed playbooks that describe step-by-step responses for common scenarios identified by alerts. Additionally, automate remediation tasks where possible to speed up response times and reduce the manual burden on teams. This ensures consistent and effective responses to all incidents.

CloudWatch anomaly detection and alerting features help organizations proactively manage their data lakes. This ensures data integrity, reduces downtime, and maintains high data quality. As a result, it enhances operational efficiency and supports robust data governance.

Conclusion

In this blog post, we explored Apache Iceberg’s transformative impact on data lake management. Apache Iceberg addresses the challenges of big data with features like ACID transactions, schema evolution, and snapshot isolation, enhancing data reliability, query performance, and scalability.

We delved into Iceberg’s metadata layer and related metadata tables such as snapshots, files, and partitions that allow easy access to crucial information about the current state of the table. These metadata tables facilitate the extraction of performance-related data, enabling teams to monitor and optimize the data lake’s efficiency.

Finally, we showed you a practical solution for monitoring Apache Iceberg tables using Lambda, AWS Glue, and CloudWatch. This solution uses Iceberg’s metadata layer and CloudWatch monitoring capabilities to provide a proactive operational framework. This framework detects trends and anomalies, ensuring robust data lake management.

About the Author

Avatar Michael Greenshtein is a Senior Analytics Specialist at Amazon Web Services. He is an experienced data professional with over 8 years in cloud computing and data management. Michael is passionate about open-source technology and Apache Iceberg.

How ATPCO enables governed self-service data access to accelerate innovation with Amazon DataZone

2024-07-25 Brian Olsen

Post Syndicated from Brian Olsen original https://aws.amazon.com/blogs/big-data/how-atpco-enables-governed-self-service-data-access-to-accelerate-innovation-with-amazon-datazone/

This blog post is co-written with Raj Samineni from ATPCO.

In today’s data-driven world, companies across industries recognize the immense value of data in making decisions, driving innovation, and building new products to serve their customers. However, many organizations face challenges in enabling their employees to discover, get access to, and use data easily with the right governance controls. The significant barriers along the analytics journey constrain their ability to innovate faster and make quick decisions.

ATPCO is the backbone of modern airline retailing, enabling airlines and third-party channels to deliver the right offers to customers at the right time. ATPCO’s reach is impressive, with its fare data covering over 89% of global flight schedules. The company collaborates with more than 440 airlines and 132 channels, managing and processing over 350 million fares in its database at any given time. ATPCO’s vision is to be the platform driving innovation in airline retailing while remaining a trusted partner to the airline ecosystem. ATPCO aims to empower data-driven decision-making by making high quality data discoverable by every business unit, with the appropriate governance on who can access what.

In this post, using one of ATPCO’s use cases, we show you how ATPCO uses AWS services, including Amazon DataZone, to make data discoverable by data consumers across different business units so that they can innovate faster. We encourage you to read Amazon DataZone concepts and terminologies first to become familiar with the terms used in this post.

Use case

One of ATPCO’s use cases is to help airlines understand what products, including fares and ancillaries (like premium seat preference), are being offered and sold across channels and customer segments. To support this need, ATPCO wants to derive insights around product performance by using three different data sources:

Airline Ticketing data – 1 billion airline ticket sales data processed through ATPCO
ATPCO pricing data – 87% of worldwide airline offers are powered through ATPCO pricing data. ATPCO is the industry leader in providing pricing and merchandising content for airlines, global distribution systems (GDSs), online travel agencies (OTAs), and other sales channels for consumers to visually understand differences between various offers.
De-identified customer master data – ATPCO customer master data that has been de-identified for sensitive internal analysis and compliance.

In order to generate insights that will then be shared with airlines as a data product, an ATPCO analyst needs to be able to find the right data related to this topic, get access to the data sets, and then use it in a SQL client (like Amazon Athena) to start forming hypotheses and relationships.

Before Amazon DataZone, ATPCO analysts needed to find potential data assets by talking with colleagues; there wasn’t an easy way to discover data assets across the company. This slowed down their pace of innovation because it added time to the analytics journey.

Solution

To address the challenge, ATPCO sought inspiration from a modern data mesh architecture. Instead of a central data platform team with a data warehouse or data lake serving as the clearinghouse of all data across the company, a data mesh architecture encourages distributed ownership of data by data producers who publish and curate their data as products, which can then be discovered, requested, and used by data consumers.

Amazon DataZone provides rich functionality to help a data platform team distribute ownership of tasks so that these teams can choose to operate less like gatekeepers. In Amazon DataZone, data owners can publish their data and its business catalog (metadata) to ATPCO’s DataZone domain. Data consumers can then search for relevant data assets using these human-friendly metadata terms. Instead of access requests from data consumer going to a ATPCO’s data platform team, they now go to the publisher or a delegated reviewer to evaluate and approve. When data consumers use the data, they do so in their own AWS accounts, which allocates their consumption costs to the right cost center instead of a central pool. Amazon DataZone also avoids duplicating data, which saves on cost and reduces compliance tracking. Amazon DataZone takes care of all of the plumbing, using familiar AWS services such as AWS Identity and Access Management (IAM), AWS Glue, AWS Lake Formation, and AWS Resource Access Manager (AWS RAM) in a way that is fully inspectable by a customer.

The following diagram provides an overview of the solution using Amazon DataZone and other AWS services, following a fully distributed AWS account model, where data sets like airline ticket sales, ticket pricing, and de-identified customer data in this use case are stored in different member accounts in AWS Organizations.

Implementation

Now, we’ll walk through how ATPCO implemented their solution to solve the challenges of analysts discovering, getting access to, and using data quickly to help their airline customers.

There are four parts to this implementation:

Set up account governance and identity management.
Create and configure an Amazon DataZone domain.
Publish data assets.
Consume data assets as part of analyzing data to generate insights.

Part 1: Set up account governance and identity management

Before you start, compare your current cloud environment, including data architecture, to ATPCO’s environment. We’ve simplified this environment to the following components for the purpose of this blog post:

ATPCO uses an organization to create and govern AWS accounts.
ATPCO has existing data lake resources set up in multiple accounts, each owned by different data-producing teams. Having separate accounts helps control access, limits the blast radius if things go wrong, and helps allocate and control cost and usage.
In each of their data-producing accounts, ATPCO has a common data lake stack: An Amazon Simple Storage Service (Amazon S3) bucket for data storage, AWS Glue crawler and catalog for updating and storing technical metadata, and AWS LakeFormation (in hybrid access mode) for managing data access permissions.
ATPCO created two new AWS accounts: one to own the Amazon DataZone domain and another for a consumer team to use for analytics with Amazon Athena.
ATPCO enabled AWS IAM Identity Center and connected their identity provider (IdP) for authentication.

We’ll assume that you have a similar setup, though you might choose differently to suit your unique needs.

Part 2: Create and configure an Amazon DataZone domain

After your cloud environment is set up, the steps in Part 2 will help you create and configure an Amazon DataZone domain. A domain helps you organize your data, people, and their collaborative projects, and includes a unique business data catalog and web portal that publishers and consumers will use to share, collaborate, and use data. For ATPCO, their data platform team created and configured their domain.

Step 2.1: Create an Amazon DataZone domain

Persona: Domain administrator

Go to the Amazon DataZone console in your domain account. If you use AWS IAM Identity Center for corporate workforce identity authentication, then select the AWS Region in which your Identity Center instance is deployed. Choose Create domain.

Enter a name and description.
Leave Customize encryption settings (advanced) cleared.
Leave the radio button selected for Create and use a new role. AWS creates an IAM role in your account on your behalf with the necessary IAM permissions for accessing Amazon DataZone APIs.
Leave clear the quick setup option for Set-up this account for data consumption and publishing because we don’t plan to publish or consume data in our domain account.
Skip Add new tag for now. You can always come back later to edit the domain and add tags.
Choose Create Domain.

After a domain is created, you will see a domain detail page similar to the following. Notice that IAM Identity Center is disabled by default.

Step 2.2: Enable IAM Identity Center for your Amazon DataZone domain and add a group

Persona: Domain administrator

By default, your Amazon domain, its APIs, and its unique web portal are accessible by IAM principals in this AWS account with the necessary datazone IAM permissions. ATPCO wanted its corporate employees to be able to use Amazon DataZone with their corporate single sign-on SSO credentials without needing secondary federation to IAM roles. AWS Identity Center is the AWS cross-service solution for passing identity provider credentials. You can skip this step if you plan to use IAM principals directly for accessing Amazon DataZone.

Navigate to your Amazon DataZone domain’s detail page and choose Enable IAM Identity Center.

Scroll down to the User management section and select Enable users in IAM Identity Center. When you do, User and group assignment method options appear below. Turn on Require assignments. This means that you need to explicitly allow (add) users and groups to access your domain. Choose Update domain.

Now let’s add a group to the domain to provide its members with access. Back on your domain’s detail page, scroll to the bottom and choose the User management tab. Choose Add, and select Add SSO Groups from the drop-down.

Enter the first letters of the group name and select it from the options. After you’ve added the desired groups, choose Add group(s).
You can confirm that the groups are added successfully on the domain’s detail page, under the User management tab by selecting SSO Users and then SSO Groups from the drop-down.

Step 2.3: Associate AWS accounts with the domain for segregated data publishing and consumption

Personas: Domain administrator and AWS account owners

Amazon DataZone supports a distributed AWS account structure, where data assets are segregated from data consumption (such as Amazon Athena usage), and data assets are in their own accounts (owned by their respective data owners). We call these associated accounts. Amazon DataZone and the other AWS services it orchestrates take care of the cross-account data sharing. To make this work, domain and account owners need to perform a one-time account association: the domain needs to be shared with the account, and the account owner needs to configure it for use with Amazon DataZone. For ATPCO, there are four desired associated accounts, three of which are the accounts with data assets stored in Amazon S3 and cataloged in AWS Glue (airline ticketing data, pricing data, and de-identified customer data), and a fourth account that is used for an analyst’s consumption.

The first part of associating an account is to share the Amazon DataZone domain with the desired accounts (Amazon DataZone uses AWS RAM to create the resource policy for you). In ATPCO’s case, their data platform team manages the domain, so a team member does these steps.

Todo this in the Amazon DataZone console, sign in to the domain account and navigate to the domain detail page, and then scroll down and choose the Associated Accounts tab. Choose Request association.
Enter the AWS account ID of the first account to be associated.
Choose Add another account and repeat step one for the remaining accounts to be associated. For ATPCO, there were four to-be associated accounts.
When complete, choose Request Association.

The second part of associating an account is for the account owner to then configure their account for use by Amazon DataZone. Essentially, this process means that the account owner is allowing Amazon DataZone to perform actions in the account, like granting access to Amazon DataZone projects after a subscription request is approved.

Sign in to the associated account and go to the Amazon DataZone console in the same Region as the domain. On the Amazon DataZone home page, choose View requests.
Select the name of the inviting Amazon DataZone domain and choose Review request.

Choose the Amazon DataZone blueprint you want to enable. We select Data Lake in this example because ATPCO’s use case has data in Amazon S3 and consumption through Amazon Athena.

Leave the defaults as-is in the Permissions and resources The Glue Manage Access role allows Amazon DataZone to use IAM and LakeFormation to manage IAM roles and permissions to data lake resources after you approve a subscription request in Amazon DataZone. The Provisioning role allows Amazon DataZone to create S3 buckets and AWS Glue databases and tables in your account when you allow users to create Amazon DataZone projects and environments. The Amazon S3 bucket for data lake is where you specify which S3bucket is used by Amazon DataZone when users store data with your account.

Choose Accept & configure association. This will take you to the associated domains table for this associated account, showing which domains the account is associated with. Repeat this process for other to-be associated accounts.

After the associations are configured by accounts, you will see the status reflected in the Associated accounts tab of the domain detail page.

Step 2.4: Set up environment profiles in the domain

Persona: Domain administrator

The final step to prepare the domain is making the associated AWS accounts usable by Amazon DataZone domain users. You do this with an environment profile, which helps less technical users get started publishing or consuming data. It’s like a template, with pre-defined technical details like blueprint type, AWS account ID, and Region. ATPCO’s data platform team set up an environment profile for each associated account.

To do this in the Amazon DataZone console, the data platform team member sign in to the domain account and navigates to the domain detail page, and chooses Open data portal in the upper right to go to the web-based Amazon DataZone portal.

Choose Select project in the upper-left next to the DataZone icon and select Create Project. Enter a name, like Domain Administration and choose Create. This will take you to your new project page.
In the Domain Administration project page, choose the Environments tab, and then choose Environment profiles in the navigation pane. Select Create environment profile.
1. Enter a name, such as Sales – Data lake blueprint.
2. Select the Domain Administration project as owner, and the DefaultDataLake as the blueprint.
3. Select the AWS account with sales data as well as the preferred Region for new resources, such as AWS Glue and Athena consumption.
4. Leave All projects and Any database
5. Finalize your selection by choosing Create Environment Profile.

Repeat this step for each of your associated accounts. As a result, Amazon DataZone users will be able to create environments in their projects to use AWS resources in specific AWS accounts forpublishing or consumption.

Part 3: Publish assets

With Part 2 complete, the domain is ready for publishers to sign in and start publishing the first data assets to the business data catalog so that potential data consumers find relevant assets to help them with their analyses. We’ll focus on how ATPCO published their first data asset for internal analysis—sales data from their airline customers. ATPCO already had the data extracted, transformed, and loaded in a staged S3 bucket and cataloged with AWS Glue.

Step 3.1: Create a project

Persona: Data publisher

Amazon DataZone projects enable a group of users to collaborate with data. In this part of the ATPCO use case, the project is used to publish sales data as an asset in the project. By tying the eventual data asset to a project (rather than a user), the asset will have long-lived ownership beyond the tenure of any single employee or group of employees.

As a data publisher, obtain theURL of the domain’s data portal from your domain administrator, navigate to this sign-in page and authenticate with IAM or SSO. After you’re signed in to the data portal, choose Create Project, enter a name (such as Sales Data Assets) and choose Create.
If you want to add teammates to the project, choose Add Members. On the Project members page, choose Add Members, search for the relevant IAM or SSO principals, and select a role for them in the project. Owners have full permissions in the project, while contributors are not able to edit or delete the project or control membership. Choose Add Members to complete the membership changes.

Step 3.2: Create an environment

Persona: Data publisher

Projects can be comprised of several environments. Amazon DataZone environments are collections of configured resources (for example, an S3 bucket, an AWS Glue database, or an Athena workgroup). They can be useful if you want to manage stages of data production for the same essential data products with separate AWS resources, such as raw, filtered, processed, and curated data stages.

While signed in to the data portal and in the Sales Data Assets project, choose the Environments tab, and then select Create Environment. Enter a name, such as Processed, referencing the processed stage of the underlying data.
Select the Sales – Data lake blueprint environment profile the domain administrator created in Part 2.
Choose Create Environment. Notice that you don’t need any technical details about the AWS account or resources! The creation process might take several minutes while Amazon DataZone sets up Lake Formation, Glue, and Athena.

Step 3.3: Create a new data source and run an ingestion job

Persona: Data publisher

In this use case, ATPCO has cataloged their data using AWS Glue. Amazon DataZone can use AWS Glue as a data source. Amazon DataZone data source (for AWS Glue) is a representation of one or more AWS Glue databases, with the option to set table selection criteria based on their name. Similar to how AWS Glue crawlers scan for new data and metadata, you can run an Amazon DataZone ingestion job against an Amazon DataZone data source (again, AWS Glue) to pull all of the matching tables and technical metadata (such as column headers) as the foundation for one or more data assets. An ingestion job can be run manually or automatically on a schedule.

While signed in to the data portal and in the Sales Data Assets project, choose the Data tab, and then select Data sources. Choose Create Data Source, and enter a name for your data source, such as Processed Sales data in Glue, select AWS Glue as the type, and choose Next.
Select the Processed environment from Step 3.2. In the database name box, enter a value or select from the suggested AWS Glue databases that Amazon DataZone identified in the AWS account. You can add additional criteria and another AWS Glue database.
For Publishing settings, select No. This allows you to review and enrich the suggested assets before publishing them to the business data catalog.
For Metadata generation methods, keep this box selected. Amazon DataZone will provide you with recommended business names for the data assets and its technical schema to publish an asset that’s easier for consumers to find.
Clear Data quality unless you have already set up AWS Glue data quality. Choose Next.
For Run preference, select to run on demand. You can come back later to run this ingestion job automatically on a schedule. Choose Next.
Review the selections and choose Create.

To run the ingestion job for the first time, choose Run in the upper right corner. This will start the job. The run time is dependent on the quantity of databases, tables, and columns in your data source. You can refresh the status by choosing Refresh.

Step 3.4: Review, curate, and publish assets

Persona: Data publisher

After the ingestion job is complete, the matching AWS Glue tables will be added to the project’s inventory. You can then review the asset, including automated metadata generated by Amazon DataZone, add additional metadata, and publish the asset.

While signed in to the data portal and in the Sales Data Assets project, go to the Data tab, and select Inventory. You can review each of the data assets generated by the ingestion job. Let’s select the first result. In the asset detail page, you can edit the asset’s name and description to make it easier to find, especially in a list of search results.
You can edit the Read Me section and add rich descriptions for the asset, with markdown support. This can help reduce the questions consumers message the publisher with for clarification.
You can edit the technical schema (columns), including adding business names and descriptions. If you enabled automated metadata generation, then you’ll see recommendations here that you can accept or reject.
After you are done enriching the asset, you can choose Publish to make it searchable in the business data catalog.

Have the data publisher for each asset follow Part 3. For ATPCO, this means two additional teams followed these steps to get pricing and de-identified customer data into the data catalog.

Part 4: Consume assets as part of analyzing data to generate insights

Now that the business data catalog has three published data assets, data consumers will find available data to start their analysis. In this final part, an ATPCO data analyst can find the assets they need, obtain approved access, and analyze the data in Athena, forming the precursor of a data product that ATPCO can then make available to their customer (such as an airline).

Step 4.1: Discover and find data assets in the catalog

Persona: Data consumer

As a data consumer, obtain the URL of the domain’s data portal from your domain administrator, navigate to in the sign-in page, and authenticate with IAM or SSO. In the data portal, enter text to find data assets that match what you need to complete your analysis. In the ATPCO example, the analyst started by entering ticketing data. This returned the sales asset published above because the description noted that the data was related to “sales, including tickets and ancillaries (like premium seat selection preferences).”

The data consumer reviews the detail page of the sales asset, including the description and human-friendly terms in the schema, and confirms that it’s of use to the analysis. They then choose Subscribe. The data consumer is prompted to select a project for the subscription request, in which case they follow the same instructions as creating a project in Step 3.1, naming it Product analysis project. Enter a short justification of the request. Choose Subscribe to send the request to the data publisher.

Repeat Steps 4.2 and 4.3 for each of the needed data assets for the analysis. In the ATPCO use case, this meant searching for and subscribing to pricing and customer data.

While waiting for the subscription requests to be approved, the data consumer creates an Amazon DataZone environment in the Product analysis project, similar to Step 3.2. The data consumer selects an environment profile for their consumption AWS account and the data lake blueprint.

Step 4.2: Review and approve subscription request

Persona: Data publisher

The next time that a member of the Sales Data Assets project signs in to the Amazon DataZone data portal, they will see a notification of the subscription request. Select that notification or navigate in the Amazon DataZone data portal to the project. Choose the Data tab and Incoming requests and then the Requested tab to find the request. Review the request and decide to either Approve or Reject, while providing a disposition reason for future reference.

Step 4.3: Analyze data

Persona: Data consumer

Now that the data consumer has subscribed to all three data assets needed (by repeating steps 4.1-4.2 for each asset), the data consumer navigates to the Product analysis project in the Amazon DataZone data portal. The data consumer can verify that the project has data asset subscriptions by choosing the Data tab and Subscribed data.

Because the project has an environment with the data lake blueprint enabled in their consumption AWS account, the data consumer will see an icon in the right-side tab called Query Data: Amazon Athena. By selecting this icon, they’re taken to the Amazon Athena console.

In the Amazon Athena console, the data consumer sees the data assets their DataZone project is subscribed to (from steps 4.1-4.2). They use the Amazon Athena query editor to query the subscribed data.

Conclusion

In this post, we walked you through an ATPCO use case to demonstrate how Amazon DataZone allows users across an organization to easily discover relevant data products using business terms. Users can then request access to data and build products and insights faster. By providing self-service access to data with the right governance guardrails, Amazon DataZone helps companies tap into the full potential of their data products to drive innovation and data-driven decision making. If you’re looking for a way to unlock the full potential of your data and democratize it across your organization, then Amazon DataZone can help you transform your business by making data-driven insights more accessible and productive.

To learn more about Amazon DataZone and how to get started, refer to the Getting started guide. See the YouTube playlist for some of the latest demos of Amazon DataZone and short descriptions of the capabilities available.

About the Author

Brian Olsen is a Senior Technical Product Manager with Amazon DataZone. His 15 year technology career in research science and product has revolved around helping customers use data to make better decisions. Outside of work, he enjoys learning new adventurous hobbies, with the most recent being paragliding in the sky.

Mitesh Patel is a Principal Solutions Architect at AWS. His passion is helping customers harness the power of Analytics, machine learning and AI to drive business growth. He engages with customers to create innovative solutions on AWS.

Raj Samineni is the Director of Data Engineering at ATPCO, leading the creation of advanced cloud-based data platforms. His work ensures robust, scalable solutions that support the airline industry’s strategic transformational objectives. By leveraging machine learning and AI, Raj drives innovation and data culture, positioning ATPCO at the forefront of technological advancement.

Sonal Panda is a Senior Solutions Architect at AWS with over 20 years of experience in architecting and developing intricate systems, primarily in the financial industry. Her expertise lies in Generative AI, application modernization leveraging microservices and serverless architectures to drive innovation and efficiency.

Migrate workloads from AWS Data Pipeline

2024-07-25 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/migrate-workloads-from-aws-data-pipeline/

AWS Data Pipeline helps customers automate the movement and transformation of data. With Data Pipeline, customers can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks. Launched in 2012, Data Pipeline predates several popular Amazon Web Services (AWS) offerings for orchestrating data pipelines such as AWS Glue, AWS Step Functions, and Amazon Managed Workflows for Apache Airflow (Amazon MWAA).

Data Pipeline has been a foundational service for getting customer off the ground for their extract, transform, load (ETL) and infra provisioning use cases. Some customers want a deeper level of control and specificity than possible using Data Pipeline. With the recent advancements in the data industry, customers are looking for a more feature-rich platform to modernize their data pipelines to get them ready for data and machine learning (ML) innovation. This post explains how to migrate from Data Pipeline to alternate AWS services to serve the growing needs of data practitioners. The option you choose depends on your current workload on Data Pipeline. You can migrate typical use cases of Data Pipeline to AWS Glue, Step Functions, or Amazon MWAA.

Note that you will need to modify the configurations and code in the examples provided in this post based on your requirements. Before starting any production workloads after migration, you need to test your new workflows to ensure no disruption to production systems.

Migrating workloads to AWS Glue

AWS Glue is a serverless data integration service that helps analytics users to discover, prepare, move, and integrate data from multiple sources. It includes tooling for authoring, running jobs, and orchestrating workflows. With AWS Glue, you can discover and connect to hundreds of different data sources and manage your data in a centralized data catalog. You can visually create, run, and monitor ETL pipelines to load data into your data lakes. Also, you can immediately search and query cataloged data using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

We recommend migrating your Data Pipeline workload to AWS Glue when:

You’re looking for a serverless data integration service that supports various data sources, authoring interfaces including visual editors and notebooks, and advanced data management capabilities such as data quality and sensitive data detection.
Your workload can be migrated to AWS Glue workflows, jobs (in Python or Apache Spark) and crawlers (for example, your existing pipeline is built on top of Apache Spark).
You need a single platform that can handle all aspects of your data pipeline, including ingestion, processing, transfer, integrity testing, and quality checks.
Your existing pipeline was created from a pre-defined template on the AWS Management Console for Data Pipeline, such as exporting a DynamoDB table to Amazon S3, or importing DynamoDB backup data from S3, and you’re looking for the same template.
Your workload doesn’t depend on a specific Hadoop ecosystem application such as Apache Hive.
Your workload doesn’t require orchestrating on-premises servers, user-managed Amazon Elastic Compute Cloude (Amazon EC2) instances, or a user-managed Amazon EMR cluster.

Example: Migrate EmrActivity on EmrCluster to export DynamoDB tables to S3

One of the most common workloads on Data Pipeline is to backup Amazon DynamoDB tables to Amazon Simple Storage Service (Amazon S3). Data Pipeline has a pre-defined template named Export DynamoDB table to S3 to export DynamoDB table data to a given S3 bucket.

The template uses EmrActivity (named TableBackupActivity) which runs on EmrCluster (named EmrClusterForBackup) and backs up data on DynamoDBDataNode to S3DataNode.

You can migrate these pipelines to AWS Glue because it natively supports reading from DynamoDB.

To define an AWS Glue job for the preceding use case:

Open the AWS Glue console.
Choose ETL jobs.
Choose Visual ETL.
For Sources, select Amazon DynamoDB.
On the node Data source - DynamoDB, for DynamoDB source, select Choose the DynamoDB table directly, then select your source DynamoDB table from the menu.
For Connection options, enter s3.bucket and dynamodb.s3.prefix.
Choose + (plus) to add a new node.
For Targets, select Amazon S3.
On the node Data target - S3 bucket, for Format, select your preferred format, for example, Parquet.
For S3 Target location, enter your destination S3 path.
On Job details tab, select IAM role. In case you do not have the IAM role, follow Configuring IAM permissions for AWS Glue.
Choose Save and Run.

Your AWS Glue job has been successfully created and started.

You might notice that there is no property to manage read I/O rate. It’s because the default DynamoDB reader used in Glue Studio does not scan the source DynamoDB table. Instead it uses DynamoDB export.

Example: Migrate EmrActivity on EmrCluster to import DynamoDB from S3

Another common workload on Data Pipeline is to restore DynamoDB tables using backup data on Amazon S3. Data Pipeline has a pre-defined template named Import DynamoDB backup data from S3 to import DynamoDB table data from a given S3 bucket.

The template uses EmrActivity (named TableLoadActivity) which runs on EmrCluster (named EmrClusterForLoad) and loads data from S3DataNode to DynamoDBDataNode.

You can migrate these pipelines to AWS Glue because it natively supports writing to DynamoDB.

Prerequisites are to create a destination DynamoDB table and catalog it on Glue Data Catalog using Glue crawler, Glue console, or the API.

Open the AWS Glue console.
Choose ETL jobs.
Choose Visual ETL.
For Sources, select Amazon S3.
On the node Data source - S3 bucket, for S3 URL, enter your S3 path.
Choose + (plus) to add a new node.
For Targets, select AWS Glue Data Catalog.
On the node Data target - Data Catalog, for Database, select your destination database on Data Catalog.
For Table, select your destination table on Data Catalog.
On Job details tab, select IAM role. In case you do not have the IAM role, follow Configuring IAM permissions for AWS Glue.
Choose Save and Run.

Your AWS Glue job has been successfully created and started.

Migrating workloads to Step Functions

AWS Step Functions is a serverless orchestration service that lets you build workflows for your business-critical applications. With Step Functions, you use a visual editor to build workflows and integrate directly with over 11,000 actions for over 250 AWS services, including AWS Lambda, Amazon EMR, DynamoDB, and more. You can use Step Functions for orchestrating data processing pipelines, handling errors, and working with the throttling limits on the underlying AWS services. You can create workflows that process and publish machine learning models, orchestrate micro-services, as well as control AWS services, such as AWS Glue, to create ETL workflows. You also can create long-running, automated workflows for applications that require human interaction.

We recommend migrating your Data Pipeline workload to Step Functions when:

You’re looking for a serverless, highly available workflow orchestration service.
You’re looking for a cost-effective solution that charges at single-task granularity.
Your workloads are orchestrating tasks for multiple AWS services, such as Amazon EMR, AWS Lambda, AWS Glue, or DynamoDB.
You’re looking for a low-code solution that comes with a drag-and-drop visual designer for workflow creation and doesn’t require learning new programming concepts.
You’re looking for a service that provides integrations with over 250 AWS services covering over 11,000 actions out-of-the-box, as well as allowing integrations with custom non-AWS services and activities.
Both Data Pipeline and Step Functions use JSON format to define workflows. This allows you to store your workflows in source control, manage versions, control access, and automate with continuous integration and development (CI/CD). Step Functions use a syntax called Amazon State Language, which is fully based on JSON and allows a seamless transition between the textual and visual representations of the workflow.
Your workload requires orchestrating on-premises servers, user-managed EC2 instances, or a user-managed EMR cluster.

With Step Functions, you can choose the same version of Amazon EMR that you’re currently using in Data Pipeline.

For migrating activities on Data Pipeline managed resources, you can use AWS SDK service integration on Step Functions to automate resource provisioning and cleaning up. For migrating activities on on-premises servers, user-managed EC2 instances, or a user-managed EMR cluster, you can install an SSM agent to the instance. You can initiate the command through the AWS Systems Manager Run Command from Step Functions. You can also initiate the state machine from the schedule defined in Amazon EventBridge.

Example: Migrate HadoopActivity on EmrCluster

To migrate HadoopActivity on EmrCluster on Data Pipeline to Step Functions:

Open the AWS Step Functions console.
Choose State machines.
Choose Create state machine.
In the Choose a template wizard, search for emr, select Manage an EMR job, and choose Select.

For Choose how to use this template, select Build on it.
Choose Use template.

For Create an EMR cluster state, configure API Parameters based on the EMR release label, EMR capacity, IAM role, and so on based on the existing EmrClusternode configuration on Data Pipeline.

For Run first step state, configure API Parameters based on the JAR file and arguments based on the existing HadoopActivity node configuration on Data Pipeline.
If you have further activities configured on the existing HadoopActivity, repeat step 8.
Choose Create.

Your state machine has been successfully configured. Learn more in Manage an Amazon EMR Job.

Migrating workloads to Amazon MWAA

Amazon MWAA is a managed orchestration service for Apache Airflow that lets you use the Apache Airflow platform to set up and operate end-to-end data pipelines in the cloud at scale. Apache Airflow is an open-source tool used to programmatically author, schedule, and monitor sequences of processes and tasks referred to as workflows. Apache Airflow brings in new concepts like executors, pools, and SLAs that provide you with superior data orchestration capabilities. With Amazon MWAA, you can use Airflow and Python programming language to create workflows without having to manage the underlying infrastructure for scalability, availability, and security. Amazon MWAA automatically scales its workflow runtime capacity to meet your needs and is integrated with AWS security services to help provide you with fast and secure access to your data.

We recommend migrating your Data Pipeline workloads to Amazon MWAA when:

You’re looking for a managed, highly available service to orchestrate workflows written in Python.
You want to transition to a fully managed, widely adopted open source technology—Apache Airflow—for maximum portability.
You require a single platform that can handle all aspects of your data pipeline, including ingestion, processing, transfer, integrity testing, and quality checks.
You’re looking for a service designed for data pipeline orchestration with features such as rich UI for observability, restarts for failed workflows, backfills, retries for tasks, and lineage support with OpenLineage.
You’re looking for a service that comes with more than 1,000 pre-built operators and sensors, covering AWS as well as non-AWS services.
Your workload requires orchestrating on-premises servers, user-managed EC2 instances, or a user-managed EMR cluster.

Amazon MWAA workflows are defined as directed acyclic graphs (DAGs) using Python, so you can also treat them as source code. Airflow’s extensible Python framework enables you to build workflows connecting with virtually any technology. It comes with a rich user interface for viewing and monitoring workflows and can be easily integrated with version control systems to automate the CI/CD process. With Amazon MWAA, you can choose the same version of Amazon EMR that you’re currently using in Data Pipeline.

Example: Migrate HadoopActivity on EmrCluster

Complete the following steps in case you do not have existing MWAA environments:

Create an AWS CloudFormation template on your computer by copying the template from the quick start guide into a local text file.
On the CloudFormation console, choose Stacks in the navigation pane.
Choose Create stack with the option With new resources (standard).
Choose Upload a template file and select the local template file.
Choose Next.
Complete the setup steps, entering a name for the environment, and leave the rest of the parameters as default.
On the last step, acknowledge that resources will be created and choose Submit.

The creation can take 20–30 minutes, until the status of the stack changes to CREATE_COMPLETE. The resource that will take the most time is the Airflow environment. While it’s being created, you can continue with the following steps, until you’re required to open the Airflow UI.

An Airflow workflow is based on a DAG, which is defined by a Python file that programmatically specifies the different tasks involved and its interdependencies. Complete the following scripts to create the DAG:

Create a local file named emr_dag.py using a text editor with following snippets, and configure the EMR related parameters based on the existing Data Pipeline definition:

from airflow import DAG
from airflow.providers.amazon.aws.operators.emr import (
    EmrCreateJobFlowOperator,
    EmrAddStepsOperator,
)
from airflow.providers.amazon.aws.sensors.emr import EmrStepSensor
from airflow.utils.dates import days_ago
from datetime import timedelta
import os
DAG_ID = os.path.basename(__file__).replace(".py", "")
SPARK_STEPS = [
    {
        'Name': 'calculate_pi',
        'ActionOnFailure': 'CONTINUE',
        'HadoopJarStep': {
            'Jar': 'command-runner.jar',
            'Args': ['spark-example', 'SparkPi', '10'],
        },
    }
]
JOB_FLOW_OVERRIDES = {
    'Name': 'my-demo-cluster',
    'ReleaseLabel': 'emr-6.1.0',
    'Applications': [
        {
            'Name': 'Spark'
        },
    ],
    'Instances': {
        'InstanceGroups': [
            {
                'Name': "Master nodes",
                'Market': 'ON_DEMAND',
                'InstanceRole': 'MASTER',
                'InstanceType': 'm5.xlarge',
                'InstanceCount': 1,
            },
            {
                'Name': "Slave nodes",
                'Market': 'ON_DEMAND',
                'InstanceRole': 'CORE',
                'InstanceType': 'm5.xlarge',
                'InstanceCount': 2,
            }
        ],
        'KeepJobFlowAliveWhenNoSteps': False,
        'TerminationProtected': False,
    },
    'VisibleToAllUsers': True,
    'JobFlowRole': 'EMR_EC2_DefaultRole',
    'ServiceRole': 'EMR_DefaultRole'
}
with DAG(
    dag_id=DAG_ID,
    start_date=days_ago(1),
    schedule_interval='@once',
    dagrun_timeout=timedelta(hours=2),
    catchup=False,
    tags=['emr'],
) as dag:
    cluster_creator = EmrCreateJobFlowOperator(
        task_id='create_job_flow',
        job_flow_overrides=JOB_FLOW_OVERRIDES,
        aws_conn_id='aws_default',
    )
    step_adder = EmrAddStepsOperator(
        task_id='add_steps',
        job_flow_id=cluster_creator.output,
        aws_conn_id='aws_default',
        steps=SPARK_STEPS,
    )
    step_checker = EmrStepSensor(
        task_id='watch_step',
        job_flow_id=cluster_creator.output,
        step_id="{{ task_instance.xcom_pull(task_ids='add_steps')[0] }}",
        aws_conn_id='aws_default',
    )
    cluster_creator >> step_adder >> step_checker

Defining the schedule in Amazon MWAA is as simple as updating the schedule_interval parameter for the DAG. For example, to run the DAG daily, set schedule_interval='@daily'.

Now, you create a workflow that invokes the Amazon EMR step you just created:

On the Amazon S3 console, locate the bucket created by the CloudFormation template, which will have a name starting with the name of the stack followed by -environmentbucket- (for example, myairflowstack-environmentbucket-ap1qks3nvvr4).
Inside that bucket, create a folder called dags, and inside that folder, upload the DAG file emr_dag.py that you created in the previous section.
On the Amazon MWAA console, navigate to the environment you deployed with the CloudFormation stack.

If the status is not yet Available, wait until it reaches that state. It shouldn’t take longer than 30 minutes after you deployed the CloudFormation stack.

Choose the environment link on the table to see the environment details.

It’s configured to pick up DAGs from the bucket and folder you used in the previous steps. Airflow will monitor that folder for changes.

Choose Open Airflow UI to open a new tab accessing the Airflow UI, using the integrated IAM security to sign you in.

If there are issues with the DAG file you created, it will display an error on top of the page indicating the lines affected. In that case, review the steps and upload again. After a few seconds, it will parse it and update or remove the error banner.

Clean up

After you migrate your existing Data Pipeline workload and verify that the migration was successful, delete your pipelines in Data Pipeline to stop further runs and billing.

Conclusion

In this blog post, we outlined a few alternate AWS services for migrating your existing Data Pipeline workloads. You can migrate to AWS Glue to run and orchestrate Apache Spark applications, AWS Step Functions to orchestrate workflows involving various other AWS services, or Amazon MWAA to help manage workflow orchestration using Apache Airflow. By migrating, you will be able to run your workloads with a broader range of data integration functionalities. If you have additional questions, post in the comments or read about migration examples in our documentation.

About the authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team and AWS Data Pipeline team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Vaibhav Porwal is a Senior Software Development Engineer on the AWS Glue and AWS Data Pipeline team. He is working on solving problems in orchestration space by building low cost, repeatable, scalable workflow systems that enables customers to create their ETL pipelines seamlessly.

Sriram Ramarathnam is a Software Development Manager on the AWS Glue and AWS Data Pipeline team. His team works on solving challenging distributed systems problems for data integration across AWS serverless and serverfull compute offerings.

Matt Su is a Senior Product Manager on the AWS Glue team and AWS Data Pipeline team. He enjoys helping customers uncover insights and make better decisions using their data with AWS Analytics services. In his spare time, he enjoys skiing and gardening.

How Volkswagen streamlined access to data across multiple data lakes using Amazon DataZone – Part 1

2024-07-18 Bandana Das

Post Syndicated from Bandana Das original https://aws.amazon.com/blogs/big-data/how-volkswagen-streamlined-access-to-data-across-multiple-data-lakes-using-amazon-datazone-part-1/

Over the years, organizations have invested in creating purpose-built, cloud-based data lakes that are siloed from one another. A major challenge is enabling cross-organization discovery and access to data across these multiple data lakes, each built on different technology stacks. A data mesh addresses these issues with four principles: domain-oriented decentralized data ownership and architecture, treating data as a product, providing self-serve data infrastructure as a platform, and implementing federated governance. Data mesh enables organizations to organize around data domains with a focus on delivering data as a product.

In 2019, Volkswagen AG (VW) and Amazon Web Services (AWS) formed a strategic partnership to co-develop the Digital Production Platform (DPP), aiming to enhance production and logistics efficiency by 30 percent while reducing production costs by the same margin. The DPP was developed to streamline access to data from shop-floor devices and manufacturing systems by handling integrations and providing standardized interfaces. However, as applications evolved on the platform, a significant challenge emerged: sharing data across applications stored in multiple isolated data lakes in Amazon Simple Storage Service (Amazon S3) buckets in individual AWS accounts without having to consolidate data into a central data lake. Another challenge is discovering available data stored across multiple data lakes and facilitating a workflow to request data access across business domains within each plant. The current method is largely manual, relying on emails and general communication, which not only increases overhead but also varies from one use case to another in terms of data governance. This blog post introduces Amazon DataZone and explores how VW used it to build their data mesh to enable streamlined data access across multiple data lakes. It focuses on the key aspect of the solution, which was enabling data providers to automatically publish data assets to Amazon DataZone, which served as the central data mesh for enhanced data discoverability. Additionally, the post provides code to guide you through the implementation.

Introduction to Amazon DataZone

Amazon DataZone is a data management service that makes it faster and easier for customers to catalog, discover, share, and govern data stored across AWS, on premises, and third-party sources. Key features of Amazon DataZone include a business data catalog that allows users to search for published data, request access, and start working on data in days instead of weeks. Amazon DataZone projects enable collaboration with teams through data assets and the ability to manage and monitor data assets across projects. It also includes the Amazon DataZone portal, which offers a personalized analytics experience for data assets through a web-based application or API. Lastly, Amazon DataZone governed data sharing ensures that the right data is accessed by the right user for the right purpose with a governed workflow.

Architecture for Data Management with Amazon DataZone

Figure 1: Data mesh pattern implementation on AWS using Amazon DataZone

The architecture diagram (Figure 1) represents a high-level design based on the data mesh pattern. It separates source systems, data domain producers (data publishers), data domain consumers (data subscribers), and central governance to highlight key aspects. This cross-account data mesh architecture aims to create a scalable foundation for data platforms, supporting producers and consumers with consistent governance.

A data domain producer resides in an AWS account and uses Amazon S3 buckets to store raw and transformed data. Producers ingest data into their S3 buckets through pipelines they manage, own, and operate. They are responsible for the full lifecycle of the data, from raw capture to a form suitable for external consumption.
A data domain producer maintains its own ETL stack using AWS Glue, AWS Lambda to process, AWS Glue Databrew to profile the data and prepare the data asset (data product) before cataloguing it into AWS Glue Data Catalog in their account.
A second pattern could be that a data domain producer prepares and stores the data asset as table within Amazon Redshift using AWS S3 Copy.
Data domain producers publish data assets using datasource run to Amazon DataZone in the Central Governance account. This populates the technical metadata in the business data catalog for each data asset. The business metadata, can be added by business users to provide business context, tags, and data classification for the datasets. Producers control what to share, for how long, and how consumers interact with it.
Producers can register and create catalog entries with AWS Glue from all their S3 buckets. The central governance account securely shares datasets between producers and consumers via metadata linking, with no data (except logs) existing in this account. Data ownership remains with the producer.
With Amazon DataZone, once data is cataloged and published into the DataZone domain, it can be shared with multiple consumer accounts.
The Amazon DataZone Data portal provides a personalized view for users to discover/search and submit requests for subscription of data assets using a web-based application. The data domain producer receives the notification of subscription requests in the Data portal and can approve/reject the requests.
Once approved, the consumer account can read and further process data assets to implement various use cases with AWS Lambda, AWS Glue, Amazon Athena, Amazon Redshift query editor v2, Amazon QuickSight (Analytics use cases) and with Amazon Sagemaker (Machine learning use cases).

Manual process to publish data assets to Amazon DataZone

To publish a data asset from the producer account, each asset must be registered in Amazon DataZone as a data source for consumer subscription. The Amazon DataZone User Guide provides detailed steps to achieve this. In the absence of an automated registration process, all required tasks must be completed manually for each data asset.

How to automate publishing data assets from AWS Glue Data Catalog from the producer account to Amazon DataZone

Using the automated registration workflow, the manual steps can be automated for any new data asset that needs to be published in an Amazon DataZone domain or when there’s a schema change in an already published data asset.

The automated solution reduces the repetitive manual steps to publish the data sources (AWS Glue tables) into an Amazon DataZone domain.

Architecture for automated data asset publish

Figure 2 Architecture for automated data publish to Amazon DataZone

To automate publishing data assets:

In the producer account (Account B), the data to be shared resides in an Amazon S3 bucket (Figure 2). An AWS Glue crawler is configured for the dataset to automatically create the schema using AWS Cloud Development Kit (AWS CDK).
Once configured, the AWS Glue crawler crawls the Amazon S3 bucket and updates the metadata in the AWS Glue Data Catalog. The successful completion of the AWS Glue crawler generates an event in the default event bus of Amazon EventBridge.
An EventBridge rule is configured to detect this event and invoke a dataset-registration AWS Lambda function.
The AWS Lambda function performs all the steps to automatically register and publish the dataset in Amazon Datazone.

Steps performed in the dataset-registration AWS Lambda function

- The AWS Lambda function retrieves the AWS Glue database and Amazon S3 information for the dataset from the Amazon Eventbridge event triggered by the successful run of the AWS Glue crawler.
- It obtains the Amazon DataZone Datalake blueprint ID from the producer account and the Amazon DataZone domain ID and project ID by assuming an IAM role in the central governance account where the Amazon Datazone domain exists.
- It enables the Amazon DataZone Datalake blueprint in the producer account.
- It checks if the Amazon Datazone environment already exists within the Amazon DataZone project. If it does not, then it initiates the environment creation process. If the environment exists, it proceeds to the next step.
- It registers the Amazon S3 location of the dataset in Lake Formation in the producer account.
- The function creates a data source within the Amazon DataZone project and monitors the completion of the data source creation.
- Finally, it checks whether the data source sync job in Amazon DataZone needs to be started. If new AWS Glue tables or metadata is created or updated, then it starts the data source sync job.

Prerequisites

As part of this solution, you will publish data assets from an existing AWS Glue database in a producer account into an Amazon DataZone domain for which the following prerequisites need to be performed.

You need two AWS accounts to deploy the solution.
- One AWS account will act as the data domain producer account (Account B) which will contain the AWS Glue dataset to be shared.
- The second AWS account is the central governance account (Account A), which will have the Amazon DataZone domain and project deployed. This is the Amazon DataZone account.
- Ensure that both the AWS accounts belong to the same AWS Organization
Remove the IAMAllowedPrincipals permissions from the AWS Lake Formation tables for which Amazon DataZone handles permissions.
Make sure in both AWS accounts that you have cleared the checkbox for Default permissions for newly created databases and tables under the Data Catalog settings in Lake Formation (Figure 3).

Figure 3: Clear default permissions in AWS Lake Formation
Sign in to Account A (central governance account) and make sure you have created an Amazon DataZone domain and a project within the domain.

If your Amazon DataZone domain is encrypted with an AWS Key Management Service (AWS KMS) key, add Account B (producer account) to the key policy with the following actions:

{
  "Sid": "Allow use of the key",
  "Effect": "Allow",
  "Principal": {
    "AWS": "arn:aws:iam::<Account B>:root"
  },
  "Action": [
    "kms:Encrypt",
    "kms:Decrypt",
    "kms:ReEncrypt*",
    "kms:GenerateDataKey*",
    "kms:DescribeKey"
  ],
  "Resource": "*"
}

Ensure you have created an AWS Identity and Access Management (IAM) role that Account B (producer account) can assume and this IAM role is added as a member (as contributor) of your Amazon DataZone project. The role should have the following permissions:

This IAM role is called dz-assumable-env-dataset-registration-role in this example. Adding this role will enable you to successfully run the dataset-registration Lambda function. Replace the account-region, account id, and DataZonekmsKey in the following policy with your information. These values correspond to where your Amazon DataZone domain is created and the AWS KMS key Amazon Resource Name (ARN) used to encrypt the Amazon DataZone domain.

{
    "Version": "2012-10-17",
    "Statement": [
         {
            "Action": [
                "DataZone:CreateDataSource",
               "DataZone:CreateEnvironment",
               "DataZone:CreateEnvironmentProfile",
               "DataZone:GetDataSource",
               "DataZone:GetEnvironment",
               "DataZone:GetEnvironmentProfile",
               "DataZone:GetIamPortalLoginUrl",
               "DataZone:ListDataSources",
                "DataZone:ListDomains",
                "DataZone:ListEnvironmentProfiles",
                "DataZone:ListEnvironments",
                "DataZone:ListProjectMemberships",
               "DataZone:ListProjects",
                "DataZone:StartDataSourceRun"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                 "kms:Decrypt",
                "kms:DescribeKey",
                "kms:GenerateDataKey"
            ],
           "Resource": "arn:aws:kms:${account_region}:${account_id}:key/${DataZonekmsKey}",
            "Effect": "Allow"
        }
    ]
}

Add the AWS account in the trust relationship of this role with the following trust relationship. Replace ProducerAccountId with the AWS account ID of Account B (data domain producer account).

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::${ProducerAccountId}:root",
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
} }

The following tools are needed to deploy the solution using AWS CDK:
- Either Bash or ZSH terminal
- Node and NPM using Node Version Manager
  - Install Node Version Manager (NVM)
  - Install Node version 18.12.0 using following command
```
$ nvm install 18.12.0
```

- - The node and npm binaries should now be available
- Python
- AWS Command Line Interface (AWS CLI)
- AWS SDK for Python
- AWS CDK

Deployment Steps

After completing the pre-requisites, use the AWS CDK stack provided on GitHub to deploy the solution for automatic registration of data assets into DataZone domain

Clone the repository from GitHub to your preferred IDE using the following commands.

git clone https://github.com/aws-samples/automate-and-simplify-aws-glue-data-asset-publish-to-amazon-datazone.git

cd automate-and-simplify-aws-glue-data-asset-publish-to-amazon-datazone

At the base of the repository folder, run the following commands to build and deploy resources to AWS.
```
npm install 
npm run lint
```
Sign in to the AWS account B (the data domain producer account) using AWS Command Line Interface (AWS CLI) with your profile name.
Ensure you have configured the AWS Region in your credential’s configuration file.
Bootstrap the CDK environment with the following commands at the base of the repository folder. Replace <PROFILE_NAME> with the profile name of your deployment account (Account B). Bootstrapping is a one-time activity and is not needed if your AWS account is already bootstrapped.
```
export AWS_PROFILE=<PROFILE_NAME>
npm run cdk bootstrap
```
Replace the placeholder parameters (marked with the suffix _PLACEHOLDER) in the file config/DataZoneConfig.ts (Figure 4).

- Amazon DataZone domain and project name of your Amazon DataZone instance. Make sure all names are in lowercase.
- The AWS account ID and Region.
- The assumable IAM role from the prerequisites.
- The deployment role starting with cfn-xxxxxx-cdk-exec-role-.

Figure 4: Edit the DataZoneConfig file

In the AWS Management Console for Lake Formation, select Administrative roles and tasks from the navigation pane (Figure 5) and make sure the IAM role for AWS CDK deployment that starts with cfn-xxxxxx-cdk-exec-role- is selected as an administrator in Data lake administrators. This IAM role needs permissions in Lake Formation to create resources, such as an AWS Glue database. Without these permissions, the AWS CDK stack deployment will fail.

Figure 5: Add cfn-xxxxxx-cdk-exec-role- as a Data Lake administrator

Use the following command in the base folder to deploy the AWS CDK solution
```
npm run cdk deploy --all
```

During deployment, enter y if you want to deploy the changes for some stacks when you see the prompt Do you wish to deploy these changes (y/n)?

After the deployment is complete, sign in to your AWS account B (producer account) and navigate to the AWS CloudFormation console to verify that the infrastructure deployed. You should see a list of the deployed CloudFormation stacks as shown in Figure 6.

Figure 6: Deployed CloudFormation stacks

Test automatic data registration to Amazon DataZone

To test, we use the Online Retail Transactions dataset from Kaggle as a sample dataset to demonstrate the automatic data registration.

Download the Online Retail.csv file from Kaggle dataset.
Login to AWS Account B (producer account) and navigate to the Amazon S3 console, find the DataZone-test-datasource S3 bucket, and upload the csv file there (Figure 7).

Figure 7: Upload the dataset CSV file

The AWS Glue crawler is scheduled to run at a specific time each day. However for testing, you can manually run the crawler by going to the AWS Glue console and selecting Crawlers from the navigation pane. Run the on-demand crawler starting with DataZone-. After the crawler has run, verify that a new table has been created.
Go to the Amazon DataZone console in AWS account A (central governance account) where you deployed the resources. Select Domains in the navigation pane (Figure 8), then Select and open your domain.

Figure 8: Amazon DataZone domains
After you open the Datazone Domain, you can find the Amazon Datazone data portal URL in the Summary section (Figure 9). Select and open data portal.

Figure 9: Amazon DataZone data portal URL
In the data portal find your project (Figure 10). Then select the Data tab at the top of the window.

Figure 10: Amazon DataZone Project overview
Select the section Data Sources (Figure 11) and find the newly created data source DataZone-testdata-db.

Figure 11: Select Data sources in the Amazon Datazone Domain Data portal
Verify that the data source has been successfully published (Figure 12).

Figure 12: The data sources are visible in the Published data section
After the data sources are published, users can discover the published data and can submit a subscription request. The data producer can approve or reject requests. Upon approval, users can consume the data by querying data in Amazon Athena. Figure 13 illustrates data discovery in the Amazon DataZone data portal.

Figure 13: Example data discovery in the Amazon DataZone portal

Clean up

Use the following steps to clean up the resources deployed through the CDK.

Empty the two S3 buckets that were created as part of this deployment.
Go to the Amazon DataZone domain portal and delete the published data assets that were created in the Amazon DataZone project by the dataset-registration Lambda function.
Delete the remaining resources created using the following command in the base folder:
```
npm run cdk destroy --all
```

Conclusion

By using AWS Glue and Amazon DataZone, organizations can make their data management easier and allow teams to share and collaborate on data smoothly. Automatically sending AWS Glue data to Amazon DataZone not only makes the process simple but also keeps the data consistent, secure, and well-governed. Simplify and standardize publishing data assets to Amazon DataZone and streamline data management with Amazon DataZone. For guidance on establishing your organization’s data mesh with Amazon DataZone, contact your AWS team today.

About the Authors

Bandana Das is a Senior Data Architect at Amazon Web Services and specializes in data and analytics. She builds event-driven data architectures to support customers in data management and data-driven decision-making. She is also passionate about enabling customers on their data management journey to the cloud.

Anirban Saha is a DevOps Architect at AWS, specializing in architecting and implementation of solutions for customer challenges in the automotive domain. He is passionate about well-architected infrastructures, automation, data-driven solutions and helping make the customer’s cloud journey as seamless as possible. Personally, he likes to keep himself engaged with reading, painting, language learning and traveling.

Chandana Keswarkar is a Senior Solutions Architect at AWS, who specializes in guiding automotive customers through their digital transformation journeys by using cloud technology. She helps organizations develop and refine their platform and product architectures and make well-informed design decisions. In her free time, she enjoys traveling, reading, and practicing yoga.

Sindi Cali is a ProServe Associate Consultant with AWS Professional Services. She supports customers in building data driven applications in AWS.