Tag Archives: Analytics

Enforce fine-grained access control on data lake tables using AWS Glue 5.0 integrated with AWS Lake Formation

2024-12-04 Sakti Mishra

Post Syndicated from Sakti Mishra original https://aws.amazon.com/blogs/big-data/enforce-fine-grained-access-control-on-data-lake-tables-using-aws-glue-5-0-integrated-with-aws-lake-formation/

AWS Glue 5.0 supports fine-grained access control (FGAC) based on your policies defined in AWS Lake Formation. FGAC enables you to granularly control access to your data lake resources at the table, column, and row levels. This level of control is essential for organizations that need to comply with data governance and security regulations, or those that deal with sensitive data.

Lake Formation makes it straightforward to build, secure, and manage data lakes. It allows you to define fine-grained access controls through grant and revoke statements, similar to those used with relational database management systems (RDBMS), and automatically enforce those policies using compatible engines like Amazon Athena, Apache Spark on Amazon EMR, and Amazon Redshift Spectrum. With AWS Glue 5.0, the same Lake Formation rules that you set up for use with other services like Athena now apply to your AWS Glue Spark jobs and Interactive Sessions through built-in Spark SQL and Spark DataFrames. This simplifies security and governance of your data lakes.

This post demonstrates how to enforce FGAC on AWS Glue 5.0 through Lake Formation permissions.

How FGAC works on AWS Glue 5.0

Using AWS Glue 5.0 with Lake Formation lets you enforce a layer of permissions on each Spark job to apply Lake Formation permissions control when AWS Glue runs jobs. AWS Glue uses Spark resource profiles to create two profiles to effectively run jobs. The user profile runs user-supplied code, and the system profile enforces Lake Formation policies. For more information, see the AWS Lake Formation Developer Guide.

The following diagram demonstrates a high-level overview of how AWS Glue 5.0 gets access to data protected by Lake Formation permissions.

The workflow consists of the following steps:

A user calls the StartJobRun API on a Lake Formation enabled AWS Glue job.
AWS Glue sends the job to a user driver and runs the job in the user profile. The user driver runs a lean version of Spark that has no ability to launch tasks, request executors, or access Amazon Simple Storage Service (Amazon S3) or the AWS Glue Data Catalog. It builds a job plan.
AWS Glue sets up a second driver called the system driver and runs it in the system profile (with a privileged identity). AWS Glue sets up an encrypted TLS channel between the two drivers for communication. The user driver uses the channel to send the job plans to the system driver. The system driver doesn’t run user-submitted code. It runs full Spark and communicates with Amazon S3 and the Data Catalog for data access. It requests executors and compiles the Job Plan into a sequence of execution stages.
AWS Glue then runs the stages on executors with the user driver or system driver. The user code in any stage is run exclusively on user profile executors.
Stages that read data from Data Catalog tables protected by Lake Formation or those that apply security filters are delegated to system executors.

Enable FGAC on AWS Glue 5.0

To enable Lake Formation FGAC for your AWS Glue 5.0 jobs on the AWS Glue console, complete the following steps:

On the AWS Glue console, choose ETL jobs in the navigation pane.
Choose your job.
Choose the Job details
For Glue version, choose Glue 5.0 – Supports spark 3.5, Scala 2, Python 3.
For Job parameters, add following parameter:
1. Key: --enable-lakeformation-fine-grained-access
2. Value: true
Choose Save.

To enable Lake Formation FGAC for your AWS Glue notebooks on the AWS Glue console, use %%configure magic:

%glue_version 5.0
%%configure
{
    "--enable-lakeformation-fine-grained-access": "true"
}

Example use case

The following diagram represents the high-level architecture of the use case we demonstrate in this post. The objective of the use case is to showcase how can you enforce Lake Formation FGAC on both CSV and Iceberg tables and configure an AWS Glue PySpark job to read from them.

The implementation consists of the following steps:

Create an S3 bucket and upload the input CSV dataset.
Create a standard Data Catalog table and an Iceberg table by reading data from the input CSV table, using an Athena CTAS query.
Use Lake Formation to enable FGAC on both CSV and Iceberg tables using row- and column-based filters.
Run two sample AWS Glue jobs to showcase how you can run a sample PySpark script in AWS Glue that respects the Lake Formation FGAC permissions, and then write the output to Amazon S3.

To demonstrate the implementation steps, we use sample product inventory data that has the following attributes:

op – The operation on the source record. This shows values I to represent insert operations, U to represent updates, and D to represent deletes.
product_id – The primary key column in the source database’s products table.
category – The product’s category, such as Electronics or Cosmetics.
product_name – The name of the product.
quantity_available – The quantity available in the inventory for a product.
last_update_time – The time when the product record was updated at the source database.

To implement this workflow, we create AWS resources such as an S3 bucket, define FGAC with Lake Formation, and build AWS Glue jobs to query those tables.

Prerequisites

Before you get started, make sure you have the following prerequisites:

An AWS account with AWS Identity and Access Management (IAM) roles as needed.
The required permissions to perform the following actions:
- Read or write to an S3 bucket.
- Create and run AWS Glue crawlers and jobs.
- Manage Data Catalog databases and tables.
- Manage Athena workgroups and run queries.
Lake Formation already set up in the account and a Lake Formation administrator role or a similar role to follow along with the instructions in this post. To learn more about setting up permissions for a data lake administrator role, see Create a data lake administrator.

For this post, we use the eu-west-1 AWS Region, but you can integrate it in your preferred Region if the AWS services included in the architecture are available in that Region.

Next, let’s dive into the implementation steps.

Create an S3 bucket

To create an S3 bucket for the raw input datasets and Iceberg table, complete the following steps:

On the Amazon S3 console, choose Buckets in the navigation pane.
Choose Create bucket.
Enter the bucket name (for example, glue5-lf-demo-${AWS_ACCOUNT_ID}-${AWS_REGION_CODE}), and leave the remaining fields as default.
Choose Create bucket.
On the bucket details page, choose Create folder.
Create two subfolders: raw-csv-input and iceberg-datalake.
Upload the LOAD00000001.csv file into the raw-csv-input folder of the bucket.

Create tables

To create input and output tables in the Data Catalog, complete the following steps:

On the Athena console, navigate to the query editor.

Run the following queries in sequence (provide your S3 bucket name):

-- Create database for the demo
CREATE DATABASE glue5_lf_demo;

-- Create external table in input CSV files. Replace the S3 path with your bucket name
CREATE EXTERNAL TABLE glue5_lf_demo.raw_csv_input(
 op string, 
 product_id bigint, 
 category string, 
 product_name string, 
 quantity_available bigint, 
 last_update_time string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://<bucket-name>/raw-csv-input/'
TBLPROPERTIES (
  'areColumnsQuoted'='false', 
  'classification'='csv', 
  'columnsOrdered'='true', 
  'compressionType'='none', 
  'delimiter'=',', 
  'typeOfData'='file');
 
-- Create output Iceberg table with partitioning. Replace the S3 bucket name with your bucket name
CREATE TABLE glue5_lf_demo.iceberg_datalake WITH (
  table_type='ICEBERG',
  format='parquet',
  write_compression = 'SNAPPY',
  is_external = false,
  partitioning=ARRAY['category', 'bucket(product_id, 16)'],
  location='s3://<bucket-name>/iceberg-datalake/'
) AS SELECT * FROM glue5_lf_demo.raw_csv_input;

Run the following query to validate the raw CSV input data:
```
SELECT * FROM glue5_lf_demo.raw_csv_input;
```

The following screenshot shows the query result.

Run the following query to validate the Iceberg table data:
```
SELECT * FROM glue5_lf_demo.iceberg_datalake;
```

The following screenshot shows the query result.

This step used DDL to create table definitions. Alternatively, you can use a Data Catalog API, the AWS Glue console, the Lake Formation console, or an AWS Glue crawler.

Next, let’s configure Lake Formation permissions on the raw_csv_input table and iceberg_datalake table.

Configure Lake Formation permissions

To validate the capability, let’s define FGAC permissions for the two Data Catalog tables we created.

For the raw_csv_input table, we enable permission for specific rows, for example allow read access only for the Furniture category. Similarly, for the iceberg_datalake table, we enable a data filter for the Electronics product category and limit read access to a few columns only.

To configure Lake Formation permissions for the two tables, complete the following steps:

On the Lake Formation console, choose Data lake locations under Administration in the navigation pane.
Choose Register location.
For Amazon S3 path, enter the path of your S3 bucket to register the location.
For IAM role, choose your Lake Formation data access IAM role, which is not a service linked role.
For Permission mode, select Lake Formation.
Choose Register location.

Grant table permissions on the standard table

The next step is to grant table permissions on the raw_csv_input table to the AWS Glue job role.

On the Lake Formation console, choose Data lake permissions under Permissions in the navigation pane.
Choose Grant.
For Principals, choose IAM users and roles.
For IAM users and roles, choose your IAM role that is going to be used on an AWS Glue job.
For LF-Tags or catalog resources, choose Named Data Catalog resources.
For Databases, choose glue5_lf_demo.
For Tables, choose raw_csv_input.
For Data filters, choose Create new.
In the Create data filter dialog, provide the following information:
1. For Data filter name, enter product_furniture.
2. For Column-level access, select Access to all columns.
3. Select Filter rows.
4. For Row filter expression, enter category='Furniture'.
5. Choose Create filter.

For Data filters, select the filter product_furniture you created.
For Data filter permissions, choose Select and Describe.
Choose Grant.

Grant permissions on the Iceberg table

The next step is to grant table permissions on the iceberg_datalake table to the AWS Glue job role.

On the Lake Formation console, choose Data lake permissions under Permissions in the navigation pane.
Choose Grant.
For Principals, choose IAM users and roles.
For IAM users and roles, choose your IAM role that is going to be used on an AWS Glue job.
For LF-Tags or catalog resources, choose Named Data Catalog resources.
For Databases, choose glue5_lf_demo.
For Tables, choose iceberg_datalake.
For Data filters, choose Create new.
In the Create data filter dialog, provide the following information:
1. For Data filter name, enter product_electronics.
2. For Column-level access, select Include columns.
3. For Included columns, choose category, last_update_time, op, product_name, and quantity_available.
4. Choose Filter rows.
5. For Row filter expression, enter category='Electronics'.
6. Choose Create filter.
For Data filters, select the filter product_electronics you created.
For Data filter permissions, choose Select and Describe.
Choose

Next, let’s create the AWS Glue PySpark job to process the input data.

Query the standard table through an AWS Glue 5.0 job

Complete the following steps to create an AWS Glue job to load data from the raw_csv_input table:

On the AWS Glue console, choose ETL jobs in the navigation pane.
For Create job, choose Script Editor.
For Engine, choose Spark.
For Options, choose Start fresh.
Choose Create script.

For Script, use the following code, providing your S3 output path. This example script writes the output in Parquet format; you can change this according to your use case.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Read from raw CSV table
df = spark.sql("SELECT * FROM glue5_lf_demo.raw_csv_input")
df.show()

# Write to your preferred location.
df.write.mode("overwrite").parquet("s3://<s3_output_path>")

On the Job details tab, for Name, enter glue5-lf-demo.
For IAM Role, assign an IAM role that has the required permissions to run an AWS Glue job and read and write to the S3 bucket.
For Glue version, choose Glue 5.0 – Supports spark 3.5, Scala 2, Python 3.
For Job parameters, add following parameter:
1. Key: --enable-lakeformation-fine-grained-access
2. Value: true

Choose Save and then Run.
When the job is complete, on the Run details tab at the bottom of job runs, choose Output logs.

You’re redirected to the Amazon CloudWatch console to validate the output.

The printed table is shown in the following screenshot. Only two records were returned because they are Furniture category products.

Query the Iceberg table through an AWS Glue 5.0 job

Next, complete the following steps to create an AWS Glue job to load data from the iceberg_datalake table:

On the AWS Glue console, choose ETL jobs in the navigation pane.
For Create job, choose Script Editor.
For Engine, choose Spark.
For Options, choose Start fresh.
Choose Create script.
For Script, replace the following parameters:
1. Replace aws_region with your Region.
2. Replace aws_account_id with your AWS account ID.
3. Replace warehouse_path with your S3 warehouse path for the Iceberg table.
4. Replace <s3_output_path> with your S3 output path.

This example script writes the output in Parquet format; you can change it according to your use case.

from pyspark.context import SparkContext
from pyspark.sql import SparkSession

catalog_name = "spark_catalog"
aws_region = "eu-west-1"
aws_account_id = "123456789012"
warehouse_path = "s3://<bucket-name>/warehouse"

# Create Spark Session with Iceberg Configurations
spark = SparkSession.builder \
    .config(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkSessionCatalog") \
    .config(f"spark.sql.catalog.{catalog_name}.warehouse", f"{warehouse_path}") \
    .config(f"spark.sql.catalog.{catalog_name}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
    .config(f"spark.sql.catalog.{catalog_name}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
    .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config(f"spark.sql.catalog.{catalog_name}.client.region", f"{aws_region}") \
    .config(f"spark.sql.catalog.{catalog_name}.glue.account-id", f"{aws_account_id}") \
    .getOrCreate()

# Read from Iceberg table
df = spark.sql(f"SELECT * FROM {catalog_name}.glue5_lf_demo.iceberg_datalake")
df.show()

# Write to your preferred location.
df.write.mode("overwrite").parquet("s3://<s3_output_path>")

On the Job details tab, for Name, enter glue5-lf-demo-iceberg.
For IAM Role, assign an IAM role that has the required permissions to run an AWS Glue job and read and write to the S3 bucket.
For Glue version, choose Glue 5.0 – Supports spark 3.5, Scala 2, Python 3.
For Job parameters, add following parameters:
1. Key: --enable-lakeformation-fine-grained-access
2. Value: true
3. Key: --datalake-formats
4. Value: iceberg
Choose Save and then Run.
When the job is complete, on the Run details tab, choose Output logs.

You’re redirected to the CloudWatch console to validate the output.

The printed table is shown in the following screenshot. Only two records were returned because they are Electronics category products, and the product_id column is excluded.

You are now able to verify that records of the table raw_csv_input and the table iceberg_datalake are successfully retrieved with configured Lake Formation data cell filters.

Clean up

Complete the following steps to clean up your resources:

Delete the AWS Glue jobs glue5-lf-demo and glue5-lf-demo-iceberg.
Delete the Lake Formation permissions.
Delete the output files written to the S3 bucket.
Delete the bucket you created for the input datasets, which might have a name similar to glue5-lf-demo-${AWS_ACCOUNT_ID}-${AWS_REGION_CODE}.

Conclusion

This post explained how you can enable Lake Formation FGAC in AWS Glue jobs and notebooks that will enforce access control defined using Lake Formation grant commands. Previously, you needed to integrate AWS Glue DynamicFrames to enforce FGAC in AWS Glue jobs, but with this release, you can enforce FGAC through Spark DataFrame or Spark SQL. This capability also works not only with standard file formats like CSV, JSON, and Parquet but also with Apache Iceberg.

This feature can save you effort and encourage portability while migrating Spark scripts to different serverless environments such as AWS Glue and Amazon EMR.

About the Authors

Sakti Mishra is a Principal Solutions Architect at AWS, where he helps customers modernize their data architecture and define end-to end-data strategies, including data security, accessibility, governance, and more. He is also the author of Simplify Big Data Analytics with Amazon EMR and AWS Certified Data Engineer Study Guide. Outside of work, Sakti enjoys learning new technologies, watching movies, and visiting places with family. He can be reached via LinkedIn.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is also the author of the book Serverless ETL and Analytics with AWS Glue. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Matt Su is a Senior Product Manager on the AWS Glue team. He enjoys helping customers uncover insights and make better decisions using their data with AWS Analytics services. In his spare time, he enjoys skiing and gardening.

Layth Yassin is a Software Development Engineer on the AWS Glue team. He’s passionate about tackling challenging problems at a large scale, and building products that push the limits of the field. Outside of work, he enjoys playing/watching basketball, and spending time with friends and family.

Introducing AWS Glue 5.0 for Apache Spark

2024-12-04 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/introducing-aws-glue-5-0-for-apache-spark/

AWS Glue is a serverless, scalable data integration service that makes it simple to discover, prepare, move, and integrate data from multiple sources. Today, we are launching AWS Glue 5.0, a new version of AWS Glue that accelerates data integration workloads in AWS. AWS Glue 5.0 upgrades the Spark engines to Apache Spark 3.5.2 and Python 3.11, giving you newer Spark and Python releases so you can develop, run, and scale your data integration workloads and get insights faster.

This post describes what’s new in AWS Glue 5.0, performance improvements, key highlights on Spark and related libraries, and how to get started on AWS Glue 5.0.

What’s new in AWS Glue 5.0

AWS Glue 5.0 upgrades the runtimes to Spark 3.5.2, Python 3.11, and Java 17 with new performance and security improvements from the open source. AWS Glue 5.0 also updates support for open table format libraries to Apache Hudi 0.15.0, Apache Iceberg 1.6.1, and Delta Lake 3.2.1 so you can solve advanced use cases around performance, cost, governance, and privacy in your data lakes. AWS Glue 5.0 adds support for Spark-native fine-grained access control with AWS Lake Formation so you can apply table- and column-level permissions on an Amazon Simple Storage Service (Amazon S3) data lake for write operations (such as INSERT INTO and INSERT OVERWRITE) with Spark jobs.

Key features include:

Amazon SageMaker Unified Studio support
Amazon SageMaker Lakehouse support
Frameworks updated to Spark 3.5.2, Python 3.11, Scala 2.12.18, and Java 17
Open Table Formats (OTF) updated to Hudi 0.15.0, Iceberg 1.6.1, and Delta Lake 3.2.1
Spark-native fine-grained access control using Lake Formation
Amazon S3 Access Grants support
requirements.txt support to install additional Python libraries
Data lineage support in Amazon DataZone

Amazon SageMaker Unified Studio support

Amazon SageMaker Unified Studio supports AWS Glue 5.0 for compute runtime of unified notebooks and visual ETL flow editor.

Amazon SageMaker Lakehouse support

Glue 5.0 supports native integration with Amazon SageMaker Lakehouse to enable unified access across Amazon Redshift data warehouses and S3 data lakes.

Frameworks updated to Spark 3.5.2, Python 3.11, Scala 2.12.18, and Java 17

AWS Glue 5.0 upgrades the runtimes to Spark 3.5.2, Python 3.11, Scala 2.12.18, and Java 17. Glue 5.0 uses AWS performance optimized Spark runtime, 3.9 times faster than open source Spark. Glue 5.0 is 32% faster than AWS Glue 4.0 and reduces costs by 22%.

For more details about updated library dependencies, see Dependent library upgrades section.

Open Table Formats (OTF) updated to Hudi 0.15.0, Iceberg 1.6.1, and Delta Lake 3.2.1

AWS Glue 5.0 upgrades the open table format libraries to Hudi 0.15.0, Iceberg 1.6.1, and Delta Lake 3.2.1.

Spark-native fine-grained access control using Lake Formation

AWS Glue supports AWS Lake Formation Fine Grained Access Control (FGAC) through native Spark DataFrames and Spark SQL.

S3 Access Grants support

S3 Access Grants provides a simplified model for defining access permissions to data in Amazon S3 by prefix, bucket, or object. AWS Glue 5.0 supports S3 Access Grants through EMR File System (EMRFS) using additional Spark configurations:

Key: --conf
Value: hadoop.fs.s3.s3AccessGrants.enabled=true --conf spark.hadoop.fs.s3.s3AccessGrants.fallbackToIAM=false

To learn more, refer to documentation.

requirements.txt support to install additional Python libraries

In AWS Glue 5.0, you can provide the standard requirements.txt file to manage Python library dependencies. To do that, provide the following job parameters:

Parameter 1:
- Key: --python-modules-installer-option
- Value: -r
Parameter 2:
- Key: --additional-python-modules
- Value: s3://path_to_requirements.txt

AWS Glue 5.0 nodes initially load Python libraries specified in requirements.txt. The following code illustrates the sample requirements.txt:

awswrangler==3.9.1 
elasticsearch==8.15.1
PyAthena==3.9.0
PyMySQL==1.1.1
PyYAML==6.0.2
pyodbc==5.2.0
pyorc==0.9.0 
redshift-connector==2.1.3
scipy==1.14.1
scikit-learn==1.5.2
SQLAlchemy==2.0.36

Data lineage support in Amazon DataZone (preview)

AWS Glue 5.0 supports data lineage in Amazon DataZone in preview. You can configure AWS Glue to automatically collect lineage information during Spark job runs and send the lineage events to be visualized in Amazon DataZone.

To configure this on the AWS Glue console, enable Generate lineage events, and enter your Amazon DataZone domain ID on the Job details tab.

Alternatively, you can provide the following job parameter (provide your DataZone domain ID):

Key: --conf
Value: extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener —conf spark.openlineage.transport.type=amazon_datazone_api —conf spark.openlineage.transport.domainId=<Your-Domain-ID>

Learn more in Amazon DataZone introduces OpenLineage-compatible data lineage visualization in preview.

Improved performance

AWS Glue 5.0 improves the price-performance of your AWS Glue jobs. AWS Glue 5.0 is 32% faster than AWS Glue 4.0 and reduces costs by 22%. The following chart shows the total job runtime for all queries (in seconds) in the 3 TB query dataset between AWS Glue 4.0 and AWS Glue 5.0. The TPC-DS dataset is located in an S3 bucket in Parquet format, and we used 30 G.2X workers in AWS Glue. We observed that our AWS Glue 5.0 TPC-DS tests on Amazon S3 was 58% faster than that on AWS Glue 4.0 while reducing cost by 36%.

.	AWS Glue 4.0	AWS Glue 5.0
Total Query Time (seconds)	1896.1904	1197.78755
Geometric Mean (seconds)	10.09472	6.82208
Estimated Cost ($)	45.85533	29.20133

The following graphs illustrates the comparisons of performance and cost.

Dependent library upgrades

The following table lists dependency upgrades.

Dependency	Version in AWS Glue 4.0	Version in AWS Glue 5.0
Spark	3.3.0	3.5.2
Hadoop	3.3.3	3.4.0
Scala	2.12	2.12.18
Hive	2.3.9	2.3.9
EMRFS	2.54.0	2.66.0
Arrow	7.0.0	12.0.1
Iceberg	1.0.0	1.6.1
Hudi	0.12.1	0.15.0
Delta Lake	2.1.0	3.2.1
Java	8	17
Python	3.10	3.11
boto3	1.26	1.34.131
AWS SDK for Java	1.12	2.28.8
AWS Glue Data Catalog Client	3.7.0	4.2.0
EMR DynamoDB Connector	4.16.0	5.6.0

The following table lists database connector (JDBC driver) upgrades.

Driver	Connector Version in AWS Glue 4.0	Connector Version in AWS Glue 5.0
MySQL	8.0.23	8.0.33
Microsoft SQL Server	9.4.0	10.2.0
Oracle Databases	21.7	23.3.0.23.09
PostgreSQL	42.3.6	42.7.3
Amazon Redshift	redshift-jdbc42-2.1.0.16	redshift-jdbc42-2.1.0.29

The following are Spark connector upgrades:

Driver	Connector Version in AWS Glue 4.0	Connector Version in AWS Glue 5.0
Amazon Redshift	6.1.3	6.3.0
OpenSearch	1.0.1	1.2.0
MongoDB	10.0.4	10.3.0
Snowflake	2.12.0	3.0.0
BigQuery	0.32.2	0.32.2

Apache Spark highlights

Spark 3.5.2 in AWS Glue 5.0 brings a number of valuable features, which we highlight in this section. To learn more about the highlights and enhancements of Spark 3.4 and 3.5, refer to Spark Release 3.4.0 and Spark Release 3.5.0.

Apache Arrow-optimized Python UDF

Python user-defined functions (UDFs) enable users to build custom code for data processing needs, providing flexibility and accessibility. However, performance suffers because UDFs require serialization between Python and JVM processes. Spark 3.5’s Apache Arrow-optimized UDFs solve this by keeping data in shared memory using Arrow’s high-performance columnar format, eliminating serialization overhead and making UDFs efficient for large-scale processing.

To use Arrow-optimized Python UDFs, set spark.sql.execution.pythonUDF.arrow.enabled to True.

Python user-defined table functions

A user-defined table function (UDTF) is a function that returns an entire output table instead of a single value. PySpark users can now write custom UDTFs with Python logic and use them in PySpark and SQL queries. Called in the FROM clause, UDTFs can accept zero or more arguments, either as scalar expressions or table arguments. The UDTF’s return type, defined as either a StructType (for example, StructType().add("c1", StringType())) or DDL string (for example, c1: string), determines the output table’s schema.

RocksDB state store enhancement

At Spark 3.2, RocksDB state store provider has been added as a built-in state store implementation.

Changelog checkpointing

A new checkpoint mechanism for the RocksDB state store provider called changelog checkpointing persists the changelog (updates) of the state. This reduces the commit latency, thereby reducing end-to-end latency significantly.

You can enable this by setting spark.sql.streaming.stateStore.rocksdb.changelogCheckpointing.enabled to True.

You can also enable this feature with existing checkpoints.

Memory management enhancements

Although the RocksDB state store provider is well-known to be useful to address memory issues on the state, there was no fine-grained memory management. Spark 3.5 introduces more fine-grained memory management, which enables users to cap the total memory usage across RocksDB instances in the same executor process, enabling users to configure the memory usage per executor process.

Enhanced Structured Streaming

Spark 3.4 and 3.5 have many enhancements related to Spark Structured Streaming.

This new API deduplicates rows based on certain events. Watermark-based processing allows for more precise control over late data handling:

Deduplicate the same rows: dropDuplicatesWithinWatermark()
Deduplicate values on ‘value’ columns: dropDuplicatesWithinWatermark(['value'])
Deduplicate using the guid column with a watermark based on the eventTime column: withWatermark("eventTime", "10 hours") .dropDuplicatesWithinWatermark(["guid"])

Get started with AWS Glue 5.0

You can start using AWS Glue 5.0 through AWS Glue Studio, the AWS Glue console, the latest AWS SDK, and the AWS Command Line Interface (AWS CLI).

To start using AWS Glue 5.0 jobs in AWS Glue Studio, open the AWS Glue job and on the Job Details tab, choose the version Glue 5.0 – Supports Spark 3.5, Scala 2, Python 3.

To start using AWS Glue 5.0 on an AWS Glue Studio notebook or an interactive session through a Jupyter notebook, set 5.0 in the %glue_version magic:

%%glue_version 5.0

The following output shows that the session is set to use AWS Glue 5.0:

Setting Glue version to: 5.0

Conclusion

In this post, we discussed the key features and benefits of AWS Glue 5.0. You can create new AWS Glue jobs on AWS Glue 5.0 to get the benefit from the improvements, or migrate your existing AWS Glue jobs.

We would like to thank the support of numerous engineers and leaders who helped build Glue 5.0 that enables customers with a performance optimized Spark runtime and several new capabilities.

About the Authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Stuti Deshpande is a Big Data Specialist Solutions Architect at AWS. She works with customers around the globe, providing them strategic and architectural guidance on implementing analytics solutions using AWS. She has extensive experience in big data, ETL, and analytics. In her free time, Stuti likes to travel, learn new dance forms, and enjoy quality time with family and friends.

Martin Ma is a Software Development Engineer on the AWS Glue team. He is passionate about improving the customer experience by applying problem-solving skills to invent new software solutions, as well as constantly searching for ways to simplify existing ones. In his spare time, he enjoys singing and playing the guitar.

Anshul Sharma is a Software Development Engineer in AWS Glue Team.

Rajendra Gujja is a Software Development Engineer on the AWS Glue team. He is passionate about distributed computing and everything and anything about data.

Maheedhar Reddy Chappidi is a Sr. Software Development Engineer on the AWS Glue team. He is passionate about building fault tolerant and reliable distributed systems at scale. Outside of his work, Maheedhar is passionate about listening to podcasts and playing with his two-year-old kid.

Savio Dsouza is a Software Development Manager on the AWS Glue team. His team works on generative AI applications for the Data Integration domain and distributed systems for efficiently managing data lakes on AWS and optimizing Apache Spark for performance and reliability.

Kartik Panjabi is a Software Development Manager on the AWS Glue team. His team builds generative AI features for the Data Integration and distributed system for data integration.

Mohit Saxena is a Senior Software Development Manager on the AWS Glue and Amazon EMR team. His team focuses on building distributed systems to enable customers with simple-to-use interfaces and AI-driven capabilities to efficiently transform petabytes of data across data lakes on Amazon S3, and databases and data warehouses on the cloud.

Read and write S3 Iceberg table using AWS Glue Iceberg Rest Catalog from Open Source Apache Spark

2024-12-04 Raj Ramasubbu

Post Syndicated from Raj Ramasubbu original https://aws.amazon.com/blogs/big-data/read-and-write-s3-iceberg-table-using-aws-glue-iceberg-rest-catalog-from-open-source-apache-spark/

In today’s data-driven world, organizations are constantly seeking efficient ways to process and analyze vast amounts of information across data lakes and warehouses.

Enter Amazon SageMaker Lakehouse, which you can use to unify all your data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and AI and machine learning (AI/ML) applications on a single copy of data. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines. This opens up exciting possibilities for Open Source Apache Spark users who want to use SageMaker Lakehouse capabilities. Further you can secure your data in SageMaker Lakehouse by defining fine-grained permissions, which are enforced across all analytics and ML tools and engines.

In this post, we will explore how to harness the power of Open source Apache Spark and configure a third-party engine to work with AWS Glue Iceberg REST Catalog. The post will include details on how to perform read/write data operations against Amazon S3 tables with AWS Lake Formation managing metadata and underlying data access using temporary credential vending.

Solution overview

In this post, the customer uses Data Catalog to centrally manage technical metadata for structured and semi-structured datasets in their organization and wants to enable their data team to use Apache Spark for data processing. The customer will create an AWS Glue database and configure Apache Spark to interact with Glue Data Catalog using the Iceberg Rest API for writing/reading Iceberg data on Amazon S3 using Lake Formation permission control.

We will start by running an extract, transform, and load (ETL) script using Apache Spark to create an Iceberg table on Amazon S3 and access the table using the Glue Iceberg REST Catalog. The ETL script will add data to the Iceberg table and then read it back using Spark SQL. This post will showcase how this data can also be queried by other data teams using Amazon Athena .

Prerequisites

Access to an AWS Identity and Access Management (IAM) role that is a Lake Formation data lake administrator in the account that has the Data Catalog. For instructions, see Create a data lake administrator.

Verify that you have Python version 3.7 or later installed. Check if pip3 version is 22.2.2 or higher is installed.
Install or update the latest AWS Command Line Interface (AWS CLI). For instructions, see Installing or updating the latest version of the AWS CLI. Run aws configure using AWS CLI to point to your AWS account.
Create an S3 bucket to store the customer Iceberg table. For this post, we will be using the us-east-2 AWS Region and will name the bucket: ossblog-customer-datalake.
Create an IAM role that will be used in OSS Spark for data access using an AWS Glue Iceberg REST catalog endpoint. Make sure that the role has AWS Glue and Lake Formation policies as defined in Data engineer permissions. For this post, we will use an IAM role named spark_role.

Enable Lake Formation permissions for third-party access

In this section, you will register the S3 bucket with Lake Formation. This step allows Lake Formation to act as a centralized permissions management system for metadata and data stored in Amazon S3, enabling more efficient and secure data governance in data lake environments.

Create a user defined IAM role following the instructions in Requirements for roles used to register locations. For this post, we will use the IAM role: LFRegisterRole.

aws lakeformation register-resource \
--resource-arn '< S3 bucket ARN for amzn-s3-demo-bucket>' \
--role-arn '< IAM Role ARN for LFRegisterRole >' \
--region <aws_region>

Alternatively you can use the AWS Management Console for Lake Formation.

Navigate to the Lake Formation console, choose Administration in the navigation pane, and then Data lake locations and provide the following values:
1. For Amazon S3 path, select s3://ossblog-customer-datalake.
2. For IAM role, select LFRegisterRole
3. For Permission mode, choose Lake Formation.
4. Choose Register location.

In Lake Formation, enable full table access for external engines to access data.
1. Sign in as an admin user, choose Administration in the navigation pane.
2. Choose Application integration settings and select Allow external engines to access data in Amazon S3 locations with full table access.
3. Choose Save.

Set up resource access for the OSS Spark role:

Create an AWS Glue database called ossblogdb in the default catalog by going to the Lake Formation console and choosing Databases in the navigation pane.
Select the database, choose Edit and clear the checkbox for Use only IAM access control for new tables in this database.

Grant resource permission to OSS Spark role:

To enable OSS Spark to create and populate the dataset in the ossblogdb database, you will use the IAM role (spark_role) for Apache Spark instance that you created in step 4 of the prerequisites section. Apache Spark will assume this role to create an Iceberg table, add records to it and read from it. To enable this functionality, grant full table access to spark_role and provide data location permission to the S3 bucket where the table data can be stored.

Grant create table permission to the spark_role:

aws lakeformation grant-permissions \
--principal '{"DataLakePrincipalIdentifier":"arn:aws:iam::<aws_account_id>:role/<iam_role_name>"}' \
--permissions '["CREATE_TABLE","DESCRIBE"]'\
--resource '{"Database":{"CatalogId":"<aws_account_id>","Name":"ossblogdb"}}' \
--region <aws_region>

Alternatively on the console:

In the Lake Formation console navigation pane, choose Data lake permissions, and then choose Grant.
In the Principals section, for IAM users and roles, select spark_role.
In the LF-Tags or catalog resources section, select Named Data Catalog resources:
1. Select <accountid> for Catalogs.
2. Select ossblogdb for Databases.
Select DESCRIBE and CREATE TABLE for Database permissions.
Choose Grant.

Grant data location permission to the spark_role:

aws lakeformation grant-permissions 
--principal '{"DataLakePrincipalIdentifier":"<Principal>"}' 
--permissions DATA_LOCATION_ACCESS 
--resource '{"DataLocation":{"CatalogId":"<Catalog ID>","ResourceArn":"<S3 bucket ARN>"}}' 
--region <aws_region>

Alternatively on the console:

In the Lake Formation console navigation pane, choose Data Locations, and then choose Grant.
For IAM users and roles, select spark_role.
For Storage locations, select the bucket_name
Choose Grant.

Set up a Spark script to use an AWS Glue Iceberg REST catalog endpoint:

Create a file named oss_spark_customer_etl.py in your environment with the following content:

import sys
import os
import time
from pyspark.sql import SparkSession

#Replace <aws_region> with AWS region name.
#Replace <aws_account_id> with AWS account ID.

spark = SparkSession.builder.appName('osspark') \
.config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.1,software.amazon.awssdk:bundle:2.20.160,software.amazon.awssdk:url-connection-client:2.20.160') \
.config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
.config('spark.sql.defaultCatalog', 'spark_catalog') \
.config('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkCatalog') \
.config('spark.sql.catalog.spark_catalog.type', 'rest') \
.config('spark.sql.catalog.spark_catalog.uri','https://glue.<aws_region>.amazonaws.com/iceberg') \
.config('spark.sql.catalog.spark_catalog.warehouse','<aws_account_id>') \
.config('spark.sql.catalog.spark_catalog.rest.sigv4-enabled','true') \
.config('spark.sql.catalog.spark_catalog.rest.signing-name','glue') \
.config('spark.sql.catalog.spark_catalog.rest.signing-region', <aws_region>) \
.config('spark.sql.catalog.spark_catalog.io-impl','org.apache.iceberg.aws.s3.S3FileIO') \
.config('spark.hadoop.fs.s3a.aws.credentials.provider','org.apache.hadoop.fs.s3a.SimpleAWSCredentialProvider') \
.config('spark.sql.catalog.spark_catalog.rest-metrics-reporting-enabled','false') \
.getOrCreate()
spark.sql("use ossblogdb").show()
spark.sql("""CREATE TABLE ossblogdb.customer (name string) USING iceberg location 's3://<3_bucket_name>/customer'""")
time.sleep(120)
spark.sql("insert into ossblogdb.customer values('Alice') ").show()
spark.sql("select * from ossblogdb.customer").show()

Launch Pyspark locally and validate read/write to the Iceberg table on Amazon S3

Run pip install pyspark. Save the script locally and set the environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN) with temporary credentials for the spark_role IAM role.

Run python /path/to/oss_spark_customer_etl.py

You can also use Athena to view the data in the Iceberg table:

To enable the other data team to view the content, provide read access to the data team IAM role using the Lake Formation console:

In the Lake Formation console navigation pane, choose Data lake permissions, and then choose Grant.
In the Principals section, for IAM users and roles choose <iam_role>.
In the LF-Tags or catalog resources section, select Named Data Catalog resources:
1. Select <accountid> for Catalogs.
2. Select ossblogdb for Databases.
3. Select customer for Tables.
Select DESCRIBE and SELECT for Table permissions.
Choose Grant.

SELECT * FROM "ossblogdb"."customer" limit 10;

Clean up

To clean up your resources, complete the following steps:

Delete the resources database/table created in Data Catalog.
Empty and then delete the S3 bucket

Conclusion

In this post, we’ve walked through the seamless integration between Apache Spark and an AWS Glue Iceberg Rest Catalog for accessing Iceberg tables in Amazon S3, demonstrating how to effectively perform read and write operations using Iceberg REST API. The beauty of this solution lies in its flexibility—whether you’re running Spark on bare metal servers in your data center, in a Kubernetes cluster, or any other environment, this architecture can be adapted to suit your needs.

About the Authors

Raj Ramasubbu is a Sr. Analytics Specialist Solutions Architect focused on big data and analytics and AI/ML with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. Raj provided technical expertise and leadership in building data engineering, big data analytics, business intelligence, and data science solutions for over 20 years prior to joining AWS. He helped customers in various industry verticals like healthcare, medical devices, life science, retail, asset management, car insurance, residential REIT, agriculture, title insurance, supply chain, document management, and real estate.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She works with product team and customer to build robust features and solutions for their analytical data platform. She enjoys building data mesh solutions and sharing them with the community.

Pratik Das is a Senior Product Manager with AWS Lake Formation. He is passionate about all things data and works with customers to understand their requirements and build delightful experiences. He has a background in building data-driven solutions and machine learning systems in production.

Author visual ETL flows on Amazon SageMaker Unified Studio (preview)

2024-12-04 Praveen Kumar

Post Syndicated from Praveen Kumar original https://aws.amazon.com/blogs/big-data/author-visual-etl-flows-on-amazon-sagemaker-unified-studio/

Amazon SageMaker Unified Studio (preview) provides an integrated data and AI development environment within Amazon SageMaker. From the Unified Studio, you can collaborate and build faster using familiar AWS tools for model development, generative AI, data processing, and SQL analytics. This experience includes visual ETL, a new visual interface that makes it simple for data engineers to author, run, and monitor extract, transform, load (ETL) data integration flow. You can use a simple visual interface to compose flows that move and transform data and run them on serverless compute. Additionally, you can choose to author your visual flows with English using generative AI prompts powered by Amazon Q. Visual ETL also automatically converts your visual flow directed acyclic graph (DAG) into Spark native scripts so you can continue authoring by notebook, enabling a quick-start experience for developers who prefer to author using code.

This post shows how you can build a low-code and no-code (LCNC) visual ETL flow that enables seamless data ingestion and transformation across multiple data sources. We demonstrate how to:

Connect to diverse data sources
Perform table joins
Apply custom filters
Export aggregated data to Amazon Simple Storage Service (Amazon S3)

Additionally, we explore how generative AI can enhance your LCNC visual ETL development process, creating an intuitive and powerful workflow that streamlines the entire development experience.

Use case walkthrough

In this example, we use Amazon SageMaker Unified Studio to develop a visual ETL flow. This pipeline reads data from an Amazon S3 based file location, performs transformations on the data, and subsequently writes the transformed data back into an Amazon S3 based AWS Glue Data Catalog table. We use allevents_pipe and venue_pipe files from the TICKIT dataset to demonstrate this capability.

The TICKIT dataset records sales activities on the fictional TICKIT website, where users can purchase and sell tickets online for different types of events such as sports games, shows, and concerts. Analysts can use this dataset to track how ticket sales change over time, evaluate the performance of sellers, and determine the most successful events, venues, and seasons in terms of ticket sales.

The process involves merging the allevents_pipe and venue_pipe files from the TICKIT dataset. Next, the merged data is filtered to include only a specific geographic region. The data is then aggregated to calculate the number of events by venue name. In the end, the transformed output data is saved to Amazon S3, and a new AWS Glue Data Catalog table is created.

The following diagram illustrates the architecture:

Prerequisites

To run the instruction, you must complete the following prerequisites:

An AWS account
A SageMaker Unified Studio domain
A SageMaker Unified Studio project with Data analytics and machine learning project profile

Build a visual ETL flow

Complete following steps to build a new visual ETL flow with sample dataset:

On the SageMaker Unified Studio console, on the top menu, choose Build.
Under DATA ANALYSIS & INTEGRATION, choose Visual ETL flows, as shown in the following screenshot.

Select your project and choose Continue.

Choose Create visual ETL flow.

This time, manually define the ETL flow.

On the top left, choose the + icon in the circle. Under Data sources, choose Amazon S3, as shown in the following screenshot. Locate the icon at the canvas.

Choose the Amazon S3 source node and enter the following values:

- S3 URI: s3://aws-blogs-artifacts-public/artifacts/BDB-4798/data/venue.csv
- Format: CSV
- Delimiter: ,
- Multiline: Enabled
- Header: Disabled

Leave the rest as default.

Wait for the data preview to be available at the bottom of the screen.

Choose the + icon in the circle to the right of the Amazon S3 node. Under Transforms, choose Rename Columns.

Choose the Rename Columns node and choose Add new rename pair. For Current name and New name, enter the following pairs:
- _c0: venueid
- _c1: venuename
- _c2: venuecity
- _c3: venuestate
- _c4: venueseats

Choose the + icon to the right of Rename Columns node. Under Transforms, choose Filter.
Choose Add new filter condition.
For Key, choose venuestate. For Operation, choose ==. For Value, enter DC, as shown in the following screenshot.

Repeat steps 5 and 6 to add the Amazon S3 source node for table events.

- S3 URI: s3://aws-blogs-artifacts-public/artifacts/BDB-4798/data/events.csv
- Format: CSV
- Sep: ,
- Multiline: Enabled
- Header: Disabled

Leave the rest as default

Repeat steps 7 and 8 for the Amazon S3 source node. On the Rename Columns node, choose Add new rename pair. For Current name and New name, enter the following pairs:
- _c0: eventid
- _c1: e_venueid
- _c2: catid
- _c3: dateid
- _c4: eventname
- _c5: starttime

Choose the + icon to the right of Rename Column node. Under Transforms, choose Join.
Drag the + icon at the right of the Filter node and drop it at the left of the Join node.
For Join type, choose Inner. For Left data source, choose e_venueid. For Right data source, choose venue_id.

Choose the + icon to the right of the Join node. Under Transforms, choose SQL Query.
Enter the following query statement:

select 
  venuename,
  count(distinct eventid) as eventid_count 
from {myDataSource} 
group by venuename

Choose the + icon to the right of the SQL Query node. Under Data target, choose Amazon S3.
Choose the Amazon S3 target node and enter the following values:
- S3 URI: <choose s3 location from project overview page and add suffix “/output/venue_event/”> (for example, s3://<bucket-name>/dzd_bd693kieeb65yf/52d3z1nutb42w7/dev/output/venue_event/)
- Format: Parquet
- Compression: Snappy
- Mode: Overwrite
- Update catalog: True
- Database: Choose your database
- Table: venue_event_agg

At this point, you should encounter this end-to-end visual flow. Now you can publish it.

On the top right, choose Save to project to save the draft flow. You can optionally change the name and add a description. Choose Save to project, as shown in the following screenshot.

The visual ETL flow has been successfully saved.

Run flow

This section shows you how to run the visual ETL flow you authored.

On the top right, choose Run.

At the bottom of the screen, the run status is shown. The run status transitions from Starting to Running and Running to Finished.

Wait for the run to be Finished.

Query using Amazon Athena

The output data has been written to the target S3 bucket. This section shows you how to query the output table.

On the top left menu, under DATA ANALYSIS & INTEGRATION, choose Query Editor.

On the data explorer, under Lakehouse, choose AwsDataCatalog. Navigate to the table venue_event_agg.
From the three dots icon, choose Query with Athena.

Four records will be returned, as shown in the following screenshot. This indicates you succeeded in querying the output table written by the visual ETL flow.

Generative AI section to generate a visual ETL flow

The preceding instruction is done in step-by-step operations on the visual console. On the other hand, SageMaker Unified Studio can automate job authoring steps by using generative AI powered by Amazon Q.

On the top left menu, choose Visual ETL flows.
Choose Create visual ETL flow.
Enter the following text and choose Submit.

Create a flow to connect 2 Glue catalog tables venue and event in database glue_db, join on event id , filter on venue state with condition as venuestate=='DC' and write output to a S3 location

This creates the following boilerplate flow that you can edit to quickly author the visual ETL flow.

The generated flow keeps the context of the prompt at the node level.

Clean Up

To avoid incurring future charges, clean up the resources you created during this walkthrough:

From the SQL querybook, enter the following SQL to drop table:

drop table venue_event_agg

To delete the flow, under Actions, choose Delete flow

Conclusion

This post demonstrated how you can use Amazon SageMaker Unified Studio to build a low-code no-code (LCNC) visual ETL flow. This allows for a seamless data ingestion and transformation across multiple data sources.

To learn more, refer to our documentation and the AWS News Blog.

About the Authors

Praveen Kumar is an Analytics Solutions Architect at AWS with expertise in designing, building, and implementing modern data and analytics platforms using cloud-based services. His areas of interest are serverless technology, data governance, and data-driven AI applications.

Noritaka Sekiyama is a Principal Big Data Architect with AWS Analytics services. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.

Alexandra Tello is a Senior Front End Engineer with the AWS Analytics services in New York City. She is a passionate advocate for usability and accessibility. In her free time, she’s an espresso enthusiast and enjoys building mechanical keyboards.

Ranu Shah is a Software Development Manager with AWS Analytics services. She loves building data analytics features for customers. Outside work, she enjoys reading books or listening to music.

Gal Heyne is a Technical Product Manager for AWS Analytics services with a strong focus on AI/ML and data engineering. She is passionate about developing a deep understanding of customers’ business needs and collaborating with engineers to design simple-to-use data products.

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

2024-12-04 Shovan Kanjilal

Post Syndicated from Shovan Kanjilal original https://aws.amazon.com/blogs/big-data/simplify-data-integration-with-aws-glue-and-zero-etl-to-amazon-sagemaker-lakehouse/

With the growing emphasis on data, organizations are constantly seeking more efficient and agile ways to integrate their data, especially from a wide variety of applications. While traditional extract, transform, and load (ETL) processes have long been a staple of data integration due to its flexibility, for common use cases such as replication and ingestion, they often prove time-consuming, complex, and less adaptable to the fast-changing demands of modern data architectures.

In addition, organizations rely on an increasingly diverse array of digital systems, data fragmentation has become a significant challenge. Valuable information is often scattered across multiple repositories, including databases, applications, and other platforms. To harness the full potential of their data, businesses must enable seamless access and consolidation from these varied sources. However, this task is complicated by the unique characteristics of modern systems, such as differing API protocols, implementations, and rate limits. To address these challenges and accelerate innovation, AWS Glue has recently expanded its third-party application support by introducing native connectors for 19 applications.

To utilize these new application connectors for well-defined use cases such as replication and ingestion, AWS Glue is also launching zero-ETL integration support from external applications. With this new functionality, customers can create up-to-date replicas of their data from applications such as Salesforce, ServiceNow, and Zendesk in an Amazon SageMaker Lakehouse and Amazon Redshift.

Amazon SageMaker Lakehouse unifies all your data across Amazon S3 data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and AI/ML applications on a single copy of data. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines. By directly integrating with Lakehouse, all the data is automatically cataloged and can be secured through fine-grained permissions in Lake Formation.

What is zero-ETL?

Zero-ETL is a set of fully managed integrations by AWS that minimizes the need to build ETL data pipelines. It makes data available in Amazon SageMaker Lakehouse and Amazon Redshift from multiple operational, transactional, and enterprise sources. Extract, transform, and load (ETL) is the process of combining, cleaning, and normalizing data from different sources to prepare it for analytics, artificial intelligence (AI), and machine learning (ML) workloads. You don’t need to maintain complex ETL pipelines. We take care of the ETL for you by automating the creation and management of data replication.

What’s the difference between zero-ETL and Glue ETL?

AWS Glue now offers multiple ways for you to build data integration pipelines, depending on your integration needs.

Zero-ETL provides service-managed replication. It’s designed for scenarios where customers need a fully managed, efficient way to replicate data from one source to AWS with minimal configuration. Zero-ETL handles the entire replication process, including schema discovery and evolution, without requiring customers to write or manage any custom logic. This approach is ideal for creating up-to-date replicas of source data in near-real-time, with AWS managing the underlying infrastructure and replication process.
Glue ETL offers customer-managed data ingestion. It’s the preferred choice when customers need more control and customization over the data integration process or require complex transformations. With Glue ETL, customers can write custom transformation logic, combine data from multiple sources, apply data quality rules, add calculated fields, and perform advanced data cleansing or aggregation. This flexibility makes Glue ETL suitable for scenarios where data must be transformed or enriched before analysis.

It’s worth mentioning that the source connections are reusable between Glue ETL and Glue zero-ETL so that can easily support both patterns. After you create a connection once, you can choose to use the same connection across various AWS Glue components including Glue ETL, Glue Visual ETL and zero-ETL. For example, you might start by creating a connection and a zero-ETL integration, but decide later to use the same connection to create a custom GlueETL pipeline.

This blog post will explore how zero-ETL capabilities combined with its new application connectors are transforming the way businesses integrate and analyze their data from popular platforms such as ServiceNow, Salesforce, Zendesk, SAP and others.

Use case

Consider a large company that relies heavily on data-driven insights to optimize its customer support processes. The company stores vast amounts of transactional data in ServiceNow. To gain a comprehensive understanding of their business and make informed decisions, the company needs to integrate and analyze data from ServiceNow seamlessly, identifying and addressing problems and root causes, managing service level agreements and compliance, and proactively planning for incident prevention.

The company is looking for an efficient, scalable, and cost-effective solution to collecting and ingesting data from ServiceNow, ensuring continuous near real-time replication, automated availability of new data attributes, robust monitoring capabilities to track data load statistics, and reliable data lake foundation supporting data versioning. This allows data analysts, data engineers, and data scientists to quickly explore ingested data and develop data products that meet the needs of business teams.

Solution overview

The following architecture diagram illustrates an efficient and scalable solution for collecting and ingesting replicated data from ServiceNow with zero-ETL integration. In this example we use ServiceNow as a source, but this can be done with any supported source such as Salesforce, Zendesk, SAP, or others. The AWS Glue managed connectors act as a bridge between ServiceNow and the target Amazon SageMaker Lakehouse, enabling seamless, near real-time data flow without the need for custom ETL and scheduling.

The following are the key components and steps in the integration process:

Zero-ETL extracts and loads the data into Amazon S3, a highly scalable object storage service. The data is also registered in the Glue Data Catalog, a metadata repository. Additionally, it keeps the information synchronized by capturing changes that occur in ServiceNow and maintains data consistency by automatically performing schema evolution.
Amazon CloudWatch, a monitoring and observability service, collects logs and metrics from the data integration process.
Amazon EventBridge, a serverless event bus service, triggers a downstream process that allows you to build event-driven architecture as soon as your new data arrives in your target. Through EventBridge, customers can build on top of zero-ETL for a diverse set of use cases such as:
- Trigger Glue ETL to perform transformations and aggregations on the data to create specific analysis.
- Trigger a Directed Acyclic Graph (DAG) in Amazon Managed Workflows for Apache Airflow (Amazon MWAA).
- Trigger a state machine in AWS Step Functions.
- Notify the status of replications and their details to downstream applications.

Prerequisites

Complete the following prerequisites before setting up the solution:

Create a bucket in Amazon S3 called zero-etl-demo-<your AWS Account Number>-<AWS Region> (for example, zero-etl-demo-012345678901-us-east-1). The bucket will be used to store the data ingested by zero-ETL in Apache Iceberg which is an open table format (OTF) supporting ACID transactions (atomicity, consistency, isolation, and durability), seamless schema evolution, and data versioning using time travel.
Create an AWS Glue database <your database name>, such as zero_etl_demo_db and associate the S3 bucket zero-etl-demo-<your AWS Account Number>-<AWS Region> as a location of the database. The database will be used to store the metadata related to the data integrations performed by zero-ETL.
Update AWS Glue Data Catalog settings using the following IAM policy for fine-grained access control of the data catalog for zero-ETL.
Create an AWS Identity and Access Management (IAM) role named zero_etl_demo_role. The IAM role will be used by zero-ETL to access the Glue Connector to read from the Service Now and write the data into the target. Optionally, you can create two separate IAM roles (one associated with your source data and another associated with your target).
Make sure you have a ServiceNow instance named ServiceNowInstance, a user named ServiceNowUser, and a password passwordServiceNowPassword with the required permissions to read from ServiceNow. The instance name, user, and password are used in the AWS Glue connection to authenticate within ServiceNow using the BASIC authentication type. Optionally, you can choose OAUTH2 if your ServiceNow supports it.
Create the secret zero_etl_demo_secret in AWS Secrets Manager to store ServiceNow credentials.

Build and verify the zero-ETL integration

Complete the following steps to create and validate zero-ETL integration:

Step 1: Set up a connector

Zero-ETL integration, when used with AWS Glue natively supported applications connectors, provides a straightforward way to bring third-party data into an Amazon S3 transactional data lake or Amazon Redshift. Use the following steps to create a ServiceNow data connection:

Open the AWS Glue console.
In the navigation pane, under Data catalog, choose Connections.
Choose Create Connection.
In the Create Connection pane, enter ServiceNow in Data Sources.
Choose ServiceNow.
Choose Next.
For Instance Name, enter ServiceNowInstance (created as part of the prerequisites).
For IAM service role, choose the zero_etl_demo_role (created as part of the prerequisites).
For Authentication Type, choose the authentication type that you’re using for ServiceNow. In this example. we have chosen OAUTH2, which requires the set up of Application Registries in ServiceNow.
For AWS Secret, choose the secret zero_etl_demo_secret (created as part of the prerequisites).
Choose Next.
In the Connection Properties section, for Name, enter zero_etl_demo_conn.
Choose Next.
Choose Create connection.
There will be a popup from ServiceNow after you choose Create connection. Choose Allow.

Step 2: Set up Zero-ETL integration

After creating the data connection to ServiceNow, use the following steps to create the zero-ETL integration:

Open the AWS Glue console.
In the navigation pane, under Data catalog, choose Zero-ETL integrations.
Choose Create zero-ETL integration.
In the Create integration pane, enter ServiceNow in Data Sources.
Choose ServiceNow.
Choose Next.
For ServiceNow connection, choose the data connection created on Step 1—zero_etl_demo_conn.
For Source IAM role, choose the zero_etl_demo_role (from the prerequisites).
For ServiceNow objects, choose the objects you want to perform the ingestion managed by zero-ETL integration. For this post, choose problem and incident objects.
For Namespace or Database, choose <your database name>. In this example, we use the zero_etl_demo_db (from the prerequisites).
For Target IAM role, choose the zero_etl_demo_role (from the prerequisites).
Choose Next.
For Security and data encryption, you can choose either AWS Managed KMS Key or choose a customer KMS key managed by AWS Key Management Service. For this post, choose Use AWS managed KMS key.
In the Integration details section, for Name, enter zero-etl-demo-integration.
Choose Next.
Review the details and choose Create and launch integration.
The newly created integration will show as Active in about a minute.

Step 3: Verify the initial SEED load

The SEED load refers to the initial loading of the tables that you want to ingest into an Amazon SageMaker Lakehouse using zero-ETL integration. The status and statistics of the SEED load are published into CloudWatch and the data ingested by zero-ETL integration can be accessed in AWS using a set of services such Amazon Sagemaker Unified Studio, Amazon QuickSight, and others. Use the following steps to access zero-ETL integration logs and query the data:

Open the AWS Glue console.
In the navigation pane, choose Zero-ETL integrations.
In the Zero-ETL integrations section, choose zero-etl-demo-integration.
In the Activity summary (all time) section, choose CloudWatch logs.
Check CloudWatch log events for the SEED Load. For each table ingested by the zero-ETL integration, two groups of logs are created: status and statistics. Highlighted in the following screenshot in IngestionTableStatistics are the statistics. The insertCount represents how many rows were extracted and loaded by zero-ETL integration. For the SEED load, you will always see only insertCount because it’s the initial load. In addition, in IngestionCompleted you will find information about the Zero-ETL integration such as status, load type, and message.

To validate the SEED load, query the data using Amazon Sagemaker Unified Studio.

Access Amazon Sagemaker Unified Studio for your specific domain through your AWS Console.
Open the Amazon SageMaker Unified Studio URL.
Sign in with SSO or AWS IAM user.
Select your project.
Go to Data from the left menu, expand the Lakehouse AWSDataCatalog, expand your database, and select the incident table. Click the ⋮ icon and select Query with Athena.

For Query, enter the following statement:

SELECT count(*) AS incidents_count
FROM "zero_etl_demo_db"."incident"

Choose Run.
Let’s check an existing incident in ServiceNow. This is the incident that you will update the description of in ServiceNow to validate change data capture (CDC). In the query editor, pane, for Query, enter the following statement:
```
SELECT number
, short_description
, description
FROM "zero_etl_demo_db"."incident"
WHERE number = 'INC0000003' -- update to your Incident number
```
Choose Run.

Step 4: Validate CDC

The CDC load is a technique used to identify and process only the data that has changed in a source system since the last extraction. Instead of reloading an entire dataset, CDC captures and transfers only the new, updated, or deleted records into the target system, making data processing more efficient and reducing load times. The status and statistics of the CDC load are published into CloudWatch. For this post, you will use Amazon SageMaker unified studio to query the data ingested. Use the following steps to access zero-ETL integration logs and query the data ingested. For the next step in this example, you will select an incident and perform an update in ServiceNow, changing the short_description and description of the incident.

To demonstrate CDC event, in this blog we are going to edit 1 incident and delete 1 incident in ServiceNow.
Open the AWS Glue console.
In the navigation pane, under Data catalog, choose Zero-ETL integrations.
In the Zero-ETL integrations section, choose zero-etl-demo-integration.
In the Activity summary (all time) section, choose CloudWatch logs.
Zero-ETL integration replicates the changes to the Amazon S3 transactional data lake every 60 minutes by default. Check CloudWatch log events for the CDC load. Shown in the following figure in IngestionTableStatistics, review updateCount and deleteCount for each specific object managed by zero-ETL integration. It’s applying the updates and deletes that happened in ServiceNow to the transactional data lake.

To validate the CDC load, query the data using Amazon SageMaker Unified Studio.

You can go back to Amazon SageMaker Unified Studio.

For Query, enter the following statement:

SELECT count(*) AS incidents_count
FROM "zero_etl_demo_db"."incident"

For Query, enter the following statement to record initial snapshot results before CDC:

SELECT number
    , short_description
    , description
FROM "zero_etl_demo_db"."incident"
WHERE number = 'INC0000003' -- update to your Incident number

Choose Run and confirm that one record was updated in short_description and description attributes.

By following these steps, you can effectively set up, build, and verify a zero-ETL job using the new AWS Glue application connector for ServiceNow. This process demonstrates the simplicity and efficiency of the zero-ETL approach in integrating applications data into your AWS environment.

Apache Iceberg Time Travel: Enhancing data versioning in zero-ETL

One of the benefits of using Apache Iceberg in zero-ETL integration is the ability to perform Time Travel. This feature allows you to access and query historical versions of your data effortlessly. With Iceberg Time Travel, you can easily roll back to previous data states, compare data across different points in time, or recover from accidental data changes. In the context of zero-ETL integrations, this capability becomes particularly valuable when dealing with rapidly changing applications data.

To demonstrate this feature, let’s consider a scenario where you’re analyzing ServiceNow incident data ingested through zero-ETL integration using Amazon SageMaker Unified Studio. Here’s an example query that showcases Iceberg time travel:

-- Query incident data as of particular timestamp before CDC
SELECT number,
    short_description,
    description
FROM "zero_etl_demo_db"."incident" 
FOR TIMESTAMP AS OF TIMESTAMP '2024-11-06 05:10:00 UTC' 
-- update this timestamp value to before your CDC update
WHERE number = 'INC0000003' -- update to your Incident number
-- Compare with current data
SELECT number,
    short_description,
    description
FROM "zero_etl_demo_db"."incident"
WHERE number = 'INC0000003' -- update to your Incident number

In this example:

The first query uses the FOR TIMESTAMP AS OF clause for time travel queries on Iceberg tables. It retrieves incident data as it existed before CDC update for the specific incident number INC0000003.
The second query fetches the current state of the data for the same incident number.

This capability allows you to track the evolution of incidents, identify trends in resolution times, or recover information that may have been inadvertently altered.

Clean up

To avoid incurring future charges, remove up the resources used in this post from your AWS account by completing the following steps:

Delete zero-ETL integration zero-etl-demo-integration.
Delete content from the S3 bucket zeroetl-etl-demo-<your AWS Account Number>-<AWS Region>.
Delete the Data Catalog database zero_etl_demo_db.
Delete the Data Catalog connection zero_etl_demo_conn.
Delete the AWS Secrets manager Secret.

Conclusion

As the pace of business continues to accelerate, the ability to quickly and efficiently integrate data from various applications and enterprise platforms has become a critical competitive advantage. By adopting a zero-ETL integration powered by AWS Glue and its new set of managed connectors, you organization can unlock the full potential of its data across multiple platforms faster and stay ahead of the curve.

To learn more about how AWS Amazon SageMaker Lakehouse can help your organization streamline its data integration efforts, visit Amazon SageMaker Lakehouse.

Get started with zero-ETL on AWS by creating a free account today!

About the authors

Shovan Kanjilal is a Senior Analytics and Machine Learning Architect with Amazon Web Services. He is passionate about helping customers build scalable, secure and high-performance data solutions in the cloud.

Vivek Pinyani is a Data Architect at AWS Professional Services with expertise in Big Data technologies. He focuses on helping customers build robust and performant Data Analytics solutions and Data Lake migrations. In his free time, he loves to spend time with his family and enjoys playing cricket and running.

Kartikay Khator is a Solutions Architect within Global Life Sciences at AWS, where he dedicates his efforts to developing innovative and scalable solutions that cater to the evolving needs of customers. His expertise lies in harnessing the capabilities of AWS analytics services. Extending beyond his professional pursuits, he finds joy and fulfillment in the world of running and hiking. Having already completed multiple marathons, he is currently preparing for his next marathon challenge.

Caio Sgaraboto Montovani is a Sr. Specialist Solutions Architect, Data Lake and AI/ML within AWS Professional Services, developing scalable solutions according customer needs. His vast experience has helped customers in different industries such as life sciences and healthcare, retail, banking, and aviation build solutions in data analytics, machine learning, and generative AI. He is passionate about rock and roll and cooking and loves to spend time with his family.

Kamen Sharlandjiev is a Sr. Big Data and ETL Solutions Architect, Amazon MWAA and AWS Glue ETL expert. He’s on a mission to make life easier for customers who are facing complex data integration and orchestration challenges. His secret weapon? Fully managed AWS services that can get the job done with minimal effort. Follow Kamen on LinkedIn to keep up to date with the latest Amazon MWAA and AWS Glue features and news!

Catalog and govern Amazon Athena federated queries with Amazon SageMaker Lakehouse

2024-12-04 Sandeep Adwankar

Post Syndicated from Sandeep Adwankar original https://aws.amazon.com/blogs/big-data/catalog-and-govern-amazon-athena-federated-queries-with-amazon-sagemaker-lakehouse/

Yesterday, we announced Amazon SageMaker Unified Studio (Preview), an integrated experience for all your data and AI and Amazon SageMaker Lakehouse to unify data – from Amazon Simple Storage Service (S3) to third-party sources such as Snowflake. We’re excited by how SageMaker Lakehouse helps break down data silos, but we also know customers don’t want to compromise on data governance or introduce security and compliance risks as they expand data access.

With this new capability, data analysts can now securely access and query data stored outside S3 data lakes, including Amazon Redshift data warehouses and Amazon DynamoDB databases, all through a single, unified experience. Administrators can now apply access controls at different levels of granularity to ensure sensitive data remains protected while expanding data access. This allows organizations to accelerate data initiatives while maintaining security and compliance, leading to faster, data-driven decision-making.

In this post, we show how to connect to, govern, and run federated queries on data stored in Redshift, DynamoDB (Preview), and Snowflake (Preview). To query our data, we use Athena, which is seamlessly integrated with SageMaker Unified Studio. We use SageMaker Lakehouse to present data to end-users as federated catalogs, a new type of catalog object. Finally, we demonstrate how to use column-level security permissions in AWS Lake Formation to give analysts access to the data they need while restricting access to sensitive information.

Background

As data volumes grow, organizations often employ specialized storage systems to achieve optimal performance and cost-efficiency with different use cases. However, this approach can result in data silos, and makes it challenging to gain insights from data for several reasons. First, end-users often have to set up connections to data sources on their own. This is challenging because of configuration details that vary by source and technical connectivity properties they may not have access to. Second, data sources often have their own built-in access controls, which fragments data governance. Lastly, copying data from one storage system to another for the purposes of analysis adds cost and creates duplication risks.

SageMaker Lakehouse streamlines connecting to, cataloging, and managing permissions on data from multiple sources. It integrates with SageMaker Unified Studio, Athena, and other popular tools to give flexibility to end-users to work with data from their preferred tools.

As you create connections to data, SageMaker Lakehouse creates the underlying catalogs, databases, and tables, and integrates these resources with Lake Formation. Administrators can then define and centrally manage fine-grained access controls on these resources, without having to learn different access management concepts for each data source.

With the right access permissions in place, data discovery and analytics workflows are streamlined. Data analysts no longer need to connect to data sources on their own, saving time and frustration from setting up connectors with configurations that vary by source. Instead, analysts can simply run SQL queries on federated data catalogs, seamlessly accessing diverse data for various needs, which accelerates insights and enhances productivity.

Solution overview

This post presents a solution where a company is using multiple data sources containing customer data. Analysts want to query this data for analytics and AI and machine learning (ML) workloads. However, regulations require personally identifiable information (PII) data to be secured. The following diagram illustrates the solution architecture.

In our use case, an administrator is responsible for data governance and has administrator-level access to data sources – including Redshift, DynamoDB, and Snowflake. Existing regulations require administrators to safeguard sensitive PII data, such as customer mobile phone number, which is stored in multiple places. At the same time, there are business stakeholders in data analyst job functions who need access to these databases because they contain valuable business data that they need access to in order to gain insight on business health.

We will use an administrator account to create connections to Redshift, DynamoDB, and Snowflake, register these as catalogs in SageMaker Lakehouse, and then set up fine-grained access controls using Lake Formation. When complete, we use a data analyst account to query the data with Athena but we will be unable to access the data the role is not entitled to.

Prerequisites

Make sure you have the following prerequisites:

An AWS account with permission to create IAM roles and IAM policies
An AWS Identity and Access Management (IAM) user with an access key and secret key to configure the AWS Command Line Interface (AWS CLI)
Administrator access to SageMaker Lakehouse and the following roles:
- Administrator role
- Data analyst role
A SageMaker Unified Studio domain and two projects using the SQL Analytics profile. To learn more, refer to the Amazon SageMaker Unified Studio Administrator Guide.
- An Admin project will be used to create connections
- A Data Analyst project will be used to analyze data and will include both administrator and analysts as members. Take note of the IAM role in the Data Analyst project from the Project Overview page. This IAM role will be referenced when granting access later on.
Administrator access to one or more of the following data sources, and data sources set up as shown in the appendix A and B:
- Redshift
- DynamoDB
- Snowflake

Set up federated catalogs

The first step is to set up federated catalogs for our data sources using an administrator account. The section below walks you through the end-to-end process with DynamoDB and demonstrates how to query the data when setup is complete. When you are done setting up and exploring the DynamoDB data, repeat these steps for Redshift and Snowflake.

On the SageMaker Unified Studio console, open your project.
Choose Data in the navigation pane.
In the data explorer, choose the plus icon to add a data source.
Under Add a data source, choose Add connection, then choose Amazon DynamoDB.
Enter your connection details, and choose Add data source.

Next, SageMaker Unified Studio connects to your data source, registers the data source as a federated catalog with SageMaker Lakehouse, and displays it in your data explorer.

To explore and query your data, click any SageMaker Lakehouse catalog to view its contents. Use the data explorer to drill down to a table and use the Actions menu to select Query with Athena.

This brings you to the query editor where your sample query is executed. Here, try different SQL statements to better understand your data and to gain familiarity with query development features in SageMaker Unified Studio. To learn more, see SQL analytics in the Amazon SageMaker Unified Studio User Guide.

Similarly, you can setup data source connection for Redshift and Snowflake and query the data. Please refer to Appendix B which contains screenshots capturing the details needed to create the connection and data catalog for Redshift and Snowflake sources.

Set up fine-grained access permissions on federated catalogs

Our next step is to set up access permissions on our federated catalogs. As mentioned in the prerequisites, you have already set up an IAM role with data analyst permissions and a SageMaker Studio data analyst project. We will grant permissions to the data analyst role and SageMaker studio data analyst project role to ensure that access controls you specify are enforced when the data is queried. The following steps show how to set up permissions on a Redshift federated catalog, but the steps are the same for each data source.

Navigate to Lake Formation in the AWS management console as an administrator.
In the Lake Formation console, under Data Catalog in the navigation pane, choose Catalogs. Here, you will see the federated catalogs that were set up previously in SageMaker Unified Studio.
Choose the federated catalog that you wish to set up permissions for. Here, you can see details for the catalog and any associated databases and tables, and manage permissions.
From the Actions menu, choose Grant to grant permissions to the data analyst role and SageMaker studio data analyst project role.
In Catalogs, choose the federated catalog name for the source you wish to grant permissions on.
In Databases, choose your Redshift schema, Snowflake schema, or default for DynamoDB.
In Database permissions, select Describe.
Choose Grant.

The next step is to grant the permission on the tables to the data analyst role and SageMaker studio data analyst project role. For this solution, assume you wish to restrict access to a sensitive column containing the mobile phone number for each customer.

In the Actions menu, choose Grant.
In Catalogs, choose your federated catalog.
In Databases, choose your Redshift schema, Snowflake schema, or default for DynamoDB.
In Tables, choose your tables.
In Table permissions, choose Select.
In Data permissions, choose Column-based access.
In Choose permission filter, choose Include columns.
In Select columns, choose one or more columns.
Choose Grant.

You have successfully set up fine-grained access permissions on your Redshift federated catalog. Repeat these steps to add permissions on your DynamoDB and Snowflake federated catalogs.

Validate fine-grained access permissions on federated catalogs

Now that you have set up federated catalogs with fine-grained access permissions, it’s time to run queries to confirm access permissions are working as expected.

First, access SageMaker Unified Studio using the data analyst role and navigate to your project, select Query Editor from the Build menu, and click on the DynamoDB catalog in the Data explorer. Next, drill down to a table and click Query with Athena to run a sample query. Note how permissions are working as expected because the query result does not include the mobile phone number column that was visible before.

Next, query the Redshift data source and note how the mobile phone number is not included in the query result.

Lastly, query the Snowflake data source and, like the previous examples, note how the result does not include the mobile phone number column.

In this example, we demonstrated how to set up a basic column-level filter to restrict access to sensitive data. However, SageMaker Lakehouse supports a broad range of fine-grained access control scenarios beyond column filters that allow you to meet complex security and compliance requirements across diverse data sources. To learn more, see Managing Permissions.

Clean up

Make sure you remove the SageMaker Lakehouse resources to mitigate any unexpected costs. Start by deleting the connections, catalogs, underlying data sources, projects, and domain that you created for this blog. For additional details, refer to the Amazon SageMaker Unified Studio Administrator Guide.

Conclusion

In this blog post, we utilized fine-grained access controls with federated queries in Athena. We demonstrated how this feature allows flexibility in choosing the right data storage solutions for your needs while securely expanding access to data. We showed how to create federated catalogs and set up access policies with Lake Formation, and then queried data with Athena where we saw permissions enforced on different sources. This approach unified data access controls and streamlined data discovery, saving end-users valuable time. To learn more about federated queries in Athena and the data sources that support fine-grained access controls today, see Register your connection as a Glue Data Catalog in the Athena User Guide.

We encourage you to try fine-grained access controls on federated queries today in SageMaker Unified Studio, and to share your feedback with us. To learn more, see Getting started in the Amazon SageMaker Unified Studio User Guide.

Appendix A: Set up data sources

In this section, we provide the steps to set up your data sources.

Redshift

You can create a new table customer_rs in your current database with columns cust_id, mobile, and zipcode and populate with sample data using the following SQL command:

CREATE TABLE "customer_rs" AS
SELECT 6 AS "cust_id",  66666666 AS "mobile", 6000 as "zipcode"
UNION ALL SELECT 7, 77777777, 7000
UNION ALL SELECT 8,  88888888, 8000
UNION ALL SELECT 9,  99999999, 9000
UNION ALL SELECT 10, 11112222, 1100

DynamoDB

You can create a new table in DynamoDB with the partition key cust_id and the sort key zipcode through AWS CloudShell with the following command:

aws dynamodb create-table \
    --table-name customer_ddb \
    --attribute-definitions \
        AttributeName=cust_id,AttributeType=N \
        AttributeName=zipcode,AttributeType=N \
    --key-schema \
        AttributeName=cust_id,KeyType=HASH \
        AttributeName=zipcode,KeyType=RANGE \
    --provisioned-throughput \
        ReadCapacityUnits=5,WriteCapacityUnits=5 \
    --table-class STANDARD

You can populate the DynamoDB table with the following commands:

aws dynamodb put-item \
    --table-name customer_ddb  \
    --item \
        ‘{“cust_id”: {“N”: “11”}, “zipcode”: {“N”: “2000”}, “mobile”: {“N”: “11113333”}}’

aws dynamodb put-item \
    --table-name customer_ddb  \
    --item \
              ‘{“cust_id”: {“N”: “12”}, “zipcode”: {“N”: “2000”}, “mobile”: {“N”: “22224444”}}’

aws dynamodb put-item \
    --table-name customer_ddb \
    --item \
               ‘{“cust_id”: {“N”: “13”}, “zipcode”: {“N”: “3000”}, “mobile”: {“N”: “33335555”}}’
                            
aws dynamodb put-item \
    --table-name customer_ddb \
    --item \
               ‘{“cust_id”: {“N”: “14”}, “zipcode”: {“N”: “4000”}, “mobile”: {“N”: “55556666”}}’

Snowflake

You can create your database, schema, and tables in Snowflake with the following SQL queries:

use database tasty_bytes_sample_data
create schema "sf_schema"

CREATE TABLE "customer_sf" AS
SELECT 1 AS "cust_id",  11111111 AS "mobile", 1000 as "zipcode" 
UNION ALL SELECT 2, 22222222 , 2000
UNION ALL SELECT 3,  33333333, 3000
UNION ALL SELECT 4,  44444444, 4000
UNION ALL SELECT 5, 55555555, 5000
UNION ALL SELECT 21, 12341234, 1234

Appendix B: Connection Properties for Redshift and Snowflake

Redshift Connection Properties:

Snowflake Connection Properties:

About the Authors

Sandeep Adwankar is a Senior Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Praveen Kumar is a Principal Analytics Solution Architect at AWS with expertise in designing, building, and implementing modern data and analytics platforms using cloud-centered services. His areas of interests are serverless technology, modern cloud data warehouses, streaming, and generative AI applications.

Scott Rigney is a Senior Technical Product Manager with AWS and has expertise in analytics, data science, and machine learning. He is passionate about building software products that enable enterprises to make data-driven decisions and drive innovation.

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

2024-12-04 G2 Krishnamoorthy

Post Syndicated from G2 Krishnamoorthy original https://aws.amazon.com/blogs/big-data/the-next-generation-of-amazon-sagemaker-the-center-for-all-your-data-analytics-and-ai/

This week on the keynote stages at AWS re:Invent 2024, you heard from Matt Garman, CEO, AWS, and Swami Sivasubramanian, VP of AI and Data, AWS, speak about the next generation of Amazon SageMaker, the center for all of your data, analytics, and AI.

The relationship between analytics and AI is rapidly evolving. Our customers are telling us that they are seeing their analytics and AI workloads increasingly converge around a lot of the same data, and this is changing how they are using analytics tools with their data. They aren’t using analytics and AI tools in isolation. They’re taking data they’ve historically used for analytics or business reporting and putting it to work in machine learning (ML) models and AI-powered applications.

We want to make it streamlined for our customers to work with their data, whether for analytics or AI, help them get to AI-ready data faster, and improve productivity of all data and AI workers. The next generation of SageMaker is set to do just that.

Introducing the next generation of SageMaker

The rise of generative AI is changing how data and AI teams work together. For example, when a retail data analyst creates customer segmentation reports, those same datasets are now being used by AI teams to train recommendation engines. Or customer service teams analyzing call logs to track common issues are now using that data to train AI chatbots to handle routine inquiries. Our customers tell us that they need tools that help data and AI teams collaborate seamlessly, but they face real challenges: data is siloed and scattered across systems, they have to build and maintain complex data pipelines, and teams struggle to access and use data efficiently due to inconsistent access controls. Customers also need to make sure that their data practices remain secure, reliable, and compliant with regulations. They need data that’s not just accessible, but also trustworthy and properly governed to keep up with growing business demands and AI opportunities.

The next generation of SageMaker, an integrated experience for data, analytics, and AI, addresses these challenges and more. SageMaker brings together widely adopted AWS ML and analytics capabilities—virtually all of the components you need for data exploration, preparation, and integration; petabyte-scale big data processing; fast SQL analytics; model development and training; governance; and generative AI development. SageMaker helps you work faster and smarter with your data and build powerful analytics and AI solutions that are deeply rooted in your unique data assets, giving you an edge over the competition.

Unified tools: Collaborate and build faster with one data and AI development environment

The rapid evolution of data and AI roles demands a revolution in the services and tools that power your work, driving a need for collaboration and teamwork across your entire organization. Amazon SageMaker Unified Studio (Preview) solves this challenge by providing an integrated authoring experience to use all your data and tools for analytics and AI. Collaborate and build faster using familiar AWS tools for model development, generative AI, data processing, and SQL analytics with Amazon Q Developer, the most capable generative AI assistant for software development, helping you along the way. All your favorite functionality and tools, like standalone studios, query editors, and visual tools, are now available in one place, helping you discover and prepare data with ease, author queries or code, and get to insights faster.

SageMaker also comes with built-in generative AI powered by Amazon Q Developer that guides you along the way of your data and AI journey, transforming complex tasks into intuitive conversations. Ask questions in plain English to find the right datasets, automatically generate SQL queries, or create data pipelines without writing code. This isn’t just about making data management effortless—it’s about using AI to make your data work harder for you, unlocking insights that might otherwise remain hidden, and enabling everyone in your organization to work with data confidently, regardless of their technical expertise.

SageMaker still includes all the existing ML and AI capabilities you’ve come to know and love for data wrangling, human-in-the-loop data labeling with Amazon SageMaker Ground Truth, experiments, MLOps, Amazon SageMaker HyperPod managed distributed training, and more. Moving forward, we’ll refer to this set of AI/ML capabilities as SageMaker AI, and we’ll continue to innovate and expand on them to make sure the new SageMaker remains the premier center for building, training, and deploying AI models. With improved access and collaboration, you’ll be able to create and securely share analytics and AI artifacts and bring data and AI products to market faster.

Unified data: Reduce data silos with an open lakehouse to unify all your data

We see organizations embarking on digital transformations and needing to quickly adapt to ever-evolving customer demands. In doing so, a unified view across all their data is required—one that breaks down data silos and simplifies data usage for teams, without sacrificing the depth and breadth of capabilities that make AWS tools unbelievably valuable. This balance between unification and maintaining advanced capabilities is key to supporting our customers’ ongoing innovation and adaptability in a rapidly changing technological landscape.

Amazon SageMaker Lakehouse, now generally available, unifies all your data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and AI/ML applications on a single copy of data. This innovation drives an important change: you’ll no longer have to copy or move data between data lake and data warehouses. SageMaker Lakehouse enables seamless data access directly in the new SageMaker Unified Studio and provides the flexibility to access and query your data with all Apache Iceberg-compatible tools on a single copy of analytics data. With this launch, you can query data regardless of where it is stored with support for a wide range of use cases, including analytics, ad-hoc querying, data science, machine learning, and generative AI. You’ll get a single unified view of all your data for your data and AI workers, regardless of where the data sits, breaking down your data siloes. We’ve simplified data architectures, saving you time and costs on unnecessary data movement, data duplication, and custom solutions.

Additionally, we are advancing towards a zero-ETL future by expanding integrations that make data from multiple operational, transactional, and application sources available in SageMaker Lakehouse and Amazon Redshift. Zero-ETL integrations simplify data movement and ingestion, enabling increased agility, reduced costs, and minimized operational overhead while providing near real-time insights for AI and ML initiatives. All the existing Amazon Redshift zero-ETL integrations are seamlessly available within SageMaker—you can move transactional data from databases like Amazon Aurora, Amazon Relational Database Service (Amazon RDS), and Amazon DynamoDB into Amazon Redshift without performance impact and ingest high-volume real-time data from Amazon Kinesis and Amazon Managed Streaming for Apache Kafka (Amazon MSK) with native streaming services integrations. We announced SageMaker Lakehouse and Amazon Redshift support for zero-ETL integrations from eight applications, including Salesforce, Zendesk, ServiceNow, Zoho CRM, Salesforce Pardot, SAP, Facebook Ads, and Instagram Ads. This new capability streamlines data replication and ingestion into a unified process, minimizing the need for custom data replication pipelines. With automatic pipeline maintenance, the solution minimizes the complexity of building in-house connectors, reduces implementation and operational costs, and accelerates insights by unifying data from diverse applications.

“We have spent the last 18 months working with AWS to transform our data foundation to use best-in-class solutions that are cost-effective as well. With advancements like SageMaker Unified Studio and SageMaker Lakehouse, we expect to accelerate our velocity of delivery through seamless access to data and services, thus enabling our engineers, analysts, and scientists to surface insights that provide material value to our business.”

– Lee Slezak, SVP of Data and Analytic, Lennar

Unified governance: Meet your enterprise security needs with built-in data and AI governance

When it comes to data and AI governance, discipline equals freedom. The right governance practices can enable your teams to move faster. Data teams struggle to find a unified approach that enables effortless discovery, understanding, and assurance of data quality and security across various sources. Our customers tell us that the fragmented nature of permissions and access controls, managed separately within individual data sources and tools, leads to inconsistent implementation and potential security risks.

SageMaker simplifies the discovery, governance, and collaboration for data and AI across your lakehouse, AI models, and applications. With Amazon SageMaker Catalog, built on Amazon DataZone, you can define and enforce access policies consistently using a single permission model with fine-grained access controls. This unified catalog enables engineers, data scientists, and analysts to securely discover and access approved data and models using semantic search with generative AI-created metadata. Collaboration is seamless, with straightforward publishing and subscribing workflows, fostering a more connected and efficient work environment.

Having confidence in your data is key. SageMaker Catalog provides comprehensive data quality capabilities, including data profiling, data quality recommendations, monitoring of data quality rules, and alerts. By combining rule-based and ML approaches, we help you reconcile entities and deliver high-quality data, giving you the tools to make confident business decisions. You’ll have trust in your data, with real-time visibility of data quality and data and ML lineage, allowing you to resolve hard-to-find quality challenges. Automate data profiling and data quality recommendations, monitor data quality rules, and receive alerts. Resolve hard-to-find data quality challenges by using rule-based and ML approaches to reconcile entities, enabling you to deliver high-quality data to make confident business decisions.

Beyond discovery and collaboration, SageMaker takes AI governance to the next level by providing robust safeguards and tools to develop responsible AI policies. This holistic approach not only streamlines operations, but also builds and maintains trust throughout the organization, setting a new standard for responsible and efficient AI development and deployment.

Innovate faster with the convergence of data, analytics and AI

The next generation of SageMaker delivers an integrated experience to access, govern, and act on all your data by bringing together widely adopted AWS data, analytics, and AI capabilities. Collaborate and build faster from a unified studio using familiar AWS tools for model development, generative AI, data processing, and SQL analytics, with Amazon Q Developer assisting you along the way. Access all your data, whether it’s stored in data lakes, data warehouses, or third-party or federated data sources. And move with confidence and trust with built-in governance to address enterprise security needs. The tools to transform your business are here. We’re excited to see what you’ll build next!

To learn more, check out the following AWS News blog announcements:

About the authors

G2 Krishnamoorthy is VP of Analytics, leading AWS data lake services, data integration, Amazon OpenSearch Service, and Amazon QuickSight. Prior to his current role, G2 built and ran the Analytics and ML Platform at Facebook/Meta, and built various parts of the SQL Server database, Azure Analytics, and Azure ML at Microsoft.

Rahul Pathak is VP of Relational Database Engines, leading Amazon Aurora, Amazon Redshift, and Amazon QLDB. Prior to his current role, he was VP of Analytics at AWS, where he worked across the entire AWS database portfolio. He has co-founded two companies, one focused on digital media analytics and the other on IP-geolocation.

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

2024-12-04 Leo Ramsamy

Post Syndicated from Leo Ramsamy original https://aws.amazon.com/blogs/big-data/how-anz-institutional-division-built-a-federated-data-platform-to-enable-their-domain-teams-to-build-data-products-to-support-business-outcomes/

In today’s rapidly evolving financial landscape, data is the bedrock of innovation, enhancing customer and employee experiences and securing a competitive edge. Recognizing this paradigm shift, ANZ Institutional Division has embarked on a transformative journey to redefine its approach to data management, utilization, and extracting significant business value from data insights.

Like many large financial institutions, ANZ Institutional Division operated with siloed data practices and centralized data management teams. As time went on, the limitations of this approach became apparent due to rising data complexity, larger volumes, and the growing demand for swift, business-driven insights. Consequently, the bank encountered several challenges and needed to take the following actions:

Create business insights from untapped data potential, estimated to be approximately $150 million in the Institutional Division alone
Improve operational efficiency by removing manual data handling, the use of spreadsheets, and duplicate data entries
Increase agility by making data expertise more readily available, thereby improving time to market and overall customer experience
Address data quality
Standardize tooling and remove the Shadow IT culture, driving scalability, reducing risk, and minimizing overall operational inefficiencies

These challenges are not unique to ANZ Institutional Division. Globally, financial institutions have been experiencing similar issues, prompting a widespread reassessment of traditional data management approaches.

One major trend, embraced by many financial institutions, has been the adoption of the data mesh architecture and the shift towards treating data as a product. This paradigm, pioneered by thought leaders like Zhamak Dehghani, introduces a decentralized approach to data management that aligns closely with modern organizational structures and agile methodologies.

Some notable global examples of leading companies embracing and implementing this trend are JPMorgan Chase, Capital One, and Saxo Bank.

Inspired by these global trends and driven by its own unique challenges, ANZ’s Institutional Division decided to pivot from viewing data as a byproduct of projects to treating it as a valuable product in its own right.

This shift promises several business benefits:

Empowered domain expertise – By decentralizing data ownership to domain-based teams, ANZ can use the deep business knowledge within each unit to create more relevant and valuable data products
Increased agility – Domain teams can now respond more quickly to business needs, creating and iterating on data products without relying on a centralized bottleneck
Improved data quality – With domain experts overseeing their own data, there’s a greater likelihood of catching and correcting quality issues at the source
Scalability – The federated approach allows for greater scalability, enabling ANZ to handle increasing data volumes and complexity more effectively
Innovation catalyst – By democratizing data access and empowering teams to create data products, ANZ is fostering a culture of innovation and data-driven decision-making across the organization

This transition is not just about technology; it represents a fundamental shift in how ANZ views and values its data assets. By treating data as a product, the bank is positioned to not only overcome current challenges, but to unlock new opportunities for growth, customer service, and competitive advantage.

This post explores how the shift to a data product mindset is being implemented, the challenges faced, and the early wins that are shaping the future of data management in the Institutional Division.

ANZ’s federated data strategy

In response to the challenges, ANZ Group formulated a data strategy that focuses on empowering employees to securely use data to improve the sustainability and financial well-being of their customers. At its core are the following pillars:

Introducing new ways of working that focus on generating customer value first
New technology platforms and tooling that allow the bank to collect, share, archive, and dispose data in a secure and controlled way
Achieving consistency in how data is produced and consumed across the entire bank through data products and better-connected systems
Supporting the bank’s risk and regulatory obligations by providing a secure and resilient data platform that provides fine-grained, controlled access to quality data products

ANZ has made the strategic decision to adopt an architectural and operational model aligned with the data mesh paradigm, which revolves around four key principles: domain ownership, data as a product, a self-serve data platform, and federated computational governance.

Domain ownership recognizes that the teams generating the data have the deepest understanding of it and are therefore best suited to manage, govern, and share it effectively. This principle makes sure data accountability remains close to the source, fostering higher data quality and relevance.

Treating data as a product instils a product-centric mindset, emphasizing that data must be secure, discoverable, understandable, interoperable, reusable, and managed throughout its lifecycle. This principle makes sure data consumers, both internal and external, derive consistent value from well-designed data products.

A self-serve data platform empowers domains to create, discover, and consume data products independently. It abstracts technical complexities and provides user-friendly tools, enabling a scalable, repeatable, and automated approach to producing high-quality data products.

Under the federated mesh architecture, each divisional mesh functions as a node within the broader enterprise data mesh, maintaining a degree of autonomy in managing its data products. To effectively coordinate these autonomous nodes and facilitate seamless integration, enterprise-wide standards, such as those related to data governance, interoperability, and security, are essential to maintain alignment and consistency across all nodes and domains and teams within.

With this approach, each node in ANZ maintains its divisional alignment and adherence to data risk and governance standards and policies to manage local data products and data assets. This enables global discoverability and collaboration without centralizing ownership or operations.

As a result, governance resides with the data products themselves, making sure standards and policies, such as access control, data quality, and compliance, are enforced where the data lives. In this regard, the enterprise data product catalog acts as a federated portal, facilitating cross-domain access and interoperability while maintaining alignment with governance principles. This model balances node or domain-level autonomy with enterprise-level oversight, creating a scalable and consistent framework across ANZ.

Within the ANZ enterprise data mesh strategy, aligning data mesh nodes with the ANZ Group’s divisional structure provides optimal alignment between data mesh principles and organizational structure, as shown in the following diagram.

Central to the success of this strategy is its support for each division’s autonomy and freedom to choose their own domain structure, which is closely aligned to their business needs. Divisions decide how many domains to have within their node; some may have one, others many. These nodes can implement analytical platforms like data lake houses, data warehouses, or data marts, all united by producing data products. Nodes and domains serve business needs and are not technology mandated.

Under the federated computational governance model, the ANZ Group strategy defines guardrails that treat a node as a logical data container suitable for the following:

Ingestion and metadata management
Creating source-aligned data products complying with ANZ’s Data Product Specification (DPS)
Integrating source-aligned data products from other nodes
Producing consumer-aligned data products for specific business purposes
Publishing conforming data products to ANZ’s Data Product Catalog (DPC)

Following on from this strategy is organizing its domain structure to provide autonomy to various functional teams while preserving the core values of data mesh. The following diagram depicts an example of the possible structure.

For instance, Domain A will have the flexibility to create data products that can be published to the divisional catalog, while also maintaining the autonomy to develop data products that are exclusively accessible to teams within the domain. These products will not be available to others until they are deemed ready for broader enterprise use.

This strategy supports each division’s autonomy to implement their own data catalogs and decide which data products to publish to the group-level catalog. This flexibility extends to divisional domains, which can choose which data products to publish to the divisional catalog or keep visible only to domain consumers.

Institutional Data & AI Platform architecture

The Institutional Division has implemented a self-service data platform to enable the domain teams to build and manage data products autonomously. The Institutional Data & AI platform adopts a federated approach to data while centralizing the metadata to facilitate simpler discovery and sharing of data products. The following diagram illustrates the building blocks of the Institutional Data & AI Platform.

The building blocks are as follows:

Foundational Data & AI Platform capabilities – A dedicated data platform team provides domain-agnostic tools, systems, and capabilities to enable autonomous data product development across domains. This self-serve infrastructure allows domain teams to manage the full data lifecycle without relying on a centralized data team. Key capabilities include data storage, data onboarding and transformation, and data utilities that facilitate data sharing with interoperability between domains. These capabilities abstract the technical complexities associated with data management infrastructure, allowing domain experts to focus on creating valuable data products rather than infrastructure management.
Domain-owned data assets – The domain-oriented data ownership approach distributes responsibility for data across the business units within the Institutional Division. Domain teams are responsible for developing, deploying, and managing their own analytical data products alongside operational data services. Data contracts authored by data product owners automate data product creation and provide a standard to access data products. By treating the data as a product, the outcome is a reusable asset that outlives a project and meets the needs of the enterprise consumer. Consumer feedback and demand drives creation and maintenance of the data product.
Division-level metadata management and data governance – A centrally hosted service provides domain teams with the capability to publish their data products along with relevant metadata, like business definitions and lineage. Some of the key features implemented are:
1. Metadata management that centralizes metadata and presents it within the context of data products, such as data quality scores and data product lineage.
2. A data portal for consumers to discover data products and access associated metadata.
3. Subscription workflows that simplify access management to the data products.
4. Computational governance that enforces divisional and enterprise data policies and standards, such as data classification and business data models for aligning terminology.

The following diagram is a high-level example of the technical architecture approach towards the Institutional Data & AI Platform. The solution uses a building block approach, on a cloud-centered platform comprised of AWS services, with partner solutions and open standards like OpenLineage and Apache Iceberg.

Let’s look at the key services that enable the federated platform to operate at scale:

Data storage and processing:
- Apache Iceberg on Amazon Simple Storage Service (Amazon S3) offers an optimized way to store data assets and products and promotes interoperability across other services
- Amazon Redshift allows domain teams to create and manage fit-for-purpose data marts
- AWS Lambda and AWS Glue are used for data onboarding and processing, and data utilities created in Python and PySpark promote reusability and quality across the data processing pipelines
- dbt simplifies data transformation rules and allows sub-domain data analysts to build modeling logic as SQL statements
- Amazon Managed Workflows for Apache Airflow (Amazon MWAA) enables efficient management of workflows and data pipeline orchestration using out-of-the-box integrations with AWS services
Metadata management and data governance:
- To maintain data reliability and accuracy, a robust data quality framework using Soda core is used that automates data quality using checks defined in a data contract
- Amazon DataZone enables data product cataloging, discovery, metadata management, and implementing computational governance
- OpenLineage simplifies harvesting and collection of data and process-level lineage, which are then published to Amazon DataZone
- AWS Lake Formation, combined with AWS Glue Data Catalog, provides data governance and access management to data products that reside within sub-domains
Analytics:
- Tableau offers capabilities for sub-domains with data visualization and business intelligence capabilities
Observability and security:
- Observability needs of the platform are built into all the processes using monitoring, with logging functionality provided by Amazon CloudWatch and AWS CloudTrail
- AWS Secrets Manager makes sure secrets are stored and made available for data pipelines to access services in a secure manner

The technical implementation actualizes the data product strategy at ANZ Institutional Division. Amazon DataZone plays an essential role in facilitating data product management for the domain teams. The service addresses several critical aspects of the Institutional Division’s data product strategy, including:

Data cataloging and metadata management – Amazon DataZone provides comprehensive data cataloging and metadata management capabilities
Data governance and compliance – Effective data governance is essential for scaling data products
Self-service capabilities – Amazon DataZone empowers domain teams with self-service capabilities, enabling them to create, manage, and deploy data products independently
Integration and interoperability – One of the challenges in scaling data products is providing seamless integration across various data sources and systems
Collaboration and sharing – Amazon DataZone provides a platform for sharing data and metadata across teams and domains

Institutional Division’s delivery model to achieve scale

The Institutional Division has successfully used the federated architecture, and key to this delivery model is the implementation of Foundational Data & AI Platform capabilities that serve all domains within the division. This model promotes self-service and accelerates the delivery of subsequent initiatives by using the capabilities built for previous use cases.

To evaluate the success of the delivery model, ANZ has implemented key metrics, such as cost transparency and domain adoption, to guide the data mesh governance team in refining the delivery approach. For instance, one enhancement involves integrating cross-functional squads to support data literacy.

The key to scaling the Institutional Division operating model are the following considerations:

Data as a product approach – Use techniques like event storming and domain-driven design to capture business events and their meanings.
Education and enablement – Conduct learning interventions to upskill teams on understanding and using the data as a product approach.
Iterative data platform delivery – Work backward from business initiative to iteratively deliver self-service data platform infrastructure capabilities.
Managing demand efficiently – Implement a feedback mechanism to manage demand on data products. Track and manage data debt using standard data contract specifications. Most importantly, adopt governance and standards to make sure data products are built and maintained with a long-term perspective, minimizing technical debt.

“The Institutional Data & Analytics Platform (IDAP) has allowed the Institutional team to establish a base foundation to allow various teams to aggregate and consume the wealth of data across the division. This self-service platform enables business leaders to both create and consume reusable data products, unlocking value across this division. It’s also an excellent proof point for our broader data mesh architecture, allowing us to connect this divisional data to broader enterprise data stores—further positioning us to put the customer at the center of everything we do.”

– Tim Hogarth, CTO ANZ

“AWS believes that democratizing data, while not compromising on security and fine-grained access, is a key component of any future-proof, scalable data platform, so we are pleased to be enabling ANZ bank’s IDAP metadata management and data governance capabilities through Amazon DataZone. This allows the diverse business functions at ANZ the autonomy to self-serve on their data needs with built-in governance.”

– Shikha Verma, Head of Product, Amazon DataZone

Conclusion

ANZ’s journey to move towards a data product approach has improved the organization’s approach to manage data and reduce data silos, and has positioned it to become a data-driven, customer-centric organization. By combining federated platform practices and adopting AWS services and open standards, ANZ Institutional Division is achieving its objectives in decentralization with a scalable data platform that enables its domain teams to make informed decisions, drive innovation, and maintain a competitive edge.

Special thanks: This implementation success is a result of close collaboration between ANZ Institutional Division, AWS ProServe, and the AWS account team. We want to thank ANZ Institutional Executives and the Leadership Team for the strong sponsorship and direction.

About the Authors

Leo Ramsamy is a Platform Architect specializing in data and analytics for ANZ’s Institutional division. He focuses on modern data practices, including Data Mesh architecture, data governance, quality management, and observability. His work aligns data strategies with business goals, improving accessibility and enabling better decision-making across ANZ.

Srinivasan Kuppusamy is a Senior Cloud Architect – Data at AWS ProServe, where he helps customers solve their business problems using the power of AWS Cloud technology. His areas of interests are data and analytics, data governance, and AI/ML.

Rada Stanic is a Chief Technologist at Amazon Web Services, where she helps ANZ customers across different segments solve their business problems using AWS Cloud technologies. Her special areas of interest are data analytics, machine learning/AI, and application modernization.

Solve complex problems with new scenario analysis capability in Amazon Q in QuickSight

2024-12-04 Veliswa Boya

Post Syndicated from Veliswa Boya original https://aws.amazon.com/blogs/aws/solve-complex-problems-with-new-scenario-analysis-capability-in-amazon-q-in-quicksight/

Today, we announced a new capability of Amazon Q in QuickSight that helps users perform scenario analyses to find answers to complex problems quickly. This AI-assisted data analysis experience helps business users find answers to complex problems by guiding them step-by-step through in-depth data analysis—suggesting analytical approaches, automatically analyzing data, and summarizing findings with suggested actions—using natural language prompts. This new capability eliminates hours of tedious and error-prone manual work traditionally required to perform analyses using spreadsheets or other alternatives. In fact, Amazon Q in QuickSight enables business users to perform complex scenario analysis up to 10x faster than spreadsheets. This capability expands upon existing data Q&A capabilities of Amazon QuickSight so business professionals can start their analysis by simply asking a question.

How it works
Business users are often faced with complex questions that have traditionally required specialized training and days or weeks of time analyzing data in spreadsheets or other tools to address. For example, let’s say you’re a franchisee with multiple locations to manage. You might use this new capability in Amazon Q in QuickSight to ask, “How can I help our new Chicago store perform as well as the ﬂagship store in New York?” Using an agentic approach, Amazon Q would then suggest analytical approaches needed to address the underlying business goal, automatically analyze data, and present results complete with visualizations and suggested actions. You can conduct this multistep analysis in an expansive analysis canvas, giving you the ﬂexibility to make changes, explore multiple analysis paths simultaneously, and adapt to situations over time.

This new analysis experience is part of Amazon QuickSight meaning it can read from QuickSight dashboards which connect to sources such as Amazon Athena, Amazon Aurora, Amazon Redshift, Amazon Simple Storage Service (Amazon S3), and Amazon OpenSearch Service. Specifically, this new experience is part of Amazon Q in QuickSight, which allows it to seamlessly integrate with other generative business intelligence (BI) capabilities such as data Q&A. You can also upload either a .csv or a single-table, single-sheet .xlsx file to incorporate into your analysis.

Here’s a visual walkthrough of this new analysis experience in Amazon Q in QuickSight.

I’m planning a customer event, and I’ve received an Excel spreadsheet of all who’ve registered to attend the event. I want to learn more about the attendees, so I analyze the spreadsheet and ask a few questions. I start by describing what I want to explore.

I upload the spreadsheet to start my analysis. Firstly, I want to understand how many people have registered for the event.

To design an agenda that’s suitable for the audience, I want to understand the various roles that will be attending. I select on the + icon to add a new block for asking a question following along the thread from the previous block.

I can continue to ask more questions. However, there are suggested questions for analyzing my data even further, and I now select one of these suggested questions. I want to increase marketing efforts at companies that don’t currently have a lot of attendees in this case, companies with fewer than two attendees.

Amazon Q executes the required analysis and keeps me updated of the progress. Step 1 of the process identifies companies that have fewer than two attendees and lists them.

Step 2 gives an estimate of how many more attendees I might get from each company if marketing efforts are increased.

In Step 3 I can see the potential increase in total attendees (including the percentage increase) in line with the increase in marketing efforts.

Lastly, Step 4 goes even further to highlight companies I should prioritize for these increased marketing efforts.

To increase the potential number of attendees even more, I wanted to change the analysis to identify companies with fewer than three attendees instead of two attendees. I choose the AI sparkle icon in the upper right to launch a modal that I then use to provide more context and make specific changes to the previous result.

This change resulted in new projections, and I can choose to consider them for my marketing efforts or keep to the previous projections.

Now available
Amazon Q in QuickSight Pro users can use this new capability in preview in the following AWS Regions at launch: US East (N. Virginia) and US West (Oregon). Get started with a free 30-day trial of QuickSight today. To learn more, visit the Amazon QuickSight User Guide. You can submit your questions to AWS re:Post for Amazon QuickSight, or through your usual AWS Support contacts.

– Veliswa.

Introducing AWS Glue Data Catalog automation for table statistics collection for improved query performance on Amazon Redshift and Amazon Athena

2024-12-04 Sotaro Hikita

Post Syndicated from Sotaro Hikita original https://aws.amazon.com/blogs/big-data/introducing-aws-glue-data-catalog-automation-for-table-statistics-collection-for-improved-query-performance-on-amazon-redshift-and-amazon-athena/

The AWS Glue Data Catalog now automates generating statistics for new tables. These statistics are integrated with the cost-based optimizer (CBO) from Amazon Redshift Spectrum and Amazon Athena, resulting in improved query performance and potential cost savings.

Queries on large datasets often read extensive amounts of data and perform complex join operations across multiple datasets. When a query engine like Redshift Spectrum or Athena processes the query, the CBO uses table statistics to optimize it. For example, if the CBO knows the number of distinct values in a table column, it can choose the optimal join order and strategy. These statistics must be collected beforehand and should be kept up to date to reflect the latest data state.

Previously, the Data Catalog has supported collecting table statistics used by the CBO for Redshift Spectrum and Athena for tables with Parquet, ORC, JSON, ION, CSV, and XML formats. We introduced this feature and its performance benefits in Enhance query performance using AWS Glue Data Catalog column-level statistics. Additionally, the Data Catalog also has supported Apache Iceberg tables. We’ve also covered this in detail in Accelerate query performance with Apache Iceberg statistics on the AWS Glue Data Catalog.

Previously, creating statistics for Iceberg tables in the Data Catalog required you to continuously monitor and update configurations for your tables. You had to do undifferentiated heavy lifting to do the following:

Discover new tables with specific data table formats (such as Parquet, JSON, CSV, XML, ORC, ION) and specific transactional data table formats such as Iceberg and their individual bucket paths
Determine and set up compute tasks based on scan strategy (sampling percentage and schedules)
Configure AWS Identity and Access Management (IAM) and AWS Lake Formation roles for specific tasks to provide specific Amazon Simple Storage Service (Amazon S3) access, Amazon CloudWatch logs, AWS Key Management Service (AWS KMS) keys for CloudWatch encryption, and trust policies
Set up event notification systems to understand changes in data lakes
Set up specific optimizer configuration-based query performance and storage improvement strategies
Set up a scheduler or build your own event-based compute tasks with setup and teardown

Now, the Data Catalog lets you generate statistics automatically for updated and created tables with a one-time catalog configuration. You can get started by selecting the default catalog on the Lake Formation console and enabling table statistics on the table optimization configuration tab. As new tables are created, the number of distinct values (NDVs) are collected for Iceberg tables, and additional statistics such as the number of nulls, maximum, minimum, and average length are collected for other file formats such as Parquet. Redshift Spectrum and Athena can use the updated statistics to optimize queries, using optimizations such as optimal join order or cost-based aggregation pushdown. The AWS Glue console provides you visibility into the updated statistics and statistics generation runs.

Now, data lake administrators can configure weekly statistics collection across all databases and tables in their catalog. When the automation is enabled, the Data Catalog generates and updates column statistics for all columns in the tables on a weekly basis. This job analyzes 20% of records in the tables to calculate statistics. These statistics can be used by Redshift Spectrum and Athena CBO to optimize queries.

Furthermore, this new feature provides the flexibility to configure automation settings and scheduled collection configurations at the table level. Individual data owners can override catalog-level automation settings based on specific requirements. Data owners can customize settings for individual tables, including whether to enable automation, collection frequency, target columns, and sampling percentage. This flexibility allows administrators to maintain an optimized platform overall, while enabling data owners to fine-tune individual table statistics.

In this post, we discuss how the Data Catalog automates table statistics collection and how you can use it to enhance your data platform’s efficiency.

Enable catalog-level statistics collection

The data lake administrator can enable catalog-level statistics collection on the Lake Formation console. Complete the following steps:

On the Lake Formation console, choose Catalogs in the navigation pane.
Select the catalog that you want to configure, and choose Edit on the Actions menu.

Select Enable automatic statistics generation for the tables of the catalog and choose an IAM role. For the required permissions, see Prerequisites for generating column statistics.
Choose Submit.

You can also enable catalog-level statistics collection through the AWS Command Line Interface (AWS CLI):

aws glue update-catalog --cli-input-json '{
    "name": "123456789012",
    "catalogInput": {
        "description": "Updating root catalog with role arn",
        "catalogProperties": {
            "customProperties": {
                "ColumnStatistics.RoleArn": "arn:aws:iam::123456789012:role/service-role/AWSGlueServiceRole",
                "ColumnStatistics.Enabled": "true"
            }
        }
    }
}'

The command calls the AWS Glue UpdateCatalog API, which takes in a CatalogProperties structure that expects the following key-value pairs for catalog-level statistics:

ColumnStatistics.RoleArn – The IAM role Amazon Resource Name (ARN) to be used for all jobs triggered for catalog-level statistics
ColumnStatistics.Enabled – A Boolean value indicating whether the catalog-level settings are enabled or disabled

Callers of UpdateCatalog must have UpdateCatalog IAM permissions and be granted ALTER on CATALOG permissions on the root catalog if using Lake Formation permissions. You can call the GetCatalog API to verify the properties that are set to your catalog properties. For the required permissions used by the role passed, see Prerequisites for generating column statistics.

By following these steps, catalog-level statistics collection is enabled. AWS Glue then automatically updates statistics for all columns in each table, sampling 20% of records on a weekly basis. This allows data lake administrators to effectively manage the data platform’s performance and cost-efficiency.

View automated table-level settings

When catalog-level statistics collection is enabled, when an Apache Hive table or Iceberg table is created or updated using the AWS Glue CreateTable or UpdateTable APIs through the AWS Glue console, AWS SDK, or AWS Glue crawlers, an equivalent table level setting is created for that table.

Tables with automatic statistics generation enabled must follow one of following properties:

HIVE table formats such as Parquet, Avro, ORC, JSON, ION, CSV, and XML
Apache Iceberg table format

After a table has been created or updated, you can confirm that a statistics collection setting has been set by checking the table description on the AWS Glue console. The setting should have the Schedule property set as Auto and Statistics configuration set as Inherited from catalog. Any table setting with the following settings is automatically triggered by AWS Glue internally.

The following is an image of a Hive Table where catalog-level statistics collection has been applied and statistics have been collected:

The following is an image of a Iceberg Table where catalog-level statistics collection has been applied and statistics have been collected:

Configure table-level statistics collection

Data owners can customize statistics collection at the table level to meet specific needs. For frequently updated tables, statistics can be refreshed more often than weekly. You can also specify target columns to focus on those most commonly queried.

Moreover, you can set what percentage of table records to use when calculating statistics. Therefore, you can increase this percentage for tables that need more precise statistics, or decrease it for tables where a smaller sample is sufficient to optimize costs and statistics generation performance.

These table-level settings can override the catalog-level settings previously described.

To configure table-level statistics collection on AWS Glue console, complete the following steps:

On the AWS Glue console, choose Databases under Data Catalog in the navigation pane.
Choose a database to view all available tables (for example, optimization_test).
Choose the table to be configured (for example, catalog_returns).
Go to Column statistics and choose Generate on schedule.
In the Schedule section, choose the frequency from Hourly, Daily, Weekly, Monthly and Custom (cron expression). In this example, for Frequency, choose Daily.
For Start time, enter 06:43 in UTC.

For Column options, select All columns.
For IAM role, choose an existing role, or create a new role. For the required permissions, see Prerequisites for generating column statistics.

Under Advanced configuration, for Security configuration, optionally choose your security configuration to enable at-rest encryption on the logs pushed to CloudWatch.
For Sample rows, enter 100 as the percentage of rows to sample.
Choose Generate statistics.

In the table description on the AWS Glue console, you can confirm that a statistics collection job has been scheduled for the specified date and time.

By following these steps, you have configured table-level statistics collection. This allows data owners to manage table statistics based on their specific requirements. Combining this with catalog-level settings by data lake administrators enables securing a baseline for optimizing the entire data platform while flexibly addressing individual table requirements.

You can also create a column statistics generation schedule through the AWS CLI:

aws glue create-column-statistics-task-settings \
  --database-name 'database_name' \
  --table-name table_name \
  --role 'arn:aws:iam::123456789012:role/stats-role' \
  --schedule 'cron(8 0-5 14 * * ?)' \
  --column-name-list 'col-1' \
  --catalog-id '123456789012' \
  --sample-size '10.0' \
  --security-configuration 'test-security'

The required parameters are database-name, table-name, and role. You can also include optional parameters such as schedule, column-name-list, catalog-id, sample-size, and security-configuration. For more information, see Generating column statistics on a schedule.

Conclusion

This post introduced a new feature in the Data Catalog that enables automated statistics collection at the catalog level with flexible per-table controls. Organizations can effectively manage and maintain up-to-date column-level statistics. By incorporating these statistics, CBO in both Redshift Spectrum and Athena can optimize query processing and cost-efficiency.

Try out this feature for your own use case, and let us know your feedback in the comments.

About the Authors

Sotaro Hikita is an Analytics Solutions Architect. He supports customers across a wide range of industries in building and operating analytics platforms more effectively. He is particularly passionate about big data technologies and open source software.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Kyle Duong is a Senior Software Development Engineer on the AWS Glue and AWS Lake Formation team. He is passionate about building big data technologies and distributed systems.

Amazon SageMaker Lakehouse and Amazon Redshift supports zero-ETL integrations from applications

2024-12-03 Veliswa Boya

Post Syndicated from Veliswa Boya original https://aws.amazon.com/blogs/aws/introducing-amazon-sagemaker-lakehouse-support-for-zero-etl-integrations-from-applications/

Today, we announced the general availability of Amazon SageMaker Lakehouse and Amazon Redshift support for zero-ETL integrations from applications. Amazon SageMaker Lakehouse unifies all your data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and AI/ML applications on a single copy of data. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines. Zero-ETL is a set of fully managed integrations by AWS that minimizes the need to build ETL data pipelines for common ingestion and replication use cases. With zero-ETL integrations from applications such as Salesforce, SAP, and Zendesk, you can reduce time spent building data pipelines and focus on running unified analytics on all your data in Amazon SageMaker Lakehouse and Amazon Redshift.

As organizations rely on an increasingly diverse array of digital systems, data fragmentation has become a significant challenge. Valuable information is often scattered across multiple repositories, including databases, applications, and other platforms. To harness the full potential of their data, businesses must enable access and consolidation from these varied sources. In response to this challenge, users build data pipelines to extract and load (EL) from multiple applications into centralized data lakes and data warehouses. Using zero-ETL, you can eﬃciently replicate valuable data from your customer support, relationship management, and enterprise resource planning (ERP) applications for analytics and AI/ML to datalakes and data warehouses, saving you weeks of engineering eﬀort needed to design, build, and test data pipelines.

Prerequisites

An Amazon SageMaker Lakehouse catalog configured through AWS Glue Data Catalog and AWS Lake Formation.
An AWS Glue database that is configured for Amazon S3 where the data will be stored.
A secret in AWS Secret Manager to use for the connection to the data source. The credentials must contain the username and password that you use to sign in to your application.
An AWS Identity and Access Management (IAM) role for the Amazon SageMaker Lakehouse or Amazon Redshift job to use. The role must grant access to all resources used by the job, including Amazon S3 and AWS Secrets Manager.
A valid AWS Glue connection to the desired application.

How it works – creating a Glue connection prerequisite
I start by creating a connection using the AWS Glue console. I opt for a Salesforce integration as the data source.

Next, I provide the location of the Salesforce instance to be used for the connection, together with the rest of the required information. Be sure to use the .salesforce.com domain instead of .force.com. Users can choose between two authentication methods, JSON Web Token (JWT), which is obtained through Salesforce access tokens, or OAuth login through the browser.

I review all the information and then choose Create connection.

After I sign into the Salesforce instance through a popup (not shown here), the connection is successfully created.

How it works – creating a zero-ETL integration
Now that I have a connection, I choose zero-ETL integrations from the left navigation panel, then choose Create zero-ETL integration.

First I choose the source type for my integration – in this case Salesforce so I can use my recently created connection.

Next, I select objects from the data source that I want to replicate to the target database in AWS Glue.

While in the process of adding objects, I can quickly preview both data and metadata to confirm that I am selecting the correct object.

By default, zero-ETL integration will synchronize data from the source to the target every 60 minutes. However, you can change this interval to reduce the cost of replication for cases that do not require frequent updates.

I review and then choose Create and launch integration.

The data in the source (Salesforce instance) has now been replicated to the target database salesforcezeroETL in my AWS account. This integration has two phases. Phase 1: initial load will ingest all the data for the selected objects and may take between 15 min to a few hours depending on the size of the data in these objects. Phase 2: incremental load will detect any changes (such as new records, updated records, or deleted records) and apply these to the target.

Each of the objects that I selected earlier has been stored in its respective table within the database. From here I can view the Table data for each of the objects that have been replicated from the data source.

Lastly, here’s a view of the data in Salesforce. As new entities are created, or existing entities are updated or changed in Salesforce, the data changes will synchronize to the target in AWS Glue automatically.

Now available
Amazon SageMaker Lakehouse and Amazon Redshift support for zero-ETL integrations from applications is now available in US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Hong Kong), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), and Europe (Stockholm) AWS Regions. For pricing information, visit the AWS Glue pricing page.

To learn more, visit our AWS Glue User Guide. Send feedback to AWS re:Post for AWS Glue or through your usual AWS Support contacts. Get started by creating a new zero-ETL integration today.

– Veliswa

Simplify analytics and AI/ML with new Amazon SageMaker Lakehouse

2024-12-03 Esra Kayabali

Post Syndicated from Esra Kayabali original https://aws.amazon.com/blogs/aws/simplify-analytics-and-aiml-with-new-amazon-sagemaker-lakehouse/

Today, I’m very excited to announce the general availability of Amazon SageMaker Lakehouse, a capability that unifies data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and artificial intelligence and machine learning (AI/ML) applications on a single copy of data. SageMaker Lakehouse is a part of the next generation of Amazon SageMaker, which is a unified platform for data, analytics and AI, that brings together widely-adopted AWS machine learning and analytics capabilities and delivers an integrated experience for analytics and AI.

Customers want to do more with data. To move faster with their analytics journey, they are picking the right storage and databases to store their data. The data is spread across data lakes, data warehouses, and different applications, creating data silos that make it difficult to access and utilize. This fragmentation leads to duplicate data copies and complex data pipelines, which in turn increases costs for the organization. Furthermore, customers are constrained to use specific query engines and tools, as the way and where the data is stored limits their options. This restriction hinders their ability to work with the data as they would prefer. Lastly, the inconsistent data access makes it challenging for customers to make informed business decisions.

SageMaker Lakehouse addresses these challenges by helping you to unify data across Amazon S3 data lakes and Amazon Redshift data warehouses. It offers you the flexibility to access and query data in-place with all engines and tools compatible with Apache Iceberg. With SageMaker Lakehouse, you can define fine-grained permissions centrally and enforce them across multiple AWS services, simplifying data sharing and collaboration. Bringing data into your SageMaker Lakehouse is easy. In addition to seamlessly accessing data from your existing data lakes and data warehouses, you can use zero-ETL from operational databases such as Amazon Aurora, Amazon RDS for MySQL, Amazon DynamoDB, as well as applications such as Salesforce and SAP. SageMaker Lakehouse fits into your existing environments.

Get started with SageMaker Lakehouse
For this demonstration, I use a preconfigured environment that has multiple AWS data sources. I go to the Amazon SageMaker Unified Studio (preview) console, which provides an integrated development experience for all your data and AI. Using Unified Studio, you can seamlessly access and query data from various sources through SageMaker Lakehouse, while using familiar AWS tools for analytics and AI/ML.

This is where you can create and manage projects, which serve as shared workspaces. These projects allow team members to collaborate, work with data, and develop AI models together. Creating a project automatically sets up AWS Glue Data Catalog databases, establishes a catalog for Redshift Managed Storage (RMS) data, and provisions necessary permissions. You can get started by creating a new project or continue with an existing project.

To create a new project, I choose Create project.

I have 2 project profile options to build a lakehouse and interact with it. First one is Data analytics and AI-ML model development, where you can analyze data and build ML and generative AI models powered by Amazon EMR, AWS Glue, Amazon Athena, Amazon SageMaker AI, and SageMaker Lakehouse. Second one is SQL analytics, where you can analyze your data in SageMaker Lakehouse using SQL. For this demo, I proceed with SQL analytics.

I enter a project name in the Project name field and choose SQL analytics under Project profile. I choose Continue.

I enter the values for all the parameters under Tooling. I enter the values to create my Lakehouse databases. I enter the values to create my Redshift Serverless resources. Finally, I enter a name for my catalog under Lakehouse Catalog.

On the next step, I review the resources and choose Create project.

After the project is created, I observe the project details.

I go to Data in the navigation pane and choose the + (plus) sign to Add data. I choose Create catalog to create a new catalog and choose Add data.

After the RMS catalog is created, I choose Build from the navigation pane and then choose Query Editor under Data Analysis & Integration to create a schema under RMS catalog, create a table, and then load table with sample sales data.

After entering the SQL queries into the designated cells, I choose Select data source from the right dropdown menu to establish a database connection to Amazon Redshift data warehouse. This connection allows me to execute the queries and retrieve the desired data from the database.

Once the database connection is successfully established, I choose Run all to execute all queries and monitor the execution progress until all results are displayed.

For this demonstration, I use two additional pre-configured catalogs. A catalog is a container that organizes your lakehouse object definitions such as schema and tables. The first is an Amazon S3 data lake catalog (test-s3-catalog) that stores customer records, containing detailed transactional and demographic information. The second is a lakehouse catalog (churn_lakehouse) dedicated to storing and managing customer churn data. This integration creates a unified environment where I can analyze customer behavior alongside churn predictions.

From the navigation pane, I choose Data and locate my catalogs under the Lakehouse section. SageMaker Lakehouse offers multiple analysis options, including Query with Athena, Query with Redshift, and Open in Jupyter Lab notebook.

Note that you need to choose Data analytics and AI-ML model development profile when you create a project, if you want to use Open in Jupyter Lab notebook option. If you choose Open in Jupyter Lab notebook, you can interact with SageMaker Lakehouse using Apache Spark via EMR 7.5.0 or AWS Glue 5.0 by configuring the Iceberg REST catalog, enabling you to process data across your data lakes and data warehouses in a unified manner.

Here’s how querying using Jupyter Lab notebook looks like:

I continue by choosing Query with Athena. With this option, I can use serverless query capability of Amazon Athena to analyze the sales data directly within SageMaker Lakehouse. Upon selecting Query with Athena, the Query Editor launches automatically, providing an workspace where I can compose and execute SQL queries against the lakehouse. This integrated query environment offers a seamless experience for data exploration and analysis, complete with syntax highlighting and auto-completion features to enhance productivity.

I can also use Query with Redshift option to run SQL queries against the lakehouse.

SageMaker Lakehouse offers a comprehensive solution for modern data management and analytics. By unifying access to data across multiple sources, supporting a wide range of analytics and ML engines, and providing fine-grained access controls, SageMaker Lakehouse helps you make the most of your data assets. Whether you’re working with data lakes in Amazon S3, data warehouses in Amazon Redshift, or operational databases and applications, SageMaker Lakehouse provides the flexibility and security you need to drive innovation and make data-driven decisions. You can use hundreds of connectors to integrate data from various sources. Additionally, you can access and query data in-place with federated query capabilities across third-party data sources.

Now available
You can access SageMaker Lakehouse through the AWS Management Console, APIs, AWS Command Line Interface (AWS CLI), or AWS SDKs. You can also access through AWS Glue Data Catalog and AWS Lake Formation. SageMaker Lakehouse is available in US East (N. Virginia), US West (Oregon), US East (Ohio), Europe (Ireland), Europe (Frankfurt), Europe (Stockholm), Asia Pacific (Sydney), Asia Pacific (Hong Kong), Asia Pacific (Tokyo), and Asia Pacific (Singapore) AWS Regions.

For pricing information, visit the Amazon SageMaker Lakehouse pricing.

For more information on Amazon SageMaker Lakehouse and how it can simplify your data analytics and AI/ML workflows, visit the Amazon SageMaker Lakehouse documentation.

— Esra

New Amazon DynamoDB zero-ETL integration with Amazon SageMaker Lakehouse

2024-12-03 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/new-amazon-dynamodb-zero-etl-integration-with-amazon-sagemaker-lakehouse/

Amazon DynamoDB, a serverless NoSQL database, has been a go-to solution for over one million customers to build low-latency and high-scale applications. As data grows, organizations are constantly seeking ways to extract valuable insights from operational data, which is often stored in DynamoDB. However, to make the most of this data in Amazon DynamoDB for analytics and machine learning (ML) use cases, customers often build custom data pipelines—a time-consuming infrastructure task that adds little unique value to their core business.

Starting today, you can use Amazon DynamoDB zero-ETL integration with Amazon SageMaker Lakehouse to run analytics and ML workloads in just a few clicks without consuming your DynamoDB table capacity. Amazon SageMaker Lakehouse unifies all your data across Amazon S3 data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and AI/ML applications on a single copy of data.

Zero-ETL is a set of integrations that eliminates or minimizes the need to build ETL data pipelines. This zero-ETL integration reduces the complexity of engineering efforts required to build and maintain data pipelines, benefiting users running analytics and ML workloads on operational data in Amazon DynamoDB without impacting production workflows.

Let’s get started
For the following demo, I need to set up zero-ETL integration for my data in Amazon DynamoDB with an Amazon Simple Storage Service data lake managed by Amazon SageMaker Lakehouse. Before setting up the zero-ETL integration, there are prerequisites to complete. If you want to learn more on how to set up, refer to this Amazon DynamoDB documentation page.

With all the prerequisites completed, I can get started with this integration. I navigate to the AWS Glue console and select Zero-ETL integrations under Data Integration and ETL. Then, I choose Create zero-ETL integration.

Here, I have options to select my data source. I choose Amazon DynamoDB and choose Next.

Next, I need to configure the source and target details. In the Source details section, I select my Amazon DynamoDB table. In the Target details section, I specify the S3 bucket that I’ve set up in the AWS Glue Data Catalog.

To set up this integration, I need an IAM role that grants AWS Glue the necessary permissions. For guidance on configuring IAM permissions, visit the Amazon DynamoDB documentation page. Also, if I haven’t configured a resource policy for my AWS Glue Data Catalog, I can select Fix it for me to automatically add the required resource policies.

Here, I have options to configure the output. Under Data partitioning, I can either use DynamoDB table keys for partitioning or specify custom partition keys. After completing the configuration, I choose Next.

Because I select the Fix it for me checkbox, I need to review the required changes and choose Continue before I can proceed to the next step.

On the next page, I have the flexibility to configure data encryption. I can use AWS Key Management Service (AWS KMS) or a custom encryption key. Then, I assign a name to the integration and choose Next.

On the last step, I need to review the configurations. When I’m happy, I choose Next to create the zero-ETL integration.

After the initial data ingestion completes, my zero-ETL integration will be ready for use. The completion time varies depending on the size of my source DynamoDB table.

If I navigate to Tables under Data Catalog in the left navigation panel, I can observe more details including Schema. Under the hood, this zero-ETL integration uses Apache Iceberg to transform related to data format and structure in my DynamoDB data into Amazon S3.

Lastly, I can tell that all my data is available in my S3 bucket.

This zero-ETL integration significantly reduces the complexity and operational burden of data movement, and I can therefore focus on extracting insights rather than managing pipelines.

Available now
This new zero-ETL capability is available in the following AWS Regions: US East (N. Virginia, Ohio), US West (Oregon), Asia Pacific (Hong Kong, Singapore, Sydney, Tokyo), Europe (Frankfurt, Ireland, Stockholm).

Explore how to streamline your data analytics workflows using Amazon DynamoDB zero-ETL integration with Amazon SageMaker Lakehouse. Learn more how to get started on the Amazon DynamoDB documentation page.

Happy building!
— Donnie

Announcing the general availability of data lineage in the next generation of Amazon SageMaker and Amazon DataZone

2024-12-03 Esra Kayabali

Post Syndicated from Esra Kayabali original https://aws.amazon.com/blogs/aws/announcing-the-general-availability-of-data-lineage-in-the-next-generation-of-amazon-sagemaker-and-amazon-datazone/

Today, I’m happy to announce the general availability of data lineage in Amazon DataZone, following its preview release in June 2024. This feature is also extended as part of the catalog capabilities in the next generation of Amazon SageMaker, a unified platform for data, analytics, and AI.

Traditionally, business analysts have relied on manual documentation or personal connections to validate data origins, leading to inconsistent and time-consuming processes. Data engineers have struggled to evaluate the impact of changes to data assets, especially as self-service analytics adoption increases. Additionally, data governance teams have faced difficulties in enforcing practices and responding to auditor queries about data movement.

Data lineage in Amazon DataZone addresses the challenges faced by organizations striving to remain competitive by using their data for strategic analysis. It enhances data trust and validation by providing a visual, traceable history of data assets, enabling business analysts to quickly understand data origins without manual research. For data engineers, it facilitates impact analysis and troubleshooting by clearly showing relationships between assets and allowing easy tracing of data flows.

The feature supports data governance and compliance efforts by offering a comprehensive view of data movement, helping governance teams to quickly respond to compliance queries and enforce data policies. It improves data discovery and understanding, helping consumers grasp the context and relevance of data assets more efficiently. Additionally, data lineage contributes to better change management, increased data literacy, reduced data duplication, and enhanced cross-team collaboration. By tackling these challenges, data lineage in Amazon DataZone helps organizations build a more trustworthy, efficient, and compliant data ecosystem, ultimately enabling more effective data-driven decision-making.

Automated lineage capture is a key feature of the data lineage in Amazon DataZone, which focuses on automatically collecting and mapping lineage information from AWS Glue and Amazon Redshift. This automation significantly reduces the manual effort required to maintain accurate and up-to-date lineage information.

Get started with data lineage in Amazon DataZone
Data producers and domain administrators get started by setting up the data source run jobs for the AWS Glue Data Catalog and Amazon Redshift sources to Amazon DataZone to periodically collect metadata from the source catalog. Additionally, the data producers can hydrate the lineage information programmatically by creating custom lineage nodes using APIs that accept OpenLineage compatible events from existing pipeline components—such as schedulers, warehouses, analysis tools, and SQL engines—to send data about datasets, jobs, and runs directly to Amazon DataZone API endpoint. With the information being sent, Amazon DataZone will start populating the lineage model and map them to the assets already cataloged. As new lineage events are captured, Amazon DataZone maintains versions of events that were already captured, so users can navigate to previous versions if needed.

From the consumer’s perspective, lineage can help with three scenarios. First, a business analyst browsing an asset, can go to the Amazon DataZone portal, search for an asset by name, and select an asset that interests them to dive into the details. Initially, they’ll be presented with details in the Business Metadata tab and move right to neighboring tabs. To view lineage, the analyst can go the Lineage tab for details of upstream nodes to find the source. The analyst is presented with a view of that asset’s lineage with 1-level upstream and downstream. To get the source, the analyst can choose upstream and get to the source of the asset. When the analyst is sure that this is the correct asset, they can subscribe to the asset and continue with their work.

Second, if a data issue is reported—for instance, when a dashboard unexpectedly shows a significant increase in customer count—a data engineer can use the Amazon DataZone portal to locate and examine the relevant asset details. In the asset details page, the data engineer navigates to the Lineage tab to view the details of upstream nodes of the asset in question. The engineer can dive into the details of each node, its snapshots, column mapping between each table node, the jobs that ran in between, and view the query that was executed in the job run. Using this information, the data engineer can spot that a new input table was added to the pipeline, which has introduced an uptick in customer count, because they notice that this new table wasn’t part of the previous snapshots of the job runs. This helps them clarify that a new source was added and hence the data shown in the dashboard is accurate.

Lastly, a steward looking to respond to questions from an auditor can go to the asset in question and navigates to the Lineage tab of that asset. The steward traverses the graph upstream to see where the data is coming from and notices that the data is from two different teams—for instance, from two different on-premises databases—that has its own pipelines until it reaches a point where the pipelines merge. While navigating through the lineage graph, the steward can expand the columns to make sure sensitive columns are dropped during the transformations processes and respond to the auditors with details in a timely manner.

How Amazon DataZone automates lineage collection
Amazon DataZone now enables automatic capture of lineage events, helping data producers and administrators to streamline the tracking of data relationships and transformations across their AWS Glue and Amazon Redshift resources. To allow automatic capture of lineage events from AWS Glue and Amazon Redshift, you have to opt in because some of your jobs or connections might be for testing and you might not need any lineage to be captured. With the integrated experience available, the services will provide you an option in your configuration settings to opt-in to collect and emit lineage events directly to Amazon DataZone.

These events should capture the various data transformation operations you perform on tables and other objects, such as table creation with column definitions, schema changes, and transformation queries, including aggregations and filtering. By obtaining these lineage events directly from your processing engines, Amazon DataZone can build a foundation of accurate and consistent data lineage information. This will then help you, as a data producer, to further curate the lineage data as part of the broader business data catalog capabilities.

Administrators can enable lineage when setting up the built-in DefaultDataLake or the DefaultDataWarehouse blueprints.

Data producers can view the status of automated lineage while setting up the data source runs.

With the recent launch of the next generation of Amazon SageMaker, data lineage is available as one of the catalog capabilities in the Amazon SageMaker Unified Studio (preview). Data users can set up lineage using connections, and that configuration will automate the capture of lineage in the platform for all users to browse and understand the data. Here’s how data lineage in next generation Amazon SageMaker will look.

Now available
You can begin using this capability to gain deeper insights into your data ecosystem and drive more informed, data-driven decision-making.

Data lineage is generally available in all AWS Regions where Amazon DataZone is available. For a list of Regions where Amazon DataZone domains can be provisioned, visit AWS Services by Region.

Data lineage costs are dependent on storage usage and API requests, which are already included in the Amazon DataZone pricing model. For more details, visit Amazon DataZone pricing.

To get started with data lineage in Amazon DataZone, visit the Amazon DataZone User Guide.

— Esra

Introducing the next generation of Amazon SageMaker: The center for all your data, analytics, and AI

2024-12-03 Antje Barth

Post Syndicated from Antje Barth original https://aws.amazon.com/blogs/aws/introducing-the-next-generation-of-amazon-sagemaker-the-center-for-all-your-data-analytics-and-ai/

Today, we’re announcing the next generation of Amazon SageMaker, a unified platform for data, analytics, and AI. The all-new SageMaker includes virtually all of the components you need for data exploration, preparation and integration, big data processing, fast SQL analytics, machine learning (ML) model development and training, and generative AI application development.

The current Amazon SageMaker has been renamed to Amazon SageMaker AI. SageMaker AI is integrated within the next generation of SageMaker while also being available as a standalone service for those who wish to focus specifically on building, training, and deploying AI and ML models at scale.

Highlights of the new Amazon SageMaker
At its core is SageMaker Unified Studio (preview), a single data and AI development environment. It brings together functionality and tools from the range of standalone “studios,” query editors, and visual tools that we have today in Amazon Athena, Amazon EMR, AWS Glue, Amazon Redshift, Amazon Managed Workflows for Apache Airflow (MWAA), and the existing SageMaker Studio. We’ve also integrated Amazon Bedrock IDE (preview), an updated version of Amazon Bedrock Studio, to build and customize generative AI applications. In addition, Amazon Q provides AI assistance throughout your workflows in SageMaker.

Here’s a list of key capabilities:

Amazon SageMaker Unified Studio (preview) – Build with all your data and tools for analytics and AI in a single environment.
Amazon SageMaker Lakehouse – Unify data across Amazon Simple Storage Service (Amazon S3) data lakes, Amazon Redshift data warehouses, and third-party and federated data sources with Amazon SageMaker Lakehouse.
Data and AI Governance – Securely discover, govern, and collaborate on data and AI with Amazon SageMaker Catalog, built on Amazon DataZone.
Data Processing – Analyze, prepare, and integrate data for analytics and AI using open source frameworks on Amazon Athena, Amazon EMR, and AWS Glue.
Model development – Build, train, and deploy ML and foundation models (FMs) with fully managed infrastructure, tools, and workflows with Amazon SageMaker AI.
Generative AI app development – Build and scale generative AI applications with Amazon Bedrock.
SQL analytics – Gain insights with Amazon Redshift, the most price-performant SQL engine.

In this post, I give you a quick tour of the new SageMaker Unified Studio experience and how to get started with data processing, model development, and generative AI app development.

Working with Amazon SageMaker Unified Studio (preview)
With SageMaker Unified Studio, you can discover your data and put it to work using familiar AWS tools to complete end-to-end development workflows, including data analysis, data processing, model training, and generative AI app building, in a single governed environment.

An integrated SQL editor lets you query data from multiple sources, and a visual extract, transform, and load (ETL) tool simplifies the creation of data integration and transformation workflows. New unified Jupyter notebooks enable seamless work across different compute services and clusters. With the new built-in data catalog functionality, you can find, access, and query data and AI assets across your organization. Amazon Q is integrated to streamline tasks across the development lifecycle.

Let’s explore the individual capabilities in more detail.

Data processing
SageMaker integrates with SageMaker Lakehouse and lets you analyze, prepare, integrate, and orchestrate your data in a unified experience. You can integrate and process data from various sources using the provided connectivity options.

Start by creating a project in SageMaker Unified Studio, choosing the SQL analytics or data analytics and AI-ML model development project profile. Projects are a place to collaborate with your colleagues, share data, and use tools to work with data in a secure way. Project profiles in SageMaker define the preconfigured set of resources and tools that are provisioned when you create a new project. In your project, choose Data in the left menu and start adding data sources.

The built-in SQL query editor lets you query your data stored in data lakes, data warehouses, databases, and applications directly within SageMaker Unified Studio. In the top menu of SageMaker Unified Studio, select Build and choose Query Editor to get started. Also, try creating SQL queries using natural language with Amazon Q while you’re at it.

You should also explore the built-in visual ETL tool to create data integration and transformation workflows using a visual, drag-and-drop interface. In the top menu, select Build and choose Visual ETL flow to get started.

If Amazon Q is enabled, you can also use generative AI to author flows. Visual ETL comes with a wide range of data connectors, pre-built transformations, and features such as scheduling, monitoring, and data previewing to streamline your data workflows.

Model development
SageMaker Unified Studio includes capabilities from SageMaker AI, which provides infrastructure, tools, and workflows for the entire ML lifecycle. From the top menu, select Build to access tools for data preparation, model training, experiment tracking, pipeline creation, and orchestration. You can also use these tools for model deployment and inference, machine learning operations (MLOps) implementation, model monitoring and evaluation, as well as governance and compliance.

To start your model development, create a project in SageMaker Unified Studio using the data analytics and AI-ML model development project profile and explore the new unified Jupyter notebooks. In the top menu, select Build and choose JupyterLab. You can use the new unified notebooks to seamlessly work across different compute services and clusters. You can use these notebooks to switch between environments without leaving your workspace, streamlining your model development process.

You can also use Amazon Q Developer to assist with tasks such as code generation, debugging, and optimization throughout your model development process.

Generative AI app development
Use the new Amazon Bedrock IDE to develop generative AI applications within Amazon SageMaker Unified Studio. The Amazon Bedrock IDE includes tools to build and customize generative AI applications using FMs and advanced capabilities such as Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, Amazon Bedrock Agents, and Amazon Bedrock Flows to create tailored solutions aligned with your requirements and responsible AI guidelines.

Choose Discover in the top menu of SageMaker Unified Studio to browse Amazon Bedrock models or experiment with the model playgrounds.

Create a project using the GenAI Application Development profile to start building generative AI applications. Choose Build in the top menu of SageMaker Unified Studio and select Chat agent.

With the Amazon Bedrock IDE, you can build chat agents and create knowledge bases from your proprietary data sources with just a few clicks, enabling Retrieval-Augmented Generation (RAG). You can add guardrails to promote safe AI interactions and create functions to integrate with any system. With built-in model evaluation features, you can test and optimize your AI applications’ performance while collaborating with your team. Design flows for deterministic genAI-powered workflows, and when ready, share your applications or prompts within the domain or export them for deployment anywhere—all while maintaining control of your project and domain assets.

For a detailed description of all Amazon SageMaker capabilities, check the SageMaker Unified Studio User Guide.

Getting started
To begin using SageMaker Unified Studio, administrators need to complete several setup steps. This includes setting up AWS IAM Identity Center, configuring the necessary virtual private cloud (VPC) and AWS Identity and Access Management (IAM) roles, creating a SageMaker domain, and enabling Amazon Q Developer Pro. Instead of IAM Identity Center, you can also configure SAML through IAM federation for user management.

After the environment is configured, users sign in through the provided SageMaker Unified Studio domain URL with single sign-on. You can create projects to collaborate with team members, choosing from pre-configured project profiles for different use cases. Each project connects to a Git repository for version control and includes an example unified Jupyter notebook to get you started.

For detailed setup instructions, check the SageMaker Unified Studio Administrator Guide.

Now available
The next generation of Amazon SageMaker is available today in the US East (N. Virginia, Ohio), US West (Oregon), Asia Paciﬁc (Tokyo), and Europe (Ireland) AWS Regions. Amazon SageMaker Uniﬁed Studio and Amazon Bedrock IDE are available today in preview in these AWS Regions. Check the full Region list for future updates.

For pricing information, visit Amazon SageMaker pricing and Amazon Bedrock pricing. To learn more, visit Amazon SageMaker, SageMaker Unified Studio, and Amazon Bedrock IDE.

Existing Amazon Bedrock Studio preview domains will be available until February 28, 2025, but you may not create new workspaces. To experience the advanced features of Bedrock IDE, create a new SageMaker domain following the instructions in the Administrator Guide.

Give the new Amazon SageMaker a try in the console today and let us know what you think! Send feedback to AWS re:Post for Amazon SageMaker or through your usual AWS Support contacts.

— Antje

Introducing the HubSpot connector for AWS Glue

2024-12-02 Eric Bomarsi

Post Syndicated from Eric Bomarsi original https://aws.amazon.com/blogs/big-data/introducing-the-hubspot-connector-for-aws-glue/

Most companies have adopted a diverse set of software as a service (SaaS) platforms to support various applications. The rapid adoption has enabled them to quickly streamline operations, enhance collaboration, and gain more accessible, scalable solutions for managing their critical data and workflows.

More companies have realized there is an opportunity to integrate, enhance, and present this SaaS data to improve internal operations and gain valuable insights on their data. Using AWS Glue, a serverless data integration service, companies can streamline this process, integrating data from internal and external sources into a centralized AWS data lake. From there, they can perform meaningful analytics, gain valuable insights, and optionally push enriched data back to external SaaS platforms.

This post introduces the new HubSpot managed connector for AWS Glue, and demonstrates how you can integrate HubSpot data into your existing data lake on AWS. By consolidating HubSpot data with data from your AWS accounts and from other SaaS services, you can enhance, analyze, and optionally write the data back to HubSpot, creating a seamless and integrated data experience.

Solution overview

In this example, we use AWS Glue to extract, transform, and load (ETL) data from your HubSpot account into a transactional data lake on Amazon Simple Storage Service (Amazon S3), using Apache Iceberg format. We register the schema in the AWS Glue Data Catalog to make your data discoverable. Subsequently, we use Amazon Athena to validate that the HubSpot data has been successfully loaded to Amazon S3. The following diagram illustrates the solution architecture.

The following are key components and steps in the integration:

Configure your HubSpot account and app to enable access to your HubSpot data.
Prepare for data movement by securely storing your HubSpot OAuth credentials in AWS Secrets Manager, creating an S3 bucket to store your ingested data, and creating an AWS Identity and Access Management (IAM) role for AWS Glue.
Create an AWS Glue job to extract and load data from HubSpot to Amazon S3. AWS Glue establishes a secure connection to HubSpot using OAuth for authorization and TLS for data encryption in transit. AWS Glue also supports the ability to apply complex data transformations, enabling efficient data integration and preparation to meet your needs.
Schema and other metadata will be registered in the AWS Glue Data Catalog, a centralized metadata repository for all your data assets. This helps simplify schema management, and also makes the data discoverable by other services.
Run the AWS Glue job to extract data from HubSpot and write it to Amazon S3 using Iceberg format. Apache Iceberg is an open source, high-performance open table format designed for large-scale analytics, providing transactional consistency and seamless schema evolution. Although we use Iceberg in this example, AWS Glue offers robust support for various data formats, including other transactional formats such as Apache Hudi and Delta Lake.
The data loaded to Amazon S3 will be organized into partitioned folders to optimize for query performance and management. Amazon S3 will also store the AWS Glue scripts, logs, and other temporary data required during the ETL process.
Finally, Amazon Athena will be used to query the data loaded from HubSpot to Amazon S3, validating that all changes in the source system have been captured successfully.
Optionally, HubSpot can regularly synchronize HubSpot data to Amazon S3 and analyze data updates over time.

Set up your HubSpot account

This example requires you to create a HubSpot public app for AWS Glue in a HubSpot Developer account, and connect it to an associated HubSpot account. A HubSpot public app is a type of integration that can be installed in your HubSpot accounts or listed in the HubSpot Marketplace. In this example, you create a HubSpot app for the AWS Glue integration, and install it in a new test account. Although HubSpot calls it a public app, it will not be listed in their Marketplace and will only have access to your test account.

If you don’t already have one, sign up for a free HubSpot developer account.
Log in to your HubSpot developer account, where you’ll see options to create apps and test accounts.
Choose Create a test account and follow the instructions.

HubSpot test accounts have Enterprise versions of the HubSpot Marketing, Sales, and Service Hubs along with sample data, so you can test most HubSpot tools, create CRM data, and access it through APIs with Glue. For more information about creating a test account, refer to Create a developer test account.

Create a HubSpot app

Complete the following steps to create a HubSpot app:

Switch back to your HubSpot developer account, and choose Create an app.
Fill in the App Info section with the name AWS Glue and a brief description.
Choose the Auth tab.
For Redirect URLs, enter the redirect URL for AWS Glue in the form: https://<region>.console.aws.amazon.com/gluestudio/oauth.

Be sure to replace <region> with your AWS Glue operating AWS Region. For instance, the code for the US East (N. Virginia) Region is us-east-1, so the AWS Glue redirect URL is https://us-east-1.console.aws.amazon.com/gluestudio/oauth.

In the Scopes section, choose Add new scope and select the following permissions:
- automation
- content
- crm.lists.read
- crm.lists.write
- crm.objects.companies.read
- crm.objects.companies.write
- crm.objects.contacts.read
- crm.objects.contacts.write
- crm.objects.custom.read
- crm.objects.custom.write
- crm.objects.deals.read
- crm.objects.deals.write
- crm.objects.owners.read
- crm.schemas.custom.read
- e-commerce
- forms
- oauth
- sales-email-read
- tickets
Review the Scopes and Redirect URL settings, then choose Create app.
Navigate back to your app Auth tab.
Take note of the values for Client ID, Client secret, and Install URL (OAuth). You will need these later to connect your AWS Glue instance.

Select or create an Amazon S3 bucket where your HubSpot data will reside

Select an existing Amazon S3 bucket in your account, or create a new bucket to store your HubSpot data, as well as scripts, logs, and so on. For this example, the bucket name will follow the format aws-glue-hubspot-<account>-<region>, where <account> is the AWS account number and <region> is the operating Region. The account will be configured with all defaults: public access disabled, versioning disabled, and server-side encryption with Amazon S3 managed keys (SSE-S3).

If you use AWSGlueServiceRole in your IAM role as shown in this example, it will provide access to S3 buckets with names starting with aws-glue-.

Create an IAM role for AWS Glue

Create an IAM role with permissions for the AWS Glue job. AWS Glue will assume this role when calling other services on your behalf.

On the IAM console, choose Roles in the navigation pane.
Choose Create role.
For Trusted entity type¸ choose AWS service.
For Use case, choose Glue.
Add the following AWS managed policies to the role:
1. AWSGlueServiceRole for accessing related services such as Amazon S3, Amazon Elastic Compute Cloud, Amazon CloudWatch, and IAM. This policy enables access to S3 buckets with names starting with aws-glue-.
2. SecretsManagerReadWrite for read/write access to AWS Secrets Manager.
Give the role a name, for instance AWSGlueServiceRole_blog.

For more information, see Getting started with AWS Glue and Create an IAM role for AWS Glue.

Create a AWS Secrets Manager secret

AWS Secrets Manager is used to securely store your HubSpot OAuth credentials. Complete the following steps to create a secret:

On the AWS Secrets Manager console, choose Secrets in the navigation pane.
Choose Store a new secret.
For Secret type, select Other type of secret.
Under Kay/value pairs, enter the HubSpot client secret with the key USER_MANAGED_CLIENT_APPLICATION_CLIENT_SECRET.
Choose Next.

Enter the secret name, such as HubSpot-Blog, a description, and continue.
Leave the secret rotation as default, and choose Next.
Review the secret configuration, and choose Store.

Create an AWS Glue connection

Complete the following steps to create an AWS Glue connection to your HubSpot account:

On the AWS Glue console, choose Data connections in the navigation pane.
Choose Create connection.
For Data sources, search for and select HubSpot.
Choose Next.

On the Configure connection page, fill in the required information:
1. For IAM service role, choose the service role created previously. In this example, we use the role AWSGlueServiceRole_blog.
2. For Authentication URL, leave as default.
3. For User Managed Client Application ClientId, enter the OAuth client ID from HubSpot.
4. For AWS Secret, choose the OAuth client secret name configured previously in AWS Secrets Manager.
5. Choose Next.

Choose Test Connection to validate the connection to HubSpot.
This will bring up a new HubSpot connection window. Be sure to select your HubSpot test account (not your developer account) to test the connection.
If this is your first connection attempt, you will be redirected to another page where you are asked to confirm the access level granted to AWS Glue. Choose Connect App.

If successful, the HubSpot window will close and your AWS connection window will say Connection test successful.

Under Set properties, for Name, enter a name (for example, HubSpot_Connection_blog).
Choose Next.
Under Review and create, review your settings and then create the connection.

Create a database in AWS Glue Data Catalog

Complete the following steps to create a database in AWS Glue Data Catalog to organize your HubSpot data:

On the AWS Glue console, choose Databases in the navigation pane.
Create a new database.
Enter a name (for example, hubspot).
You can leave the location field blank.
Choose Create database.

Create an AWS Glue ETL job

Now that you have an AWS Glue data connection to your HubSpot account, you can create an AWS Glue ETL job to ingest HubSpot data into your AWS data lake. AWS Glue provides both visual and code-based interfaces to simplify data integration, depending on your expertise. In this example, we use the Script interface to ingest HubSpot data into the Amazon S3 location. Complete the following steps:

On the AWS Glue console, choose ETL jobs in the navigation pane.
Choose the Script editor.
Choose Spark as the engine, and upload the following script.

The AWS Glue Spark job reads the HubSpot data and merges it into the S3 bucket in Iceberg format.

On the Job details tab, provide the following information:
For Name, enter a name, such as HubSpot_to_S3_blog.
For Description, enter a meaningful description of the job.
For IAM Role, choose the IAM role you created previously (for this post, AWSGlueServiceRole_blog).

Expand Advanced properties.
Under Connections, enter your HubSpot connection from the previous section (for this post, HubSpot_Connection_blog).

Under Job parameters, enter the following parameters:

- For --conf, enter spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.warehouse=file:///tmp/spark-warehouse
- For --datalake-formats, enter iceberg
- For --db_name, enter the AWS Glue database to store your data lake (for this post, hubspot)
- For --table_name, enter the HubSpot table to be ingested (for this post, company)
- For --s3_bucket_name, enter where the ingested Iceberg table is stored, in this case aws-glue-hubspot-<account>-<region>
- For --connection_name, enter the AWS Glue connection name created, in this case HubSpot_Connection_blog

Choose Save to save the job, then choose Run.

Depending on the amount of data in your HubSpot account, the job can take a few minutes to complete. After a successful job run, you can choose Run details to see the job specifications and logs.

Use Athena to query data

Athena is an interactive and serverless query service that makes it straightforward to analyze data directly in Amazon S3 using standard SQL. In this example, we query the results of the HubSpot data ingested into Amazon S3.

On the Athena console, choose Query editor.
For Database, choose hubspot, and you should see your company table.
Select entries from the hubspot.company table to view the data captured from hubspot.

You can try various queries on the HubSpot data, such as:

-- get sample of dataset
SELECT * FROM "hubspot"."company" limit 10;

-- get companies revenue
SELECT * FROM "hubspot"."company" A
WHERE A.annualrevenue IS NOT NULL;

-- get number of companies with revenue
SELECT COUNT(*) AS companies_count FROM "hubspot"."company" A
WHERE A.annualrevenue IS NOT NULL;

Over time, your HubSpot data may change. You can rerun your ETL job periodically, and the Iceberg data lake table will effectively capture your changes. You can verify by adding, removing, and changing companies in your HubSpot database, and then rerun the ETL job. Your data lake should match your latest HubSpot data. With this capability, you can schedule the ETL job to run as often as you need.

Extending the HubSpot connector with AWS services

The HubSpot connector for AWS Glue provides a powerful foundation for building comprehensive data pipelines and analytics workflows. By integrating HubSpot data into your AWS environment, you can use additional services like Amazon Redshift, Amazon QuickSight, and Amazon SageMaker to further process, transform, and analyze the data. This allows you to construct sophisticated, end-to-end data architectures that unlock the full value of your HubSpot data, without the need to manage complex infrastructure. The seamless integration between these AWS services makes it straightforward to build scalable analytics pipelines tailored to your specific requirements.

Considerations

You can set up AWS Glue job triggers to run the ETL jobs on a schedule, so that the data is regularly synchronized between HubSpot and Amazon S3. You can also integrate the ETL jobs with other AWS services, including AWS Step Functions, Amazon MWAA (Amazon Managed Workflows for Apache Airflow), AWS Lambda, Amazon EventBridge , and Amazon Bedrock to create a more advanced data processing pipeline.

By default, the HubSpot connector doesn’t import deleted records. However, you can set the IMPORT_DELETED_RECORDS option to true to import all records, including the deleted ones.

Clean up

To avoid incurring charges, clean up the resources used in this post from your AWS account, including the AWS Glue jobs, HubSpot connection, AWS Secrets Manager secret, IAM role, and Amazon S3 bucket.

Conclusion

With the introduction of the AWS Glue connector for HubSpot, integrating HubSpot data with information from other data sources has become more streamlined than ever. This feature enables you to set up ongoing data integration from HubSpot to AWS, providing a unified view of data from across platforms and enabling more comprehensive analytics. The serverless nature of AWS Glue means there is no infrastructure management required, and you only pay for the resources consumed. By following the steps outlined in this post, you can make sure that up-to-date data from HubSpot is captured in the your data lake, allowing teams to make faster data-driven decisions and uncover complex insights from across data sources.

To learn more about the AWS Glue connector for HubSpot, refer to Connecting to HubSpot in AWS Glue. This guide walks through the entire process, from setting up the connection to running the data transfer flow. For more information on AWS Glue, visit AWS Glue.

About the Authors

Eric Bomarsi is a Senior Solutions Architect in the ISV group at AWS, where he focuses on building scalable solutions for large customers. As a member of the AWS analytics community, he helps customers get strategic insights from their data. Outside of work, he enjoys playing ice hockey and traveling with his family.

Annie Nelson is a Senior Solutions Architect at AWS. She is a data enthusiast who enjoys problem solving and tackling complex architectural challenges with customers.

AWS Clean Rooms now supports multiple clouds and data sources

2024-12-02 Esra Kayabali

Post Syndicated from Esra Kayabali original https://aws.amazon.com/blogs/aws/aws-clean-rooms-now-supports-multiple-clouds-and-data-sources/

Today, we are announcing support for Snowflake and Amazon Athena as new sources for AWS Clean Rooms data collaborations. AWS Clean Rooms helps you and your partners more seamlessly and securely analyze your collective datasets without sharing or copying one another’s underlying data. This enhancement helps you collaborate with datasets stored in Snowflake or those queryable through Athena features, such as AWS Lake Formation permissions or AWS Glue Data Catalog views, without moving or revealing the source data.

You often need to collaborate with partners to analyze datasets to get insights for research and development, investments, or marketing and advertising campaigns. In some cases, your partners’ datasets are stored or managed outside of Amazon Simple Storage Service (Amazon S3), and companies want to reduce or eliminate the complexity, cost, compliance risks, and delays that are associated with moving or copying data. Companies also find that copying data can result in them using outdated information, potentially reducing the quality of the insights gained.

This launch helps companies to collaborate on the most up-to-date collective datasets in an AWS Clean Rooms collaboration with zero extract, transform, and load (zero-ETL). This eliminates the cost and complexity associated with migrating datasets out of existing environments. For example, an advertiser with data stored in Amazon S3 and a media publisher with data stored in Snowflake can run an audience overlap analysis to determine the percentage of users present in their collective datasets without having to build ETL data pipelines, or share underlying data with one another. No underlying data from external data sources is permanently stored in AWS Clean Rooms during the collaboration process and any data temporarily read into the AWS Clean Rooms analysis environment is deleted upon query completion. You can now work with your partners regardless of where their data is stored, streamlining the process of generating insights.

Let me show you how to use this feature.

How to use multiple clouds and data sources in AWS Clean Rooms
To demonstrate this feature, I use a scenario between an advertiser, Company A, and a publisher, Company B. Company A wants to know how many of their high-value users can be reached on Company B’s website before running an ad campaign. Company A stores their data in Amazon S3. Company B stores their data in Snowflake. To use AWS Clean Rooms, both parties must have their own AWS accounts.

In this demo, Company A, the advertiser, is the collaboration creator. Company A creates the AWS Clean Rooms collaboration and invites Company B, who has data hosted in Snowflake, to collaborate. You can follow the specific steps to create a collaboration in the AWS Clean Rooms general availability announcement blog post.

Next, I show how Company B, the publisher, creates a configured table in AWS Clean Rooms, specifying Snowflake as the data source and providing the Secrets Manager Amazon Resource Name (ARN). AWS Secrets Manager helps you manage, retrieve, and rotate secrets such as database credentials throughout their lifecycles. Your secret must contain the credentials for a Snowflake user with read-only permission to the data you want to collaborate with. AWS Clean Rooms will use it to read your secret and access the data stored in Snowflake. See the Secrets Manager documentation for step-by-step instructions for creating your secret.

Using Company B’s AWS account, I go to the AWS Clean Rooms console and choose Tables under Configured resources. I choose Configure new table. I choose Snowflake under Third-party clouds and data sources. I enter the Secret ARN for the secret that contains Snowflake credentials for a role with read access to the dataset stored in Snowflake I want to collaborate with. These are the credentials that you use to verify the identity of the entity trying to access the Snowflake table and schema. If you don’t have a secret ARN, you can create a new secret using the Store a new secret for this table option.

To define the table and schema details, I use the Import from file option and choose the Columns View Information Schema CSV file I exported from Snowflake to populate the information for me. You can also enter the information manually.

For this demo, I choose All columns under the Columns allowed in collaborations. Next, I choose Configure new table.

I go to the configured table and observe the table details, such as AWS accounts allowed to create queries and columns available for querying. On this page, I can edit the table name, description, and analysis rule.

As part of configuring a table to use in AWS Clean Rooms for collaboration analysis, I need to configure an analysis rule. An analysis rule is a privacy-enhancing control that each data owner sets up on a configured table. An analysis rule determines how the configured table can be analyzed. I choose Configure analysis rule to configure a custom analysis rule that allows custom queries to be run on the configured table.

In Step 1, I proceed with the selections. You can use JSON editor to create, paste, or import an analysis rule definition in a JSON format. I choose Next.

In Step 2, I choose Allow any queries created by specific collaborators to run without review on this table under Analyses for direct querying. With this option, only queries provided by the AWS accounts that I specify in the list of allowed accounts can be run on the table. All analysis templates created by the allowed accounts will automatically be allowed to be run on this table without requiring a review. I choose the allowed account under AWS account ID and choose Next.

In Step 3, I proceed with the selections. I choose None under Columns not allowed in output to allow all columns to be shown in the query output. I choose Not allowed under Additional analyses applied to output, so no additional analyses can be run on this table. I choose Next.

In the final step, I review the configuration and choose Configure analysis rule.

Next, I associate the table with the collaboration Company A, the advertiser, created using Associate to collaboration.

On the pop-up window, I choose a collaboration from the ones with active memberships and select Choose collaboration.

On the next page, I choose the Configured table name and enter the Name under Table associations details. I choose a method to authorize AWS Clean Rooms to give the permission to query the table. I choose Associate table.

Company A, the advertiser, and Company B, the publisher, can now run an audience overlap analysis to determine the percentage of users present in their collective datasets without accessing each other’s raw data. The analysis helps determine how much of the advertiser’s audience can be reached by the publisher. By evaluating the overlap, advertisers can determine whether the publisher provides unique reach or if the publisher’s audience predominantly overlaps with the advertiser’s existing audience, without either party having to move or share their source data. I switch to Company A’s account and go to AWS Clean Rooms console. I choose the collaboration I created and run the following query to get the audience overlap analysis result:

select count (distinct emailaddress)
from customer_data_example as advertiser
inner join synthetic_customer_data  as publisher
on 'emailaddress' = 'publisher_hashed_email_address'

In this example, I used Snowflake as a data source. You can also run queries on this data using Athena while following AWS Lake Formation permissions. This helps you do row- and column-level filtering with Lake Formation fine-grained access control and transform data using AWS Glue Data Catalog views before the datasets are associated to the collaboration.

Customer and partner voices
“Data security and privacy is essential to our work at Kinective Media by United Airlines, the world’s first traveler media network,” said Khatidja Ajania, Director, Strategic Partnerships, Kinective Media by United Airlines. “AWS Clean Rooms support of source data in multiple clouds and AWS sources enables us to securely and seamlessly work with more brands to deliver on closed loop measurement and other key use cases. This enhancement will make it easier for us to securely deliver personalized experiences, content, and relevant offerings to millions of United travelers through privacy-enhanced collaboration with our advertisers and partners.”

“Snowflake recognizes the challenges of source data interoperability across tech stacks when using data clean room technology; we are excited to see the progress and one more step taken in the direction of a shared goal to empower users to unlock the full potential of their data partnerships through their solution of choice, safely and effectively” – Kamakshi Sivaramakrishnan, General Manager, Snowflake Data Clean Rooms

Now available
Support for Snowﬂake and Athena as data sources in AWS Clean Rooms oﬀers signiﬁcant beneﬁts for cross-cloud collaboration. This launch eliminates the need for data movement across clouds and data sources and simpliﬁes the collaboration process. This is a first step in our eﬀorts to expand the ways in which customers can securely collaborate with any of their partners while protecting sensitive information, regardless of where their data is stored.

Get started with AWS Clean Rooms today. To learn more about collaborating with multiple data sources, visit the AWS Clean Rooms documentation.

— Esra

New Amazon CloudWatch and Amazon OpenSearch Service launch an integrated analytics experience

2024-12-01 Elizabeth Fuentes

Post Syndicated from Elizabeth Fuentes original https://aws.amazon.com/blogs/aws/new-amazon-cloudwatch-and-amazon-opensearch-service-launch-an-integrated-analytics-experience/

Today, Amazon Web Services (AWS) announces a new integrated analytics experience and zero-ETL integration between Amazon CloudWatch and Amazon OpenSearch Service. This integration simplifies log data analysis and visualization without data duplication, streamlining log management while reducing technical overhead and operational costs. CloudWatch Logs customers now have access to two additional query languages beyond CloudWatch Logs Insights QL, while OpenSearch customers can query CloudWatch logs in place without creating separate extract, transform, and load (ETL) pipelines.

Organizations often need different analytics capabilities for their log data. Some teams prefer CloudWatch Logs for its scalability and simplicity in centralizing logs from all their systems, applications, and AWS services. Others require OpenSearch Service for advanced analytics and visualizations. Previously, integration between these services required maintaining separate ingestion pipelines or creating ETL processes. This new integration helps customers get the best of both services by eliminating this complexity by bringing the power of OpenSearch analytics directly to CloudWatch Logs, without any data copy.

Amazon CloudWatch Logs now supports OpenSearch Piped Processing Language (PPL) and OpenSearch SQL directly within the CloudWatch Logs Insights console. You can use SQL to analyze data and correlate logs using JOIN. You can use SQL functions (such as JSON, mathematical, datetime, and string functions) for intuitive log analytics. You can also use the OpenSearch PPL to filter, aggregate, and analyze data. With a few clicks, you can access pre-built, out-of-the-box dashboards for vended logs, such as Amazon Virtual Private Cloud (VPC), AWS CloudTrail, and AWS WAF. These dashboards enable faster monitoring and troubleshooting through visualizations, such as analyzing flows over time, top talkers, megabytes, and packets transferred over time, without having to configure individual widgets or build specific queries. You can analyze VPC flows over time, identify top talkers, track network traffic metrics, monitor web request trends in AWS WAF, or analyze API activity patterns in AWS CloudTrail.

Additionally, OpenSearch Service users can now analyze CloudWatch logs using OpenSearch Discover and run SQL and PPL, similar to how they analyze data in Amazon Simple Storage (Amazon S3), and build indexes and create dashboards directly without any ETL operations or separate ingestion pipelines.

Let’s explore how this integration works
To demonstrate the new OpenSearch SQL and PPL query capabilities in CloudWatch, I start in the CloudWatch console. In the navigation pane, I choose Logs then Logs Insights. After selecting log groups for the query, I can now use OpenSearch PPL or OpenSearch SQL query languages directly within CloudWatch Logs Insights, with no additional setup or integration required. Using this new capability, I can write complex queries using familiar SQL syntax or OpenSearch PPL, making log analysis more intuitive and efficient. In the Query commands menu, you can find sample queries to help you get started.

This example demonstrates how to use SQL JOIN to combine data from two log groups: pet adoptions and pet availability. By filtering for specific customer IDs, you can analyze related log records and trace IDs for troubleshooting purposes.

One of the powerful features of this integration for CloudWatch Logs customers is the ability to create pre-built dashboards for Amazon VPC Flows, AWS CloudTrail and AWS WAF logs. Let’s explore this by creating a dashboard for AWS WAF logs. In the Analyze with OpenSearch tab, I choose Settings and follow the steps.

After a few minutes, my integration is ready and I go to Create an OpenSearch dashboard. In the options Select automatic dashboard type, I choose AWS WAF logs.

In the Dashboard data configuration tab, I can select Data synchronization frequency to occur every 15 minutes. I Select the log groups and View log samples of the selected log groups. I finish by choosing Create a dashboard.

After creating my dashboard, I can explore my logs. The AWS WAF logs dashboard provides comprehensive visibility into web application firewall metrics and events, with automatically configured visualizations that help you monitor and analyze security patterns.

Similarly, the CloudTrail dashboard offers deep insights into API activity across your AWS environment. It’s useful for monitoring API activity, auditing actions, and identifying potential security or compliance issues.

The VPC Flow Logs dashboard provides detailed visualization of key metrics from your logs for network traffic analysis. You can analyze network traffic, detect unusual patterns, and monitor resource usage. The dashboard currently supports only VPC v2 fields (default format). Custom formatted fields are not supported.

With zero-ETL to access CloudWatch data from OpenSearch Services, I also can build an OpenSearch dashboard from the OpenSearch Service console without having to build and maintain an ETL process. For this, I go to Central management, then I select the new Connected data sources menu, click choose Connect to create a new connected data source, and choose CloudWatch Logs.

In the next step, I name my data source and choose to Create a new role, which must have the necessary permissions to execute actions on OpenSearch Service. You can see them in the Sample custom policy.

In the Set up OpenSearch step, configure a OpenSearch data connection for CloudWatch Logs by selecting Create a new collection. As part of setting up the CloudWatch Logs source, a new OpenSearch Service serverless collection and OpenSearch UI application is created to store the indexed views and provide a user interface to analyze your CloudWatch Logs data. I create a new collection, name it, and conﬁgure the OpenSearch application and workspace within the application. After setting the Data retention days, I choose Next and ﬁnish with Review and connect.

When the integration with CloudWatch is ready, I can choose between Explore logs without indexing data which will take me to a querying interface in Discover or Explore vended logs by creating a dashboard for Amazon VPC Flows, CloudTrail and AWS WAF logs.

After I select Explore logs, OpenSearch UI takes me to Discover in the application workspace I created during the data source setup. In Discover, I select the data picker and choose View all available data to access my CloudWatch Logs data source and log groups.

After I select the log groups, I can analyze my CloudWatch logs using OpenSearch SQL and PPL directly in Discover, without having to switch between applications.

To create a dashboard, I return to the Connected data sources overview page on the console. From there, I select Create dashboard, which allows me to visually analyze my CloudWatch data without having to define queries or build visualizations, as I previously did in the CloudWatch console

After the dashboard is created, I navigate to OpenSearch resources where I can see the newly created indexes being populated with data in my Collection. After I have the data, I can go to the dashboard with the data from the CloudWatch logs that I selected in the configuration, and as more data comes in, it will be displayed in near real-time on the OpenSearch dashboard.

With this zero-ETL integration you can ingest data directly into OpenSearch, using its powerful query capabilities and visualization features while maintaining data consistency and reducing operational overhead.

Integration Highlights
For CloudWatch customers:

Query capabilities – Streamline log investigation by using OpenSearch SQL and PPL queries directly within the CloudWatch Logs Insights console.
Analytics features – With a few clicks, access pre-built, out-of-the-box dashboards for vended logs, such as VPC, AWS WAF, and CloudTrail logs. These dashboards enable faster monitoring and troubleshooting through visualizations for analyzing flows over time, top talkers, megabytes, and packets transferred over time, without having to configure individual widgets or build specific queries.
Getting started for CloudWatch users – Configure integration from CloudWatch Logs to OpenSearch Service. For more information refer to the Amazon CloudWatch Logs query capabilities and Amazon CloudWatch Logs vended dashboard documentation.

For OpenSearch Service customers:

Zero-ETL integration – Access and analyze CloudWatch data directly from OpenSearch Service without building or maintaining ETL processes. This integration eliminates separate ingestion pipelines while reducing storage costs and operational overhead through simplified data management and zero data duplication.
Getting started for OpenSearch users – Create a data connection selecting CloudWatch as a data source from OpenSearch Service. For more information, refer to the Amazon OpenSearch Service Developer Guide.

Regional availability and pricing
This integration is now available in AWS Regions where Amazon OpenSearch Service direct query is available. For pricing details and free trial information, you can visit the Amazon CloudWatch Pricing and Amazon OpenSearch Service Pricing pages.

PS: Writing a blog post at AWS is always a team effort, even when you see only one name under the post title. In this case, I want to thank Joshua Bright, Ashok Swaminathan, Abeetha Bala, Calvin Weng, and Ronil Prasad for their generous help with screenshots, technical guidance, and sharing their expertise in both services, which made this integration overview possible and comprehensive.

— Eli

Scaling RISE with SAP data and AWS Glue

2024-11-29 Allison Quinn

Post Syndicated from Allison Quinn original https://aws.amazon.com/blogs/big-data/scaling-rise-with-sap-data-and-aws-glue/

Customers often want to augment and enrich SAP source data with other non-SAP source data. Such analytic use cases can be enabled by building a data warehouse or data lake. Customers can now use the AWS Glue SAP OData connector to extract data from SAP. The SAP OData connector supports both on-premises and cloud-hosted (native and SAP RISE) deployments. By using the AWS Glue OData connector for SAP, you can work seamlessly with your data on AWS Glue and Apache Spark in a distributed fashion for efficient processing. AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development.

AWS Glue OData connector for SAP uses the SAP ODP framework and OData protocol for data extraction. This framework acts in a provider-subscriber model to enable data transfers between SAP systems and non-SAP data targets. The ODP framework supports full data extraction and change data capture through the Operational Delta Queues (ODQ) mechanism. As a source for data extraction for SAP, you can use SAP data extractors, ABAP CDS views, SAP BW, or BW/4 HANA sources, HANA information views in SAP ABAP sources, or any ODP-enabled data sources.

SAP source systems can hold historical data, and can receive constant updates. For this reason, it’s important to enable incremental processing of source changes. This blog post details how you can extract data from SAP and implement incremental data transfer from your SAP source using the SAP ODP OData framework with source delta tokens.

Solution overview

Example Corp wants to analyze the product data stored in their SAP source system. They want to understand their current product offering, in particular the number of products that they have in each of their material groups. This will include joining data from the SAP material master and material group data sources from their SAP system. The material master data is available on incremental extraction, while the material group is only available on a full load. These data sources should be combined and available to query for analysis.

Prerequisites

To complete the solution presented in the post, start by completing the following prerequisite steps:

Configure operational data provisioning (ODP) data sources for extraction in the SAP Gateway of your SAP system.
Create an Amazon Simple Storage Service (Amazon S3) bucket to store your SAP data.
In an AWS Glue Data Catalog, create a database called sapgluedatabase.

Create an AWS Identity and Access Management (IAM) role for the AWS Glue extract, transform, and load (ETL) job to use. The role must grant access to all resources used by the job, including Amazon S3 and AWS Secrets Manager. For the solution in this post, name the role GlueServiceRoleforSAP. Use the following policies:

AWS managed policies:
- AWSGlueServiceRole
- SecretsManagerReadWrite

Inline policy:

{
       "Version": "2012-10-17",
       "Statement": [
              {
                      "Sid": "VisualEditor0",
                      "Effect": "Allow",
                      "Action": [
                             "s3:PutObject",
                             "s3:GetObjectAcl",
                             "s3:GetObject",
                             "s3:GetObjectAttributes",
                             "s3:ListBucket",
                             "s3:DeleteObject",
                             "s3:PutObjectAcl"],
                      "Resource": [
                             "arn:aws:s3:::<S3-BUCKET-NAME>",
                             "arn:aws:s3:::<S3-BUCKET-NAME>/*"
                      ]
              }
       ]
}

Create the AWS Glue connection for SAP

The SAP connector supports both CUSTOM (this is SAP BASIC authentication) and OAUTH authentication methods. For this example, you will be connecting with BASIC authentication.

Use the AWS Management Console for AWS Secrets Manager to create a secret called ODataGlueSecret for your SAP source. Details in AWS Secrets Manager should include the elements in the following code. You will need to enter your SAP system username in place of <your SAP username> and its password in place of <your SAP username password>.
```
{
   "basicAuthUsername": "<your SAP username>",
   "basicAuthPassword": "<your SAP username password>",
   "basicAuthDisableSSO": "True",
   "customAuthenticationType": "CustomBasicAuth"
}
```
Create the AWS Glue connection GlueSAPOdata for your SAP system by selecting the new SAP OData data source.
Configure the connection with the appropriate values for your SAP source.
1. Application host URL: The host must have the SSL certificates for the authentication and validation of your SAP host name.
2. Application service path: /sap/opu/odata/iwfnd/catalogservice;v=2;
3. Port number: Port number of your SAP source system.
4. Client number: Client number of your SAP source system.
5. Logon language: Logon language of your SAP source system.
In the Authentication section, select CUSTOM as the Authentication Type.
Select the AWS Secret created in the preceding steps: SAPODataSecret.
In the Network Options section enter the VPC, subnet and security group used for the connection to your SAP system. For more information on connecting to your SAP system, see Configure a VPC for your ETL job.

Create an ETL job to ingest data from SAP

In the AWS Glue console, create a new Visual Editor AWS Glue job.

Go to the AWS Glue console.
In the navigation pane under ETL Jobs choose Visual ETL.
Choose Visual ETL to create a job in the Visual Editor.
For this post, edit the default name to be Material Master Job and choose Save.

On your Visual Editor canvas, select your SAP sources.

Choose the Visual tab, then choose the plus sign to open the Add nodes menu. Search for SAP and add the SAP OData Source.
Choose the node you just added and name it Material Master Attributes.
1. For SAP OData connection, select the GlueSAPOData connection.
2. Select the material attributes, service and entity set from your SAP source.
3. For Entity Name and Sub Entity Name, select SAP OData entity from your SAP source.
4. From the Fields, select Material, Created on, Material Group, Material Type, Old Matl number, GLUE_FETCH_SQ, DELTA_TOKEN and DML_STATUS.
5. Enter limit 100 in the filter section, to limit the data for design time.

Note that this service supports delta extraction, so Incremental transfer is the default selected option.

After the AWS Glue service role details have been chosen, the data preview is available. You can adjust the preview to include the three new available fields, which are:

glue_fetch_sq: This is a sequence field, generated from the EPOC timestamp in the order the record was received and is unique for each record. This can be used if you need to know or establish the order of changes in the source system.
delta_token: All records will have this field value blank, except for the last passed record, which will contain the value for the ODQ token to capture any changed records (CDC). This record is not a transactional record from the source and is only there for the purpose of passing the delta token value.
dml_status: This will show UPDATED for all newly inserted and updated records from the source and DELETED for records that have been deleted from source.

For delta enabled extraction, the last record passed will contain the value DELTA_TOKEN and the delta_token field will be filled as mentioned above.

Add another SAP ODATA source connection to your canvas, and name this node Material Group Text.
1. Select the material group service and entity set from your SAP source
2. For Entity Name and Sub Entity Name, select the SAP OData entity from your SAP source

Note that this service supports full extraction, so Full transfer is the default selected option. You can also preview this dataset.

When previewing the data, notice the language key. SAP passes all languages, so add a filter of SPRAS = ‘E’ to only extract English. Note this uses the SAP internal value of the field.
Add a transform node to the canvas Change Schema transform after the Material Group Text.
- Rename the material group field in target key to matkl2, so it is different than your first source.
- Under Drop, select ;spras, odq_changemode, odq_entitycntr, dml_status, delta_token and glue_fetch_sq.
Add a join transform to your canvas, bringing together both source datasets.
1. Ensure the node parents of both Material Master Attributes and Change Schema have been chosen
2. Select the Join type of Left join
3. Select the join conditions as the key fields from each source
  - Under Material Master Attributes, select matkl
  - Under Change Schema, select matkl2

You can preview the output to ensure the correct data is being returned. Now, you are ready to store the result.

Add the S3 bucket target, to your canvas.
1. Ensure the node parents is Join
2. For format, select Parquet.
3. For S3 Target Location, browse to the S3 bucket you created in the prerequisites and add materialmaster/ to the S3 target location.
4. For the Data Catalog update options, select Create a table in the Data Catalog and on subsequent runs, update the schema and add new partitions.
5. For Database, select the name of the AWS Glue database created earlier sapgluedatabase.
6. For Table name, enter materialmaster.
Choose Save to save your job. Your job should look like the following figure.

Clone your ETL job and make it incremental

After your ETL job has been created, it’s ready to clone and include incremental data handling using delta tokens.

To do this, you will need to modify the job script directly. You will modify the script to add a statement which retrieves the last delta token (to be stored on the job tag) and add the delta token value to the to the request (or execution of the job), which will enable the Delta Enabled SAP OData Service when retrieving the data on the next job run.

The first execution of the job will not have a delta token value on the tag; therefore, the call will be an initial run and the delta token will subsequently be stored in the tags for future executions.

Go to the AWS Glue console.
In the navigation pane under ETL Jobs choose Visual ETL.
Select the Material Master Job, choose Actions and select Clone job.
Change the name of the job to Material Master Job Delta, then choose the Script tab.
You need to add an additional python library that will take care of storing and retrieving the Delta Tokens for each job execution. To do this, navigate to the Job Details tab, scroll down and expand the Advanced Properties section. In the Python library path add the following path:
s3://aws-blogs-artifacts-public/artifacts/BDB-4789/sap_odata_state_management.zip

Now choose the Script tab and choose Edit script on the top right corner. Choose Confirm to confirm that your job will be script-only.

Apply the following changes to the script to enable the delta token.

7. Import the SAP OData state management library classes you added in step 5 above, by adding the following code to row 8.
```
from sap_odata_state_management.state_manager import StateManagerFactory, StateManagerType, StateType
```
The next few steps will retrieve and persist the delta token in the job tags so it can be accessed by the subsequent job execution. The delta token is added to the request back to the SAP source, so the incremental changes are extracted. If there is no token passed, the load will run as an initial load and the token will be persisted for the next run which will then be a delta load.To initialize the sap_odata_state_management library, extract the connection options into a variable and update them using the state manager. Do this by adding the following code to line 16 (after the job.init statement).

You can find the <key of MaterialMasterAttributes node> and the <entityName for Material Attribute> in the existing generated script under # Script generated for node Material Master Attributes. Be sure to replace with the appropriate values.

key = "<key of MaterialMasterAttributes node>"
state_manager = StateManagerFactory.create_manager(
    manager_type=StateManagerType.JOB_TAG, state_type=StateType.DELTA_TOKEN, options={"job_name": args['JOB_NAME'], "logger": glueContext.get_logger()}
)
options = {
    "connectionName": "GlueSAPOData",
    "entityName": "<entityName for Material Attribute>",
    "ENABLE_CDC": "true"
}
connector_options = state_manager.get_connector_options(key)
options.update(connector_options)

9. Comment out the existing script generated for node Material Master Attributes by adding a #, and add the following replacement snippet.

<key of MaterialMasterAttributes node> = glueContext.create_dynamic_frame.from_options(connection_type="sapodata", connection_options=options, transformation_ctx="<key of MaterialMasterAttributes node>")

To extract the delta token from the dynamic frame and persist it in the job tags, add the following code snippet just above the last line in your script (before job.commit())
```
state_manager.update_state(key, <key of MaterialMasterAttributes node>.toDF())
```

This is what your final script should look like:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from sap_odata_state_management.state_manager import StateManagerFactory, StateManagerType, StateType

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

key = "MaterialMasterAttributes_node1730873953236"
state_manager = StateManagerFactory.create_manager(
    manager_type=StateManagerType.JOB_TAG, state_type=StateType.DELTA_TOKEN, options={"job_name": args['JOB_NAME'], "logger": glueContext.get_logger()}
)
options = {
    "connectionName": "GlueSAPOData",
    "entityName": "/sap/opu/odata/sap/ZMATERIAL_ATTR_SRV/EntityOf0MATERIAL_ATTR",
    "ENABLE_CDC": "true"
}

# Script generated for node Material Group Text
MaterialGroupText_node1730874412841 = glueContext.create_dynamic_frame.from_options(connection_type="sapodata", connection_options={"ENABLE_CDC": "false", "connectionName": "GlueSAPOData", "FILTER_PREDICATE": "SPRAS = 'E'", "ENTITY_NAME": "/sap/opu/odata/sap/ZMATL_GROUP_SRV/EntityOf0MATL_GROUP_TEXT"}, transformation_ctx="MaterialGroupText_node1730874412841")

# Script generated for node Material Master Attributes
#MaterialMasterAttributes_node1730873953236 = glueContext.create_dynamic_frame.from_options(connection_type="sapodata", connection_options={"ENABLE_CDC": "true", "connectionName": "GlueSAPOdata", "FILTER_PREDICATE": "limit 100", "SELECTED_FIELDS": "MATNR,MTART,MATKL,BISMT,ERSDA,DML_STATUS,DELTA_TOKEN,GLUE_FETCH_SQ", "ENTITY_NAME": "/sap/opu/odata/sap/ZMATERIAL_ATTR_SRV/EntityOf0MATERIAL_ATTR"}, transformation_ctx="MaterialMasterAttributes_node1732755261264")
MaterialMasterAttributes_node1730873953236 = glueContext.create_dynamic_frame.from_options(connection_type="sapodata", connection_options=options, transformation_ctx="MaterialMasterAttributes_node1730873953236")

# Script generated for node Change Schema
ChangeSchema_node1730875214894 = ApplyMapping.apply(frame=MaterialGroupText_node1730874412841, mappings=[("matkl", "string", "matkl2", "string"), ("txtsh", "string", "txtsh", "string")], transformation_ctx="ChangeSchema_node1730875214894")

# Script generated for node Join
MaterialMasterAttributes_node1730873953236DF = MaterialMasterAttributes_node1730873953236.toDF()
ChangeSchema_node1730875214894DF = ChangeSchema_node1730875214894.toDF()
Join_node1730874996674 = DynamicFrame.fromDF(MaterialMasterAttributes_node1730873953236DF.join(ChangeSchema_node1730875214894DF, (MaterialMasterAttributes_node1730873953236DF['matkl'] == ChangeSchema_node1730875214894DF['matkl2']), "left"), glueContext, "Join_node1730874996674")

# Script generated for node Amazon S3
AmazonS3_node1730875848117 = glueContext.write_dynamic_frame.from_options(frame=Join_node1730874996674, connection_type="s3", format="json", connection_options={"path": "s3://sapglueodatabucket", "compression": "snappy", "partitionKeys": []}, transformation_ctx="AmazonS3_node1730875848117")
state_manager.update_state(key, MaterialMasterAttributes_node1730873953236.toDF())
job.commit()

Choose Save to save your changes.
Choose Run to run your job. Note that there are currently no tags in your job details.
Wait for your job run to be successfully completed. You can see the status on the Runs tab.
After your job run is complete, you will notice on the Job Details tab that a tag has been added. The next job run will read this token and run a delta load.

Query your SAP data source data

The AWS Glue job run has created an entry in the Data Catalog enabling you to query the data immediately.

Go to the Amazon Athena console.
Choose Launch Query Editor.
Make sure you have an appropriate workgroup assigned, or create a workgroup if required.
Select the sapgluedatabase and run a query (such as the following) to start analyzing your data.
```
select matkl, txtsh, count(*)
from materialmaster
group by 1, 2
order by 1, 2;
```

Clean up

To avoid incurring charges, clean up the resources used in this post from your AWS account, including the AWS Glue jobs, SAP OData connection, Glue Data Catalog entry, Secrets Manager secret, IAM role, the contents of the S3 bucket, and the S3 bucket.

Conclusion

In this post, we showed you how to create a serverless incremental data load process for multiple SAP data sources. The approach used AWS Glue to incrementally load the data from a SAP source using SAP ODP delta tokens and then load the data into Amazon S3.

The serverless nature of AWS Glue means that there is no infrastructure management, and you pay only for the resources consumed while your jobs are running (plus storage cost for outputs). As organizations increasingly become more data driven, this SAP connector can provide an efficient, cost effective, performant, secure way to include SAP source data in your big data and analytic outcomes. For more information see AWS Glue.

About the authors

Allison Quinn is a Sr. ANZ Analytics Specialist Solutions Architect for Data and AI based in Melbourne, Australia working closely with Financial Service customers in the region. Allison worked over 15 years with SAP products before concentrating her Analytics technical specialty on AWS native services. She’s very passionate about all things data, and democratizing so that customers of all types can drive business benefit.

Pavol is an Innovation Solution Architect at AWS, specializing in SAP cloud adoption across EMEA. With over 20 years of experience, he helps global customers migrate and optimize SAP systems on AWS. Pavol develops tailored strategies to transition SAP environments to the cloud, leveraging AWS’s agility, resiliency, and performance. He assists clients in modernizing their SAP landscapes using AWS’s AI/ML, data analytics, and application services to enhance intelligence, automation, and performance.

Partha Pratim Sanyal is a Software Development Engineer with AWS Glue in Vancouver, Canada, specializing in Data Integration, Analytics, and Connectivity. With extensive backend development expertise, he is dedicated to crafting impactful, customer-centric solutions. His work focuses on building features that empower users to effortlessly analyze and understand their data. Partha’s commitment to addressing complex user needs drives him to create intuitive and value-driven experiences that elevate data accessibility and insights for customers.

Diego is an experienced Enterprise Solutions Architect with over 20 years’ experience across SAP technologies, specializing in SAP innovation and data and analytics. He has worked both as partner and as a customer, giving him a complete perspective on what it takes to sell, implement, and run systems and organizations. He is passionate about technology and innovation, focusing on customer outcomes and delivering business value.

Luis Alberto Herrera Gomez is a Software Development Engineer with AWS Glue in Vancouver, specializing in backend engineering, microservices, and cloud computing. With 7-8 years of experience, including roles as a backend and full-stack developer for multiple startups before joining Amazon and AWS; Luis focuses on developing scalable and efficient cloud-based applications. His expertise in AWS technologies enables him to design high-performance systems that handle complex data processing tasks. Luis is passionate about leveraging cloud computing to solving challenging business problems.

Develop a business chargeback model within your organization using Amazon Redshift multi-warehouse writes

2024-11-27 Raks Khare

Post Syndicated from Raks Khare original https://aws.amazon.com/blogs/big-data/develop-a-business-chargeback-model-within-your-organization-using-amazon-redshift-multi-warehouse-writes/

Amazon Redshift is a fast, petabyte-scale, cloud data warehouse that tens of thousands of customers rely on to power their analytics workloads. Thousands of customers use Amazon Redshift data sharing to enable instant, granular, and fast data access shared across Redshift provisioned clusters and serverless workgroups. This allows you to scale your read workloads to thousands of concurrent users without having to move or copy data.

Now, we are announcing general availability (GA) of Amazon Redshift multi-data warehouse writes through data sharing. This new capability allows you to scale your write workloads and achieve better performance for extract, transform, and load (ETL) workloads by using different warehouses of different types and sizes based on your workload needs. You can make your ETL job runs more predictable by distributing them across different data warehouses with just a few clicks. Other benefits include the ability to monitor and control costs for each data warehouse, and enabling data collaboration across different teams because you can write to each other’s databases. The data is live and available across all warehouses as soon as it’s committed, even when it’s written to cross-account or cross-Region. To learn more about the reasons for using multiple warehouses to write to same databases, refer to this previous blog on multi-warehouse writes through datasharing.

As organizations continue to migrate workloads to AWS, they are also looking for mechanisms to manage costs efficiently. A good understanding of the cost of running your business workload, and the value that business workload brings to the organization, allows you to have confidence in the efficiency of your financial management strategy in AWS.

In this post, we demonstrate how you can develop a business chargeback model by adopting the multi-warehouse architecture of Amazon Redshift using data sharing. You can now attribute cost to different business units and at the same time gain more insights to drive efficient spending.

Use case

In this use case, we consider a fictional retail company (AnyCompany) that operates several Redshift provisioned clusters and serverless workgroups, each specifically tailored to a particular business unit—such as the sales, marketing, and development teams. AnyCompany is a large enterprise organization that previously migrated large volumes of enterprise workloads into Amazon Redshift, and now is in the process of breaking data silos by migrating business-owned workloads into Amazon Redshift. AnyCompany has a highly technical community of business users, who want to continue to have autonomy on the pipelines that enrich the enterprise data with their business centric data. The enterprise IT team wants to break data siloes and data duplication as a result, and despite this segregation in workloads, they mandate all business units to access a shared centralized database, which will further help in data governance by the centralized enterprise IT team. In this intended architecture, each team is responsible for data ingestion and transformation before writing to the same or different tables residing in the central database. To facilitate this, teams will use their own Redshift workgroup or cluster for computation, enabling separate chargeback to respective cost centers.

In the following sections, we walk you through how to use multi-warehouse writes to ingest data to the same databases using data sharing and develop an end-to-end business chargeback model. This chargeback model can help you attribute cost to individual business units, have higher visibility on your spending, and implement more cost control and optimizations.

Solution overview

The following diagram illustrates the solution architecture.

The workflow includes the following steps:

Steps 1a, 1b, and 1c – In this section, we isolate ingestion from various sources by using separate Amazon Redshift Serverless workgroups and a Redshift provisioned cluster.
Steps 2a, 2b, and 2c – All producers write data to the primary ETL storage in their own respective schemas and tables. For example, the Sales workgroup writes data into the Sales schema, and the Marketing workgroup writes data into the Marketing schema, both belonging to the storage of the ETL provisioned cluster. They can also apply transformations at the schema object level depending on their business requirements.
Step 2d – Both the Redshift Serverless producer workgroups and the Redshift producer cluster can insert and update data into a common table, ETL_Audit, residing in the Audit schema in the primary ETL storage.
Steps 3a, 3b, and 3c – The same Redshift Serverless workgroups and provisioned cluster used for ingestion are also used for consumption and are maintained by different business teams and billed separately.

The high-level steps to implement this architecture are as follows:

Set up the primary ETL cluster (producer)
- Create the datashare
- Grant permissions on schemas and objects
- Grant permissions to the Sales and Marketing consumer namespaces
Set up the Sales warehouse (consumer)
- Create a sales database from the datashare
- Start writing to the etl and sales datashare
Set up the Marketing warehouse (consumer)
- Create a marketing database from the datashare
- Start writing to the etl and marketing datashare
Calculate the cost for chargeback to sales and marketing business units

Prerequisites

To follow along with this post, you should have the following prerequisites:

Three Redshift warehouses of desired sizes, with one as the provisioned cluster and another two as serverless workgroups in the same account and AWS Region.
Access to a superuser in both warehouses.
An AWS Identity and Access Management (IAM) role that is able to ingest data from Amazon Simple Storage Service (Amazon S3) to Amazon Redshift.
For cross-account only, you need access to an IAM user or role that is allowed to authorize datashares. For the IAM policy, refer to Sharing datashares.

Refer to Getting started with multi-warehouse for the most up-to-date information.

Set up the primary ETL cluster (producer)

In this section, we show how to set up the primary ETL producer cluster to store your data.

Connect to the producer

Complete the following steps to connect to the producer:

On the Amazon Redshift console, choose Query editor v2 in the navigation pane.

In the query editor v2, you can see all the warehouses you have access to in the left pane. You can expand them to see their databases.

Connect to your primary ETL warehouse using a superuser.
Run the following command to create the prod database:

CREATE DATABASE prod;

Create the database objects to share

Complete the following steps to create your database objects to share:

After you create the prod database, switch your database connection to the prod.

You may need to refresh your page to be able to see it.

Run the following commands to create the three schemas you intend to share:

CREATE SCHEMA prod.etl;
CREATE SCHEMA prod.sales;
CREATE SCHEMA prod.marketing;

Create the tables in the ETL schema to share with the Sales and Marketing consumer warehouses. These are standard DDL statements coming from the AWS Labs TPCDS DDL file with modified table names.

CREATE TABLE prod.etl.etl_audit_logs (
    id bigint identity(0, 1) not null,
    job_name varchar(100),
    creation_date timestamp,
    last_execution_date timestamp
);

create table prod.etl.inventory (
    inv_date_sk int4 not null,
    inv_item_sk int4 not null,
    inv_warehouse_sk int4 not null,
    inv_quantity_on_hand int4,
    primary key (inv_date_sk, inv_item_sk, inv_warehouse_sk)
) distkey(inv_item_sk) sortkey(inv_date_sk);

Create the tables in the SALES schema to share with the Sales consumer warehouse:

create table prod.sales.store_sales (
    ss_sold_date_sk int4,
    ss_sold_time_sk int4,
    ss_item_sk int4 not null,
    ss_customer_sk int4,
    ss_cdemo_sk int4,
    ss_hdemo_sk int4,
    ss_addr_sk int4,
    ss_store_sk int4,
    ss_promo_sk int4,
    ss_ticket_number int8 not null,
    ss_quantity int4,
    ss_wholesale_cost numeric(7, 2),
    ss_list_price numeric(7, 2),
    ss_sales_price numeric(7, 2),
    ss_ext_discount_amt numeric(7, 2),
    ss_ext_sales_price numeric(7, 2),
    ss_ext_wholesale_cost numeric(7, 2),
    ss_ext_list_price numeric(7, 2),
    ss_ext_tax numeric(7, 2),
    ss_coupon_amt numeric(7, 2),
    ss_net_paid numeric(7, 2),
    ss_net_paid_inc_tax numeric(7, 2),
    ss_net_profit numeric(7, 2),
    primary key (ss_item_sk, ss_ticket_number)
) distkey(ss_item_sk) sortkey(ss_sold_date_sk);

create table prod.sales.web_sales (
    ws_sold_date_sk int4,
    ws_sold_time_sk int4,
    ws_ship_date_sk int4,
    ws_item_sk int4 not null,
    ws_bill_customer_sk int4,
    ws_bill_cdemo_sk int4,
    ws_bill_hdemo_sk int4,
    ws_bill_addr_sk int4,
    ws_ship_customer_sk int4,
    ws_ship_cdemo_sk int4,
    ws_ship_hdemo_sk int4,
    ws_ship_addr_sk int4,
    ws_web_page_sk int4,
    ws_web_site_sk int4,
    ws_ship_mode_sk int4,
    ws_warehouse_sk int4,
    ws_promo_sk int4,
    ws_order_number int8 not null,
    ws_quantity int4,
    ws_wholesale_cost numeric(7, 2),
    ws_list_price numeric(7, 2),
    ws_sales_price numeric(7, 2),
    ws_ext_discount_amt numeric(7, 2),
    ws_ext_sales_price numeric(7, 2),
    ws_ext_wholesale_cost numeric(7, 2),
    ws_ext_list_price numeric(7, 2),
    ws_ext_tax numeric(7, 2),
    ws_coupon_amt numeric(7, 2),
    ws_ext_ship_cost numeric(7, 2),
    ws_net_paid numeric(7, 2),
    ws_net_paid_inc_tax numeric(7, 2),
    ws_net_paid_inc_ship numeric(7, 2),
    ws_net_paid_inc_ship_tax numeric(7, 2),
    ws_net_profit numeric(7, 2),
    primary key (ws_item_sk, ws_order_number)
) distkey(ws_order_number) sortkey(ws_sold_date_sk);

Create the tables in the MARKETING schema to share with the Marketing consumer warehouse:

create table prod.marketing.customer (
    c_customer_sk int4 not null,
    c_customer_id char(16) not null,
    c_current_cdemo_sk int4,
    c_current_hdemo_sk int4,
    c_current_addr_sk int4,
    c_first_shipto_date_sk int4,
    c_first_sales_date_sk int4,
    c_salutation char(10),
    c_first_name char(20),
    c_last_name char(30),
    c_preferred_cust_flag char(1),
    c_birth_day int4,
    c_birth_month int4,
    c_birth_year int4,
    c_birth_country varchar(20),
    c_login char(13),
    c_email_address char(50),
    c_last_review_date_sk int4,
    primary key (c_customer_sk)
) distkey(c_customer_sk);

create table prod.marketing.promotion (
    p_promo_sk integer not null,
    p_promo_id char(16) not null,
    p_start_date_sk integer,
    p_end_date_sk integer,
    p_item_sk integer,
    p_cost decimal(15, 2),
    p_response_target integer,
    p_promo_name char(50),
    p_channel_dmail char(1),
    p_channel_email char(1),
    p_channel_catalog char(1),
    p_channel_tv char(1),
    p_channel_radio char(1),
    p_channel_press char(1),
    p_channel_event char(1),
    p_channel_demo char(1),
    p_channel_details varchar(100),
    p_purpose char(15),
    p_discount_active char(1),
    primary key (p_promo_sk)
) diststyle all;

Create the datashare

Create datashares for the Sales and Marketing business units with the following command:

CREATE DATASHARE sales_ds;
CREATE DATASHARE marketing_ds;

Grant permissions on schemas to the datashare

To add objects with permissions to the datashare, use the grant syntax, specifying the datashare you want to grant the permissions to.

Allow the datashare consumers (Sales and Marketing business units) to use objects added to the ETL schema:

GRANT USAGE ON SCHEMA prod.etl TO DATASHARE sales_ds;
GRANT USAGE ON SCHEMA prod.etl TO DATASHARE marketing_ds;

Allow the datashare consumer (Sales business unit) to use objects added to the SALES schema:

GRANT USAGE ON SCHEMA prod.sales TO DATASHARE sales_ds;

Allow the datashare consumer (Marketing business unit) to use objects added to the MARKETING schema:

GRANT USAGE ON SCHEMA prod.marketing TO DATASHARE marketing_ds;

Grant permissions on tables to the datashare

Now you can grant access to tables to the datashare using the grant syntax, specifying the permissions and the datashare.

Grant select and insert scoped privileges on the etl_audit_logs table to the Sales and Marketing datashares:

GRANT SELECT ON TABLE prod.etl.etl_audit_logs TO DATASHARE sales_ds;
GRANT SELECT ON TABLE prod.etl.etl_audit_logs TO DATASHARE marketing_ds;
GRANT INSERT ON TABLE prod.etl.etl_audit_logs TO DATASHARE sales_ds;
GRANT INSERT ON TABLE prod.etl.etl_audit_logs TO DATASHARE marketing_ds;

Grant all privileges on all tables in the SALES schema to the Sales datashare:

GRANT ALL ON ALL TABLES IN SCHEMA prod.sales TO DATASHARE sales_ds;

Grant all privileges on all tables in the MARKETING schema to the Marketing datashare:

GRANT ALL ON ALL TABLES IN SCHEMA prod.marketing TO DATASHARE marketing_ds;

You can optionally choose to include new objects to be automatically shared. The following code will automatically add new objects in the etl, sales, and marketing schemas to the two datashares:

ALTER DATASHARE sales_ds SET INCLUDENEW = TRUE FOR SCHEMA sales;
ALTER DATASHARE sales_ds SET INCLUDENEW = TRUE FOR SCHEMA etl;
ALTER DATASHARE marketing_ds SET INCLUDENEW = TRUE FOR SCHEMA marketing;
ALTER DATASHARE marketing_ds SET INCLUDENEW = TRUE FOR SCHEMA etl;

Grant permissions to the Sales and Marketing namespaces

You can grant permissions to the Sales and Marketing namespaces by specifying the namespace IDs. There are two ways to find namespace IDs:

On the Redshift Serverless console, find the namespace ID on the namespace details page
From the Redshift query editor v2, run select current_namespace; on both consumers

You can then grant access to the other namespace with the following command (change the consumer namespace to the namespace UID of your own Sales and Marketing warehouse):

-- Sales Redshift Serverless namespace
GRANT USAGE ON DATASHARE sales_ds TO namespace '<sales namespace>';

-- Marketing Redshift Serverless namespace
GRANT USAGE ON DATASHARE marketing_ds TO namespace '<marketing namespace>';

Set up and run an ETL job in the ETL producer

Complete the following steps to set up and run an ETL job:

Create a stored procedure to perform the following steps:
- Copy data from the S3 bucket to the inventory table in the ETL
- Insert an audit record in the etl_audit_logs table in the ETL

CREATE OR REPLACE PROCEDURE load_inventory() 
LANGUAGE plpgsql 
AS $$ 
BEGIN 
    COPY etl.inventory
    FROM 's3://redshift-downloads/TPC-DS/2.13/1TB/inventory/inventory_1_25.dat.gz' 
    iam_role default gzip delimiter '|' EMPTYASNULL region 'us-east-1';

    INSERT INTO etl.etl_audit_logs (job_name, creation_date, last_execution_date)
    values ('etl copy job', sysdate, sysdate);

END;
$$

Run the stored procedure and validate data in the ETL logging table:

CALL load_inventory();

SELECT * from etl.etl_audit_logs order by last_execution_date desc;

Set up the Sales warehouse (consumer)

At this point, you’re ready to set up your Sales consumer warehouse to start writing data to the shared objects in the ETL producer namespace.

Create a database from the datashare

Complete the following steps to create your database:

In the query editor v2, switch to the Sales warehouse.
Run the command show datashares; to see etl and sales datashares as well as the datashare producer’s namespace.
Use that namespace to create a database from the datashare, as shown in the following code:

CREATE DATABASE sales_db WITH PERMISSIONS FROM DATASHARE sales_ds OF NAMESPACE '<<producer-namespace>>'

Specifying with permissions allows you to grant granular permissions to individual database users and roles. Without this, if you grant usage permissions on the datashare database, users and roles get all permissions on all objects within the datashare database.

Start writing to the datashare database

In this section, we show you how to write to the datashare database using the use <database_name> command and using three-part notation: <database_name>.<schem_name>.<table_name>.

Let’s try the use command method first. Run the following command:

use sales_db;

Ingest data into the datashare tables

Complete the following steps to ingest the data:

Copy the TPC-DS data from the AWS Labs public S3 bucket into the tables in the producer’s sales schema:

copy sales.store_sales from 's3://redshift-downloads/TPC-DS/2.13/3TB/store_sales/store_sales_9_4293.dat.gz' iam_role default gzip delimiter '|' EMPTYASNULL region 'us-east-1';

copy sales.web_sales from 's3://redshift-downloads/TPC-DS/2.13/3TB/web_sales/web_sales_9_1630.dat.gz' iam_role default gzip delimiter '|' EMPTYASNULL region 'us-east-1';

Insert an entry in the etl_audit_logs table in the producer’s etl schema. To insert the data, let’s try three-part notation this time:

INSERT INTO sales_db.etl.etl_audit_logs (job_name, creation_date, last_execution_date)
  values ('sales copy job', sysdate, sysdate);

Set up the Marketing warehouse (consumer)

Now, you’re ready to set up your Marketing consumer warehouse to start writing data to the shared objects in the ETL producer namespace. The following steps are similar to the ones previously completed while setting up the Sales warehouse consumer.

Create a database from the datashare

Complete the following steps to create your database:

In the query editor v2, switch to the Marketing warehouse.
Run the command show datashares; to see the etl and marketing datashares as well as the datashare producer’s namespace.
Use that namespace to create a database from the datashare, as shown in the following code:

CREATE DATABASE marketing _db WITH PERMISSIONS FROM DATASHARE marketing _ds OF NAMESPACE '<<producer-namespace>>'

Start writing to the datashare database

In this section, we show you how to write to the datashare database by calling a stored procedure.

Set up and run an ETL job in the ETL producer

Complete the following steps to set up and run an ETL job:

Create a stored procedure to perform the following steps:
1. Copy data from the S3 bucket to the customer and promotion tables in the MARKETING schema of the producer’s namespace.
2. Insert an audit record in the etl_audit_logs table in the ETL schema of the producer’s namespace.

CREATE OR REPLACE PROCEDURE load_marketing_data() 
LANGUAGE plpgsql 
AS $$ 
BEGIN 
    copy marketing_db.marketing.customer
    from 's3://redshift-downloads/TPC-DS/2.13/3TB/customer/' 
    iam_role default gzip delimiter '|' EMPTYASNULL region 'us-east-1';

    copy marketing_db.marketing.promotion
    from 's3://redshift-downloads/TPC-DS/2.13/3TB/promotion/' 
    iam_role default gzip delimiter '|' EMPTYASNULL region 'us-east-1';

    INSERT INTO marketing_db.etl.etl_audit_logs (job_name, creation_date, last_execution_date)
    values('marketing copy job', sysdate, sysdate);
END;
$$;

Run the stored procedure:

CALL load_marketing_data();

At this point, you’ve completed ingesting the data to the primary ETL namespace. You can query the tables in the etl, sales, and marketing schemas from both the ETL producer warehouse and Sales and Marketing consumer warehouses and see the same data.

Calculate chargeback to business units

Because the business units’ specific workloads have been isolated to dedicated consumers, you can now attribute the cost based on compute capacity utilization. The compute capacity in Redshift Serverless is measured in Redshift Processing Units (RPUs) and metered for the workloads that you run in RPU-seconds on a per-second basis. A Redshift administrator can use the SYS_SERVERLESS_USAGE view on individual consumer workgroups to view the details of Redshift Serverless usage of resources and related cost.

For example, to get the total charges for RPU hours used for a time interval, run the following query on the Sales and Marketing business units’ respective consumer workgroups:

select
    trunc(start_time) "Day",
    (sum(charged_seconds) / 3600 :: double precision) * < Price for 1 RPU > as cost_incurred
from
    sys_serverless_usage
group by 1
order by 1;

Clean up

When you’re done, remove any resources that you no longer need to avoid ongoing charges:

Delete the Redshift provisioned cluster.
Delete Redshift serverless workgroups and namespaces.

Conclusion

In this post, we showed you how you can isolate business units’ specific workloads to multiple consumer warehouses writing the data to the same producer database. This solution has the following benefits:

Straightforward cost attribution and chargeback to business
Ability to use provisioned clusters and serverless workgroups of different sizes to write to the same databases
Ability to write across accounts and Regions
Data is live and available to all warehouses as soon as it’s committed
Writes work even if the producer warehouse (the warehouse that owns the database) is paused

You can engage an Amazon Redshift specialist to answer questions, and discuss how we can further help your organization.

About the authors

Raks Khare is a Senior Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers across varying industries and regions architect data analytics solutions at scale on the AWS platform. Outside of work, he likes exploring new travel and food destinations and spending quality time with his family.

Poulomi Dasgupta is a Senior Analytics Solutions Architect with AWS. She is passionate about helping customers build cloud-based analytics solutions to solve their business problems. Outside of work, she likes travelling and spending time with her family.

Saurav Das is part of the Amazon Redshift Product Management team. He has more than 16 years of experience in working with relational databases technologies and data protection. He has a deep interest in solving customer challenges centered around high availability and disaster recovery.