Tag Archives: Amazon Redshift

Geospatial data lakes with Amazon Redshift

Post Syndicated from Jeremy Spell original https://aws.amazon.com/blogs/big-data/geospatial-data-lakes-with-amazon-redshift/

Data lake architectures help organizations offload data from premium storage systems without losing the ability to query and analyze the data. This architecture can be useful for geospatial data, where builders might have terabytes of infrequently accessed data in their databases that they want to cost-effectively maintain. However, this requires for their data lake query engine to support geographic information systems (GIS) data types and functions.

Amazon Redshift supports querying spatial data, including the GEOMETRY and GEOGRAPHY data types and functions that are used in querying GIS systems. Additionally, Amazon Redshift lets you query geospatial data both in your data lakes on Amazon S3 and your Redshift data warehouse, giving you the choice of how you can access your data. Additionally, AWS Lake Formation and support for AWS Identity and Access Management (IAM) in Esri’s ArcGIS Pro gives you a way to securely bridge data between your geospatial data lakes and map visualization tools. You can set up, manage, and secure geospatial data lakes in the cloud with a few clicks.

In this post, we walk through how to set up a geospatial data lake using Lake Formation and query the data with ArcGIS Pro using Amazon Redshift Serverless.

Solution overview

In our example, a county public health department has used Lake Formation to secure their data lake that contains public health information (PHI) data. Epidemiologists within the county want to create a map for the clinics providing vaccination for their communities. The county’s GIS analysts need access to the data lake to create the required maps without being able to access the PHI data.

This solution uses Lake Formation tags to allow column-level access in the database to the public information that includes the clinic names, addresses, zip codes, and longitude/latitude coordinates without allowing access to the PHI data within the same tables. We use Redshift Serverless and Amazon Redshift Spectrum to access this data from ArcGIS Pro, a GIS mapping software from Esri, an AWS Partner.

The following diagram shows the architecture for this solution.

End-to-end architecture showing ArcGIS Pro data integration with AWS analytics services through Redshift connector

The following is a sample schema for this post.

Description Column Name Geoproperty Tag
Patient ID patient_id No
Clinic ID clinic_id Yes
Address of Clinic clinic_address Yes
Clinic Zip Code clinic_zip Yes
Clinic City clinic_city Yes
First Name Patient first_name No
Last Name Patient last_name No
Patient Address patient_address No
Patient Zip Code patient_zip No
Vaccination Type vaccination_type No
Latitude of Clinic clinic_lat Yes
Longitude of Clinic clinic_long Yes

In the following sections, we walk through the steps to set up the solution:

  1. Deploy the solution infrastructure using AWS CloudFormation.
  2. Upload a CSV with sample data to an Amazon Simple Storage Service (Amazon S3) bucket and run an AWS Glue crawler to crawl the data.
  3. Set up Lake Formation permissions.
  4. Configure the Amazon Redshift Query Editor v2.
  5. Set up the schemas in Amazon Redshift.
  6. Create a view in Amazon Redshift.
  7. Create a local database user in ArcGIS Pro.
  8. Connect ArcGIS Pro to the Redshift database.

Prerequisites

You should have the following prerequisites:

Set up the infrastructure with AWS CloudFormation

To create the environment for the demo, complete the following steps:

  1. Log in to the AWS Management Console as an AWS account administrator and a Lake Formation data lake administrator—this account needs to be both an account admin and a data lake admin for the template to complete.
  2. Open the AWS CloudFormation console
  3. Choose Launch Stack.

The CloudFormation template creates the following components:

  • S3 bucketsamp-clinic-db-{ACCOUNT_ID}
  • AWS Glue databasesamp-clinical-glue-db
  • AWS Glue crawler samp-glue-crawler
  • Redshift Serverless workgroupsamp-clinical-rs-wg
  • Redshift Serverless namespacesamp-clinical-rs-ns
  • IAM role for Amazon Redshiftdemo-RedshiftIAMRole-{UNIQUE_ID}
  • IAM role for AWS Gluesamp-clinical-glue-role
  • Lake Formation tag geoproperty

Upload a CSV to the S3 bucket and run the AWS Glue crawler

The next step is to create a data lake in our demo environment and then use an AWS Glue crawler to populate the AWS Glue database and update the schema and metadata in the AWS Glue Data Catalog.

The CloudFormation stack created the S3 bucket we will use as well as the AWS Glue database and crawler. We have provided a fictious test dataset that will represent the patient and clinical information. Download the file and complete the following steps:

  1. On the AWS CloudFormation console, open the stack you just launched.
  2. On the Resources tab, choose the link to the S3 bucket.
  3. Choose Upload and add the CSV file (data-with-geocode.csv), then choose Upload.
  4. On the AWS Glue console, choose Crawlers in the navigation pane.
  5. Select the crawler you created with the CloudFormation stack and choose Run.

The crawler run should only take a minute to complete, and will populate a table named clinic-sample-s3_ACCOUNT_ID with a fictious dataset.

  1. Choose Tables in the navigation pane and open the table the crawler populated.

You will see that the dataset contains fields that contain PHI and personally identifiable information (PII).

AWS Glue table 'clinic-sample_s3' schema definition with patient and clinic fields, input/output formats, and database properties

We now have a database set up and the Data Catalog populated with the schema and metadata we will use for the rest of the demo.

Set up Lake Formation permissions

In this next set of steps, we demonstrate how to secure PHI data to maintain compliance and empower GIS analysts to work effectively. To secure the data lake, we use AWS Lake Formation. In order to properly set up Lake Formation permissions, we need to gather details on how access to the data lake is established.

The Data Catalog provides metadata and schema information that enables services to access data within the data lake. To access the data lake from ArcGIS Pro, we use the ArcGIS Pro Redshift connector, which allows a connection from ArcGIS Pro to Amazon Redshift. Amazon Redshift can access the Data Catalog and provide connectivity to the data lake. The CloudFormation template created a Redshift Serverless instance and namespace and an IAM role that we will use to configure this connection. We still need to set up Lake Formation permissions so that GIS analysts can only access publicly available fields and not those containing PHI or PII. We will assign a Lake Formation tag on the columns containing the publicly available information and assign permissions to the GIS analysts to allow access to columns with this tag.

By default, the Lake Formation configuration allows Super access to IAMAllowedPrinciples; this is to maintain backward compatibility as detailed in Changing the default settings for your data lake. To demonstrate a more secure configuration, we will remove this default configuration.

  1. On the Lake Formation console, choose Administration in the navigation pane.
  2. In the Data Catalog settings section, make sure Use only IAM access control for new databases and Use only IAM access control for new tables in new databases are unchecked.

AWS Data Catalog settings interface showing unchecked IAM-only access control options for new databases and tables

  1. In the navigation pane, under Permissions, choose Data permissions.
  2. Select IAMAllowedPrincipals and choose Revoke.
  3. Choose Tables in the navigation pane.
  4. Open the table clinic-sample-s3_ACCOUNT_ID and choose Edit schema.
  5. Select the fields beginning with clinic_ and choose Edit LF-Tags.
  6. The CloudFormation stack created a Lake Formation tag named geoproperty. Assign geoproperty as the key and true for the value on all the clinic_ fields, then choose Save.

Next, we need to grant the Amazon Redshift IAM role permission to access fields tagged with geoproperty = true.

  1. Choose Data lake permissions, then choose Grant.
  2. For the IAM role, choose demo-RedshiftIAMRole-UNIQUE_ID.
  3. Select geoproperty for the key and true for the value.
  4. Under Database permissions, select Describe, and under Table permissions, select Select and Describe.

Configure the Amazon Redshift Query Editor v2

Next, we need to perform the initial configuration of Amazon Redshift required for database operations. We use an AWS Secrets Manager secret created by the template to make sure password access is managed securely in accordance with AWS best practices.

  1. On the Amazon Redshift console, choose Query editor v2.
  2. When you first start Amazon Redshift, a one-time configuration for the account appears. For this post, leave the options default and choose Configure account.

For more information about these options, refer to Configuring your AWS account.

Redshift query editor configuration interface with AWS KMS encryption settings and optional S3 bucket path input

The query editor will require credentials to connect to the serverless instance; these have been created by the template and stored in Secrets Manager.

  1. Select Other ways to connect, then select AWS Secrets Manager.
  2. For Secret, select (Redshift-admin-credentials).
  3. Choose Save.

Redshift connection interface displaying IAM Identity Center and AWS Secrets Manager authentication methods with credential selector

Set up schemas in Amazon Redshift

An external schema in Amazon Redshift is a feature used to reference schemas that exist in external data sources. For information on creating external schemas, see External schemas in Amazon Redshift Spectrum. We use an external schema to provide access to the data lake in Amazon Redshift. From ArcGIS Pro, we will connect to Amazon Redshift to access the geospatial data.

The IAM role used in the creation of the external schema needs to be associated with the Redshift namespace. This has already been set up by the CloudFormation template, but it’s a good practice to verify that the role is set up correctly before proceeding.

  1. On the Redshift Serverless console, choose Namespace configuration in the navigation pane.
  2. Choose the namespace (sample-rs-namespace).

Amazon Redshift Serverless console displaying namespace configuration with status, workgroup and creation details

On the Security and encryption tab, you should see the IAM role created by CloudFormation. If this role or the namespace isn’t present, verify the stack in AWS CloudFormation before proceeding.

  1. Copy the ARN of the role for use in a later step.

Redshift security configuration panel showing single synchronized IAM role with complete ARN and management options

  1. Choose Query data to return to the query editor.

Amazon Redshift Serverless interface displaying sample-rs-namespace configuration with management and query data controls

  1. In the query editor, enter the following SQL command; be sure to replace the example role ARN with your own. This SQL command will create an external schema that uses the same Redshift role associated with our namespace to attach to the AWS Glue database.
CREATE EXTERNAL SCHEMA samp_clinic_sch_ext FROM DATA CATALOG
database 'sample-glue-database'
IAM_ROLE 'arn:aws:iam::{ACCOUNT_ID}:role/demo-RedshiftIAMRole-{UNIQUE_ID}';
  1. In the query editor, perform a select query on sample-glue-database:
SELECT * FROM "dev"."samp_clinic_sch_ext"."clinic-sample_s3_{ACCOUNT_ID}";

Because the associated role has been granted access to columns tagged with geoproperty = true, only those fields will be returned, as shown in the following screenshot (the data in this example is fictionalized).

Query result displaying 20 medical clinics with details like name, address, and coordinates

  1. Use the following command to create a local schema in Amazon Redshift. The external schema can’t be updated; we will use this local schema to add a geometry field with a Redshift function.
CREATE SCHEMA samp_clinic_sch_local

Create a view in Amazon Redshift

For the data to be viewable from ArcGIS Pro, we will need to create a view. Now that the schemas have been established, we can create the view that can be accessed from ArcGIS Pro.

Amazon Redshift provides many geospatial functions that can be used to create views with fields used by ArcGIS Pro to add points onto a map. We will use one of these functions because the dataset contains latitude and longitude.

Use the following SQL code in the Amazon Redshift Query Editor to create a new view named clinic_location_view. Replace {ACCOUNT_ID} with your own account ID.

CREATE
OR REPLACE VIEW "samp_clinic_sch_local"."clinic_location_view" AS
SELECT
    clinic_id as id,
    clinic_lat as lat,
    clinic_long as long,
    ST_MAKEPOINT(long, lat) as geom
FROM
    “dev”."samp_clinic_sch_ext"."clinic-sample_s3_{ACCOUNT_ID}"
WITH NO SCHEMA BINDING;

The new view that is created under your local schema will have a column named geom containing map-based points that can be used by ArcGIS Pro to add points during map creation. The points in this example are for the clinics providing vaccines. In a real-world scenario, as new clinics are built and their data is added to the data lake, their locations would be added to the map created using this data.

Create a local database user for ArcGIS Pro

For this demo, we use a database user and group to provide access for ArcGIS Pro clients. Enter the following SQL code into the Amazon Redshift Query Editor to create a database user and group:

CREATE USER dbuser with PASSWORD ‘SET_PASSWORD_HERE’;
CREATE GROUP esri_developer_group;
ALTER GROUP esri_developer_group ADD USER dbuser;

After the commands are complete, use the following code to grant permissions to the group:

GRANT USAGE ON SCHEMA samp_clinic_sch_local TO GROUP esri_developer_group;
ALTER DEFAULT PRIVILEGES IN SCHEMA samp_clinic_sch_local GRANT SELECT ON TABLES TO GROUP esri_developer_group;
GRANT SELECT ON ALL TABLES IN SCHEMA samp_clinic_sch_local TO GROUP esri_developer_group;

Connect ArcGIS Pro to the Redshift database

In order to add the database connection to ArcGIS Pro, you need the endpoint for the Redshift Serverless workgroup. You can access the endpoint information on the sample-rs-wg workgroup details page on the Redshift Serverless console. The Redshift namespaces and workgroups are listed by default, as shown in the following screenshot.

Amazon Redshift Serverless namespace and workgroup status dashboard with performance metrics

You can copy the endpoint in the General information section. This endpoint will need to modified; the :5439/dev will need to be removed when configuring the connector in ArcGIS Pro.

Amazon Redshift Serverless workgroup details showing configuration and connection information

  1. Open ArcGIS Pro with the project file you want to add the Redshift connection to.
Make sure the Amazon Redshift ODBC connector has already been installed; this is required in order to make the connection.
  1. On the menu, choose Insert and then Connections, Database, and New Database Connection.
  2. For Database Platform, choose Amazon Redshift.
  3. For Server, insert the endpoint you copied (remove everything following .com from the endpoint).
  4. For Database, choose your database.

Amazon Redshift Serverless connection settings with server, authentication, and database fields

If your ArcGIS Pro client doesn’t have access to the endpoint, you will receive an error during this step. A network path must exist between the ArcGIS Pro client and the Redshift Serverless endpoint. You can set up the network path with Direct Connect, AWS Site-to-Site VPN, or AWS Client VPN. Although it’s not recommended for security reasons, you can also configure Amazon Redshift with a publicly available endpoint. Be sure you consult your security and network teams for best practices and policy guidance before allowing public access to your Redshift Serverless instance.

If a network path exists and you’re having issues connecting, verify the security group rules allow communication inbound from your ArcGIS Pro subnet over the port your Redshift Serverless instance is running on. The default port is 5439, but you can configure a range of ports depending on your environment; see Connecting to Amazon Redshift Serverless for more information.

If connectivity is successful, ArcGIS Pro will add the Amazon Redshift connection under Connection File Name.

  1. Choose OK.
  2. Choose the connection to display the view that was created to include geometry (clinic_location_view).
  3. Choose (right-click) the view and choose Add To Current Map.

ArcGIS Pro will add the points from the view onto the map. The final map displayed has the symbology edited to use red crosses to represent the clinics instead of dots.

Professional GIS interface showing Houston metropolitan vaccination clinics with topographic base map, toolbars, and database connectivity

Clean up

After you have finished the demo, complete the following steps to clean up your resources:

  1. On the Amazon S3 console, open the bucket created by the CloudFormation stack and delete the data-with-geocode.csv file.
  2. On the AWS CloudFormation console, delete the demo stack to remove the resources it created.

Conclusion

In this post, we reviewed how to set up Redshift Serverless to use geospatial data contained within a data lake to enhance maps in ArcGIS Pro. This technique helps builders and GIS analysts use available datasets in data lakes and transform it in Amazon Redshift to further enrich the data before presenting it on a map. We also showed how to secure a data lake using Lake Formation, crawl a geospatial dataset with AWS Glue, and visualize the data in ArcGIS Pro.

For additional best practices for storing geospatial data in Amazon S3 and querying it with Amazon Redshift, see How to partition your geospatial data lake for analysis with Amazon Redshift. We invite you to leave feedback in the comments section.


About the authors

Jeremy Spell is a Cloud Infrastructure Architect working with Amazon Web Services (AWS) Professional Services. He enjoys architecting and building solutions for customers. In his free time Jeremy makes Texas style BBQ, and spends time with his family and church community.

Jeff Demuth is a solutions architect who joined Amazon Web Services (AWS) in 2016. He focuses on the geospatial community and is passionate about geographic information systems (GIS) and technology. Outside of work, Jeff enjoys traveling, building Internet of Things (IoT) applications, and tinkering with the latest gadgets.

Amazon Redshift Python user-defined functions will reach end of support after June 30, 2026

Post Syndicated from Raks Khare original https://aws.amazon.com/blogs/big-data/amazon-redshift-python-user-defined-functions-will-reach-end-of-support-after-june-30-2026/

The Amazon Redshift integration with AWS Lambda provides the capability to create Amazon Redshift Lambda user-defined functions (UDFs). This capability delivers flexibility, enhanced integrations, and security for functions defined in Lambda that can be run through SQL queries. Amazon Redshift Lambda UDFs offer many advantages:

  • Enhanced integration – You can connect to external services or APIs from within your UDF logic, enabling richer data enrichment and operational workflows.
  • Multiple Python runtimes – Lambda UDFs benefit from Lambda function support for multiple Python runtimes depending on specific use cases. In addition, the new versions and security patches are available within a month of their official release.
  • Independent scaling – Lambda UDFs use Lambda compute resources, so heavy compute or memory-intensive tasks don’t impact query performance or resource concurrency within Amazon Redshift.
  • Isolation and security – You can isolate custom code execution in a separate service boundary. This simplifies maintenance, monitoring, budgeting, and permission management.

Because Lambda UDFs provide these significant advantages in integration, flexibility, scalability, and security, we will be ending support for Python UDFs in Amazon Redshift. We recommend that you migrate your existing Python UDFs to Lambda UDFs by June 30, 2026.

  • October 30, 2025 – Creation of new Python UDFs will no longer be supported (existing functions can still be invoked)
  • June 30, 2026 – Execution of existing Python UDFs will be suspended

In this post, we walk you through how to migrate your existing Python UDFs to Lambda UDFs, set up monitoring and cost evaluations, and review key considerations for a smooth transition.

Solution overview

You can create UDFs for tasks such as tokenization, encryption and decryption, or data science functionality like the Levenshtein distance calculation. For this post, we provide examples for customers who have Python UDFs in place, demonstrating how to replace them with Lambda UDFs.

The Levenshtein function, also known as the Levenshtein distance or edit distance, is a string metric used to measure the difference between two sequences of characters. Although this functionality was previously implemented using Python UDFs using the Python library in Amazon Redshift, Lambda provides a more efficient and scalable solution. This post demonstrates how to migrate from Python UDFs to Lambda UDFs for calculating Levenshtein distances.

Prerequisites

You must have the following:

Prepare the data

To set up our use case, complete the following steps:

  1. On the Amazon Redshift console, choose Query editor v2 under Explorer in the navigation pane.
  2. Connect to your Redshift data warehouse.
  3. Create a table and load data. The following query loads 30,000,000 rows in the customer table:
DROP TABLE IF EXISTS customer;
CREATE TABLE customer
(
c_customer_sk int4 not null ,
c_customer_id char(16) not null ,
c_current_cdemo_sk int4 ,
c_current_hdemo_sk int4 ,
c_current_addr_sk int4 ,
c_first_shipto_date_sk int4 ,
c_first_sales_date_sk int4 ,
c_salutation char(10) ,
c_first_name char(20) ,
c_last_name char(30) ,
c_preferred_cust_flag char(1) ,
c_birth_day int4 ,
c_birth_month int4 ,
c_birth_year int4 ,
c_birth_country varchar(20) ,
c_login char(13) ,
c_email_address char(50) ,
c_last_review_date_sk int4 ,
primary key (c_customer_sk)
) distkey(c_customer_sk);

COPY customer from 's3://redshift-downloads/TPC-DS/2.13/3TB/customer/'
IAM_ROLE default gzip delimiter '|' EMPTYASNULL REGION 'us-east-1';

Identify existing Python UDFs

Run the following script to list existing Python UDFs:

SELECT 
    p.proname, 
    p.pronargs, 
    t.typname, 
    n.nspname, 
    l.lanname, 
    pg_get_functiondef(p.oid) 
FROM 
    pg_proc p, 
    pg_language l, 
    pg_type t, 
    pg_namespace n
WHERE 
    p.prolang = l.oid
    and p.prorettype = t.oid
    and l.lanname = 'plpythonu'
    and p.pronamespace = n.oid
    and nspname not in ('pg_catalog', 'information_schema')
ORDER BY 
    proname;

The following is our existing Python UDF definition for Levenshtein distance:

create or replace function fn_levenshtein_distance(a varchar, b varchar) returns integer as
$$

def levenshtein_distance(a, len_a, b, len_b):
    d = [[0] * (len_b + 1) for i in range(len_a + 1)]  

    for i in range(1, len_a + 1):
        d[i][0] = i

    for j in range(1, len_b + 1):
        d[0][j] = j
    
    for j in range(1, len_b + 1):
        for i in range(1, len_a + 1):
            if a[i - 1] == b[j - 1]:
                cost = 0
            else:
                cost = 1
            d[i][j] = min(d[i - 1][j] + 1,      # deletion
                          d[i][j - 1] + 1,      # insertion
                          d[i - 1][j - 1] + cost) # substitution   

    return d[len_a][len_b]

def distance(a, b):
    len_a, len_b = len(a), len(b)
    if len_a == len_b:
        return 0
    elif len_a == 0:
        return len_b
    elif len_b == 0:
        return len_a
    else:
        return levenshtein_distance(a, len_a, b, len_b)

return distance(a, b)
$$ immutable;

Convert the Python UDF function to a Lambda UDF

You can simplify converting your Python UDF to a Lambda UDF using Amazon Q Developer, a generative AI-powered assistant. It handles code transformation, packaging, and integration logic, accelerating migration and improving scalability. Integrated with popular developer tools like VS Code, JetBrains, and others, Amazon Q streamlines workflows so teams can modernize analytics using serverless architectures with minimal effort.

Amazon Q Developer code suggestions are based on large language models (LLMs) trained on billions of lines of code, including open source and Amazon code. Always review a code suggestion before accepting it, and you might need to edit it to make sure that it does exactly what you intended.

Convert @python-udf.py Redshift Python UDF to Redshift Lambda UDF which batch processes data in the arguments array in a loop and returns json dump at the end. Refer to @lambda-context.py for reference and additional guidance on Lambda UDF.

Create a Lambda function

Complete the following steps to create a Lambda function:

  1. On the Lambda console, choose Functions in the navigation pane.
  2. Choose Create function.
  3. Choose Author from scratch.
  4. For Function name, enter a custom name (for example, levenshtein_distance_func).
  5. For Runtime, choose your code environment. (The examples in this post are compatible with Python 3.12.)
  6. For Architecture, select your system architecture. (The examples in this post are compatible with x86_64.)

  1. For Execution role, select Create a new role with basic Lambda permissions.

  1. Choose Create function.
  2. Choose Code and add the following code:
import json

def lambda_handler(event, context):
    t1 = event['arguments']
    resp = [None]*len(t1)

    for i, x in enumerate(t1):
        if x[0] is not None and x[1] is not None:
            resp[i] = distance(x[0], x[1])

    ret = dict()
    ret['results'] = resp
    return json.dumps(ret)

def levenshtein_distance(a, len_a, b, len_b):
    d = [[0] * (len_b + 1) for i in range(len_a + 1)]  

    for i in range(1, len_a + 1):
        d[i][0] = i

    for j in range(1, len_b + 1):
        d[0][j] = j
    
    for j in range(1, len_b + 1):
        for i in range(1, len_a + 1):
            if a[i - 1] == b[j - 1]:
                cost = 0
            else:
                cost = 1
            d[i][j] = min(d[i - 1][j] + 1,      # deletion
                          d[i][j - 1] + 1,      # insertion
                          d[i - 1][j - 1] + cost) # substitution   

    return d[len_a][len_b]

def distance(a, b):
    len_a, len_b = len(a), len(b)
    if len_a == len_b and a == b:
        return 0
    elif len_a == 0:
        return len_b
    elif len_b == 0:
        return len_a
    else:
        return levenshtein_distance(a, len_a, b, len_b)
  1. Choose configuration and update Timeout to 1 minute.

You can modify memory to optimize performance. To learn more, see Optimizing Levenshtein User-Defined Function in Amazon Redshift.

Create an Amazon Redshift IAM role

To allow your Amazon Redshift cluster to invoke the Lambda function, you must set up proper IAM permissions. Complete the following steps:

  1. Identify the IAM role associated with your Amazon Redshift cluster. If you don’t have one, create a new IAM role for Amazon Redshift.
  2. Add the following IAM policy to this role, providing your AWS Region and AWS account number:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "lambda:InvokeFunction",
"Resource": "arn:aws:lambda:<REGION>:<AWS account>:function:levenshtein_distance_func"
}
]
}

Create a Lambda UDF

Run following script to create your Lambda UDF:

CREATE or REPLACE EXTERNAL FUNCTION 
fn_lambda_levenshtein_distance(a varchar, b varchar) returns int
lambda 'levenshtein_distance_func' IAM_ROLE default 
STABLE
;

Test the solution

To test the solution, run the following script using the Python UDF:

SELECT c_customer_sk, c_customer_id, fn_levenshtein_distance(c_first_name, c_last_name) as distance
FROM customer
WHERE c_customer_sk in (1,2,3,4,5,31);

The following table shows our output.

Run the same script using the Lambda UDF:

SELECT c_customer_sk, c_customer_id, fn_lambda_levenshtein_distance(c_first_name, c_last_name) as distance
FROM customer
WHERE c_customer_sk in (1,2,3,4,5,31);

The results of both UDFs match.

Replace the Python UDF with the Lambda UDF

You can use the following steps in preproduction for testing:

  1. Revoke access for the Python UDF:
REVOKE execute on function fn_levenshtein_distance(varchar, varchar) from <group_name> or <role_name>
  1. Grant access to the Lambda UDF:
grant execute on function fn_lambda_levenshtein_distance(varchar, varchar) to <group_name> or <role_name>
  1. After full testing of the Lambda UDF has been performed, you can drop the Python UDF.
  2. Rename the Lambda UDF fn_lambda_levenshtein_distance to fn_levenshtein_distance so the end-user and application code doesn’t need to change:
ALTER FUNCTION fn_lambda_levenshtein_distance(varchar, varchar)
     RENAME TO fn_levenshtein_distance;
  1. Validate with the following query:
SELECT c_customer_sk, c_customer_id, fn_levenshtein_distance(c_first_name, c_last_name) as distance
FROM customer
WHERE c_customer_sk in (1,2,3,4,5,31);

Cost evaluation

To evaluate the cost of the Lambda UDF, complete the following steps:

  1. Run the following script to create a table using a SELECT query, which uses the Lambda UDF:
DROP TABLE IF EXISTS customer_lambda;
CREATE TABLE customer_lambda as 
SELECT c_customer_sk, c_customer_id, fn_levenshtein_distance(c_first_name, c_last_name) as distance
FROM customer;

You can inspect the query logs using CloudWatch Log Insights.

  1. On the CloudWatch console, choose Logs in the navigation pane, then choose Log Insights.
  2. Filter by the Lambda UDF and use the following query to identify the number of Lambda invocations.
fields @timestamp, @message, @logStream, @log
| filter @message like /^REPORT/
| sort @timestamp desc
| limit 10000

  1. Use following query to find the cost of the Lambda UDF for the specific duration you selected:
parse @message /Duration:\s*(?<@duration_ms>\d+\.\d+)\s*ms\s*Billed\s*Duration:\s*(?<@billed_duration_ms>\d+)\s*ms\s*Memory\s*Size:\s*(?<@memory_size_mb>\d+)\s*MB/
| filter @message like /REPORT RequestId/
| stats sum(@billed_duration_ms * @memory_size_mb * 1.6279296875e-11 + 2.0e-7) as @cost_dollars_total

For this example, we used the us-east-1 Region using ARM-based instances. For more details on Lambda pricing by Region and the Free Tier limit, see AWS Lambda pricing.

  1. Choose Summarize results.

The cost of this Lambda UDF invocation was $0.02329 for 30 million rows.

Monitor Lambda UDFs

Monitoring Lambda UDFs involves tracking both the Lambda function’s performance and the impact on the Redshift query execution. Because UDFs execute externally, a dual approach is necessary.

CloudWatch metrics and logs for Lambda functions

CloudWatch provides comprehensive monitoring for Lambda functions, such as the following key metrics:

  • Invocations – Tracks the number of times the Lambda function is called, indicating UDF usage frequency
  • Duration – Measures execution time, helping identify performance bottlenecks
  • Errors – Counts failed invocations, which is critical for detecting issues in UDF logic
  • Throttles – Indicates when Lambda limits invocations due to concurrency caps, which can delay query results
  • Logs – CloudWatch Logs capture detailed execution output, including errors and custom log messages, aiding in debugging
  • Alarms – Configures alarms for high error rates (for example, Errors > 0) or excessive duration (for example, Duration > 1 second) to receive proactive notifications

Redshift query performance

Within Amazon Redshift, system views provide comprehensive insights into Lambda UDF performance and errors:

  • SYS_QUERY_HISTORY – Identifies queries that have called your Lambda UDFs by filtering with the UDF name in the query_text column. This helps track usage patterns and execution frequency.
  • SYS_QUERY_DETAIL – Provides granular execution metrics for queries involving Lambda UDFs, helping identify performance bottlenecks at the step level.
  • Performance aggregation – Generates summary reports of Lambda UDF performance metrics, including execution count, average duration, and maximum duration to track performance trends over time.

The following table summarizes the monitoring tools available.

Monitoring Tool Purpose Key Metrics/Views
CloudWatch Metrics Track Lambda function performance Invocations, Duration, Errors, Throttles
CloudWatch Logs Debug Lambda execution issues Error messages, custom logs
SYS_QUERY_HISTORY Track Lambda UDF usage patterns Query execution times, status, user information, query text
SYS_QUERY_DETAIL Analyze Lambda UDF performance Step-level execution details, resource utilization, query plan information
Performance Summary Reports Track UDF performance trends Execution count, average/maximum duration, total elapsed time

Monitoring approach for Lambda UDFs in Amazon Redshift

For analyzing individual queries, you can use the following code to track how your Lambda UDFs are being used across your organization:

SELECT * FROM sys_query_history
WHERE query_text LIKE '%your_lambda_udf_name%'
ORDER BY start_time DESC
LIMIT 20;

This helps you do the following:

  • Identify frequent users
  • Monitor execution patterns
  • Track usage trends
  • Detect unauthorized access

You can also create comprehensive monitoring by using query history to monitor performance metrics at the user level:

SELECT 
    usename,
    DATE_TRUNC('day', start_time) as day,
    COUNT(*) as query_count,
    AVG(DATEDIFF(microsecond, start_time, end_time))/1000000.0 as avg_duration_seconds,
    MAX(DATEDIFF(microsecond, start_time, end_time))/1000000.0 as max_duration_seconds
FROM sys_query_history q
JOIN pg_user u ON q.user_id = u.usesysid
WHERE query_text LIKE '%your_lambda_udf_name%'
AND user_id > 1
GROUP BY usename, day
ORDER BY usename, query_count DESC;

Additionally, you can generate weekly performance reports using the following aggregation query:

SELECT 
    'your_lambda_udf_name' AS function_name,
    COUNT(DISTINCT q.query_id) AS execution_count,
    AVG(DATEDIFF(millisecond, q.start_time, q.end_time)) AS avg_duration_ms,
    MAX(DATEDIFF(millisecond, q.start_time, q.end_time)) AS max_duration_ms,
    SUM(q.elapsed_time) / 1000000 AS total_elapsed_time_sec
FROM 
    sys_query_history q
WHERE 
    q.query_text LIKE '%your_lambda_udf_name%'
GROUP BY 
    function_name
ORDER BY 
    execution_count DESC;

Considerations

To maximize the benefits of Lambda UDFs, consider the following aspects to optimize performance, provide reliability, secure data, and manage costs. If you have Python UDFs that don’t use Python libraries, consider whether they are candidates to convert to SQL UDFs.

The following are key performance considerations:

  • Batching – Amazon Redshift batches multiple rows into a single Lambda invocation to reduce call frequency, improving efficiency. Make sure the Lambda function handles batched inputs efficiently. For more details, see Accessing external components using Amazon Redshift Lambda UDFs.
  • Parallel invocations – Redshift cluster slices invoke Lambda functions in parallel, enhancing performance for large datasets. Design functions to support concurrent executions.
  • Cold starts – Lambda functions might experience cold start delays, particularly if infrequently used. Languages like Python or Node.js typically have faster startup times than Java, reducing latency.
  • Function optimization – Optimize Lambda code for quick execution, minimizing resource usage and latency. For example, avoid unnecessary computations or external API calls.

Consider the following error handling methods:

  • Robust lambda logic – Implement comprehensive error handling in the Lambda function to manage exceptions gracefully. Return clear error messages in the JSON response, as specified in the Amazon Redshift-Lambda interface. For more details, see Scalar Lambda UDFs.
  • Error propagation – Lambda errors can cause Redshift query failures. Monitor SYS_QUERY_HISTORY for query-level issues and CloudWatch Logs for detailed Lambda errors.
  • JSON interface – The Lambda function must return a JSON object with success, error_msg, num_records, and results fields. Use proper formatting to avoid query disruptions.

Clean up

Complete the following steps to clean up your resources:

  1. Delete the Redshift provisioned or serverless endpoint.
  2. Delete the Lambda function.
  3. Delete the IAM roles you created.

Conclusion

Lambda UDFs unlock a new level of flexibility, performance, and maintainability for extending Amazon Redshift. By decoupling custom logic from the warehouse engine, teams can scale independently, adopt modern runtimes, and integrate external systems.

If you’re currently using Python UDFs in Amazon Redshift, it’s time to explore the benefits of migrating to Lambda UDFs. With the generative AI capabilities of Amazon Q Developer, you can automate much of this transformation and accelerate your modernization journey. To learn more, refer to the Lambda UDF examples GitHub repo and Data Tokenization with Amazon Redshift and Protegrity.


About the authors

Raks KhareRaks Khare is a Senior Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers across varying industries and regions architect data analytics solutions at scale on the AWS platform. Outside of work, he likes exploring new travel and food destinations and spending quality time with his family.

Ritesh Kumar Sinha is an Analytics Specialist Solutions Architect based out of San Francisco. He has helped customers build scalable data warehousing and big data solutions for over 16 years. He loves to design and build efficient end-to-end solutions on AWS. In his spare time, he loves reading, walking, and doing yoga.

Yanzhu Ji is a Product Manager in the Amazon Redshift team. She has experience in product vision and strategy in industry-leading data products and platforms. She has outstanding skill in building substantial software products using web development, system design, database, and distributed programming techniques. In her personal life, Yanzhu likes painting, photography, and playing tennis.

Harshida Patel is a Analytics Specialist Principal Solutions Architect, with AWS.

Enhance data ingestion performance in Amazon Redshift with concurrent inserts

Post Syndicated from Raghu Kuppala original https://aws.amazon.com/blogs/big-data/enhance-data-ingestion-performance-in-amazon-redshift-with-concurrent-inserts/

Amazon Redshift is a fully managed petabyte data warehousing service in the cloud. Its massively parallel processing (MPP) architecture processes data by distributing queries across compute nodes. Each node executes identical query code on its data portion, enabling parallel processing.

Amazon Redshift employs columnar storage for database tables, reducing overall disk I/O requirements. This storage method significantly improves analytic query performance by minimizing data read during queries. Data has become many organizations’ most valuable asset, driving demand for real-time or near real-time analytics in data warehouses. This demand necessitates systems that support simultaneous data loading while maintaining query performance. This post showcases the key improvements in Amazon Redshift concurrent data ingestion operations.

Challenges and pain points for write workloads

In a data warehouse environment, managing concurrent access to data is crucial yet challenging. Customers using Amazon Redshift ingest data using various approaches. For example, you might commonly use INSERT and COPY statements to load data to a table, which are also called pure write operations. You might have requirements for low-latency ingestions to maximize data freshness. To achieve this, you can submit queries concurrently to the same table. To enable this, Amazon Redshift implements snapshot isolation by default. Snapshot isolation provides data consistency when multiple transactions are running simultaneously. Snapshot isolation guarantees that each transaction sees a consistent snapshot of the database as it existed at the start of the transaction, preventing read and write conflicts that could compromise data integrity. With snapshot isolation, read queries are able to execute in parallel, so you can take advantage of the full performance that the data warehouse has to offer.

However, pure write operations execute sequentially. Specifically, pure write operations need to acquire an exclusive lock during the entire transaction. They only release the lock when the transaction has committed the data. In these cases, the performance of the pure write operations is constrained by the speed of serial execution of the writes across sessions.

To understand this better, let’s look at how a pure write operation works. Every pure write operation includes pre-ingestion tasks such as scanning, sorting, and aggregation on the same table. After the pre-ingestion tasks are complete, the data is written to the table while maintaining data consistency. Because the pure write operations run serially, even the pre-ingestion steps ran serially due to lack of concurrency. This means that when multiple pure write operations are submitted concurrently, they are processed one after another, with no parallelization even for the pre-ingestion steps. To improve the concurrency of ingestion to the same table and meet low latency requirements for ingestion, customers often use workarounds through the use of staging tables. Specifically, you can submit INSERT ... VALUES(..) statements into staging tables. Then, you perform joins with other tables, such FACT and DIMENSION tables, prior to appending data using ALTER TABLE APPEND into your target tables. This approach isn’t desirable because it requires you to maintain staging tables and potentially have a larger storage footprint due to data block fragmentation from the use of ALTER TABLE APPEND statements.

In summary, the sequential execution of concurrent INSERT and COPY statements, due to their exclusive locking behavior, creates challenges if you want to maximize the performance and efficiency of your data ingestion workflows in Amazon Redshift. To overcome these limitations, you must adopt workaround solutions, introducing additional complexity and overhead. The following section outlines how Amazon Redshift has addressed these pain points with improvements to concurrent inserts.

Concurrent inserts and its benefits

With Amazon Redshift patch 187, Amazon Redshift has introduced significant improvement in concurrency for data ingestion with support for concurrent inserts. This improves concurrent execution of pure write operations such as COPY and INSERT statements, accelerating the time for you to load data into Amazon Redshift. Specifically, multiple pure write operations are able to progress simultaneously and complete pre-ingestion tasks such as scanning, sorting, and aggregation in parallel.

To visualize this improvement, let’s consider an example of two queries, executed concurrently from different transactions.

The following is query 1 in transaction 1:

INSERT INTO table_a SELECT * FROM table_b WHERE table_b.column_x = 'value_a';

The following is query 2 in transaction 2:

INSERT INTO table_a SELECT * FROM table_c WHERE table_c.column_y = 'value_b'

The following figure illustrates a simplified visualization of pure write operations without concurrent inserts.

Without concurrent inserts, the key components are as follows:

  • First, both pure write operations (INSERT) need to read data from table b and table c, respectively.
  • The segment in pink is the scan step (reading data) and the segment in green is write step (actually inserting the data).
  • In the “Before concurrent inserts” state, both queries would run sequentially. Specifically, the scan step in query 2 waits for the insert step in query 1 to complete before it begins.

For example, consider two identically sized queries across different transactions. Both queries need to scan the same amount of data and insert the same amount of data into the target table. Let’s say both are issued at 10:00 AM. First, query 1 would spend from 10:00 AM to 10:50 AM scanning the data and 10:50 AM to 11:00 AM inserting the data. Next, because query 2 is identical in scan and insertion volumes, query 2 would spend from 11:00 AM to 11:50 AM scanning the data and 11:50 AM to 12:00 PM inserting the data. Both transactions started at 10:00 AM. The end-to-end runtime is 2 hours (transaction 2 ends at 12:00 PM).The following figure illustrates a simplified visualization of pure write operations with concurrent inserts, compared with the previous example.

With concurrent inserts enabled, the scan step of query 1 and query 2 can progress simultaneously. When either of the queries need to insert data, they now do so serially. Let’s consider the same example, with two identically sized queries across different transactions. Both queries need to scan the same amount of data and insert the same amount of data into the target table. Again, let’s say both are issued at 10:00 AM. At 10:00 AM, query 1 and query 2 begin executing concurrently. From 10:00 AM to 10:50 AM, query 1 and query 2 are able to scan the data in parallel. From 10:50 AM to 11:00 AM, query 1 inserts the data into the target table. Next, from 11:00 AM to 11:10 AM, query 2 inserts the data into the target table. The total end-to-end runtime for both transactions is now reduced to 1 hour and 10 minutes, with query 2 completing at 11:10 AM. In this scenario, the pre-ingestion steps (scanning the data) for both queries are able to run concurrently, taking the same amount of time as in the previous example (50 minutes). However, the actual insertion of data into the target table is now executed serially, with query 1 completing the insertion first, followed by query 2. This demonstrates the performance benefits of the concurrent inserts feature in Amazon Redshift. By allowing the pre-ingestion steps to run concurrently, the overall runtime is improved by 50 minutes compared to the sequential execution before the feature was introduced.

With concurrent inserts, pre-ingestion steps are able to progress simultaneously. Pre-ingestion tasks could be one or a combination of tasks, such as scanning, sorting, and aggregation. There are significant performance benefits achieved in the end-to-end runtime of the queries.

Benefits

You can now benefit from these performance improvements without any additional configuration because the concurrent processing is handled automatically by the service. There are multiple benefits from the improvements in concurrent inserts. You can experience the improvement of end-to-end performance of ingestion workloads when you’re writing to the same table. Internal benchmarking shows that concurrent inserts can improve end-to-end runtime by up to 40% for concurrent insert transactions to the same tables. This feature is particularly beneficial for scan-heavy queries (queries that spend more time reading data than they spend time writing data). The higher the ratio of scan:insert in any query, higher the performance improvement expected.

This feature also improves the throughput and performance for multi-warehouse writes through data sharing. Multi-warehouse writes through data sharing helps you scale your write workloads across dedicated Redshift clusters or serverless workgroups, optimizing resource utilization and achieving more predictable performance for your extract, transform, and load (ETL) pipelines. Specifically, in multi-warehouse writes through data sharing, queries from different warehouses can write data on the same table. Concurrent inserts improve the end-to-end performance of these queries by reducing resource contention and enabling them to make progress simultaneously.

The following figure shows the performance improvements from internal tests from concurrent inserts, with the orange bar indicating the performance improvement for multi-warehouse writes through data sharing and the blue bar denoting the performance improvement for concurrent inserts on the same warehouse. As the graph indicates, queries with higher scan components relative to insert components benefit up to 40% with this new feature.

You can also experience additional benefits as a result of using concurrent inserts to manage your ingestion pipelines. When you directly write data to the same tables by using the benefit of concurrent inserts instead of using workarounds with ALTER TABLE APPEND statements, you can reduce your storage footprint. This comes in two forms: first from the elimination of temporary tables, and second from the reduction in table fragmentation from frequent ALTER TABLE APPEND statements. Additionally, you can avoid operational overhead of managing complex workarounds and rely on frequent background and customer-issued VACUUM DELETE operations to manage the fragmentation caused by appending temporary tables to your target tables.

Considerations

Although the concurrent insert enhancements in Amazon Redshift provide significant benefits, it’s important to be aware of potential deadlock scenarios that can arise in a snapshot isolation environment. Specifically, in a snapshot isolation environment, deadlocks can occur in certain conditions when running concurrent write transactions on the same table. The snapshot isolation deadlock happens when concurrent INSERT and COPY statements are sharing a lock and making progress, and another statement needs to perform an operation (UPDATE, DELETE, MERGE, or DDL operation) that requires an exclusive lock on the same table.

Consider the following scenario:

  • Transaction 1:
    INSERT/COPY INTO table_A;

  • Transaction 2:
    INSERT/COPY INTO table_A;
    <UPDATE/DELETE/MERGE/DDL statement> table_A

A deadlock can occur when multiple transactions with INSERT and COPY operations are running concurrently on the same table with a shared lock, and one of those transactions follows its pure write operation with an operation that requires an exclusive lock, such as an UPDATE, MERGE, DELETE, or DDL statement. To avoid the deadlock in these situations, you can separate statements requiring an exclusive lock (UPDATE, MERGE, DELETE, DDL statements) to a different transaction so that INSERT and COPY statements can progress simultaneously, and the statements requiring exclusive locks can execute after them. Alternatively, for transactions with INSERT and COPY statements and MERGE, UPDATE, and DELETE statements on same table, you can include retry logic in your applications to work around potential deadlocks. Refer to Potential deadlock situation for concurrent write transactions involving a single table for more information about deadlocks, and see Concurrent write examples for examples of concurrent transactions.

Conclusion

In this post, we demonstrated how Amazon Redshift has addressed a key challenge: improving concurrent data ingestion performance into a single table. This enhancement can help you meet your requirements for low latency and stricter SLAs when accessing the latest data. The update exemplifies our commitment to implementing critical features in Amazon Redshift based on customer feedback.


About the authors

Raghu Kuppala is an Analytics Specialist Solutions Architect experienced working in the databases, data warehousing, and analytics space. Outside of work, he enjoys trying different cuisines and spending time with his family and friends.

Sumant Nemmani is a Senior Technical Product Manager at AWS. He is focused on helping customers of Amazon Redshift benefit from features that use machine learning and intelligent mechanisms to enable the service to self-tune and optimize itself, ensuring Redshift remains price-performant as they scale their usage.

Gagan Goel is a Software Development Manager at AWS. He ensures that Amazon Redshift features meet customer needs by prioritising and guiding the team in delivering customer-centric solutions, monitor and enhance query performance for customer workloads.

Kshitij Batra is a Software Development Engineer at Amazon, specializing in building resilient, scalable, and high-performing software solutions.

Sanuj Basu is a Principal Engineer at AWS, driving the evolution of Amazon Redshift into a next-generation, exabyte-scale cloud data warehouse. He leads engineering for Redshift’s core data platform — including managed storage, transactions, and data sharing — enabling customers to power seamless multi-cluster analytics and modern data mesh architectures. Sanuj’s work helps Redshift customers break through th

How Skroutz handles real-time schema evolution in Amazon Redshift with Debezium

Post Syndicated from Konstantina Mavrodimitraki original https://aws.amazon.com/blogs/big-data/how-skroutz-handles-real-time-schema-evolution-in-amazon-redshift-with-debezium/

This guest post was co-authored with Kostas Diamantis from Skroutz.

At Skroutz, we are passionate about our product, and it is always our top priority. We are constantly working to improve and evolve it, supported by a large and talented team of software engineers. Our product’s continuous innovation and evolution lead to frequent updates, often necessitating changes and additions to the schemas of our operational databases.

When we decided to build our own data platform to meet our data needs, such as supporting reporting, business intelligence (BI), and decision-making, the main challenge—and also a strict requirement—was to make sure it wouldn’t block or delay our product development.

We chose Amazon Redshift to promote data democratization, empowering teams across the organization with seamless access to data, enabling faster insights and more informed decision-making. This choice supports a culture of transparency and collaboration, as data becomes readily available for analysis and innovation across all departments.

However, keeping up with schema changes from our operational databases, while updating the data warehouse without constantly coordinating with development teams, delaying releases, or risking data loss, became a new challenge for us.

In this post, we share how we handled real-time schema evolution in Amazon Redshift with Debezium.

Solution overview

Most of our data resides in our operational databases, such as MariaDB and MongoDB. Our approach involves using the change data capture (CDC) technique, which automatically handles the schema evolution of the data stores being captured. For this, we used Debezium along with a Kafka cluster. This solution enables schema changes to be propagated without disrupting the Kafka consumers.

However, handling schema evolution in Amazon Redshift became a bottleneck, prompting us to develop a strategy to address this challenge. It’s important to note that, in our case, changes in our operational databases primarily involve adding new columns rather than breaking changes like altering data types. Therefore, we have implemented a semi-manual process to resolve this issue, along with a mandatory alerting mechanism to notify us of any schema changes. This two-step process consists of handling schema evolution in real time and handling data updates in an asynchronous manual step. The following architectural diagram illustrates a hybrid deployment model, integrating both on-premises and cloud-based components.

End-to-end data migration workflow from on-premises databases to AWS cloud using CDC, messaging, and data warehouse services

The data flow begins with data from MariaDB and MongoDB, captured using Debezium for CDC in near real-time mode. The captured data is streamed to a Kafka cluster, where Kafka consumers (built on the Ruby Karafka framework) read and write them to the staging area, either in Amazon Redshift or Amazon Simple Storage Service (Amazon S3). From the staging area, DataLoaders promote the data to production tables in Amazon Redshift. At this stage, we apply the slowly changing dimension (SCD) concept to these tables, using Type 7 for most of them.

In data warehousing, an SCD is a dimension that stores data, and though it’s generally stable, it might change over time. Various methodologies address the complexities of SCD management. SCD Type 7 places both the surrogate key and the natural key into the fact table. This allows the user to select the appropriate dimension records based on:

  • The primary effective date on the fact record
  • The most recent or current information
  • Other dates associated with the fact record

Afterwards, analytical jobs are run to create reporting tables, enabling BI and reporting processes. The following diagram provides an example of the data modeling process from a staging table to a production table.

Database schema evolution: staging.shops to production.shops with added temporal and versioning columns

The architecture depicted in the diagram shows only our CDC pipeline, which fetches data from our operational databases and doesn’t include other pipelines, such as those for fetching data through APIs, scheduled batch processes, and many more. Also note that our convention is that dw_* columns are used to catch SCD metadata information and other metadata in general. In the following sections, we discuss the key components of the solution in more detail.

Real-time workflow

For the schema evolution part, we focus on the column dw_md_missing_data, which captures schema evolution changes in near real time that occur in the source databases. When a new change is produced to the Kafka cluster, the Kafka consumer is responsible for writing this change to the staging table in Amazon Redshift. For example, a message produced by Debezium to the Kafka cluster will have the following structure when a new shop entity is created:

{
  "before": null,
  "after": {
    "id": 1,
    "name": "shop1",
    "state": "hidden"
  },
  "source": {
    ...
    "ts_ms": "1704114000000",
    ...
  },
  "op": "c",
  ...
}

The Kafka consumer is responsible for preparing and executing the SQL INSERT statement:

INSERT INTO staging.shops (
  id,
  "name",
  state,
  dw_md_changed_at,
  dw_md_operation,
  dw_md_missing_data
)
VALUES
  (
    1,
    'shop1',
    'hidden',
    '2024-01-01 13:00:00',
    'create',
    NULL
  )
;

After that, let’s say a new column is added to the source table called new_column, with the value new_value.
The new message produced to the Kafka cluster will have the following format:

{
  "before": { ... },
  "after": {
    "id": 1,
    "name": "shop1",
    "state": "hidden",
    "new_column": "new_value"
  },
  "source": {
    ...
    "ts_ms": "1704121200000"
    ...
  },
  "op": "u"
  ...
}

Now the SQL INSERT statement executed by the Kafka consumer will be as follows:

INSERT INTO staging.shops (
  id,
  "name",
  state,
  dw_md_changed_at,
  dw_md_operation,
  dw_md_missing_data
)
VALUES
  (
    1,
    'shop1',
    'hidden',
    '2024-01-01 15:00:00',
    'update',
    JSON_PARSE('{"new_column": "new_value"}') /* <-- check this */
  )
;

The consumer performs an INSERT as it would for the known schema, and anything new is added to the dw_md_missing_data column as key-value JSON. After the data is promoted from the staging table to the production table, it will have the following structure.

Production.shops table displaying temporal data versioning with creation, update history, and current state indicators

At this point, the data flow continues running without any data loss or the need for communication with teams responsible for maintaining the schema in the operational databases. However, this data might not be easily accessible for the data consumers, analysts, or other personas. It’s worth noting that dw_md_missing_data is defined as a column of the SUPER data type, which was introduced in Amazon Redshift to store semistructured data or documents as values.

Monitoring mechanism

To track new columns added to a table, we have a scheduled process that runs weekly. This process checks for tables in Amazon Redshift with values in the dw_md_missing_data column and generates a list of tables requiring manual action to make this data available through a structured schema. A notification is then sent to the team.

Manual remediation steps

In the aforementioned example, the manual steps to make this column available would be:

  1. Add the new columns to both staging and production tables:
ALTER TABLE staging.shops ADD COLUMN new_column varchar(255);
ALTER TABLE production.shops ADD COLUMN new_column varchar(255);
  1. Update the Kafka consumer’s known schema. In this step, we just need to add the new column name to a simple array list. For example:
class ShopsConsumer < ApplicationConsumer
  SOURCE_COLUMNS = [
    'id',
    'name',
    'state',
    'new_column' # this one is the new column
  ]
 
  def consume
    # Ruby code for:
    #   1. data cleaning
    #   2. data transformation
    #   3. preparation of the SQL INSERT statement
 
    RedshiftClient.conn.exec <<~SQL
      /*
        generated SQL INSERT statement
      */
    SQL
  end
end
  1. Update the DataLoader’s SQL logic for the new column. A DataLoader is responsible for promoting the data from the staging area to the production table.
class DataLoader::ShopsTable < DataLoader::Base
  class << self
    def load
      RedshiftClient.conn.exec <<~SQL
        CREATE TABLE staging.shops_new (LIKE staging.shops);
      SQL
 
      RedshiftClient.conn.exec <<~SQL
        /*
          We move the data to a new table because in staging.shops
          the Kafka consumer will continue add new rows
        */
        ALTER TABLE staging.shops_new APPEND FROM staging.shops;
      SQL
 
      RedshiftClient.conn.exec <<~SQL
        BEGIN;
          /*
            SQL to handle
              * data deduplications etc
              * more transformations
              * all the necessary operations in order to apply the data modeling we need for this table
          */
 
          INSERT INTO production.shops (
            id,
            name,
            state,
            new_column, /* --> this one is the new column <-- */
            dw_start_date,
            dw_end_date,
            dw_current,
            dw_md_changed_at,
            dw_md_operation,
            dw_md_missing_data
          )
          SELECT
            id,
            name,
            state,
            new_column, /* --> this one is the new column <-- */
            /*
              here is the logic to apply the data modeling (type 1,2,3,4...7)
            */
          FROM
            staging.shops_new
          ;
 
          DROP TABLE staging.shops_new;
        END TRANSACTION;
      SQL
    end
  end
end
  1. Transfer the data that has been loaded in the meantime from the dw_md_missing_data SUPER column to the newly added column and then clean up. In this step, we just need to run a data migration like the following:
BEGIN;
 
  /*
    Transfer the data from the `dw_md_missing_data` to the corresponding column
  */
  UPDATE production.shops
  SET new_column = dw_md_missing_data.new_column::varchar(255)
  WHERE dw_md_missing_data.new_column IS NOT NULL;
 
  /*
    Clean up dw_md_missing_data column
  */
  UPDATE production.shops
  SET dw_md_missing_data = NULL
  WHERE dw_md_missing_data IS NOT NULL;
 
END TRANSACTION;

To perform the preceding operations, we make sure that no one else performs changes to the production.shops table because we want no new data to be added to the dw_md_missing_data column.

Conclusion

The solution discussed in this post enabled Skroutz to manage schema evolution in operational databases while seamlessly updating the data warehouse. This alleviated the need for constant development team coordination and removed risks of data loss during releases, ultimately fostering innovation rather than stifling it.

As the migration of Skroutz to the AWS Cloud approaches, discussions are underway on how the current architecture can be adapted to align more closely with AWS-centered principles. To that end, one of the changes being considered is Amazon Redshift streaming ingestion from Amazon Managed Streaming for Apache Kafka (Amazon MSK) or open source Kafka, which will make it possible for Skroutz to process large volumes of streaming data from multiple sources with low latency and high throughput to derive insights in seconds.

If you face similar challenges, discuss with an AWS representative and work backward from your use case to provide the most suitable solution.


About the authors

Konstantina Mavrodimitraki is a Senior Solutions Architect at Amazon Web Services, where she assists customers in designing scalable, robust, and secure systems in global markets. With deep expertise in data strategy, data warehousing, and big data systems, she helps organizations transform their data landscapes. A passionate technologist and people person, Konstantina loves exploring emerging technologies and supports the local tech communities. Additionally, she enjoys reading books and playing with her dog.

Kostas Diamantis is the Head of the Data Warehouse at Skroutz company. With a background in software engineering, he transitioned into data engineering, using his technical expertise to build scalable data solutions. Passionate about data-driven decision-making, he focuses on optimizing data pipelines, enhancing analytics capabilities, and driving business insights.

Build a multi-Region analytics solution with Amazon Redshift, Amazon S3, and Amazon QuickSight

Post Syndicated from Donatas Kuchalskis original https://aws.amazon.com/blogs/big-data/build-a-multi-region-analytics-solution-with-amazon-redshift-amazon-s3-and-amazon-quicksight/

Organizations increasingly face complex requirements balancing regional data sovereignty with global analytics needs. Regulatory frameworks like GDPR, HIPAA, and local data protection laws often mandate storing data in specific geographic regions, and business operations require global teams to access and analyze this data efficiently.

This post explores how to effectively architect a solution that addresses this specific challenge: enabling comprehensive analytics capabilities for global teams while making sure that your data remains in the AWS Regions required by your compliance framework. We use a variety of AWS services, including Amazon Redshift, Amazon Simple Storage Service (Amazon S3), and Amazon QuickSight.

It’s important to note that this solution focuses primarily on data residency (where data is stored) and not on preventing data from being in transit between Regions. Organizations with strict data transit restrictions might need additional controls beyond what’s covered here. We show how you can configure AWS across Regions to help meet business needs and regulatory requirements simultaneously.

Cross-Region architecture requirements

Before implementing a cross-Region solution, it’s important to understand when this approach is actually necessary. Although single-Region deployments offer simplicity and cost advantages, several specific business and regulatory scenarios warrant a cross-Region approach:

  • Data sovereignty and residency requirements – When regulations like GDPR, HIPAA, or local data sovereignty laws require data to remain in specific geographic boundaries while still enabling global analytics capabilities
  • Global operations with local compliance – When your organization operates globally, but needs to adhere to regional compliance frameworks while maintaining unified analytics
  • Performance optimization for global users – When your organization needs to optimize analytics performance for users in different geographic areas while centralizing data governance
  • Enhanced business continuity – When your analytics capabilities need higher availability and Regional redundancy to support mission-critical business processes

Use case: Financial services analytics with Regional data residency

Consider a financial services company with the following business and regulatory requirements:

  • Data residency requirement – All customer financial data must remain in the Bahrain Region (me-south-1) to comply with local financial regulations.
  • Global analytics capability – The organization’s data science team operates from European offices and needs to access and analyze the financial data without moving it out of its mandated storage Region.
  • Advanced analytics requirements – Business leaders need interactive data exploration and natural language query capabilities to derive insights from financial data.
  • Performance requirement – Specific dashboard queries require subsecond response times for both local executives and the global management team.

This specific combination of requirements can’t be met with a single-Region deployment. Let’s explore how to architect a solution.

Solution overview

The following architecture is designed to address the specific challenge of using QuickSight in one Region while maintaining data in another Region.

As shown in the architecture diagram, data engineers based in Bahrain (me-south-1) work with local data, whereas data engineers in Stockholm (eu-north-1) and analysts in Ireland (eu-west-1) can securely access the same data through Redshift datashares and virtual private cloud (VPC) peering connections. This approach maintains data residency in me-south-1 while enabling global access.

The solution consists of the following key components:

  • Primary data Region (me-south-1):
    • Redshift cluster (primary data repository)
    • S3 buckets for data lake storage
    • Private and public subnets with appropriate security controls
    • Data must remain in this Region for compliance reasons
  • Analytics services Region (eu-west-1):
    • QuickSight deployment
    • Cross-Region VPC peering connection to the primary Region
    • Data access using Redshift datashares (no data replication)
  • Data engineering Region (eu-north-1):
    • Redshift consumer cluster for data engineering workloads
    • Data access using Redshift datashares from me-south-1
    • Makes it possible for data engineering teams in eu-north-1 to access and work with data while maintaining compliance

Before implementing this architecture, evaluate whether:

  • Your requirements actually necessitate a cross-Region approach
  • The performance impact is acceptable for your use case
  • The additional cost is justified by your business requirements

For most analytics workloads, a single-Region architecture remains the recommended approach for simplicity, performance, and cost-effectiveness. Consider cross-Region architectures only when specific business and compliance requirements make them necessary.

Establish cross-Region network connectivity: Amazon Redshift to QuickSight

The foundation of a cross-Region solution is secure, reliable network connectivity. VPC peering provides a straightforward approach for connecting VPCs across Regions. To implement VPC peering in Amazon Virtual Private Cloud (Amazon VPC), complete the following steps:

  1. Create a new VPC in the secondary Region (eu-west-1):
    1. Open the Amazon VPC console in the eu-west-1 Region.
    2. Choose Create VPC.
    3. Set IPv4 CIDR block to 172.32.0.0/16 (verify there is no overlap with the primary Region VPC).
    4. Select Auto-generate to create subnets automatically within this new VPC.
    5. Leave other settings as default and choose Create VPC.

  1. Set up VPC peering:
    1. On the Amazon VPC console, choose Peering connections in the navigation pane and choose Create peering connection.
    2. Select the new eu-west-1 VPC as the requester.
    3. For Select another VPC to peer with, select My account and Another Region.
    4. Choose the primary Region (me-south-1) and enter the VPC ID.
    5. Choose Create peering connection.

  1. Accept the VPC peering connection:
    1. Switch to the primary Region on the Amazon VPC console.
    2. Choose Peering connections in the navigation pane and select the pending connection.
    3. On the Actions dropdown menu, choose Accept request.

  1. Update the route tables:
    1. On the  secondary Region Amazon VPC console, choose Route tables in the navigation pane.
    2. Choose the route table for the new VPC.
    3. Choose Edit routes and add a new route:
      • Destination: Primary Region VPC CIDR (e.g., 172.31.0.0/16).
      • Target: Choose the peering connection.
    4. On the primary Region Amazon VPC console, repeat the process, adding a route to the secondary Region VPC CIDR (172.32.0.0/16) using the peering connection.

  1. Configure security groups:
    1. On the secondary Region Amazon VPC console, choose Security groups in the navigation pane and create a new security group.
    2. Add an outbound rule:
      • Type: Custom TCP
      • Port range: 5439
      • Destination: Primary Region VPC CIDR

    3. On the primary Region Amazon VPC console, locate the Redshift cluster’s security group.
    4. Add an inbound rule:
      • Type: Custom TCP
      • Port range: 5439
      • Source: Secondary Region VPC CIDR

  1. Configure DNS settings:
    1. On the Amazon VPC console for both Regions, choose Your VPCs in the navigation pane.
    2. Select each VPC, and on the Actions dropdown menu, choose Edit DNS hostnames.
    3. Select Enable DNS resolution and Enable DNS hostnames.

Implement cross-Region data sharing

Rather than replicating data, which could create compliance issues, you can use Redshift datashares to provide secure, read-only access to data across Regions. Complete the following steps to set up your datashares:

  1. Create producer datashares in the primary Region:
    1. On the Amazon Redshift console, choose Query editor v2 in the navigation pane to connect to your primary Region Redshift cluster (me-south-1).
    2. Run the following commands:
      -- In Primary Region Redshift
      
      CREATE DATASHARE datashare_1;
      ALTER DATASHARE datashare_1 ADD SCHEMA analytics;
      ALTER DATASHARE datashare_1 ADD TABLE analytics.customers;
      ALTER DATASHARE datashare_1 ADD TABLE analytics.transactions;
      
      -- Grant usage permissions
      	
      GRANT USAGE ON DATASHARE datashare_1 TO ACCOUNT '123456789012';

  1. Create a consumer database in the secondary Region:
  2. Connect to your secondary Region Redshift cluster (eu-west-1) using the query editor and run the following commands:
    -- In Secondary Region Redshift
    
    CREATE DATABASE consumer_db FROM DATASHARE datashare_1 OF ACCOUNT '123456789012'REGION 'me-south-1';
  3. Verify the datashare configuration with the following code:
    -- In Secondary Region Redshift
    
    SELECT * FROM SVV_DATASHARE_CONSUMERS;
    SELECT * FROM SVV_DATASHARE_OBJECTS; 

This approach maintains data residency in the primary Region while enabling analytics access from another Region, addressing the core challenge of Regional service limitations. For our financial services company example, this makes sure that customer financial data remains in Bahrain (me-south-1) while making it securely accessible to the data science team in Europe (eu-west-1).

Configure QuickSight in the analytics Region

With network connectivity and data sharing established, complete the following steps to configure QuickSight to securely access the Redshift data:

  1. Set up a QuickSight VPC connection:
    1. Open the QuickSight console in the secondary Region.
    2. Choose Manage QuickSight, VPC connections, and Add VPC connection.
    3. Configure the connection:
      • Name: Enter a name (for example, Cross-Region-Connection).
      • VPC: Choose the secondary Region VPC.
      • Subnet: Choose the automatically created subnets.
      • Security group: Choose the security group created for cross-Region access.

  1. Add a QuickSight IP range to the data source security group:
    1. Open the Amazon Elastic Compute Cloud (Amazon EC2) console in the primary Region.
    2. Choose Security groups in the navigation pane and find the security group for your data source.
    3. Edit the inbound rules.
    4. Add a new rule:
      • Type: HTTPS (443)
      • Protocol: TCP
      • Port range: 443
      • Source: QuickSight IP range for the secondary Region (for example, 52.210.255.224/27 for eu-west-1).

QuickSight IP ranges can change over time. Refer to AWS Regions, websites, IP address ranges, and endpoints for current IP ranges.

  1. Create a QuickSight data source:
    1. On the QuickSight console, choose Datasets in the navigation pane.
    2. Choose New dataset, then choose Redshift.
    3. Configure the connection:
      • Data source name: Enter a descriptive name.
      • Connection type: Choose the VPC connection.
      • Database server: Enter the Redshift cluster endpoint from the primary Region.
      • Port: 5439
      • Database name: Enter the consumer database name.
      • Username and Password: Enter credentials (consider using AWS Secrets Manager).
    4. Choose Validate connection to test.
    5. Choose Create data source.

  1. Verify the connection and create datasets:
    1. Choose the schema and tables from the consumer database.
    2. Configure appropriate refresh schedules.
    3. Create calculations and visualizations as needed.

Performance considerations for cross-Region analytics

When implementing a cross-Region analytics architecture, be aware of the following performance implications:

  • Query performance impact – Cross-Region queries can experience higher latency than single-Region queries. To mitigate this, consider the following:
    • Use SPICE for QuickSight – Import frequently-used datasets into SPICE (Super-fast, Parallel, In-memory Calculation Engine) to help avoid repeated cross-Region queries. SPICE is the QuickSight in-memory engine that enables fast, interactive visualizations by precomputing and storing datasets locally in the QuickSight Region.
    • Implement efficient query patterns – Minimize the amount of data transferred between Regions.
    • Use appropriate caching – Enable result caching where possible.
    • Monitoring cross-Region performance – Implement monitoring to identify and address performance issues:
      • Set up Amazon CloudWatch metrics to track cross-Region query performance
      • Create dashboards to visualize latency trends
      • Establish performance baselines and alerts for degradation

Security considerations

Maintaining security in a cross-Region architecture requires additional attention:

  • Network security:
    • Limit VPC peering connections to only necessary VPCs
    • Implement restrictive security groups that allow only required traffic
    • Consider using VPC endpoints for service access when possible
  • Data access controls:
    • Use AWS Identity and Access Management (IAM) policies consistently across Regions
    • Implement fine-grained access controls in Redshift datashares
    • Enable audit logging in relevant Regions
  • Compliance monitoring:
    • Implement AWS CloudTrail in all Regions
    • Create centralized logging for cross-Region activities
    • Regularly review cross-Region access patterns

Cost implications

Before implementing a cross-Region architecture, consider these cost factors:

  • Data transfer costs – Data transfer between Regions incurs charges
  • Additional infrastructure – You might need Redshift clusters in multiple Regions
  • VPC peering costs – Data transfer costs are associated with VPC peering
  • Operational overhead – Managing multi-Region deployments requires additional resources
  • Workload-based sizing – You should size each Regional Redshift cluster according to the specific workloads it will handle

Conclusion

The cross-Region architecture described in this post addresses specific challenges related to Regional compliance requirements and global analytics needs, particularly in the following scenarios:

  • Your data must remain in a specific Region for compliance reasons
  • You have teams in different Regions who need to access and analyze this data
  • Different user groups have distinct workload requirements

The datasharing capabilities of Amazon Redshift and Regional storage options in Amazon S3 are key enablers of this solution, allowing data to remain in the required Region while still being accessible for analytics across Regions. However, it’s worth emphasizing that this architecture supports data storage in specific Regions but doesn’t prevent data from traveling between Regions during processing. Organizations concerned about data transit restrictions should evaluate additional controls to address those specific requirements. Combined with secure VPC peering connections and QuickSight visualizations, this architecture creates a complete solution that satisfies both compliance requirements and business needs.

For our financial services example, this architecture successfully enables the company to keep its customer financial data in Bahrain while providing seamless analytics capabilities to the European data science team and delivering interactive dashboards to global business leaders.

For more information, refer to Building a Cloud Security Posture Dashboard with Amazon QuickSight. For hands-on experience, explore the Amazon QuickSight Workshops. Visit the Amazon Redshift console or Amazon QuickSight console to start building your first dashboard, and explore our AWS Big Data Blog for more customer success stories and implementation patterns

Try out this solution for your own use case, and share your thoughts in the comments.


About the Authors

Donatas Kuchalskis is a Cloud Operations Architect at AWS, based in London, focusing on Financial Services customers in the UK. He helps customers optimize their AWS environments for cost, security, and resiliency while providing strategic cloud guidance. Prior to this role, he served as a Prototyping Architect specializing in Big Data and as a Specialist Solutions Architect for Retail. Before joining AWS, Donatas spent 6 years as a technical consultant in the retail sector.

Jumana Nagaria is a Prototyping Architect at AWS. She builds innovative prototypes with customers to solve their business challenges. She is passionate about cloud computing and data analytics. Outside of work, Jumana enjoys travelling, reading, painting, and spending quality time with friends and family.

Reduce time to access your transactional data for analytical processing using the power of Amazon SageMaker Lakehouse and zero-ETL

Post Syndicated from Avijit Goswami original https://aws.amazon.com/blogs/big-data/reduce-time-to-access-your-transactional-data-for-analytical-processing-using-the-power-of-amazon-sagemaker-lakehouse-and-zero-etl/

As the lines between analytics and AI continue to blur, organizations find themselves dealing with converging workloads and data needs. Historical analytics data is now being used to train machine learning models and power generative AI applications. This shift requires shorter time to value and tighter collaboration among data analysts, data scientists, machine learning (ML) engineers, and application developers. However, the reality of scattered data across various systems—from data lakes to data warehouses and applications—makes it difficult to access and use data efficiently. Moreover, organizations attempting to consolidate disparate data sources into a data lakehouse have historically relied on extract, transform, and load (ETL) processes, which have become a significant bottleneck in their data analytics and machine learning initiatives. Traditional ETL processes are often complex, requiring significant time and resources to build and maintain. As data volumes grow, so do the costs associated with ETL, leading to delayed insights and increased operational overhead. Many organizations find themselves struggling to efficiently onboard transactional data into their data lakes and warehouses, hindering their ability to derive timely insights and make data-driven decisions. In this post, we address these challenges with a two-pronged approach:

  • Unified data management: Using Amazon SageMaker Lakehouse to get unified access to all your data across multiple sources for analytics and AI initiatives with a single copy of data, regardless of how and where the data is stored. SageMaker Lakehouse is powered by AWS Glue Data Catalog and AWS Lake Formation and brings together your existing data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses with integrated access controls. In addition, you can ingest data from operational databases and enterprise applications to the lakehouse in near real-time using zero-ETL which is a set of fully-managed integrations by AWS that eliminates or minimizes the need to build ETL data pipelines.
  • Unified development experience: Using Amazon SageMaker Unified Studio to discover your data and put it to work using familiar AWS tools for complete development workflows, including model development, generative AI application development, data processing, and SQL analytics, in a single governed environment.

In this post, we demonstrate how you can bring transactional data from AWS OLTP data stores like Amazon Relational Database Service (Amazon RDS) and Amazon Aurora flowing into Redshift using zero-ETL integrations to SageMaker Lakehouse Federated Catalog (Bring your own Amazon Redshift into SageMaker Lakehouse). With this integration, you can now seamlessly onboard the changed data from OLTP systems to a unified lakehouse and expose the same to analytical applications for consumptions using Apache Iceberg APIs from new SageMaker Unified Studio. Through this integrated environment, data analysts, data scientists, and ML engineers can use SageMaker Unified Studio to perform advanced SQL analytics on the transactional data.

Architecture patterns for a unified data management and unified development experience

In this architecture pattern, we show you how to use zero-ETL integrations to seamlessly replicate transactional data from Amazon Aurora MySQL-Compatible Edition, an operational database, into the Redshift Managed Storage layer. This zero-ETL approach eliminates the need for complex data extraction, transformation, and loading processes, enabling near real-time access to operational data for analytics. The transferred data is then cataloged using a federated catalog in the SageMaker Lakehouse Catalog and exposed through the Iceberg Rest Catalog API, facilitating comprehensive data analysis by consumer applications.

You then use SageMaker Unified Studio, to perform advanced analytics on the transactional data bridging the gap between operational databases and advanced analytics capabilities.

Prerequisites

Make sure that you have the following prerequisites:

Deployment steps

In this section, we share steps for deploying resources needed for Zero-ETL integration using AWS CloudFormation.

Setup resources with CloudFormation

This post provides a CloudFormation template as a general guide. You can review and customize it to suit your needs. Some of the resources that this stack deploys incur costs when in use. The CloudFormation template provisions the following components:

  1. An Aurora MySQL provisioned cluster (source).
  2. An Amazon Redshift Serverless data warehouse (target).
  3. Zero-ETL integration between the source (Aurora MySQL) and target (Amazon Redshift Serverless). See Aurora zero-ETL integrations with Amazon Redshift for more information.

Create your resources

To create resources using AWS Cloudformation, follow these steps:

  1. Sign in to the AWS Management Console.
  2. Select the us-east-1 AWS Region in which to create the stack.
  3. Open the AWS CloudFormation
  4. Choose Launch Stack
    https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/create/template?templateURL=https://aws-blogs-artifacts-public.s3.us-east-1.amazonaws.com/BDB-4866/aurora-zero-etl-redshift-lakehouse-cfn.yaml
  5. Choose Next.
    This automatically launches CloudFormation in your AWS account with a template. It prompts you to sign in as needed. You can view the CloudFormation template from within the console.
  6. For Stack name, enter a stack name, for example UnifiedLHBlogpost.
  7. Keep the default values for the rest of the Parameters and choose Next.
  8. On the next screen, choose Next.
  9. Review the details on the final screen and select I acknowledge that AWS CloudFormation might create IAM resources.
  10. Choose Submit.

Stack creation can take up to 30 minutes.

  1. After the stack creation is complete, go to the Outputs tab of the stack and record the values of the keys for the following components, which you will use in a later step:
    • NamespaceName
    • PortNumber
    • RDSPassword
    • RDSUsername
    • RedshiftClusterSecurityGroupName
    • RedshiftPassword
    • RedshiftUsername
    • VPC
    • Workgroupname
    • ZeroETLServicesRoleNameArn

Implementation steps

To implement this solution, follow these steps:

Setting up zero-ETL integration

A zero-ETL integration is already created as a part of CloudFormation template provided. Use the following steps from the Zero-ETL integration post to complete setting up the integration.:

  1. Create a database from integration in Amazon Redshift
  2. Populate source data in Aurora MySQL
  3. Validate the source data in your Amazon Redshift data warehouse

Bring Amazon Redshift metadata to the SageMaker Lakehouse catalog

Now that transactional data from Aurora MySQL is replicating into Redshift tables through zero-ETL integration, you next bring the data into SageMaker Lakehouse, so that operational data can co-exist and be accessed and governed together with other data sources in the data lake. You do this by registering an existing Amazon Redshift Serverless namespace that has Zero-ETL tables as a federated catalog in SageMaker Lakehouse.

Before starting the next steps, you need to configure data lake administrators in AWS Lake Formation.

  1. Go to the Lake Formation console and in the navigation pane, choose Administration roles and then choose Tasks under Administration. Under Data lake administrators, choose Add.
  2. In the Add administrators page, under Access type, select Data Lake administrator.
  3. Under IAM users and roles, select Admin. Choose Confirm.

Add AWS Lake Formation Administrators

  1. On the Add administrators page, for Access type, select Read-only administrators. Under IAM users and roles, select AWSServiceRoleForRedshift and choose Confirm. This step enables Amazon Redshift to discover and access catalog objects in AWS Glue Data Catalog.

Add AWS Lake Formation Administrators 2

With the data lake administrators configured, you’re ready to bring your existing Amazon Redshift metadata to SageMaker Lakehouse catalog:

  1. From the Amazon Redshift Serverless console, choose Namespace configuration in the navigation pane.
  2. Under Actions, choose Register with AWS Glue Data Catalog. You can find more details on registering a federated Amazon Redshift catalog in Registering namespaces to the AWS Glue Data Catalog.

  1. Choose Register. This will register the namespace to AWS Glue Data Catalog

  1. After registration is complete, the Namespace register status will change to Registered to AWS Glue Data Catalog.
  2. Navigate to the Lake Formation console and choose Catalogs New under Data Catalog in the navigation pane. Here you can see a pending catalog invitation is available for the Amazon Redshift namespace registered in Data Catalog.

  1. Select the pending invitation and choose Approve and create catalog. For more information, see Creating Amazon Redshift federated catalogs.

  1. Enter the Name, Description, and IAM role (created by the CloudFormation template). Choose Next.

  1. Grant permissions using a principal that is eligible to provide all permissions (an admin user).
    • Select IAM users and rules and choose Admin.
    • Under Catalog permissions, select Super user to grant super user permissions.

  1. Assigning super user permissions grants the user unrestricted permissions to the resources (databases, tables, views) within this catalog. Follow the principal of least privilege to grant users only the permissions required to perform a task wherever applicable as a security best practice.

  1. As final step, review all settings and choose Create Catalog

After the catalog is created, you will see two objects under Catalogs. dev refers to the local dev database inside Amazon Redshift, and aurora_zeroetl_integration is the database created for Aurora to Amazon Redshift ZeroETL tables

Fine-grained access control

To set up fine-grained access control, follow these steps:

  1. To grant permission to individual objects, choose Action and then select Grant.

  1. On the Principals page, grant access to individual objects or more than one object to different principals under the federated catalog.

Access lakehouse data using SageMaker Unified Studio

SageMaker Unified Studio provides an integrated experience outside the console to use all your data for analytics and AI applications. In this post, we show you how to use the new experience through the Amazon SageMaker management console to create a SageMaker platform domain using the quick setup method. To do this, you set up IAM Identity Center, a SageMaker Unified Studio domain, and then access data through SageMaker Unified Studio.

Set up IAM Identity Center

Before creating the domain, makes sure that your data admins and data workers are ready to use the Unified Studio experience by enabling IAM Identity Center for single sign-on following the steps in Setting up Amazon SageMaker Unified Studio. You can use Identity Center to set up single sign-on for individual accounts and for accounts managed through AWS Organizations. Add users or groups to the IAM instance as appropriate. The following screenshot shows an example email sent to a user through which they can activate their account in IAM Identity Center.

Set up SageMaker Unified domain

Follow steps in Create a Amazon SageMaker Unified Studio domain – quick setup to set up a SageMaker Unified Studio domain. You need to choose the VPC that was created by the CloudFormation stack earlier.

The quick setup method also has a Create VPC option that sets up a new VPC, subnets, NAT Gateway, VPC endpoints, and so on, and is meant for testing purposes. There are charges associated with this, so delete the domain after testing.

If you see the No models accessible, you can use the Grant model access button to grant access to Amazon Bedrock serverless models for use in SageMaker Unified Studio, for AI/ML use-cases

  1. Fill in the sections for Domain Name. For example, MyOLTPDomain. In the VPC section, select the VPC that was provisioned by the CloudFormation stack, for example UnifiedLHBlogpost-VPC. Select subnets and choose Continue.

  1. In the IAM Identity Center User section, look up the newly created user from (for example, Data User1) and add them to the domain. Choose Create Domain. You should see the new domain along with a link to open Unified Studio.

Access data using SageMaker Unified Studio

To access and analyze your data in SageMaker Unified Studio, follow these steps:

    1. Select the URL for SageMaker Unified Studio. Choose Sign in with SSO and sign in using the IAM user, for example datauser1, and you will be prompted to select a multi-factor authentication (MFA) method.
    2. Select Authenticator App and proceed with next steps. For more information about SSO setup, see Managing users in Amazon SageMaker Unified Studio.After you have signed in to the Unified Studio domain, you need to set up a new project. For this illustration, we created a new sample project called MyOLTPDataProject using the project profile for SQL Analytics as shown here.A project profile is a template for a project that defines what blueprints are applied to the project, including underlying AWS compute and data resources. Wait for the new project to be set up, and when status is Active, open the project in Unified Studio.By default, the project will have access to the default Data Catalog (AWSDataCatalog). For the federated redshift catalog redshift-consumer-catalog to be visible, you need to grant permissions to the project IAM role using Lake Formation. For this example, using the Lake Formation console, we have granted below access to the demodb database that is part of the Zero-ETL catalog to the Unified Studio project IAM role. Follow steps in Adding existing databases and catalogs using AWS Lake Formation permissions.In your SageMaker Unified Studio Project’s Data section, connect to the Lakehouse Federated catalog that you created and registered earlier (for example redshift-zetl-auroramysql-catalog/aurora_zeroetl_integration). Select the objects that you want to query and execute them using the Redshift Query Editor integrated with SageMaker Unified Studio.If you select Redshift, you will be transferred to the Query editor where you can execute the SQL and see the results as shown in the following figure.

With this integration of Amazon Redshift metadata with SageMaker Lakehouse federated catalog, you have access to your existing Redshift data warehouse objects in your organizations centralized catalog managed by SageMaker Lakehouse catalog and join the existing Redshift data seamlessly with the data stored in your Amazon S3 data lake. This solution helps you avoid unnecessary ETL processes to copy data between the data lake and the data warehouse and minimize data redundancy.

You can further integrate more data sources serving transactional workloads such as Amazon DynamoDB and enterprise applications such as Salesforce and ServiceNow. The architecture shared in this post for accelerated analytical processing using Zero-ETL and SageMaker Lakehouse can be further expanded by adding Zero-ETL integrations for DynamoDB using DynamoDB zero-ETL integration with Amazon SageMaker Lakehouse and for enterprise applications by following the instructions in Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Clean up

When you’re finished, delete the CloudFormation stack to avoid incurring costs for some of the AWS resources used in this walkthrough incur a cost. Complete the following steps:

  1. On the CloudFormation console, choose Stacks.
  2. Choose the stack you launched in this walkthrough. The stack must be currently running.
  3. In the stack details pane, choose Delete.
  4. Choose Delete stack.
  5. On the Sagemaker console, choose Domains and delete the domain created for testing.

Summary

In this post, you’ve learned how to bring data from operational databases and applications into your lake house in near real-time through Zero-ETL integrations. You’ve also learned about a unified development experience to create a project and bring in the operational data to the lakehouse, which is accessible through SageMaker Unified Studio, and query the data using integration with Amazon Redshift Query Editor. You can use the following resources in addition to this post to quickly start your journey to make your transactional data available for analytical processing.

  1. AWS zero-ETL
  2. SageMaker Unified Studio
  3. SageMaker Lakehouse
  4. Getting started with Amazon SageMaker Lakehouse


About the authors

Avijit Goswami is a Principal Data Solutions Architect at AWS specialized in data and analytics. He supports AWS strategic customers in building high-performing, secure, and scalable data lake solutions on AWS using AWS managed services and open-source solutions. Outside of his work, Avijit likes to travel, hike in the San Francisco Bay Area trails, watch sports, and listen to music.

Saman Irfan is a Senior Specialist Solutions Architect focusing on Data Analytics at Amazon Web Services. She focuses on helping customers across various industries build scalable and high-performant analytics solutions. Outside of work, she enjoys spending time with her family, watching TV series, and learning new technologies.

Sudarshan Narasimhan is a Principal Solutions Architect at AWS specialized in data, analytics and databases. With over 19 years of experience in Data roles, he is currently helping AWS Partners & customers build modern data architectures. As a specialist & trusted advisor he helps partners build & GTM with scalable, secure and high performing data solutions on AWS. In his spare time, he enjoys spending time with his family, travelling, avidly consuming podcasts and being heartbroken about Man United’s current state.

Architecture patterns to optimize Amazon Redshift performance at scale

Post Syndicated from Eddie Yao original https://aws.amazon.com/blogs/big-data/architecture-patterns-to-optimize-amazon-redshift-performance-at-scale/

Tens of thousands of customers use Amazon Redshift as a fully managed, petabyte-scale data warehouse service in the cloud. As an organization’s business data grows in volume, the data analytics need also grows. Amazon Redshift performance needs to be optimized at scale to achieve faster, near real-time business intelligence (BI). You might also consider optimizing Amazon Redshift performance when your data analytics workloads or user base increases, or to meet a data analytics performance service level agreement (SLA). You can also look for ways to optimize Amazon Redshift data warehouse performance after you complete an online analytical processing (OLAP) migration from another system to Amazon Redshift.

In this post, we will show you five Amazon Redshift architecture patterns that you can consider to optimize your Amazon Redshift data warehouse performance at scale using features such as Amazon Redshift Serverless, Amazon Redshift data sharing, Amazon Redshift Spectrum, zero-ETL integrations, and Amazon Redshift streaming ingestion.

Use Amazon Redshift Serverless to automatically provision and scale your data warehouse capacity

To start, let’s review using Amazon Redshift Serverless to automatically provision and scale your data warehouse capacity. The architecture is shown in the following diagram and includes different components within Amazon Redshift Serverless like ML-based workload monitoring and automatic workload management.

Amazon Redshift Serverless architecture diagram

Amazon Redshift Serverless architecture diagram

Amazon Redshift Serverless is a deployment model that you can use to run and scale your Redshift data warehouse without managing infrastructure. Amazon Redshift Serverless will automatically provision and scale your data warehouse capacity to deliver fast performance for even the most demanding, unpredictable, or massive workloads.

Amazon Redshift Serverless measures data warehouse capacity in Redshift Processing Units (RPUs). You pay for the workloads you run in RPU-hours on a per-second basis. You can optionally configure your Base, Max RPU-Hours, and MaxRPU parameters to modify your warehouse performance costs. This post dives deep into understanding cost mechanisms to consider when managing Amazon Redshift Serverless.

Amazon Redshift Serverless scaling is automatic and based on your RPU capacity. To further optimize scaling operations for large scale datasets, Amazon Redshift Serverless has AI-driven scaling and optimization. It uses AI to scale automatically with workload changes across key metrics such as data volume changes, concurrent users, and query complexity, accurately meeting your price performance targets.

There is no maintenance window in Amazon Redshift Serverless, because software version updates are applied automatically. This maintenance occurs with no interruptions for any existing connections or query executions. Make sure to consult the considerations guide to better understand the operation of Amazon Redshift Serverless.

You can migrate from an existing provisioned Amazon Redshift data warehouse to Amazon Redshift Serverless by creating a snapshot of your current provisioned data warehouse and then restoring that snapshot in Amazon Redshift Serverless. Amazon Redshift will automatically convert interleaved keys to compound keys when you restore a provisioned data warehouse snapshot to a Serverless namespace. You can also get started with a new Amazon Redshift Serverless data warehouse.

Amazon Redshift Serverless use cases

You can use Amazon Redshift Serverless for:

  • Self-service analytics
  • Auto scaling for unpredictable or variable workloads
  • New applications
  • Multi-tenant applications

With Amazon Redshift, you can access and query data stored in Amazon S3 Tables – fully managed Apache Iceberg tables optimized for analytics workloads. Amazon Redshift also supports querying data stored using Apache Iceberg tables, and other open table formats like Apache Hudi and Linux Foundation Delta Lake, for more information see External tables for Redshift Spectrum and Expand data access through Apache Iceberg using Delta Lake UniForm on AWS.

You can also use Amazon Redshift Serverless with Amazon Redshift data sharing, which can automatically scale your large dataset in independent datashares and maintain workload isolation controls.

Amazon Redshift data sharing to share live data between separate Amazon Redshift data warehouses

Next, we will look at an Amazon Redshift data sharing architecture pattern, shown in below diagram, to share data between a hub Amazon Redshift data warehouse and spoke Amazon Redshift data warehouses , and to share data across multiple Amazon Redshift data warehouses with each other.

Amazon Redshift data sharing architecture patterns diagram

Amazon Redshift data sharing architecture patterns diagram

With Amazon Redshift data sharing, you can securely share access to live data between separate Amazon Redshift data warehouses without manually moving or copying the data. Because the data is live, all users can see the most up-to-date and consistent information in Amazon Redshift as soon as it’s updated using separate dedicated resources. Because the compute accessing the data is isolated, you can size the data warehouse configurations to individual workload price performance requirements rather than the aggregate of all workloads. This also provides additional flexibility to scale with new workloads without affecting the workloads already being run on Amazon Redshift.

A datashare is the unit of sharing data in Amazon Redshift. A producer data warehouse administrator can create datashares and add datashare objects to share data with other data warehouses, referred to as outbound shares. A consumer data warehouse administrator can receive datashares from other data warehouses, referred to as inbound shares.

To get started, a producer data warehouse needs to add all objects (and potential permissions) that need to be accessed by another data warehouse to a datashare, and share that datashare with a consumer. After that consumer creates a database from the datashare, the shared objects can be accessed using three-part notation consumer_database_name.schema_name.table_name on the consumer, using the consumer’s compute.

Amazon Redshift data sharing use cases

Amazon Redshift data sharing, including multi-warehouse writes in Amazon Redshift, can be used to:

  • Support different kinds of business-critical workloads, including workload isolation and chargeback for individual workloads.
  • Enable cross-group collaboration across teams for broader analytics, data science, and cross-product impact analysis.
  • Deliver data as a service.
  • Share data between environments to improve team agility by sharing data at different granularity levels such as development, test, and production.
  • License access to data in Amazon Redshift by listing Amazon Redshift data sets in the AWS Data Exchange catalog so that customers can find, subscribe to, and query the data in minutes.
  • Update business source data on the producer. You can share data as a service across your organization, but then consumers can also perform actions on the source data.
  • Insert additional records on the producer. Consumers can add records to the original source data.

The following articles provide examples of how you can use Amazon Redshift data sharing to scale performance:

Amazon Redshift Spectrum to query data in Amazon S3

You can use Amazon Redshift Spectrum to query data in , as shown in below diagram using AWS Glue Data Catalog.

Amazon Redshift Spectrum architecture diagram

Amazon Redshift Spectrum architecture diagram

You can use Amazon Redshift Spectrum to efficiently query and retrieve structured and semi-structured data from files in Amazon S3 without having to directly load data into Amazon Redshift tables. Using the large, parallel scale of the Amazon Redshift Spectrum layer, you can run massive, fast, parallel queries against large datasets while most of the data remains in Amazon S3. This can significantly improve the performance and cost-effectiveness of massive analytics workloads, because you can use the scalable storage of Amazon S3 to handle large volumes of data while still benefiting from the powerful query processing capabilities of Amazon Redshift.

Amazon Redshift Spectrum uses separate infrastructure independent of your Amazon Redshift data warehouse, offloading many compute-intensive tasks, such as predicate filtering and aggregation. This means that you can use significantly less data warehouse processing capacity than other queries. Amazon Redshift Spectrum can also automatically scale to potentially thousands of instances, based on the demands of your queries.

When implementing Amazon Redshift Spectrum, make sure to consult the considerations guide which details how to configure your networking, external table creation, and permissions requirements.

Review this best practices guide and this blog post, which outlines recommendations on how to optimize performance including the impact of different file types, how to design around the scaling behavior, and how you can efficiently partition files. You can check out an example architecture in Accelerate self-service analytics with Amazon Redshift Query Editor V2.

To get started with Amazon Redshift Spectrum, you define the structure for your files and register them as an external table in an external data catalog (AWS Glue, Amazon Athena, and Apache Hive metastore are supported). After creating your external table, you can query your data in Amazon S3 directly from Amazon Redshift.

Amazon Redshift Spectrum use cases

You can use Amazon Redshift Spectrum in the following use cases:

  • Huge volume but less frequently accessed data, build lake house architecture to query exabytes of data in an S3 data lake
  • Heavy scan- and aggregation-intensive queries
  • Selective queries that can use partition pruning and predicate pushdown, so the output is fairly small

Zero-ETL to unify all data and achieve near real-time analytics

You can use Zero-ETL integration with Amazon Redshift to integrate with your transactional databases like Amazon Aurora MySQL-Compatible Edition, so you can run near real-time analytics in Amazon Redshift, or BI in Amazon QuickSight, or machine learning workload in Amazon SageMaker AI, shown in below diagram.

Zero-ETL integration with Amazon Redshift architecture diagram

Zero-ETL integration with Amazon Redshift architecture diagram

Zero-ETL integration with Amazon Redshift removes the undifferentiated heavy lifting to build and manage complex extract, transform, and load (ETL) data pipelines; unifies data across databases, data lakes, and data warehouses; and makes data available in Amazon Redshift in near real time for analytics, artificial intelligence (AI) and machine learning (ML) workloads.

Currently Amazon Redshift supports the following zero-ETL integrations:

To create a zero-ETL integration, you specify an integration source, such as an Amazon Aurora DB cluster, and an Amazon Redshift data warehouse, such as Amazon Redshift Serverless workgroup or a provisioned data warehouse (including Multi-AZ deployment on RA3 clusters to automatically recover from any infrastructure or Availability Zone failures and help ensure that your workloads remain uninterrupted), as the target. The integration replicates data from the source to the target and makes data available in the target data warehouse within seconds. The integration also monitors the health of the integration pipeline and recovers from issues when possible.

Make sure to review considerations, limitations, and quotas on both the data source and target when using zero-ETL integrations with Amazon Redshift.

Zero-ETL integration use cases

You can use zero-ETL integration with Amazon Redshift as an architecture pattern to boost analytical query performance at scale, enable a straightforward and secure way to create near real-time analytics on petabytes of transactional data, with continuous change-data-capture (CDC). Plus, you can use other Amazon Redshift capabilities such as built-in machine learning, materialized views, data sharing, and federated access to multiple data stores and data lakes. You can see more other zero-ETL integrations use cases at What is ETL.

Ingest streaming data into Amazon Redshift data warehouse for near real-time analytics

You can ingest streaming data with Amazon Kinesis Data Streams or Amazon Managed Streaming for Apache Kafka (Amazon MSK) to Amazon Redshift and run near real-time analytics in Amazon Redshift, as shown in the following diagram.

Amazon Redshift data streaming architecture diagram

Amazon Redshift data streaming architecture diagram

Amazon Redshift streaming ingestion provides low-latency, high-speed data ingestion directly from Amazon Kinesis Data Streams or Amazon MSK to an Amazon Redshift provisioned or Amazon Redshift Serverless data warehouse, without staging data in Amazon S3. You can connect to and access the data from the stream using standard SQL and simplify data pipelines by creating materialized views in Amazon Redshift on top of the data stream. For best practices, you can review these blog posts:

To get started on Amazon Redshift streaming ingestion, you create an external schema that maps to the streaming data source and create a materialized view that references the external schema. For details on how to set up Amazon Redshift streaming ingestion for Amazon KDS, see Getting started with streaming ingestion from Amazon Kinesis Data Streams. For details on how to set up Amazon Redshift streaming ingestion for Amazon MSK, see Getting started with streaming ingestion from Apache Kafka sources.

Amazon Redshift streaming ingestion use cases

You can use Amazon Redshift streaming ingestion to:

  • Improve gaming experience by analyzing real-time data from gamers
  • Analyze real-time IoT data and use machine learning (ML) within Amazon Redshift to improve operations, predict customer churn, and grow your business
  • Analyze clickstream user data
  • Conduct real-time troubleshooting by analyzing streaming data from log files
  • Perform near real-time retail analytics on streaming point of sale (POS) data

Other Amazon Redshift features to optimize performance

There are other Amazon Redshift features that you can use to optimize performance.

  • You can resize Amazon Redshift provisioned clusters to optimize data warehouse compute and storage use.
  • You can use concurrency scaling, where Amazon Redshift provisioning automatically adds additional capacity to process increases in read, such as dashboard queries; and write operations, such as data ingestion and processing.
  • You can also consider materialized views in Amazon Redshift, applicable to both provisioned and serverless data warehouses, which contains a precomputed result set, based on an SQL query over one or more base tables. They are especially useful for speeding up queries that are predictable and repeated.
  • You can use auto-copy for Amazon Redshift to set up continuous file ingestion from your Amazon S3 prefix and automatically load new files to tables in your Amazon Redshift data warehouse without the need for additional tools or custom solutions.

Cloud security at AWS is the highest priority. Amazon Redshift offers broad security-related configurations and controls to help ensure information is appropriately protected. See Amazon Redshift Security Best Practices for a comprehensive guide to Amazon Redshift security best practices.

Conclusion

In this post, we reviewed Amazon Redshift architecture patterns and features that you can use to help scale your data warehouse to dynamically accommodate different workload combinations, volumes, and data sources to achieve optimal price performance. You can use them alone or together—choosing the best infrastructural set up for your use case requirements—and scale to accommodate for any future growth.

Get started with these Amazon Redshift architecture patterns and features today by following the instructions provided in each section. If you have questions or suggestions, leave a comment below.


About the authors

Eddie Yao is a Principal Technical Account Manager (TAM) at AWS. He helps enterprise customers build scalable, high-performance cloud applications and optimize cloud operations. With over a decade of experience in web application engineering, digital solutions, and cloud architecture, Eddie currently focuses on Media & Entertainment (M&E) and Sports industries and AI/ML and generative AI.

Julia Beck is an Analytics Specialist Solutions Architect at AWS. She supports customers in validating analytics solutions by architecting proof of concept workloads designed to meet their specific needs.

Scott St. Martin is a Solutions Architect at AWS who is passionate about helping customers build modern applications. Scott uses his decade of experience in the cloud to guide organizations in adopting best practices around operational excellence and reliability, with a focus the manufacturing and financial services spaces. Outside of work, Scott enjoys traveling, spending time with family, and playing piano.

Powering global payout intelligence: How MassPay uses Amazon Redshift Serverless and zero-ETL to drive deeper analytics.

Post Syndicated from Yossi Shlomo original https://aws.amazon.com/blogs/big-data/powering-global-payout-intelligence-how-masspay-uses-amazon-redshift-serverless-and-zero-etl-to-drive-deeper-analytics/

Since the company was founded in 2019, MassPay’s singular objective has been to deliver frictionless global payments that power innovation and lift people, businesses, and quality of life worldwide. Today, the MassPay payment orchestration offering empowers companies to move money across borders effortlessly; enabling local payment experiences in over 175 countries and 70 currencies—including digital wallets, locally preferred alternative payment methods, and cryptocurrencies. From hyper-localized checkout experiences to instant global payouts, we orchestrate seamless financial experiences that reflect how people and businesses transact around the world.

As we have expanded globally, so has the complexity of our data. In this blog post we shall cover how understanding real-time payout performance, identifying customer behavior patterns across regions, and optimizing internal operations required more than traditional business intelligence and analytics tools. And how since implementing Amazon Redshift and Zero-ETL, we’ve seen 90% reduction in data availability latency, payments data available for analytics 1.5x faster, leading to 45% reduction in time-to-insight and 37% fewer support tickets related to transaction visibility and payment inquiries.

Unlocking deeper payout intelligence and global insights

To continue our innovation—and to continue to exceed our partners’ and customers’ expectations—we knew we needed to go beyond basic reporting. We know success is dependent upon developing a truly data-driven organization. This means tracking granular KPIs across payout success rates, payment method adoption, transaction velocity, customer onboarding funnel drop-off, and support ticket correlation. We also wanted to better forecast customer payment expectations, monitor foreign exchange cost trends, and understand market-specific nuances such as how payout timing impacts seller satisfaction in social commerce ecosystems.

We didn’t just want more data. We wanted faster, smarter insights that would shape decisions in real time. Being a data-driven organization means our teams don’t guess. They know. And that gives us, our partners, and our customers real operational and competitive advantages.

– Yossi Schlomo, Director of Payment Systems Architecture

MySQL databases, CSV exports, and third-party reporting tools wouldn’t support the scale or speed we needed to deliver.

Choosing AWS: A scalable and integrated analytics foundation

We chose Amazon Web Services (AWS) for our data infrastructure and to accelerate our analytics capabilities.

At the core of our stack is Amazon Redshift Serverless with AI-driven scaling and optimizations enabled, which gives us scalable, fast, and cost-efficient analytics without the burden of managing infrastructure. Coupled with Amazon Aurora MySQL-Compatible Edition as our transactional data store and Amazon Redshift zero-ETL integration, we eliminated manual data pipelines altogether. Transactional data flows into Amazon Redshift in near real-time, instantly powering dashboards, alerts, and machine learning (ML) models.

This data feeds interactive dashboards—both internally and embedded within our platform for customers. Now, executives, operations leads, and customer success teams can drill into payout performance by region, merchant, or payment method, while customers get real-time visibility into their own payout analytics as part of our platform experience. The architecture is shown in the following figure.

MassPay Zero-ETL architecture with Amazon Redshift Serverless

MassPay Zero-ETL architecture with Amazon Redshift Serverless

Why it’s different and what it unlocked

Without Amazon Redshift Serverless and zero-ETL, we would have had to invest in costly custom data pipelines, maintain separate exchange, transform, and load (ETL) infrastructure, and manually manage data freshness. The integration with Aurora MySQL-Compatible is seamless and reduces our analytics latency from minutes to seconds.

Our differentiator is simple: We operationalize not just transactions but analytics for global payments. Most platforms can tell you if a transaction went through. For payments and payouts, MassPay can tell you how fast it went, what it cost, what method was most effective, and what that means for your business in real time.

– Yossi Schlomo, Director of Payment Systems Architecture

Embedded intelligence, built for scale

Every MassPay customer gets access to comprehensive payment analytics. These are accessed using our API or through a white-label dashboard (shown in the following figure). This detail is core to our product and central to our value proposition. As part of our go-to-market strategy, we showcase these capabilities in every demo, and they’ve proven to be key drivers in conversion and upsell conversations, especially with platforms targeting high-growth ecosystems.We use tiered pricing models based on transaction volume, and our embedded intelligence helps our partners and customers optimize usage and scale efficiently.

MassPay Dashboard

MassPay Dashboard

What we’ve gained

Since implementing Amazon Redshift and Zero-ETL, we’ve seen measurable results including:

  • 90% reduction in data availability latency and data available for analytics 1.5x faster
  • 45% reduction in time-to-insight across payment and payout intelligence reports
  • 37% fewer support tickets related to transaction visibility and payment inquiries
  • Real-time Net Promoter Score (NPS) tracking correlates with payout success metrics, driving faster resolution paths

What’s next

We’re now extending our analytics model to include more advanced ML-based payout failure prediction and ML-based payment authorization prediction, FX optimization alerts, partner-level and network-level benchmarking, and much more.

Conclusion

MassPay isn’t just payments. We aren’t just payouts. We are the engine powering modern commerce. With AWS, we’re turning complex global payments infrastructure into a smart, transparent, and scalable platform for insights. For our partners, and for our customers, this means better decisions, faster payment processing, faster payouts, and truly global reach without guesswork.

We encourage you to leverage below resources to explore these features further


About the authors

Yossi Shlomo serves as the Director of Payment Systems Architecture at MassPay. Yossi is an expert in credit card payment systems, PCI compliance, and secure transaction architecture, helping global platforms process payments at scale with confidence. He specializes in building scalable, cloud-based transaction systems and optimizing global payment gateways for performance and reliability.

Milind Oke is a Amazon Redshift and SageMaker Lakehouse specialist Solutions Architect as AWS. He is based out of New York and has been building enterprise data platforms, data warehousing, and analytics solutions for customers across various domains over two decades. In the 5 years with AWS, Milind has been a speaker at worldwide technical conferences and is co-author of Amazon Redshift: The Definitive Guide: Jump-Start Analytics Using Cloud Data Warehousing 1st Edition.

Scalable analytics and centralized governance for Apache Iceberg tables using Amazon S3 Tables and Amazon Redshift

Post Syndicated from Satesh Sonti original https://aws.amazon.com/blogs/big-data/scalable-analytics-and-centralized-governance-for-apache-iceberg-tables-using-amazon-s3-tables-and-amazon-redshift/

Amazon Redshift supports querying data stored in Apache Iceberg tables managed by Amazon S3 Tables, which we previously covered as part of getting started blog post. While this blog post helps you to get started using Amazon Redshift with Amazon S3 Tables, there are additional steps you need to consider when working with your data in production environments, including who has access to your data and with what level of permissions.

In this post, we’ll build on the first post in this series to show you how to set up an Apache Iceberg data lake catalog using Amazon S3 Tables and provide different levels of access control to your data. Through this example, you’ll set up fine-grained access controls for multiple users and see how this works using Amazon Redshift. We’ll also review an example with simultaneously using data that resides both in Amazon Redshift and Amazon S3 Tables, enabling a unified analytics experience.

Solution overview

In this solution, we show how to query a dataset stored in Amazon S3 Tables for further analysis using data managed in Amazon Redshift. Specifically, we go through the steps shown in the following figure to load a dataset into Amazon S3 Tables, grant appropriate permissions, and finally execute queries to analyze our dataset for trends and insights.

Solution Architecture

In this post, you walk through the following steps:

  1. Creating an Amazon S3 Table bucket: In AWS Management Console for Amazon S3, create an Amazon S3 Table bucket and integrate with other AWS analytics services
  2. Creating an S3 Table and loading data: Run spark SQL in Amazon EMR to create a namespace and an S3 Table and load diabetic patients’ visit data
  3. Granting permissions: Granting fine-grained access controls in AWS Lake Formation
  4. Running SQL analytics: Querying S3 Tables using the auto mounted S3 Table catalog.

This post uses data from a healthcare use case to analyze information about diabetic patients and identify the frequency of age groups admitted to the hospital. You’ll use the preceding steps to perform this analysis.

Prerequisites

To begin, you need to add an Amazon Redshift service-linked role—AWSServiceRoleForRedshift—as a read-only administrator in Lake Formation. You can run following AWS Command Line Interface (AWS CLI) command to add the role.

Replace <account_number> with your account number and replace <region> with the AWS Region that you are using. You can run this command from AWS CloudShell or through AWS CLI configured in your environment.

aws lakeformation put-data-lake-settings \
        --region <region> \
        --data-lake-settings \
 '{
   "DataLakeAdmins": [{"DataLakePrincipalIdentifier":"arn:aws:iam::<account_number>:role/Admin"}],
   "ReadOnlyAdmins":[{"DataLakePrincipalIdentifier":"arn:aws:iam:: <account_number>:role/aws-service-role/redshift.amazonaws.com/AWSServiceRoleForRedshift"}],
   "CreateDatabaseDefaultPermissions":[],
   "CreateTableDefaultPermissions":[],
   "Parameters":{"CROSS_ACCOUNT_VERSION":"4","SET_CONTEXT":"TRUE"}
  }'

You also need to create or use an existing Amazon Elastic Compute Cloud (Amazon EC2) key pair that will be used for SSH connections to cluster instances. For more information, see Amazon EC2 key pairs.

The examples in this post require the following AWS services and features:

The CloudFormation template that follows creates the following resources:

  • An Amazon EMR 7.6.0 cluster with Apache Iceberg packages
  • An Amazon Redshift Serverless instance
  • An AWS Identity and Access Management (IAM) instance profile, service role, and security groups
  • IAM roles with required policies
  • Two IAM users: nurse and analyst

Download the CloudFormation template, or you can use the Launch Stack button to automatically download it to your AWS environment. Note that network routes are directed to 255.255.255.255/32 for security reasons. Replace the routes with your organization’s IP addresses. Also enter your IP or VPN range for Jupyter Notebook access in the SourceCidrForNotebook parameter in CloudFormation.

Launch CloudFormation Stack

Download the diabetic encounters and patient datasets and upload it into your S3 bucket. These files are from a publicly available open dataset.

This sample dataset is used to highlight this use case, the techniques covered can be adapted to your workflows. The following are more details about this dataset:

diabetic_encounters_s3.csv: Contains information about patient visits for diabetic treatment.

  • encounter_id: Unique number to refer to an encounter with a patient who has diabetes.
  • patient_nbr: Unique number to identify a patient.
  • num_procedures: Number of medical procedures administered.
  • num_medications: Number of medications provided during the visit
  • insulin: Insulin level observed. Valid values are steady, up, and no.
  • time_in_hospital: Duration of time in hospital in days.
  • readmitted: Readmitted to hospital within 30 days or after 30 days.

diabetic_patients_rs.csv: Contains patient information such as age group, gender, race, and number of visits.

  • patient_nbr: Unique number to identify a patient
  • race: Patient’s race
  • gender: Patient’s gender
  • age_grp: Patient’s age group. Valid values are 0-10, 10-20, 20-30, and so on
  • number_outpatient: Number of outpatient visits
  • number_emergency: Number of emergency room visits
  • number_inpatient: Number of inpatient visits

Now that you’ve set up the prerequisites, you’re ready to connect Amazon Redshift to query Apache Iceberg data stored in Amazon S3 Tables.

Create an S3 Table bucket

Before you can use Amazon Redshift to query the data in an Amazon S3 Table, you must create an Amazon S3 Table.

  1. Sign in to the AWS Management Console and go to Amazon S3.
  2. Go to Amazon S3 Table buckets. This is an option in the Amazon S3 console.
  3. In the Table buckets view, there’s a section that describes Integration with AWS analytics services. Choose Enable Integration if you haven’t previously set this up. This sets up the integration with AWS analytics services, including Amazon Redshift, Amazon EMR, and Amazon Athena.
    Enable Integration
  4. Wait a few seconds for the status to change to Enabled.
    Integration Enabled
  5. Choose Create table bucket and enter a bucket name. You can use any name that follows the naming conventions. In this example, we used the bucket name patient-encounter. When you’re finished, choose Create table bucket.Create Table Bucket
  6. After the S3 Table bucket is created, you’ll be redirected to the Table buckets list. Copy the Amazon Resource Name (ARN) of the table bucket you just created to use in the next section.Table Bucket List

Now that your S3 Table bucket is set up, you can load data.

Create S3 Table and load data

The CloudFormation template in the prerequisites created an Apache Spark cluster using Amazon EMR. You’ll use the Amazon EMR cluster to load data into Amazon S3 Tables.

  1. Connect to the Apache Spark primary node using SSH or through Jupyter Notebooks. Note that an Amazon EMR cluster was launched when you deployed the CloudFormation template.
  2. Enter the following command to launch the Spark shell and initialize a Spark session for Iceberg that connects to your S3 Table bucket. Replace <Region>, <accountID> and <bucketname><bucket arn> with the information your region, account and bucket name.
    spark-shell \
      --packages "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.1,software.amazon.awssdk:bundle:2.20.160,software.amazon.awssdk:url-connection-client:2.20.160" \
      --master "local[*]" \
      --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
      --conf "spark.sql.defaultCatalog=spark_catalog" \
       --conf "spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkCatalog" \
      --conf "spark.sql.catalog.spark_catalog.type=rest" \
      --conf "spark.sql.catalog.spark_catalog.uri=https://s3tables.<Region>.amazonaws.com/iceberg" \
      --conf "spark.sql.catalog.spark_catalog.warehouse=arn:aws:s3tables:<Region>:<accountID>:bucket/<bucketname>" \
      --conf "spark.sql.catalog.spark_catalog.rest.sigv4-enabled=true" \
      --conf "spark.sql.catalog.spark_catalog.rest.signing-name=s3tables" \
      --conf "spark.sql.catalog.spark_catalog.rest.signing-region=<Region>" \
      --conf "spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO" \
      --conf "spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialProvider" \
      --conf "spark.sql.catalog.spark_catalog.rest-metrics-reporting-enabled=false"               

See Accessing Amazon S3 Tables with Amazon EMR for upgrades to software.amazon.s3tables package versions.

  1. Next, create a namespace that will link your S3 Table bucket with your Amazon Redshift Serverless workgroup. We chose encounters as the namespace for this example, but you can use a different name. Use the following SparkSQL command:
    spark.sql("CREATE NAMESPACE IF NOT EXISTS s3tablesbucket.encounters")

  2. Create an Apache Iceberg table with name diabetic_encounters.
    spark.sql( 
    """ CREATE TABLE IF NOT EXISTS s3tablesbucket.encounters.`diabetic_encounters` ( 
    encounter_id INT, 
    patient_nbr INT,
    num_procedures INT,
    num_medications INT,
    insulin STRING,
    time_in_hospital INT,
    readmitted STRING 
    ) 
    USING iceberg """
    )

  3. Load csv into the S3 Table encounters.diabetic_encounters. Replace <diabetic_encounters_s3.csv file location> with the Amazon S3 file path of the diabetic_encounters_s3.csv file you uploaded earlier.
    val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("<diabetic_encounters_s3.csv file location> ")
    
    df.writeTo("s3tablesbucket.encounters.diabetic_encounters").using("Iceberg").tableProperty ("format-version", "2").createOrReplace()

  4. Query the data to validate it using Spark shell.
    spark.sql(""" SELECT * FROM s3tablesbucket.encounters.diabetic_encounters """).show()

Grant permissions

In this section, you grant fine-grained access control to the two IAM users created as part of the prerequisites.

  • nurse: Grant access to all columns in the diabetic_encounters table
  • analyst: Grant access to only {encounter_id, patient_nbr, readmitted} columns

First, grant access to the diabetic_encounters table for nurse user.

  1. In AWS Lake Formation, Choose Data Permissions.
  2. On the Grant Permissions page, under Principals, select IAM users and roles.
  3. Select the IAM user nurse.
  4. For Catalogs, select <accoundID>:s3tablescatalog/patient-encounter.
  5. For Databases, select encounterGrant Database Permissions
  6. Scroll down. For Tables, select diabetic_encounters.
  7. For Table permissions, select Select.
  8. For Data permissions, select All data access.Grant Table Permissions
  9. Choose Grant. This will grant select access on all the columns in diabetic_encounters to the nurse

Now grant access to the diabetic_encounters table for the analyst user.

  1. Repeat the same steps that you followed for nurse user up to step 7 in the previous section.
  2. For Data permissions, select Column-based access. Select Include columns and select the encounter_id, patient_nbr, and readmitted columns
    Grant Column Permissions
  3. Choose Grant. This will grant select access on the encounter_id, patient_nbr, and readmitted columns in diabetic_encounters to the analyst

Run SQL analytics

In this section, you will access the data in the diabetic_encounters S3 Table using nurse and analyst to learn how fine-grain access control works. You will also combine data from the S3 Table data with a local table in Amazon Redshift using a single query.

  1. In the Amazon Redshift Query Editor V2, connect to serverless:rs-demo-wg, an Amazon Redshift Serverless instance created by the CloudFormation template.
  2. Select Database user name and password as the connection method and connect using super user awsuser. Provide the password you gave as an input parameter to the CloudFormation stack.Database Connection
  3. Run the following commands to create the IAM users nurse and analyst in Amazon Redshift.
    CREATE USER IAM:nurse password disable;
    CREATE USER IAM:analyst password disable;

  4. Amazon Redshift automatically mounts the Data Catalog as an external database named awsdatacatalog to simplify accessing your tables in Data Catalog. You can grant usage access to this database for the IAM users:
    GRANT USAGE ON DATABASE awsdatacatalog to "IAM:nurse";
    GRANT USAGE ON DATABASE awsdatacatalog to "IAM:analyst";

For the next steps, you must first sign in to the AWS Console as the nurse IAM user. You can find the IAM user’s password in the AWS Secrets Manager console and retrieving the value from the secret ending with iam-users-credentials. See Get a secret value using the AWS console for more information.

  1. After you’ve signed in to the console, navigate to the Amazon Redshift Query Editor V2.
  2. Sign in to your Amazon Redshift cluster using the IAM:nurse. You can do this by connecting to serverless:rs-demo-wg as Federated user. This applies the permission provided in Lake Formation for accessing your data in Amazon S3 Tables:
    Federated Connection
  3. Run following SQL to query S3 Table diabetic_encounters.
    SELECT * FROM patient-encounter@s3tablescatalog"."encounters"."diabetic_encounters";

This returns all the data in the S3 Table for diabetic_encounters across every column in the table, as shown in the following figure:

Diabetic Encounters Output

Recall that you also created an IAM user called analyst that only has access to the encounter_id, patient_nbr, and readmitted columns. Let’s verify that analyst user can only access those columns.

  1. Sign in to the AWS console as the analyst IAM user and open the Amazon Redshift Query Editor v2 using the same steps as above. Run the same query as before:
    SELECT * FROM patient-encounter@s3tablescatalog"."encounters"."diabetic_encounters";
    

This time, you should only the encounter_id, patient_nbr, and readmitted columns:

Diabetic Encounters Output restricted

Now that you’ve seen how you can access data in Amazon S3 Tables from Amazon Redshift while setting the levels of access required for your users, let’s see how we can join data in S3 Tables to tables that already exist in Amazon Redshift.

Combine data from an S3 Table and a local table in Amazon Redshift

For this section, you’ll load data into your local Amazon Redshift cluster. After this is complete, you can analyze the datasets in both Amazon Redshift and S3 Tables.

  1. First, as the analytics federated user, sign in to your Amazon Redshift cluster using Amazon Redshift Query Editor v2.
  2. Use the following SQL command to create a table that contains patient information.:
    CREATE TABLE public.patient_info (
        patient_nbr integer ENCODE az64,
        race character varying(256) ENCODE lzo,
        gender character varying(256) ENCODE lzo,
        age_grp character varying(256) ENCODE lzo,
        number_outpatient integer ENCODE az64,
        number_emergency integer ENCODE az64,
        number_inpatient integer ENCODE az64);

  3. Copy patient information from the file csv that’s stored in your Amazon S3 object bucket. Replace <diabetic_patients_rs.csv file S3 location> with the location of the file in your S3 bucket.
    COPY dev.public.patient_info FROM 's3://<diabetic_patients_rs.csv file S3 location>' 
    IAM_ROLE default 
    FORMAT AS CSV DELIMITER ',' 
    IGNOREHEADER 1;

  4. Use the following query to review the sample data to verify that the command was successful. This will show information from 10 patients, as shown in the following figure.
    SELECT * FROM public.patient_info limit 10;

    Patient Information

  5. Now combine data from the Amazon S3 Table diabetic_encounters and the Amazon Redshift patient_info. In this example, the query fetches information about what age group was most frequently readmitted to the hospital within 30 days of an initial hospital visit:
    SELECT
        age_grp,
        count(*) readmission_count
    FROM
        "patient-encounter@s3tablescatalog"."encounters"."diabetic_encounters" a
    JOIN public.patient_info b ON b.patient_nbr = a.patient_nbr
    WHERE
        a.readmitted='<30'
    GROUP BY age_grp
    ORDER BY readmission_count DESC
    LIMIT 1;

This query returns results showing an age group and the number of re-admissions, as shown in the following figure.

Redamissions Output

Cleanup

To clean up your resources, delete the stack you deployed using AWS CloudFormation. For instructions, see Deleting a stack on the AWS CloudFormation console.

Conclusion

In this post, you walked through an end-to-end process for setting up security and governance controls for Apache Iceberg data stored in Amazon S3 Tables and accessing it from Amazon Redshift. This includes creating S3 Tables, loading data into them, registering the tables in a data lake catalog, setting up access controls, and querying the data using Amazon Redshift. You also learned how to combine data from Amazon S3 Tables and local Amazon Redshift tables stored in Redshift Managed Storage in a single query, enabling a seamless, unified analytics experience. Try out these features and see Working with Amazon S3 Tables and table buckets for more details. We welcome your feedback in the comments section.


About the Authors

Satesh SontiSatesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specializing in building enterprise data platforms, data warehousing, and analytics solutions. He has over 19 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

JonathanJonathan Katz is a Principal Product Manager – Technical on the Amazon Redshift team and is based in New York. He is a Core Team member of the open source PostgreSQL project and an active open source contributor, including PostgreSQL and the pgvector project.

Empower financial analytics by creating structured knowledge bases using Amazon Bedrock and Amazon Redshift

Post Syndicated from Nita Shah original https://aws.amazon.com/blogs/big-data/empower-financial-analytics-by-creating-structured-knowledge-bases-using-amazon-bedrock-and-amazon-redshift/

Traditionally, financial data analysis could require deep SQL expertise and database knowledge. Now with Amazon Bedrock Knowledge Bases integration with structured data, you can use simple, natural language prompts to query complex financial datasets. By combining the AI capabilities of Amazon Bedrock with an Amazon Redshift data warehouse, individuals with varied levels of technical expertise can quickly generate valuable insights, making sure that data-driven decision-making is no longer limited to those with specialized programming skills.

With the support for structured data retrieval using Amazon Bedrock Knowledge Bases, you can now use natural language querying to retrieve structured data from your data sources, such as Amazon Redshift. This enables applications to seamlessly integrate natural language processing capabilities on structured data through simple API calls. Developers can rapidly implement sophisticated data querying features without complex coding—just connect to the API endpoints and let users explore financial data using plain English. From customer portals to internal dashboards and mobile apps, this API-driven approach makes enterprise-grade data analysis accessible to everyone in your organization. Using structured data from a Redshift data warehouse, you can efficiently and quickly build generative AI applications for tasks such as text generation, sentiment analysis, or data translation.

In this post, we showcase how financial planners, advisors, or bankers can now ask questions in natural language, such as, “Give me the name of the customer with the highest number of accounts?” or “Give me details of all accounts for a specific customer.” These prompts will receive precise data from the customer databases for accounts, investments, loans, and transactions. Amazon Bedrock Knowledge Bases automatically translates these natural language queries into optimized SQL statements, thereby accelerating time to insight, enabling faster discoveries and efficient decision-making.

Solution overview

To illustrate the new Amazon Bedrock Knowledge Bases integration with structured data in Amazon Redshift, we will build a conversational AI-powered assistant for financial assistance that is designed to help answer financial inquiries, like “Who has the most accounts?” or “Give details of the customer with the highest loan amount.”

We will build a solution using sample financial datasets and set up Amazon Redshift as the knowledge base. Users and applications will be able to access this information using natural language prompts.

The following diagram provides an overview of the solution.

For building and running this solution, the steps include:

  1. Load sample financial datasets.
  2. Enable Amazon Bedrock large language model (LLM) access for Amazon Nova Pro.
  3. Create an Amazon Bedrock knowledge base referencing structured data in Amazon Redshift.
  4. Ask queries and get responses in natural language.

To implement the solution, we use a sample financial dataset that is for demonstration purposes only. The same implementation approach can be adapted to your specific datasets and use cases.

Download the SQL script to run the implementation steps in Amazon Redshift Query Editor V2. If you’re using another SQL editor, you can copy and paste the SQL queries either from this post or from the downloaded notebook.

Prerequisites

Make sure your meet the following prerequisites:

  1. Have an AWS account.
  2. Create an Amazon Redshift Serverless workgroup or provisioned cluster. For setup instructions, see Creating a workgroup with a namespace or Create a sample Amazon Redshift database, respectively. The Amazon Bedrock integration feature is supported in both Amazon Redshift provisioned and serverless.
  3. Create an AWS Identity and Access Management (IAM) role. For instructions, see Creating or updating an IAM role for Amazon Redshift ML integration with Amazon Bedrock.
  4. Associate the IAM role to a Redshift instance.
  5. Set up the required permissions for Amazon Bedrock Knowledge Bases to connect with Amazon Redshift.

Load sample financial data

To load the finance datasets to Amazon Redshift, complete the following steps:

  1. Open the Amazon Redshift Query Editor V2 or another SQL editor of your choice and connect to the Redshift database.
  2. Run the following SQL to create the finance data tables and load sample data:
    -- Create table
    CREATE TABLE accounts (
        id integer ,
        account_id integer PRIMARY KEY,
        customer_id integer,
        account_type character varying(256),
        opening_date date,
        balance bigint,
        currency character varying(256)
    );
    
    CREATE TABLE customer (
        id integer,
        customer_id integer PRIMARY KEY ,
        name character varying(256) ,
        age integer,
        gender character varying(256) ,
        address character varying(256) ,
        phone character varying(256) ,
        email character varying(256)
    );
    
    CREATE TABLE investments (
        id integer ,
        investment_id integer PRIMARY KEY,
        customer_id integer ,
        investment_type character varying(256) ,
        investment_name character varying(256) ,
        purchase_date date ,
        purchase_price bigint ,
        quantity integer 
    );
    
    
    CREATE TABLE loans (
        id integer ,
        loan_id integer PRIMARY KEY,
        customer_id integer ,
        loan_type character varying(256) ,
        loan_amount bigint ,
        interest_rate integer ,
        start_date date ,
        end_date date 
    );
    
    CREATE TABLE orders (
        id integer ,
        order_id integer PRIMARY KEY,
        customer_id integer ,
        order_type character varying(256) ,
        order_date date ,
        investment_id integer ,
        quantity integer ,
        price integer 
    );
    
    CREATE TABLE transactions (
        id integer ,
        transaction_id integer PRIMARY KEY ,
        account_id integer REFERENCES accounts(account_id),
        transaction_type character varying(256) ,
        transaction_date date ,
        amount integer ,
        description character varying(256) 
    );

  3. Download the sample financial dataset to your local storage and unzip the zipped folder.
  4. Create an Amazon Simple Storage Service (Amazon S3) bucket with a unique name. For instructions, refer to Creating a general purpose bucket.
  5. Upload the downloaded files into your newly created S3 bucket.
  6. Using the following COPY command statements, load the datasets from Amazon S3 into the new tables you created in Amazon Redshift. Replace <<your_s3_bucket>> with the name of your S3 bucket and <<your_region>> with your AWS Region.
    -- Load sample data
    COPY accounts FROM 's3://<<your_s3_bucket>>/accounts.csv' IAM_ROLE DEFAULT FORMAT AS CSV DELIMITER ',' QUOTE '"' IGNOREHEADER 1 REGION AS '<<your_region>>';
    
    COPY customer FROM 's3://<<your_s3_bucket>>/customer.csv' IAM_ROLE DEFAULT FORMAT AS CSV DELIMITER ',' QUOTE '"' IGNOREHEADER 1 REGION AS '<<your_region>>';
    COPY investments FROM 's3://<<your_s3_bucket>>/investments.csv' IAM_ROLE DEFAULT FORMAT AS CSV DELIMITER ',' QUOTE '"' IGNOREHEADER 1 REGION AS '<<your_region>>';
    COPY loans FROM 's3://<<your_s3_bucket>>/loans.csv' IAM_ROLE DEFAULT FORMAT AS CSV DELIMITER ',' QUOTE '"' IGNOREHEADER 1 REGION AS '<<your_region>>';
    COPY orders FROM 's3://<<your_s3_bucket>>/orders.csv' IAM_ROLE DEFAULT FORMAT AS CSV DELIMITER ',' QUOTE '"' IGNOREHEADER 1 REGION AS '<<your_region>>';
    COPY transactions FROM 's3://<<your_s3_bucket>>/transactions.csv' IAM_ROLE DEFAULT FORMAT AS CSV DELIMITER ',' QUOTE '"' IGNOREHEADER 1 REGION AS '<<your_region>>';

Enable LLM access

With Amazon Bedrock, you can access state-of-the-art AI models from providers like Anthropic, AI21 Labs, Stability AI, and Amazon’s own foundation models (FMs). These include Anthropic’s Claude 2, which excels at complex reasoning and content generation; Jurassic-2 from AI21 Labs, known for its multilingual capabilities; Stable Diffusion from Stability AI for image generation; and Amazon Titan models for various text and embedding tasks. For this demo, we use Amazon Bedrock to access the Amazon Nova FMs. Specifically, we use the Amazon Nova Pro model, which is a highly capable multimodal model designed for a wide range of tasks like video summarization, Q&A, mathematical reasoning, software development, and AI agents, including high speed and accuracy for text summarization tasks.

Make sure you have the required IAM permissions to enable access to available Amazon Bedrock Nova FMs. Then complete the following steps to enable model access in Amazon Bedrock:

  1. On the Amazon Bedrock console, in the navigation pane, choose Model access.
  2. Choose Enable specific models.
  3. Search for Amazon Nova models, select Nova Pro, and choose Next.
  4. Review the selection and choose Submit.

Create an Amazon Bedrock knowledge base referencing structured data in Amazon Redshift

Amazon Bedrock Knowledge Bases uses Amazon Redshift as the query engine to query your data. It reads metadata from your structured data store to generate SQL queries. There are different supported authentication methods to create the Amazon Bedrock knowledge base using Amazon Redshift. For more information, refer to the Set up query engine for your structured data store in Amazon Bedrock Knowledge Bases.

For this post, we create an Amazon Bedrock knowledge base for the Redshift database and sync the data using IAM authentication.

If you’re creating an Amazon Bedrock knowledge base through the AWS Management Console, you can skip the service role setup mentioned in the previous section. It automatically creates one with the necessary permissions for Amazon Bedrock Knowledge Bases to retrieve data from your new knowledge base and generate SQL queries for structured data stores.

When creating an Amazon Bedrock knowledge base using an API, you must attach IAM policies that grant permissions to create and manage knowledge bases with connected data stores. Refer to Prerequisites for creating an Amazon Bedrock Knowledge Base with a structured data store for instructions.

Complete the following steps to create an Amazon Bedrock knowledge base using structured data:

  1. On the Amazon Bedrock console, choose Knowledge Bases in the navigation pane.
  2. Choose Create and choose Knowledge Base with structure data store from the dropdown menu.
  3. Provide the following details for your knowledge base:
    1. Enter a name and optional description.
    2. Select Amazon Redshift as the query engine.
    3. Select Create and use a new service role for resource management.
    4. Make note of this newly created IAM role.
    5. Choose Next to proceed to the next part of the setup process.
    6. Configure the query engine:
      • Select Redshift Serverless (Amazon Redshift provisioned is also supported).
      • Choose your Redshift workgroup.
      • Use the IAM role created earlier.
      • Under Default storage metadata, select Amazon Redshift databases and for Database, choose dev.
      • You can customize settings by adding specific contexts to enhance the accuracy of the results.
      • Choose Next.
    7. Complete creating your knowledge base.
    8. Record the generated service role details.
    9. Next, grant appropriate access to the service role for Amazon Bedrock Knowledge Bases through the Amazon Redshift Query Editor V2. Update <your Service Role name> in the following statements with your service role, and update the value for <your schema>.
      CREATE USER "IAMR:<your Service Role name>" WITH PASSWORD DISABLE;
      SELECT * FROM PG_USER; -- To verify that the user is created.
      GRANT SELECT ON ALL TABLES IN SCHEMA <your schema> TO "IAMR:<your Service Role name>";
      --You can also Restricting access to certain tables for finer-grained control on the tables that can be accessed as shown below
      GRANT SELECT ON TABLE customer to "IAMR:<your Service Role name>";
      GRANT SELECT ON TABLE loan to "IAMR:<your Service Role name>";

Now you can update the knowledge base with the Redshift database.

  1. On the Amazon Bedrock console, choose Knowledge Bases in the navigation pane.
  2. Open the knowledge base you created.
  3. Select the dev Redshift database and choose Sync.

It may take a few minutes for the status to display as COMPLETE.

Ask queries and get responses in natural language

You can set up your application to query the knowledge base or attach the knowledge base to an agent by deploying your knowledge base for your AI application. For this demo, we use a native testing interface on the Amazon Bedrock Knowledge Bases console.

To ask questions in natural language on the knowledge base for Redshift data, complete the following steps:

  1. On the Amazon Bedrock console, open the details page for your knowledge base.
  2. Choose Test.
  3. Choose your category (Amazon), model (Nova Pro), and inference settings (On demand), and choose Apply.
  4. In the right pane of the console, test the knowledge base setup with Amazon Redshift by asking a few simple questions in natural language, such as “How many tables do I have in the database?” or “Give me list of all tables in the database.

The following screenshot shows our results.

  1. To view the generated query from your Amazon Redshift based knowledge base, choose Show details next to the response.
  2. Next, ask questions related to the financial datasets loaded in Amazon Redshift using natural language prompts, such as, “Give me the name of the customer with the highest number of accounts” or “Give the details of all accounts for customer Deanna McCoy.

The following screenshot shows the responses in natural language.

Using natural language queries in Amazon Bedrock, you were able to retrieve responses from the structured financial data stored in Amazon Redshift.

Considerations

In this section, we discuss some important considerations when using this solution.

Security and compliance

When integrating Amazon Bedrock with Amazon Redshift, implementing robust security measures is crucial. To protect your systems and data, implement essential safeguards including restricted database roles, read-only database instances, and proper input validation. These measures help prevent unauthorized access and potential system vulnerabilities. For more information, see Allow your Amazon Bedrock Knowledge Bases service role to access your data store.

Cost

You incur a cost for converting natural language to text based on SQL. To learn more, refer to Amazon Bedrock pricing.

Use custom contexts

To improve query accuracy, you can enhance SQL generation by providing custom context in two key ways. First, specify which tables to include or exclude, focusing the model on relevant data structures. Second, supply curated queries as examples, demonstrating the types of SQL queries you expect. These curated queries serve as valuable reference points, guiding the model to generate more accurate and relevant SQL outputs tailored to your specific needs. For more information, refer to Create a knowledge base by connecting to a structured data store.

For different workgroups, you can create separate knowledge bases for each group, with access only to their specific tables. Control data access by setting up role-based permissions in Amazon Redshift, verifying each role can only view and query authorized tables.

Clean up

To avoid incurring future charges, delete the Redshift Serverless instance or provisioned data warehouse created as part of the prerequisite steps.

Conclusion

Generative AI applications provide significant advantages in structured data management and analysis. The key benefits include:

  • Using natural language processing – This makes data warehouses more accessible and user-friendly
  • Enhancing customer experience – By providing more intuitive data interactions, it boosts overall customer satisfaction and engagement
  • Simplifying data warehouse navigation – Users can understand and explore data warehouse content through natural language interactions, improving ease of use
  • Improving operational efficiency – By automating routine tasks, it allows human resources to focus on more complex and strategic activities

In this post, we showed how the natural language querying capabilities of Amazon Bedrock Knowledge Bases when integrated with Amazon Redshift enables rapid solution development. This is particularly valuable for the finance industry, where financial planners, advisors, or bankers face challenges in accessing and analyzing large volumes of financial data in a secured and performant manner.

By enabling natural language interactions, you can bypass the traditional barriers of understanding database structures and SQL queries, and quickly access insights and provide real-time support. This streamlined approach accelerates decision-making and drives innovation by making complex data analysis accessible to non-technical users.

For additional details on Amazon Bedrock and Amazon Redshift integration, refer to Amazon Redshift ML integration with Amazon Bedrock.


About the authors

Nita Shah is an Analytics Specialist Solutions Architect at AWS based out of New York. She has been building data warehouse solutions for over 20 years and specializes in Amazon Redshift. She is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Sushmita Barthakur is a Senior Data Solutions Architect at Amazon Web Services (AWS), supporting Strategic customers architect their data workloads on AWS. With a background in data analytics, she has extensive experience helping customers architect and build enterprise data lakes, ETL workloads, data warehouses and data analytics solutions, both on-premises and the cloud. Sushmita is based in Florida and enjoys traveling, reading and playing tennis.

Jonathan Katz is a Principal Product Manager – Technical on the Amazon Redshift team and is based in New York. He is a Core Team member of the open source PostgreSQL project and an active open source contributor, including PostgreSQL and the pgvector project.

Simplify enterprise data access using the Amazon Redshift integration with Amazon S3 Access Grants

Post Syndicated from Maneesh Sharma original https://aws.amazon.com/blogs/big-data/simplify-enterprise-data-access-using-the-amazon-redshift-integration-with-amazon-s3-access-grants/

Scaling data access securely while maintaining operational efficiency is a critical challenge for organizations. Access rights are often fragmented across various AWS services, as different business units own and manage different data stores, such as Amazon Simple Storage Service (Amazon S3) and Amazon Redshift. As data grows, modeling access in AWS Identity and Access Management (IAM) policies becomes challenging for data owners, as they try to manage access for different groups and users across accounts in the organization. Managing these distributed access rights requires substantial overhead, because security teams and data owners must collaborate to update and monitor permissions to make sure data is only accessible to authorized users.

Recognizing this challenge, the Amazon S3 Access Grants integration with Amazon Redshift allows centralized user authentication through AWS IAM Identity Center, providing unified identity across the organization. S3 Access Grants allows specific IAM Identity Center users or groups to access registered Amazon S3 locations through a grant. Creating a grant with a group as grantee lets the group members access only the S3 bucket, prefix, or object within the grant’s scope. This means that access can be managed by simply creating a grant for a group and adding or removing the user from the group, reducing administrative overhead.

In this post, we show how to grant Amazon S3 permissions to IAM Identity Center users and groups using S3 Access Grants. We also test the integration using an IAM Identity Center federated user to unload data from Amazon Redshift to Amazon S3 and load data from Amazon S3 to Amazon Redshift.

Solution overview

This post covers a use case where a large organization manages thousands of corporate users across multiple business units through their identity provider (IdP). These users regularly interact with vast amounts of data stored across numerous S3 buckets, frequently performing extract, transform, and load (ETL) operations through Amazon Redshift. Their goal is to have a simpler ETL process of data loading and unloading operations in Amazon Redshift without managing multiple IAM roles and policies for Amazon S3 access. Also, they want a centralized access management solution that seamlessly integrates their corporate identities from existing IdP with AWS services.

For this solution, AWS Organizations is enabled and IAM Identity Center is configured in the delegated administration account. The organization has two member accounts: Member Account 1 runs analytical workloads on Amazon Redshift, with all the services enabled with trusted identity propagation, and Member Account 2 manages data stored in Amazon S3; here you will set up S3 Access Grants. Amazon Redshift will load the user-specific data from Amazon S3 stored in Member Account 2 using access control based on IAM Identity Center users and groups. This improves the user experience maintaining a single authentication mechanism within an organization, retaining access control, and resource separation using AWS accounts as a boundary per business units.

The following diagram illustrates the solution architecture.

Figure 1: Architecture showing the solution

Figure 1: Architecture showing the solution

To run this solution in a single account, configure Amazon Redshift and S3 Access Grants with account instances of IAM Identity Center. Review When to use account instances for more information.

The solution workflow includes the following steps:

  1. The user configures and connects with their respective clients (such as Amazon Redshift Query Editor v2 or a SQL client) to access Amazon Redshift using IAM Identity Center.
  2. A new browser windows opens and is redirected to the login page of the IdP.
  3. The user logs in with their IdP user name and password.
  4. After the login is successful, the user is redirected to the client application, such as the Amazon Redshift Query Editor.
  5. When the user tries to access data in Amazon S3 using the LOAD or UNLOAD SQL command, Amazon Redshift in Member Account 1 will request credentials from the S3 Access Grants instance from Member Account 2, where the Amazon S3 data is stored. This request will contain the user context.
  6. S3 Access Grants will then evaluate the request against the grants it has, matching the identity specified in the grant with the one received in the request. If there is a match, the requestor will receive temporary access to the Amazon S3 locations specified in the grant’s scope.

To implement the solution, we walk you through the following steps:

  1. Enable S3 Access Grants in your Amazon Redshift managed application.
  2. Update IAM role permissions used in the application.
  3. Create a bucket for S3 Access Grants.
  4. Create an IAM policy and role for S3 Access Grants.
  5. Set up S3 Access Grants.
  6. Allow cross-account access of resources.
  7. Create Redshift tables.
  8. Unload and load data in Amazon Redshift.

Prerequisites

You should have the following prerequisites already set up:

Enable S3 Access Grants from the Amazon Redshift managed application

After you have created your Redshift application in IAM Identity Center, you need to perform the following steps to enable S3 Access Grants in the account where Amazon Redshift exists. For this post, we use Member Account 1:

  1. Log in to the AWS Management Console as admin.
  2. On the Amazon Redshift console, choose IAM Identity Center connection in the navigation pane.
  3. Select the managed Redshift application and choose Edit.
  4. Choose Amazon S3 access grants in Trusted identity propagation.
  5. Choose Save changes.

The following screenshot shows the updated configuration.

Figure 2: Redshift managed application

Figure 2: Redshift managed application

Update the IAM role permission attached to the Amazon Redshift managed application

The Amazon Redshift managed application has an IAM role attached (in the preceding screenshot, you can see the role called IAMIDCRedshiftRole under IAM role for IAM Identity Center access. We now need to modify the policy on this role and add permissions to allow interaction with Amazon S3. Edit the role and add s3:GetAccessGrantsInstanceForPrefix and s3:GetDataAccess as shown in the following policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowGetRedsfhitInformation",
            "Effect": "Allow",
            "Action": [
                "redshift-serverless:ListNamespaces",
                "redshift-serverless:ListWorkgroups",
                "redshift:DescribeQev2IdcApplications",
                "redshift-serverless:GetWorkgroup"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowDescribeIdentityCenter",
            "Effect": "Allow",
            "Action": [
                "sso:DescribeApplication",
                "sso:DescribeInstance"
            ],
            "Resource": [
                "arn:aws:sso:::instance/<IAM Identity Center Instance ID>",
                "arn:aws:sso::<Delegated Adminstration AWS Account ID>:application/<IAM Identity Center Instance ID>/*"
            ]
        },
        {
            "Sid": "RetrieveAGinstanceforParticularPrefix",
            "Effect": "Allow",
            "Action": 
                      "s3:GetAccessGrantsInstanceForPrefix",
            "Resource": "*"
        },
        {
            "Sid": "CrossAccountAccessGrantsPolicy",
            "Effect": "Allow",
            "Action": [
                "s3:GetDataAccess"
            ],
            "Resource": "arn:aws:s3:<region>:<AWS Account of S3 Access Grant>:access-grants/default"
        }
    ]
}

Replace <IAM Identity Center Instance ID> with your IAM Identity Center instance ID and <Delegated Adminstration AWS Account ID> with the account ID where IAM Identity Center is set up. You also need to replace the resource in CrossAccountAccessGrantscasePolicy with your S3 Access Grants instance information.

Create an S3 bucket for S3 Access Grants

In this step, you create a S3 bucket that you want to grant access to or use an existing bucket. For this post, we create a bucket called amzn-s3-demo-bucket. You can choose another appropriate name. For more information, see Creating a general purpose bucket.

The bucket must be located in the same AWS Region as your S3 Access Grants instance and IAM Identity Center.

Next, create two folders in the newly created S3 bucket. If you’re using an existing S3 bucket, identify two folders to use for this walkthrough. For this blog post, we create two folders: awssso-sales and awssso-finance, under a bucket named amzn-s3-demo-bucket. The purpose of creating two folders is so that users from different groups have access only to their respective folder.

Create an IAM policy and role for S3 Access Grants

Complete the following steps to create an IAM policy to scope the permissions for a specific access grant:

  1. Create an IAM policy with the following permissions. For more information on creating IAM policy, see Create IAM policies. To get additional information on the following specific policy, refer to Register a location.
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "ObjectLevelReadPermissions",
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:GetObjectVersion",
                    "s3:GetObjectAcl",
                    "s3:GetObjectVersionAcl",
                    "s3:ListMultipartUploadParts"
                ],
                "Resource": "arn:aws:s3:::<bucket-name>/*",
                "Condition": {
                    "StringEquals": {
                        "aws:ResourceAccount": "<AWS Account of S3 Access Grant>"
                    },
                    "ArnEquals": {
                        "s3:AccessGrantsInstanceArn": [
                            "arn:aws:s3:<region>:<AWS Account of S3 Access Grant>:access-grants/default"
                        ]
                    }
                }
            },
            {
                "Sid": "ObjectLevelWritePermissions",
                "Effect": "Allow",
                "Action": [
                    "s3:PutObject",
                    "s3:PutObjectAcl",
                    "s3:PutObjectVersionAcl",
                    "s3:DeleteObject",
                    "s3:DeleteObjectVersion",
                    "s3:AbortMultipartUpload"
                ],
                "Resource": "arn:aws:s3:::<bucket-name>/*",
                "Condition": {
                    "StringEquals": {
                        "aws:ResourceAccount": "<AWS Account of S3 Access Grant>"
                    },
                    "ArnEquals": {
                        "s3:AccessGrantsInstanceArn": "arn:aws:s3:<region>:<AWS Account of S3 Access Grant>:access-grants/default"
                    }
                }
            },
            {
                "Sid": "BucketLevelReadPermissions",
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket"
                ],
                "Resource": "arn:aws:s3:::<bucket-name>",
                "Condition": {
                    "StringEquals": {
                        "aws:ResourceAccount": "<AWS Account of S3 Access Grant>"
                    },
                    "ArnEquals": {
                        "s3:AccessGrantsInstanceArn": "arn:aws:s3:<region>:<AWS Account of S3 Access Grant>:access-grants/default"
                    }
                }
            }
        ]
    }

  2. Create an IAM role that has permission to access your S3 data in the Region. For more information, see IAM role creation. In this example, we create an IAM role called iamidcs3accessgrant. You need to attach the preceding policy to the IAM role.
  3. Use the following trust policy for the IAM role:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "ForAccessGrants",
                "Effect": "Allow",
                "Principal": {
                    "Service": "access-grants.s3.amazonaws.com"
                },
                "Action": [
                    "sts:AssumeRole",
                    "sts:SetContext",
                    "sts:SetSourceIdentity"
                ],
                "Condition": {
            "StringEquals": {
              "aws:SourceAccount":"<accountId>",
              "aws:SourceArn":"arn:aws:s3:<region>:<accountId>:access-grants/default"
            }
          }
            }
        ]
    }

Set up S3 Access Grants

The S3 Access Grants instance serves as the container for your S3 Access Grants resources, which include registered locations and grants. You can create only one S3 Access Grants instance per Region per account. You can associate this S3 Access Grants instance to your corporate directory with your IAM Identity Center instance. After you’ve done so, you can create grants for your corporate users and groups. S3 Access Grants requires registering a location to map an S3 bucket or prefix to an IAM role, enabling secure access by providing temporary credentials to grantees for that specific location.

Complete the following steps to set up S3 Access Grants:

  1. On the Amazon S3 console, choose your preferred Region.
  2. In the navigation pane, choose Access Grants.
  3. Choose Create S3 Access Grants instance.
  4. Select Add IAM Identity Center instance in <region> and enter the IAM Identity Center instance Amazon Resource Name (ARN). For this post, we use the delegated administration account IAM Identity Center ARN.
  5. Choose Next.

    Figure 3: S3 Access Grants instance

    Figure 3: S3 Access Grants instance

  6. After you create an Amazon S3 Access Grants instance in a Region in your account, you register an Amazon S3 location in that instance. For Location scope, choose Browse S3 or enter the S3 URI path to the location that you want to register. After you enter a URI, you can choose View to browse the location. In this example, we provide the scope as s3://amzn-s3-demo-bucket.
  7. For IAM role, select Choose from existing IAM roles and choose the IAM role you previously created (iamidcs3accessgrant).
  8. Choose Next.

This will register a location in your S3 Access Grants instance.

Figure 4: S3 Access Grants instance location scope

Figure 4: S3 Access Grants instance location scope

  1. You will now create a grant.
    1. If you selected the default Amazon S3 location, use the Subprefix box to narrow the scope of the access grant. For more information, see Working with grants in S3 Access Grants.
    2. If you’re granting access only to an object, select Grant scope is an object. In our example, we register the location as s3://amzn-s3-demo-bucket and then for the subprefix, we specify the folder name followed by an asterisk (awssso-sales/*).
  2. Under Permissions and access, select the Permission level, either Read, Write, or both. In this example, we select both because we will first unload from Amazon S3 to Amazon Redshift and then copy from the same bucket to Amazon Redshift.
  3. For Grantee type, choose Directory identity from IAM Identity Center.
  4. For Directory identity type, you can choose either User or Group. In this example, we choose Group.
  5. For IAM Identity Center group ID, enter the group ID from IAM Identity Center where user and group information belongs.

To get this value, open the IAM Identity Center console and choose Groups in the navigation pane, then choose one of the groups you want to provide access and copy the value under Group ID. In the following example, we collect the group ID information from the delegated administration account.

Figure 5: IAM Identity Center group information

Figure 5: IAM Identity Center group information

  1. Choose Next.

    Figure 6: S3 Access Grants instance permissions and access

    Figure 6: S3 Access Grants instance permissions and access

  2. Choose Finish.

    Figure 7: S3 Access Grants instance review information page

    Figure 7: S3 Access Grants instance review information page

You can view the details of the access grant on the Amazon S3 console, as shown in the following screenshot. For more information, see View a grant.

Figure 8: S3 Access Grants grants

Figure 8: S3 Access Grants grants

Similarly, you can get the details of a location that’s registered in your S3 Access Grants instance. For more information, see View the details of a registered location.

Figure 9: S3 Access Grants locations

Figure 9: S3 Access Grants locations

Allow cross-account access of resources and create initial tables

Now we want to share resources to make our cross-account scenario work. This step is only needed if your Amazon Redshift and Amazon S3 resources are in different accounts. This should be done in the account where Amazon S3 is set up. Complete the following steps:

  1. On the AWS RAM console, in the navigation pane, choose Resource shares.
  2. Choose Create resource share.
  3. For Name, enter a descriptive name for the resource share (for example, s3accessgrant).
  4. For Resources – optional, choose S3 Access Grants. The S3 Access Grants instance you created will be shown; select the default S3 Access Grant instance ARN.
  5. Choose Next.
  6. Under Managed permission for s3:AccessGrants, you can choose to associate a managed permission created by AWS with the resource type, choose an existing customer managed permission, or create your own customer managed permission for supported resource types. In this post, we choose the existing permission named AWSRAMPermissionAccessGrantsData.
  7. Choose Next.
  8. For Grant access to principals, choose Allow sharing only within your organization and enter the account ID where the Redshift instance exists.
  9. Choose Add.
  10. Choose Next.
  11. Choose Create resource share.

The following screenshot shows the new resource share details.

Figure 10: AWS RAM - create resource share wizard

Figure 10: AWS RAM – create resource share wizard

Create tables in Amazon Redshift

As an Amazon Redshift admin user, you need to first create the tables you will use to unload data. In the following code, we create a new store_sales_s3access table:

CREATE TABLE IF NOT EXISTS 
sales_schema.store_sales_s3access ( 
ID INTEGER ENCODE az64, 
Product varchar(20), 
Sales_Amount INTEGER ENCODE az64 
) 
DISTSTYLE AUTO ;

Also make sure the following permissions are applied on the respective IAM Identity Center group; this group is represented in Amazon Redshift as a Redshift role. For this post, we grant permissions to the awssso-sales group:

grant usage on schema sales_schema to role "awsidc:awssso-sales";
grant select,insert  for tables in schema sales_schema to role "awsidc:awssso-sales";

As an Amazon Redshift admin user, you have created a Redshift table and assigned relevant permissions to the Redshift database role awsidc:awssso-sales. Now when an authenticated user that belongs to the group awssso-sales runs a query in Amazon Redshift to access Amazon S3 (such as a COPY, UNLOAD, or Amazon Redshift Spectrum operation), Amazon Redshift retrieves temporary Amazon S3 access credentials scoped to that IAM Identity Center user from S3 Access Grants. Amazon Redshift then uses the retrieved temporary credentials to access the authorized Amazon S3 locations for that query.

Unload and load data in Amazon Redshift

In this step, we log in to the Amazon Redshift Query Editor using IAM Identity Center authentication and run an UNLOAD command to unload data from the table created earlier into the S3 bucket. After that, we run the COPY command to copy information from Amazon S3 into the same table in the same directory we unloaded the data from.

Complete the following steps to access the Amazon Redshift Query Editor with an IAM Identity Center user:

  1. On the Amazon Redshift console, open the Amazon Redshift Query Editor.
  2. Choose (right-click) your Redshift instance and choose Create connection.
  3. Choose IAM Identity Center as your authentication method.
  4. A pop-up will appear. Because your IdP credentials are already cached, it uses the same credentials and connects to the Amazon Redshift Query Editor using IAM Identity Center authentication.

Now you’re ready to run the SQL queries in Amazon Redshift.

Unload data

As a federated user, you will first run an unload command from the table store_sales in the bucket s3://amzn-s3-demo-bucket/awssso-sales/.

In this post, we run an UNLOAD command as a federated IAM Identity Center user (Ethan), where we will be unloading the data from a Redshift table. Replace the S3 bucket name with the one you created.

UNLOAD ('SELECT * FROM "dev"."sales_schema"."store_sales"')
TO 's3://amzn-s3-demo-bucket/awssso-sales/';

The preceding command doesn’t include an IAM role ARN. This simplified syntax not only makes your code more readable, but also reduces the potential for configuration errors. The underlying permissions are handled automatically through S3 Access Grants and trusted identity propagation, maintaining robust security while simplifying permissions management.

Load data

Now we demonstrate a common data workflow using the same federated IAM Identity Center user (Ethan), where we will be running the COPY command accessing the same Amazon S3 location where we previously unloaded our data. Use to following command to load data into a separate table called store_sales_s3access:

copy dev.sales_schema.store_sales_s3access 
from 's3://amzn-s3-demo-bucket/awssso-sales/' delimiter '|'

If user Ethan tries to unload "sales_schema"."store_sales" in sales_schema to a different folder in the S3 bucket (awssso-finance), they get a permission denied error. This is because access is controlled by S3 Access Grants, and this user doesn’t have a grant to the awssso-finance folder. Use the following command to test the access denied use case:

UNLOAD ('SELECT * FROM "dev"."sales_schema"."store_sales"')
TO 's3://amzn-s3-demo-bucket/awssso-finance/';
Figure 11: QEv2 query result error

Figure 11: QEv2 query result error

IAM Identity Center related operations are automatically captured and logged in AWS CloudTrail, offering enhanced visibility and comprehensive audit capabilities. To view detailed error information on the CloudTrail console, choose Event history in the navigation pane, then specify s3.amazonaws.com as the event source and open GetDataAccess.

The following screenshot shows the snippet from the CloudTrail logs showing that user access is denied.

Figure 12: Amazon CloudTrail

Figure 12: Amazon CloudTrail

Clean up

Complete the following steps to clean up your resources:

  1. Delete the IdP applications that you created to integrate with IAM Identity Center.
  2. Delete the IAM Identity Center configuration.
  3. Delete the Redshift application and the Amazon Redshift provisioned cluster or serverless instance that you created for testing.
  4. Delete the IAM role and IAM policies that you created in this post.
  5. Delete the permission set from IAM Identity Center that you created for the Amazon Redshift Query Editor in the management account.
  6. Delete the S3 bucket and associated S3 Access Grants instance.

Conclusion

In this post, we explored how to integrate Amazon Redshift with S3 Access Grants using IAM Identity Center. We established cross-account access to enable centralized user authentication through IAM Identity Center in the delegated administrator account, while keeping Amazon Redshift and Amazon S3 isolated by business unit in separate member accounts. We also showed simplified versions of running COPY and UNLOAD commands as a federated IAM Identity Center user without using an IAM role ARN. This setup creates a robust and secure analytics environment that streamlines data access for business users.

For additional guidance and detailed documentation, refer to the following key resources:


About the Authors

Maneesh Sharma is a Senior Database Engineer at AWS with more than a decade of experience designing and implementing large-scale data warehouse and analytics solutions. He collaborates with various Amazon Redshift Partners and customers to drive better integration.

Laura is an Identity Solutions Architect at AWS, where she thrives on helping customers overcome security and identity challenges. In her free time, she enjoys wreck diving and traveling around the world.

Praveen Kumar Ramakrishnan is a Senior Software Engineer at AWS. He has nearly 20 years of experience spanning various domains including filesystems, storage virtualization and network security. At AWS, he focuses on enhancing the Redshift data security.

Yanzhu Ji is a Product Manager in the Amazon Redshift team. She has experience in product vision and strategy in industry-leading data products and platforms. She has outstanding skill in building substantial software products using web development, system design, database, and distributed programming techniques. In her personal life, Yanzhu likes painting, photography, and playing tennis.

Access Amazon Redshift Managed Storage tables through Apache Spark on AWS Glue and Amazon EMR using Amazon SageMaker Lakehouse

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/access-amazon-redshift-managed-storage-tables-through-apache-spark-on-aws-glue-and-amazon-emr-using-amazon-sagemaker-lakehouse/

Data environments in data-driven organizations are changing to meet the growing demands for analytics, including business intelligence (BI) dashboarding, one-time querying, data science, machine learning (ML), and generative AI. These organizations have a huge demand for lakehouse solutions that combine the best of data warehouses and data lakes to simplify data management with easy access to all data from their preferred engines.

Amazon SageMaker Lakehouse unifies all your data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and artificial intelligence and machine learning (AI/ML) applications on a single copy of data. SageMaker Lakehouse gives you the flexibility to access and query your data  in place with all Apache Iceberg compatible tools and engines. It secures your data in the lakehouse by defining fine-grained permissions, which are consistently applied across all analytics and ML tools and engines. You can bring data from operational databases and applications into your lakehouse in near real time through zero-ETL integrations. It accesses and queries data in-place with federated query capabilities across third-party data sources through Amazon Athena.

With SageMaker Lakehouse, you can access tables stored in Amazon Redshift managed storage (RMS) through Iceberg APIs, using the Iceberg REST catalog backed by AWS Glue Data Catalog. This expands your data integration workload across data lakes and data warehouses, enabling seamless access to diverse data sources.

Amazon SageMaker Unified Studio, Amazon EMR 7.5.0 and higher, and AWS Glue 5.0 natively support SageMaker Lakehouse. This post describes how to integrate data on RMS tables through Apache Spark using SageMaker Unified Studio, Amazon EMR 7.5.0 and higher, and AWS Glue 5.0.

How to access RMS tables through Apache Spark on AWS Glue and Amazon EMR

With SageMaker Lakehouse, RMS tables are accessible through the Apache Iceberg REST catalog. Open source engines such as Apache Spark are compatible with Apache Iceberg, and they can interact with RMS tables by configuring this Iceberg REST catalog. You can learn more in Connecting to the Data Catalog using AWS Glue Iceberg REST extension endpoint.

Note that the Iceberg REST extensions endpoint is used when you access RMS tables. This endpoint is accessible through the Apache Iceberg AWS Glue Data Catalog extensions, which comes preinstalled on AWS Glue 5.0 and Amazon EMR 7.5.0 or higher. The extension library enables access to RMS tables using the Amazon Redshift connector for Apache Spark.

To access RMS backed catalog databases from Spark, each RMS database requires its own Spark session catalog configuration. Here are the required Spark configurations:

Spark config key Value
spark.sql.catalog.{catalog_name} org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.{catalog_name}.type glue
spark.sql.catalog.{catalog_name}.glue.id {account_id}:{rms_catalog_name}/{database_name}
spark.sql.catalog.{catalog_name}.client.region {aws_region}
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Configuration parameters:

  • {catalog_name}: Your chosen name for referencing the RMS catalog database in your application code
  • {rms_catalog_name}: The RMS catalog name as shown in the AWS Lake Formation catalogs section
  • {database_name}: The RMS database name
  • {aws_region}: The AWS Region where the RMS catalog is located

For a deeper understanding of how the Amazon Redshift hierarchy (databases, schemas, and tables) is mapped to the AWS Glue multilevel catalogs, you can refer to the Bringing Amazon Redshift data into the AWS Glue Data Catalog documentation.

In the following section, we demonstrate how to access RMS tables through Apache Spark using SageMaker Unified Studio JupyterLab notebooks with the AWS Glue 5.0 runtime and Amazon EMR Serverless.

Although we can bring existing Amazon Redshift tables into the AWS Glue Data catalog by creating a Lakehouse Redshift catalog from an existing Redshift namespace and provide access to a SageMaker Unified Studio project, in the following example, you’ll create a managed Amazon Redshift Lakehouse catalog directly from SageMaker Unified Studio and work with that.

Prerequisites

To follow these instructions, you must have the following prerequisites:

Create a SageMaker Unified Studio project

Complete the following steps to create a SageMaker Unified Studio project:

  1. Sign in to SageMaker Unified Studio.
  2. Choose Select a project on the top menu and choose Create project.
  3. For Project name, enter demo.
  4. For Project profile, choose All capabilities.
  5. Choose Continue.

  1. Leave the default values and choose Continue.
  2. Review the configurations and choose Create project.

You need to wait for the project to be created. Project creation can take about 5 minutes. When the project status changes to Active, select the project name to access the project’s home page.

  1. Make note of the Project role ARN because you’ll need it for next steps.

You’ve successfully created the project and noted the project role ARN. The next step is to configure a Lakehouse catalog for your RMS.

Configure a Lakehouse catalog for your RMS

Complete the following steps to configure a Lakehouse catalog for your RMS:

  1. In the navigation pane, choose Data.
  2. Choose the + (plus) sign.
  3. Select Create Lakehouse catalog to create a new catalog and choose Next.

  1. For Lakehouse catalog name, enter rms-catalog-demo.
  2. Choose Add catalog.

  1. Wait for the catalog to be created.

  1. In SageMaker Unified Studio, choose Data in the left navigation pane, then select the three vertical dots next to Redshift (Lakehouse) and choose Refresh to make sure the Amazon Redshift compute is active.

Create a new table in the RMS Lakehouse catalog:

  1. In SageMaker Unified Studio, on the top menu, under Build, choose Query Editor.
  2. On the top right, choose Select data source.
  3. For CONNECTIONS, choose Redshift (Lakehouse).
  4. For DATABASES, choose dev@rms-catalog-demo.
  5. For SCHEMAS, choose public.
  6. Choose Choose.

  1. In the query cell, enter and execute the following query to create a new schema:
create schema "dev@rms-catalog-demo".salesdb

  1. In a new cell, enter and execute the following query to create a new table:
create table salesdb.store_sales (ss_sold_timestamp timestamp, ss_item text, ss_sales_price float);

  1. In a new cell, enter and execute the following query to populate the table with sample data:
insert into salesdb.store_sales values ('2024-12-01T09:00:00Z', 'Product 1', 100.0),
('2024-12-01T11:00:00Z', 'Product 2', 500.0),
('2024-12-01T15:00:00Z', 'Product 3', 20.0),
('2024-12-01T17:00:00Z', 'Product 4', 1000.0),
('2024-12-01T18:00:00Z', 'Product 5', 30.0),
('2024-12-02T10:00:00Z', 'Product 6', 5000.0),
('2024-12-02T16:00:00Z', 'Product 7', 5.0);

  1. In a new cell, enter and run the following query to verify the table contents:
select * from salesdb.store_sales;

(Optional) Create an Amazon EMR Serverless application

IMPORTANT: This section is only required if you plan to test also using Amazon EMR Serverless. If you intend to use AWS Glue exclusively, you can skip this section entirely.

  1. Navigate to the project page. In the left navigation pane, select Compute, then select the Data processing Choose Add compute.

  1. Choose Create new compute resources, then choose Next.

  1. Select EMR Serverless.

  1. Specify emr_serverless_application as Compute name, select Compatibility as Permission mode, and choose Add compute.

  1. Monitor the deployment progress. Wait for the Amazon EMR Serverless application to complete its deployment. This process can take a minute.

Access Amazon Redshift Managed Storage tables through Apache Spark

In this section, we demonstrate how to query tables stored in RMS using a SageMaker Unified Studio notebook.

  1. In the navigation pane, choose Data
  2. Under Lakehouse, select the down arrow next to rms-catalog-demo
  3. Under dev, select the down arrow next salesdb, choose store_sales, and choose the three dots

SageMaker Lakehouse offers multiple analysis options: Query with Athena, Query with Redshift, and Open in Jupyter Lab notebook.

  1. Choose Open in Jupyter Lab notebook
  2. On the Launcher tab, choose Python 3 (ipykernel)

In SageMaker Unified Studio JupyterLab, you can specify different compute types for each notebook cell. Although this example demonstrates using AWS Glue compute (project.spark.compatibility), the same code can be executed using Amazon EMR Serverless by selecting the appropriate compute in the cell settings. The following table shows the connection type and compute values to specify when running PySpark code or Spark SQL code with different engines:

Compute option Pyspark code Spark SQL
Connection type Compute Connection type Compute
AWS Glue Pyspark project.spark.compatibility SQL project.spark.compatibility
Amazon EMR Serverless Pyspark emr-s.emr_serverless_application SQL emr-s.emr_serverless_application
  1. In the notebook cell’s top left corner, set Connection Type to PySpark and select spark.compatibility (AWS Glue 5.0) as Compute
  2. Execute the following code to initialize the SparkSession and configure rmscatalog as the session catalog for accessing the dev database under the rms-catalog-demo RMS catalog:
from pyspark.sql import SparkSession

catalog_name = "rmscatalog"
#Change <your_account_id> with your AWS account ID
rms_catalog_id = "<your_account_id>:rms-catalog-demo/dev"

#Change with your AWS region
aws_region="us-east-2"

spark = SparkSession.builder.appName('rms_demo') \
    .config(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') \
    .config(f'spark.sql.catalog.{catalog_name}.type', 'glue') \
    .config(f'spark.sql.catalog.{catalog_name}.glue.id', rms_catalog_id) \
    .config(f'spark.sql.catalog.{catalog_name}.client.region', aws_region) \
    .config('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
    .getOrCreate()

  1. Create a new cell and switch the connection type from PySpark to SQL to execute Spark SQL commands directly
  2. Enter the following SQL statement to view all tables under salesdb (RMS schema) within rmscatalog:
SHOW TABLES IN rmscatalog.salesdb

  1. In a new SQL cell, enter the following DESCRIBE EXTENDED statement to view detailed information about the store_sales table in the salesdb schema:
DESCRIBE EXTENDED rmscatalog.salesdb.store_sales

In the output, you’ll observe that the Provider is set to iceberg. This indicates that the table is recognized as an Iceberg table, despite being stored in Amazon Redshift managed storage.

  1. In a new SQL cell, enter the following SELECT statement to view the content of the table
SELECT * FROM rmscatalog.salesdb.store_sales

Throughout this example, we demonstrated how to create a table in Amazon Redshift Serverless and seamlessly query it as an Iceberg table using Apache Spark within a SageMaker Unified Studio notebook.

Clean up

To avoid incurring future charges, clean up all created resources:

  1. Delete the created SageMaker Unified Studio project. This step will automatically delete Amazon EMR compute (for example, the Amazon EMR Serverless application) that was provisioned from the project:
    1. Inside SageMaker Studio, navigate to the demo project’s Project overview section.
    2. Choose Actions, then select Delete project.
    3. Type confirm and choose Delete project.
  1. Delete the created Lakehouse catalog:
    1. Navigate to the AWS Lake Formation page in the Catalogs section.
    2. Select the rms-catalog-demo catalog, choose Actions, then select Delete.
    3. In the confirmation window type rms-catalog-demo and then choose Drop.

Conclusion

In this post, we demonstrated how to use Apache Spark to interact with Amazon Redshift Managed Storage tables through Amazon SageMaker Lakehouse using the Iceberg REST catalog. This integration provides a unified view of your data across Amazon S3 data lakes and Amazon Redshift data warehouses, so you can build powerful analytics and AI/ML applications while maintaining a single copy of your data.

For additional workloads and implementations, visit Simplify data access for your enterprise using Amazon SageMaker Lakehouse.


About the Authors

Noritaka Sekiyama is a Principal Big Data Architect with Amazon Web Services (AWS) Analytics services. He’s responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.

Stefano Sandonà is a Senior Big Data Specialist Solution Architect at Amazon Web Services (AWS). Passionate about data, distributed systems, and security, he helps customers worldwide architect high-performance, efficient, and secure data solutions.

Derek Liu is a Senior Solutions Architect based out of Vancouver, BC. He enjoys helping customers solve big data challenges through Amazon Web Services (AWS) analytic services.

Raj Ramasubbu is a Senior Analytics Specialist Solutions Architect focused on big data and analytics and AI/ML with Amazon Web Services (AWS). He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. Raj provided technical expertise and leadership in building data engineering, big data analytics, business intelligence, and data science solutions for over 18 years prior to joining AWS. He helped customers in various industry verticals like healthcare, medical devices, life science, retail, asset management, car insurance, residential REIT, agriculture, title insurance, supply chain, document management, and real estate.

Angel Conde Manjon is a Sr. EMEA Data & AI PSA, based in Madrid. He has previously worked on research related to data analytics and AI in diverse European research projects. In his current role, Angel helps partners develop businesses centered on data and AI.


Appendix: Sample script for Lake Formation FGAC enabled Spark cluster

If you want to access RMS tables from Lake Formation FGAC enabled Spark cluster on AWS Glue or Amazon EMR, refer to the following code example:

from pyspark.sql import SparkSession

catalog_name = "rmscatalog"
rms_catalog_name = "123456789012:rms-catalog-demo/dev"
account_id = "123456789012"
region = "us-east-2"

spark = SparkSession.builder.appName('rms_demo') \
.config('spark.sql.defaultCatalog', catalog_name) \
.config(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') \
.config(f'spark.sql.catalog.{catalog_name}.type', 'glue') \
.config(f'spark.sql.catalog.{catalog_name}.glue.id', rms_catalog_name) \
.config(f'spark.sql.catalog.{catalog_name}.client.region', region) \
.config(f'spark.sql.catalog.{catalog_name}.glue.account-id', account_id) \
.config(f'spark.sql.catalog.{catalog_name}.glue.catalog-arn',f'arn:aws:glue:{region}:{rms_catalog_name}') \
.config('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
.getOrCreate()

Configure cross-account access of Amazon SageMaker Lakehouse multi-catalog tables using AWS Glue 5.0 Spark

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/configure-cross-account-access-of-amazon-sagemaker-lakehouse-multi-catalog-tables-using-aws-glue-5-0-spark/

Many organizations build and operate enterprise-wide data mesh architectures using the AWS Glue Data Catalog and AWS Lake Formation for their Amazon Simple Storage Service (Amazon S3) based data lakes. Now, with Amazon SageMaker Lakehouse, these organizations can unify their data analytics and AI/ML workflows while maintaining secure cross-account access without data replication. By centralizing access to a single copy of data and using the secure fine-grained permissions of Lake Formation, enterprises can accelerate their analytics initiatives while reducing operational complexity across business units.

SageMaker Lakehouse organizes data using logical containers called catalogs, enabling teams to seamlessly query and analyze data across their entire ecosystem—from S3 data lakes to Amazon Redshift warehouses—using familiar Apache Iceberg compatible tools. Organizations can either mount their existing data warehouse to the lakehouse or create new catalogs using Amazon Redshift managed storage. Built-in zero-ETL connectors reduce data silos by integrating various data sources, enabling unified analytics across teams. This seamless integration particularly benefits existing AWS customers who already use the Data Catalog and Lake Formation, because they can immediately take advantage of SageMaker Lakehouse capabilities.

AWS Glue is a serverless service that makes data integration simpler, faster, and cheaper. We launched AWS Glue 5.0 with upgraded Apache Spark 3.5.4 and Python 3.11. AWS Glue 5.0 adds support for SageMaker Lakehouse to unify your data across S3 data lakes and Redshift data warehouses.

In our previous blog post, we demonstrated the process of creating tables in both the Amazon Redshift managed catalog and Amazon Redshift federated catalog within a single AWS account. In this post, we show you how to share a Redshift table and Amazon S3 based Iceberg table from the account that owns the data to another account that consumes the data. In the recipient account, we run a join query on the shared data lake and data warehouse tables using Spark in AWS Glue 5.0. We walk you through the complete cross-account setup and provide the Spark configuration in a Python notebook.

Solution overview

To demonstrate the functionality of SageMaker Lakehouse multi-catalog tables using AWS Glue 5.0 Spark, let’s assume the retail company Example Retail Corp launches a campaign to understand their market and drive growth by country of operation. Their infrastructure consists of a Redshift data warehouse for structured data and an S3 data lake for structured and semi-structured data. The marketing team realizes that customer data is spread across those two systems and wants to use the support of their data engineering and analysts to analyze and provide insights. As a company, they prefer unified governance for managing data access while enabling a secure sharing mechanism for business and engineering teams.

Let’s see how they can achieve the goal using SageMaker Lakehouse. The solution is represented in the following diagram.

001-BDB 5089

The setup could be extended to enterprise data meshes where a data producer account will own the Redshift clusters, catalog the tables in a central governance account, and share with any number of consumer accounts from the central account. Multiple consumer accounts could analyze the shared Redshift tables using the SageMaker Lakehouse integrated analytics engines.

The solution also works for cross-Region table access. You would create a resource link for the catalog tables in an AWS Region where you want to run your analyses and create dashboards. For cross-Region resource link setup, refer to Setting up cross-Region table access.

Prerequisites

To implement this solution, you need the following prerequisites:

  • Two AWS accounts with Lake Formation cross-account sharing version 4 and Lake Formation administrator configured. Refer to the Lake Formation data administrator permissions and initial setup of Lake Formation.
  • Permissions from Prerequisites for managing Amazon Redshift namespaces in the AWS Glue Data Catalog granted to the Lake Formation administrator role on both accounts.
  • An S3 bucket in the producer account to host the sample Iceberg table data.
  • An AWS Identity and Access Management (IAM) role, LakeFormationS3Registration_custom, in the producer account to register your Iceberg table’s Amazon S3 location with Lake Formation. For details, refer to Registering an Amazon S3 location and Requirements for roles used to register locations.
  • An Amazon Redshift Serverless namespace in the producer account. Follow the instructions in Creating a data warehouse with Amazon Redshift Serverless to launch a serverless namespace with default settings.
  • Two sample datasets, orders and returns, in CSV format. This is Example Retail Corp’s data on their customer purchase and return trends. Their marketing team has collected these data in a Redshift table and Amazon S3 from various systems. The instructions to create these tables are provided in the appendix at the end of this post. After completing the steps in the appendix, you should have customerdb.returnstbl_iceberg in your default catalog and ordersdb.orderstbl in your Redshift Serverless application default namespace.
  • An IAM role, Glue-execution-role, in the consumer account, with the following policies:
    1. AWS managed policies AWSGlueServiceRole and AmazonRedshiftDataFullAccess.
    2. Create a new in-line policy with the following permissions and attach it:
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "LFandRSserverlessAccess",
                  "Effect": "Allow",
                  "Action": [
                      "lakeformation:GetDataAccess",
                      "redshift-serverless:GetCredentials"
                  ],
                  "Resource": "*"
              },
              {
                  "Effect": "Allow",
                  "Action": "iam:PassRole",
                  "Resource": "*",
                  "Condition": {
                      "StringEquals": {
                          "iam:PassedToService": "glue.amazonaws.com"
                      }
                  }
              }
          ]
      }

    3. Add the following trust policy to Glue-execution-role, allowing AWS Glue to assume this role:
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Principal": {
                      "Service": [
                          "glue.amazonaws.com"
                      ]
                  },
                  "Action": "sts:AssumeRole"
              }
          ]
      }

    Steps for producer account setup

    For the producer account setup, you can either use your IAM administrator role added as Lake Formation administrator or use a Lake Formation administrator role with permissions added as discussed in the prerequisites. For illustration purposes, we use the IAM admin role Admin added as Lake Formation administrator.

    002-BDB 5089

    Configure your catalog

    Complete the following steps to set up your catalog:

    1. Log in to AWS Management Console as Admin.
    2. On the Amazon Redshift console, follow the instructions in Registering Amazon Redshift clusters and namespaces to the AWS Glue Data Catalog.
    3. After the registration is initiated, you will see the invite from Amazon Redshift on the Lake Formation console.
    4. Select the pending catalog invitation and choose Approve and create catalog.

    003-BDB 5089

    1. On the Set catalog details page, configure your catalog:
      1. For Name, enter a name (for this post, redshiftserverless1-uswest2).
      2. Select Access this catalog from Apache Iceberg compatible engines.
      3. Choose the IAM role you created for the data transfer.
      4. Choose Next.

      004-BDB 5089

    2. On the Grant permissions – optional page, choose Add permissions.
      1. Grant the Admin user Super user permissions for Catalog permissions and Grantable permissions.
      2. Choose Add.

      005-BDB 5089

    3. Verify the granted permission on the next page and choose Next.
      006-BDB 5089
    4. Review the details on the Review and create page and choose Create catalog.
      007-BDB 5089

    Wait a few seconds for the catalog to show up.

    1. Choose Catalogs in the navigation pane and verify that the redshiftserverless1-uswest2 catalog is created.
      008-BDB 5089
    2. Explore the catalog detail page to verify the ordersdb.public database.
      009-BDB 5089
    3. On the database View dropdown menu, view the table and verify that the orderstbl table shows up.
      010-BDB 5089

    As the Admin role, you can also query the orderstbl in Amazon Athena and confirm the data is available.

    011-BDB 5089

    Grant permissions on the tables from the producer account to the consumer account

    In this step, we share the Amazon Redshift federated catalog database redshiftserverless1-uswest2:ordersdb.public and table orderstbl as well as the Amazon S3 based Iceberg table returnstbl_iceberg and its database customerdb from the default catalog to the consumer account. We can’t share the entire catalog to external accounts as a catalog-level permission; we just share the database and table.

    1. On the Lake Formation console, choose Data permissions in the navigation pane.
    2. Choose Grant.
      012-BDB 5089
    3. Under Principals, select External accounts.
    4. Provide the consumer account ID.
    5. Under LF-Tags or catalog resources, select Named Data Catalog resources.
    6. For Catalogs, choose the account ID that represents the default catalog.
    7. For Databases, choose customerdb.
      013-BDB 5089
    8. Under Database permissions, select Describe under Database permissions and Grantable permissions.
    9. Choose Grant.
      014-BDB 5089
    10. Repeat these steps and grant table-level Select and Describe permissions on returnstbl_iceberg.
    11. Repeat these steps again to grant database- and table-level permissions for the ordertbl table of the federated catalog database redshiftserverless1-uswest2/ordersdb.

    The following screenshots show the configuration for database-level permissions.

    015-BDB 5089

    016-BDB 5089

    The following screenshots show the configuration for table-level permissions.

    017-BDB 5089

    018-BDB 5089

    1. Choose Data permissions in the navigation pane and verify that the consumer account has been granted database- and table-level permissions for both orderstbl from the federated catalog and returnstbl_iceberg from the default catalog.
      019-BDB 5089

    Register the Amazon S3 location of the returnstbl_iceberg with Lake Formation.

    In this step, we register the Amazon S3 based Iceberg table returnstbl_iceberg data location with Lake Formation to be managed by Lake Formation permissions. Complete the following steps:

    1. On the Lake Formation console, choose Data lake locations in the navigation pane.
    2. Choose Register location.
      020-BDB 5089
    3. For Amazon S3 path, enter the path for your S3 bucket that you provided while creating the Iceberg table returnstbl_iceberg.
    4. For IAM role, provide the user-defined role LakeFormationS3Registration_custom that you created as a prerequisite.
    5. For Permission mode, select Lake Formation.
    6. Choose Register location.
      021-BDB 5089
    7. Choose Data lake locations in the navigation pane to verify the Amazon S3 registration.
      022-BDB 5089

    With this step, the producer account setup is complete.

    Steps for consumer account setup

    For the consumer account setup, we use the IAM admin role Admin, added as a Lake Formation administrator.

    The steps in the consumer account are quite involved. In the consumer account, a Lake Formation administrator will accept the AWS Resource Access Manager (AWS RAM) shares and create the required resource links that point to the shared catalog, database, and tables. The Lake Formation admin verifies that the shared resources are accessible by running test queries in Athena. The admin further grants permissions to the role Glue-execution-role on the resource links, database, and tables. The admin then runs a join query in AWS Glue 5.0 Spark using Glue-execution-role.

    Accept and verify the shared resources

    Lake Formation uses AWS RAM shares to enable cross-account sharing with Data Catalog resource policies in the AWS RAM policies. To view and verify the shared resources from producer account, complete the following steps:

    1. Log in to the consumer AWS console and set the AWS Region to match the producer’s shared resource Region. For this post, we use us-west-2.
    2. Open the Lake Formation console. You will see a message indicating there is a pending invite and asking you accept it on the AWS RAM console.
      023-BDB 5089
    3. Follow the instructions in Accepting a resource share invitation from AWS RAM to review and accept the pending invites.
    4. When the invite status changes to Accepted, choose Shared resources under Shared with me in the navigation pane.
    5. Verify that the Redshift Serverless federated catalog redshiftserverless1-uswest2, the default catalog database customerdb, the table returnstbl_iceberg, and the producer account ID under Owner ID column display correctly.
      024-BDB 5089
    6. On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
    7. Search by the producer account ID.
      You should see the customerdb and public databases. You can further select each database and choose View tables on the Actions dropdown menu and verify the table names

    025-BDB 5089

    You will not see an AWS RAM share invite for the catalog level on the Lake Formation console, because catalog-level sharing isn’t possible. You can review the shared federated catalog and Amazon Redshift managed catalog names on the AWS RAM console, or using the AWS Command Line Interface (AWS CLI) or SDK.

    Create a catalog link container and resource links

    A catalog link container is a Data Catalog object that references a local or cross-account federated database-level catalog from other AWS accounts. For more details, refer to Accessing a shared federated catalog. Catalog link containers are essentially Lake Formation resource links at the catalog level that reference or point to a Redshift cluster federated catalog or Amazon Redshift managed catalog object from other accounts.

    In the following steps, we create a catalog link container that points to the producer shared federated catalog redshiftserverless1-uswest2. Inside the catalog link container, we create a database. Inside the database, we create a resource link for the table that points to the shared federated catalog table <<producer account id>>:redshiftserverless1-uswest2/ordersdb.public.orderstbl.

    1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Catalogs.
    2. Choose Create catalog.

    026-BDB 5089

    1. Provide the following details for the catalog:
      1. For Name, enter a name for the catalog (for this post, rl_link_container_ordersdb).
      2. For Type, choose Catalog Link container.
      3. For Source, choose Redshift.
      4. For Target Redshift Catalog, enter the Amazon Resource Name (ARN) of the producer federated catalog (arn:aws:glue:us-west-2:<<producer account id>>:catalog/redshiftserverless1-uswest2/ordersdb).
      5. Under Access from engines, select Access this catalog from Apache Iceberg compatible engines.
      6. For IAM role, provide the Redshift-S3 data transfer role that you had created in the prerequisites.
      7. Choose Next.

    027-BDB 5089

    1. On the Grant permissions – optional page, choose Add permissions.
      1. Grant the Admin user Super user permissions for Catalog permissions and Grantable permissions.
      2. Choose Add and then choose Next.

    028-BDB 5089

    1. Review the details on the Review and create page and choose Create catalog.

    Wait a few seconds for the catalog to show up.

    029-BDB 5089

    1. In the navigation pane, choose Catalogs.
    2. Verify that rl_link_container_ordersdb is created.

    030-BDB 5089

    Create a database under rl_link_container_ordersdb

    Complete the following steps:

    1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
    2. On the Choose catalog dropdown menu, choose rl_link_container_ordersdb.
    3. Choose Create database.

    Alternatively, you can choose the Create dropdown menu and then choose Database.

    1. Provide details for the database:
      1. For Name, enter a name (for this post, public_db).
      2. For Catalog, choose rl_link_container_ordersdb.
      3. Leave Location – optional as blank.
      4. Under Default permissions for newly created tables, deselect Use only IAM access control for new tables in this database.
      5. Choose Create database.

    031-BDB 5089

    1. Choose Catalogs in the navigation pane to verify that public_db is created under rl_link_container_ordersdb.

    032-BDB 5089

    Create a table resource link for the shared federated catalog table

    A resource link to a shared federated catalog table can reside only inside the database of a catalog link container. A resource link for such tables will not work if created inside the default catalog. For more details on resource links, refer to Creating a resource link to a shared Data Catalog table.

    Complete the following steps to create a table resource link:

    1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Tables.
    2. On the Create dropdown menu, choose Resource link.

    033-BDB 5089

    1. Provide details for the table resource link:
      1. For Resource link name, enter a name (for this post, rl_orderstbl).
      2. For Destination catalog, choose rl_link_container_ordersdb.
      3. For Database, choose public_db.
      4. For Shared table’s region, choose US West (Oregon).
      5. For Shared table, choose orderstbl.
      6. After the Shared table is selected, Shared table’s database and Shared table’s catalog ID should get automatically populated.
      7. Choose Create.

    034-BDB 5089

    1. In the navigation pane, choose Databases to verify that rl_orderstbl is created under public_db, inside rl_link_container_ordersdb.

    035-BDB 5089

    036-BDB 5089

    Create a database resource link for the shared default catalog database.

    Now we create a database resource link in the default catalog to query the Amazon S3 based Iceberg table shared from the producer. For details on database resource links, refer Creating a resource link to a shared Data Catalog database.

    Though we are able to see the shared database in the default catalog of the consumer, a resource link is required to query from analytics engines, such as Athena, Amazon EMR, and AWS Glue. When using AWS Glue with Lake Formation tables, the resource link needs to be named identically to the source account’s resource. For additional details on using AWS Glue with Lake Formation, refer to Considerations and limitations.

    Complete the following steps to create a database resource link:

    1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
    2. On the Choose catalog dropdown menu, choose the account ID to choose the default catalog.
    3. Search for customerdb.

    You should see the shared database name customerdb with the Owner account ID as that of your producer account ID.

    1. Select customerdb, and on the Create dropdown menu, choose Resource link.
    2. Provide details for the resource link:
      1. For Resource link name, enter a name (for this post, customerdb).
      2. The rest of the fields should be already populated.
      3. Choose Create.
    3. In the navigation pane, choose Databases and verify that customerdb is created under the default catalog. Resource link names will show in italicized font.

    037-BDB 5089

    Verify access as Admin using Athena

    Now you can verify your access using Athena. Complete the following steps:

    1. Open the Athena console.
    2. Make sure an S3 bucket is provided to store the Athena query results. For details, refer to Specify a query result location using the Athena console.
    3. In the navigation pane, verify both the default catalog and federated catalog tables by previewing them.
    4. You can also run a join query as follows. Pay attention to the three-point notation for referring to the tables from two different catalogs:
    SELECT
    returns_tb.market as Market,
    sum(orders_tb.quantity) as Total_Quantity
    FROM rl_link_container_ordersdb.public_db.rl_orderstbl as orders_tb
    JOIN awsdatacatalog.customerdb.returnstbl_iceberg as returns_tb
    ON orders_tb.order_id = returns_tb.order_id
    GROUP BY returns_tb.market;

    038-BDB 5089

    This verifies the new capability of SageMaker Lakehouse, which enables accessing Redshift cluster tables and Amazon S3 based Iceberg tables in the same query, across AWS accounts, through the Data Catalog, using Lake Formation permissions.

    Grant permissions to Glue-execution-role

    Now we will share the resources from the producer account with additional IAM principals in the consumer account. Usually, the data lake admin grants permissions to data analysts, data scientists, and data engineers in the consumer account to do their job functions, such as processing and analyzing the data.

    We set up Lake Formation permissions on the catalog link container, databases, tables, and resource links to the AWS Glue job execution role Glue-execution-role that we created in the prerequisites.

    Resource links allow only Describe and Drop permissions. You need to use the Grant on target configuration to provide database Describe and table Select permissions.

    Complete the following steps:

    1. On the Lake Formation console, choose Data permissions in the navigation pane.
    2. Choose Grant.
    3. Under Principals, select IAM users and roles.
    4. For IAM users and roles, enter Glue-execution-role.
    5. Under LF-Tags or catalog resources, select Named Data Catalog resources.
    6. For Catalogs, choose rl_link_container_ordersdb and the consumer account ID, which indicates the default catalog.
    7. Under Catalog permissions, select Describe for Catalog permissions.
    8. Choose Grant.

    039-BDB 5089

    040-BDB 5089

    1. Repeat these steps for the catalog rl_link_container_ordersdb:
      1. On the Databases dropdown menu, choose public_db.
      2. Under Database permissions, select Describe.
      3. Choose Grant.
    2. Repeat these steps again, but after choosing rl_link_container_ordersdb and public_db, on the Tables dropdown menu, choose rl_orderstbl.
      1. Under Resource link permissions, select Describe.
      2. Choose Grant.
    3. Repeat these steps to grant additional permissions to Glue-execution-role.
      1. For this iteration, grant Describe permissions on the default catalog databases public and customerdb.
      2. Grant Describe permission on the resource link customerdb.
      3. Grant Select permission on the tables returnstbl_iceberg and orderstbl.

    The following screenshots show the configuration for database public and customerdb permissions.

    041-BDB 5089

    042-BDB 5089

    The following screenshots show the configuration for resource link customerdb permissions.

    043-BDB 5089

    044-BDB 5089

    The following screenshots show the configuration for table returnstbl_iceberg permissions.

    045-BDB 5089

    046-BDB 5089

    The following screenshots show the configuration for table orderstbl permissions.

    047-BDB 5089

    048-BDB 5089

    1. In the navigation pane, choose Data permissions and verify permissions on Glue-execution-role.

    049-BDB 5089

    Run a PySpark job in AWS Glue 5.0

    Download the PySpark script LakeHouseGlueSparkJob.py. This AWS Glue PySpark script runs Spark SQL by joining the producer shared federated orderstbl table and Amazon S3 based returns table in the consumer account to analyze the data and identify the total orders placed per market.

    Replace <<consumer_account_id>> in the script with your consumer account ID. Complete the following steps to create and run an AWS Glue job:

    1. On the AWS Glue console, in the navigation pane, choose ETL jobs.
    2. Choose Create job, then choose Script editor.

    050-BDB 5089

    1. For Engine, choose Spark.
    2. For Options, choose Start fresh.
    3. Choose Upload script.
    4. Browse to the location where you downloaded and edited the script, select the script, and choose Open.
    5. On the Job details tab, provide the following information:
      1. For Name, enter a name (for this post, LakeHouseGlueSparkJob).
      2. Under Basic properties, for IAM role, choose Glue-execution-role.
      3. For Glue version, select Glue 5.0.
      4. Under Advanced properties, for Job parameters, choose Add new parameter.
      5. Add the parameters --datalake-formats = iceberg and --enable-lakeformation-fine-grained-access = true.
    6. Save the job.
    7. Choose Run to execute the AWS Glue job, and wait for the job to complete.
    8. Review the job run details from the Output logs

    051-BDB 5089

    052-BDB 5089

    Clean up

    To avoid incurring costs on your AWS accounts, clean up the resources you created:

    1. Delete the Lake Formation permissions, catalog link container, database, and tables in the consumer account.
    2. Delete the AWS Glue job in the consumer account.
    3. Delete the federated catalog, database, and table resources in the producer account.
    4. Delete the Redshift Serverless namespace in the producer account.
    5. Delete the S3 buckets you created as part of data transfer in both accounts and the Athena query results bucket in the consumer account.
    6. Clean up the IAM roles you created for the SageMaker Lakehouse setup as part of the prerequisites.

    Conclusion

    In this post, we illustrated how to bring your existing Redshift tables to SageMaker Lakehouse and share them securely with external AWS accounts. We also showed how to query the shared data warehouse and data lakehouse tables in the same Spark session, from a recipient account, using Spark in AWS Glue 5.0.

    We hope you find this useful to integrate your Redshift tables with an existing data mesh and access the tables using AWS Glue Spark. Test this solution in your accounts and share feedback in the comments section. Stay tuned for more updates and feel free to explore the features of SageMaker Lakehouse and AWS Glue versions.

    Appendix: Table creation

    Complete the following steps to create a returns table in the Amazon S3 based default catalog and an orders table in Amazon Redshift:

    1. Download the CSV format datasets orders and returns.
    2. Upload them to your S3 bucket under the corresponding table prefix path.
    3. Use the following SQL statements in Athena. First-time users of Athena should refer to Specify a query result location.
    CREATE DATABASE customerdb;
    CREATE EXTERNAL TABLE customerdb.returnstbl_csv(
      `returned` string, 
      `order_id` string, 
      `market` string)
    ROW FORMAT DELIMITED 
      FIELDS TERMINATED BY '\;' 
    LOCATION
      's3://<your-S3-bucket>/<prefix-for-returns-table-data>/'
    TBLPROPERTIES (
      'skip.header.line.count'='1'
    );
    
    select * from customerdb.returnstbl_csv limit 10; 
    

    053-BDB 5089

    1. Create an Iceberg format table in the default catalog and insert data from the CSV format table:
    CREATE TABLE customerdb.returnstbl_iceberg(
      `returned` string, 
      `order_id` string, 
      `market` string)
    LOCATION 's3://<your-producer-account-bucket>/returnstbl_iceberg/' 
    TBLPROPERTIES (
      'table_type'='ICEBERG'
    );
    
    INSERT INTO customerdb.returnstbl_iceberg
    SELECT *
    FROM returnstbl_csv;  
    
    SELECT * FROM customerdb.returnstbl_iceberg LIMIT 10; 
    

    054-BDB 5089

    1. To create the orders table in the Redshift Serverless namespace, open the Query Editor v2 on the Amazon Redshift console.
    2. Connect to the default namespace using your database admin user credentials.
    3. Run the following commands in the SQL editor to create the database ordersdb and table orderstbl in it. Copy the data from your S3 location of the orders data to the orderstbl:
    create database ordersdb;
    use ordersdb;
    
    create table orderstbl(
      row_id int, 
      order_id VARCHAR, 
      order_date VARCHAR, 
      ship_date VARCHAR, 
      ship_mode VARCHAR, 
      customer_id VARCHAR, 
      customer_name VARCHAR, 
      segment VARCHAR, 
      city VARCHAR, 
      state VARCHAR, 
      country VARCHAR, 
      postal_code int, 
      market VARCHAR, 
      region VARCHAR, 
      product_id VARCHAR, 
      category VARCHAR, 
      sub_category VARCHAR, 
      product_name VARCHAR, 
      sales VARCHAR, 
      quantity bigint, 
      discount VARCHAR, 
      profit VARCHAR, 
      shipping_cost VARCHAR, 
      order_priority VARCHAR
      );
    
    copy orderstbl
    from 's3://<your-s3-bucket>/ordersdatacsv/orders.csv' 
    iam_role 'arn:aws:iam::<producer-account-id>:role/service-role/<your-Redshift-Role>'
    CSV 
    DELIMITER ';'
    IGNOREHEADER 1
    ;
    
    select * from ordersdb.orderstbl limit 5;
    


    About the Authors

    055-BDB 5089Aarthi Srinivasan is a Senior Big Data Architect with Amazon SageMaker Lakehouse. She collaborates with the service team to enhance product features, works with AWS customers and partners to architect lakehouse solutions, and establishes best practices for data governance.

    056-BDB 5089Subhasis Sarkar is a Senior Data Engineer with Amazon. Subhasis thrives on solving complex technological challenges with innovative solutions. He specializes in AWS data architectures, particularly data mesh implementations using AWS CDK components.

Amazon SageMaker Lakehouse now supports attribute-based access control

Post Syndicated from Sandeep Adwankar original https://aws.amazon.com/blogs/big-data/amazon-sagemaker-lakehouse-now-supports-attribute-based-access-control/

Amazon SageMaker Lakehouse now supports attribute-based access control (ABAC) with AWS Lake Formation, using AWS Identity and Access Management (IAM) principals and session tags to simplify data access, grant creation, and maintenance. With ABAC, you can manage business attributes associated with user identities and enable organizations to create dynamic access control policies that adapt to the specific context.

SageMaker Lakehouse is a unified, open, and secure data lakehouse that now supports ABAC to provide unified access to general purpose Amazon S3 buckets, Amazon S3 Tables, Amazon Redshift data warehouses, and data sources such as Amazon DynamoDB or PostgreSQL. You can then query, analyze, and join the data using Redshift, Amazon AthenaAmazon EMR, and AWS Glue. You can secure and centrally manage your data in the lakehouse by defining fine-grained permissions with Lake Formation that are consistently applied across all analytics and machine learning(ML) tools and engines. In addition to its support for role-based and tag-based access control, Lake Formation extends support to attribute-based access to simplify data access management for SageMaker Lakehouse, with the following benefits:

  • Flexibility – ABAC policies are flexible and can be updated to meet changing business needs. Instead of creating new rigid roles, ABAC systems allow access rules to be modified by simply changing user or resource attributes.
  • Efficiency – Managing a smaller number of roles and policies is more straightforward than managing a large number of roles, reducing administrative overhead.
  • Scalability – ABAC systems are more scalable for larger enterprises because they can handle a large number of users and resources without requiring a large number of roles.

Attribute-based access control overview

Previously, within SageMaker Lakehouse, Lake Formation granted access to resources based on the identity of a requesting user. Our customers were requesting the capability to express the full complexity required for access control rules in organizations. ABAC allows for more flexible and nuanced access policies that can better reflect real-world needs. Organizations can now grant permissions on a resource based on user attribute and is context-driven. This allows administrators to grant permissions on a resource with conditions that specify user attribute keys and values. IAM principals with matching IAM or session tag key-value pairs will gain access to the resource.

Instead of creating a separate role for each team member’s access to a specific project, you can set up ABAC policies to grant access based on attributes like membership and user role, reducing the number of roles required. For instance, without ABAC, a company with an account manager role that covers five different geographical territories needs to create five different IAM roles and grant data access for only the specific territory for which the IAM role is meant. With ABAC, they can simply add those territory attributes as keys/values to the principal tag and provide data access grants based on those attributes. If the value of the attribute for a user changes, access to the dataset will automatically be invalidated.

With ABAC, you can use attributes such as department or country and use IAM or sessions tags to determine access to data, making it more straightforward to create and maintain data access grants. Administrators can define fine-grained access permissions with ABAC to limit access to databases, tables, rows, columns, or table cells.

In this post, we demonstrate how to get started with ABAC in SageMaker Lakehouse and use with various analytics services.

Solution overview

To illustrate the solution, we are going to consider a fictional company called Example Retail Corp. Example Retail’s leadership is interested in analyzing sales data in Amazon S3 to determine in-demand products, understand customer behavior, and identify trends, for better decision-making and increased profitability. The sales department sets up a team for sales analysis with the following data access requirements:

  • All data analysts in the Sales department in the US get access to only sales-specific data in only US regions
  • All BI analysts in the Sales department have full access to data in only US regions
  • All scientists in the Sales department get access to only sales-specific data across all regions
  • Anyone outside of Sales department have no access to sales data

For this post, we consider the database salesdb, which contains the store_sales table that has store sales details. The table store_sales has the following schema.

To demonstrate the product sales analysis use case, we will consider the following personas from the Example Retail Corp:

  • Ava is a data administrator in Example Retail Corp who is responsible for supporting team members with specific data permission policies
  • Alice is a data analyst who should be able to access sales specific US store data to perform product sales analysis
  • Bob is a BI analyst who should be able to access all data from US store sales to generate reports
  • Charlie is a data scientist who should be able to access sales specific across all regions to explore and find patterns for trend analysis

Ava decides to use SageMaker Lakehouse to unify data across various data sources while setting up fine-grained access control using ABAC. Alice is excited about this decision as she can now build daily reports using her expertise with Athena. Bob now knows that he can quickly build Amazon QuickSight dashboards with queries that are optimized using Redshift’s cost-based optimizer. Charlie, being an open source Apache Spark contributor, is excited that he can build Spark based processing with Amazon EMR to build ML forecasting models.

Ava defines the user attributes as static IAM tags that could also include attributes stored in the identity provider (IdP) or as session tags dynamically to represent the user metadata. These tags are assigned to IAM users or roles and can be used to define or restrict access to specific resources or data. For more details, refer to Tags for AWS Identity and Access Management resources and Pass session tags in AWS STS.

For this post, Ava assigns users with static IAM tags to represent the user attributes, including their department membership, Region assignment, and current role relationship. The following table summarizes the tags that represent user attributes and user assignment.

User Persona Attributes Access
Alice Data Analyst Department=sales
Region=US
Role=Analyst
Sales specific data in US and no access to customer data
Bob BI Analyst Department=sales
Region=US
Role=BIAnalyst
All data in US
Charlie Data Scientist Department=sales
Region=ALL
Role=Scientist
Sales specific data in All regions and no access to customer data

Ava then defines access control policies in Lake Formation that grant or restrict access to certain resources based on predefined criteria (user attributes defined using IAM tags) being satisfied. This allows for flexible and context-aware security policies where access privileges can be adjusted dynamically by modifying the user attribute assignment without changing the policy rules. The following table summarizes the policies in the Sales department.

Access User Attributes Policy
All analysts (including Alice) in US get access to sales specific data in US regions Department=sales
Region=US
Role=Analyst
Table: store_sales (store_id, transaction_date, product_name, country, sales_price, quantity columns)
Row filter: country='US'
All BI analysts (including Bob) in US get access to all data in US regions Department=sales
Region=US
Role=BIAnalyst
Table: store_sales (all columns)
Row filter: country='US'
All scientists (including Charlie) get access to sales-specific data from all regions Department=sales
Region=ALL
Role=Scientist
Table: store_sales (all rows)
Column filter: store_id, transaction_date, product_name, country, sales_price,quantity

The following diagram illustrates the solution architecture.

Implementing this solution consists of the following high-level steps. For Example Retail, Ava as a data Administrator performs these steps:

  1. Define the user attributes and assign them to the principal.
  2. Grant permission on the resources (database and table) to the principal based on user attributes.
  3. Verify the permissions by querying the data using various analytics services.

Prerequisites

To follow the steps in this post, you must complete the following prerequisites:

  1. AWS account with access to the following AWS services:
    • Amazon S3
    • AWS Lake Formation and AWS Glue Data Catalog
    • Amazon Redshift
    • Amazon Athena
    • Amazon EMR
    • AWS Identity and Access Management (IAM)
  1. Set up an admin user for Ava. For instructions, see Create a user with administrative access.
  2. Setup S3 bucket for uploading script.
  3. Set up a data lake admin. For instructions, see Create a data lake administrator.
  4. Create IAM user named Alice and attach permissions for Athena access. For instructions, refer to Data analyst permissions.
  5. Create IAM user Bob and attach permissions for Redshift access.
  6. Create IAM user Charlie and attach permissions for EMR Serverless access.
  7. Create job runtime role: scientist_role and that will be used by Charlie. For instruction refer to: Job runtime roles for Amazon EMR Serverless
  8. Setup EMR Serverless application with Lake Formation enabled. For instruction refer to: Using EMR Serverless with AWS Lake Formation for fine-grained access control
  9. Have an existing AWS Glue database or table and Amazon Simple Storage Service (Amazon) S3 bucket that holds the table data. For this post, we use salesdb as our database, store_sales as our table, and data is stored in an S3 bucket.

Define attributes for the IAM principals Alice, Bob, Charlie

Ava completes the following steps to define the attributes for the IAM principal:

  1. Log in as an admin user and navigate to the IAM console.
  2. Choose Users under Access management in the navigation pane and search for the user Alice.
  3. Choose the user and choose the Tags tab.
  4. Choose Add new tag and provide the following key pairs:
    • Key: Department and value: sales
    • Key: Region and value: US
    • Key: Role and value: Analyst
  5. Choose Save changes.
  6. Repeat the process for the user Bob and provide the following key pairs:
    • Key: Department and value: sales
    • Key: Region and value: US
    • Key: Role and value: BIAnalyst
  7. Repeat the process for the user Charlie and IAM role scientist_role and provide the following key pairs:
    • Key: Department and value: sales
    • Key: Region and value: ALL
    • Key: Role and value: Scientist

Grant permissions to Alice, Bob, Charlie using ABAC

Ava now grants database and table permissions to users with ABAC.

Grant database permissions

Complete the following steps:

  1. Ava logs in as data lake admin and navigate to the Lake Formation console.
  2. In the navigation pane, under Permissions, choose Data lake permissions.
  3. Choose Grant.
  4. On the Grant permissions page, choose Principals by attribute.
  5. Specify the following attributes:
    • Key: Department  and value: sales
    • Key: Role and value: Analyst,Scientist
  6. Review the resulting policy expression.
  7. For Permission scope, select This account.
  8. Next, choose the catalog resources to grant access:
    • For Catalogs, enter the account ID.
    • For Databases, enter salesdb.
  9. For Database permissions, select Describe.
  10. Choose Grant.

Ava now verifies the database permission by navigating to the Databases tab under the Data Catalog and searching for salesdb. Select salesdb and choose View under Actions.

Grant table permissions to Alice

Complete the following steps to create a data filter to view sales specific columns in store_sales records whose country=US:

  1. On the Lake Formation console, choose Data filters under Data Catalog in the navigation pane.
  2. Choose Create new filter.
  3. Provide the data filter name as us_sales_salesonlydata.
  4. For Target catalog, enter the account ID.
  5. For Target database, choose salesdb.
  6. For Target table, choose store_sales.
  7. For column-level access, choose Include columns: store_id, item_code, transaction_date, product_name, country, sales_price, and quantity.
  8. For Row-level access, choose Filter rows and enter the row filter country='US'.
  9. Choose Create data filter.
  1. On the Grant permissions page, choose Principals by attribute.
  2. Specify the attributes:
    • Key: Department and value: sales
    • Key: Role as value: Analyst
    • Key: Region and value: US
  3. Review the resulting policy expression.
  4. For Permission scope, select This account.
  5. Choose the catalog resources to grant access:
    • Catalogs: Account ID
    • Databases: salesdb
    • Table: store_sales
    • Data filters: us_sales
  6. For Data filter permissions, select Select.
  7. Choose Grant.

Grant table permissions to Bob

Complete the following steps to create a data filter to view only store_sales records whose country=US:

  1. On the Lake Formation console, choose Data filters under Data Catalog in the navigation pane.
  2. Choose Create new filter.
  3. Provide the data filter name as us_sales.
  4. For Target catalog, enter the account ID.
  5. For Target database, choose salesdb.
  6. For Target table, choose store_sales.
  7. Leave Column-level access as Access to all columns.
  8. For Row-level access, enter the row filter country='US'.
  9. Choose Create data filter.

Complete the following steps to grant table permissions to Bob:

  1. On the Grant permissions page, choose Principals by attribute.
  2. Specify the attributes:
    • Key: Department and value: sales
    • Key: Role as value: BIAnalyst
    • Key: Region and value: US
  3. Review the resulting policy expression.
  4. For Permission scope, select This account.
  5. Choose the catalog resources to grant access:
    • Catalogs: Account ID
    • Databases: salesdb
    • Table: store_sales
  6. For Data filter permissions, select Select.
  7. Choose Grant.

Grant table permissions to Charlie

Complete the following steps to grant table permissions to Charlie:

  1. On the Grant permissions page, choose Principals by attribute.
  2. Specify the attributes:
    1. Key: Department and value: sales
    2. Key: Role as value: Scientist
    3. Key: Region and value: ALL
  3. Review the resulting policy expression.
  4. For Permission scope, select This account
  5. Choose the catalog resources to grant access:
    1. Catalogs: Account ID
    2. Databases: salesdb
    3. Table: store_sales
  6. For Table permissions, select Select.
  7. For Data permissions, specify the following columns: store_id, transaction_date, product_name, country, sales_price, and quantity.
  8. Choose Grant.

Alice now verifies the table permission by navigating to the Tables tab under the Data Catalog and searching for store_sales. Select store_sales and choose View under Actions. The following screenshots show the details for both sets of permissions.

Data Analyst uses Athena for building daily sales reports

Alice, the data analyst logs in to the Athena console and run the following query:

select * from "salesdb"."store_sales" limit 5

Alice has the user attributes as Department=sales, Role=Analyst, Region=US, and this attribute combination allows her access to US sales data to specific sales only column, without access to customer data as shown in the following screenshot.

BI Analyst uses Redshift for building sales dashboards

Bob, the BI Analyst, logs in to the Redshift console and run the following query:

select * from "salesdb"."store_sales" limit 10

Bob has the user attributes Department=sales, Role=BIAnalyst, Region=US, and this attribute combination allows him access to all columns including customer data for US sales data.

Data Scientist uses Amazon EMR to process sales data

Finally, Charlie logs in to the EMR console and submit the EMR job with runtime role as scientist_role. Charlie uses  the script sales_analysis.py that is uploaded to s3 bucket created for the script. He chooses the EMR Serverless application created with Lake Formation enabled.

Charlie submits batch job runs by choosing the following values:

  • Name: sales_analysis_Charlie
  • Runtime_role: scientist_role
  • Script location: <s3_script_path>/sales_analysis.py
  • For spark properties, provide key as spark.emr-serverless.lakeformation.enabled and value as true.
  • Additional configurations: Under Metastore configuration select Use AWS Glue Data Catalog as metastore. Charlie keeps rest of the configuration as default.

Once the job run is completed, Charlie can view the output by selecting stdout under Driver log files.

Charlie uses scientist_role as job runtime role with the attributes Department=sales, Role=Scientist, Region=ALL, and this attribute combination allows him access to select columns of all sales data.

Clean up

Complete the following steps to delete the resources you created to avoid unexpected costs:

  1. Delete the IAM users created.
  2. Delete the AWS Glue database and table resources created for the post, if any.
  3. Delete the Athena, Redshift and EMR resources created for the post.

Conclusion

In this post, we showcased how you can use SageMaker Lakehouse attribute-based access control, using IAM principals and session tags to simplify data access, grant creation, and maintenance. With attribute-based access control, you can manage permissions using dynamic business attributes associated with user identities and secure your data in the lakehouse by defining fine-grained permissions in the Lake Formation that are enforced across analytics and ML tools and engines.

For more information, refer to documentation. We encourage you to try out the SageMaker Lakehouse with ABAC and share your feedback with us.


About the authors

Sandeep Adwankar is a Senior Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Accelerate your analytics with Amazon S3 Tables and Amazon SageMaker Lakehouse

Post Syndicated from Sandeep Adwankar original https://aws.amazon.com/blogs/big-data/accelerate-your-analytics-with-amazon-s3-tables-and-amazon-sagemaker-lakehouse/

Amazon SageMaker Lakehouse is a unified, open, and secure data lakehouse that now seamlessly integrates with Amazon S3 Tables, the first cloud object store with built-in Apache Iceberg support. With this integration, SageMaker Lakehouse provides unified access to S3 Tables, general purpose Amazon S3 buckets, Amazon Redshift data warehouses, and data sources such as Amazon DynamoDB or PostgreSQL. You can then query, analyze, and join the data using Redshift, Amazon Athena, Amazon EMR, and AWS Glue. In addition to your familiar AWS services, you can access and query your data in-place with your choice of Iceberg-compatible tools and engines, providing you the flexibility to use SQL or Spark-based tools and collaborate on this data the way you like. You can secure and centrally manage your data in the lakehouse by defining fine-grained permissions with AWS Lake Formation that are consistently applied across all analytics and machine learning(ML) tools and engines.

Organizations are becoming increasingly data driven, and as data becomes a differentiator in business, organizations need faster access to all their data in all locations, using preferred engines to support rapidly expanding analytics and AI/ML use cases. Let’s take an example of a retail company that started by storing their customer sales and churn data in their data warehouse for business intelligence reports. With massive growth in business, they need to manage a variety of data sources as well as exponential growth in data volume. The company builds a data lake using Apache Iceberg to store new data such as customer reviews and social media interactions.

This enables them to cater to their end customers with new personalized marketing campaigns and understand its impact on sales and churn. However, data distributed across data lakes and warehouses limits their ability to move quickly, as it may require them to set up specialized connectors, manage multiple access policies, and often resort to copying data, that can increase cost in both managing the separate datasets as well as redundant data stored. SageMaker Lakehouse addresses these challenges by providing secure and centralized management of data in data lakes, data warehouses, and data sources such as MySQL, and SQL Server by defining fine-grained permissions that are consistently applied across data in all analytics engines.

In this post, we guide you how to use various analytics services using the integration of SageMaker Lakehouse with S3 Tables. We begin by enabling integration of S3 Tables with AWS analytics services. We create S3 Tables and Redshift tables and populate them with data. We then set up SageMaker Unified Studio by creating a company specific domain, new project with users, and fine-grained permissions. This lets us unify data lakes and data warehouses and use them with analytics services such as Athena, Redshift, Glue, and EMR.

Solution overview

To illustrate the solution, we are going to consider a fictional company called Example Retail Corp. Example Retail’s leadership is interested in understanding customer and business insights across thousands of customer touchpoints for millions of their customers that will help them build sales, marketing, and investment plans. Leadership wants to conduct an analysis across all their data to identify at-risk customers, understand impact of personalized marketing campaigns on customer churn, and develop targeted retention and sales strategies.

Alice is a data administrator in Example Retail Corp who has embarked on an initiative to consolidate customer information from multiple touchpoints, including social media, sales, and support requests. She decides to use S3 Tables with Iceberg transactional capability to achieve scalability as updates are streamed across billions of customer interactions, while providing same durability, availability, and performance characteristics that S3 is known for. Alice already has built a large warehouse with Redshift, which contains historical and current data about sales, customers prospects, and churn information.

Alice supports an extended team of developers, engineers, and data scientists who require access to the data environment to develop business insights, dashboards, ML models, and knowledge bases. This team includes:

Bob, a data analyst who needs to access to S3 Tables and warehouse data to automate building customer interactions growth and churn across various customer touchpoints for daily reports sent to leadership.

Charlie, a Business Intelligence analyst who is tasked to build interactive dashboards for funnel of customer prospects and their conversions across multiple touchpoints and make those available to thousands of Sales team members.

Doug, a data engineer responsible for building ML forecasting models for sales growth using the pipeline and/or customer conversion across multiple touchpoints and make those available to finance and planning teams.

Alice decides to use SageMaker Lakehouse to unify data across S3 Tables and Redshift data warehouse. Bob is excited about this decision as he can now build daily reports using his expertise with Athena. Charlie now knows that he can quickly build Amazon QuickSight dashboards with queries that are optimized using Redshift’s cost-based optimizer. Doug, being an open source Apache Spark contributor, is excited that he can build Spark based processing with AWS Glue or Amazon EMR to build ML forecasting models.

The following diagram illustrates the solution architecture.

Implementing this solution consists of the following high-level steps. For Example Retail, Alice as a data Administrator performs these steps:

  1. Create a table bucket. S3 Tables stores Apache Iceberg tables as S3 resources, and customer details are managed in S3 Tables. You can then enable integration with AWS analytics services, which automatically sets up the SageMaker Lakehouse integration so that the tables bucket is shown as a child catalog under the federated s3tablescatalog in the AWS Glue Data Catalog and is registered with AWS Lake Formation for access control. Next, you create a table namespace or database which is a logical construct that you group tables under and create a table using Athena SQL CREATE TABLE statement.
  2. Publish your data warehouse to Glue Data Catalog. Churn data is managed in a Redshift data warehouse, which is published to the Data Catalog as a federated catalog and is available in SageMaker Lakehouse.
  3. Create a SageMaker Unified Studio project. SageMaker Unified Studio integrates with SageMaker Lakehouse and simplifies analytics and AI with a unified experience. Start by creating a domain and adding all users (Bob, Charlie, Doug). Then create a project in the domain, choosing project profile that provisions various resources and the project AWS Identity and Access Management (IAM) role that manages resource access. Alice adds Bob, Charlie, and Doug to the project as members.
  4. Onboard S3 Tables and Redshift tables to SageMaker Unified Studio. To onboard the S3 Tables to the project, in Lake Formation, you grant permission on the resource to the SageMaker Unified Studio project role. This enables the catalog to be discoverable within the lakehouse data explorer for users (Bob, Charlie, and Doug) to start querying tables .SageMaker Lakehouse resources can now be accessed from computes like Athena, Redshift, and Apache Spark based computes like Glue to derive churn analysis insights, with Lake Formation managing the data permissions.

Prerequisites

To follow the steps in this post, you must complete the following prerequisites:

Alice completes the following steps to create the S3 Table bucket for the new data she plans to add/import into an S3 Table.

  1. AWS account with access to the following AWS services:
    • Amazon S3 including S3 Tables
    • Amazon Redshift
    • AWS Identity and Access Management (IAM)
    • Amazon SageMaker Unified Studio
    • AWS Lake Formation and AWS Glue Data Catalog
    • AWS Glue
  2. Create a user with administrative access.
  3. Have access to an IAM role that is a Lake Formation data lake administrator. For instructions, refer to Create a data lake administrator.
  4. Enable AWS IAM Identity Center in the same AWS Region where you want to create your SageMaker Unified Studio domain. Set up your identity provider (IdP) and synchronize identities and groups with AWS IAM Identity Center. For more information, refer to IAM Identity Center Identity source tutorials.
  5. Create a read-only administrator role to discover the Amazon Redshift federated catalogs in the Data Catalog. For instructions, refer to Prerequisites for managing Amazon Redshift namespaces in the AWS Glue Data Catalog.
  6. Create an IAM role named DataTransferRole. For instructions, refer to Prerequisites for managing Amazon Redshift namespaces in the AWS Glue Data Catalog.
  7. Create an Amazon Redshift Serverless namespace called churnwg. For more information, see Get started with Amazon Redshift Serverless data warehouses.

Create a table bucket and enable integration with analytics services

Alice completes the following steps to create the S3 Table bucket for the new data she plans to add/import into an S3 Tables.

Follow the below steps to create a table bucket to enable integration with SageMaker Lakehouse:

  1. Sign in to the S3 console as user created in prerequisite step 2.
  2. Choose Table buckets in the navigation pane and choose Enable integration.
  3. Choose Table buckets in the navigation pane and choose Create table bucket.
  4. For Table bucket name, enter a name such as blog-customer-bucket.
  5. Choose Create table bucket.
  6. Choose Create table with Athena.
  7. Select Create a namespace and provide a namespace (for example, customernamespace).
  8. Choose Create namespace.
  9. Choose Create table with Athena.
  10. On the Athena console, run the following SQL script to create a table:
    CREATE TABLE customer (
      `c_salutation` string, 
      `c_preferred_cust_flag` string, 
      `c_first_sales_date_sk` int, 
      `c_customer_sk` int, 
      `c_login` string, 
      `c_current_cdemo_sk` int, 
      `c_first_name` string, 
      `c_current_hdemo_sk` int, 
      `c_current_addr_sk` int, 
      `c_last_name` string, 
      `c_customer_id` string, 
      `c_last_review_date_sk` int, 
      `c_birth_month` int, 
      `c_birth_country` string, 
      `c_birth_year` int, 
      `c_birth_day` int, 
      `c_first_shipto_date_sk` int, 
      `c_email_address` string)
      TBLPROPERTIES ('table_type' = 'iceberg')
      
    
    INSERT INTO customer VALUES
    ('Dr.','N',2452077,13251813,'Y',1381546,'Joyce',2645,2255449,'Deaton','AAAAAAAAFOEDKMAA',2452543,1,'GREECE',1987,29,2250667,'[email protected]'),
    ('Dr.','N',2450637,12755125,'Y',1581546,'Daniel',9745,4922716,'Dow','AAAAAAAAFLAKCMAA',2432545,1,'INDIA',1952,3,2450667,'[email protected]'),
    ('Dr.','N',2452342,26009249,'Y',1581536,'Marie',8734,1331639,'Lange','AAAAAAAABKONMIBA',2455549,1,'CANADA',1934,5,2472372,'[email protected]'),
    ('Dr.','N',2452342,3270685,'Y',1827661,'Wesley',1548,11108235,'Harris','AAAAAAAANBIOBDAA',2452548,1,'ROME',1986,13,2450667,'[email protected]'),
    ('Dr.','N',2452342,29033279,'Y',1581536,'Alexandar',8262,8059919,'Salyer','AAAAAAAAPDDALLBA',2952543,1,'SWISS',1980,6,2650667,'[email protected]'),
    ('Miss','N',2452342,6520539,'Y',3581536,'Jerry',1874,36370,'Tracy','AAAAAAAALNOHDGAA',2452385,1,'ITALY',1957,8,2450667,'[email protected]')

This is just an example of adding a few rows to the table, but generally for production use cases, customers use engines such as Spark to add data to the table.

S3 Tables customer is now created, populated with data and integrated with SageMaker Lakehouse.

Set up Redshift tables and publish to the Data Catalog

Alice completes the following steps to connect the data in Redshift to be published into the data catalog. We’ll also demonstrate how the Redshift table is created and populated, but in Alice’s case Redshift table already exists with all the historic data on sales revenue.

  1. Sign in to the Redshift endpoint churnwg as an admin user.
  2. Run the following script to create a table under the dev database under the public schema:
    CREATE TABLE customer_churn (
    customer_id BIGINT,
    tenure INT,
    monthly_charges DECIMAL(5,1),
    total_charges DECIMAL(5,1),
    contract_type VARCHAR(100),
    payment_method VARCHAR(100),
    internet_service VARCHAR(100),
    has_phone_service BOOLEAN,
    is_churned BOOLEAN
    );
    
    INSERT INTO customer_churn VALUES
    (10251783, 12, 70.5, 850.0, 'Month-to-Month', 'Credit Card', 'Fiber Optic', true, true),
    (13251813, 36, 55.0, 1980.0, 'One Year', 'Bank Transfer', 'DSL', true, false),
    (12755125, 6, 90.0, 540.0, 'Month-to-Month', 'Mailed Check', 'Fiber Optic', false, true),
    (26009249, 12, 70.5, 850.0, 'One Year', 'Credit Card', 'DSL', true, false),
    (3270685, 36, 55.0, 1980.0, 'One Year', 'Bank Transfer', 'DSL', true, false),
    (29033279, 6, 90.0, 540.0, 'Month-to-Month', 'Mailed Check', 'Fiber Optic', false, true),
    (6520539, 24, 60.0, 1440.0, 'Two Year', 'Electronic Check', 'DSL', true, false);

    This is just an example of adding a few rows to the table, but generally for production use cases, customers use several ways to add data to the table as documented in Loading data in Amazon Redshift.

  3. On the Redshift Serverless console, navigate to the namespace.
  4. On the Action dropdown menu, choose Register with AWS Glue Data Catalog to integrate with SageMaker Lakehouse.
  5. Choose Register.
  6. Sign in to the Lake Formation console as the data lake administrator.
  7. Under Data Catalog in the navigation pane, choose Catalogs and Pending catalog invitations.
  8. Select the pending invitation and choose Approve and create catalog.
  9. Provide a name for the catalog (for example, churn_lakehouse).
  10. Under Access from engines, select Access this catalog from Iceberg-compatible engines and choose DataTransferRole for the IAM role.
  11. Choose Next.
  12. Choose Add permissions.
  13. Under Principals, choose the datalakeadmin role for IAM users and roles, Super user for Catalog permissions, and choose Add.
  14. Choose Create catalog.

Redshift Table customer_churn is now created, populated with data and integrated with SageMaker Lakehouse.

Create a SageMaker Unified Studio domain and project

Alice now sets up SageMaker Unified Studio domain and projects so that she can bring users (Bob, Charlie and Doug) together in the new project.

Complete the following steps to create a SageMaker domain and project using SageMaker Unified Studio:

  1. On the SageMaker Unified Studio console, create a SageMaker Unified Studio domain and project using the All Capabilities profile template. For more details, refer to Setting up Amazon SageMaker Unified Studio. For this post, we create a project named churn_analysis.
  2. Setup AWS Identity center with users Bob, Charlie and Doug, Add them to domain and project.
  3. From SageMaker Unified Studio, navigate to the project overview and on the Project details tab, note the project role Amazon Resource Name (ARN).
  4. Sign in to the IAM console as an admin user.
  5. In the navigation pane, choose Roles.
  6. Search for the project role and add AmazonS3TablesReadOnlyAccess by choosing Add permissions.

SageMaker Unified Studio is now setup with domain, project and users.

Onboard S3 Tables and Redshift tables to the SageMaker Unified Studio project

Alice now configures SageMaker Unified Studio project role for fine-grained access control to determine who on her team gets to access what data sets.

Grant the project role full table access on customer dataset. For that, complete the following steps:

  1. Sign in to the Lake Formation console as the data lake administrator.
  2. In the navigation pane, choose Data lake permissions, then choose Grant.
  3. In the Principals section, for IAM users and roles, choose the project role ARN noted earlier.
  4. In the LF-Tags or catalog resources section, select Named Data Catalog resources:
    • Choose <account_id>:s3tablescatalog/blog-customer-bucket for Catalogs.
    • Choose customernamespace for Databases.
    • Choose customer for Tables.
  5. In the Table permissions section, select Select and Describe for permissions.
  6. Choose Grant.

Now grant the project role access to subset of columns  from customer_churn dataset.

  1. In the navigation pane, choose Data lake permissions, then choose Grant.
  2. In the Principals section, for IAM users and roles, choose the project role ARN noted earlier.
  3. In the LF-Tags or catalog resources section, select Named Data Catalog resources:
    • Choose <account_id>:churn_lakehouse/dev for Catalogs.
    • Choose public for Databases.
    • Choose customer_churn for Tables.
  4. In the Table Permissions section, select Select.
  5. In the Data Permissions section, select Column-based access.
  6. For Choose permission filter, select Include columns and choose customer_id, internet_service, and is_churned.
  7. Choose Grant.

All users in the project churn_analysis in SageMaker Unified Studio are now setup. They have access to all columns in the table and fine-grained access permissions for Redshift table where they have access to only three columns.

Verify data access in SageMaker Unified Studio

Alice can now do a final verification if the data is all available to ensure that each of her team members are set up to access the datasets.

Now you can verify data access for different users in SageMaker Unified Studio.

  1. Sign in to SageMaker Unified Studio as Bob and choose the churn_analysis
  2. Navigate to the Data explorer to view s3tablescatalog and churn_lakehouse under Lakehouse.

Data Analyst uses Athena for analyzing customer churn

Bob, the data analyst can now logs into to the SageMaker Unified Studio, chooses the churn_analysis project and navigates to the Build options and choose Query Editor under Data Analysis & Integration.

Bob chooses the connection as Athena (Lakehouse), the catalog as s3tablescatalog/blog-customer-bucket, and the database as customernamespace. And runs the following SQL to analyze the data for customer churn:

select * from "churn_lakehouse/dev"."public"."customer_churn" a, 
"s3tablescatalog/blog-customer-bucket"."customernamespace"."customer" b
where a.customer_id=b.c_customer_sk limit 10;

Bob can now join the data across S3 Tables and Redshift in Athena and now can proceed to build full SQL analytics capability to automate building customer growth and churn leadership daily reports.

BI Analyst uses Redshift engine for analyzing customer data

Charlie, the BI Analyst can now logs into the SageMaker Unified Studio and chooses the churn_analysis project. He navigates to the Build options and choose Query Editor under Data Analysis & Integration. He chooses the connection as Redshift (Lakehouse), Databases as dev, Schemas as public.

He then runs the follow SQL to perform his specific analysis.

select * from "dev@churn_lakehouse"."public"."customer_churn" a, 
"blog-customer-bucket@s3tablescatalog"."customernamespace"."customer" b
where a.customer_id=b.c_customer_sk limit 10;

Charlie can now further update the SQL query and use it to power QuickSight dashboards that can be shared with Sales team members.

Data engineer uses AWS Glue Spark engine to process customer data

Finally, Doug logs in to SageMaker Unified Studio as Doug and chooses the churn_analysis project to perform his analysis. He navigates to the Build options and choose JupyterLab under IDE & Applications. He downloads the churn_analysis.ipynb notebook and upload it into the explorer. He then runs the cells by selecting compute as project.spark.compatibility.

He runs the following SQL to analyze the data for customer churn:

Doug, now can use Spark SQL and start processing data from both S3 tables and Redshift tables and start  building forecasting models for customer growth and churn

Cleaning up

If you implemented the example and want to remove the resources, complete the following steps:

  1. Clean up S3 Tables resources:
    1. Delete the table.
    2. Delete the namespace in the table bucket.
    3. Delete the table bucket.
  2. Clean up the Redshift data resources:
    1. On the Lake Formation console, choose Catalogs in the navigation pane.
    2. Delete the churn_lakehouse catalog.
  3. Delete SageMaker project, IAM roles, Glue resources, Athena workgroup, S3 buckets created for domain.
  4. Delete SageMaker domain and VPC created for the setup.

Conclusion

In this post, we showed how you can use SageMaker Lakehouse to unify data across S3 Tables and Redshift data warehouses, which can help you build powerful analytics and AI/ML applications on a single copy of data. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with Iceberg-compatible tools and engines. You can secure your data in the lakehouse by defining fine-grained permissions that are enforced across analytics and ML tools and engines.

For more information, refer to Tutorial: Getting started with S3 Tables, S3 Tables integration, and Connecting to the Data Catalog using AWS Glue Iceberg REST endpoint. We encourage you to try out the S3 Tables integration with SageMaker Lakehouse integration and share your feedback with us.


About the authors

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She works with the product team and customers to build robust features and solutions for their analytical data platform. She enjoys building data mesh solutions and sharing them with the community.

Aditya Kalyanakrishnan is a Senior Product Manager on the Amazon S3 team at AWS. He enjoys learning from customers about how they use Amazon S3 and helping them scale performance. Adi’s based in Seattle, and in his spare time enjoys hiking and occasionally brewing beer.

Integrate ThoughtSpot with Amazon Redshift using AWS IAM Identity Center

Post Syndicated from Maneesh Sharma original https://aws.amazon.com/blogs/big-data/integrate-thoughtspot-with-amazon-redshift-using-aws-iam-identity-center/

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. Tens of thousands of customers use Amazon Redshift to process large amounts of data, modernize their data analytics workloads, and provide insights for their business users.

The combination of Amazon Redshift and ThoughtSpot’s AI-powered analytics service enables organizations to transform their raw data into actionable insights with unprecedented speed and efficiency. Through this collaboration, Amazon Redshift now supports AWS IAM Identity Center integration with ThoughtSpot, enabling seamless and secure data access with streamlined authentication and authorization workflows. This single sign-on (SSO) integration is available across ThoughtSpot’s cloud landscape and can be used for both embedded and standalone analytics implementations.

Prior to the IAM Identity Center integration, ThoughtSpot users didn’t have native connectivity to integrate Amazon Redshift with their identity providers (IdPs), which can provide unified governance and identity propagation across multiple AWS services like AWS Lake Formation and Amazon Simple Storage Service (Amazon S3).

Now, ThoughtSpot users can natively connect to Amazon Redshift using the IAM Identity Center integration, which streamlines data analytics access management while maintaining robust security. By configuring Amazon Redshift as an AWS managed application, organizations benefit from SSO capabilities with trusted identity propagation and a trusted token issuer (TTI). The IAM Identity Center integration with Amazon Redshift provides centralized user management, automatically synchronizing access permissions with organizational changes—whether employees join, transition roles, or leave the organization. The solution uses Amazon Redshift role-based access control features that align with IdP groups synced in IAM Identity Center. Organizations can further enhance their security posture by using Lake Formation to define granular access control permissions on catalog resources for IdP identities. From a compliance and security standpoint, the integration offers comprehensive audit trails by logging end-user identities both in Amazon Redshift and AWS CloudTrail, providing visibility into data access patterns and user activities.

Dime Dimovski, a Data Warehousing Architect at Merck, shares:

“The recent integration of Amazon Redshift with our identity access management center will significantly enhance our data access management because we can propagate user identities across various tools. By using OAuth authentication from ThoughtSpot to Amazon Redshift, we will benefit from a seamless single sign-on experience—giving us granular access controls as well as the security and efficiency we need.”

In this post, we walk you through the process of setting up ThoughtSpot integration with Amazon Redshift using IAM Identity Center authentication. The solution provides a secure, streamlined analytics environment that empowers your team to focus on what matters most: discovering and sharing valuable business insights.

Solution overview

The following diagram illustrates the architecture of the ThoughtSpot SSO integration with Amazon Redshift, IAM Identity Center, and your IdP.

The solution includes the following steps:

  1. The user configures ThoughtSpot to access Amazon Redshift using IAM Identity Center.
  2. When a user attempts to sign in, ThoughtSpot initiates a browser-based OAuth flow and redirects the user to their preferred IdP (such as Okta or Microsoft EntraID) sign-in page to enter their credentials.
  3. Following successful authentication, IdP issues authentication tokens (ID and access token) to ThoughtSpot.
  4. The Amazon Redshift driver then makes a call to the Amazon Redshift enabled AWS Identity Center application and forwards the access token.
  5. Amazon Redshift passes the token to IAM Identity Center for validation.
  6. IAM Identity Center first validates the token using the OpenID Connect (OIDC) discovery connection to the TTI and returns an IAM Identity Center generated access token for the same user. The TTI enables you to use trusted identity propagation with applications that authenticate outside of AWS. In the preceding figure, the IdP authorization server is the TTI.
  7. Amazon Redshift uses IAM Identity Center APIs to obtain the user and group membership information from AWS Identity Center.
  8. The ThoughtSpot user can now connect with Amazon Redshift and access data based on the user and group membership returned from IAM Identity Center.

In this post, you will use the following steps to build the solution:

  1. Set up an OIDC application.
  2. Set up a TTI in IAM Identity Center.
  3. Set up client connections and TTIs in Amazon Redshift.
  4. Federate to Amazon Redshift from ThoughtSpot using IAM Identity Center.

Prerequisites

Before you begin implementing the solution, you must have the following in place:

Set up an OIDC application

In this section, we’ll show you the step-by-step process to set up an OIDC application using both Okta and EntraID as the identity providers.

Set up an Okta OIDC application

Complete the following steps to set up an Okta OIDC application:

  1. Sign in to your Okta organization as a user with administrative privileges.
  2. On the admin console, under Applications in the navigation pane, choose Applications.
  3. Choose Create App Integration.
  4. Select OIDC – OpenID Connect for Sign-in method and Web Application for Application type.
  5. Choose Next.
  6. On the General tab, provide the following information:
    1. For App integration name, enter a name for your app integration. For example, ThoughtSpot_Redshift_App.
    2. For Grant type, select Authorization Code and Refresh Token.
    3. For Sign-in redirect URIs, choose Add URI and along with the default URI, add the URI https://<your_okta_instance_name>/callosum/v1/connection/generateTokens. The sign-in redirect URI is where Okta sends the authentication response and ID token for the sign-in request. The URIs must be absolute URIs.
    4. For Sign-out redirect URIs, keep the default value as http://localhost:8080.
    5. Skip the Trusted Origins section and for Assignments, select Skip group assignment for now.
    6. Choose Save.
  7. Choose the Assignments tab and then choose Assign to Groups. In this example, we’re assigning awssso-finance and awssso-sales.
  8. Choose Done.

Set up an EntraID OIDC application

To create your EntraID application, follow these steps:

  1. Sign in to the Microsoft Entra admin center as Cloud Application Administrator (or higher level of access).
  2. Browse to App registrations under Manage, and choose New registration.
  3. Enter a name for the application. For example, ThoughtSpot-OIDC-App.
  4. Select a supported account type, which determines who can use the application. For this example, select the first option in the list.
  5. Under Redirect URI, choose Web for the type of application you want to create. Enter the URI where the access token is sent to. Your redirect URL will be in the format https://<your_instance_name>/callosum/v1/connection/generateTokens.
  6. Choose Register.
  7. In the navigation pane, choose Certificates & secrets.
  8. Choose New client secret.
  9. Enter a description and select an expiration for the secret or specify a custom lifetime. For this example, keep the Microsoft recommended default expiration value of 6 months.
  10. Choose Add.
  11. Copy the secret value.

The secret value will only be presented one time; after that you can’t read it. Make sure to copy it now. If you fail to save it, you must generate a new client secret.

  1. In the navigation pane, under Manage, choose Expose an API.

If you’re setting up for the first time, you can see Add to the right of the application ID URI.

  1. Choose Save.
  2. After the application ID URI is set up, choose Add a scope.
  3. For Scope name, enter a name. For example, redshift_login.
  4. For Admin consent display name, enter a display name. For example, redshift_login.
  5. For Admin consent description, enter a description of the scope.
  6. Choose Add scope.
  7. In the navigation pane, choose API permissions.
  8. Choose Add a permission and choose Microsoft Graph.
  9. Choose Delegated Permission.
  10. Under OpenId permissions, choose email, offlines_access, openid, and profile, and choose Add permissions.

Set up a TTI in IAM Identity Center

Assuming you have completed the prerequisites, you will establish your IdP as a TTI in your delegated administration account. To create a TTI, refer to How to add a trusted token issuer to the IAM Identity Center console. In this post, we walk through the steps to set up a TTI for both Okta and EntraID.

Set up a TTI for Okta

To get the issuer URL from Okta, complete the following steps:

  1. Sign in as an admin to Okta and navigate to Security and then to API.
  2. Choose Default on the Authorization Servers tab and copy the Issuer
    url.
  3. In the Map attributes section, choose which IdP attributes correspond to Identity Center attributes. For example, in the following screenshot, we mapped Okta’s Subject attribute to the Email attribute in IAM Identity Center.
  4. Choose Create trusted token issuer.

Set up a TTI for EntraID

Complete the following steps to set up a TTI for EntraID:

  1. To find out which token your application is using, under Manage, choose Manifest.
  2. Locate the accessTokenAcceptedVersion parameter: null or 1 indicate v1.0 tokens, and 2 indicates v2.0 tokens.

Next, you need to find the tenant ID value from EntraID.

  1. Go to the EntraID application, choose Overview, and a new page will appear containing the Essentials
  2. You can find the tenant ID value as shown in the following screenshot. If you’re using the v1.0 token, the issuer URL will be https://sts.windows.net/<Directory (tenant) ID>/. If you’re using the v2.0 token, the issuer URL will be https://login.microsoftonline.com/<Directory (tenantid) ID>/v2.0.
  3. For Map attributes, the following example uses Other, where we’re specifying the user principal name (upn) as the IdP attribute to map with Email from the IAM identity Center attribute.
  4. Choose Create trusted token issuer.

Set up client connections and TTIs in Amazon Redshift

In this step, you configure the Amazon Redshift applications that exchange externally generated tokens to use the TTI you created in the previous step. Also, the audience claim (or aud claim) from your IdP must be specified. You need to collect the audience value from the respective IdP.

Acquire the audience value from Okta

To acquire the audience value from Okta, complete the following steps:

  1. Sign in as an admin to Okta and navigate to Security and then to API.
  2. Choose Default on the Authorization Servers tab and copy the Audience value.

Acquire the audience value from EntraID

Similarly, to get the audience value EntraID, complete the following steps:

  1. Go to the EntraID application, choose Overview, and a new page will appear containing the Essentials
  2. You can find the audience value (Application ID URI) as shown in the following screenshot.

Configure the application

After you collect the audience value from the respective IdP, you need to configure the Amazon Redshift application in the member account where the Amazon Redshift cluster or serverless instance exists.

  1. Choose IAM Identity Center connection in the navigation pane on the Amazon Redshift console.
  2. Choose the Amazon Redshift application that you created as part of the prerequisites.
  3. Choose the Client connections tab and choose Edit.
  4. Choose Yes under Configure client connections that use third-party IdPs.
  5. Select the check box for Trusted token issuer that you created in the previous section.
  6. For Aud claim, enter the audience claim value under Configure selected trusted token issuers.
  7. Choose Save.

Your IAM Identity Center, Amazon Redshift, and IdP configuration is complete. Next, you need to configure ThoughtSpot.

Federate to Amazon Redshift from ThoughtSpot using IAM Identity Center

Complete the following steps in ThoughtSpot to federate with Amazon Redshift using IAM Identity Center authentication:

  1. Sign in to ThoughtSpot cloud.
  2. Choose Data in the top navigation bar.
  3. Open the Connections tab in the navigation pane, and select the Redshift

Alternatively, you can choose Create new in the navigation pane, choose Connection, and select the Redshift tile.

  1. Create a name for your connection and a description (optional), then choose Continue.
  2. Under Authentication Type, choose AWS IDC OAuth and enter following details:
    1. For Host, enter the Redshift endpoint. For example, test-cluster.ab6yejheyhgf.us-east-1.redshift.amazonaws.com.
    2. For Port, enter 5439.
    3. For OAuth Client ID, enter the client ID from the IdP OIDC application.
    4. For OAuth Client Secret, enter the client secret from the IdP OIDC application.
    5. For Scope, enter the scope from the IdP application:
      • For Okta, use openid offline_access openid profile. You can use the Okta scope values shared earlier as is on ThoughtSpot. You can modify the scope according to your requirements.
      • For EntraID, use the API scope and API permissions. For example, api://1230a234-b456-7890-99c9-a12345bcc123/redshift_login offline_access.
    6. For API scope value, go to the OIDC application, and under Manage, choose Expose an API to acquire the value.
    7. For API permissions, go to the OIDC application, and under Manage, choose API permissions to acquire the permissions.
    8. For Auth Url, enter the authorization endpoint URI:
      • For Okta use https:// <okta-hostname>/oauth2/default/v1/authorize. For example, https://prod-1234567.okta.com/oauth2/default/v1/authorize.
      • For EntraID, use https://login.microsoftonline.com/<Directory (tenantid) ID>/oauth2/v2.0/authorize. For example, https://login.microsoftonline.com/e12a1ab3-1234-12ab-12b3-1a5012221d12/oauth2/v2.0/authorize.
    9. For Access token Url, enter the token endpoint URI:
      • For Okta, use https://<okta-hostname>/oauth2/default/v1/token. For example, https://prod-1234567.okta.com/oauth2/default/v1/token.
      • For EntraID, use https://login.microsoftonline.com/<Directory (tenantid) ID>/oauth2/v2.0/token. For example, https://login.microsoftonline.com/e12a1ab3-1234-12ab-12b3-1a5012221d12/oauth2/v2.0/token.
    10. For AWS Identity Namespace, enter the namespace configured in your Amazon Redshift IAM Identity Center application. The default value is AWSIDC unless previously customized. For this example, we use awsidc.
    11. For Database, enter the database name you want to connect. For example, dev.
  3. Choose Continue.
  4. Enter your IdP user credentials in the browser pop-up window.

The following screenshot illustrates the ThoughtSpot integration with Amazon Redshift using Okta as the IdP.

The following screenshot shows the ThoughtSpot integration with Amazon Redshift using EntraID as the IdP.

Upon a successful authentication, you will be redirected back to ThoughtSpot and logged in as an IAM Identity Center authenticated user.

Congratulations! You’ve logged in through IAM Identity Center and Amazon Redshift, and you’re ready to dive into your data analysis with ThoughtSpot.

Clean up

Complete the following steps to clean up your resources:

  1. Delete the IdP applications that you created to integrate with IAM Identity Center.
  2. Delete the IAM Identity Center configuration.
  3. Delete the Amazon Redshift application and the Amazon Redshift provisioned cluster or serverless instance that you created for testing.
  4. Delete the IAM role and IAM policy that you created for IAM Identity Center and Amazon Redshift integration.
  5. Delete the permission set from IAM Identity Center that you created for Amazon Redshift Query Editor V2 in the management account.
  6. Delete the ThoughtSpot connection to integrate with Amazon Redshift using AWS IDC OAuth.

Conclusion

In this post, we explored how to integrate ThoughtSpot with Amazon Redshift using IAM Identity Center. The process consisted of registering an OIDC application, setting up an IAM Identity Center TTI, and finally configuring ThoughtSpot for IAM Identity Center authentication. This setup creates a robust and secure analytics environment that streamlines data access for business users.

For additional guidance and detailed documentation, refer to the following key resources:


About the authors

Maneesh Sharma is a Senior Database Engineer at AWS with more than a decade of experience designing and implementing large-scale data warehouse and analytics solutions. He collaborates with various Amazon Redshift Partners and customers to drive better integration.

BP Yau is a Sr Partner Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Ali Alladin is the Senior Director of Product Management and Partner Solutions at ThoughtSpot. In this role, Ali oversees Cloud Engineering and Operations, ensuring seamless integration and optimal performance of ThoughtSpot’s cloud-based services. Additionally, Ali spearheads the development of AI-powered solutions in augmented and embedded analytics, collaborating closely with technology partners to drive innovation and deliver cutting-edge analytics capabilities. With a robust background in product management and a keen understanding of AI technologies, Ali is dedicated to pushing the boundaries of what’s possible in the analytics space, helping organizations harness the full potential of their data.

Debu Panda is a Senior Manager, Product Management at AWS. He is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world.

Enhance Agentforce data security with Private Connect for Salesforce Data Cloud and Amazon Redshift – Part 3

Post Syndicated from Yogesh Dhimate original https://aws.amazon.com/blogs/big-data/enhance-agentforce-data-security-with-private-connect-for-salesforce-data-cloud-and-amazon-redshift-part-3/

Data protection is a high priority, particularly as organizations face increasing cybersecurity threats. Maintaining the security of customer data is top priority for AWS and Salesforce. With AWS PrivateLink, Salesforce Private Connect eliminates common security risks associated with public endpoints. Salesforce Private Connect now works with Salesforce Data Cloud to keep your customer data secure when using with key services like Agentforce.

In Part 2 of this series, we discussed the architecture and implementation details of cross-Region data sharing between Salesforce Data Cloud and AWS accounts. In this post, we discuss how to create AWS endpoint services to improve data security with Private Connect for Salesforce Data Cloud.

Solution overview

In this example, we configure PrivateLink for an Amazon Redshift instance to enable direct, private connectivity from Salesforce Data Cloud. AWS recommends that organizations use an Amazon Redshift managed VPC endpoint (powered by PrivateLink) to privately access a Redshift cluster or serverless workgroup. For details about best practices, refer to Enable private access to Amazon Redshift from your client applications in another VPC.

However, some organizations might prefer to use PrivateLink managed by themselves—for example, a Redshift managed VPC endpoint is not yet available in Salesforce Data Cloud, and you need to manage your PrivateLink connection. This post focuses on the solution to configure self-managed PrivateLink between Salesforce Data Cloud and Amazon Redshift in your AWS account to establish private connectivity.

The following architecture diagram shows the steps for setting up private connectivity between Salesforce Data Cloud and Amazon Redshift in your AWS account.

To set up private connectivity between Salesforce Data Cloud and Amazon Redshift, we use the following resources:

Prerequisites

To complete the steps in this post, you must already have Amazon Redshift running in a private subnet and have the permissions to manage it.

Create a security group for the Network Load Balancer

The security group acts as a virtual firewall. The only traffic that reaches the instance is the traffic allowed by the security group rules. To enhance the security posture, you only want to allow traffic to Redshift instances. Complete the following steps to create a security group for your Network Load Balancer (NLB):

  1. On the Amazon VPC console, choose Security groups in the navigation pane.
  2. Choose Create security group.
  3. Enter a name and description for the security group.
  4. For VPC, use the same virtual private cloud (VPC) as your Redshift cluster.
  5. For Inbound rules, add a rule to allow traffic to ingress the listening port 5439 on the load balancer.

  1. For Outbound rules, add a rule to allow traffic to your Redshift instance.

  1. Choose Create security group.

Create a target group

Complete the following steps to create a target group:

  1. On the Amazon EC2 console, under Load balancing in the navigation pane, choose Target groups.
  2. Choose Create target group.
  3. For Choose a target type, select IP addresses.

  1. For Protocol: Port, choose TCP and port 5436 (if your Redshift cluster runs on a different port, change the port accordingly).
  2. For IP address type, select IPv4.
  3. For VPC, choose the same VPC as your Redshift cluster.
  4. Choose Next.

  1. For Enter an IPv4 address from a VPC subnet, enter your Amazon Redshift IP address.

To locate this address, navigate to your cluster details on the Amazon Redshift console, choose the Properties tab, and under Network and security settings, expand VPC endpoint connection details and copy the private address of the network interface. If you’re using Amazon Redshift Serverless, navigate to the workgroup home page. The Amazon Redshift IPv4 addresses can be located in the Network and security section under Data access when you choose VPC endpoint ID.

  1. After you add the IP address, choose Include as pending below, then choose Create target group.

Create a load balancer

Complete the following steps to create a load balancer:

  1. On the Amazon EC2 console, choose Load balancers in the navigation pane.
  2. Choose Create load balancer.
  3. Choose Network.
  4. For Load balancer name, enter a name.
  5. For Scheme, select Internal.
  6. For Load balancer address type, select IPv4.
  7. For VPC, use the VPC that your target group is in.

  1. For Availably Zones, select the Availability Zone where the Redshift cluster is running.
  2. For Security groups, choose the security group you created in the previous step.
  3. For Listener details, add a listener that points to the target group created in the last step:
    1. For Protocol, choose TCP.
    2. For Port, use 5439.
    3. For Default action, choose Redshift-TargetGroup.
  4. Choose Create load balancer.

Make sure that the registered targets in the target group are healthy before proceeding. Also make sure that the target group has a target for all Availability Zones in your AWS Region or the NLB has the Cross-zone load balancing attribute enabled.

In the load balancer’s security setting, make sure that Enforce inbound rules on PrivateLink traffic is off.

Create an endpoint service

Complete the following steps to create an endpoint service:

  1. On the Amazon VPC console, choose Endpoint services in the navigation pane.
  2. Choose Create endpoint service.
  3. For Load balancer type, choose Network.
  4. For Available load balancers, select the load balancer you created in the last step
  5. From Supported Regions, select an additional region if Data Cloud isn’t hosted in the same AWS region as the Redshift instance.  For additional settings leave Acceptance required.

If this is selected, later, when the Salesforce Data Cloud endpoint is created to connect to the endpoint service, you will need to come back to this page to accept the connection. If not selected, the connection will be built directly.

  1. For Supported IP address type, select IPv4.
  2. Choose Create.

Next, you need to allow Salesforce principals.

  1. After you create the endpoint service, choose Allow principals.
  2. In another browser, navigate to Salesforce Data Cloud Setup.
  3. Under External Integrations, access the new Private Connect menu item.
  4. Create a new private network route to Amazon Redshift.

  1. Copy the principal ID.

  1. Return to the endpoint service creation page.
  2. For Principals to add, enter the principal ID.
  3. Copy the endpoint service name.
  4. Choose Allow principals.

  1. Return to the Salesforce Data Cloud private network configuration page.
  2. For Route Name, enter the endpoint service name.
  3. Choose Save.

The route status should show as Allocating.

If you opted to accept connections in the previous step, you will now need to accept the connection from Salesforce Data Cloud.

  1. On the Amazon VPC console, navigate to the endpoint service.
  2. On the Endpoint connections tab, locate your pending connection request.

  1. Accept the endpoint connection request from Salesforce Data Cloud.

Navigate to the Salesforce Data Cloud setup and wait 30 seconds, then refresh the private connect route so the status shows as Ready.

You can now use this route when creating a connection with Amazon Redshift. For additional details, refer to Part 1 of this series.

Amazon Redshift federation PrivateLink failover

Now that we have discussed how to configure PrivateLink to use with Private Connect for Salesforce Data Cloud, let’s discuss Amazon Redshift federation PrivateLink failover scenarios.

You can choose to deploy your Redshift clusters in three different deployment modes:

  • Amazon Redshift provisioned in a Single-AZ RA3 cluster
  • Amazon Redshift provisioned in a Multi-AZ RA3 cluster
  • Amazon Redshift Serverless

PrivateLink relies on a customer managed NLB connected to service endpoints using IP address target groups. The target group has the IP addresses of your Redshift instance. If there is a change in IP address targets, the NLB target group must be updated to the new IP addresses associated with the service. Failover behavior for Amazon Redshift will differ based on the deployment mode you employ.

This section describes PrivateLink failover scenarios for these three deployment modes.

Amazon Redshift provisioned in a Single-AZ RA3 cluster

RA3 nodes support provisioned cluster VPC endpoints, which decouple the backend infrastructure from the cluster endpoint used for access. When you create or restore an RA3 cluster, Amazon Redshift uses a port within the ranges of 5431–5455 or 8191–8215. When the cluster is set to a port in one of these ranges, Amazon Redshift automatically creates a VPC endpoint in your AWS account for the cluster and attaches network interfaces with a private IP for each Availability Zone in the cluster. For the PrivateLink configuration, you use the IP associated with the VPC endpoint as the target for the frontend NLB. You can identify the IP address of the VPC endpoint on the Amazon Redshift console or by doing a describe-clusters query on the Redshift cluster.

Amazon Redshift will not remove a network interface associated with a VPC endpoint unless you add an additional subnet to an existing Availability Zone or remove a subnet using Amazon Redshift APIs. We recommend that you don’t add multiple subnets to an Availability Zone to avoid disruption. There might be failover scenarios where additional network interfaces are added to a VPC endpoint.

In RA3 clusters, the nodes are automatically recovered and replaced as needed by Amazon Redshift. The cluster’s VPC endpoint will not change even if the leader node is replaced.

Cluster relocation is an optional feature that allows Amazon Redshift to move a cluster to another Availability Zone without any loss of data or changes to your applications. When cluster relocation is turned on, Amazon Redshift might choose to relocate clusters in some situations. In particular, this happens where issues in the current Availability Zone prevent optimal cluster operation or to improve service availability. You can also invoke the relocation function in cases where resource constraints in a given Availability Zone are disrupting cluster operations. When a Redshift cluster is relocated to a new Availability Zone, the new cluster has the same VPC endpoint but a new network interface is added in the new Availability Zone. The new private address should be added to the NLB’s target group to optimize availability and performance.

In the case that a cluster has failed and can’t be recovered automatically, you have to initiate a restore of the cluster from a previous snapshot. This action generates a new cluster with a new DNS name, connection string, and VPC endpoint and IP address for the cluster. You have to update the NLB with the new IP for the VPC endpoint of the new cluster.

Amazon Redshift provisioned in a Multi-AZ RA3 cluster

Amazon Redshift supports Multi-AZ deployments for provisioned RA3 clusters. By using Multi-AZ deployments, your Redshift data warehouse can continue operating in failure scenarios when an unexpected event happens in an Availability Zone. A Multi-AZ deployment deploys compute resources in two Availability Zones, and these compute resources can be accessed through a single endpoint. In the case of a failure of the primary nodes, Multi-AZ clusters will make secondary nodes primary and deploy a new secondary stack in another Availability Zone. The following diagram illustrates this architecture.

Multi-AZ clusters deploy VPC endpoints that point to network interfaces in two Availability Zones, which should be configured as a part of the NLB target group. To configure the VPC endpoints in the NLB target group, you can identify the IP addresses of the VPC endpoint using the Amazon Redshift console or by doing a describe-clusters query on the Redshift cluster. In a failover scenario, VPC endpoint IPs will not change and the NLB doesn’t require an update.

Amazon Redshift will not remove a network interface associated with a VPC endpoint unless you add an additional subnet in to an existing Availability Zone or remove a subnet using Amazon Redshift APIs. We recommend that you don’t add multiple subnets to an Availability Zone to avoid disruption.

Amazon Redshift Serverless

Redshift Serverless provides managed infrastructure. You can perform the get-workgroup query to get the workgroup’s VpcEndpoint IPs. IPs should be configured in the target group of the PrivateLink NLB. Because this is a managed service, the failover is managed by AWS. During the event of an underlying Availability Zone failure, the workgroup might get a new set of IPs. You can frequently query the workgroup configuration or DNS record for the Redshift cluster to check if IP addresses have changed and update the NLB accordingly.

Automating IP address management

In scenarios where Amazon Redshift operations might change the IP address of the endpoint needed for Amazon Redshift connectivity, you can automate the update of NLB network targets by monitoring the results for cluster DNS resolution, using describe-cluster or get-workgroup queries, and using an AWS Lambda function to update the NLB target group configuration.

You can periodically (on a schedule) query the DNS of the Redshift cluster for IP address resolution. Use a Lambda function to compare and update the IP target groups for the NLB. For an example of this solution, see Hostname-as-Target for Network Load Balancers.

For legacy DS2 clusters where the IP address of the leader node must be explicitly monitored, you can configure Amazon CloudWatch metrics to monitor the HealthStatus of the leader node. You can configure the metric to trigger an alarm, which alerts an Amazon Simple Notification Service (Amazon SNS) topic and invokes a Lambda function to reconcile the NLB target group.

For backup and restore patterns, you can create a rule in Amazon EventBridge triggered on the RestoreFromClusterSnapshot API action, which invokes a Lambda function to update the NLB with the new IP addresses of the cluster.

For a cluster relocation pattern, you can trigger an event based on the Amazon Redshift ModifyCluster availability-zone-relocation API action.

Conclusion

In this post, we discussed how to use AWS endpoint services to improve data security with Private Connect for Salesforce Data Cloud. If you are currently using the Salesforce Data Cloud zero-copy integration with Amazon Redshift, we recommend you follow the steps provided in this post to make the network connection between Salesforce and AWS secure. Reach out to your Salesforce and AWS support teams if you need additional support to implement this solution.


About the authors

Yogesh Dhimate is a Sr. Partner Solutions Architect at AWS, leading technology partnership with Salesforce. Prior to joining AWS, Yogesh worked with leading companies including Salesforce driving their industry solution initiatives. With over 20 years of experience in product management and solutions architecture Yogesh brings unique perspective in cloud computing and artificial intelligence.

Avijit Goswami is a Principal Solutions Architect at AWS specialized in data and analytics. He supports AWS strategic customers in building high-performing, secure, and scalable data lake solutions on AWS using AWS managed services and open source solutions. Outside of his work, Avijit likes to travel, hike, watch sports, and listen to music.

Ife Stewart is a Principal Solutions Architect in the Strategic ISV segment at AWS. She has been engaged with Salesforce Data Cloud over the last 2 years to help build integrated customer experiences across Salesforce and AWS. Ife has over 10 years of experience in technology. She is an advocate for diversity and inclusion in the technology field.

Mike Patterson is a Senior Customer Solutions Manager in the Strategic ISV segment at AWS. He has partnered with Salesforce Data Cloud to align business objectives with innovative AWS solutions to achieve impactful customer experiences. In his spare time, he enjoys spending time with his family, sports, and outdoor activities.

Drew Loika is a Director of Product Management at Salesforce and has spent over 15 years delivering customer value via data platforms and services. When not diving deep with customers on what would help them be more successful, he enjoys the acts of making, growing, and exploring the great outdoors.

AWS Weekly Roundup: Omdia recognition, Amazon Bedrock RAG evaluation, International Women’s Day events, and more (March 24, 2025)

Post Syndicated from Betty Zheng (郑予彬) original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-omdia-recognition-amazon-bedrock-rag-evaluation-international-womens-day-events-and-more-march-24-2025/

As we celebrate International Women’s Day (IWD) this March, I had the privilege of attending the ‘Women in Tech’ User Group meetup in Shenzhen last weekend. I was inspired to see over 100 women in tech from different industries come together to discuss AI ethics from a female perspective. Together, we explored strategies such as reducing gender bias in AI systems and promoting diverse representation in model training data. In the AWS Cloud Lab, participants used Amazon Bedrock with large language models (LLMs) to generate rose bloom videos, which was the most popular part of this meetup.

These gatherings are crucial to our efforts to engage more women in AI technology exploration and development, and to help make sure that the generative AI era evolves without gender bias. The collaborative spirit and technical curiosity displayed throughout the event is further proof that diverse teams truly build inclusive and effective solutions.

Speaking of vibrant community engagement, I also had the honor of presenting at Kubernetes Community Day (KCD) Beijing 2025 this weekend. The enthusiasm Omdia Universe: Cloud Container Management & Services 2024-25 reportfor container technologies was remarkable, with nearly 300 developers gathering to share experiences and best practices. During my keynote introducing the DoEKS project from Amazon Web Services (AWS), I was struck by the depth of interest in managed Kubernetes services. The audience’s questions revealed how widely adopted services such as Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon Elastic Container Service (Amazon ECS) have become among Chinese developers building mission-critical applications.This strong community interest aligns perfectly with findings from the Omdia Universe: Cloud Container Management & Services 2024–25 report. In this comprehensive evaluation of container management solutions hosted on public clouds, AWS was recognized as a Leader. The report specifically highlights that AWS offers “widest range of options for working with Kubernetes or its own container management service, across cloud, edge, and on-premises environments.” You can read the full report about AWS offerings to learn more about our comprehensive container portfolio and how we’re helping builders deploy scalable, reliable containerized applications.

Last Week’s launches

In addition to the inspiring community events, here are some AWS launches that caught my attention.

Amazon Q Business browser extension gets upgrades – The Amazon Q Business browser extension now features significant enhancements designed to streamline browser-based tasks. Users gain access to their company’s indexed knowledge alongside web content, direct PDF support within the browser, image file attachment capabilities, and controls to remove irrelevant attachments from conversation context. The expanded context window accommodates larger web pages and more detailed prompts, resulting in more helpful responses. For advanced needs, the extension offers seamless transition to the full Amazon Q Business web experience with access to Actions and Amazon Q Apps. Review the Enhancing web browsing with Amazon Q Business in the documentation for detailed setup instructions and feature descriptions to learn more about this announcement.

Amazon Bedrock RAG evaluation is now generally available – Offering comprehensive assessment of both Bedrock Knowledge Bases and custom Retrieval Augmented Generation (RAG) systems through LLM-as-a-judge methodology. The service evaluates retrieval quality and end-to-end generation with metrics for relevance, correctness, and hallucination detection, and the newly added support for custom RAG pipeline evaluations lets you bring your own input-output pairs and retrieved contexts directly into the evaluation job, along with new citation precision metrics and Amazon Bedrock Guardrails integration for more flexible RAG system optimization. To learn more, visit the Amazon Bedrock Evaluations page and What is Amazon Bedrock? in the documentation.

Amazon Nova expands Tool Choice options for Converse API – We’ve enhanced Amazon Nova with expanded Tool Choice capabilities for the Converse API, giving developers more flexibility in building sophisticated AI applications. This update allows models to determine when to use tools to fulfill user requests more effectively. Learn more in the announcement about expands Tool Choice options.

Amazon Bedrock Guardrails adds policy-based enforcement for responsible AI – Our builders can now enforce responsible AI policies at scale with Amazon Bedrock Guardrails’ new AWS Identity and Access Management (IAM) policy-based enforcement capabilities. This feature helps you to specify required guardrails through IAM policies using the bedrock:GuardrailIdentifiercondition key, so that all model inference calls comply with your organization’s AI safety standards. When your teams make Amazon Bedrock Invoke or Converse API calls, requests are automatically rejected if they don’t include the mandated guardrails, providing consistent protection against undesirable content, sensitive information exposure, and model hallucinations. Refer to the Set up permissions to use Guaidrails for content filtering in the technical documentation and the Amazon Bedrock Guardrails product page to learn more about the announcement about policy based enforcement for responsible AI.

Next generation of Amazon Connect released – We’ve launched the next generation of Amazon Connect, featuring AI-powered interactions designed to strengthen customer relationships and improve business outcomes. This major update brings enhanced agent experiences, smarter customer interactions, and deeper operational insights to contact centers of all sizes. Learn more from the new launch post in the AWS Contact Center Blog.

Amazon Redshift Serverless introduces Current and Trailing release tracksAmazon Redshift Serverless now offers two release tracks to give users more control over their update cadence. The Current track delivers the most up-to-date certified release with the latest features and security updates, while the Trailing track remains on the previous certified release. This dual-track approach allows organizations to validate new releases on select workgroups before implementing them across production environments. Users can easily switch between tracks through the Amazon Redshift console, providing the flexibility to balance innovation with stability for mission-critical workloads. This capability is available in all AWS Regions where Amazon Redshift Serverless is offered. Refer to Tracks for Amazon Redshift provisioned cluster and serverless work groups to learn more about the Current and Trailing tracks in Amazon Redshift Serverless.

AWS WAF now supports URI fragment field matchingAWS WAF has expanded its capability to include URI fragment field matching, allowing security teams to create rules that inspect and match against the fragment portion of URLs. This enhancement enables more precise security controls for web applications that use URI fragments to identify specific sections within pages. Security professionals can now implement more targeted protections, such as restricting access to sensitive page elements, detecting suspicious navigation patterns, and enhancing bot mitigation by analyzing fragment usage patterns characteristic of automated attacks. This feature is available in all AWS Regions where AWS WAF is supported. For more information about URI field for matching, visit the AWS WAF Developer Guide.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS.

Other AWS news

Here are some other additional projects and blog posts that you might find interesting.

Build your generative AI skills at AWS Gen AI Lofts – AWS has established more than 10 global hubs offering training and networking for developers and startups in 2025, where you can gain practical, hands-on experience with the latest AI technologies. These revamped spaces feature dedicated zones where you can participate in workshops on prompt engineering, foundation model (FM) selection, and implementing AI in production environments. If you’re near San Francisco, New York, Tokyo, or other major tech hubs with AWS Gen AI Lofts, stop by to access these free resources and accelerate your generative AI development skills. Check out all of the AWS Gen AI Loft locations and events and to read 5 ways to build your AI skills on AWS Gen AI Loft to learn more.

AWS Lambda‘s architecture for billions of asynchronous invocations – A recent technical article reveals how AWS Lambda handles massive scale through sophisticated engineering approaches. The Lambda asynchronous invocation path employs multiple queuing strategies, consistent hashing for intelligent partitioning, and shuffle-sharding techniques to minimize noisy neighbor effects. The system relies on key observability metrics (AsyncEventReceived, AsyncEventAge, and AsyncEventDropped) to maintain optimal performance. These architectural decisions enable Lambda to process tens of trillions of monthly invocations across 1.5 million active customers while providing reliable scalability and performance isolation. For details read Handling billions of invocations – best practices from AWS Lambda in the AWS computing blog.

AWS is reducing prices by more than 11% for its high-memory U7i instances across all Regions and pricing models. The reduction applies to four instances: u7i-12tb.224xlarge, u7in-16tb.224xlarge, u7in-24tb.224xlarge, and u7in-32tb.224xlarge. The new On-Demand pricing, which covers shared, dedicated, and host tenancy options is retroactive, to March 1, 2025. For new Savings Plan purchases, pricing is effective immediately.

Create your AWS Builder ID and reserve your alias – Builder ID is a universal login credential that gives you access beyond the AWS Management Console to AWS tools and resources, including over 600 free training courses, community features, and developer tools such as Amazon Q Developer.

From community.aws
Here are some of my favorite posts from community.aws.

Model Context Protocol (MCP): why it matters – The recently introduced Model Context Protocol (MCP) creates a standardized way for AI applications to communicate with multiple FMs using consistent prompts and tools.

Build serverless GenAI Apps faster with Amazon Q Developer CLI agent – Discover how Amazon Q Developer CLI Agent revolutionizes cloud development by building a complete serverless generative AI application in minutes instead of days.

Automating code reviews with Amazon Q and GitHub actions – A new developer tutorial demonstrates how to integrate Amazon Q Developer with GitHub Actions to automatically analyze pull requests and provide AI-powered code feedback.

DeepSeek on AWS – A new technical guide demonstrates how to deploy DeepSeek’s powerful open-source AI models on AWS infrastructure. The tutorial provides step-by-step instructions for setting up these cutting-edge models using Amazon SageMaker, Amazon Elastic Compute Cloud (Amazon EC2) instances with GPUs, or through integration with Amazon Bedrock. The guide covers optimization techniques, sample applications, and best practices for balancing performance with cost efficiency.

Upcoming AWS events
Check your calendars and sign up for these upcoming AWS events.

Empowering Futures – Women Leading the Way in Tech and Non-Tech Careers – Whether you’re here to expand your professional circle, learn about the AWS Cloud or gain wisdom from inspiring speakers, this event has something for everyone. This is a public event open to everyone in the Seattle area—for free—on March 27, 2025.

AWS at KubeCon + CloudNativeCon London 2025 – Join us at KubeCon London on April 1 – April 4 , at Excel booth S300 for live product demonstrations that help you simplify Kubernetes operations, optimize costs and performance, harness the power of artificial learning and machine learning (AI/ML), and build scalable platform strategies.

That’s all for this week. Check back next Monday for another Weekly Roundup!

Betty

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!


How is the News Blog doing? Take this 1 minute survey!

(This survey is hosted by an external company. AWS handles your information as described in the AWS Privacy Notice. AWS will own the data gathered via this survey and will not share the information collected with survey respondents.)

Using Amazon S3 Tables with Amazon Redshift to query Apache Iceberg tables

Post Syndicated from Jonathan Katz original https://aws.amazon.com/blogs/big-data/using-amazon-s3-tables-with-amazon-redshift-to-query-apache-iceberg-tables/

Amazon Redshift supports querying data stored using Apache Iceberg tables, an open table format that simplifies management of tabular data residing in data lakes on Amazon Simple Storage Service (Amazon S3). Amazon S3 Tables delivers the first cloud object store with built-in Iceberg support and streamlines storing tabular data at scale, including continual table optimizations that help improve query performance. Amazon SageMaker Lakehouse unifies your data across S3 data lakes, including S3 Tables, and Amazon Redshift data warehouses, helps you build powerful analytics and artificial intelligence and machine learning (AI/ML) applications on a single copy of data, querying data stored in S3 Tables without the need for complex extract, transform, and load (ETL) or data movement processes. You can take advantage of the scalability of S3 Tables to store and manage large volumes of data, optimize costs by avoiding additional data movement steps, and simplify data management through centralized fine-grained access control from SageMaker Lakehouse.

In this post, we demonstrate how to get started with S3 Tables and Amazon Redshift Serverless for querying data in Iceberg tables. We show how to set up S3 Tables, load data, register them in the unified data lake catalog, set up basic access controls in SageMaker Lakehouse through AWS Lake Formation, and query the data using Amazon Redshift.

Note – Amazon Redshift is just one option for querying data stored in S3 Tables. You can learn more about S3 Tables and additional ways to query and analyze data on the S3 Tables product page.

Solution overview

In this solution, we show how to query Iceberg tables managed in S3 Tables using Amazon Redshift. Specifically, we load a dataset into S3 Tables, link the data in S3 Tables to a Redshift Serverless workgroup with appropriate permissions, and finally run queries to analyze our dataset for trends and insights. The following diagram illustrates this workflow.

In this post, we will walk through the following steps:

  1. Create a table bucket in S3 Tables and integrate with other AWS analytics services.
  2. Set up permissions and create Iceberg tables with SageMaker Lakehouse using Lake Formation.
  3. Load data with Amazon Athena. There are different ways to ingest data into S3 Tables, but for this post, we show how we can quickly get started with Athena.
  4. Use Amazon Redshift to query your Iceberg tables stored in S3 Tables through the auto mounted catalog.

Prerequisites

The examples in this post require you to use the following AWS services and features:

Create a table bucket in S3 Tables

Before you can use Amazon Redshift to query the data in S3 Tables, you must first create a table bucket. Complete the following steps:

  1. In the Amazon S3 console, choose Table buckets on the left navigation pane.
  2. In the Integration with AWS analytics services section, choose Enable integration if you haven’t previously set this up.

This sets up the integration with AWS analytics services, including Amazon Redshift, Amazon EMR, and Athena.

After a few seconds, the status will change to Enabled.

  1. Choose Create table bucket.
  2. Enter a bucket name. For this example, we use the bucket name redshifticeberg.
  3. Choose Create table bucket.

After the S3 table bucket is created, you will be redirected to the table buckets list.

Now that your table bucket is created, the next step is to configure the unified catalog in SageMaker Lakehouse through the Lake Formation console. This will make the table bucket in S3 Tables available to Amazon Redshift for querying Iceberg tables.

Publishing Iceberg tables in S3 Tables to SageMaker Lakehouse

Before you can query Iceberg tables in S3 Tables with Amazon Redshift, you must first make the table bucket available in the unified catalog in SageMaker Lakehouse. You can do this through the Lake Formation console, which lets you publish catalogs and manage tables through the catalogs feature, and assign permissions to users. The following steps show you how to set up Lake Formation so you can use Amazon Redshift to query Iceberg tables in your table bucket:

  1. If you’ve never visited the Lake Formation console before, you must first do so as an AWS user with admin permissions to activate Lake Formation.

You will be redirected to the Catalogs page on the Lake Formation console. You will see that one of the catalogs available is the s3tablescatalog, which maintains a catalog of the table buckets you’ve created. The following steps will configure Lake Formation to make data in the s3tablescatalog catalog available to Amazon Redshift.

Next, you need to create a database in Lake Formation. The Lake Formation database maps to a Redshift schema.

  1. Choose Databases under Data Catalog in the navigation pane.
  2. On the Create menu, choose Database.

  1. Enter a name for this database. This example uses icebergsons3.
  2. For Catalog, choose the table bucket that you created. In this example, the name will have the format <ACCOUNT ID>:s3tablescatalog/redshifticeberg.
  3. Choose Create database.

You will be redirected on the Lake Formation console to a page with more information about your new database. Now you can create an Iceberg table in S3 Tables.

  1. On the database details page, on the View menu, choose Tables.

This will open up a new browser window with the table editor for this database.

  1. After the table view loads, choose Create table to start creating the table.

  1. In the editor, enter the name of the table. We call this table examples.
  2. Choose the catalog (<ACCOUNT ID>:s3tablescatalog/redshifticeberg) and database (icebergsons3).

Next, add columns to your table.

  1. In the Schema section, choose Add column, and add a column that represents an ID.

  1. Repeat this step and add columns for additional data:
    1. category_id (long)
    2. insert_date (date)
    3. data (string)

The final schema looks like the following screenshot.

  1. Choose Submit to create the table.

Next, you need to set up a read-only permission so you can query Iceberg data in S3 Tables using the Amazon Redshift Query Editor v2. For more information, see Prerequisites for managing Amazon Redshift namespaces in the AWS Glue Data Catalog.

  1. Under Administration in the navigation pane, choose Administrative roles and tasks.
  2. In the Data lake administrators section, choose Add.

  1. For Access type, select Read-only administrator.
  2. For IAM users and roles, enter AWSServiceRoleForRedshift.

AWSServiceRoleForRedshift is a service-linked role that’s managed by AWS.

  1. Choose Confirm.

You have now configured SageMaker Lakehouse using Lake Formation to allow Amazon Redshift to query Iceberg tables in S3 Tables. Next, you populate some data into the Iceberg table, and query it with Amazon Redshift.

Use SQL to query Iceberg data with Amazon Redshift

For this example, we use Athena to load data into our Iceberg table. This is one option for ingesting data into an Iceberg table; see Using Amazon S3 Tables with AWS analytics services for other options, including Amazon EMR with Spark, Amazon Data Firehose, and AWS Glue ETL.

  1. On the Athena console, navigate to the query editor.
  2. If this is your first time using Athena, you must first specify a query result location before executing your first query.
  3. In the query editor, under Data, choose your data source (AwsDataCatalog).
  4. For Catalog, choose the table bucket you created (s3tablescatalog/redshifticeberg).
  5. For Database, choose the database you created (icebergsons3).

  1. Let’s execute a query to generate data for the examples table. The following query generates over 1.5 million rows corresponding to 30 days of data. Enter the query and choose Run.
INSERT INTO icebergsons3.examples
SELECT
    b.id * (date_diff('day', CURRENT_DATE, a.insert_date) + 1),
    b.id % 1000, a.insert_date,
    CAST(random() AS varchar)
FROM
    unnest(
        sequence(CURRENT_DATE, CURRENT_DATE + INTERVAL '30' DAY, INTERVAL '1' DAY)
    ) AS a(insert_date),
    unnest(sequence(1, 50000)) AS b(id);

The following screenshot shows our query.

The query takes about 10 seconds to execute.

Now you can use Redshift Serverless to query the data.

  1. On the Redshift Serverless console, provision a Redshift Serverless workgroup if you haven’t already done so. For instructions, see Get started with Amazon Redshift Serverless data warehouses guide. In this example, we use a Redshift Serverless workgroup called iceberg.
  2. Make sure that your Amazon Redshift patch version is patch 188 or higher.

  1. Choose Query data to open the Amazon Redshift Query Editor v2.

  1. In the query editor, choose the workgroup you want to use.

A pop-up window will appear, prompting what user to use.

  1. Select Federated user, which will use your current account, and choose Create connection.

It will take a few seconds to start the connection. When you’re connected, you will see a list of available databases.

  1. Choose External databases.

You will see the table bucket from S3 Tables in the view (in this example, this is redshifticeberg@s3tablescatalog).

  1. If you continue clicking through the tree, you will see the examples table, which is the Iceberg table you previously created that’s stored in the table bucket.

You can now use Amazon Redshift to query the Iceberg table in S3 Tables.

Before you execute the query, review the Amazon Redshift syntax for querying catalogs registered in SageMaker Lakehouse. Amazon Redshift uses the following syntax to reference a table: [email protected] or database@namespace".schema.table.

In this example, we use the following syntax to query the examples table in the table bucket: r[email protected].

Learn more about this mapping in Using Amazon S3 Tables with AWS analytics services.

Let’s run some queries. First, let’s see how many rows are in the examples table.

  1. Run the following query in the query editor:
SELECT count(*)
FROM [email protected]; 

The query will take a few seconds to execute. You will see the following result.

Let’s try a slightly more complicated query. In this case, we want to find all the days that had example data starting with 0.2 and a category_id between 50–75 with at least 130 rows. We will order the results from most to least.

  1. Run the following query:
SELECT examples.insert_date, count(*)
FROM [email protected]
WHERE
    examples.data LIKE '0.2%' AND
    examples.category_id BETWEEN 50 AND 75
GROUP BY examples.insert_date
HAVING count(*) > 130
ORDER BY count DESC;

You might see different results than the following screenshot due the randomly generated source data.

Congratulations, you have set up and queried Iceberg data in S3 Tables from Amazon Redshift!

Clean up

If you implemented the example and want to remove the resources, complete the following steps:

  1. If you no longer need your Redshift Serverless workgroup, delete the workgroup.
  2. If you don’t need to access your SageMaker Lakehouse data from the Amazon Redshift Query Editor v2, remove the data lake administrator:
    1. On the Lake Formation console, choose Administrative roles and tasks in the navigation pane.
    2. Remove the read-only data lake administrator that has the AWSServiceRoleForRedshift privilege.
  3. If you want to permanently delete the data from this post, delete the database:
    1. On the Lake Formation console, choose Databases in the navigation pane.
    2. Delete the icebergsahead database.
  4. If you no longer need the table bucket, delete the table bucket.
  5. In you want to deactivate the integration between S3 Tables and AWS analytics services, see Migrating to the updated integration process.

Conclusion

In this post, we showed how to get started with Amazon Redshift to query Iceberg tables stored in S3 Tables. This is just the beginning for how you can use Amazon Redshift to analyze your Iceberg data that’s stored in S3 Tables—you can combine this with other Amazon Redshift features, including writing queries that join data from Iceberg tables stored in S3 Tables and Redshift Managed Storage (RMS), or implement data access controls that give you fine-granted access control rules for different users across the S3 Tables. Additionally, you can use features like Redshift Serverless to automatically select the amount of compute for analyzing your Iceberg tables, and use AI to intelligently scale on demand and optimize query performance characteristics for your analytical workload.

We invite you to leave feedback in the comments.


About the Authors

Jonathan Katz is a Principal Product Manager – Technical on the Amazon Redshift team and is based in New York. He is a Core Team member of the open source PostgreSQL project and an active open source contributor, including PostgreSQL and the pgvector project.

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 19 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Unlock the power of optimization in Amazon Redshift Serverless

Post Syndicated from Ricardo Serafim original https://aws.amazon.com/blogs/big-data/unlock-the-power-of-optimization-in-amazon-redshift-serverless/

Amazon Redshift Serverless automatically scales compute capacity to match workload demands, measuring this capacity in Redshift Processing Units (RPUs). Although traditional scaling primarily responds to query queue times, the new AI-driven scaling and optimization feature offers a more sophisticated approach by considering multiple factors including query complexity and data volume. Intelligent scaling addresses key data warehouse challenges by preventing both over-provisioning of resources for performance and under-provisioning to save costs, particularly for workloads that fluctuate based on daily patterns or monthly cycles.

Amazon Redshift serverless now offers enhanced flexibility in configuring workgroups through two primary methods. Users can either set a base capacity, specifying the baseline RPUs for query execution, with options ranging from 8 to 1024 RPUs and each RPU providing 16 GB of memory, or they can opt for the price-performance target. Amazon Redshift Serverless AI-driven scaling and optimization can adapt more precisely to diverse workload requirements and employs intelligent resource management, automatically adjusting resources during query execution for optimal performance. Consider using AI-driven scaling and optimization if your current workload requires 32 to 512 base RPUs. We don’t recommend using this feature for less than 32 base RPU or more than 512 base RPU workloads.

In this post, we demonstrate how Amazon Redshift Serverless AI-driven scaling and optimization impacts performance and cost across different optimization profiles.

Options in AI-driven scaling and optimization

Amazon Redshift Serverless AI-driven scaling and optimization offers an intuitive slider interface, letting you balance price and performance goals. You can select from five optimization profiles, ranging from Optimized for Cost to Optimized for Performance, as shown in the following diagram. Your slider position determines how Amazon Redshift allocates resources and implements AI-driven scaling and optimizations, to achieve your desired price-performance target.

Sliding bar

The slider offers the following options:

  1. Optimized for Cost (1)
    • Prioritizes cost savings over performance
    • Allocates minimum resources in favor of saving on costs
    • Best for workloads where performance isn’t time-critical
  2. Cost-Balanced (25)
    • Balances towards cost savings while maintaining reasonable performance
    • Allocates moderate resources
    • Suitable for mixed workloads with some flexibility in query time
  3. Balanced (50)
    • Provides equal emphasis on cost efficiency and performance
    • Allocates optimal resources for most use cases
    • Ideal for general-purpose workloads
  4. Performance-Balanced (75)
    • Favors performance while maintaining some cost control
    • Allocates additional resources when needed
    • Suitable for workloads requiring consistently fast query elapsed time
  5. Optimized for Performance (100)
    • Maximizes performance regardless of cost
    • Provides maximum available resources
    • Best for time-critical workloads requiring fastest possible query delivery

Which workloads to consider for AI-driven scaling and optimizations

The Amazon Redshift Serverless AI-driven scaling and optimization capabilities can be applied to almost every analytical workload. Amazon Redshift will assess and apply optimizations according to your price-performance target—cost, balance, or performance.

Most analytical workloads operate on millions or even billions of rows and generate aggregations and complex calculations. These workloads have high variability for query patterns and number of queries. The Amazon Redshift Serverless AI-driven scaling and optimization will improve the price, performance, or both because it learns the patterns (the repeatability of your workload) and will allocate more resources towards performance improvements if you’re performance-focused or fewer resources if you’re cost-focused.

Cost-effectiveness of AI-driven scaling and optimization

To effectively determine the effectiveness of Amazon Redshift Serverless AI-driven scaling and optimization we need to be able to measure your current state of price-performance. We encourage you to measure your current price-performance by using sys_query_history to calculate the total elapsed time of your workload and note the start time and end time. Then use sys_serverless_usage to calculate the cost. You can use the query from the Amazon Redshift documentation and add the same start and end times. This will establish your current price performance, and now you have a baseline to compare against.

If such measurement isn’t practical because your workloads are continuously running and it’s impractical for you to determine a fixed start and end time, then another way is to compare holistically, check your month over month cost, check your user sentiment towards performance, towards system stability, improvements in data delivery, or reduction in overall monthly processing times.

Benchmark conducted and results

We evaluated the optimization options using the TPCDS 3TB dataset from the AWS Labs GitHub repository (amazon-redshift-utils). We deployed this dataset across three Amazon Redshift Serverless workgroups configured as Optimized for Cost, Balanced, and Optimized for Performance. To create a realistic reporting environment, we configured three Amazon Elastic Compute Cloud (Amazon EC2) instances with JMeter (one per endpoint) and ran 15 selected TPCDS queries concurrently for approximately 1 hour, as shown in the following screenshot.

We disabled the result cache to make sure Amazon Redshift Serverless ran all queries directly, providing accurate measurements. This setup helped us capture authentic performance characteristics across each optimization profile. Also, we designed our test environment without setting the Amazon Redshift Serverless workgroup max capacity parameter—a key configuration that controls the maximum RPUs available to your data warehouse. By removing this limit, we could clearly showcase how different configurations affect scaling behavior in our test endpoints.

Jmeter

Our comprehensive test plan included running each of the 15 queries 355 times, generating 5,325 queries per test cycle. The AI-driven scaling and optimization needs multiple iterations to identify patterns and optimize RPUs, so we ran this workload 10 times. Through these repetitions, the AI learned and adapted its behavior, processing a total of 53,250 queries throughout our testing period.

The testing revealed how the AI-driven scaling and optimization system adapts and optimizes performance across three distinct configuration profiles: Optimized for Cost, Balanced, and Optimized for Performance.

Queries and elapsed time

Although we ran the same core workload repeatedly, we used variable parameters in JMeter to generate different values for the WHERE clause conditions. This approach created similar but not identical workloads, introducing natural variations that showed how the system handles real-world scenarios with varying query patterns.

Our elapsed time analysis demonstrates how each configuration achieved its performance objectives, as shown by the average consumption metrics for each endpoint, as shown in the following screenshot.

Average Elapsed Time per Endpoint

The results matched our expectations: the Optimized for Performance configuration delivered significant speed improvements, running queries approximately two times as the Balanced configuration and four times as the Optimized for Cost setup.

The following screenshots show the elapsed time breakdown for each test.

Optimized for Cost - Elapsed Time Balanced - Elapsed Time Optimized for Performance - Elapsed Time

The following screenshot shows tenth and final test iteration demonstrates distinct performance differences across configurations.

Per Configuration - Elapsed Time

To clarify more, we categorized our query elapsed times into three groups:

  • Short queries – Less than 10 seconds
  • Medium queries – From 10 seconds to 10 minutes
  • Long queries: More than 10 minutes

Considering our last test, the analysis shows:

Duration per configuration Optimized for Cost Balanced Optimized for Performance
Short queries (<10 sec) 1488 1743 3290
Medium queries (10 sec – 10 min) 3633 3579 2035
Long queries (>10 min) 204 3 0
TOTAL 5325 5325 5325

The configuration’s capacity directly impacts query elapsed time. The Optimized for Cost configuration limits resources to save money, resulting in longer query times, making it best suited for workloads that aren’t time critical, where cost savings are prioritized. The Balanced configuration provides moderate resource allocation, striking a middle ground by effectively handling medium-duration queries and maintaining reasonable performance for short queries while nearly eliminating long-running queries. In contrast, the Optimized for Performance configuration allocates more resources, which increases costs but delivers faster query results, making it best for latency-sensitive workloads where query speed is critical.

Capacity used during the tests

Our comparison of the three configurations reveals how Amazon Redshift Serverless AI-driven scaling and optimization technology adapts resource allocation to meet user expectations. The monitoring showed both Base RPU variations and distinct scaling patterns across configurations—scaling up aggressively for faster performance or maintaining lower RPUs to optimize costs.

The Optimized for Cost configuration starts at 128 RPUs and increases to 256 RPUs after three tests. To maintain cost-efficiency, this setup limits the maximum RPU allocation during scaling, even when facing query queuing.

In the following table, we can observe the costs for this Optimized for Cost configuration.

Test# Starting RPUs Scaled up to Cost incurred
1 128 1408  $254.17
2 128 1408  $258.39
3 128 1408  $261.92
4 256 1408  $245.57
5 256 1408  $247.11
6 256 1408  $257.25
7 256 1408  $254.27
8 256 1408  $254.27
9 256 1408  $254.11
10 256 1408  $256.15

The strategic RPU allocation by Amazon Redshift Serverless helps optimize costs, as demonstrated in tests 3 and 4, where we observed significant cost savings. This is shown in the following graph.

Optimized for Cost - Cost Average

Although the optimization for cost changed the base RPU, the balanced configuration didn’t change the base RPUs but scaled up to 2176, further than the 1408 RPUs that were the maximum used by the cost optimization setup. The following table shows the figures for the Balanced configuration.

Test# Starting RPUs Scaled up to Cost incurred
1 192 2176  $261.48
2 192 2112  $270.90
3 192 2112  $265.26
4 192 2112  $260.20
5 192 2112  $262.12
6 192 2112  $253.18
7 192 2112  $272.80
8 192 2112  $272.80
9 192 2112  $263.72
10 192 2112  $243.28

The Balanced configuration, averaging $262.57 per test, delivered significantly better performance while costing only 3% more than the Optimized for Cost configuration, which averaged $254.32 per test. As demonstrated in the previous section, this performance advantage is evident in the elapsed time comparisons. The following graph shows the costs for the Balanced configuration.

Balanced - Cost Average

As expected from the Optimized for Performance configuration, the usage of resources was higher to attend the high performance. In this configuration, we can also observe that after two tests, the engine adapted itself to start with a higher number of RPUs to attend the queries faster.

Test# Starting RPUs Scaled Up to Cost incurred
1 512 2753  $295.07
2 512 2327  $280.29
3 768 2560  $333.52
4 768 2991  $295.36
5 768 2479  $308.72
6 768 2816  $324.08
7 768 2413  $300.45
8 768 2413  $300.45
9 768 2107  $321.07
10 768 2304  $284.93

Despite a 19% cost increase in the third test, most subsequent tests remained below the $304.39 average cost.

Optimized for Performance - Cost Average

The Optimized for Performance configuration maximizes resource usage to achieve faster query times, prioritizing speed over cost efficiency.

The final cost-performance analysis reveals compelling results:

  • The Balanced configuration delivered twofold better performance while costing only 3.25% more than the Optimized for Cost setup
  • The Optimized for Performance configuration achieved fourfold faster elapsed time with a 19.39% cost increase compared to the Optimized for Cost option.

The following chart illustrates our cost-performance findings:

Average Billing and Elapsed Time per Endpoint

It’s important to note that these results reflect our specific test scenario. Each workload has unique characteristics, and the performance and cost differences between configurations might vary significantly in other use cases. Our findings serve as a reference point rather than a universal benchmark. Additionally, we didn’t test two intermediate configurations available in Amazon Redshift Serverless: one between Optimized for Cost and Balanced, and another between Balanced and Optimized for Performance.

Conclusion

The test results demonstrate the effectiveness of Amazon Redshift Serverless AI-driven scaling and optimization across different workload requirements. These findings highlight how Amazon Redshift Serverless AI-driven scaling and optimization can help organizations find their ideal balance between cost and performance. Although our test results serve as a reference point, each organization should evaluate their specific workload requirements and price-performance targets. The flexibility of five different optimization profiles, combined with intelligent resource allocation, enables teams to fine-tune their data warehouse operations for optimal efficiency.

To get started with Amazon Redshift Serverless AI-driven scaling and optimization, we recommend:

  1. Establishing your current price-performance baseline
  2. Identifying your workload patterns and requirements
  3. Testing different optimization profiles with your specific workloads
  4. Monitoring and adjusting based on your results

By using these capabilities, organizations can achieve better resource utilization while meeting their specific performance and cost objectives.

Ready to optimize your Amazon Redshift Serverless workloads? Visit the AWS Management Console today to create your own Amazon Redshift Serverless AI-driven scaling and optimization to start exploring the different optimization profiles. For more information, check out our documentation on Amazon Redshift Serverless AI-driven scaling and optimization, or contact your AWS account team to discuss your specific use case.


About the Authors

Ricardo Serafim Ricardo Serafim is a Senior Analytics Specialist Solutions Architect at AWS. He has been helping companies with Data Warehouse solutions since 2007.

Milind Oke Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift.

Andre HassAndre Hass is a Senior Technical Account Manager at AWS, specialized in AWS Data Analytics workloads. With more than 20 years of experience in databases and data analytics, he helps customers optimize their data solutions and navigate complex technical challenges. When not immersed in the world of data, Andre can be found pursuing his passion for outdoor adventures. He enjoys camping, hiking, and exploring new destinations with his family on weekends or whenever an opportunity arises.