All posts by Rajkumar Irudayaraj

Harness Zero Copy data sharing from Salesforce Data Cloud to Amazon Redshift for Unified Analytics – Part 2

2024-09-12 Rajkumar Irudayaraj

Post Syndicated from Rajkumar Irudayaraj original https://aws.amazon.com/blogs/big-data/harness-zero-copy-data-sharing-from-salesforce-data-cloud-to-amazon-redshift-for-unified-analytics-part-2/

In the era of digital transformation and data-driven decision making, organizations must rapidly harness insights from their data to deliver exceptional customer experiences and gain competitive advantage. Salesforce and Amazon have collaborated to help customers unlock value from unified data and accelerate time to insights with bidirectional Zero Copy data sharing between Salesforce Data Cloud and Amazon Redshift.

In the Part 1 of this series, we discussed how to configure data sharing between Salesforce Data Cloud and customers’ AWS accounts in the same AWS Region. In this post, we discuss the architecture and implementation details of cross-Region data sharing between Salesforce Data Cloud and customers’ AWS accounts.

Solution overview

Salesforce Data Cloud provides a point-and-click experience to share data with a customer’s AWS account. On the AWS Lake Formation console, you can accept the datashare, create the resource link, mount Salesforce Data Cloud objects as data catalog views, and grant permissions to query the live and unified data in Amazon Redshift. Cross-Region data sharing between Salesforce Data Cloud and a customer’s AWS accounts is supported for two deployment scenarios: Amazon Redshift Serverless and Redshift provisioned clusters (RA3).

Cross-Region data sharing with Redshift Serverless

The following architecture diagram depicts the steps for setting up a cross-Region datashare between a Data Cloud instance in US-WEST-2 with Redshift Serverless in US-EAST-1.

Cross-Region data sharing set up consists of the following steps:

The Data Cloud admin identifies the objects to be shared and creates a Data Share in the data cloud provisioned in the US-WEST-2
The Data Cloud admin links the Data Share with the Amazon Redshift Data Share target. This creates an AWS Glue Data Catalog view and a cross-account Lake Formation resource share using the AWS Resource Access Manager (RAM) with the customer’s AWS account in US-WEST-2.
The customer’s Lake Formation admin accepts the datashare invitation in US-WEST-2 from the Lake Formation console and grants default (select and describe) permissions to an AWS Identity and Access Management (IAM) principal.
The Lake Formation admin switches to US-EAST-1 and creates a resource link pointing to the shared database in the US-WEST-2 Region.
The IAM principal can log in to the Amazon Redshift query editor in US-EAST-1 and creates an external schema referencing the datashare resource link. The data can be queried through these external tables.

Cross-Region data sharing with a Redshift provisioned cluster

Cross-Region data sharing across Salesforce Data Cloud and a Redshift provisioned cluster requires additional steps on top of the Serverless set up. Based on the Amazon Redshift Spectrum considerations, the provisioned cluster and the Amazon Simple Storage Service (Amazon S3) bucket must be in the same Region for Redshift external tables. The following architecture depicts a design pattern and steps to share data with Redshift provisioned clusters.

Steps 1–5 in the set up remain the same across Redshift Serverless and provisioned cluster cross-Region sharing. Encryption must be enabled on both Redshift Serverless and the provisioned cluster. Listed below are the additional steps:

Create a table from datashare data with the CREATE TABLE AS SELECT Create a datashare in Redshift serverless and grant access to the Redshift provisioned cluster.
Create a database in the Redshift provisioned cluster and grant access to the target IAM principals. The datashare is ready for query.

The new table needs to be refreshed periodically to get the latest data from the shared Data Cloud objects with this solution.

Considerations when using data sharing in Amazon Redshift

For a comprehensive list of considerations and limitations of data sharing, refer to Considerations when using data sharing in Amazon Redshift. Some of the important ones for Zero Copy data sharing includes:

Data sharing is supported for all provisioned RA3 instance types (ra3.16xlarge, ra3.4xlarge, and ra3.xlplus) and Redshift Serverless. It isn’t supported for clusters with DC and DS node types.
For cross-account and cross-Region data sharing, both the producer and consumer clusters and serverless namespaces must be encrypted. However, they don’t need to share the same encryption key.
Data Catalog multi-engine views are generally available in commercial Regions where Lake Formation, the Data Catalog, Amazon Redshift, and Amazon Athena are available.
Cross-Region sharing is available in all LakeFormation supported regions.

Prerequisites

The prerequisites remain the same across same-Region and cross-Region data sharing, which are required before proceeding with the setup.

Configure cross-Region data sharing

The steps to create a datashare, create a datashare target, link the datashare target to the datashare, and accept the datashare in Lake Formation remain the same across same-Region and cross-Region data sharing. Refer to Part 1 of this series to complete the setup.

Cross-Region data sharing with Redshift Serverless

If you’re using Redshift Serverless, complete the following steps:

On the Lake Formation console, choose Databases in the navigation pane.
Choose Create database.
Under Database details¸ select Resource link.
For Resource link name, enter a name for the resource link.
For Shared database’s region, choose the Data Catalog view source Region.
The Shared database and Shared database’s owner ID fields are populated manually from the database metadata.
Choose Create to complete the setup.

The resource link appears on the Databases page on the Lake Formation console, as shown in the following screenshot.

Launch Redshift Query Editor v2 for the Redshift Serverless workspace The cross-region data share tables are auto-mounted and appear under awsdatacatalog. To query, run the following command and create an external schema. Specify the resource link as the Data Catalog database, the Redshift Serverless Region, and the AWS account ID.
```
CREATE external SCHEMA cross_region_data_share --<<SCHEMA_NAME>>
FROM DATA CATALOG DATABASE 'cross-region-data-share' --<<RESOURCE_LINK_NAME>>
REGION 'us-east-1' --<TARGET_REGION>
IAM_ROLE 'SESSION' CATALOG_ID '<<aws_account_id>>'; --<<REDSHIFT AWS ACCOUNT ID>>
```
Refresh the schemas to view the external schema created in the dev database
Run the show tables command to check the shared objects under the external database:
```
SHOW TABLES FROM SCHEMA dev.cross_region_data_share --<<schema name>>
```

Query the datashare as shown in the following screenshot.

SELECT * FROM dev.cross_region_data_share.churn_modellingcsv_tableaus3_dlm; --<<change schema name & table name>>

Cross-Region data sharing with Redshift provisioned cluster

This section is a continuation of the previous section with additional steps needed for data sharing to work when the consumer is a provisioned Redshift cluster. Refer to Sharing data in Amazon Redshift and Sharing datashares for a deeper understanding of concepts and the implementation steps.

Create a new schema and table in the Redshift Serverless in the consumer Region:

CREATE SCHEMA customer360_data_share;
CREATE TABLE customer360_data_share. customer_churn as
SELECT * from dev.cross_region_data_share.churn_modellingcsv_tableaus3__dlm;

Get the namespace for the Redshift Serverless (producer) and Redshift provisioned cluster (consumer) by running the following query in each cluster:
```
select current_namespace
```

Create a datashare in the Redshift Serverless (producer) and grant usage to the Redshift provisioned cluster (consumer). Set the datashare, schema, and table names to the appropriate values, and set the namespace to the consumer namespace.

CREATE DATASHARE customer360_redshift_data_share;
ALTER DATASHARE customer360_redshift_data_share ADD SCHEMA customer360_data_share;
ALTER DATASHARE customer360_redshift_data_share ADD TABLE customer360_data_share.customer_churn; 
GRANT USAGE ON DATASHARE customer360_redshift_data_share 
TO NAMESPACE '5709a006-6ac3-4a0c-a609-d740640d3080'; --<<Data Share Consumer Namespace>>

Log in as a superuser in the Redshift provisioned cluster, create a database from the datashare, and grant permissions. Refer to managing permissions for Amazon Redshift datashare for detailed guidance.

The datashare is now ready for query.

You can periodically refresh the table you created to get the latest data from the data cloud based on your business requirement.

Conclusion

Zero Copy data sharing between Salesforce Data Cloud and Amazon Redshift represents a significant advancement in how organizations can use their customer 360 data. By eliminating the need for data movement, this approach offers real-time insights, reduced costs, and enhanced security. As businesses continue to prioritize data-driven decision-making, Zero Copy data sharing will play a crucial role in unlocking the full potential of customer data across platforms.

This integration empowers organizations to break down data silos, accelerate analytics, and drive more agile customer-centric strategies. To learn more, refer to the following resources:

About the Authors

Rajkumar Irudayaraj is a Senior Product Director at Salesforce with over 20 years of experience in data platforms and services, with a passion for delivering data-powered experiences to customers.

Sriram Sethuraman is a Senior Manager in Salesforce Data Cloud product management. He has been building products for over 9 years using big data technologies. In his current role at Salesforce, Sriram works on Zero Copy integration with major data lake partners and helps customers deliver value with their data strategies.

Jason Berkowitz is a Senior Product Manager with AWS Lake Formation. He comes from a background in machine learning and data lake architectures. He helps customers become data-driven.

Ravi Bhattiprolu is a Senior Partner Solutions Architect at AWS. Ravi works with strategic ISV partners, Salesforce and Tableau, to deliver innovative and well-architected products and solutions that help joint customers achieve their business and technical objectives.

Avijit Goswami is a Principal Solutions Architect at AWS specialized in data and analytics. He supports AWS strategic customers in building high-performing, secure, and scalable data lake solutions on AWS using AWS managed services and open source solutions. Outside of his work, Avijit likes to travel, hike, watch sports, and listen to music.

Ife Stewart is a Principal Solutions Architect in the Strategic ISV segment at AWS. She has been engaged with Salesforce Data Cloud over the last 2 years to help build integrated customer experiences across Salesforce and AWS. Ife has over 10 years of experience in technology. She is an advocate for diversity and inclusion in the technology field.

Michael Chess is a Technical Product Manager at AWS Lake Formation. He focuses on improving data permissions across the data lake. He is passionate about enabling customers to build and optimize their data lakes to meet stringent security requirements.

Mike Patterson is a Senior Customer Solutions Manager in the Strategic ISV segment at AWS. He has partnered with Salesforce Data Cloud to align business objectives with innovative AWS solutions to achieve impactful customer experiences. In his spare time, he enjoys spending time with his family, sports, and outdoor activities.

Harness Zero Copy data sharing from Salesforce Data Cloud to Amazon Redshift for Unified Analytics – Part 1

2024-08-28 Rajkumar Irudayaraj

Post Syndicated from Rajkumar Irudayaraj original https://aws.amazon.com/blogs/big-data/harness-zero-copy-data-sharing-from-salesforce-data-cloud-to-amazon-redshift-for-unified-analytics-part-1/

This post is co-authored by Rajkumar Irudayaraj, Sr. Director of Product, Salesforce Data Cloud.

In today’s ever-evolving business landscape, organizations must harness and act on data to fuel analytics, generate insights, and make informed decisions to deliver exceptional customer experiences. Salesforce and Amazon have collaborated to help customers unlock value from unified data and accelerate time to insights with bidirectional Zero Copy data sharing between Salesforce Data Cloud and Amazon Redshift.

In a previous post, we showed how Zero Copy data federation empowers businesses to access Amazon Redshift data within the Salesforce Data Cloud to enrich customer 360 data with operational data. This two-part series explores how analytics teams can access customer 360 data from Salesforce Data Cloud within Amazon Redshift to generate insights on unified data without the overhead of extract, transform, and load (ETL) pipelines. In this post, we cover data sharing between Salesforce Data Cloud and customers’ AWS accounts in the same AWS Region. Part 2 covers cross-Region data sharing between Salesforce Data Cloud and customers’ AWS accounts.

What is Salesforce Data Cloud?

Salesforce Data Cloud is a data platform that unifies all of your company’s data into Salesforce’s Einstein 1 Platform, giving every team a 360-degree view of the customer to drive automation, create analytics, personalize engagement, and power trusted artificial intelligence (AI). Salesforce Data Cloud creates a holistic customer view by turning volumes of disconnected data into a unified customer profile that’s straightforward to access and understand. This unified view helps your sales, service, and marketing teams build personalized customer experiences, invoke data-driven actions and workflows, and safely drive AI across all Salesforce applications.

What is Amazon Redshift?

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence (BI) tools. It’s optimized for datasets ranging from a few hundred gigabytes to petabytes and delivers better price-performance compared to other data warehousing solutions. With a fully managed, AI-powered, massively parallel processing (MPP) architecture, Amazon Redshift makes business decision-making quick and cost-effective. Amazon Redshift Spectrum enables querying structured and semi-structured data in Amazon Simple Storage Service (Amazon S3) without having to load the data into Redshift tables. Redshift Spectrum integration with AWS Lake Formation enables querying auto-mounted AWS Glue Data Catalog tables with AWS Identity and Access Management (IAM) credentials and harnessing Lake Formation for permission grants and access control policies on Data Catalog views. Salesforce Data Cloud Data sharing with Amazon Redshift leverages AWS Glue Data Catalog support for multi-engine views and Redshift Spectrum integration with Lake Formation.

What is Zero Copy data sharing?

Zero Copy data sharing enables Amazon Redshift customers to query customer 360 data stored in Salesforce Data Cloud without the need for traditional ETL to move or copy the data. Instead, you simply connect and use the data in place, unlocking its value immediately with on demand access to the most recent data. Data sharing is supported with both Amazon Redshift Serverless and provisioned RA3 clusters. Data can be shared with a Redshift Serverless or provisioned cluster in the same Region or with a Redshift Serverless cluster in a different Region. To get an overview of Salesforce Zero Copy integration with Amazon Redshift, please refer to this Salesforce Blog.

Solution overview

Salesforce Data Cloud provides a point-and-click experience to share data with a customer’s AWS account. On the Lake Formation console, you can accept the data share, create the resource link, mount Salesforce Data Cloud objects as data catalog views, and grant permissions to query the live and unified data in Amazon Redshift.

The following diagram depicts the end-to-end process involved for sharing Salesforce Data Cloud data with Amazon Redshift in the same Region using a Zero Copy architecture. This architecture follows the pattern documented in Cross-account data sharing best practices and considerations.

The data share setup consists of the following high-level steps:

The Salesforce Data Cloud admin creates the data share target with the target account for the data share.
The Salesforce Data Cloud admin selects the data cloud objects to be shared with Amazon Redshift and creates a data share.
The Salesforce Data Cloud admin links the data share to the data share target, which invokes the following operations to create a cross-account resource share:
1. Create a Data Catalog view for the Salesforce Data Cloud Apache Iceberg tables by invoking the Catalog API.
2. Use Lake Formation sharing to create a cross-account Data Catalog share.
In the customer AWS account, the Lake Formation admin logs in to the Lake Formation console to accept the resource share, create a resource link, and grant access permissions to the Redshift role.
The data analyst launches the Amazon Redshift Query Editor with the appropriate role to query the data share and join with native Redshift tables.

Prerequisites

The following are the prerequisites to enable data sharing:

A Salesforce Data Cloud account.
An AWS account with AWS Glue and Lake Formation enabled.
Either a Redshift Serverless or a Redshift provisioned cluster with RA3 instance types (ra3.16xlarge, ra3.4xlarge, ra3.xlplus). Data sharing is not supported for other provisioned instance types like DC2 or DS2 and must be set up before accessing the data share. If you don’t have an existing provisioned Redshift RA3 cluster, we recommend using a Redshift Serverless namespace for ease of operations and maintenance.
The Amazon Redshift service must be running in the same Region where the Salesforce Data Cloud is running.
AWS admin roles for Lake Formation and Amazon Redshift:
- Lake Formation – A data lake admin for accepting the share and providing access to users. For more details, see Lake Formation personas and IAM permissions reference.
- Amazon Redshift – A Redshift database owner, admin, or superuser who creates the database and provides access to developers or analysts. For more details, see Default database user permissions.

Create the data share target

Complete the following steps to create the data share target:

In Salesforce Data Cloud, choose App Launcher and choose Data Share Targets.

Choose New and choose Amazon Redshift, then choose Next.

Enter the details for Label, API Name, and Account for the data share target.
Choose Save.

After you save these settings, the S3 Tenant Folder value is populated.

Choose the S3 Tenant Folder link and copy the verification token.

If you’re not signed in to the AWS Management Console, you’ll be redirected to the login page.

Enter the verification token and choose Save.

The data share target turns to active status.

Create a data share

Complete the following steps to create a data share:

Navigate to the Data Share tab in your Salesforce org.
Choose App Launcher and choose Data Shares.

Alternatively, you can navigate to the Data Share tab from your org’s home page.

Choose New, then choose Next.

Provide a label, name, data space, and description, then choose Next.

Select the objects to be included in the share and choose Save.

Link the data share target to the data share

To link the data share target to the data share, complete the following steps:

On the data share record home page, choose Link/Unlink Data Share Target.
Select the data share target you want to link to the data share and choose Save.

The data share must be active before you can accept the resource share on the Lake Formation console.

Accept the data share in Lake Formation

This section provides the detailed steps for accepting the data share invite and configuration steps to mount the data share with Amazon Redshift.

After the data share is successfully linked to the data share target, navigate to the Lake Formation console.

The data share invitation banner is displayed.

Choose Accept and create.

The Accept and create page shows a resource link and provides the option to set up IAM permissions.

In the Principals section, choose the IAM users and roles to grant the default permissions (describe and select) for the data share resource link.

Choose Create.

The resource link created in the previous step appears next to the AWS Glue database resource share on the Lake Formation console.

Query the data share from Redshift Serverless

Launch the query editor for Redshift Serverless and log in as a federated user with the role that has describe and select permissions for the resource link.

The data share tables are auto-mounted, appear under awsdatacatalog, and can be queried as shown in the following screenshot.

Query the data share from the Redshift provisioned cluster

To query the data share from the Redshift provisioned cluster, log in to the provisioned cluster as the superuser.

On an editor tab, run the following SQL statement to grant an IAM user access to the Data Catalog:

GRANT USAGE ON DATABASE awsdatacatalog to "IAM:myIAMUser"

IAM:myIAMUser is an IAM user that you want to grant usage privilege to the Data Catalog. Alternatively, you can grant usage privilege to IAMR:myIAMRole for an IAM role. For more details, refer to Querying the AWS Glue Data Catalog.

You should be able to expand awsdatacatalog and query the data share tables as shown in the following screenshot.

Conclusion

This integration empowers organizations to break down data silos, accelerate analytics, and drive more agile customer-centric strategies. To learn more, refer to the following resources:

About the Authors

Jason Berkowitz is a Senior Product Manager with AWS Lake Formation. He comes from a background in machine learning and data lake architectures. He helps customers become data-driven.

Ravi Bhattiprolu is a Senior Partner Solutions Architect at AWS. Ravi works with strategic ISV partners, Salesforce and Tableau, to deliver innovative and well-architected products & solutions that help joint customers achieve their business and technical objectives.

Michael Chess is a Technical Product Manager at AWS Lake Formation. He focuses on improving data permissions across the data lake. He is passionate about ensuring customers can build and optimize their data lakes to meet stringent security requirements.

Noise

All posts by Rajkumar Irudayaraj

Harness Zero Copy data sharing from Salesforce Data Cloud to Amazon Redshift for Unified Analytics – Part 2

Solution overview

Cross-Region data sharing with Redshift Serverless

Cross-Region data sharing with a Redshift provisioned cluster

Considerations when using data sharing in Amazon Redshift

Prerequisites

Configure cross-Region data sharing

Cross-Region data sharing with Redshift Serverless

Cross-Region data sharing with Redshift provisioned cluster

Conclusion

About the Authors

Harness Zero Copy data sharing from Salesforce Data Cloud to Amazon Redshift for Unified Analytics – Part 1

What is Salesforce Data Cloud?

What is Amazon Redshift?

What is Zero Copy data sharing?

Solution overview

Prerequisites

Create the data share target

Create a data share

Link the data share target to the data share

Accept the data share in Lake Formation

Query the data share from Redshift Serverless

Query the data share from the Redshift provisioned cluster

Conclusion

About the Authors

The collective thoughts of the interwebz