Tag Archives: Amazon Redshift data sharing

Build a multi-Region analytics solution with Amazon Redshift, Amazon S3, and Amazon QuickSight

2025-06-19 Donatas Kuchalskis

Post Syndicated from Donatas Kuchalskis original https://aws.amazon.com/blogs/big-data/build-a-multi-region-analytics-solution-with-amazon-redshift-amazon-s3-and-amazon-quicksight/

Organizations increasingly face complex requirements balancing regional data sovereignty with global analytics needs. Regulatory frameworks like GDPR, HIPAA, and local data protection laws often mandate storing data in specific geographic regions, and business operations require global teams to access and analyze this data efficiently.

This post explores how to effectively architect a solution that addresses this specific challenge: enabling comprehensive analytics capabilities for global teams while making sure that your data remains in the AWS Regions required by your compliance framework. We use a variety of AWS services, including Amazon Redshift, Amazon Simple Storage Service (Amazon S3), and Amazon QuickSight.

It’s important to note that this solution focuses primarily on data residency (where data is stored) and not on preventing data from being in transit between Regions. Organizations with strict data transit restrictions might need additional controls beyond what’s covered here. We show how you can configure AWS across Regions to help meet business needs and regulatory requirements simultaneously.

Cross-Region architecture requirements

Before implementing a cross-Region solution, it’s important to understand when this approach is actually necessary. Although single-Region deployments offer simplicity and cost advantages, several specific business and regulatory scenarios warrant a cross-Region approach:

Data sovereignty and residency requirements – When regulations like GDPR, HIPAA, or local data sovereignty laws require data to remain in specific geographic boundaries while still enabling global analytics capabilities
Global operations with local compliance – When your organization operates globally, but needs to adhere to regional compliance frameworks while maintaining unified analytics
Performance optimization for global users – When your organization needs to optimize analytics performance for users in different geographic areas while centralizing data governance
Enhanced business continuity – When your analytics capabilities need higher availability and Regional redundancy to support mission-critical business processes

Use case: Financial services analytics with Regional data residency

Consider a financial services company with the following business and regulatory requirements:

Data residency requirement – All customer financial data must remain in the Bahrain Region (me-south-1) to comply with local financial regulations.
Global analytics capability – The organization’s data science team operates from European offices and needs to access and analyze the financial data without moving it out of its mandated storage Region.
Advanced analytics requirements – Business leaders need interactive data exploration and natural language query capabilities to derive insights from financial data.
Performance requirement – Specific dashboard queries require subsecond response times for both local executives and the global management team.

This specific combination of requirements can’t be met with a single-Region deployment. Let’s explore how to architect a solution.

Solution overview

The following architecture is designed to address the specific challenge of using QuickSight in one Region while maintaining data in another Region.

As shown in the architecture diagram, data engineers based in Bahrain (me-south-1) work with local data, whereas data engineers in Stockholm (eu-north-1) and analysts in Ireland (eu-west-1) can securely access the same data through Redshift datashares and virtual private cloud (VPC) peering connections. This approach maintains data residency in me-south-1 while enabling global access.

The solution consists of the following key components:

Primary data Region (me-south-1):
- Redshift cluster (primary data repository)
- S3 buckets for data lake storage
- Private and public subnets with appropriate security controls
- Data must remain in this Region for compliance reasons
Analytics services Region (eu-west-1):
- QuickSight deployment
- Cross-Region VPC peering connection to the primary Region
- Data access using Redshift datashares (no data replication)
Data engineering Region (eu-north-1):
- Redshift consumer cluster for data engineering workloads
- Data access using Redshift datashares from me-south-1
- Makes it possible for data engineering teams in eu-north-1 to access and work with data while maintaining compliance

Before implementing this architecture, evaluate whether:

Your requirements actually necessitate a cross-Region approach
The performance impact is acceptable for your use case
The additional cost is justified by your business requirements

For most analytics workloads, a single-Region architecture remains the recommended approach for simplicity, performance, and cost-effectiveness. Consider cross-Region architectures only when specific business and compliance requirements make them necessary.

Establish cross-Region network connectivity: Amazon Redshift to QuickSight

The foundation of a cross-Region solution is secure, reliable network connectivity. VPC peering provides a straightforward approach for connecting VPCs across Regions. To implement VPC peering in Amazon Virtual Private Cloud (Amazon VPC), complete the following steps:

Create a new VPC in the secondary Region (eu-west-1):
1. Open the Amazon VPC console in the eu-west-1 Region.
2. Choose Create VPC.
3. Set IPv4 CIDR block to 172.32.0.0/16 (verify there is no overlap with the primary Region VPC).
4. Select Auto-generate to create subnets automatically within this new VPC.
5. Leave other settings as default and choose Create VPC.

Set up VPC peering:
1. On the Amazon VPC console, choose Peering connections in the navigation pane and choose Create peering connection.
2. Select the new eu-west-1 VPC as the requester.
3. For Select another VPC to peer with, select My account and Another Region.
4. Choose the primary Region (me-south-1) and enter the VPC ID.
5. Choose Create peering connection.

Accept the VPC peering connection:
1. Switch to the primary Region on the Amazon VPC console.
2. Choose Peering connections in the navigation pane and select the pending connection.
3. On the Actions dropdown menu, choose Accept request.

Update the route tables:
1. On the secondary Region Amazon VPC console, choose Route tables in the navigation pane.
2. Choose the route table for the new VPC.
3. Choose Edit routes and add a new route:
  - Destination: Primary Region VPC CIDR (e.g., 172.31.0.0/16).
  - Target: Choose the peering connection.
4. On the primary Region Amazon VPC console, repeat the process, adding a route to the secondary Region VPC CIDR (172.32.0.0/16) using the peering connection.

Configure security groups:
1. On the secondary Region Amazon VPC console, choose Security groups in the navigation pane and create a new security group.
2. Add an outbound rule:
  - Type: Custom TCP
  - Port range: 5439
  - Destination: Primary Region VPC CIDR
3. On the primary Region Amazon VPC console, locate the Redshift cluster’s security group.
4. Add an inbound rule:
  - Type: Custom TCP
  - Port range: 5439
  - Source: Secondary Region VPC CIDR

Configure DNS settings:
1. On the Amazon VPC console for both Regions, choose Your VPCs in the navigation pane.
2. Select each VPC, and on the Actions dropdown menu, choose Edit DNS hostnames.
3. Select Enable DNS resolution and Enable DNS hostnames.

Implement cross-Region data sharing

Rather than replicating data, which could create compliance issues, you can use Redshift datashares to provide secure, read-only access to data across Regions. Complete the following steps to set up your datashares:

Create producer datashares in the primary Region:

On the Amazon Redshift console, choose Query editor v2 in the navigation pane to connect to your primary Region Redshift cluster (me-south-1).

Run the following commands:

-- In Primary Region Redshift

CREATE DATASHARE datashare_1;
ALTER DATASHARE datashare_1 ADD SCHEMA analytics;
ALTER DATASHARE datashare_1 ADD TABLE analytics.customers;
ALTER DATASHARE datashare_1 ADD TABLE analytics.transactions;

-- Grant usage permissions
	
GRANT USAGE ON DATASHARE datashare_1 TO ACCOUNT '123456789012';

Create a consumer database in the secondary Region:

Connect to your secondary Region Redshift cluster (eu-west-1) using the query editor and run the following commands:

-- In Secondary Region Redshift

CREATE DATABASE consumer_db FROM DATASHARE datashare_1 OF ACCOUNT '123456789012'REGION 'me-south-1';

Verify the datashare configuration with the following code:

-- In Secondary Region Redshift

SELECT * FROM SVV_DATASHARE_CONSUMERS;
SELECT * FROM SVV_DATASHARE_OBJECTS;

This approach maintains data residency in the primary Region while enabling analytics access from another Region, addressing the core challenge of Regional service limitations. For our financial services company example, this makes sure that customer financial data remains in Bahrain (me-south-1) while making it securely accessible to the data science team in Europe (eu-west-1).

Configure QuickSight in the analytics Region

With network connectivity and data sharing established, complete the following steps to configure QuickSight to securely access the Redshift data:

Set up a QuickSight VPC connection:
1. Open the QuickSight console in the secondary Region.
2. Choose Manage QuickSight, VPC connections, and Add VPC connection.
3. Configure the connection:
  - Name: Enter a name (for example, Cross-Region-Connection).
  - VPC: Choose the secondary Region VPC.
  - Subnet: Choose the automatically created subnets.
  - Security group: Choose the security group created for cross-Region access.

Add a QuickSight IP range to the data source security group:
1. Open the Amazon Elastic Compute Cloud (Amazon EC2) console in the primary Region.
2. Choose Security groups in the navigation pane and find the security group for your data source.
3. Edit the inbound rules.
4. Add a new rule:
  - Type: HTTPS (443)
  - Protocol: TCP
  - Port range: 443
  - Source: QuickSight IP range for the secondary Region (for example, 52.210.255.224/27 for eu-west-1).

QuickSight IP ranges can change over time. Refer to AWS Regions, websites, IP address ranges, and endpoints for current IP ranges.

Create a QuickSight data source:
1. On the QuickSight console, choose Datasets in the navigation pane.
2. Choose New dataset, then choose Redshift.
3. Configure the connection:
  - Data source name: Enter a descriptive name.
  - Connection type: Choose the VPC connection.
  - Database server: Enter the Redshift cluster endpoint from the primary Region.
  - Port: 5439
  - Database name: Enter the consumer database name.
  - Username and Password: Enter credentials (consider using AWS Secrets Manager).
4. Choose Validate connection to test.
5. Choose Create data source.

Verify the connection and create datasets:
1. Choose the schema and tables from the consumer database.
2. Configure appropriate refresh schedules.
3. Create calculations and visualizations as needed.

Performance considerations for cross-Region analytics

When implementing a cross-Region analytics architecture, be aware of the following performance implications:

Query performance impact – Cross-Region queries can experience higher latency than single-Region queries. To mitigate this, consider the following:
- Use SPICE for QuickSight – Import frequently-used datasets into SPICE (Super-fast, Parallel, In-memory Calculation Engine) to help avoid repeated cross-Region queries. SPICE is the QuickSight in-memory engine that enables fast, interactive visualizations by precomputing and storing datasets locally in the QuickSight Region.
- Implement efficient query patterns – Minimize the amount of data transferred between Regions.
- Use appropriate caching – Enable result caching where possible.
- Monitoring cross-Region performance – Implement monitoring to identify and address performance issues:
  - Set up Amazon CloudWatch metrics to track cross-Region query performance
  - Create dashboards to visualize latency trends
  - Establish performance baselines and alerts for degradation

Security considerations

Maintaining security in a cross-Region architecture requires additional attention:

Network security:
- Limit VPC peering connections to only necessary VPCs
- Implement restrictive security groups that allow only required traffic
- Consider using VPC endpoints for service access when possible
Data access controls:
- Use AWS Identity and Access Management (IAM) policies consistently across Regions
- Implement fine-grained access controls in Redshift datashares
- Enable audit logging in relevant Regions
Compliance monitoring:
- Implement AWS CloudTrail in all Regions
- Create centralized logging for cross-Region activities
- Regularly review cross-Region access patterns

Cost implications

Before implementing a cross-Region architecture, consider these cost factors:

Data transfer costs – Data transfer between Regions incurs charges
Additional infrastructure – You might need Redshift clusters in multiple Regions
VPC peering costs – Data transfer costs are associated with VPC peering
Operational overhead – Managing multi-Region deployments requires additional resources
Workload-based sizing – You should size each Regional Redshift cluster according to the specific workloads it will handle

Conclusion

The cross-Region architecture described in this post addresses specific challenges related to Regional compliance requirements and global analytics needs, particularly in the following scenarios:

Your data must remain in a specific Region for compliance reasons
You have teams in different Regions who need to access and analyze this data
Different user groups have distinct workload requirements

The datasharing capabilities of Amazon Redshift and Regional storage options in Amazon S3 are key enablers of this solution, allowing data to remain in the required Region while still being accessible for analytics across Regions. However, it’s worth emphasizing that this architecture supports data storage in specific Regions but doesn’t prevent data from traveling between Regions during processing. Organizations concerned about data transit restrictions should evaluate additional controls to address those specific requirements. Combined with secure VPC peering connections and QuickSight visualizations, this architecture creates a complete solution that satisfies both compliance requirements and business needs.

For our financial services example, this architecture successfully enables the company to keep its customer financial data in Bahrain while providing seamless analytics capabilities to the European data science team and delivering interactive dashboards to global business leaders.

For more information, refer to Building a Cloud Security Posture Dashboard with Amazon QuickSight. For hands-on experience, explore the Amazon QuickSight Workshops. Visit the Amazon Redshift console or Amazon QuickSight console to start building your first dashboard, and explore our AWS Big Data Blog for more customer success stories and implementation patterns

Try out this solution for your own use case, and share your thoughts in the comments.

About the Authors

Donatas Kuchalskis is a Cloud Operations Architect at AWS, based in London, focusing on Financial Services customers in the UK. He helps customers optimize their AWS environments for cost, security, and resiliency while providing strategic cloud guidance. Prior to this role, he served as a Prototyping Architect specializing in Big Data and as a Specialist Solutions Architect for Retail. Before joining AWS, Donatas spent 6 years as a technical consultant in the retail sector.

Jumana Nagaria is a Prototyping Architect at AWS. She builds innovative prototypes with customers to solve their business challenges. She is passionate about cloud computing and data analytics. Outside of work, Jumana enjoys travelling, reading, painting, and spending quality time with friends and family.

How Getir unleashed data democratization using a data mesh architecture with Amazon Redshift

2024-10-23 Asser Moustafa

Post Syndicated from Asser Moustafa original https://aws.amazon.com/blogs/big-data/how-getir-unleashed-data-democratization-using-a-data-mesh-architecture-with-amazon-redshift/

This blog post is co-written with Pinar Yasar from Getir.

Amazon Redshift is a fully managed cloud data warehouse that’s used by tens of thousands of customers for price-performance, scale, and advanced data analytics. Amazon Redshift enables data warehousing by seamlessly integrating with other data stores and services in the modern data organization through features such as Zero-ETL, data sharing, streaming ingestion, data lake integration, and Redshift ML.

In this post, we explain how ultrafast delivery pioneer, Getir, unleashed the power of data democratization on a large scale through their data mesh architecture using Amazon Redshift.

We start by introducing Getir and their vision—to seamlessly, securely, and efficiently share business data across different teams within the organization for BI, extract, transform, and load (ETL), and other use cases. We’ll then explore how Amazon Redshift data sharing powered the data mesh architecture that allowed Getir to achieve this transformative vision. We will also explain how Getir’s data mesh architecture enabled data democratization, shorter time-to-market, and cost-efficiencies. Next, we’ll provide a broader overview of modern data trends reinforced by Getir’s vision. In conclusion, we’ll offer some thoughts on how you can apply a similar approach to eliminate costly and barrier-inducing data silos using Amazon Redshift.

Who is Getir?

Getir is an ultrafast delivery pioneer that revolutionized last-mile delivery in 2015 with its 10-minute grocery delivery proposition.Getir’s story started in Istanbul, and they have launched multiple products since inception: GetirFood, GetirMore, GetirWater, GetirLocals, GetirBitaksi (taxi service), GetirDrive (car rental service), and GetirJobs (recruitment).

Getir serves dozens of cities throughout the world with more than 30,000 employees. The following figure shows the Getir app.

Figure 1: Getir app

Overview of Getir’s main use case

Getir’s business is characterized by a tremendous volume of data generation and growth, in addition to ample opportunities to gain valuable insights. However, siloing this data and creating friction for teams trying to access the information they needed wasn’t a viable option. Allowing teams to duplicate data wherever required can be an anti-pattern, leading to operational complexity, cost overruns, and fragile data storage bloat.

Similarly, relying on dedicated teams to create data extracts or insights for downstream consumers introduces bottlenecks, stifles innovation, and increases the time-to-market. This approach isn’t optimal for a data-driven organization like Getir, which needs to empower its teams with seamless access to the information they require to drive the business forward. The various business lines within the organization made it abundantly clear that they wanted unfettered access to the company’s entire data ecosystem in a secure, cost-efficient, near real-time, and well-governed manner.

Furthermore, the organization was anticipating the emergence of data-as-a-serviceservice and generative AI use cases in the near future. This would necessitate the ability to securely share and potentially monetize the company’s data with external partners, such as franchises.

Overview of Getir’s use of Amazon Redshift and modern data architecture

To strike a balance that addresses these concerns and enables Getir teams to effectively use the wealth of data to generate meaningful insights and drive strategic decision-making across the organization, we chose a data mesh architecture.

Getir’s data analytics environment encompasses hundreds of terabytes of data, thousands of tables, and billions upon billions of data rows. Additionally, it processes millions of messaging events daily, all of which must be ingested, refined, and made available to analysts querying multiple Amazon Redshift warehouses. The end-to-end service level agreements (SLAs) for this data ecosystem can be extremely aggressive, with requirements that can be as stringent as single-digit minutes to single-digit seconds. This underscores the scale and complexity of Getir’s data analytics capabilities, which must operate with the utmost efficiency and responsiveness to meet the demands of the business. We were able to easily implement the envisioned data mesh architecture using Amazon Redshift’s native data sharing capabilities.

Figure 2: Data mesh architecture using Amazon Redshift data sharing

As the preceding diagram shows, at the heart of Getir’s architecture, was an ETL Redshift data warehouse that was used for various data sets from all over the organization, creating a refined 360-degree view of critical assets. It also was a producer for downstream Redshift data warehouses.

The demand was quite heavy on this main ETL cluster, so we relied on data sharing to isolate noisy workloads on a different Redshift data warehouse without having to duplicate the data on the main ETL cluster.

Using Redshift data sharing, individual business line teams could now rely solely on their dedicated Redshift cluster to provide them with their own data and analytics capabilities, but also the refined 360-degree views of data generated from all over the organization—without any data duplication or overstepping compute boundaries. BI analysts gained access to all of the data they needed to power their most complex dashboards with consistent performance free of noisy jobs. Additional warehouses were integrated into the data mesh for visualization, reporting, and machine learning.

Another benefit of Amazon Redshift data sharing and the data mesh architecture, was the relative ease with which we were able to maintain a chargeback model for ensuring costs were spread fairly across different teams.

Finally, the data sharing capability also enabled the seamless propagation of newly created tables within a schema to the subscribed consumers.

Modern data trends reinforced by Getir’s case study

Getir’s case study showcases the strategic uses of a data mesh architecture and Amazon Redshift, but more importantly provides tremendous insights into five key trends across all industries as modern data organizations move away from costly data silos that hinder collaboration, business insights, and time-to-market. As highlighted in the following diagram, those trends are 1/interconnected, purpose-built data stores that enable users to access data regardless of its physical location, 2/data democratization empowering users with self-service analytics capabilities, 3/real-time insights to drive greater value from data, 4/resilient data services ensuring business continuity, 5/leveraging generative AI to extract even deeper insights from data more expeditiously.

Figure 3: Key trends in the modern data organization reinforced by Getir’s use case and solution

As Getir showed, the modern data organization is adopting data architectures that democratize data securely and enable self-service analytics. To realize data’s true potential, the modern data organization has progressed beyond basic dashboarding and reporting on limited, point-in-time data sets, and evolved to use more sophisticated ETL processes that can ingest data from diverse sources. Near real-time analytics in addition to predictive models have become standard fare, significantly reducing the time to actionable insights.

Furthermore, the data landscape has been democratized to empower analysts in numerous ways through the rise of transactional data lakes powered by open table formats such as Apache Iceberg and the assistance of generative AI. This holistic approach has elevated data organizations’ capabilities well beyond traditional reporting, unlocking greater business value from the wealth of data available.

Using generative AI with data mesh architecture

In addition to the five key trends previously mentioned, the present-day data landscape is characterized by three key facts that are leading data organizations like Getir to increasingly harness the power of generative AI to drive the next evolution of data-informed decision-making.

Data is an organization’s most valuable asset and the ability to effectively use data is central to an organization’s success and growth. Data analytics and insights are absolutely crucial to strengthening and expanding the business. Deriving meaningful insights from data is essential for making informed, strategic decisions. Democratizing data and enabling self-service analytics can greatly expand the range of business insights, while reducing the time to market for those insights. Empowering users across the organization to access and analyze data can unlock tremendous value. Generative AI’s ability to respond to natural language prompts, explore and analyze complex data, and summarize lengthy content makes it a valuable tool for translating large amounts of data into valuable insights. However, the true potential of generative AI for organizations lies in Retrieval Augmented Generation (RAG).

Out of the box, generative AI models start with a relatively generic knowledge base, which can lead to unreliable or inaccurate information. RAG addresses this by introducing the model to additional datasets that are specific to the organization or context. This allows generative AI models to produce far more accurate, attributable, and highly contextualized outputs to support decision-making.

Data mesh architecture can play a crucial role in enabling and facilitating RAG. By facilitating access to multiple data sources within the organization, the data mesh provides the necessary fuel for the generative AI model to draw from, resulting in more reliable and insightful information. This, in turn, empowers data-driven decision-making and helps organizations harness the full potential of their data assets.

Conclusion

In this post, we examined how Getir implemented a data mesh architecture and Amazon Redshift data sharing to meet their evolving data requirements. This entailed dedicated data warehouses tailored to different business lines and needs, while maintaining robust data governance and secure data access. Additionally, we highlighted the key industry trends that Getir’s case study reinforces across the broader data landscape. For more information, contact AWS or connect with your AWS Technical Account Manager or Solutions Architect, who will be happy to provide more detailed guidance and support.

About the Authors

Asser Moustafa is a Principal Worldwide Specialist Solutions Architect at AWS, based in Dallas, Texas, USA. He partners with customers worldwide, advising them on all aspects of their data architectures, migrations, and strategic data visions to help organizations adopt cloud-based solutions, maximize the value of their data assets, modernize legacy infrastructures, and implement cutting-edge capabilities like machine learning and advanced analytics. Prior to joining AWS, Asser held various data and analytics leadership roles, completing an MBA from New York University and an MS in Computer Science from Columbia University in New York. He is passionate about empowering organizations to become truly data-driven and unlock the transformative potential of their data.

Pinar Yasar is the Data Engineering Manager at Getir. Her passion is to accelerate self-service analytics for her internal customers and build highly scalable and cost-effective solutions in the cloud.

How to get best price performance from your Amazon Redshift Data Sharing deployment

2022-12-20 BP Yau

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/how-to-get-best-price-performance-from-your-amazon-redshift-data-sharing-deployment/

Amazon Redshift is a fast, scalable, secure, and fully-managed data warehouse that enables you to analyze all of your data using standard SQL easily and cost-effectively. Amazon Redshift Data Sharing allows customers to securely share live, transactionally consistent data in one Amazon Redshift cluster with another Amazon Redshift cluster across accounts and regions without needing to copy or move data from one cluster to another.

Amazon Redshift Data Sharing was initially launched in March 2021, and added support for cross-account data sharing was added in August 2021. The cross-region support became generally available in February 2022. This provides full flexibility and agility to share data across Redshift clusters in the same AWS account, different accounts, or different regions.

Amazon Redshift Data Sharing is used to fundamentally redefine Amazon Redshift deployment architectures into a hub-spoke, data mesh model to better meet performance SLAs, provide workload isolation, perform cross-group analytics, easily onboard new use cases, and most importantly do all of this without the complexity of data movement and data copies. Some of the most common questions asked during data sharing deployment are, “How big should my consumer clusters and producer clusters be?”, and “How do I get the best price performance for workload isolation?”. As workload characteristics like data size, ingestion rate, query pattern, and maintenance activities can impact data sharing performance, a continuous strategy to size both consumer and producer clusters to maximize the performance and minimize cost should be implemented. In this post, we provide a step-by-step approach to help you determine your producer and consumer clusters sizes for the best price performance based on your specific workload.

Generic consumer sizing guidance

The following steps show the generic strategy to size your producer and consumer clusters. You can use it as a starting point and modify accordingly to cater your specific use case scenario.

Size your producer cluster

You should always make sure that you properly size your producer cluster to get the performance that you need to meet your SLA. You can leverage the sizing calculator from the Amazon Redshift console to get a recommendation for the producer cluster based on the size of your data and query characteristic. Look for Help me choose on the console in AWS Regions that support RA3 node types to use this sizing calculator. Note that this is just an initial recommendation to get started, and you should test running your full workload on the initial size cluster and elastic resize the cluster up and down accordingly to get the best price performance.

Size and setup initial consumer cluster

You should always size your consumer cluster based on your compute needs. One way to get started is to follow the generic cluster sizing guide similar to the producer cluster above.

Setup Amazon Redshift data sharing

Setup data sharing from producer to consumer once you have both the producer and consumer cluster setup. Refer to this post for guidance on how to setup data sharing.

Test consumer only workload on initial consumer cluster

Test consumer only workload on the new initial consumer cluster. This can be done by pointing consumer applications, for example ETL tools, BI applications, and SQL clients, to the new consumer cluster and rerunning the workload to evaluate the performance against your requirements.

Test consumer only workload on different consumer cluster configurations

If the initial size consumer cluster meets or exceeds your workload performance requirements, then you can either continue to use this cluster configuration or you can test on smaller configurations to see if you can further reduce the cost and still get the performance that you need.

On the other hand, if the initial size consumer cluster fails to meet your workload performance requirements, then you can further test larger configurations to get the configuration that meets your SLA.

As a rule of thumb, size up the consumer cluster by 2x the initial cluster configuration incrementally until it meets your workload requirements.

Once you plan out what configuration you want to test, use elastic resize to resize the initial cluster to the target cluster configuration. After elastic resize is completed, perform the same workload test and evaluate the performance against your SLA. Select the configuration that meets your price performance target.

Test producer only workload on different producer cluster configurations

Once you move your consumer workload to the consumer cluster with the optimum price performance, there might be an opportunity to reduce the compute resource on the producer to save on costs.

To achieve this, you can rerun the producer only workload on 1/2x of the original producer size and evaluate the workload performance. Resizing the cluster up and down accordingly depends on the result, and then you select the minimum producer configuration that meets your workload performance requirements.

Re-evaluate after a full workload run over time

As Amazon Redshift continues evolving, and there are continuous performance and scalability improvement releases, data sharing performance will continue improving. Furthermore, numerous variables might impact the performance of data sharing queries. The following are just some examples:

Ingestion rate and amount of data change
Query pattern and characteristic
Workload changes
Concurrency
Maintenance activities, for example vacuum, analyze, and ATO

This is why you must re-evaluate the producer and consumer cluster sizing using the strategy above on occasion, especially after a full workload deployment, to gain the new best price performance from your cluster’s configuration.

Automated sizing solutions

If your environment involved more complex architecture, for example with multiple tools or applications (BI, ingestion or streaming, ETL, data science), then it might not feasible to use the manual method from the generic guidance above. Instead, you can leverage solutions in this section to automatically replay the workload from your production cluster on the test consumer and producer clusters to evaluate the performance.

Simple Replay utility will be leveraged as the automated solution to guide you through the process of getting the right producer and consumer clusters size for the best price performance.

Simple Replay is a tool for conducting a what-if analysis and evaluating how your workload performs in different scenarios. For example, you can use the tool to benchmark your actual workload on a new instance type like RA3, evaluate a new feature, or assess different cluster configurations. It also includes enhanced support for replaying data ingestion and export pipelines with COPY and UNLOAD statements. To get started and replay your workloads, download the tool from the Amazon Redshift GitHub repository.

Here we walk through the steps to extract your workload logs from the source production cluster and replay them in an isolated environment. This lets you perform a direct comparison between these Amazon Redshift clusters seamlessly and select the clusters configuration that best meet your price performance target.

The following diagram shows the solution architecture.

Architecutre for testing simple replay

Solution walkthrough

Follow these steps to go through the solution to size your consumer and producer clusters.

Size your production cluster

You should always make sure to properly size your existing production cluster to get the performance that you need to meet your workload requirements. You can leverage the sizing calculator from the Amazon Redshift console to get a recommendation on the production cluster based on the size of your data and query characteristic. Look for Help me choose on the console in AWS Regions that support RA3 node types to use this sizing calculator. Note that this is just an initial recommendation to get started. You should test running your full workload on the initial size cluster and elastic resize the cluster up and down accordingly to get the best price performance.

Identify the workload to be isolated

You might have different workloads running on your original cluster, but the first step is to identify the most critical workload to the business that we want to isolate. This is because we want to make sure that the new architecture can meet your workload requirements. This post is a good reference on a data sharing workload isolation use case that can help you decide which workload can be isolated.

Setup Simple Replay

Once you know your critical workload, you must enable audit logging in your production cluster where the critical workload identified above is running to capture query activities and store in Amazon Simple Storage Service (Amazon S3). Note that it may take up to three hours for the audit logs to be delivered to Amazon S3. Once the audit log is available, proceed to setup Simple Replay and then extract the critical workload from the audit log. Note that start_time and end_time could be used as parameters to filter out the critical workload if those workloads run in certain time periods, for example 9am to 11am. Otherwise it will extract all of the logged activities.

Baseline workload

Create a baseline cluster with the same configuration as the producer cluster by restoring from the production snapshot. The purpose of starting with the same configuration is to baseline the performance with an isolated environment.

Once the baseline cluster is available, replay the extracted workload in the baseline cluster. The output from this replay will be the baseline used to compare against subsequent replays on different consumer configurations.

Setup initial producer and consumer test clusters

Create a producer cluster with the same production cluster configuration by restoring from the production snapshot. Create a consumer cluster with the recommended initial consumer size from the previous guidance. Furthermore, setup data sharing between the producer and consumer.

Replay workload on initial producer and consumer

Replay the producer only workload on the initial size producer cluster. This can be achieved using the “Exclude” filter parameter to exclude consumer queries, for example the user that runs consumer queries.

Replay the consumer only workload on the initial size consumer cluster. This can be achieved using the “Include” filter parameter to exclude consumer queries, for example the user that runs consumer queries.

Evaluate the performance of these replays against the baseline and workload performance requirements.

Replay consumer workload on different configurations

If the initial size consumer cluster meets or exceeds your workload performance requirements, then you can either use this cluster configuration or you can follow these steps to test on smaller configurations to see if you can further reduce costs and still get the performance that you need.

Compare initial consumer performance results against your workload requirements:

If the result exceeds your workload performance requirements, then you can reduce the size of the consumer cluster incrementally, starting with 1/2x, retry the replay and evaluate the performance, then resize up or down accordingly based on the result until it meets your workload requirements. The purpose is to get a sweet spot where you’re comfortable with the performance requirements and get the lowest price possible.
If the result fails to meet your workload performance requirements, then you can increase the size of the cluster incrementally, starting with 2x the original size, retry the replay and evaluate the performance until it meets your workload performance requirements.

Replay producer workload on different configurations

Once you split your workloads out to consumer clusters, the load on the producer cluster should be reduced and you should evaluate your producer cluster’s workload performance to seek the opportunity to downsize to save on costs.

The steps are similar to consumer replay. Elastic resize the producer cluster incrementally starting with 1/2x the original size, replay the producer only workload and evaluate the performance, and then further resize up or down until it meets your workload performance requirements. The purpose is to get a sweet spot where you’re comfortable with the workload performance requirements and get the lowest price possible. Once you have the desired producer cluster configuration, retry replay consumer workloads on the consumer cluster to make sure that the performance wasn’t impacted by producer cluster configuration changes. Finally, you should replay both producer and consumer workloads concurrently to make sure that the performance is achieved in a full workload scenario.

Re-evaluate after a full workload run over time

Similar to the generic guidance, you should re-evaluate the producer and consumer clusters sizing using the previous strategy on occasion, especially after full workload deployment to gain the new best price performance from your cluster’s configuration.

Clean up

Running these sizing tests in your AWS account may have some cost implications because it provisions new Amazon Redshift clusters, which may be charged as on-demand instances if you don’t have Reserved Instances. When you complete your evaluations, we recommend deleting the Amazon Redshift clusters to save on costs. We also recommend pausing your clusters when they’re not in use.

Applying Amazon Redshift and data sharing best practices

Proper sizing of both your producer and consumer clusters will give you a good start to get the best price performance from your Amazon Redshift deployment. However, sizing isn’t the only factor that can maximize your performance. In this case, understanding and following best practices are equally important.

General Amazon Redshift performance tuning best practices are applicable to data sharing deployment. Make sure that your deployment follows these best practices.

There numerous data sharing specific best practices that you should follow to make sure that you maximize the performance. Refer to this post for more details.

Summary

There is no one-size-fits-all recommendation on producer and consumer cluster sizes. It varies by workloads and your performance SLA. The purpose of this post is to provide you with guidance for how you can evaluate your specific data sharing workload performance to determine both consumer and producer cluster sizes to get the best price performance. Consider testing your workloads on producer and consumer using simple replay before adopting it in production to get the best price performance.

About the Authors

BP Yau is a Sr Product Manager at AWS. He is passionate about helping customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Sidhanth Muralidhar is a Principal Technical Account Manager at AWS. He works with large enterprise customers who run their workloads on AWS. He is passionate about working with customers and helping them architect workloads for costs, reliability, performance and operational excellence at scale in their cloud journey. He has a keen interest in Data Analytics as well.

From centralized architecture to decentralized architecture: How data sharing fine-tunes Amazon Redshift workloads

2022-08-16 Jingbin Ma

Post Syndicated from Jingbin Ma original https://aws.amazon.com/blogs/big-data/from-centralized-architecture-to-decentralized-architecture-how-data-sharing-fine-tunes-amazon-redshift-workloads/

Amazon Redshift is a fully managed, petabyte-scale, massively parallel data warehouse that offers simple operations and high performance. It makes it fast, simple, and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. Today, Amazon Redshift has become the most widely used cloud data warehouse.

With the significant growth of data for big data analytics over the years, some customers have asked how they should optimize Amazon Redshift workloads. In this post, we explore how to optimize workloads on Amazon Redshift clusters using Amazon Redshift RA3 nodes, data sharing, and pausing and resuming clusters. For more cost-optimization methods, refer to Getting the most out of your analytics stack with Amazon Redshift.

Key features of Amazon Redshift

First, let’s review some key features:

RA3 nodes – Amazon Redshift RA3 nodes are backed by a new managed storage model that gives you the power to separately optimize your compute power and your storage. They bring a few very important features, one of which is data sharing. RA3 nodes also support the ability to pause and resume, which allows you to easily suspend on-demand billing while the cluster is not being used.
Data sharing – Amazon Redshift data sharing offers you to extend the ease of use, performance, and cost benefits of Amazon Redshift in a single cluster to multi-cluster deployments while being able to share data. Data sharing enables instant, granular, and fast data access across Redshift clusters without the need to copy or move it. You can securely share live data with Amazon Redshift clusters in the same or different AWS accounts, and across regions. You can share data at many levels, including schemas, tables, views, and user-defined functions. You can also share the most up-to-date and consistent information as it’s updated in Amazon Redshift Serverless. It also provides fine-grained access controls that you can tailor for different users and businesses that all need access to the data. However, data sharing in Amazon Redshift has a few limitations.

Solution overview

In this use case, our customer is heavily using Amazon Redshift as their data warehouse for their analytics workloads, and they have been enjoying the possibility and convenience that Amazon Redshift brought to their business. They mainly use Amazon Redshift to store and process user behavioral data for BI purposes. The data has increased by hundreds of gigabytes daily in recent months, and employees from departments continuously run queries against the Amazon Redshift cluster on their BI platform during business hours.

The company runs four major analytics workloads on a single Amazon Redshift cluster, because some data is used by all workloads:

Queries from the BI platform – Various queries run mainly during business hours.
Hourly ETL – This extract, transform, and load (ETL) job runs in the first few minutes of each hour. It generally takes about 40 minutes.
Daily ETL – This job runs twice a day during business hours, because the operation team needs to get daily reports before the end of the day. Each job normally takes between 1.5–3 hours. It’s the second-most resource-heavy workload.
Weekly ETL – This job runs in the early morning every Sunday. It’s the most resource-heavy workload. The job normally takes 3–4 hours.

The analytics team has migrated to the RA3 family and increased the number of nodes of the Amazon Redshift cluster to 12 over the years to keep the average runtime of queries from their BI tool within an acceptable time due to the data size, especially when other workloads are running.

However, they have noticed that performance is reduced while running ETL tasks, and the duration of ETL tasks is long. Therefore, the analytics team wants to explore solutions to optimize their Amazon Redshift cluster.

Because CPU utilization spikes appear while the ETL tasks are running, the AWS team’s first thought was to separate workloads and relevant data into multiple Amazon Redshift clusters with different cluster sizes. By reducing the total number of nodes, we hoped to reduce the cost of Amazon Redshift.

After a series of conversations, the AWS team found that one of the reasons that the customer keeps all workloads on the 12-node Amazon Redshift cluster is to manage the performance of queries from their BI platform, especially while running ETL workloads, which have a big impact on the performance of all workloads on the Amazon Redshift cluster. The obstacle is that many tables in the data warehouse are required to be read and written by multiple workloads, and only the producer of a data share can update the shared data.

The challenge of dividing the Amazon Redshift cluster into multiple clusters is data consistency. Some tables need to be read by ETL workloads and written by BI workloads, and some tables are the opposite. Therefore, if we duplicate data into two Amazon Redshift clusters or only create a data share from the BI cluster to the reporting cluster, the customer will have to develop a data synchronization process to keep the data consistent between all Amazon Redshift clusters, and this process could be very complicated and unmaintainable.

After more analysis to gain an in-depth understanding of the customer’s workloads, the AWS team found that we could put tables into four groups, and proposed a multi-cluster, two-way data sharing solution. The purpose of the solution is to divide the workloads into separate Amazon Redshift clusters so that we can use Amazon Redshift to pause and resume clusters for periodic workloads to reduce the Amazon Redshift running costs, because clusters can still access a single copy of data that is required for workloads. The solution should meet the data consistency requirements without building a complicated data synchronization process.

The following diagram illustrates the old architecture (left) compared to the new multi-cluster solution (right).

Improve the old architecture (left) to the new multi-cluster solution (right)

Dividing workloads and data

Due to the characteristics of the four major workloads, we categorized workloads into two categories: long-running workloads and periodic-running workloads.

The long-running workloads are for the BI platform and hourly ETL jobs. Because the hourly ETL workload requires about 40 minutes to run, the gain is small even if we migrate it to an isolated Amazon Redshift cluster and pause and resume it every hour. Therefore, we leave it with the BI platform.

The periodic-running workloads are the daily and weekly ETL jobs. The daily job generally takes about 1 hour and 40 minutes to 3 hours, and the weekly job generally takes 3–4 hours.

Data sharing plan

The next step is identifying all data (tables) access patterns of each category. We identified four types of tables:

Type 1 – Tables are only read and written by long-running workloads
Type 2 – Tables are read and written by long-running workloads, and are also read by periodic-running workloads
Type 3 – Tables are read and written by periodic-running workloads, and are also read by long-running workloads
Type 4 – Tables are only read and written by periodic-running workloads

Fortunately, there is no table that is required to be written by all workloads. Therefore, we can separate the Amazon Redshift cluster into two Amazon Redshift clusters: one for the long-running workloads, and the other for periodic-running workloads with 20 RA3 nodes.

We created a two-way data share between the long-running cluster and the periodic-running cluster. For type 2 tables, we created a data share on the long-running cluster as the producer and the periodic-running cluster as the consumer. For type 3 tables, we created a data share on the periodic-running cluster as the producer and the long-running cluster as the consumer.

The following diagram illustrates this data sharing configuration.

The long-running cluster (producer) shares type 2 tables to the periodic-running cluster (consumer). The periodic-running cluster (producer’) shares type 3 tables to the long-running cluster (consumer’)

Build two-way data share across Amazon Redshift clusters

In this section, we walk through the steps to build a two-way data share across Amazon Redshift clusters. First, let’s take a snapshot of the original Amazon Redshift cluster, which became the long-running cluster later.

Take a snapshot of the long-running-cluster from the Amazon Redshift console

Now, let’s create a new Amazon Redshift cluster with 20 RA3 nodes for periodic-running workloads. Then we migrate the type 3 and type 4 tables to the periodic-running cluster. Make sure you choose the ra3 node type. (Amazon Redshift Serverless supports data sharing too, and it becomes generally available in July 2022, so it is also an option now.)

Create the periodic-running-cluster. Make sure you select the ra3 node type.

Create the long-to-periodic data share

The next step is to create the long-to-periodic data share. Complete the following steps:

On the periodic-running cluster, get the namespace by running the following query:

SELECT current_namespace;

Make sure record the namespace.

On the long-running cluster, we run queries similar to the following:

CREATE DATASHARE ltop_share SET PUBLICACCESSIBLE TRUE;
ALTER DATASHARE ltop_share ADD SCHEMA public_long;
ALTER DATASHARE ltop_share ADD ALL TABLES IN SCHEMA public_long;
GRANT USAGE ON DATASHARE ltop_share TO NAMESPACE '[periodic-running-cluster-namespace]';

We can validate the long-to-periodic data share using the following command:

SHOW datashares;

After we validate the data share, we get the long-running cluster namespace with the following query:

SELECT current-namespace;

Make sure record the namespace.

On the periodic-running cluster, run the following command to load the data from the long-to-periodic data share with the long-running cluster namespace:

CREATE DATABASE ltop FROM DATASHARE ltop_share OF NAMESPACE '[long-running-cluster-namespace]';

Confirm that we have read access to tables in the long-to-periodic data share.

Create the periodic-to-long data share

The next step is to create the periodic-to-long data share. We use the namespaces of the long-running cluster and the periodic-running cluster that we collected in the previous step.

On the periodic-running cluster, run queries similar to the following to create the periodic-to-long data share:

CREATE DATASHARE ptol_share SET PUBLICACCESSIBLE TRUE;
ALTER DATASHARE ptol_share ADD SCHEMA public_periodic;
ALTER DATASHARE ptol_share ADD ALL TABLES IN SCHEMA public_periodic;
GRANT USAGE ON DATASHARE ptol_share TO NAMESPACE '[long-running-cluster-namespace]';

Validate the data share using the following command:

SHOW datashares;

On the long-running cluster, run the following command to load the data from the periodic-to-long data using the periodic-running cluster namespace:

CREATE DATABASE ptol FROM DATASHARE ptol_share OF NAMESPACE '[periodic-running-cluster-namespace]';

Check that we have read access to the tables in the periodic-to-long data share.

At this stage, we have separated workloads into two Amazon Redshift clusters and built a two-way data share across two Amazon Redshift clusters.

The next step is updating the code of different workloads to use the correct endpoints of two Amazon Redshift clusters and perform consolidated tests.

Pause and resume the periodic-running Amazon Redshift cluster

Let’s update the crontab scripts, which run periodic-running workloads. We make two updates.

When the scripts start, call the Amazon Redshift check and resume cluster APIs to resume the periodic-running Amazon Redshift cluster when the cluster is paused:
```
aws redshift resume-cluster --cluster-identifier [periodic-running-cluster-id]
```
After the workloads are finished, call the Amazon Redshift pause cluster API with the cluster ID to pause the cluster:
```
aws redshift pause-cluster --cluster-identifier [periodic-running-cluster-id]
```

Results

After we migrated the workloads to the new architecture, the company’s analytics team ran some tests to verify the results.

According to tests, the performance of all workloads improved:

The BI workload is about 100% faster during the ETL workload running periods
The hourly ETL workload is about 50% faster
The daily workload duration reduced to approximately 40 minutes, from a maximum of 3 hours
The weekly workload duration reduced to approximately 1.5 hours, from a maximum of 4 hours

All functionalities work properly, and cost of the new architecture only increased approximately 13%, while over 10% of new data had been added during the testing period.

Learnings and limitations

After we separated the workloads into different Amazon Redshift clusters, we discovered a few things:

The performance of the BI workloads was 100% faster because there was no resource competition with daily and weekly ETL workloads anymore
The duration of ETL workloads on the periodic-running cluster was reduced significantly because there were more nodes and no resource competition from the BI and hourly ETL workloads
Even when over 10% new data was added, the overall cost of the Amazon Redshift clusters only increased by 13%, due to using the cluster pause and resume function of the Amazon Redshift RA3 family

As a result, we saw a 70% price-performance improvement of the Amazon Redshift cluster.

However, there are some limitations of the solution:

To use the Amazon Redshift pause and resume function, the code for calling the Amazon Redshift pause and resume APIs must be added to all scheduled scripts that run ETL workloads on the periodic-running cluster
Amazon Redshift clusters require several minutes to finish pausing and resuming, although you’re not charged during these processes
The size of Amazon Redshift clusters can’t automatically scale in and out depending on workloads

Next steps

After improving performance significantly, we can explore the possibility of reducing the number of nodes of the long-running cluster to reduce Amazon Redshift costs.

Another possible optimization is using Amazon Redshift Spectrum to reduce the cost of Amazon Redshift on cluster storage. With Redshift Spectrum, multiple Amazon Redshift clusters can concurrently query and retrieve the same structured and semistructured dataset in Amazon Simple Storage Service (Amazon S3) without the need to make copies of the data for each cluster or having to load the data into Amazon Redshift tables.

Amazon Redshift Serverless was announced for preview in AWS re:Invent 2021 and became generally available in July 2022. Redshift Serverless automatically provisions and intelligently scales your data warehouse capacity to deliver best-in-class performance for all your analytics. You only pay for the compute used for the duration of the workloads on a per-second basis. You can benefit from this simplicity without making any changes to your existing analytics and BI applications. You can also share data for read purposes across different Amazon Redshift Serverless instances within or across AWS accounts.

Therefore, we can explore the possibility of removing the need to script for pausing and resuming the periodic-running cluster by using Redshift Serverless to make the management easier. We can also explore the possibility of improving the granularity of workloads.

Conclusion

In this post, we discussed how to optimize workloads on Amazon Redshift clusters using RA3 nodes, data sharing, and pausing and resuming clusters. We also explored a use case implementing a multi-cluster two-way data share solution to improve workload performance with a minimum code change. If you have any questions or feedback, please leave them in the comments section.

About the authors

Jingbin Ma

Jingbin Ma is a Sr. Solutions Architect at Amazon Web Services. He helps customers build well-architected applications using AWS services. He has many years of experience working in the internet industry, and his last role was CTO of a New Zealand IT company before joining AWS. He is passionate about serverless and infrastructure as code.

Chao Pan is a Data Analytics Solutions Architect at Amazon Web Services. He’s responsible for the consultation and design of customers’ big data solution architectures. He has extensive experience in open-source big data. Outside of work, he enjoys hiking.

Cross-Region architecture requirements

Use case: Financial services analytics with Regional data residency

Solution overview

Establish cross-Region network connectivity: Amazon Redshift to QuickSight

Implement cross-Region data sharing

Configure QuickSight in the analytics Region

Performance considerations for cross-Region analytics

Security considerations

Cost implications

Conclusion

About the Authors

Who is Getir?

Overview of Getir’s main use case

Overview of Getir’s use of Amazon Redshift and modern data architecture

Modern data trends reinforced by Getir’s case study

Using generative AI with data mesh architecture

Conclusion

About the Authors

Generic consumer sizing guidance

Size your producer cluster

Size and setup initial consumer cluster

Setup Amazon Redshift data sharing

Test consumer only workload on initial consumer cluster

Test consumer only workload on different consumer cluster configurations

Test producer only workload on different producer cluster configurations

Re-evaluate after a full workload run over time

Automated sizing solutions

Solution walkthrough

Size your production cluster

Identify the workload to be isolated

Setup Simple Replay

Baseline workload

Setup initial producer and consumer test clusters

Replay workload on initial producer and consumer

Replay consumer workload on different configurations

Replay producer workload on different configurations

Re-evaluate after a full workload run over time

Clean up

Applying Amazon Redshift and data sharing best practices

Summary

About the Authors

Key features of Amazon Redshift

Solution overview

Dividing workloads and data

Data sharing plan

Build two-way data share across Amazon Redshift clusters

Create the long-to-periodic data share

Create the periodic-to-long data share

Pause and resume the periodic-running Amazon Redshift cluster

Results

Learnings and limitations

Next steps

Conclusion

About the authors

The collective thoughts of the interwebz