All posts by Neeraja Rentachintala

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

Post Syndicated from Neeraja Rentachintala original https://aws.amazon.com/blogs/big-data/amazon-redshift-announcements-at-aws-reinvent-2023-to-enable-analytics-on-all-your-data/

In 2013, Amazon Web Services revolutionized the data warehousing industry by launching Amazon Redshift, the first fully-managed, petabyte-scale, enterprise-grade cloud data warehouse. Amazon Redshift made it simple and cost-effective to efficiently analyze large volumes of data using existing business intelligence tools. This cloud service was a significant leap from the traditional data warehousing solutions, which were expensive, not elastic, and required significant expertise to tune and operate. Since then, customer demands for better scale, higher throughput, and agility in handling a wide variety of changing, but increasingly business critical analytics and machine learning use cases has exploded, and we have been keeping pace. Today, tens of thousands of customers use Amazon Redshift in AWS global infrastructure to collectively process exabytes of data daily and employs Amazon Redshift as a key component of their data architecture to drive use cases from typical dashboarding to self-service analytics, real-time analytics, machine learning, data sharing and monetization, and more

The advancements to Amazon Redshift announced at AWS re:Invent 2023 further accelerates modernization of your cloud analytics environments, keeping our core tenet to help you achieve the best price-performance at any scale. These announcements drive forward the AWS Zero-ETL vision to unify all your data, enabling you to better maximize the value of your data with comprehensive analytics and ML capabilities, and innovate faster with secure data collaboration within and across organizations. From price-performance improvements to zero-ETL, to generative AI capabilities, we have something for everyone. Let’s dive into the highlights.

Modernizing analytics for scale, performance, and reliability

“Our migration from legacy on-premises platform to Amazon Redshift allows us to ingest data 88% faster, query data 3x faster, and load daily data to the cloud 6x faster. Amazon Redshift enabled us to optimize performance, availability, and reliability—significantly easing operational complexity, while increasing the velocity of our end-users’ decision-making experience on the Fab floor.”

– Sunil Narayan, Sr Dir, Analytics at GlobalFoundries

Diligently driving the best price-performance at scale with new improvements

Since day 1, Amazon Redshift has been building innovative capabilities to help you get to optimum performance, while keeping costs lower. Amazon Redshift continues to lead on the price-performance front with up to 6x better price-performance than alternative cloud data warehouse and for dash boarding applications with high concurrency and low latency. We closely analyze query patterns in the fleet and look for opportunities to drive customer-focused innovation. For example, earlier in the year, we announced speed ups for string-based data processing up to 63x compared to alternative compression encodings such as LZO (Lempel-Ziv-Oberhumer) or ZStandard. At AWS re:Invent 2023, we introduced more performance enhancements in query planning and execution such as enhanced bloom filters , query rewrites, and support for write operations in auto scaling . For more information about performance improvement capabilities, refer to the list of announcements below.

Amazon Redshift Serverless is more intelligent than ever with new AI-driven scaling and optimizations

Speaking of price-performance, new next generation AI-driven scaling and optimizations capabilities in Amazon Redshift Serverless can deliver up to 10x better price-performance for variable workloads (based on internal testing), without manual intervention. Amazon Redshift Serverless, generally available since 2021, allows you to run and scale analytics without having to provision and manage the data warehouse. Since GA, Redshift Serverless executed over a billion queries to power data insights for thousands of customers. With these new AI optimizations, Amazon Redshift Serverless scales proactively and automatically with workload changes across all key dimensions —such as data volume, concurrent users, and query complexity. You just specify your desired price-performance targets to either optimize for cost or optimize for performance or balanced and serverless does the rest. Learn more about additional improvements in Redshift Serverless, under the list of announcements below.

Multi-data warehouse writes through data sharing

Data sharing is a widely adopted feature in Amazon Redshift with customers running tens of millions of queries on shared data daily. Customers share live transactionally consistent data within and across organizations and regions for read purposes without data copies or data movement. Customers are using data sharing to modernize their analytics architectures from monolithic architectures to multi-cluster, data mesh deployments that enable seamless and secure access across organizations to drive data collaboration and powerful insights. At AWS re:Invent 2023, we extended data sharing capabilities to launch multi-data warehouse writes in preview. You can now start writing to Redshift databases from other Redshift data warehouses in just a few clicks, further enabling data collaboration, flexible scaling of compute for ETL/data processing workloads by adding warehouses of different types and sizes based on price-performance needs. Experience greater transparency of compute usage as each warehouse is billed for its own compute and consequently keep your costs under control.

Multidimensional data layouts

Amazon Redshift offers industry leading predictive optimizations that continuously monitor your workloads and seamlessly accelerate performance and maximize concurrency by adjusting data layout and compute management as you use the data warehouse more. In addition to the powerful optimizations Redshift already offers, such as Automatic Table Sort, Automatic sort and distribution keys, we are introducing Multidimensional Data Layouts, a new powerful table sorting mechanism that improves performance of repetitive queries by automatically sorting data based on the incoming query filters (for example: Sales in a specific region). This method significantly accelerates the performance of table scans compared to traditional methods.

Unifying all your data with zero-ETL approaches

“Using the Aurora MySQL zero-ETL integration, we experience near real-time data synchronization between Aurora MySQL databases and Amazon Redshift, making it possible to build an analysis environment in just three hours instead of the month of developer time it used to take before”

– MoneyForward

JOYME uses Amazon Redshift’s streaming ingestion and other Amazon services for risk control over users’ financial activity such as recharge, refund, and rewards.

“With Redshift, we are able to view risk counterparts and data in near real time—
instead of on an hourly basis. Redshift significantly improved our business ROI efficiency.”

– PengBo Yang, CTO, JOYME

Data pipelines can be challenging and costly to build and manage and can create hours-long delays to obtain transactional data for analytics. These delays can lead to missed business opportunities, especially when the insights derived from analytics on transactional data are relevant for only a limited amount of time. Amazon Redshift employs AWS’s zero-ETL approach that enables interoperability and integration between the data warehouse and operational databases and even your streaming data services, so that the data is easily and automatically ingested into the warehouse for you, or you can access the data in place, where it lives.

Zero-ETL integrations with operational databases

We delivered zero-ETL integration between Amazon Aurora MySQL Amazon Redshift (general availability) this year, to enable near real-time analytics and machine learning (ML) using Amazon Redshift on petabytes of transactional data from Amazon Aurora. Within seconds of transactional data being written into Aurora, the data is available in Amazon Redshift, so you don’t have to build and maintain complex data pipelines to perform extract, transform, and load (ETL) operations. At AWS re:Invent, we extended zero-ETL integration to additional sources specifically Aurora PostgreSQL, Dynamo DB, and Amazon RDS MySQL. Zero-ETL integration also enables you to load and analyze data from multiple operational database clusters in a new or existing Amazon Redshift instance to derive holistic insights across many applications.

Data lake querying with support for Apache Iceberg tables

Amazon Redshift allows customers to run a wide range of workloads on data warehouse and data lakes using its support for various open file and table formats. At AWS re:Invent, we announced the general availability of support for Apache Iceberg tables, so you can easily access your Apache Iceberg tables on your data lake from Amazon Redshift and join it with the data in your data warehouse when needed. Use one click to access your data lake tables using auto-mounted AWS Glue data catalogs on Amazon Redshift for a simplified experience. We have improved data lake query performance by integrating with AWS Glue statistics and introduce preview of incremental refresh for materialized views on data lake data to accelerate repeated queries.

Learn more about the zero-ETL integrations, data lake performance enhancements, and other announcements below.

Maximize value with comprehensive analytics and ML capabilities

“Amazon Redshift is one of the most important tools we had in growing Jobcase as a company.”

– Ajay Joshi, Distinguished Engineer, Jobcase

With all your data integrated and available, you can easily build and run near real-time analytics to AI/ML/Generative AI applications. Here’s a couple of highlights from this week and for the full list, see below.

Amazon Q Generative SQL capability

Query Editor, an out-of-the-box web-based SQL experience in Amazon Redshift is a popular tool for data exploration, visual analysis, and data collaboration. At AWS re:Invent, we introduced Amazon Q Generative SQL capabilities in Amazon Redshift Query Editor (preview), to simplify query authoring and increase your productivity by allowing you to express queries in natural language and receive SQL code recommendations. Generative SQL uses AI to analyze user intent, query patterns, and schema metadata to identify common SQL query patterns directly allowing you to get insights faster in a conversational format without extensive knowledge of your organization’s complex database metadata.

Amazon Redshift ML large language model (LLM) integration

Amazon Redshift ML enables customers to create, train, and deploy machine learning models using familiar SQL commands. Customers use Redshift ML to run an average of over 10 billion predictions a day within their data warehouses. At AWS re:Invent, we announced support for LLMs as preview. Now, you can use pre-trained open source LLMs in Amazon SageMaker JumpStart as part of Redshift ML, allowing you to bring the power of LLMs to analytics. For example, you can make inferences on your product feedback data in Amazon Redshift, use LLMs to summarize feedback, perform entity extraction, sentiment analysis and product feedback classification.

Innovate faster with secure data collaboration within and across the organizations

“Millions of companies use Stripe’s software and APIs to accept payments, send payouts, and manage their businesses online.  Access to their Stripe data via leading data warehouses like Amazon Redshift has been a top request from our customers. Our customers needed secure, fast, and integrated analytics at scale without building complex data pipelines or moving and copying data around. With Stripe Data Pipeline for Amazon Redshift, we’re helping our customers set up a direct and reliable data pipeline in a few clicks. Stripe Data Pipeline enables our customers to automatically share their complete, up-to-date Stripe data with their Amazon Redshift data warehouse, and take their business analytics and reporting to the next level.”

– Tony Petrossian, Head of Engineering, Revenue & Financial Management at Stripe

With Amazon Redshift, you can easily and securely share data and collaborate no matter where your teams or data is located. And have the confidence that your data is secure no matter where you operate or how highly regulated your industries are. We have enabled fine grained permissions, an easy authentication experience with single sign-on for your organizational identity—all provided at no additional cost to you.

Unified identity with IAM identity center integration

We announced Amazon Redshift integration with AWS IAM Identity Center to enable organizations to support trusted identity propagation between Amazon QuickSight,, Amazon Redshift Query Editor, and Amazon Redshift,  . Customers can use their organization identities to access Amazon Redshift in a single sign-on experience using third party identity providers (IdP), such as Microsoft Entra ID, Okta, Ping, OneLogin, etc. from Amazon QuickSight and Amazon Redshift Query Editor. Administrators can use third-party identity provider users and groups to manage fine grained access to data across services and audit user level access in AWS CloudTrail. With trusted identity propagation, a user’s identity is passed seamlessly between Amazon QuickSight, Amazon Redshift reducing time to insights and enabling a friction free analytics experience.

For the full set of announcements, see the following:

Learn more: https://aws.amazon.com/redshift


About the authors

Neeraja Rentachintala is a Principal Product Manager with Amazon Redshift. Neeraja is a seasoned Product Management and GTM leader, bringing over 20 years of experience in product vision, strategy and leadership roles in data products and platforms. Neeraja delivered products in analytics, databases, data Integration, application integration, AI/Machine Learning, large scale distributed systems across On-Premise and Cloud, serving Fortune 500 companies as part of ventures including MapR (acquired by HPE), Microsoft SQL Server, Oracle, Informatica and Expedia.com.

Sunaina AbdulSalah leads product marketing for Amazon Redshift. She focuses on educating customers about the impact of data warehousing and analytics and sharing AWS customer stories. She has a deep background in marketing and GTM functions in the B2B technology and cloud computing domains. Outside of work, she spends time with her family and friends and enjoys traveling.

Announcing Amazon Redshift data sharing (preview)

Post Syndicated from Neeraja Rentachintala original https://aws.amazon.com/blogs/big-data/announcing-amazon-redshift-data-sharing-preview/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL. Amazon Redshift offers up to 3x better price performance than any other cloud data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as high-performance business intelligence (BI) reporting, dashboarding applications, data exploration, and real-time analytics.

We’re excited to launch the new Amazon Redshift data sharing capability, which enables you to securely and easily share live data across Amazon Redshift clusters. Data sharing allows you to extend the ease of use, performance, and cost benefits that Amazon Redshift offers in a single cluster to multi-cluster deployments while being able to share data. Data sharing enables instant, granular, and high-performance data access across Amazon Redshift clusters without the need to copy or move it. Data sharing provides live access to data so that your users always see the most up-to-date and consistent information as it’s updated in the data warehouse. There is no additional cost to use data sharing on your Amazon Redshift clusters.

In this post, we discuss the needs for data sharing in organizations, current challenges when it comes to sharing data, and how Amazon Redshift data sharing addresses these needs.

Need for sharing data in organizations and current challenges

We hear from our customers that they want to share data at many levels to enable broad and deep insights but also minimize complexity and cost. For example, data needs to be shared from a central data warehouse that loads and transforms constant streams of updates with BI and analytics clusters that serve a variety of workloads, such as dashboarding applications, ad-hoc queries, and data science. Multiple teams and business groups within an organization want to share and collaborate on data to gain differentiated insights that can help unlock new market opportunities, or analyze cross-group impact. As data becomes more valuable, many organizations are becoming data providers too and want to share data across organizations and offer analytics services to external consumers.

Before the launch of Amazon Redshift data sharing, in order to share data, you needed to unload data from one system (the producer) and copy it into another (the consumer). This approach can be expensive and introduce delays, especially as the number of consumers grow. It requires building and maintaining separate extract, transform, and load (ETL) jobs to provide relevant subsets of the data for each consumer. The complexity increases with the security practices that must be kept in place to regularly monitor business-critical data usage and ensure compliance. Additionally, this way of sharing doesn’t provide users with complete and up-to-date views of the data and limits insights. Organizations want a simple and secure way to share fresh, complete, and consistent views of data with any number of consumers.

Introduction to Amazon Redshift data sharing

Amazon Redshift data sharing allows you to securely and easily share data for read purposes across different Amazon Redshift clusters without the complexity and delays associated with data copies and data movement. Data can be shared at many levels, including schemas, tables, views, and user-defined functions, providing fine-grained access controls that can be tailored for different users and businesses that all need access to the data.

Consumers can discover shared data using standard SQL interfaces and query it with high performance from familiar BI and analytics tools. Users connect to an Amazon Redshift database in their cluster, and can perform queries by referring to objects from any other Amazon Redshift database that they have permissions to access, including the databases shared from remote clusters. Amazon Redshift enables sharing live and transactionally consistent views of the data, meaning that consumers always view the most up-to-date data, even when it’s continuously updated on the producer clusters. Amazon Redshift clusters that share data can be in the same or different AWS accounts, making it possible for you to share data across organizations and collaborate with external parties in a secure and governed manner. Amazon Redshift offers comprehensive auditing capabilities using system tables and AWS CloudTrail to allow you to monitor the data sharing permissions and the usage across all the consumers and revoke access instantly, when necessary.

Data sharing provides high-performance access and workload isolation

Data sharing builds on Amazon Redshift RA3 managed storage, which decouples storage and compute, allowing either of them to scale independently. With data sharing, workloads accessing shared data are isolated from each other. Queries accessing shared data run on the consumer cluster and read data from the Amazon Redshift managed storage layer directly without impacting the performance of the producer cluster.

You can now rapidly onboard any number of workloads with diverse data access patterns and SLA requirements and not be concerned about resource contention. Workloads accessing shared data can be provisioned with flexible compute resources that meet their workload-specific price performance requirements and be scaled independently as needed in a self-service fashion. You can optionally allow these teams and business groups to pay for what they use with charge-back. With data sharing, the producer cluster pays for the managed storage cost of the shared data, and the consumer cluster pays for the compute. Data sharing itself doesn’t have any cost associated with it.

Consumer clusters are regular Amazon Redshift clusters. Although consumer clusters can only read the shared data, they can write their own private data and if desired, share it with other clusters, including back to the producer cluster. Building on a managed storage foundation also enables high-performance access to the data. The frequently accessed datasets are cached on the local compute nodes of the consumer cluster to speed up those queries. The following diagram illustrates this data sharing architecture.

How data sharing works

Data sharing between Amazon Redshift clusters is a two-step process. First, the administrator of the producer cluster that wants to share data creates a data share, a new named object that serves as a unit of sharing. The administrator of the producer cluster and other privileged users add the needed database objects such as schemas, tables, views (regular, late binding, and materialized) to the data share and specify a list of consumers for sharing purposes. Consumers can be other Amazon Redshift clusters in the same AWS account or separate AWS accounts.

Next, the administrator of the consumer cluster looks at the shared datasets and reviews the contents of each share. To consume shared data, the consumer cluster administrator creates an Amazon Redshift database from the data share object and assigns permissions to appropriate users and groups in the consumer cluster. Users and groups that have access to shared data can discover and query it using standard SQL and analytics tools. You can join shared data with local data and perform cross-database queries. We have introduced new metadata views and modified JDBC/ODBC drivers so that tools can seamlessly integrate with shared data. The following diagram illustrates this data sharing architecture.

Data sharing use cases

In this section, we discuss the common data sharing use cases:

  • Sharing data from a central ETL cluster with multiple, isolated BI and analytics clusters in a hub-spoke architecture to provide read workload isolation and optional charge-back for costs.
  • Sharing data among multiple business groups so they can collaborate for broader analytics and data science. Each Amazon Redshift cluster can be a producer of some data but also can be a consumer of other datasets.
  • Sharing data in order to offer data and analytics as a service across the organization and with external parties.
  • Sharing data between development, test, and production environments, at any granularity.

Amazon Redshift offers unparalleled compute flexibility

Amazon Redshift offers you the most flexibility than any other cloud data warehouse when it comes to organizing your workloads based on price performance, isolation, and charge-ability. Amazon Redshift efficiently utilizes compute resources in a cluster to maximize performance and throughput with automatic workload management (WLM). Specifying query priorities allows you to influence the usage of resources across multiple workloads based on business priorities. With Amazon Redshift concurrency scaling, you can elastically scale one or more workloads in a cluster automatically with extra capacity to handle high concurrency and query spikes without any application changes. And with data sharing, you can now handle diverse business-critical workloads that need flexible compute resources, isolation, and charge-ability with multi-cluster deployments while sharing data. You can use WLM and concurrency scaling on both data sharing producer and consumer clusters.

The combination of these capabilities allows you to evolve from single cluster to multi-cluster deployments easily and cost-efficiently.

Beyond sharing data across Amazon Redshift clusters

Along with sharing data across Amazon Redshift clusters, our vision is to evolve data sharing so you can share Amazon Redshift data with the Amazon Simple Storage Service (Amazon S3) data lake by publishing data shares to the AWS Lake Formation catalog. This enables other AWS analytics services to discover and access live and transactionally consistent data in Amazon Redshift managed storage. For example, you could incorporate live data from Amazon Redshift in an Amazon EMR Spark data pipeline, perform ad-hoc queries on Amazon Redshift data using Amazon Athena, or use clean and curated data in Amazon Redshift to build machine learning models with Amazon SageMaker. The following diagram illustrates this architecture.

Amazon Redshift data sharing in customer environments

Data sharing across Amazon Redshift clusters within the same account is now available for preview. Sharing across clusters that are in different AWS accounts is coming soon. We worked closely with several customers and partners for an early preview of data sharing and they are excited about the results and possibilities.

“At Warner Bros. Games, we build and maintain complex data mobility infrastructures to manage data movements across single game clusters and consolidated business function clusters. However, developing and maintaining this system monopolizes valuable team resources and introduces delays that impede our ability to act on the data with agility and speed. Using the Redshift data sharing feature, we can remove the entire subsystem we built for data copying, movement, and loading between Redshift clusters. This will empower all of our business teams to make decisions on the right datasets more quickly and efficiently. Additionally, Redshift data sharing will also allow us to re-architect compute provisioning to more closely align with the resources needed to execute those functions’ SQL workloads—ultimately enabling simpler infrastructure operations.”

— Kurt Larson, Technical Director, Warner Bros. Analytics

“The data sharing feature seamlessly allows multiple Redshift clusters to query data located in our RA3 clusters and their managed storage. This eliminates our concerns with delays in making data available for our teams, and reduces the amount of data duplication and associated backfill headache. We now can concentrate even more of our time making use of our data in Redshift and enable better collaboration instead of data orchestration.”

— Steven Moy, Engineer, Yelp

“At Fannie Mae, we adopted a de-centralized approach to data warehouse management with tens of Amazon Redshift clusters across many applications. While each team manages their own dataset, we often have use cases where an application needs to query the datasets from other applications and join with the data available locally. We currently unload and move data from one cluster to another cluster, and this introduces delays in providing timely access to data to our teams. We have had issues with unload operations spiking resource consumption on producer clusters, and data sharing allows us to skip this intermediate unload to Amazon S3, saving time and lowering consumption. Many applications are performing unloads currently in order to share datasets, and we plan to convert all such processes to leveraging the new data sharing feature. With data sharing, we can enable seamless sharing of data across application teams and give them common views of data without having to do ETL. We are also able to avoid the data copies between pre-prod, research, and production environments for each application. Data sharing made us more agile and gave us the flexibility to scale analytics in highly distributed environments like Fannie Mae.”

— Amy Tseng, Enterprise Databases Manager, Fannie Mae

“Shared storage allowed us to focus on what matters: making data available to end-users. Data is no longer stuck in a myriad of storage mediums or formats, or accessible only through select APIs, but rather in a single flavor of SQL.”

— Marco Couperus, Engineering Manager, home24

“We’re excited to launch our integration with Amazon Redshift data sharing, which is a game changer for analytics teams that are looking to improve analytics performance SLAs and reduce data processing delays for diverse workloads across organizations. By isolating workloads that have different access patterns, different SLA requirements such as BI reporting, dashboarding applications, ETL jobs, and data science workloads into separate clusters, customers will now get more control over compute resources and more predictability in their workload SLAs while still sharing common data. The integrations of Etleap with Amazon Redshift data sharing will make sharing data between clusters seamless as part of their existing data pipelines. We are thrilled to offer this integrated experience to our joint customers.”

— Christian Romming, Founder and CEO, Etleap

“We’re excited to partner with AWS to enable data sharing for Amazon Redshift within Aginity Pro. Amazon Redshift users can now navigate and explore shared data within Aginity products just like local data within their clusters. Users can then build reusable analytics that combine both local and shared data seamlessly using cross-database queries. Importantly, shared data query performance has the same high performance as local queries without the cost and performance penalty of traditional federation solutions. We’re thrilled to see how our customers will leverage the ability to securely share data across clusters.”

— Matthew Mullins, CTO, Aginity

Next steps

The preview for within account data sharing is available on Amazon Redshift RA3 node types in the following regions:

  • US East (Ohio)
  • US East (N. Virginia)
  • US West (N. California)
  • US West (Oregon)
  • Asia Pacific (Seoul)
  • Asia Pacific (Sydney)
  • Asia Pacific (Tokyo)
  • Europe (Frankfurt)
  • Europe (Ireland)

For more information about how to get started with the preview, see documentation.


About the Authors

 Neeraja Rentachintala is a principal product manager with Amazon Web Services who is focused on building Amazon Redshift – the world’s most popular, highest performing, and most scalable cloud data warehouse. Neeraja earned a Bachelor of Technology in Electronics and Communication Engineering from the National Institute of Technology in India and various business program certifications from the University of Washington, MIT Sloan School of Management, and Stanford University.

 

 

Ippokratis Pandis is a senior principal engineer at AWS. Ippokratis leads initiatives in AWS analytics and data lakes, especially in Amazon Redshift. He holds a Ph.D. in electrical engineering from Carnegie Mellon University.

 

 

 

Naresh Chainani is a senior software development manager with Amazon Redshift, where he leads Query Processing, Query Performance, Distributed Systems, and Workload Management with a strong team. He is passionate about building high-performance databases to enable customers to gain timely insights and make critical business decisions. In his spare time, Naresh enjoys reading and playing tennis.

 

 

 

 

 

 

Get started with Amazon Redshift cross-database queries (preview)

Post Syndicated from Neeraja Rentachintala original https://aws.amazon.com/blogs/big-data/get-started-with-amazon-redshift-cross-database-queries-preview/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL, business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics.

We’re excited to announce the public preview of the new cross-database queries capability to query across databases in an Amazon Redshift cluster. In this post, we provide an overview of the cross-database queries and a walkthrough of the key functionality that allows you to manage data and analytics at scale in your organization.

What are cross-database queries?

With cross-database queries, you can seamlessly query data from any database in your Amazon Redshift cluster, regardless of which database you’re connected to. Cross-database queries eliminate data copies and simplify your data organization to support multiple business groups on the same cluster. Support for cross-database queries is available on Amazon Redshift RA3 node types.

Data is organized across multiple databases in Amazon Redshift clusters to support multi-tenant configurations. However, you often need to query and join across these datasets by allowing read access. For example, different business groups and teams that own and manage their datasets in a specific database in the data warehouse need to collaborate with other groups. You might want to perform common ETL staging and processing while your raw data is spread across multiple databases. Organizing data in multiple Amazon Redshift databases is also a common scenario when migrating from traditional data warehouse systems.

With cross-database queries, you can now access data from any database on the Amazon Redshift cluster without having to connect to that specific database. You can also join datasets from multiple databases in a single query. You can access database objects such as tables, views with a simple three-part notation of <database>.<schema>.<object>, and analyze the objects using business intelligence (BI) or analytics tools. You can continue to set up granular access controls for users with standard Amazon Redshift SQL commands and ensure that users can only see the relevant subsets of the data they have permissions for.

Walkthrough overview

In this post, we walk through an end-to-end use case to illustrate cross-database queries, comprising the following steps:

  1. Set up permissions on the data.
  2. Access data and perform several cross-database queries.
  3. Connect from tools.

For this walkthrough, we use SQL Workbench, a SQL query tool, to perform queries on Amazon Redshift. For more information about connecting SQL Workbench to an Amazon Redshift cluster, see Connect to your cluster by using SQL Workbench/J .

Setting up permissions for cross-database queries

You can use standard Redshift SQL GRANT and REVOKE commands to configure appropriate permissions for users and groups. To configure permissions, we connect as an administrator to a database named TPCH_100G on an Amazon Redshift cluster that we set up with an industry standard dataset, TPC-H. You can set up this dataset in your environment using the code and scripts for this dataset on GitHub and the accompanying dataset hosted in a public Amazon Simple Storage Service (Amazon S3) bucket.

The following screenshot shows the configuration for your connection profile.

The TPCH_100G database consists of eight tables loaded in the schema PUBLIC, as shown in the following screenshot.

The following screenshot shows a test query on one of the TPC-H tables, customer.

The database administrator provides read permissions on the three of the tables, customer, orders, and lineitem, to an Amazon Redshift user called demouser. The user typically connects to and operates in their own team’s database TPCH_CONSUMERDB on the same Amazon Redshift cluster.

Performing cross-database queries using three-part notation

In this section, we see how cross-database queries work in action. With cross-database queries, you can connect to any database and query from all the other databases in the cluster without having to reconnect. In this use case, the user demouser connects to their database TPCH_CONSUMERDB (see the following screenshot).

While connected to TPCH_CONSUMERDB, demouser can also perform queries on the data in TPCH_100gG database objects that they have permissions to, referring to them using the simple and intuitive three-part notation TPCH_100G.PUBLIC.CUSTOMER (see the following screenshot).

You can refer to and query objects in any other database in the cluster using this <database>.<schema>.<object> notation as long as you have permissions to do so. The objects can be tables or views (including regular, late binding and materialized views).

In addition to performing queries on objects, you can create views on top of objects in other databases and apply granular access controls as relevant.

Joining data across databases

With cross-database queries, you can join datasets across databases. In the following screenshot, demouser queries and performs joins across the customer, lineitem, and orders tables in the TPCH_100G database.

You can also span joins on objects across databases. In the following query, demouser seamlessly joins the datasets from TPCH_100G (customer, lineitem, and orders tables) with the datasets in TPCH_CONSUMERDB (nation and supplier tables).

With cross-database queries, you get a consistent view of the data irrespective of the database you’re connected to.

Securely accessing relevant datasets by connecting from tools

To support the database hierarchy navigation and exploration introduced with cross-database queries, Amazon Redshift is introducing a new set of metadata views and modified versions of JDBC and ODBC drivers.

In addition, you can create aliases from one database to schemas in any other databases on the Amazon Redshift cluster. You create the aliases using the CREATE EXTERNAL SCHEMA command, which allows you to refer to the objects in cross-database queries with the two-part notation <external schema name>.<object>. For example, in the following screenshot, the database administrator connects to TPCH_CONSUMERDB and creates an external schema alias for the PUBLIC schema in TPC_100G database called TPC_100G_PUBLIC and grants the usage access on the schema to demouser.

Now, when demouser connects to TPCH_CONSUMERDB, they see the external schema in the object hierarchy (as in the following screenshot) with only the relevant objects that they have permissions to: CUSTOMER, LINEITEM, and ORDERS.

Now they can perform queries using the schema alias as if the data is local rather than using a three-part notation.

Summary and next steps

We provided you a glimpse into what you can accomplish with cross-database queries in Amazon Redshift. Cross-database queries allow you to organize and manage data across databases to effectively support multi-tenant data warehouse deployments for a wide variety of use cases. You can get started with your use case leveraging cross-database queries capability by trying out the preview. For more information, refer to the documentation cross-database queries.


About the Authors

Neeraja Rentachintala is a Principal Product Manager with Amazon Redshift. Neeraja is a seasoned Product Management and GTM leader, bringing over 20 years of experience in product vision, strategy and leadership roles in data products and platforms. Neeraja delivered products in analytics, databases, data Integration, application integration, AI/Machine Learning, large scale distributed systems across On-Premise and Cloud, serving Fortune 500 companies as part of ventures including MapR (acquired by HPE), Microsoft SQL Server, Oracle, Informatica and Expedia.com.

 

 

Jenny Chen is a senior database engineer at Amazon Redshift focusing on all aspects of Redshift performance, like Query Processing, Concurrency, Distributed system, Storage, OS and many more. She works together with development team to ensure of delivering highest performance, scalable and easy-of-use database for customer. Prior to her career in cloud data warehouse, she has 10-year of experience in enterprise database DB2 for z/OS in IBM with focus on query optimization, query performance and system performance.

 

Sushim Mitra is a software development engineer on the Amazon Redshift query processing team. His interest areas are Query Optimization problems, SQL Language features and Database security. When not at work, he enjoys reading fiction from all over the world.

 

 

 

Suzhen Lin is a senior software development engineer on the Amazon Redshift transaction processing and storage team. Suzhen Lin has over 15 years of experiences in industry leading analytical database products including AWS Redshift, Gauss MPPDB, Azure SQL Data Warehouse and Teradata as senior architect and developer. Her experiences cover storage, transaction processing, query processing, memory/disk caching and etc in on-premise/cloud database management systems.