All posts by Neeraja Rentachintala

Recap of Amazon Redshift key product announcements in 2024

2024-12-17 Neeraja Rentachintala

Post Syndicated from Neeraja Rentachintala original https://aws.amazon.com/blogs/big-data/recap-of-amazon-redshift-key-product-announcements-in-2024/

Amazon Redshift, launched in 2013, has undergone significant evolution since its inception, allowing customers to expand the horizons of data warehousing and SQL analytics. Today, Amazon Redshift is used by customers across all industries for a variety of use cases, including data warehouse migration and modernization, near real-time analytics, self-service analytics, data lake analytics, machine learning (ML), and data monetization.

Amazon Redshift made significant strides in 2024, rolling out over 100 features and enhancements. These improvements enhanced price-performance, enabled data lakehouse architectures by blurring the boundaries between data lakes and data warehouses, simplified ingestion and accelerated near real-time analytics, and incorporated generative AI capabilities to build natural language-based applications and boost user productivity.

Figure1: Summary of the features and enhancements in 2024

Let’s walk through some of the recent key launches, including the new announcements at AWS re:Invent 2024.

Industry-leading price-performance

Amazon Redshift offers up to three times better price-performance than alternative cloud data warehouses. Amazon Redshift scales linearly with the number of users and volume of data, making it an ideal solution for both growing businesses and enterprises. For example, dashboarding applications are a very common use case in Redshift customer environments where there is high concurrency and queries require quick, low-latency responses. In these scenarios, Amazon Redshift offers up to seven times better throughput per dollar than alternative cloud data warehouses, demonstrating its exceptional value and predictable costs.

Performance improvements

Over the past few months, we have introduced a number of performance improvements to Redshift. First query response times for dashboard queries have significantly improved by optimizing code execution and reducing compilation overhead. We have enhanced data sharing performance with improved metadata handling, resulting in data sharing first query execution that is up to four times faster when the data sharing producer’s data is being updated. We have enhanced autonomics algorithms to generate and implement smarter and quicker optimal data layout recommendations for distribution and sort keys, further optimizing performance. We have launched new RA3.large instances, a new smaller size RA3 node type, to offer better flexibility in price-performance and provide a cost-effective migration option for customers using DC2.large instances. Additionally, we have rolled out AWS Graviton in Serverless, offering up to 30% better price-performance, and expanded concurrency scaling to support more types of write queries, enabling an even greater ability to maintain consistent performance at scale. These improvements collectively reinforce Amazon Redshift’s focus as a leading cloud data warehouse solution, offering unparalleled performance and value to customers.

General availability of multi-data warehouse writes

Amazon Redshift allows you to seamlessly scale with multi-cluster deployments. With the introduction of RA3 nodes with managed storage in 2019, customers obtained flexibility to scale and pay for compute and storage independently. Redshift data sharing, launched in 2020, enabled seamless cross-account and cross-Region data collaboration and live access without physically moving the data, while maintaining transactional consistency. This allowed customers to scale read analytics workloads and offered isolation to help maintain SLAs for business-critical applications. At re:Invent 2024, we announced the general availability of multi-data warehouse writes through data sharing for Amazon Redshift RA3 nodes and Serverless. You can now start writing to shared Redshift databases from multiple Redshift data warehouses in just a few clicks. The written data is available to all the data warehouses as soon as it’s committed. This allows your teams to flexibly scale write workloads such as extract, transform, and load (ETL) and data processing by adding compute resources of different types and sizes based on individual workloads’ price-performance requirements, as well as securely collaborate with other teams on live data for use cases such as customer 360.

General availability of AI-driven scaling and optimizations

The launch of Amazon Redshift Serverless in 2021 marked a significant shift, eliminating the need for cluster management while paying for what you use. Redshift Serverless and data sharing enabled customers to easily implement distributed multi-cluster architectures for scaling analytics workloads. In 2024, we launched Serverless in 10 more regions, improved functionality, and added support for a capacity configuration of 1024 RPUs, allowing you to bring larger workloads onto Redshift. Redshift Serverless is also now even more intelligent and dynamic with the new AI-driven scaling and optimization capabilities. As a customer, you choose whether you want to optimize your workloads for cost, performance, or keep it balanced, and that’s it. Redshift Serverless works behind the scenes to scale the compute up and down and deploys optimizations to meet and maintain the performance levels, even when workload demands change. In internal tests, AI-driven scaling and optimizations showcased up to 10 times price-performance improvements for variable workloads.

Seamless Lakehouse architectures

Lakehouse brings together flexibility and openness of data lakes with the performance and transactional capabilities of data warehouses. Lakehouse allows you to use preferred analytics engines and AI models of your choice with consistent governance across all your data. At re:Invent 2024, we unveiled the next generation of Amazon SageMaker, a unified platform for data, analytics, and AI. This launch brings together widely adopted AWS ML and analytics capabilities, providing an integrated experience for analytics and AI with a re-imagined lakehouse and built-in governance.

General availability of Amazon SageMaker Lakehouse

Amazon SageMaker Lakehouse unifies your data across Amazon S3 data lakes and Redshift data warehouses, enabling you to build powerful analytics and AI/ML applications on a single copy of data. SageMaker Lakehouse provides the flexibility to access and query your data using Apache Iceberg open standards so that you can use your preferred AWS, open source, or third-party Iceberg-compatible engines and tools. SageMaker Lakehouse offers integrated access controls and fine-grained permissions that are consistently applied across all analytics engines and AI models and tools. Existing Redshift data warehouses can be made available through SageMaker Lakehouse in just a simple publish step, opening up all your data warehouse data with Iceberg REST API. You can also create new data lake tables using Redshift Managed Storage (RMS) as a native storage option. Check out the Amazon SageMaker Lakehouse: Accelerate analytics & AI presented at re:Invent 2024.

Preview of Amazon SageMaker Unified Studio

Amazon SageMaker Unified Studio is an integrated data and AI development environment that enables collaboration and helps teams build data products faster. SageMaker Unified Studio brings together functionality and tools from a mix of standalone studios, query editors, and visual tools available today in Amazon EMR, AWS Glue, Amazon Redshift, Amazon Bedrock, and the existing Amazon SageMaker Studio, into one unified experience. With SageMaker Unified Studio, various users such as developers, analysts, data scientists, and business stakeholders can seamlessly work together, share resources, perform analytics, and build and iterate on models, fostering a streamlined and efficient analytics and AI journey.

Amazon Redshift SQL analytics on Amazon S3 Tables

At re:Invent 2024, Amazon S3 introduced Amazon S3 Tables, a new bucket type that is purpose-built to store tabular data at scale with built-in Iceberg support. With table buckets, you can quickly create tables and set up table-level permissions to manage access to your data lake. Amazon Redshift introduced support for querying Iceberg data in data lakes last year, and now this capability is extended to seamlessly querying S3 Tables. S3 Tables customers create are also available as part of the Lakehouse for consumption by other AWS and third-party engines.

Data lake query performance

Amazon Redshift offers high-performance SQL capabilities on SageMaker Lakehouse, whether the data is in other Redshift warehouses or in open formats. We enhanced support for querying Apache Iceberg data and improved the performance of querying Iceberg up to threefold year-over-year. A number of optimizations contribute to these speed-ups in performance, including integration with AWS Glue Data Catalog statistics, improved data and metadata filtering, dynamic partition elimination, faster/parallel processing of Iceberg manifest files, and scanner improvements. In addition, Amazon Redshift now supports incremental refresh support for materialized views on data lake tables to eliminate the need for recomputing the materialized view when new data arrives, simplifying how you build interactive applications on S3 data lakes.

Simplified ingestion and near real-time analytics

In this section, we share the improvements regarding simplified ingestion and near real-time analytics that enable you to get faster insights over fresher data.

Zero-ETL integration with AWS databases and third-party enterprise applications

Amazon Redshift first launched zero-ETL integration between Amazon Aurora MySQL-Compatible Edition, enabling near real-time analytics on petabytes of transactional data from Aurora. This capability has since expanded to support Amazon Aurora PostgreSQL-Compatible Edition, Amazon Relational Database Service (Amazon RDS) for MySQL, and Amazon DynamoDB, and includes additional features such as data filtering to selectively extract tables and schemas using regular expressions, support for incremental and auto-refresh materialized views on replicated data, and configurable change data capture (CDC) refresh rates.

Building on this innovation, at re:Invent 2024, we launched support for zero-ETL integration with eight enterprise applications, specifically Salesforce, Zendesk, ServiceNow, SAP, Facebook Ads, Instagram Ads, Pardot, and Zoho CRM. With this new capability, you can efficiently extract and load valuable data from your customer support, relationship management, and Enterprise Resource Planning (ERP) applications directly into your Redshift data warehouse for analysis. This seamless integration eliminates the need for complex, custom ingestion pipelines for ingesting the data, accelerating time to insights.

General availability of auto-copy

Auto-copy simplifies data ingestion from Amazon S3 into Amazon Redshift. This new feature enables you to set up continuous file ingestion from your Amazon S3 prefix and automatically load new files to tables in your Redshift data warehouse without the need for additional tools or custom solutions.

Streaming ingestion from Confluent Managed Cloud and self-managed Apache Kafka clusters

Amazon Redshift now supports streaming ingestion from Confluent Managed Cloud and self-managed Apache Kafka clusters on Amazon EC2instances, expanding its capabilities beyond Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK). With this update, you can ingest data from a wider range of streaming sources directly into your Redshift data warehouses for near real-time analytics use cases such as fraud detection, logistics monitoring and clickstream analysis.

Generative AI capabilities

In this section, we share the improvements generative AI capabilities.

Amazon Q generative SQL for Amazon Redshift

We announced the general availability of Amazon Q generative SQL for Amazon Redshift feature in the Redshift Query Editor. Amazon Q generative SQL boosts productivity by allowing users to express queries in natural language and receive SQL code recommendations based on their intent, query patterns, and schema metadata. The conversational interface enables users to get insights faster without extensive knowledge of the database schema. It leverages generative AI to analyze user input, query history, and custom context like table/column descriptions and sample queries to provide more relevant and accurate SQL recommendations. This feature accelerates the query authoring process and reduces the time required to derive actionable data insights.

Amazon Redshift integration with Amazon Bedrock

We announced integration of Amazon Redshift with Amazon Bedrock, enabling you to invoke large language models (LLMs) from simple SQL commands on your data in Amazon Redshift. With this new feature, you can now effortlessly perform generative AI tasks such as language translation, text generation, summarization, customer classification, and sentiment analysis on your Redshift data using popular foundation models (FMs) like Anthropic’s Claude, Amazon Titan, Meta’s Llama 2, and Mistral AI. You can invoke these models using familiar SQL commands, making it simpler than ever to integrate generative AI capabilities into your data analytics workflows.

Amazon Redshift as a knowledge base in Amazon Bedrock

Amazon Bedrock Knowledge Bases now supports natural language querying to retrieve structured data from your Redshift data warehouses. Using advanced natural language processing, Amazon Bedrock Knowledge Bases can transform natural language queries into SQL queries, allowing users to retrieve data directly from the source without the need to move or preprocess the data. A retail analyst can now simply ask “What were my top 5 selling products last month?”, and Amazon Bedrock Knowledge Bases automatically translates that query into SQL, runs the query against Redshift, and returns the results—or even provides a summarized narrative response. To generate accurate SQL queries, Amazon Bedrock Knowledge Bases uses database schema, previous query history, and other contextual information that is provided about the data sources.

Launch summary

Following is the launch summary which provides the announcement links and reference blogs for the key announcements.

Industry-leading price-performance:

Reference Blogs:

Seamless Lakehouse architectures:

Reference Blogs:

Simplified ingestion and near real-time analytics:

Reference Blogs:

Generative AI:

Reference Blogs:

Conclusion

We continue to innovate and evolve Amazon Redshift to meet your evolving data analytics needs. We encourage you to try out the latest features and capabilities. Watch the Innovations in AWS analytics: Data warehousing and SQL analytics session from re:Invent 2024 for further details. If you need any support, reach out to us. We are happy to provide architectural and design guidance, as well as support for proof of concepts and implementation. It’s Day 1!

About the Author

Neeraja Rentachintala is Director, Product Management with AWS Analytics, leading Amazon Redshift and Amazon SageMaker Lakehouse. Neeraja is a seasoned technology leader, bringing over 25 years of experience in product vision, strategy, and leadership roles in data products and platforms. She has delivered products in analytics, databases, data integration, application integration, AI/ML, and large-scale distributed systems across on-premises and the cloud, serving Fortune 500 companies as part of ventures including MapR (acquired by HPE), Microsoft SQL Server, Oracle, Informatica, and Expedia.com

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

2023-11-29 Neeraja Rentachintala

Post Syndicated from Neeraja Rentachintala original https://aws.amazon.com/blogs/big-data/amazon-redshift-announcements-at-aws-reinvent-2023-to-enable-analytics-on-all-your-data/

In 2013, Amazon Web Services revolutionized the data warehousing industry by launching Amazon Redshift, the first fully-managed, petabyte-scale, enterprise-grade cloud data warehouse. Amazon Redshift made it simple and cost-effective to efficiently analyze large volumes of data using existing business intelligence tools. This cloud service was a significant leap from the traditional data warehousing solutions, which were expensive, not elastic, and required significant expertise to tune and operate. Since then, customer demands for better scale, higher throughput, and agility in handling a wide variety of changing, but increasingly business critical analytics and machine learning use cases has exploded, and we have been keeping pace. Today, tens of thousands of customers use Amazon Redshift in AWS global infrastructure to collectively process exabytes of data daily and employs Amazon Redshift as a key component of their data architecture to drive use cases from typical dashboarding to self-service analytics, real-time analytics, machine learning, data sharing and monetization, and more

The advancements to Amazon Redshift announced at AWS re:Invent 2023 further accelerates modernization of your cloud analytics environments, keeping our core tenet to help you achieve the best price-performance at any scale. These announcements drive forward the AWS Zero-ETL vision to unify all your data, enabling you to better maximize the value of your data with comprehensive analytics and ML capabilities, and innovate faster with secure data collaboration within and across organizations. From price-performance improvements to zero-ETL, to generative AI capabilities, we have something for everyone. Let’s dive into the highlights.

Modernizing analytics for scale, performance, and reliability

“Our migration from legacy on-premises platform to Amazon Redshift allows us to ingest data 88% faster, query data 3x faster, and load daily data to the cloud 6x faster. Amazon Redshift enabled us to optimize performance, availability, and reliability—significantly easing operational complexity, while increasing the velocity of our end-users’ decision-making experience on the Fab floor.”

– Sunil Narayan, Sr Dir, Analytics at GlobalFoundries

Diligently driving the best price-performance at scale with new improvements

Since day 1, Amazon Redshift has been building innovative capabilities to help you get to optimum performance, while keeping costs lower. Amazon Redshift continues to lead on the price-performance front with up to 6x better price-performance than alternative cloud data warehouse and for dash boarding applications with high concurrency and low latency. We closely analyze query patterns in the fleet and look for opportunities to drive customer-focused innovation. For example, earlier in the year, we announced speed ups for string-based data processing up to 63x compared to alternative compression encodings such as LZO (Lempel-Ziv-Oberhumer) or ZStandard. At AWS re:Invent 2023, we introduced more performance enhancements in query planning and execution such as enhanced bloom filters , query rewrites, and support for write operations in auto scaling . For more information about performance improvement capabilities, refer to the list of announcements below.

Amazon Redshift Serverless is more intelligent than ever with new AI-driven scaling and optimizations

Speaking of price-performance, new next generation AI-driven scaling and optimizations capabilities in Amazon Redshift Serverless can deliver up to 10x better price-performance for variable workloads (based on internal testing), without manual intervention. Amazon Redshift Serverless, generally available since 2021, allows you to run and scale analytics without having to provision and manage the data warehouse. Since GA, Redshift Serverless executed over a billion queries to power data insights for thousands of customers. With these new AI optimizations, Amazon Redshift Serverless scales proactively and automatically with workload changes across all key dimensions —such as data volume, concurrent users, and query complexity. You just specify your desired price-performance targets to either optimize for cost or optimize for performance or balanced and serverless does the rest. Learn more about additional improvements in Redshift Serverless, under the list of announcements below.

Multi-data warehouse writes through data sharing

Data sharing is a widely adopted feature in Amazon Redshift with customers running tens of millions of queries on shared data daily. Customers share live transactionally consistent data within and across organizations and regions for read purposes without data copies or data movement. Customers are using data sharing to modernize their analytics architectures from monolithic architectures to multi-cluster, data mesh deployments that enable seamless and secure access across organizations to drive data collaboration and powerful insights. At AWS re:Invent 2023, we extended data sharing capabilities to launch multi-data warehouse writes in preview. You can now start writing to Redshift databases from other Redshift data warehouses in just a few clicks, further enabling data collaboration, flexible scaling of compute for ETL/data processing workloads by adding warehouses of different types and sizes based on price-performance needs. Experience greater transparency of compute usage as each warehouse is billed for its own compute and consequently keep your costs under control.

Multidimensional data layouts

Amazon Redshift offers industry leading predictive optimizations that continuously monitor your workloads and seamlessly accelerate performance and maximize concurrency by adjusting data layout and compute management as you use the data warehouse more. In addition to the powerful optimizations Redshift already offers, such as Automatic Table Sort, Automatic sort and distribution keys, we are introducing Multidimensional Data Layouts, a new powerful table sorting mechanism that improves performance of repetitive queries by automatically sorting data based on the incoming query filters (for example: Sales in a specific region). This method significantly accelerates the performance of table scans compared to traditional methods.

Unifying all your data with zero-ETL approaches

“Using the Aurora MySQL zero-ETL integration, we experience near real-time data synchronization between Aurora MySQL databases and Amazon Redshift, making it possible to build an analysis environment in just three hours instead of the month of developer time it used to take before”

– MoneyForward

JOYME uses Amazon Redshift’s streaming ingestion and other Amazon services for risk control over users’ financial activity such as recharge, refund, and rewards.

“With Redshift, we are able to view risk counterparts and data in near real time—
instead of on an hourly basis. Redshift significantly improved our business ROI efficiency.”

– PengBo Yang, CTO, JOYME

Data pipelines can be challenging and costly to build and manage and can create hours-long delays to obtain transactional data for analytics. These delays can lead to missed business opportunities, especially when the insights derived from analytics on transactional data are relevant for only a limited amount of time. Amazon Redshift employs AWS’s zero-ETL approach that enables interoperability and integration between the data warehouse and operational databases and even your streaming data services, so that the data is easily and automatically ingested into the warehouse for you, or you can access the data in place, where it lives.

Zero-ETL integrations with operational databases

We delivered zero-ETL integration between Amazon Aurora MySQL Amazon Redshift (general availability) this year, to enable near real-time analytics and machine learning (ML) using Amazon Redshift on petabytes of transactional data from Amazon Aurora. Within seconds of transactional data being written into Aurora, the data is available in Amazon Redshift, so you don’t have to build and maintain complex data pipelines to perform extract, transform, and load (ETL) operations. At AWS re:Invent, we extended zero-ETL integration to additional sources specifically Aurora PostgreSQL, Dynamo DB, and Amazon RDS MySQL. Zero-ETL integration also enables you to load and analyze data from multiple operational database clusters in a new or existing Amazon Redshift instance to derive holistic insights across many applications.

Data lake querying with support for Apache Iceberg tables

Amazon Redshift allows customers to run a wide range of workloads on data warehouse and data lakes using its support for various open file and table formats. At AWS re:Invent, we announced the general availability of support for Apache Iceberg tables, so you can easily access your Apache Iceberg tables on your data lake from Amazon Redshift and join it with the data in your data warehouse when needed. Use one click to access your data lake tables using auto-mounted AWS Glue data catalogs on Amazon Redshift for a simplified experience. We have improved data lake query performance by integrating with AWS Glue statistics and introduce preview of incremental refresh for materialized views on data lake data to accelerate repeated queries.

Learn more about the zero-ETL integrations, data lake performance enhancements, and other announcements below.

Maximize value with comprehensive analytics and ML capabilities

“Amazon Redshift is one of the most important tools we had in growing Jobcase as a company.”

– Ajay Joshi, Distinguished Engineer, Jobcase

With all your data integrated and available, you can easily build and run near real-time analytics to AI/ML/Generative AI applications. Here’s a couple of highlights from this week and for the full list, see below.

Amazon Q Generative SQL capability

Query Editor, an out-of-the-box web-based SQL experience in Amazon Redshift is a popular tool for data exploration, visual analysis, and data collaboration. At AWS re:Invent, we introduced Amazon Q Generative SQL capabilities in Amazon Redshift Query Editor (preview), to simplify query authoring and increase your productivity by allowing you to express queries in natural language and receive SQL code recommendations. Generative SQL uses AI to analyze user intent, query patterns, and schema metadata to identify common SQL query patterns directly allowing you to get insights faster in a conversational format without extensive knowledge of your organization’s complex database metadata.

Amazon Redshift ML large language model (LLM) integration

Amazon Redshift ML enables customers to create, train, and deploy machine learning models using familiar SQL commands. Customers use Redshift ML to run an average of over 10 billion predictions a day within their data warehouses. At AWS re:Invent, we announced support for LLMs as preview. Now, you can use pre-trained open source LLMs in Amazon SageMaker JumpStart as part of Redshift ML, allowing you to bring the power of LLMs to analytics. For example, you can make inferences on your product feedback data in Amazon Redshift, use LLMs to summarize feedback, perform entity extraction, sentiment analysis and product feedback classification.

Innovate faster with secure data collaboration within and across the organizations

“Millions of companies use Stripe’s software and APIs to accept payments, send payouts, and manage their businesses online. Access to their Stripe data via leading data warehouses like Amazon Redshift has been a top request from our customers. Our customers needed secure, fast, and integrated analytics at scale without building complex data pipelines or moving and copying data around. With Stripe Data Pipeline for Amazon Redshift, we’re helping our customers set up a direct and reliable data pipeline in a few clicks. Stripe Data Pipeline enables our customers to automatically share their complete, up-to-date Stripe data with their Amazon Redshift data warehouse, and take their business analytics and reporting to the next level.”

– Tony Petrossian, Head of Engineering, Revenue & Financial Management at Stripe

With Amazon Redshift, you can easily and securely share data and collaborate no matter where your teams or data is located. And have the confidence that your data is secure no matter where you operate or how highly regulated your industries are. We have enabled fine grained permissions, an easy authentication experience with single sign-on for your organizational identity—all provided at no additional cost to you.

Unified identity with IAM identity center integration

We announced Amazon Redshift integration with AWS IAM Identity Center to enable organizations to support trusted identity propagation between Amazon QuickSight,, Amazon Redshift Query Editor, and Amazon Redshift, . Customers can use their organization identities to access Amazon Redshift in a single sign-on experience using third party identity providers (IdP), such as Microsoft Entra ID, Okta, Ping, OneLogin, etc. from Amazon QuickSight and Amazon Redshift Query Editor. Administrators can use third-party identity provider users and groups to manage fine grained access to data across services and audit user level access in AWS CloudTrail. With trusted identity propagation, a user’s identity is passed seamlessly between Amazon QuickSight, Amazon Redshift reducing time to insights and enabling a friction free analytics experience.

For the full set of announcements, see the following:

Modernizing analytics for scale, performance, and reliability
- What’s new – New AI-driven scaling and optimizations in Amazon Redshift Serverless (Preview)
- What’s new – Multi-data warehouse writes through data sharing (Preview)
- What’s new – Multi-Dimensional Data Layouts (Preview)
- What’s new – Support for Multi-AZ deployments in GA
- What’s new – Concurrency scaling now supports Create Table As Select
- Redshift price-performance improvements
Unifying all your data with zero-ETL approaches
- What’s new – Amazon Aurora PostgreSQL zero-ETL integration with Amazon Redshift (Preview)
- What’s new – Amazon RDS for MySQL zero-ETL integration with Amazon Redshift (Preview)
- What’s new – Amazon DynamoDB zero-ETL integration with Amazon Redshift (Preview)
- What’s new – Support for Apache Iceberg tables in GA
- What’s new – Incremental refreshes on materialized views (Preview)
- What’s new – Integration with AWS Glue column-level statistics
Maximize value with comprehensive analytics and ML capabilities
- What’s new – Amazon Q Generative SQL in Amazon Redshift (Preview)
- What’s new – Large Language Model support in Amazon Redshift (Preview)
- What’s new – AWS Glue support for multi engine views with AWS Analytics Engines
- What’s new – Integration with visual studio code
- What’s new – Autocomplete suggestions in Amazon Redshift Query Editor V2
Innovate faster with secure data collaboration within and across the organizations
- What’s new – Trusted Identity propagation with IAM Identity Center
- What’s new – New Fine grained access control capabilities (Preview)
- What’s new – Integration with secrets manager
- What’s new – Support for Canonical Name (CNAME)

Learn more: https://aws.amazon.com/redshift

About the authors

Neeraja Rentachintala is a Principal Product Manager with Amazon Redshift. Neeraja is a seasoned Product Management and GTM leader, bringing over 20 years of experience in product vision, strategy and leadership roles in data products and platforms. Neeraja delivered products in analytics, databases, data Integration, application integration, AI/Machine Learning, large scale distributed systems across On-Premise and Cloud, serving Fortune 500 companies as part of ventures including MapR (acquired by HPE), Microsoft SQL Server, Oracle, Informatica and Expedia.com.

Sunaina AbdulSalah leads product marketing for Amazon Redshift. She focuses on educating customers about the impact of data warehousing and analytics and sharing AWS customer stories. She has a deep background in marketing and GTM functions in the B2B technology and cloud computing domains. Outside of work, she spends time with her family and friends and enjoys traveling.

Announcing Amazon Redshift data sharing (preview)

2020-12-09 Neeraja Rentachintala

Post Syndicated from Neeraja Rentachintala original https://aws.amazon.com/blogs/big-data/announcing-amazon-redshift-data-sharing-preview/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL. Amazon Redshift offers up to 3x better price performance than any other cloud data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as high-performance business intelligence (BI) reporting, dashboarding applications, data exploration, and real-time analytics.

We’re excited to launch the new Amazon Redshift data sharing capability, which enables you to securely and easily share live data across Amazon Redshift clusters. Data sharing allows you to extend the ease of use, performance, and cost benefits that Amazon Redshift offers in a single cluster to multi-cluster deployments while being able to share data. Data sharing enables instant, granular, and high-performance data access across Amazon Redshift clusters without the need to copy or move it. Data sharing provides live access to data so that your users always see the most up-to-date and consistent information as it’s updated in the data warehouse. There is no additional cost to use data sharing on your Amazon Redshift clusters.

In this post, we discuss the needs for data sharing in organizations, current challenges when it comes to sharing data, and how Amazon Redshift data sharing addresses these needs.

Need for sharing data in organizations and current challenges

We hear from our customers that they want to share data at many levels to enable broad and deep insights but also minimize complexity and cost. For example, data needs to be shared from a central data warehouse that loads and transforms constant streams of updates with BI and analytics clusters that serve a variety of workloads, such as dashboarding applications, ad-hoc queries, and data science. Multiple teams and business groups within an organization want to share and collaborate on data to gain differentiated insights that can help unlock new market opportunities, or analyze cross-group impact. As data becomes more valuable, many organizations are becoming data providers too and want to share data across organizations and offer analytics services to external consumers.

Before the launch of Amazon Redshift data sharing, in order to share data, you needed to unload data from one system (the producer) and copy it into another (the consumer). This approach can be expensive and introduce delays, especially as the number of consumers grow. It requires building and maintaining separate extract, transform, and load (ETL) jobs to provide relevant subsets of the data for each consumer. The complexity increases with the security practices that must be kept in place to regularly monitor business-critical data usage and ensure compliance. Additionally, this way of sharing doesn’t provide users with complete and up-to-date views of the data and limits insights. Organizations want a simple and secure way to share fresh, complete, and consistent views of data with any number of consumers.

Introduction to Amazon Redshift data sharing

Amazon Redshift data sharing allows you to securely and easily share data for read purposes across different Amazon Redshift clusters without the complexity and delays associated with data copies and data movement. Data can be shared at many levels, including schemas, tables, views, and user-defined functions, providing fine-grained access controls that can be tailored for different users and businesses that all need access to the data.

Consumers can discover shared data using standard SQL interfaces and query it with high performance from familiar BI and analytics tools. Users connect to an Amazon Redshift database in their cluster, and can perform queries by referring to objects from any other Amazon Redshift database that they have permissions to access, including the databases shared from remote clusters. Amazon Redshift enables sharing live and transactionally consistent views of the data, meaning that consumers always view the most up-to-date data, even when it’s continuously updated on the producer clusters. Amazon Redshift clusters that share data can be in the same or different AWS accounts, making it possible for you to share data across organizations and collaborate with external parties in a secure and governed manner. Amazon Redshift offers comprehensive auditing capabilities using system tables and AWS CloudTrail to allow you to monitor the data sharing permissions and the usage across all the consumers and revoke access instantly, when necessary.

Data sharing provides high-performance access and workload isolation

Data sharing builds on Amazon Redshift RA3 managed storage, which decouples storage and compute, allowing either of them to scale independently. With data sharing, workloads accessing shared data are isolated from each other. Queries accessing shared data run on the consumer cluster and read data from the Amazon Redshift managed storage layer directly without impacting the performance of the producer cluster.

You can now rapidly onboard any number of workloads with diverse data access patterns and SLA requirements and not be concerned about resource contention. Workloads accessing shared data can be provisioned with flexible compute resources that meet their workload-specific price performance requirements and be scaled independently as needed in a self-service fashion. You can optionally allow these teams and business groups to pay for what they use with charge-back. With data sharing, the producer cluster pays for the managed storage cost of the shared data, and the consumer cluster pays for the compute. Data sharing itself doesn’t have any cost associated with it.

Consumer clusters are regular Amazon Redshift clusters. Although consumer clusters can only read the shared data, they can write their own private data and if desired, share it with other clusters, including back to the producer cluster. Building on a managed storage foundation also enables high-performance access to the data. The frequently accessed datasets are cached on the local compute nodes of the consumer cluster to speed up those queries. The following diagram illustrates this data sharing architecture.

How data sharing works

Data sharing between Amazon Redshift clusters is a two-step process. First, the administrator of the producer cluster that wants to share data creates a data share, a new named object that serves as a unit of sharing. The administrator of the producer cluster and other privileged users add the needed database objects such as schemas, tables, views (regular, late binding, and materialized) to the data share and specify a list of consumers for sharing purposes. Consumers can be other Amazon Redshift clusters in the same AWS account or separate AWS accounts.

Next, the administrator of the consumer cluster looks at the shared datasets and reviews the contents of each share. To consume shared data, the consumer cluster administrator creates an Amazon Redshift database from the data share object and assigns permissions to appropriate users and groups in the consumer cluster. Users and groups that have access to shared data can discover and query it using standard SQL and analytics tools. You can join shared data with local data and perform cross-database queries. We have introduced new metadata views and modified JDBC/ODBC drivers so that tools can seamlessly integrate with shared data. The following diagram illustrates this data sharing architecture.

Data sharing use cases

In this section, we discuss the common data sharing use cases:

Sharing data from a central ETL cluster with multiple, isolated BI and analytics clusters in a hub-spoke architecture to provide read workload isolation and optional charge-back for costs.
Sharing data among multiple business groups so they can collaborate for broader analytics and data science. Each Amazon Redshift cluster can be a producer of some data but also can be a consumer of other datasets.
Sharing data in order to offer data and analytics as a service across the organization and with external parties.
Sharing data between development, test, and production environments, at any granularity.

Amazon Redshift offers unparalleled compute flexibility

Amazon Redshift offers you the most flexibility than any other cloud data warehouse when it comes to organizing your workloads based on price performance, isolation, and charge-ability. Amazon Redshift efficiently utilizes compute resources in a cluster to maximize performance and throughput with automatic workload management (WLM). Specifying query priorities allows you to influence the usage of resources across multiple workloads based on business priorities. With Amazon Redshift concurrency scaling, you can elastically scale one or more workloads in a cluster automatically with extra capacity to handle high concurrency and query spikes without any application changes. And with data sharing, you can now handle diverse business-critical workloads that need flexible compute resources, isolation, and charge-ability with multi-cluster deployments while sharing data. You can use WLM and concurrency scaling on both data sharing producer and consumer clusters.

The combination of these capabilities allows you to evolve from single cluster to multi-cluster deployments easily and cost-efficiently.

Beyond sharing data across Amazon Redshift clusters

Along with sharing data across Amazon Redshift clusters, our vision is to evolve data sharing so you can share Amazon Redshift data with the Amazon Simple Storage Service (Amazon S3) data lake by publishing data shares to the AWS Lake Formation catalog. This enables other AWS analytics services to discover and access live and transactionally consistent data in Amazon Redshift managed storage. For example, you could incorporate live data from Amazon Redshift in an Amazon EMR Spark data pipeline, perform ad-hoc queries on Amazon Redshift data using Amazon Athena, or use clean and curated data in Amazon Redshift to build machine learning models with Amazon SageMaker. The following diagram illustrates this architecture.

Amazon Redshift data sharing in customer environments

Data sharing across Amazon Redshift clusters within the same account is now available for preview. Sharing across clusters that are in different AWS accounts is coming soon. We worked closely with several customers and partners for an early preview of data sharing and they are excited about the results and possibilities.

“At Warner Bros. Games, we build and maintain complex data mobility infrastructures to manage data movements across single game clusters and consolidated business function clusters. However, developing and maintaining this system monopolizes valuable team resources and introduces delays that impede our ability to act on the data with agility and speed. Using the Redshift data sharing feature, we can remove the entire subsystem we built for data copying, movement, and loading between Redshift clusters. This will empower all of our business teams to make decisions on the right datasets more quickly and efficiently. Additionally, Redshift data sharing will also allow us to re-architect compute provisioning to more closely align with the resources needed to execute those functions’ SQL workloads—ultimately enabling simpler infrastructure operations.”

— Kurt Larson, Technical Director, Warner Bros. Analytics

“The data sharing feature seamlessly allows multiple Redshift clusters to query data located in our RA3 clusters and their managed storage. This eliminates our concerns with delays in making data available for our teams, and reduces the amount of data duplication and associated backfill headache. We now can concentrate even more of our time making use of our data in Redshift and enable better collaboration instead of data orchestration.”

— Steven Moy, Engineer, Yelp

“At Fannie Mae, we adopted a de-centralized approach to data warehouse management with tens of Amazon Redshift clusters across many applications. While each team manages their own dataset, we often have use cases where an application needs to query the datasets from other applications and join with the data available locally. We currently unload and move data from one cluster to another cluster, and this introduces delays in providing timely access to data to our teams. We have had issues with unload operations spiking resource consumption on producer clusters, and data sharing allows us to skip this intermediate unload to Amazon S3, saving time and lowering consumption. Many applications are performing unloads currently in order to share datasets, and we plan to convert all such processes to leveraging the new data sharing feature. With data sharing, we can enable seamless sharing of data across application teams and give them common views of data without having to do ETL. We are also able to avoid the data copies between pre-prod, research, and production environments for each application. Data sharing made us more agile and gave us the flexibility to scale analytics in highly distributed environments like Fannie Mae.”

— Amy Tseng, Enterprise Databases Manager, Fannie Mae

“Shared storage allowed us to focus on what matters: making data available to end-users. Data is no longer stuck in a myriad of storage mediums or formats, or accessible only through select APIs, but rather in a single flavor of SQL.”

— Marco Couperus, Engineering Manager, home24

“We’re excited to launch our integration with Amazon Redshift data sharing, which is a game changer for analytics teams that are looking to improve analytics performance SLAs and reduce data processing delays for diverse workloads across organizations. By isolating workloads that have different access patterns, different SLA requirements such as BI reporting, dashboarding applications, ETL jobs, and data science workloads into separate clusters, customers will now get more control over compute resources and more predictability in their workload SLAs while still sharing common data. The integrations of Etleap with Amazon Redshift data sharing will make sharing data between clusters seamless as part of their existing data pipelines. We are thrilled to offer this integrated experience to our joint customers.”

— Christian Romming, Founder and CEO, Etleap

“We’re excited to partner with AWS to enable data sharing for Amazon Redshift within Aginity Pro. Amazon Redshift users can now navigate and explore shared data within Aginity products just like local data within their clusters. Users can then build reusable analytics that combine both local and shared data seamlessly using cross-database queries. Importantly, shared data query performance has the same high performance as local queries without the cost and performance penalty of traditional federation solutions. We’re thrilled to see how our customers will leverage the ability to securely share data across clusters.”

— Matthew Mullins, CTO, Aginity

Next steps

The preview for within account data sharing is available on Amazon Redshift RA3 node types in the following regions:

US East (Ohio)
US East (N. Virginia)
US West (N. California)
US West (Oregon)
Asia Pacific (Seoul)
Asia Pacific (Sydney)
Asia Pacific (Tokyo)
Europe (Frankfurt)
Europe (Ireland)

For more information about how to get started with the preview, see documentation.

About the Authors

Neeraja Rentachintala is a principal product manager with Amazon Web Services who is focused on building Amazon Redshift – the world’s most popular, highest performing, and most scalable cloud data warehouse. Neeraja earned a Bachelor of Technology in Electronics and Communication Engineering from the National Institute of Technology in India and various business program certifications from the University of Washington, MIT Sloan School of Management, and Stanford University.

Ippokratis Pandis is a senior principal engineer at AWS. Ippokratis leads initiatives in AWS analytics and data lakes, especially in Amazon Redshift. He holds a Ph.D. in electrical engineering from Carnegie Mellon University.

Naresh Chainani is a senior software development manager with Amazon Redshift, where he leads Query Processing, Query Performance, Distributed Systems, and Workload Management with a strong team. He is passionate about building high-performance databases to enable customers to gain timely insights and make critical business decisions. In his spare time, Naresh enjoys reading and playing tennis.

Get started with Amazon Redshift cross-database queries (preview)

2020-10-16 Neeraja Rentachintala

Post Syndicated from Neeraja Rentachintala original https://aws.amazon.com/blogs/big-data/get-started-with-amazon-redshift-cross-database-queries-preview/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL, business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics.

We’re excited to announce the public preview of the new cross-database queries capability to query across databases in an Amazon Redshift cluster. In this post, we provide an overview of the cross-database queries and a walkthrough of the key functionality that allows you to manage data and analytics at scale in your organization.

What are cross-database queries?

With cross-database queries, you can seamlessly query data from any database in your Amazon Redshift cluster, regardless of which database you’re connected to. Cross-database queries eliminate data copies and simplify your data organization to support multiple business groups on the same cluster. Support for cross-database queries is available on Amazon Redshift RA3 node types.

Data is organized across multiple databases in Amazon Redshift clusters to support multi-tenant configurations. However, you often need to query and join across these datasets by allowing read access. For example, different business groups and teams that own and manage their datasets in a specific database in the data warehouse need to collaborate with other groups. You might want to perform common ETL staging and processing while your raw data is spread across multiple databases. Organizing data in multiple Amazon Redshift databases is also a common scenario when migrating from traditional data warehouse systems.

With cross-database queries, you can now access data from any database on the Amazon Redshift cluster without having to connect to that specific database. You can also join datasets from multiple databases in a single query. You can access database objects such as tables, views with a simple three-part notation of <database>.<schema>.<object>, and analyze the objects using business intelligence (BI) or analytics tools. You can continue to set up granular access controls for users with standard Amazon Redshift SQL commands and ensure that users can only see the relevant subsets of the data they have permissions for.

Walkthrough overview

In this post, we walk through an end-to-end use case to illustrate cross-database queries, comprising the following steps:

Set up permissions on the data.
Access data and perform several cross-database queries.
Connect from tools.

For this walkthrough, we use SQL Workbench, a SQL query tool, to perform queries on Amazon Redshift. For more information about connecting SQL Workbench to an Amazon Redshift cluster, see Connect to your cluster by using SQL Workbench/J .

Setting up permissions for cross-database queries

You can use standard Redshift SQL GRANT and REVOKE commands to configure appropriate permissions for users and groups. To configure permissions, we connect as an administrator to a database named TPCH_100G on an Amazon Redshift cluster that we set up with an industry standard dataset, TPC-H. You can set up this dataset in your environment using the code and scripts for this dataset on GitHub and the accompanying dataset hosted in a public Amazon Simple Storage Service (Amazon S3) bucket.

The following screenshot shows the configuration for your connection profile.

The TPCH_100G database consists of eight tables loaded in the schema PUBLIC, as shown in the following screenshot.

The following screenshot shows a test query on one of the TPC-H tables, customer.

The database administrator provides read permissions on the three of the tables, customer, orders, and lineitem, to an Amazon Redshift user called demouser. The user typically connects to and operates in their own team’s database TPCH_CONSUMERDB on the same Amazon Redshift cluster.

Performing cross-database queries using three-part notation

In this section, we see how cross-database queries work in action. With cross-database queries, you can connect to any database and query from all the other databases in the cluster without having to reconnect. In this use case, the user demouser connects to their database TPCH_CONSUMERDB (see the following screenshot).

While connected to TPCH_CONSUMERDB, demouser can also perform queries on the data in TPCH_100gG database objects that they have permissions to, referring to them using the simple and intuitive three-part notation TPCH_100G.PUBLIC.CUSTOMER (see the following screenshot).

You can refer to and query objects in any other database in the cluster using this <database>.<schema>.<object> notation as long as you have permissions to do so. The objects can be tables or views (including regular, late binding and materialized views).

In addition to performing queries on objects, you can create views on top of objects in other databases and apply granular access controls as relevant.

Joining data across databases

With cross-database queries, you can join datasets across databases. In the following screenshot, demouser queries and performs joins across the customer, lineitem, and orders tables in the TPCH_100G database.

You can also span joins on objects across databases. In the following query, demouser seamlessly joins the datasets from TPCH_100G (customer, lineitem, and orders tables) with the datasets in TPCH_CONSUMERDB (nation and supplier tables).

With cross-database queries, you get a consistent view of the data irrespective of the database you’re connected to.

Securely accessing relevant datasets by connecting from tools

To support the database hierarchy navigation and exploration introduced with cross-database queries, Amazon Redshift is introducing a new set of metadata views and modified versions of JDBC and ODBC drivers.

In addition, you can create aliases from one database to schemas in any other databases on the Amazon Redshift cluster. You create the aliases using the CREATE EXTERNAL SCHEMA command, which allows you to refer to the objects in cross-database queries with the two-part notation <external schema name>.<object>. For example, in the following screenshot, the database administrator connects to TPCH_CONSUMERDB and creates an external schema alias for the PUBLIC schema in TPC_100G database called TPC_100G_PUBLIC and grants the usage access on the schema to demouser.

Now, when demouser connects to TPCH_CONSUMERDB, they see the external schema in the object hierarchy (as in the following screenshot) with only the relevant objects that they have permissions to: CUSTOMER, LINEITEM, and ORDERS.

Now they can perform queries using the schema alias as if the data is local rather than using a three-part notation.

Summary and next steps

We provided you a glimpse into what you can accomplish with cross-database queries in Amazon Redshift. Cross-database queries allow you to organize and manage data across databases to effectively support multi-tenant data warehouse deployments for a wide variety of use cases. You can get started with your use case leveraging cross-database queries capability by trying out the preview. For more information, refer to the documentation cross-database queries.

About the Authors

Neeraja Rentachintala is a Principal Product Manager with Amazon Redshift. Neeraja is a seasoned Product Management and GTM leader, bringing over 20 years of experience in product vision, strategy and leadership roles in data products and platforms. Neeraja delivered products in analytics, databases, data Integration, application integration, AI/Machine Learning, large scale distributed systems across On-Premise and Cloud, serving Fortune 500 companies as part of ventures including MapR (acquired by HPE), Microsoft SQL Server, Oracle, Informatica and Expedia.com.

Jenny Chen is a senior database engineer at Amazon Redshift focusing on all aspects of Redshift performance, like Query Processing, Concurrency, Distributed system, Storage, OS and many more. She works together with development team to ensure of delivering highest performance, scalable and easy-of-use database for customer. Prior to her career in cloud data warehouse, she has 10-year of experience in enterprise database DB2 for z/OS in IBM with focus on query optimization, query performance and system performance.

Sushim Mitra is a software development engineer on the Amazon Redshift query processing team. His interest areas are Query Optimization problems, SQL Language features and Database security. When not at work, he enjoys reading fiction from all over the world.

Suzhen Lin is a senior software development engineer on the Amazon Redshift transaction processing and storage team. Suzhen Lin has over 15 years of experiences in industry leading analytical database products including AWS Redshift, Gauss MPPDB, Azure SQL Data Warehouse and Teradata as senior architect and developer. Her experiences cover storage, transaction processing, query processing, memory/disk caching and etc in on-premise/cloud database management systems.

Industry-leading price-performance

Performance improvements

General availability of multi-data warehouse writes

General availability of AI-driven scaling and optimizations

Seamless Lakehouse architectures

General availability of Amazon SageMaker Lakehouse

Preview of Amazon SageMaker Unified Studio

Amazon Redshift SQL analytics on Amazon S3 Tables

Data lake query performance

Simplified ingestion and near real-time analytics

Zero-ETL integration with AWS databases and third-party enterprise applications

General availability of auto-copy

Streaming ingestion from Confluent Managed Cloud and self-managed Apache Kafka clusters

Generative AI capabilities

Amazon Redshift integration with Amazon Bedrock

Amazon Redshift as a knowledge base in Amazon Bedrock

Launch summary

Industry-leading price-performance:

Reference Blogs:

Seamless Lakehouse architectures:

Reference Blogs:

Simplified ingestion and near real-time analytics:

Reference Blogs:

Generative AI:

Reference Blogs:

Conclusion

About the Author

Modernizing analytics for scale, performance, and reliability

Diligently driving the best price-performance at scale with new improvements

Amazon Redshift Serverless is more intelligent than ever with new AI-driven scaling and optimizations

Multi-data warehouse writes through data sharing

Multidimensional data layouts

Unifying all your data with zero-ETL approaches

Zero-ETL integrations with operational databases

Data lake querying with support for Apache Iceberg tables

Maximize value with comprehensive analytics and ML capabilities

Amazon Q Generative SQL capability

Amazon Redshift ML large language model (LLM) integration

Innovate faster with secure data collaboration within and across the organizations

Unified identity with IAM identity center integration

For the full set of announcements, see the following:

Modernizing analytics for scale, performance, and reliability

Unifying all your data with zero-ETL approaches

Maximize value with comprehensive analytics and ML capabilities

Innovate faster with secure data collaboration within and across the organizations

About the authors

Need for sharing data in organizations and current challenges

Introduction to Amazon Redshift data sharing

Data sharing provides high-performance access and workload isolation

How data sharing works

Data sharing use cases

Amazon Redshift offers unparalleled compute flexibility

Beyond sharing data across Amazon Redshift clusters

Amazon Redshift data sharing in customer environments

Next steps

About the Authors

What are cross-database queries?

Walkthrough overview

Setting up permissions for cross-database queries

Performing cross-database queries using three-part notation

Joining data across databases

Securely accessing relevant datasets by connecting from tools

Summary and next steps

About the Authors

The collective thoughts of the interwebz