Tag Archives: Amazon Redshift

How Tophatter improved stability and lowered costs by migrating to Amazon Redshift RA3

Post Syndicated from Julien DeFrance original https://aws.amazon.com/blogs/big-data/how-tophatter-improved-stability-and-lowered-costs-by-migrating-to-amazon-redshift-ra3/

This is a guest post co-written by Julien DeFrance of Tophatter and Jordan Myers of Etleap. Tophatter is a mobile discovery marketplace that hosts live auctions for products spanning every major category. Etleap, an AWS Advanced Tier Data & Analytics partner, is an extract, transform, load, and transform (ETLT) service built for AWS.

As a company grows, it continually seeks out solutions that help its teams achieve better performance and scale of their data analytics, especially when business growth has eclipsed current capabilities. Migrating to a new architecture is often a key component of this. However, a migration path that is painless, flexible, and supported is not always available.

In this post, we walk through how Tophatter—a virtual auction house where buyers and sellers interact, chat, and transact in diverse categories—recently migrated from DS2 to RA3 nodes in Amazon Redshift. We highlight the steps they took, how they improved stability and lowered costs, and the lessons other companies can follow.

Tophatter’s data storage and ETL architecture

Tophatter stores the majority of their product data in MySQL databases, while sending some webhook and web events to Amazon Simple Storage Service (Amazon S3). Additionally, some of their vendors drop data directly into dedicated S3 buckets. Etleap integrates with these sources. Every hour (according to the schedule configured by Tophatter), Etleap extracts all the new data that has been added or changed in the source, transforms the new data according to the pipeline rules defined by the user in the UI, and loads the resulting data into Amazon Redshift.

Tophatter relies on Mode Analytics and Looker for data analysis, and uses Etleap’s model feature based on Amazon Redshift materialized views to persist the results of frequently used business intelligence (BI) queries. Tophatter configures the update schedule of the model to happen at defined times or when certain source tables have been updated with new data.

Ultimately, these critical data pipelines fuel Tophatter dashboards that both internal analysts and users interact with.

The following diagram illustrates how Tophatter uses Etleap’s AWS-native extract, transform, and load (ETL) tool to ingest data from their operational databases, applications, and Amazon S3 into Amazon Redshift.

Company growth leads to data latency

Before the migration, Tophatter’s team operated 4 DS2 Reserved Instance (RI) nodes (ds2.xlarge) in Amazon Redshift, which use HDD drives as opposed to relatively faster SSDs.

As their user base expanded and online auction activity increased exponentially, Tophatter’s ETL needs grew. In response, Etleap seamlessly scaled to support their increased volume of ingestion pipelines and materialized data models. But Tophatter’s Amazon Redshift cluster—which they managed internally—wasn’t as easy to scale. When Amazon Redshift usage increased, Tophatter had to resize the cluster manually or reduce the frequency of certain analytics queries or models. Finding the optimal cluster size often required multiple iterations.

Due to the time-sensitive nature of data needed for live online auctions, Tophatter used automated monitoring to notify on-call engineers when data pipeline latency had exceeded the desired threshold. Latencies and errors began to pop up more frequently—at least once or twice a week. These events caused distress for the on-call engineers. When the issue couldn’t be resolved internally, they notified Etleap support, who typically recommended either canceling or reducing the frequency of certain long-running model queries.

While the issue was still being resolved, the latencies resulted in downstream issues for the analytics team, such as certain tables being out of sync with others, resulting in incorrect query results.

Migrating to Amazon Redshift RA3

To improve stability and reduce engineering maintenance, Tophatter decided to migrate from DS2 to RA3 nodes. Amazon Redshift RA3 with managed storage is the latest generation node type and would allow Tophatter to scale compute and storage independently.

With DS2 nodes, there was pressure to offload or archive historical data to other storage because of fixed storage limits. RA3 nodes with managed storage are an excellent fit for analytics workloads that require high storage capacity, such as operational analytics, where the subset of data that’s most important continually evolves over time.

Moving to the RA3 instance type would also enable Tophatter to capitalize on the latest features of Amazon Redshift, such as AQUA (Advanced Query Accelerator), Data Sharing, Amazon Redshift ML, and cross-VPC support.

RA3 migration upgrade program

Tophatter understood the benefits of migrating to RA3, but worried that their 3-year DS2 Reserved Instances commitment would present a roadblock. They still had 2 years remaining in their agreement and were unsure if an option was available that would allow them to change course.

They found out about the AWS RA3 upgrade program from the AWS account team, which helps customers convert their DS2 Reserved Instance commitments into RA3 without breaking the commitment agreement. This path enables you to seamlessly migrate from your legacy Amazon Redshift clusters to one of three RA3 node types: ra3.xlplus, ra3.4xlarge, or ra3.16xlarge.

Tophatter’s engagement with the program consisted of five steps:

  1. Engage with the AWS account team.
  2. Receive pricing information from the AWS account team.
  3. Schedule the migration.
  4. Purchase RA3 Reserved Instances.
  5. Submit a case to cancel their DS2 Reserved Instances.

Tophatter had the opportunity to evaluate three options for their RA3 migration:

  • Elastic resize – This is the most efficient way to change the instance type and update the nodes in your Amazon Redshift cluster. The cluster endpoint doesn’t change and the downtime during resize is minimal.
  • Snapshot and restore method – Choose the snapshot and restore method if elastic resize is unavailable (from a mismatch between slice and node count). Or, use this method to minimize the amount of time it takes to write to your production database.
  • Classic resize – Choose the classic resize method if it’s the only option available. For single-node DS2 clusters, only a classic resize can be performed to convert the cluster into a multi-node cluster.

Achieving operational goals

Tophatter worked closely with their AWS account team for the migration and chose elastic resize due to the minimal downtime that option presented as well as prior experience using it. They completed the migration in under 2 hours, which included the requisite pre- and post-testing. As a pre-migration step, they took a snapshot of the cluster and were prepared to restore it if something went wrong.

After migrating to a 2 node ra3.4xlarge cluster, the Tophatter team realized numerous benefits:

  • Storage with up to 256 TB of Amazon Redshift managed storage for their cluster
  • Dramatically reduced latency of ingestion and data modeling
  • Control of the compute and storage capacities and costs, independently
  • Better system stability, with latency incidents requiring on-call engineer responses dropping to near zero

The following graph illustrates the average latency of a typical ingestion pipeline before and after the migration from DS2 to RA3.

Conclusion

When Tophatter set out to migrate their DS2 instances to RA3, they were unaware of the AWS RA3 upgrade program and its benefits. They were happy to learn that the program avoided the administrative overhead of getting approvals ahead of time for various specific configurations, and let them try several options in order to find a stable configuration.

Tophatter migrated from DS2 to RA3 without breaking their current commitment and enhanced their analytics to be agile and versatile. Now, Tophatter is aiming to realize greater scaling benefits to support its exponential growth by exploring new RA3 features such as:

  • Amazon Redshift Data Sharing – Provides instant, granular, high-performance data access without data copies or movement
  • Amazon Redshift ML – Allows you to create, train, and apply machine learning models using SQL commands in Amazon Redshift
  • AQUA – Provides a new distributed and hardware accelerated cache that brings compute to the storage layer for Amazon Redshift and delivers up to 10 times faster query performance than other enterprise cloud data warehouses
  • Cross-VPC support for Amazon Redshift – With an Amazon Redshift-managed VPC endpoint, you can privately access your Amazon Redshift data warehouse within your VPC from your client applications in another VPC within the same AWS account, another AWS account, or running on-premises without using public IPs or requiring encrypted traffic to traverse the internet

We hope Tophatter’s migration journey can help other AWS customers reap the benefits from the AWS RA3 upgrade program from DS2 or DC2 cluster families to RA3. We believe this enables better performance and cost benefits while unlocking valuable new Amazon Redshift features.


About the Authors

Julien DeFrance is a Principal Software Engineer at Tophatter based out of San Francisco. With a strong focus on backend and cloud infrastructure, he is part of the Logistics Engineering Squad, building and supporting integrations with third parties such as sellers, ERP systems, and carriers, architecting and implementing solutions to help optimize cost efficiency and service quality. Julien holds two AWS Certifications (Cloud Practitioner, Solutions Architect – Associate).

 

Jordan Myers is an engineer at Etleap with 5 years of experience in programming ETL software. In addition to programming, he provides deep-level technical customer support and writes technical documentation

 

 

Jobin George is a Big Data Solutions Architect with more than a decade of experience designing and implementing large-scale big data and analytics solutions. He provides technical guidance, design advice, and thought leadership to some of the key AWS customers and big data partners.

 

 

Maneesh Sharma is a Senior Database Engineer with Amazon Redshift. He works and collaborates with various Amazon Redshift Partners to drive better integration. In his spare time, he likes running, playing ping pong, and exploring new travel destinations.

How to Accelerate Building a Lake House Architecture with AWS Glue

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/architecture/how-to-accelerate-building-a-lake-house-architecture-with-aws-glue/

Customers are building databases, data warehouses, and data lake solutions in isolation from each other, each having its own separate data ingestion, storage, management, and governance layers. Often these disjointed efforts to build separate data stores end up creating data silos, data integration complexities, excessive data movement, and data consistency issues. These issues are preventing customers from getting deeper insights. To overcome these issues and easily move data around, a Lake House approach on AWS was introduced.

In this blog post, we illustrate the AWS Glue integration components that you can use to accelerate building a Lake House architecture on AWS. We will also discuss how to derive persona-centric insights from your Lake House using AWS Glue.

Components of the AWS Glue integration system

AWS Glue is a serverless data integration service that facilitates the discovery, preparation, and combination of data. It can be used for analytics, machine learning, and application development. AWS Glue provides all of the capabilities needed for data integration. So you can start analyzing your data and putting it to use in minutes, rather than months.

The following diagram illustrates the various components of the AWS Glue integration system.

Figure 1. AWS Glue integration components

Figure 1. AWS Glue integration components

Connect – AWS Glue allows you to connect to various data sources anywhere

Glue connector: AWS Glue provides built-in support for the most commonly used data stores. You can use Amazon Redshift, Amazon RDS, Amazon Aurora, Microsoft SQL Server, MySQL, MongoDB, or PostgreSQL using JDBC connections. AWS Glue also allows you to use custom JDBC drivers in your extract, transform, and load (ETL) jobs. For data stores that are not natively supported such as SaaS applications, you can use connectors. You can also subscribe to several connectors offered in the AWS Marketplace.

Glue crawlers: You can use a crawler to populate the AWS Glue Data Catalog with tables. A crawler can crawl multiple data stores in a single pass. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. Extract, transform, and load (ETL) jobs that you define in AWS Glue use these Data Catalog tables as sources and targets.

Catalog – AWS Glue simplifies data discovery and governance

Glue Data Catalog: The Data Catalog serves as the central metadata catalog for the entire data landscape.

Glue Schema Registry: The AWS Glue Schema Registry allows you to centrally discover, control, and evolve data stream schemas. With AWS Glue Schema Registry, you can manage and enforce schemas on your data streaming applications.

Data quality – AWS Glue helps you author and monitor data quality rules

Glue DataBrew: AWS Glue DataBrew allows data scientists and data analysts to clean and normalize data. You can use a visual interface, reducing the time it takes to prepare data by up to 80%. With Glue DataBrew, you can visualize, clean, and normalize data directly from your data lake, data warehouses, and databases.

Curate data: You can use either Glue development endpoint or AWS Glue Studio to curate your data.

AWS Glue development endpoint is an environment that you can use to develop and test your AWS Glue scripts. You can choose either Amazon SageMaker notebook or Apache Zeppelin notebook as an environment.

AWS Glue Studio is a new visual interface for AWS Glue that supports extract-transform-and-load (ETL) developers. You can author, run, and monitor AWS Glue ETL jobs. You can now use a visual interface to compose jobs that move and transform data, and run them on AWS Glue.

AWS Data Exchange makes it easy for AWS customers to securely exchange and use third-party data in AWS. This is for data providers who want to structure their data across multiple datasets or enrich their products with additional data. You can publish additional datasets to your products using the AWS Data Exchange.

Deequ is an open-source data quality library developed internally at Amazon, for data quality. It provides multiple features such as automatic constraint suggestions and verification, metrics computation, and data profiling.

Build a Lake House architecture faster, using AWS Glue

Figure 2 illustrates how you can build a Lake House using AWS Glue components.

Figure 2. Building lake house architectures with AWS Glue

Figure 2. Building Lake House architectures with AWS Glue

The architecture flow follows these general steps:

  1. Glue crawlers scan the data from various data sources and populate the Data Catalog for your Lake House.
  2. The Data Catalog serves as the central metadata catalog for the entire data landscape.
  3. Once data is cataloged, fine-grained access control is applied to the tables through AWS Lake Formation.
  4. Curate your data with business and data quality rules by using Glue Studio, Glue development endpoints, or Glue DataBrew. Place transformed data in a curated Amazon S3 for purpose built analytics downstream.
  5. Facilitate data movement with AWS Glue to and from your data lake, databases, and data warehouse by using Glue connections. Use AWS Glue Elastic views to replicate the data across the Lake House.

Derive persona-centric insights from your Lake House using AWS Glue

Many organizations want to gather observations from increasingly larger volumes of acquired data. These insights help them make data-driven decisions with speed and agility. They must use a central data lake, a ring of purpose-built data services, and data warehouses based on persona or job function.

Figure 3 illustrates the Lake House inside-out data movement with AWS Glue DataBrew, Amazon Athena, Amazon Redshift, and Amazon QuickSight to perform persona-centric data analytics.

Figure 3. Lake house persona-centric data analytics using AWS Glue

Figure 3. Lake House persona-centric data analytics using AWS Glue

This shows how Lake House components serve various personas in an organization:

  1. Data ingestion: Data is ingested to Amazon Simple Storage Service (S3) from different sources.
  2. Data processing: Data curators and data scientists use DataBrew to validate, clean, and enrich the data. Amazon Athena is also used to run improvised queries to analyze the data in the lake. The transformation is shared with data engineers to set up batch processing.
  3. Batch data processing: Data engineers or developers set up batch jobs in AWS Glue and AWS Glue DataBrew. Jobs can be initiated by an event, or can be scheduled to run periodically.
  4. Data analytics: Data/Business analysts can now analyze prepared dataset in Amazon Redshift or in Amazon S3 using Athena.
  5. Data visualizations: Business analysts can create visuals in QuickSight. Data curators can enrich data from multiple sources. Admins can enforce security and data governance. Developers can embed QuickSight dashboard in applications.

Conclusion

Using a Lake House architecture will help you get persona-centric insights quickly from all of your data based on user role or job function. In this blog post, we describe several AWS Glue components and AWS purpose-built services that you can use to build Lake House architectures on AWS. We have also presented persona-centric Lake House analytics architecture using AWS Glue, to help you derive insights from your Lake House.

Read more and get started on building Lake House Architectures on AWS.

How MEDHOST’s cardiac risk prediction successfully leveraged AWS analytic services

Post Syndicated from Pandian Velayutham original https://aws.amazon.com/blogs/big-data/how-medhosts-cardiac-risk-prediction-successfully-leveraged-aws-analytic-services/

MEDHOST has been providing products and services to healthcare facilities of all types and sizes for over 35 years. Today, more than 1,000 healthcare facilities are partnering with MEDHOST and enhancing their patient care and operational excellence with its integrated clinical and financial EHR solutions. MEDHOST also offers a comprehensive Emergency Department Information System with business and reporting tools. Since 2013, MEDHOST’s cloud solutions have been utilizing Amazon Web Services (AWS) infrastructure, data source, and computing power to solve complex healthcare business cases.

MEDHOST can utilize the data available in the cloud to provide value-added solutions for hospitals solving complex problems, like predicting sepsis, cardiac risk, and length of stay (LOS) as well as reducing re-admission rates. This requires a solid foundation of data lake and elastic data pipeline to keep up with multi-terabyte data from thousands of hospitals. MEDHOST has invested a significant amount of time evaluating numerous vendors to determine the best solution for its data needs. Ultimately, MEDHOST designed and implemented machine learning/artificial intelligence capabilities by leveraging AWS Data Lab and an end-to-end data lake platform that enables a variety of use cases such as data warehousing for analytics and reporting.

Since you’re reading this post, you may also be interested in the following:

Getting started

MEDHOST’s initial objectives in evaluating vendors were to:

  • Build a low-cost data lake solution to provide cardiac risk prediction for patients based on health records
  • Provide an analytical solution for hospital staff to improve operational efficiency
  • Implement a proof of concept to extend to other machine learning/artificial intelligence solutions

The AWS team proposed AWS Data Lab to architect, develop, and test a solution to meet these objectives. The collaborative relationship between AWS and MEDHOST, AWS’s continuous innovation, excellent support, and technical solution architects helped MEDHOST select AWS over other vendors and products. AWS Data Lab’s well-structured engagement helped MEDHOST define clear, measurable success criteria that drove the implementation of the cardiac risk prediction and analytical solution platform. The MEDHOST team consisted of architects, builders, and subject matter experts (SMEs). By connecting MEDHOST experts directly to AWS technical experts, the MEDHOST team gained a quick understanding of industry best practices and available services allowing MEDHOST team to achieve most of the success criteria at the end of a four-day design session. MEDHOST is now in the process of moving this work from its lower to upper environment to make the solution available for its customers.

Solution

For this solution, MEDHOST and AWS built a layered pipeline consisting of ingestion, processing, storage, analytics, machine learning, and reinforcement components. The following diagram illustrates the Proof of Concept (POC) that was implemented during the four-day AWS Data Lab engagement.

Ingestion layer

The ingestion layer is responsible for moving data from hospital production databases to the landing zone of the pipeline.

The hospital data was stored in an Amazon RDS for PostgreSQL instance and moved to the landing zone of the data lake using AWS Database Migration Service (DMS). DMS made migrating databases to the cloud simple and secure. Using its ongoing replication feature, MEDHOST and AWS implemented change data capture (CDC) quickly and efficiently so MEDHOST team could spend more time focusing on the most interesting parts of the pipeline.

Processing layer

The processing layer was responsible for performing extract, tranform, load (ETL) on the data to curate them for subsequent uses.

MEDHOST used AWS Glue within its data pipeline for crawling its data layers and performing ETL tasks. The hospital data copied from RDS to Amazon S3 was cleaned, curated, enriched, denormalized, and stored in parquet format to act as the heart of the MEDHOST data lake and a single source of truth to serve any further data needs. During the four-day Data Lab, MEDHOST and AWS targeted two needs: powering MEDHOST’s data warehouse used for analytics and feeding training data to the machine learning prediction model. Even though there were multiple challenges, data curation is a critical task which requires an SME. AWS Glue’s serverless nature, along with the SME’s support during the Data Lab, made developing the required transformations cost efficient and uncomplicated. Scaling and cluster management was addressed by the service, which allowed the developers to focus on cleaning data coming from homogenous hospital sources and translating the business logic to code.

Storage layer

The storage layer provided low-cost, secure, and efficient storage infrastructure.

MEDHOST used Amazon S3 as a core component of its data lake. AWS DMS migration tasks saved data to S3 in .CSV format. Crawling data with AWS Glue made this landing zone data queryable and available for further processing. The initial AWS Glue ETL job stored the parquet formatted data to the data lake and its curated zone bucket. MEDHOST also used S3 to store the .CSV formatted data set that will be used to train, test, and validate its machine learning prediction model.

Analytics layer

The analytics layer gave MEDHOST pipeline reporting and dashboarding capabilities.

The data was in parquet format and partitioned in the curation zone bucket populated by the processing layer. This made querying with Amazon Athena or Amazon Redshift Spectrum fast and cost efficient.

From the Amazon Redshift cluster, MEDHOST created external tables that were used as staging tables for MEDHOST data warehouse and implemented an UPSERT logic to merge new data in its production tables. To showcase the reporting potential that was unlocked by the MEDHOST analytics layer, a connection was made to the Redshift cluster to Amazon QuickSight. Within minutes MEDHOST was able to create interactive analytics dashboards with filtering and drill-down capabilities such as a chart that showed the number of confirmed disease cases per US state.

Machine learning layer

The machine learning layer used MEDHOST’s existing data sets to train its cardiac risk prediction model and make it accessible via an endpoint.

Before getting into Data Lab, the MEDHOST team was not intimately familiar with machine learning. AWS Data Lab architects helped MEDHOST quickly understand concepts of machine learning and select a model appropriate for its use case. MEDHOST selected XGBoost as its model since cardiac prediction falls within regression technique. MEDHOST’s well architected data lake enabled it to quickly generate training, testing, and validation data sets using AWS Glue.

Amazon SageMaker abstracted underlying complexity of setting infrastructure for machine learning. With few clicks, MEDHOST started Jupyter notebook and coded the components leading to fitting and deploying its machine learning prediction model. Finally, MEDHOST created the endpoint for the model and ran REST calls to validate the endpoint and trained model. As a result, MEDHOST achieved the goal of predicting cardiac risk. Additionally, with Amazon QuickSight’s SageMaker integration, AWS made it easy to use SageMaker models directly in visualizations. QuickSight can call the model’s endpoint, send the input data to it, and put the inference results into the existing QuickSight data sets. This capability made it easy to display the results of the models directly in the dashboards. Read more about QuickSight’s SageMaker integration here.

Reinforcement layer

Finally, the reinforcement layer guaranteed that the results of the MEDHOST model were captured and processed to improve performance of the model.

The MEDHOST team went beyond the original goal and created an inference microservice to interact with the endpoint for prediction, enabled abstracting of the machine learning endpoint with the well-defined domain REST endpoint, and added a standard security layer to the MEDHOST application.

When there is a real-time call from the facility, the inference microservice gets inference from the SageMaker endpoint. Records containing input and inference data are fed to the data pipeline again. MEDHOST used Amazon Kinesis Data Streams to push records in real time. However, since retraining the machine learning model does not need to happen in real time, the Amazon Kinesis Data Firehose enabled MEDHOST to micro-batch records and efficiently save them to the landing zone bucket so that the data could be reprocessed.

Conclusion

Collaborating with AWS Data Lab enabled MEDHOST to:

  • Store single source of truth with low-cost storage solution (data lake)
  • Complete data pipeline for a low-cost data analytics solution
  • Create an almost production-ready code for cardiac risk prediction

The MEDHOST team learned many concepts related to data analytics and machine learning within four days. AWS Data Lab truly helped MEDHOST deliver results in an accelerated manner.


About the Authors

Pandian Velayutham is the Director of Engineering at MEDHOST. His team is responsible for delivering cloud solutions, integration and interoperability, and business analytics solutions. MEDHOST utilizes modern technology stack to provide innovative solutions to our customers. Pandian Velayutham is a technology evangelist and public cloud technology speaker.

 

 

 

 

George Komninos is a Data Lab Solutions Architect at AWS. He helps customers convert their ideas to a production-ready data product. Before AWS, he spent 3 years at Alexa Information domain as a data engineer. Outside of work, George is a football fan and supports the greatest team in the world, Olympiacos Piraeus.

Cross-Account Data Sharing for Amazon Redshift

Post Syndicated from Martin Beeby original https://aws.amazon.com/blogs/aws/cross-account-data-sharing-for-amazon-redshift/

To be successful in today’s fast-moving world, businesses need to analyze data quickly and take meaningful action. Many of our customers embrace this concept to become data-driven organizations.

Data-driven organizations treat data as an asset and use it to improve their insights and make better decisions. They unleash the power of data by using secure systems to collect, store, and process data and share it with people in their organization. Some even offer their data and analytics as a service, to their customers, partners, and external parties to create new revenue streams.

All stakeholders want to share and consume the same accurate data as a single source of truth. They want to be able to query live views of the data concurrently while experiencing no performance degradation and access the right information exactly when it’s needed.

Amazon Redshift, the first data warehouse built for the cloud, has become popular as the data warehouse component of many of our customers’ data architecture.

Amazon Redshift users can share data with users in an AWS account, but to share and collaborate on data with other AWS accounts, they needed to extract it from one system and load it into another.

There is a lot of manual work involved in building and maintaining the different extract, transform, and load jobs required to make this work. As your data sharing scales and more stakeholders need data, the complexity increases. As a result, it can become hard to maintain the monitoring, compliance, and security best practices required to keep your data safe.

This way of sharing does not provide complete and up-to-date views of the data, either, because the manual processes introduce delays and data inconsistencies that result in stale data, lower-quality business results, and slow responses to customers.

That’s why we created cross-account data sharing for Amazon Redshift.

Introducing Cross-Account Data Sharing for Amazon Redshift
This new feature gives you a simple and secure way to share fresh, complete, and consistent data in your Amazon Redshift data warehouse with any number of stakeholders across AWS accounts. It makes it possible for you to share data across organizations and collaborate with external parties while meeting compliance and security requirements.

Amazon Redshift offers comprehensive security controls and auditing capabilities using IAM integration, system tables and AWS CloudTrail. These allow customers to control and monitor data sharing permissions and usage across consumers and revoke access instantly when necessary.

You can share data at many levels, including databases, schemas, tables, views, columns, and user-defined functions, to provide fine-grained access controls tailored to users and businesses who need access to Amazon Redshift data.

Let’s take a look at how cross-account data sharing works.

Sharing Data Across Two Accounts

Cross-account data sharing is a two-step process. First, a producer cluster administrator creates a datashare, adds objects, and gives access to the consumer account. Second, the producer account administrator authorizes sharing data for the specified consumer. You can do this from the Amazon Redshift console.

To get started, in the Amazon Redshift console, I create an Amazon Redshift cluster and then import some sample data. When the cluster is available, I navigate to the cluster details page, choose the Datashares tab, and then choose Create datashare.

 

On the Create datashare page, I enter a datashare name and then choose a database. Under Publicly accessible, I choose Enable because I want the datashare to be shared with publicly accessible clusters.

I then choose the objects from the database I want to include in the datashare. I have granular control of what I choose to share with others. For simplicity, I will share all the tables. In practice, though, you might choose one or more tables, views, or user-defined functions.

The last thing I need to do is add an AWS account to the datashare. I add my second AWS account ID and then choose Create datashare.

To authorize the data consumer I just created, in the Datashares section of the console, I choose Authorize. The Consumer status will change from Pending authorization to Authorized. Now that the datashare is set up, I’ll switch to my secondary account to show you how to consume the datashare in the consumer AWS account. It’s important to note that I need to use the same Region in the secondary account, as cross-account data sharing does not work across Regions.

Similar to the producer, there is a process for consuming data. First, you need to associate the data share with one or more clusters in the consumer account. You can also associate the data share to the entire consumer account so that the current and future clusters in the consumer account get access to the share.

I sign in to my secondary account and go to the Datashares section of the console.  I choose the From other accounts tab and then select the news_blog_datashare that I shared from the producer AWS account. I then choose Associate to associate the datashare with a cluster in my account.

On the details page of the cluster, I choose Create database from datashare and then enter a name for my new database.

In the query editor, I select my database and run queries against all the objects that have been made available as part of the datashare.

When I choose Run, data is returned from the query. What’s important to remember is that this is a live view of the data. Any changes in the producer database will be reflected in my queries. No copying or manual transfers are required.

Things to Know

Here are a couple of interesting facts about cross-account data sharing:

Security – All of the permissions required for authorization and association are managed with AWS Identity and Access Management (IAM), so you can create IAM policies to control which operations each user can complete. For security considerations, see Controlling access for cross-account datashares.

Encryption – Both the producer and consumer clusters must be encrypted and in the same AWS Region.

Regions – Cross-account data sharing is available for all Amazon Redshift RA3 node types in US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Paris), Europe (Stockholm), and South America (São Paulo).

Pricing – Cross-account data sharing is available across clusters that are in the same Region. There is no cost to share data. Customers just pay for the Redshift clusters that participate in sharing.

Try Cross-Account Data Sharing for Amazon Redshift today.

This new feature is available right now so why not create a cluster and take cross-account data sharing for a spin? For information about how to get started, see Sharing data across AWS accounts. Don’t forget to let me know how you get on.

Happy sharing!

— Martin

How Magellan Rx Management used Amazon Redshift ML to predict drug therapeutic conditions

Post Syndicated from Karim Prasla original https://aws.amazon.com/blogs/big-data/how-magellan-rx-management-used-amazon-redshift-ml-to-predict-drug-therapeutic-conditions/

This post is co-written with Karim Prasla and Deepti Bhanti from Magellan Rx Management as the lead authors.

Amazon Redshift ML makes it easy for data scientists, data analysts, and database developers to create, train, and use machine learning (ML) models using familiar SQL commands in Amazon Redshift data warehouses. The ML feature can be used by various personas; in this post we discuss how data analysts at Magellan Rx Management used Redshift ML to predict and classify drugs into various therapeutic conditions.

About Magellan Rx Management

Magellan Rx Management, a division of Magellan Health, Inc., is shaping the future of pharmacy. As a next-generation pharmacy organization, we deliver meaningful solutions to the people we serve. As pioneers in specialty drug management, industry leaders in Medicaid pharmacy programs, and disruptors in pharmacy benefit management, we partner with our customers and members to deliver a best-in-class healthcare experience.

Use case

Magellan Rx Management uses data and analytics to deliver clinical solutions that improve patient care, contain costs, and improve outcomes. We utilize predictive analytics to forecast future drugs costs, identify drugs that will drive future trends, including pipeline medications, and proactively identify patients at risk for becoming non-adherent to their medications.

We develop and deliver these analytics via our MRx Predict solution. This solution utilizes a variety of data, including pharmacy and medical claims, compendia information, census data, and many other sources to optimize the predictive model development, deployment, and most importantly maximize predictive accuracy.

Part of the process involves stratification of drugs and patient population based on therapeutic conditions. These conditions are derived by utilizing compendia drug attributes to come up with a classification system focused on disease categories, such as diabetes, hypertension, spinal muscular atrophy, and many others, and ensuring drugs are appropriately categorized as opposed to being included in the “miscellaneous” bucket, which is the case for many specialty medications. Also, this broader approach to classifying drugs, beyond just mechanism of action, has allowed us to effectively develop downstream reporting solutions, patient segmentation, and outcomes analytics.

Although we use different technologies for predictive analytics, we wanted to use Redshift ML to predict appropriate drug therapeutic conditions, because it would allow us to use this functionality within our Amazon Redshift data warehouse by making the predictions using standard SQL programming. Prior to Redshift ML, we had data analysts and clinicians manually categorize any new drugs into the appropriate therapeutic conditions. The end goal of using Redshift ML was to assess how well we could improve our operational efficiency while maintaining a high level of clinical accuracy.

Redshift ML allowed us to create models using standard SQL without having to use external systems, technologies, or APIs. With data already in Amazon Redshift, our data scientists and analysts, who are skilled in SQL, seamlessly created ML models to effectively predict therapeutic conditions for individual drugs.

“At Magellan Rx Management, we leverage data, analytics, and proactive insights, to help solve complex pharmacy challenges, while focusing on improving clinical and economic outcomes for customers and members,” said Karim Prasla, Vice President of Clinical Outcomes Advanced Analytics and Research at Magellan Rx Management.“We use predictive analytics and machine learning to improve operational and clinical efficiencies and effectiveness. With Redshift ML, we were able to enable our data and outcomes analysts to classify new drugs to market into appropriate therapeutic conditions by creating and utilizing ML models with minimal effort. The efficiency gained through leveraging Redshift ML to support this process improved our productivity and optimized our resources while generating a high degree of predictive accuracy.”

Magellan Rx Management continues to focus on using data analytics to identify opportunities to improve patient care and outcomes, to deliver proactive insights to our customers so sound data-driven decisions can be made, and end-user tools to allow customers and support staff to have readily available insights. By incorporating predictive analytics along with providing descriptive and diagnostic insights, we at Magellan Rx Management are able to partner with our customers to ensure we evaluate the “what,” the “why,” and the “what will” to more effectively manage the pharmacy programs.

Key benefits of using Redshift ML

Redshift ML enables you to train models with a single SQL CREATE MODEL command. The CREATE MODEL command creates a model that Amazon Redshift uses to generate model-based predictions with familiar SQL constructs.

Redshift ML provides simple, optimized, and secure integration between Amazon Redshift and Amazon SageMaker, and enables inference within the Amazon Redshift cluster, making it easy to use predictions generated by ML-based models in queries and applications. An analyst with SQL skills can easily create and train models with no expertise in ML programming languages, algorithms, and APIs.

With Redshift ML, you don’t have to perform any of the undifferentiated heavy lifting required for integrating with an external ML service. Redshift ML saves you the time to format and move data, manage permission controls, or build custom integrations, workflows, and scripts. You can easily use popular ML algorithms and simplify training needs that require frequent iteration from training to prediction. Amazon Redshift automatically discovers the best algorithm and tunes the best model for your problem. You can simply make predictions and predictive analytics from within your Amazon Redshift cluster without the need to move data out of Amazon Redshift.

Simplifying integration between Amazon Redshift and SageMaker

Before Redshift ML, a data scientist had to go through a series of steps with various tools to arrive at a prediction—identify the appropriate ML algorithms in SageMaker or use Amazon SageMaker Autopilot, export the data from the data warehouse, and prepare the training data to work with these models. When the model is deployed, the scientist goes through various iterations with new data for making predictions (also known as inference). This involves moving data back and forth between Amazon Redshift and SageMaker through a series of manual steps:

  1. Export training data to Amazon Simple Storage Service (Amazon S3).
  2. Train the model in SageMaker.
  3. Export prediction input data to Amazon S3.
  4. Generate predictions in SageMaker.
  5. Import targets back into the database.

The following diagram depicts these steps.
Amazon Redshift ML Architecture
This iterative process is time-consuming and prone to errors, and automating the data movement can take weeks or months of custom coding that then needs to be maintained.

Redshift ML enables you to use ML with your data in Amazon Redshift without this complexity. Without movement of data, you don’t have any overhead of additional security and governance of data that you export from your data warehouse.

Prerequisites

To take advantage of Redshift ML, you need an Amazon Redshift cluster with the ML feature enabled.

The cluster needs to have an AWS Identity and Access Management (IAM) role attached with sufficient privileges on Amazon S3 and SageMaker. For more information on setup and an introduction to Redshift ML, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML. The IAM role is required for Amazon Redshift to interact with SageMaker and Amazon S3. An S3 bucket is also required to export the training dataset and store other ML-related intermediate artifacts.

Setup and data preparation

We use a sample dataset for this use case; the last column mrx_therapeutic_condition is the one we’re going to predict using the multi-classification model. We ingest this health care drug dataset into the Amazon Redshift cluster, and use it to train and test the model.

To get randomness in the training dataset, we used a random record number column (record_nbr), which can be used to separate the training dataset from testing. If your dataset already has a unique or primary key column, you can use it as a predicate to separate the training from testing dataset. The random factor (record_nbr) column was added to this dataset to serve as a WHERE clause predicate to the SELECT statement in the CREATE MODEL command (discussed later in this post). A classic example is if your prediction column is True or False, then you want to make sure that the training dataset has balanced values of both and not just True or False alone. You can achieve this by using a neutral column as a predicate, which in our case is record_nbr. After loading the data, we validated to ensure all column values are uniformly distributed for training the model. A very important factor for getting better accuracy in predictions is to use a balanced dataset for training in which the distribution of all the values is balanced and covers all characteristics.

In this example, we have kept the entire data in single table but used the WHERE predicate on the record_nbr column to differentiate the training and testing datasets.

The following is the DDL and COPY commands for this use case.

  1. Sign in to Amazon Redshift using the query editor or your preferred SQL client (DBeaver, DBVisualizer, SQL WorkbenchJ) to run the SQL statements to create the database tables.
  2. Create the schema and tables using the following DDL statements:
    --create schema
    CREATE SCHEMA drug_ml;
    
    --train table
    CREATE TABLE drug_ml.ml_drug_data
    (
    record_nbr int,
    col2 datatype,
    col3 datatype,
    etc..,
    mrx_therapeutic_condition character varying(47) 
    ) 
    DISTSTYLE EVEN;

  3. Use the COPY commands to prepare and load data for model training:
    --Load 
    COPY drug_ml.ml_drug_rs_raw
    FROM 's3://<s3bucket>/poc/drug_data_000.gz'
    IAM_ROLE 'arn:aws:iam::<AWSAccount>:role/RedshiftMLRole' 
    gzip;

Create model

Now that the data setup part is done, let’s create the model in Redshift ML. Then CREATE MODEL runs in the background and you can track progress using SHOW MODEL model_name;.

For a data analyst who is not quite conversant with machine learning, the CREATE MODEL statement is a powerhouse that offers flexibility in the number of options used to create the model. In our current use case, we haven’t passed the model type as a multi-classification model or any other options but based on the input data. The Redshift ML CREATE MODEL statement can select the correct problem type and other associated parameters using Autopilot. Multi-class classification is a problem type that predicts one of many outcomes, such as predicting a rating a customer might give for a product. In our example, we predict therapeutic conditions such as acne, anti-clotting therapy, asthma, or COPD for different patients. Data scientists and ML experts can use it to perform supervised learning to tackle problems ranging from forecasting, personalization, or customer churn prediction. See the following code:

--Create model using 130 K records (80% Training Data)
CREATE MODEL drug_ml.predict_model_drug_therapeutic_condition
FROM (
	SELECT *
	FROM drug_ml.ml_drug_data
	WHERE mrx_therapeutic_condition IS NOT NULL
		AND record_nbr < 130001
	) TARGET mrx_therapeutic_condition FUNCTION predict_drug_therapeutic_condition IAM_ROLE 'Replace with your IAM Role ARN' SETTINGS (S3_BUCKET 'Replace with your bucket name');

The target (label) in the preceding code is the mrx_therapeutic_condition column, which is used for prediction, and the function that is created out of this CREATE MODEL command is named predict_drug_therapeutic_condition.

The training data comes from the table ml_drug_data, and Autopilot automatically detects the multi-classification value model type based on input data.

Show model

The command show model model_name; is used to track the progress and also to get more details about a model (see the following example code). This section contains a representative output from SHOW MODEL after CREATE MODEL is complete. It has the inputs used in the CREATE MODEL statement with more information like model state (TRAINING, READY, FAILED) and the maximum estimated runtime it will take to finish. For better accuracy for predictions, we recommend letting the model finish to completion by increasing the default timeout value (the default is actually short for better accuracy). There is also an option to use CREATE MODEL, which lets the user limit the maximum runtime to a lower number, but the more runtime you have for CREATE MODEL, more it gets to train and iterate using multiple options before settling in on the best model.

--to show the model

SHOW MODEL drug_ml.predict_model_drug_therapeutic_condition;

The following table shows our results.

Inference

For binary and multi-class classification problems, we compute the accuracy as the model metric. We use the following formula to determine the accuracy:

accuracy = (sum (actual == predicted)/total) *100

We use the newly created function predict_drug_therapeutic_condition for the prediction and use the columns other than the target (label) as the input:

-- check accuracy
WITH infer_data AS (
SELECT mrx_therapeutic_condition AS label
,drug_ml.predict_drug_therapeutic_condition(record_nbr , brand_nm, gnrc_nm, drug_prod_strgth , gnrc_seq_num, lbl_nm , drug_dsge_fmt_typ_desc , cmpd_rte_of_admn_typ_desc , specif_thrputic_ingrd_clss_typ , hicl_seq_num, ndc_id , mfc_obsolete_dt , hcfa_termnatn_dt , specif_thrputic_ingrd_clss_desc, gpi14code, druggroup  , drugclass , drugsubclass, brandname) AS predicted,
CASE
WHEN label IS NULL
THEN 'N/A'
ELSE label
END AS actual,
CASE
WHEN actual = predicted
THEN 1::INT
ELSE 0::INT
END AS correct
FROM DRUG_ML.ML_DRUG_DATA),
aggr_data AS (
SELECT SUM(correct) AS num_correct,
COUNT(*) AS total
FROM infer_data)
SELECT (num_correct::FLOAT / total::FLOAT) AS accuracy FROM aggr_data;

--output of above query
accuracy
-------------------
0.975466558808845
(1 row)

The inference query output (0.9923 *100 = 99.23 %) matches the output from the show model command.

Let’s run the prediction query on therapeutic condition to get the count of original vs. ML-predicted values for accuracy comparison:

--check prediction 
WITH infer_data
AS (
	SELECT mrx_therapeutic_condition
		,drug_ml.predict_drug_therapeutic_condition(m.record_nbr, m.brand_nm, m.gnrc_nm, m.drug_prod_strgth, m.gnrc_seq_num, Other Params) AS predicted
	FROM "DRUG_ML"."ML_DRUG_DATA" m
	WHERE mrx_therapeutic_condition IS NOT NULL
		AND record_nbr >= 130001
	)
SELECT CASE 
		WHEN mrx_therapeutic_condition = predicted
			THEN 'Match'
		ELSE 'Un-Match'
		END AS match_ind
	,count(1) AS record_cnt
FROM infer_data
GROUP BY 1;

The following table shows our output.

Conclusion

Redshift ML allows us to develop greater efficiency and enhanced ability to generate MRx therapeutic conditions, especially when new drugs come to market. Redshift ML provides an easy and seamless platform for database users to create, train, and tune models using a SQL interface. ML capabilities empower Amazon Redshift users to create models and deploy them locally on large datasets, which was an arduous task before. Different user personas, from data analysts to advanced data science and ML experts, can take advantage of Redshift ML for machine learning based predictions and gain valuable business insights.


About the Authors

Karim Prasla, Pharm D, MS, BCPS, is a clinician and a health economist focused on optimizing use of data, analytics, and insights to improve patient care and drive positive clinical and economic outcomes. Karim leads a talented team of clinicians, economists, health outcomes scientists, and data analysts at Magellan Rx Management who are passionate about leveraging data analytics to develop products and solutions that provide proactive insights to deliver value to clinicians, customers, and members.

 

Deepti Bhanti, PhD, is an IT leader at Magellan Rx Management with a focus on delivering big data solutions to meet the data and analytics needs for our customers. Deepti leads a full stack team of IT engineers who are passionate about architecting and building scalable solutions while driving business value and market differentiation through technological innovations.

 

Rajesh Francis is a Sr. Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to build scalable analytic solutions. Rajesh worked with Magellan Rx Management to help implement a lake house architecture and leverage Redshift ML for various use cases.

 

 

 

Satish Sathiya is a Senior Product Engineer at Amazon Redshift. He is an avid big data enthusiast who collaborates with customers around the globe to achieve success and meet their data warehousing and data lake architecture needs. Satish worked with Magellan Rx Management to help jump start ML adoption.

 

 

Debu Panda, a principal product manager at AWS, is an industry leader in analytics, application platform, and database technologies and has more than 25 years of experience in the IT world. He was responsible for driving the Amazon Redshift ML feature and working closely with Magellan Rx Management and other customers to incorporate their feedback into the service.

How Comcast uses AWS to rapidly store and analyze large-scale telemetry data

Post Syndicated from Asser Moustafa original https://aws.amazon.com/blogs/big-data/how-comcast-uses-aws-to-rapidly-store-and-analyze-large-scale-telemetry-data/

This blog post is co-written by Russell Harlin from Comcast Corporation.

Comcast Corporation creates incredible technology and entertainment that connects millions of people to the moments and experiences that matter most. At the core of this is Comcast’s high-speed data network, providing tens of millions of customers across the country with reliable internet connectivity. This mission has become more important now than ever.

This post walks through how Comcast used AWS to rapidly store and analyze large-scale telemetry data.

Background

At Comcast, we’re constantly looking for ways to gain new insights into our network and improve the overall quality of service. Doing this effectively can involve scaling solutions to support analytics across our entire network footprint. For this particular project, we wanted an extensible and scalable solution that could process, store, and analyze telemetry reports, one per network device every 5 minutes. This data would then be used to help measure quality of experience and determine where network improvements could be made.

Scaling big data solutions is always challenging, but perhaps the biggest challenge of this project was the accelerated timeline. With 2 weeks to deliver a prototype and an additional month to scale it, we knew we couldn’t go through the traditional bake-off of different technologies, so we had to either go with technologies we were comfortable with or proven managed solutions.

For the data streaming pipeline, we already had the telemetry data coming in on an Apache Kafka topic, and had significant prior experience using Kafka combined with Apache Flink to implement and scale streaming pipelines, so we decided to go with what we knew. For the data storage and analytics, we needed a suite of solutions that could scale quickly, had plenty of support, and had an ecosystem of well-integrated tools to solve any problem that might arise. This is where AWS was able to meet our needs with technologies like Amazon Simple Storage Service (Amazon S3), AWS Glue, Amazon Athena, and Amazon Redshift.

Initial architecture

Our initial prototype architecture for the data store needed to be fast and simple so that we could unblock the development of the other elements of the budding telemetry solution. We needed three key things out of it:

  • The ability to easily fetch raw telemetry records and run more complex analytical queries
  • The capacity to integrate seamlessly with the other pieces of the pipeline
  • The possibility that it could serve as a springboard to a more scalable long-term solution

The first instinct was to explore solutions we used in the past. We had positive experiences with using nosql databases, like Cassandra, to store and serve raw data records, but it was clear these wouldn’t meet our need for running ad hoc analytical queries. Likewise, we had experience with more flexible RDBMs, like Postgres, for handling more complicated queries, but we knew that those wouldn’t scale to meet our requirement to store tens to hundreds of billions of rows. Therefore, any prototyping with one of these approaches would be considered throwaway work.

After moving on from these solutions, we quickly settled on using Amazon S3 with Athena. Amazon S3 provides low-cost storage with near-limitless scaling, so we knew we could store as much historical data as required and Athena would provide serverless, on-demand querying of our data. Additionally, Amazon S3 is known to be a launching pad to many other data store solutions both inside and outside the AWS ecosystem. This was perfect for the exploratory prototyping phase.

Integrating it into the rest of our pipeline would also prove simple. Writing the data to Amazon S3 from our Flink job was straightforward and could be done using the readily available Flink streaming file sink with an Amazon S3 bucket as the destination. When the data was available in Amazon S3, we ran AWS Glue to index our Parquet-formatted data and generate schemas in the AWS Glue metastore for searching using Athena with standard SQL.

The following diagram illustrates this architecture.

Using Amazon S3 and Athena allowed us to quickly unblock development of our Flink pipeline and ensure that the data being passed through was correct. Additionally, we used the AWS SDK to connect to Athena from our northbound Golang microservice and provide REST API access to our data for our custom web application. This allowed us to prove out an end-to-end solution with almost no upfront cost and very minimal infrastructure.

Updated architecture

As application and service development proceeded, it became apparent that Amazon Athena performed for developers running ad hoc queries, but wasn’t going to work as a long-term responsive backend for our microservices and user interface requirements.

One of the primary use cases of this solution was to look at device-level telemetry reports for a period of time and plot and track different aspects of their quality of experience. Because this most often involves solving problems happening in the now, we needed an improved data store for the most recent hot data.

This led us to Amazon Redshift. Amazon Redshift requires loading the data into a dedicated cluster and formulating a schema tuned for your use cases.

The following diagram illustrates this updated architecture.

Data loading and storage requirements

For loading and storing the data in Amazon Redshift, we had a few fundamental requirements:

  • Because our Amazon Redshift solution would be for querying data to troubleshoot problems happening as recent as the current hour, we needed to minimize the latency of the data load and keep up with our scale while doing it. We couldn’t live with nightly loads.
  • The pipeline had to be robust and recover from failures automatically.

There’s a lot of nuance that goes into making this happen, and we didn’t want to worry about handling these basic things ourselves, because this wasn’t where we were going to add value. Luckily, because we were already loading the data into Amazon S3, AWS Glue ETL satisfied these requirements and provided a fast, reliable, and scalable solution to do periodic loads from our Amazon S3 data store to our Amazon Redshift cluster.

A huge benefit of AWS Glue ETL is that it provides many opportunities to tune your ETL pipeline to meet your scaling needs. One of our biggest challenges was that we write multiple files to Amazon S3 from different regions every 5 minutes, which results in many small files. If you’re doing infrequent nightly loads, this may not pose a problem, but for our specific use case, we wanted to load data at least every hour and multiple times an hour if possible. This required some specific tuning of the default ETL job:

  • Amazon S3 list implementation – This allows the Spark job to handle files in batches and optimizes reads for a large number of files, preventing out of memory issues.
  • Pushdown predicates – This tells the load to skip listing any partitions in Amazon S3 that you know won’t be a part of the current run. For frequent loads, this can mean skipping a lot of unnecessary file listing during each job run.
  • File grouping – This allows the read from Amazon S3 to group files together in batches when reading from Amazon S3. This greatly improves performance when reading from a large number of small files.
  • AWS Glue 2.0 – When we were starting our development, only AWS Glue 1.0 was available, and we’d frequently see Spark cluster start times of over 10 minutes. This becomes problematic if you want to run the ETL job more frequently because you have to account for the cluster startup time in your trigger timings. When AWS Glue 2.0 came out, those start times consistently dropped to under 1 minute and they became a afterthought.

With these tunings, as well as increasing the parallelism of the job, we could meet our requirement of loading data multiple times an hour. This made relevant data available for analysis sooner.

Modeling, distributing, and sorting the data

Aside from getting the data into the Amazon Redshift cluster in a timely manner, the next consideration was how to model, distribute, and sort the data when it was in the cluster. For our data, we didn’t have a complex setup with tens of tables requiring extensive joins. We simply had two tables: one for the device-level telemetry records and one for records aggregated at a logical grouping.

The bulk of the initial query load would be centered around serving raw records from these tables to our web application. These types of raw record queries aren’t difficult to handle from a query standpoint, but do present challenges when dealing with tens of millions of unique devices and a report granularity of 5 minutes. So we knew we had to tune the database to handle these efficiently. Additionally, we also needed to be able to run more complex ad hoc queries, like getting daily summaries of each table so that higher-level problem areas could be more easily tracked and spotted in the network. These queries, however, were less time sensitive and could be run on an ad hoc, batch-like basis where responsiveness wasn’t as important.

The schema fields themselves were more or less one-to-one mappings from the respective Parquet schemas. The challenge came, however, in picking partition keys and sorting columns. For partition keys, we identified a logical device grouping column present in both our tables as the one column we were likely to join on. This seemed like a natural fit to partition on and had good enough cardinality that our distribution would be adequate.

For the sorting keys, we knew we’d be searching by the device identifier and the logical grouping; for the respective tables, and we knew we’d be searching temporally. So the primary identifier column of each table and the timestamp made sense to sort on. The documented sort key order suggestion was to use the timestamp column as the first value in the sort key, because it could provide dataset filtering on a specific time period. This initially worked well enough and we were able to get a performance improvement over Athena, but as we scaled and added more data, our raw record retrieval queries were rapidly slowing down. To help with this, we made two adjustments.

The first adjustment came with a change to the sort key. The first part of this involved swapping the order of the timestamp and the primary identifier column. This allowed us to filter down to the device and then search through the range of timestamps on just that device, skipping over all irrelevant devices. This provided significant performance gains and cut our raw record query times by several multiples. The second part of the sort key adjustment involved adding another column (a node-level identifier) to the beginning of the sort key. This allowed us to have one more level of data filtering, which further improved raw record query times.

One trade-off made while making these sort key adjustments was that our more complex aggregation queries had a noticeable decline in performance. This was because they were typically run across the entire footprint of devices and could no longer filter as easily based on time being the first column in the sort key. Fortunately, because these were less frequent and could be run offline if necessary, this performance trade-off was considered acceptable.

If the frequency of these workloads increases, we can use materialized views in Amazon Redshift, which can help avoid unnecessary reruns of the complex aggregations if minimal-to-no data changes in the underlying base tables have occurred since the last run.

The final adjustment was cluster scaling. We chose to use the Amazon Redshift next-generation RA3 nodes for a number of benefits, but three especially key benefits:

  • RA3 clusters allow for practically unlimited storage in our cluster.
  • The RA3 ability to scale storage and compute independently paired really well with our expectations and use cases. We fully expected our Amazon Redshift storage footprint to continue to grow, as well as the number, shape, and sizes of our use cases and users, but data and workloads wouldn’t necessarily grow in lockstep. Being able to scale the cluster’s compute power independent of storage (or vice versa) was a key technical requirement and cost-optimization for us.
  • RA3 clusters come with Amazon Redshift managed storage, which places the burden on Amazon Redshift to automatically situate data based on its temperature for consistently peak performance. With managed storage, hot data was cached on a large local SSD cache in each node, and cold data was kept in the Amazon Redshift persistent store on Amazon S3.

After conducting performance benchmarks, we determined that our cluster was under-powered for the amount of data and workloads it was serving, and we would benefit from greater distribution and parallelism (compute power). We easily resized our Amazon Redshift cluster to double the number of nodes within minutes, and immediately saw a significant performance boost. With this, we were able to recognize that as our data and workloads scaled, so too should our cluster.

Looking forward, we expect that there will be a relatively small population of ad hoc and experimental workloads that will require access to additional datasets sitting in our data lake, outside of Amazon Redshift in our data lake—workloads similar to the Athena workloads we previously observed. To serve that small customer base, we can leverage Amazon Redshift Spectrum, which empowers users to run SQL queries on external tables in our data lake, similar to SQL queries on any other table within Amazon Redshift, while allowing us to keep costs as lean as possible.

This final architecture provided us with the solid foundation of price, performance, and flexibility for our current set of analytical use cases—and, just as important, the future use cases that haven’t shown themselves yet.

Summary

This post details how Comcast leveraged AWS data store technologies to prototype and scale the serving and analysis of large-scale telemetry data. We hope to continue to scale the solution as our customer base grows. We’re currently working on identifying more telemetry-related metrics to give us increased insight into our network and deliver the best quality of experience possible to our customers.


About the Authors

Russell Harlin is a Senior Software Engineer at Comcast based out of the San Francisco Bay Area. He works in the Network and Communications Engineering group designing and implementing data streaming and analytics solutions.

 

 

Asser Moustafa is an Analytics Specialist Solutions Architect at AWS based out of Dallas, Texas. He advises customers in the Americas on their Amazon Redshift and data lake architectures and migrations, starting from the POC stage to actual production deployment and maintenance

 

Amit Kalawat is a Senior Solutions Architect at Amazon Web Services based out of New York. He works with enterprise customers as they transform their business and journey to the cloud.

Benchmark the performance of the new Auto WLM with adaptive concurrency in Amazon Redshift

Post Syndicated from Raj Sett original https://aws.amazon.com/blogs/big-data/benchmarking-the-performance-of-the-new-auto-wlm-with-adaptive-concurrency-in-amazon-redshift/

With Amazon Redshift, you can run a complex mix of workloads on your data warehouse clusters. For example, frequent data loads run alongside business-critical dashboard queries and complex transformation jobs. We also see more and more data science and machine learning (ML) workloads. Each workload type has different resource needs and different service level agreements. How does Amazon Redshift give you a consistent experience for each of your workloads? Amazon Redshift workload management (WLM) helps you maximize query throughput and get consistent performance for the most demanding analytics workloads, all while optimally using the resources of your existing cluster.

Amazon Redshift has recently made significant improvements to automatic WLM (Auto WLM) to optimize performance for the most demanding analytics workloads. With the release of Amazon Redshift Auto WLM with adaptive concurrency, Amazon Redshift can now dynamically predict and allocate the amount of memory to queries needed to run optimally. Amazon Redshift dynamically schedules queries for best performance based on their run characteristics to maximize cluster resource utilization.

In this post, we discuss what’s new with WLM and the benefits of adaptive concurrency in a typical environment. We synthesized a mixed read/write workload based on TPC-H to show the performance characteristics of a workload with a highly tuned manual WLM configuration versus one with Auto WLM. In this experiment, Auto WLM configuration outperformed manual configuration by a great margin. From a throughput standpoint (queries per hour), Auto WLM was 15% better than the manual workload configuration. Overall, we observed 26% lower average response times (runtime + queue wait) with Auto WLM.

What’s new with Amazon Redshift WLM?

Workload management allows you to route queries to a set of defined queues to manage the concurrency and resource utilization of the cluster. Today, Amazon Redshift has both automatic and manual configuration types. With manual WLM configurations, you’re responsible for defining the amount of memory allocated to each queue and the maximum number of queries, each of which gets a fraction of that memory, which can run in each of their queues. Manual WLM configurations don’t adapt to changes in your workload and require an intimate knowledge of your queries’ resource utilization to get right. Amazon Redshift Auto WLM doesn’t require you to define the memory utilization or concurrency for queues. Auto WLM adjusts the concurrency dynamically to optimize for throughput. Optionally, you can define queue priorities in order to provide queries preferential resource allocation based on your business priority.

Auto WLM also provides powerful tools to let you manage your workload. Query priorities lets you define priorities for workloads so they can get preferential treatment in Amazon Redshift, including more resources during busy times for consistent query performance, and query monitoring rules offer ways to manage unexpected situations like detecting and preventing runaway or expensive queries from consuming system resources.

Our initial release of Auto WLM in 2019 greatly improved the out-of-the-box experience and throughput for the majority of customers. However, in a small number of situations, some customers with highly demanding workloads had developed highly tuned manual WLM configurations for which Auto WLM didn’t demonstrate a significant improvement. Over the past 12 months, we worked closely with those customers to enhance Auto WLM technology with the goal of improving performance beyond the highly tuned manual configuration. One of our main innovations is adaptive concurrency. With adaptive concurrency, Amazon Redshift uses ML to predict and assign memory to the queries on demand, which improves the overall throughput of the system by maximizing resource utilization and reducing waste.

Electronic Arts, Inc. is a global leader in digital interactive entertainment. EA develops and delivers games, content, and online services for internet-connected consoles, mobile devices, and personal computers. EA has more than 300 million registered players around the world. Electronic Arts uses Amazon Redshift to gather player insights and has immediately benefited from the new Amazon Redshift Auto WLM.

By adopting Auto WLM, our Amazon Redshift cluster throughput increased by at least 15% on the same hardware footprint. Our average concurrency increased by 20%, allowing approximately 15,000 more queries per week now. All this with marginal impact to the rest of the query buckets or customers. Because Auto WLM removed hard walled resource partitions, we realized higher throughput during peak periods, delivering data sooner to our game studios.

– Alex Ignatius, Director of Analytics Engineering and Architecture for the EA Digital Platform.

Benefits of Amazon Redshift Auto WLM with adaptive concurrency

Amazon Redshift has implemented an advanced ML predictor to predict the resource utilization and runtime for each query. The model continuously receives feedback about prediction accuracy and adapts for future runs. Higher prediction accuracy means resources are allocated based on query needs. This allows for higher concurrency of light queries and more resources for intensive queries. The latter leads to improved query and cluster performance because less temporary data is written to storage during a complex query’s processing. A unit of concurrency (slot) is created on the fly by the predictor with the estimated amount of memory required, and the query is scheduled to run. If you have a backlog of queued queries, you can reorder them across queues to minimize the queue time of short, less resource-intensive queries while also ensuring that long-running queries aren’t being starved. We also make sure that queries across WLM queues are scheduled to run both fairly and based on their priorities.

The following are key areas of Auto WLM with adaptive concurrency performance improvements:

  • Proper allocation of memory – Reduction of over-allocation of memory creates more room for other queries to run and increases concurrency. Additionally, reduction of under-allocation reduces spill to disk and therefore improves query performance.
  • Elimination of static partitioning of memory between queues – This frees up the entire available memory, which is then available for queries.
  • Improved throughput – You can pack more queries into the system due to more efficient memory utilization.

The following diagram shows how a query moves through the Amazon Redshift query run path to take advantage of the improvements of Auto WLM with adaptive concurrency.

Benchmark test

To assess the efficiency of Auto WLM, we designed the following benchmark test. It’s a synthetic read/write mixed workload using TPC-H 3T and TPC-H 100 GB datasets to mimic real-world workloads like ad hoc queries for business analysis.

In this modified benchmark test, the set of 22 TPC-H queries was broken down into three categories based on the run timings. The shortest queries were categorized as DASHBOARD, medium ones as REPORT, and longest-running queries were marked as the DATASCIENCE group. The DASHBOARD queries were pointed to a smaller TPC-H 100 GB dataset to mimic a datamart set of tables. The COPY jobs were to load a TPC-H 100 GB dataset on top of the existing TPC-H 3 T dataset tables. The REPORT and DATASCIENCE queries were ran against the larger TPC-H 3 T dataset as if those were ad hoc and analyst-generated workloads against a larger dataset. Also, the TPC-H 3 T dataset was constantly getting larger through the hourly COPY jobs as if extract, transform, and load (ETL) was running against this dataset.

The following table summarizes the synthesized workload components.

Schema: tpch100g Schema: tpch3t
Data Set -TPC-H 100 GB Data Set – TPC-H 3T
Workload Types DASH 16 dashboard queries running every 2 seconds
REPORT 6 report queries running every 15 minutes
DATASCIENCE 4 data science queries running every 30 minutes
COPY 3 COPY jobs every hour loading TPC-H 100 GB data on to TPC-H 3 T

The following table summarizes the manual and Auto WLM configurations we used.

Manual Configuration Auto Configuration
Queues/Query Groups Memory % Max Concurrency Concurrency Scaling Priority Memory % Max Concurrency Concurrency Scaling Priority
Dashboard 24 5 Off NA Auto Auto Off Normal
Report 25 6 Off NA Auto Auto Off Normal
DataScience 25 4 Off NA Auto Auto Off Normal
COPY 25 3 Off NA Auto Auto Off Normal
Default 1 1 Off NA Auto Auto Off Normal

We ran the benchmark test using two 8-node ra3.4xlarge instances, one for each configuration. The same exact workload ran on both clusters for 12 hours.

Summary of results

We noted that manual and Auto WLM had similar response times for COPY, but Auto WLM made a significant boost to the DATASCIENCE, REPORT, and DASHBOARD query response times, which resulted in a high throughput for DASHBOARD queries (frequent short queries).

Given the same controlled environment (cluster, dataset, queries, concurrency), Auto WLM with adaptive concurrency managed the workload more efficiently and provided higher throughput than the manual WLM configuration. Better and efficient memory management enabled Auto WLM with adaptive concurrency to improve the overall throughput. Elimination of the static memory partition created an opportunity for higher parallelism. More short queries were processed though Auto WLM, whereas longer-running queries had similar throughput. To optimize the overall throughput, adaptive concurrency control kept the number of longer-running queries at the same level but allowed more short-running queries to run in parallel.

Detailed results

In this section, we review the results in more detail.

Throughput and average response times

The following table summarizes the throughput and average response times, over a runtime of 12 hours. Response time is runtime + queue wait time.

WLM Configuration Query Type Count of Queries Total Response Time (Secs) Average Response Time (Secs)
Auto COPY 72 1329 18.46
Manual COPY 72 1271 17.65
Auto DASH 126102 271691 2.15
Manual DASH 109774 304551 2.77
Auto DATASCIENCE 166 20768 125.11
Manual DATASCIENCE 160 32603 203.77
Auto REPORT 247 38986 157.84
Manual REPORT 230 55693 242.14
Auto Total 126587 332774 2.63
Manual Total 110236 394118 3.58
Auto Over Manual (%) 14.83% -26.47%

The following chart shows the throughput (queries per hour) gain (automatic throughput) over manual (higher is better).

The following chart shows the average response time of each query (lower is better).

Bucket by query completion times

The following results data shows a clear shift towards left for Auto WLM. More and more queries completed in a shorter amount of time with Auto WLM.

% of queries completed in
WLM Configuration Total Queries 0-5 seconds 6-30 seconds 31-60 seconds 61-120 seconds 121-300 seconds 301-900 seconds Over 900 seconds
Manual 110155 87.14 11.37 1.2 0.09 0.1 0.09 0.01
Auto 126477 92.82 6.06 0.85 0.13 0.09 0.03 0.01

The following chart visualizes these results.

Query latency and count over time

As we can see from the following charts, Auto WLM significantly reduces the queue wait times on the cluster.

The following chart shows the count of queries processed per hour (higher is better).

The following chart shows the count of queued queries (lower is better).

The following chart shows the total queue wait time per hour (lower is better).

Temporary data spill to disk

Because it correctly estimated the query runtime memory requirements, Auto WLM configuration was able to reduce the runtime spill of temporary blocks to disk. Basically, a larger portion of the queries had enough memory while running that those queries didn’t have to write temporary blocks to disk, which is good thing. This in turn improves query performance.

The following chart shows that DASHBOARD queries had no spill, and COPY queries had a little spill.

Auto WLM outperforms the manual configuration

Based on these tests, Auto WLM was a better choice than manual configuration. If we look at the three main aspects where Auto WLM provides greater benefits, a mixed workload (manual WLM with multiple queues) reaps the most benefits using Auto WLM. The majority of the large data warehouse workloads consists of a well-defined mixture of short, medium, and long queries, with some ETL process on top of it. So large data warehouse systems have multiple queues to streamline the resources for those specific workloads. Also, overlap of these workloads can occur throughout a typical day. If the Amazon Redshift cluster has a good mixture of workloads and they don’t overlap with each other 100% of the time, Auto WLM can use those underutilized resources and provide better performance for other queues.

Conclusion

Our test demonstrated that Auto WLM with adaptive concurrency outperforms well-tuned manual WLM for mixed workloads. If you’re using manual WLM with your Amazon Redshift clusters, we recommend using Auto WLM to take advantage of its benefits. Moreover, Auto WLM provides the query priorities feature, which aligns the workload schedule with your business-critical needs.

For more information about Auto WLM, see Implementing automatic WLM and the definition and workload scripts for the benchmark.


About the Authors

Raj Sett is a Database Engineer at Amazon Redshift. He is passionate about optimizing workload and collaborating with customers to get the best out of Redshift. Outside of work, he loves to drive and explore new places.

 

 

 

Paul Lappas is a Principal Product Manager at Amazon Redshift. Paul is passionate about helping customers leverage their data to gain insights and make critical business decisions. In his spare time Paul enjoys playing tennis, cooking, and spending time with his wife and two boys.

 

 

Gaurav Saxena is a software engineer on the Amazon Redshift query processing team. He works on several aspects of workload management and performance improvements for Amazon Redshift. In his spare time, he loves to play games on his PlayStation.

 

 

Mohammad Rezaur Rahman is a software engineer on the Amazon Redshift query processing team. He focuses on workload management and query scheduling. In his spare time, he loves to spend time outdoor with family.

How GE Healthcare modernized their data platform using a Lake House Architecture

Post Syndicated from Krishna Prakash original https://aws.amazon.com/blogs/big-data/how-ge-healthcare-modernized-their-data-platform-using-a-lake-house-architecture/

GE Healthcare (GEHC) operates as a subsidiary of General Electric. The company is headquartered in the US and serves customers in over 160 countries. As a leading global medical technology, diagnostics, and digital solutions innovator, GE Healthcare enables clinicians to make faster, more informed decisions through intelligent devices, data analytics, applications, and services, supported by its Edison intelligence platform.

GE Healthcare’s legacy enterprise analytics platform used a traditional Postgres based on-premise data warehouse from a big data vendor to run a significant part of its analytics workloads. The data warehouse is key for GE Healthcare; it enables users across units to gather data and generate the daily reports and insights required to run key business functions. In the last few years, the number of teams accessing the cluster increased by almost three times, with twice the initial number of users running four times the number of daily queries for which the cluster had been designed. The legacy data warehouse wasn’t able to scale to support GE Healthcare’s business needs and would require significant investment to maintain, update, and secure.

Searching for a modern data platform

After working for several years in a database-focused approach, the rapid growth in the data made the GEHC’s on-prem system unviable from a cost and maintenance perspective. This presented GE Healthcare with an opportunity to take a holistic look at the emerging and strategic needs for data and analytics. With this in mind, GE Healthcare decided to adopt a Lake House Architecture using AWS services:

  • Use Amazon Simple Storage Service (Amazon S3) to store raw enterprise and event data
  • Use familiar SQL statements to combine and process data across all data stores in Amazon S3 and Amazon Redshift
  • Apply the “best fit” concept of using the appropriate AWS technology to meet specific business needs

Architecture

The following diagram illustrates GE Healthcare’s architecture.

Choosing Amazon Redshift for the enterprise cloud data warehouse

Choosing the right data store is just as important as how you collect the data for analytics. Amazon Redshift provided the best value because it was easy to launch, access, and store data, and could scale to meet our business needs on demand. The following are a few additional reasons for why GE Healthcare made Amazon Redshift our cloud data warehouse:

  • The ability to store petabyte-scale data in Amazon S3 and query the data in Amazon S3 and Amazon Redshift with little preprocessing was important because GE Healthcare’s data needs are expanding at a significant pace.
  • The AWS based strategy provides financial flexibility. GE Healthcare moved from a fixed cost/fixed asset on-premises model to a consumption-based model. In addition, the total cost of ownership (TCO) of the AWS based architecture was less than the solution it was replacing.
  • Native integration with the AWS Analytics ecosystem made it easier to handle end-to-end analytics workflows without friction. GE Healthcare took a hybrid approach to extract, transform, and load (ETL) jobs by using a combination of AWS Glue, Amazon Redshift SQL, and stored procedures based on complexity, scale, and cost.
  • The resilient platform makes recovery from failure easier, with errors further down the pipeline less likely to affect production environments because all historical data is on Amazon S3.
  • The idea of a Lake House Architecture is that taking a one-size-fits-all approach to analytics eventually leads to compromises. It’s not simply about integrating a data lake with a data warehouse, but rather about integrating a data lake, a data warehouse, and purpose-built data stores and enabling unified governance and easy data movement. (For more information about the Lake House Architecture, see Harness the power of your data with AWS Analytics.)
  • It’s easy to implement new capabilities to support emerging business needs such as artificial intelligence, machine learning, graph databases, and more because of the extensive product capabilities of AWS and the Amazon Redshift Lake House Architecture.

Implementation steps and best practices

As part of this journey, GE Healthcare partnered with AWS Professional Services to accelerate their momentum. AWS Professional Services was instrumental in following the Working Backwards (WB) process, which is Amazon’s customer-centric product development process.

Here’s how it worked:

  • AWS Professional Services guided GE Healthcare and partner teams through the Lake House Architecture, sharing AWS standards and best practices.
  • The teams accelerated Amazon Redshift stored procedure migration, including data structure selection and table design.
  • The teams delivered performant code at scale to enable a timely go live.
  • They implemented a framework for batch file extract to support downstream data consumption via a data as a service (DaaS) solution.

Conclusion

Modernizing to a Lake House Architecture with Amazon Redshift allowed GEHC to speed up, innovate faster, and better solve customers’ needs. At the time of writing, GE Healthcare workloads are running at full scale in production. We retired our on-premises infrastructure. Amazon Redshift RA3 instances with managed storage enabled us to scale compute and storage separately based on our customers’ needs. Furthermore, with the concurrency scaling feature of Amazon Redshift, we don’t have to worry about peak times affecting user performance any more. Amazon Redshift scales out and in automatically.

We also look forward to realizing the benefits of Amazon Redshift data sharing and AQUA (Advanced Query Accelerator) for Amazon Redshift as we continue to increase the performance and scale of our Amazon Redshift data warehouse. We appreciate AWS’s continual innovation on behalf of its customers.


About the Authors

Krishna Prakash (KP) Bhat is a Sr. Director in Data & Analytics at GE Healthcare. In this role, he’s responsible for architecture and data management of data and analytics solutions within GE Healthcare Digital Technologies Organization. In his spare time, KP enjoys spending time with family. Connect him on LinkedIn.

 

Suresh Patnam is a Solutions Architect at AWS, specialized in big data and AI/ML. He works with customers in their journey to the cloud with a focus on big data, data lakes, and AI/ML. In his spare time, Suresh enjoys playing tennis and spending time with his family. Connect him on LinkedIn.

Benefits of Modernizing On-premise Analytics with an AWS Lake House

Post Syndicated from Vikas Nambiar original https://aws.amazon.com/blogs/architecture/benefits-of-modernizing-on-premise-analytics-with-an-aws-lake-house/

Organizational analytics systems have shifted from running in the background of IT systems to being critical to an organization’s health.

Analytics systems help businesses make better decisions, but they tend to be complex and are often not agile enough to scale quickly. To help with this, customers upgrade their traditional on-premises online analytic processing (OLAP) databases to hyper converged infrastructure (HCI) solutions. However, these systems incur operational overhead, are limited by proprietary formats, have limited elasticity, and tie customers into costly and inhibiting licensing agreements. These all bind an organization’s growth to the growth of the appliance provider.

In this post, we provide you a reference architecture and show you how an AWS lake house will help you overcome the aforementioned limitations. Our solution provides you the ability to scale, integrate with multiple sources, improve business agility, and help future proof your analytics investment.

High-level architecture for implementing an AWS lake house

Lake house architecture uses a ring of purpose-built data consumers and services centered around a data lake. This approach acknowledges that a one-size-fits-all approach to analytics eventually leads to compromises. These compromises can include agility associated with change management and impact of different business domain reporting requirements on the data from a central platform. As such, simply integrating a data lake with a data warehouse is not sufficient.

Each step in Figure 1 needs to be de-coupled to build a lake house.

Data flow in a lake house

Figure 1. Data flow in a lake house

 

High-level design for an AWS lake house implementation

Figure 2. High-level design for an AWS lake house implementation

Building a lake house on AWS

These steps summarize building a lake house on AWS:

  1. Identify source system extraction capabilities to define an ingestion layer that loads data into a data lake.
  2. Build data ingestion layer using services that support source systems extraction capabilities.
  3. Build a governance and transformation layer to manipulate data.
  4. Provide capability to consume and visualize information via purpose-built consumption/value layer.

This lake house architecture provides you a de-coupled architecture. Services can be added, removed, and updated independently when new data sources are identified like data sources to enrich data via AWS Data Exchange. This can happen while services in the purpose-built consumption layer address individual business unit requirements.

Building the data ingestion layer

Services in this layer work directly with the source systems based on their supported data extraction patterns. Data is then placed into a data lake.

Figure 3 shows the following services to be included in this layer:

  • AWS Transfer Family for SFTP integrates with source systems to extract data using secure shell (SSH), SFTP, and FTPS/FTP. This service is for systems that support batch transfer modes and have no real-time requirements, such as external data entities.
  • AWS Glue connects to real-time data streams to extract, load, transform, clean, and enrich data.
  • AWS Database Migration Service (AWS DMS) connects and migrates data from relational databases, data warehouses, and NoSQL databases.
Ingestion layer against source systems

Figure 3. Ingestion layer against source systems

Services in this layer are managed services that provide operational excellence by removing patching and upgrade overheads. Being managed services, they will also detect extraction spikes and scale automatically or on-demand based on your specifications.

Building the data lake layer

A data lake built on Amazon Simple Storage Service (Amazon S3) provides the ideal target layer to store, process, and cycle data over time. As the central aspect of the architecture, Amazon S3 allows the data lake to hold multiple data formats and datasets. It can also be integrated with most if not all AWS services and third-party applications.

Figure 4 shows the following services to be included in this layer:

  • Amazon S3 acts as the data lake to hold multiple data formats.
  • Amazon S3 Glacier provides the data archiving and long-term backup storage layer for processed data. It also reduces the amount of data indexed by transformation layer services.
Figure 4. Data lake integrated to ingestion layer

Figure 4. Data lake integrated to ingestion layer

The data lake layer provides 99.999999999% data durability and supports various data formats, allowing you to future proof the data lake. Data lakes on Amazon S3 also integrate with other AWS ecosystem services (for example, AWS Athena for interactive querying or third-party tools running off Amazon Elastic Compute Cloud (Amazon EC2) instances).

Defining the governance and transformation layer

Services in this layer transform raw data in the data lake to a business consumable format, along with providing operational monitoring and governance capabilities.

Figure 5 shows the following services to be included in this layer:

  1. AWS Glue discovers and transforms data, making it available for search and querying.
  2. Amazon Redshift (Transient) functions as an extract, transform, and load (ETL) node using RA3 nodes. RA3 nodes can be paused outside ETL windows. Once paused, Amazon Redshift’s data sharing capability allows for live data sharing for read purposes, which reduces costs to customers. It also allows for creation of separate, smaller read-intensive business intelligence (BI) instances from the larger write-intensive ETL instances required during ETL runs.
  3. Amazon CloudWatch monitors and observes your enabled services. It integrates with existing IT service management and change management systems such as ServiceNow for alerting and monitoring.
  4. AWS Security Hub implements a single security pane by aggregating, organizing, and prioritizing security alerts from services used, such as Amazon GuardDuty, Amazon Inspector, Amazon Macie, AWS Identity and Access Management (IAM) Access Analyzer, AWS Systems Manager, and AWS Firewall Manager.
  5. Amazon Managed Workflows for Apache Airflow (MWAA) sequences your workflow events to ingest, transform, and load data.
  6. Amazon Lake Formation standardizes data lake provisioning.
  7. AWS Lambda runs custom transformation jobs if required, or developed over a period of time that hold custom business logic IP.
Governance and transformation layer prepares data in the lake

Figure 5. Governance and transformation layer prepares data in the lake

This layer provides operational isolation wherein least privilege access control can be implemented to keep operational staff separate from the core services. It also lets you implement custom transformation tasks using Lambda. This allows you to consistently build lakes across all environments and single view of security via AWS Security Hub.

Building the value layer

This layer generates value for your business by provisioning decoupled, purpose-built visualization services, which decouples business units from change management impacts of other units.

Figure 6 shows the following services to be included in this value layer:

  1. Amazon Redshift (BI cluster) acts as the final store for data processed by the governance and transformation layer.
  2. Amazon Elasticsearch Service (Amazon ES) conducts log analytics and provides real-time application and clickstream analysis, including for data from previous layers.
  3. Amazon SageMaker prepares, builds, trains, and deploys machine learning models that provide businesses insights on possible scenarios such as predictive maintenance, churn predictions, demand forecasting, etc.
  4. Amazon QuickSight acts as the visualization layer, allowing business and support resources users to create reports, dashboards accessible across devices and embedded into other business applications, portals, and websites.
Value layer with services for purpose-built consumption

Figure 6. Value layer with services for purpose-built consumption

Conclusion

By using services managed by AWS as a starting point, you can build a data lake house on AWS. This open standard based, pay-as-you-go data lake will help future proof your analytics platform. With the AWS data lake house architecture provided in this post, you can expand your architecture, avoid excessive license costs associated with proprietary software and infrastructure (along with their ongoing support costs). These capabilities are typically unavailable in on-premises OLAP/HCI based data analytics platforms.

Related information

Access Amazon Location Service from Amazon Redshift

Post Syndicated from Parag Srivastava original https://aws.amazon.com/blogs/big-data/access-amazon-location-service-from-amazon-redshift/

Organizations typically store business and customer data in databases like Amazon Relational Database Service (Amazon RDS) and Amazon Redshift, and often want to enrich this data by integrating with external services. One such enrichment is to add spatial attributes such as location coordinates for an address. With the introduction of Amazon Location Service, you now have access to geospatial functionalities such as map visualization, geocoding, and reverse geocoding using data providers such as Esri and HERE.

Moreover, with the ability of Amazon Redshift and RDS databases to call AWS Lambda functions through user-defined functions (UDFs), you can now integrate these databases with geospatial functionality provided by Amazon Location Service. For more information about user-defined functions in Amazon Redshift, see Creating a scalar Lambda UDF.

In this post, we take customer addresses stored in an Amazon Redshift database and call Amazon Location Service geocoding APIs to find the coordinates of the addresses and persist them in the database. If you are using Amazon Redshift, see Access Amazon Location Service from Amazon Redshift.

Prerequisites

Before you get started, make sure you have the following prerequisites:

  • An understanding of basic concepts such as AWS Identity and Access Management (IAM) roles and policies, Lambda functions, and Amazon Redshift.
  • An Amazon Redshift cluster with a table that contains customer addresses and the following details: street number, street name, street type, municipality name, state or province code, postal or zip code, and country code. You also need a column to save the coordinates for the address.
  • A SQL developer tool of your choice to connect to the Amazon Redshift cluster.
  • An Amazon QuickSight account with access to Amazon Redshift.

Solution overview

Our solution consists of the following components:

  • A Python Lambda function to call the search_place_index_for_text function. This function takes a text input and returns the coordinates as longitude and latitude values for each search result.
  • An IAM role to allow Lambda to call the SearchPlaceIndexForText operation for Amazon Location Service.
  • A Lambda UDF in the Amazon Redshift database to invoke the Lambda function.
  • An IAM role to allow Amazon Redshift to invoke the Lambda function.
  • SQL statements to update and select the coordinate data for records in the Amazon Redshift database by calling the Lambda UDF.
  • A QuickSight dataset that uses the SQL statement to access the coordinate data.
  • A QuickSight analysis that displays the address location in a geospatial chart.

The following diagram illustrates our solution architecture:

To implement the solution, we complete the following steps:

  1. Set up a place index resource for Amazon Location Service.
  2. Create a common address geocoding function.
  3. Invoke the Amazon Location Service API from Amazon Redshift.
  4. Create a Lambda function.
  5. Set up Amazon Redshift.
  6. Run SQL statements to invoke the Lambda function.
  7. Visualize the address locations in QuickSight.

Pay close attention to names and parameters used in this post; they must match and be consistent across all solution components.

The full code is available in GitHub. The code also includes an AWS CloudFormation template.

Set up a place index resource for Amazon Location Service

Amazon Location Service uses a place index resource for providing geocoding and reverse geocoding functionality. Let’s start by creating a new place index resource. Your account may have a default place index but we don’t use it for this post because it isn’t configured for storage.

  1. On the Amazon Location Service console, use the navigation pane to launch the wizard to create a new place index.
  2. For Name, enter placeindex.redshift.
  3. For Data provider, select either of the data providers for your place index.
  4. For Data storage options, select Yes, results will be stored because we’re saving the geocoding results in a database field.
  5. Leave all other values as their default and choose Create place index.

We use the place index name in the Lambda code as a parameter to the search_place_index_for_text API call.

Create a common address geocoding Lambda function

Next, let’s create a Lambda function to call the Amazon Location Service API. We reuse this function code for both Amazon Redshift and Amazon RDS because the underlying service call is the same.

Error handling has been skipped in this code snippet for brevity. The full code is available in GitHub.

We use country_code to limit the search_place_index_for_text API to a particular country using the ISO 3166-1 alpha-3 country code. Amazon Location Service throws an exception if an input is provided but doesn’t contain a valid value.

Create a new file geocode.py with the following code snippet:

import boto3
import os

location = boto3.client("location")

def geocode_address(address_line, municipality_name, state_code, post_code, country_code):

    text = ("%s, %s %s %s" % (address_line, municipality_name, state_code, post_code))
    response = location.search_place_index_for_text(IndexName="placeindex.redshift", FilterCountries=[country_code], Text=text)

    data = response["Results"]
    if len(data) >= 1:
        response = {
            "Longitude": data[0]["Place"]["Geometry"]["Point"][0],
            "Latitude": data[0]["Place"]["Geometry"]["Point"][1],
            "Label": data[0]["Place"]["Label"],
            "MultipleMatch": False
        }
        
        if len(data) > 1:
            response["MultipleMatch"] = True
    else:
        response = {
            "Error": "No geocoding results found"
        }
        
    return response

Invoke the Amazon Location Service API from Amazon Redshift

Each Amazon Redshift compute node invokes Lambda functions in parallel in order to return the results quickly. The function returns a success or fail status code along with the results. In case of a failure, an error message can be returned back.

The Amazon Redshift Lambda UDF invocation provides an arguments list containing an ordered list of input parameters representing fields of the database records (see the following code). The function reads the input parameters by the list position and maps them to variables for processing the request.

{
   "arguments":[
      [
         "410 TERRY AVE N",
         "SEATTLE",
         "WA",
         "98109",
         "USA"
      ],
      [
         "510 W GEORGIA ST",
         "VANCOUVER",
         "BC",
         "V6B0M3",
         "CAN"
      ]
   ],
   "num_records":2
}

The interface expects results to be returned in the same order and containing the same number of items as the request. See the following code:

{
   "success":true,
   "results":[
      {
         "Longitude":-122.33664798659117,
         "Latitude":47.62231899780329,
         "Label":"410 Terry Ave N, Seattle, WA, 98109, USA",
         "MultipleMatch":false
      },
      {
         "Longitude":-123.11693997267491,
         "Latitude":49.281259993687854,
         "Label":"510 W Georgia St, Vancouver, British Columbia, V6B 0M3, CAN",
         "MultipleMatch":true
      }
   ]
}

In case of an exception, the function can return the fail status along with an error message:

{
  "success": false,
  "error_msg": "Error in processing request"
}

Create a Lambda function

We now create a Lambda function GeocodeAddresses-Redshift using the Python runtime.

Error handling has been skipped in this code snippet for brevity. The full code is available in GitHub.

  1. Create the file geocode.py as described in the previous section.

It’s important to set the timeout on the Lambda function to a longer duration to process multiple records in one call.

  1. Replace the default generated code in lambda_function.py with the following code:
import json
from geocode import *

def lambda_handler(event, context):

    try:
        response = dict()
        records = event["arguments"]
        results = []

        for record in records:
            address_line, municipality_name, state_code, post_code, country_code = record
            try:
                results.append(json.dumps(geocode_address(address_line, municipality_name, state_code, post_code, country_code)))
            except:
                results.append(None)
            
        response['success'] = True
        response['results'] = results
    except Exception as e:
        response['success'] = False
        response['error_msg'] = str(e)
    
    return json.dumps(response)
  1. This function requires permission to call the search_place_index_for_text API to geocode addresses using the place index placeindex.redshift that we created earlier.
  2. Update the IAM role for the Lambda function to add the following inline policy GeocodeAddresses-Redshift-policy:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "geo:SearchPlaceIndexForText"
            ],
            "Resource": "arn:aws:geo:<Region>:<AccountId>:place-index/placeindex.redshift",
            "Effect": "Allow"
        }
    ]
}

Set up Amazon Redshift

Now we create a new Lambda UDF in Amazon Redshift and configure it to use an IAM role that gives the permissions to invoke the specific Lambda function. For more information, see Creating a scalar Lambda UDF.

  1. Create an IAM role Redshift-Lambda-role and add the following inline policy Redshift-Lambda-policy to this role to allow Amazon Redshift to invoke the GeocodeAddresses-Redshift function that we created:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "lambda:InvokeFunction",
            "Resource": [
                "arn:aws:lambda:<Region>:<AccountId>:function:GeocodeAddresses-Redshift"
            ]
        }
    ]
}

We need to associate this IAM role with the Amazon Redshift cluster.

  1. On the Amazon Redshift console, choose the cluster to use.
  2. Under Manage IAM Roles, add the Redshift-Lambda-role role to the cluster.
  3. Wait for the cluster to be modified and be in Available status.

Now we create an Amazon Redshift external function to invoke a Lambda function.

  1. Use your preferred SQL editor to connect to your Amazon Redshift cluster and create a new external function f_geocode_address in Amazon Redshift using the following code:
CREATE OR REPLACE EXTERNAL FUNCTION public.f_geocode_address(VARCHAR, VARCHAR,
                                                             VARCHAR, VARCHAR,
                                                             VARCHAR)
RETURNS VARCHAR
STABLE
LAMBDA 'GeocodeAddresses-Redshift'
IAM_ROLE 'arn:aws:iam::<AccountId>:role/Redshift-Lambda-role';

This external function calls the Lambda function using the permissions provided in the Redshift-Lambda-role role that we created.

Run SQL statements to invoke the Lambda function

We’re now ready to run SQL statements that can take address records from the customer_address table in the Amazon Redshift database and geocode them using Amazon Location Service.

If you don’t have a customer_address table, you can create it using the script included in the full code in GitHub.

You may choose to simply select the geocoding results as part of the SQL statement, as in the following code. It’s important to understand that Lambda functions called from SELECT SQL statements contain fields from multiple database records in each call. Therefore, the timeout on the function should be set to a longer duration to process multiple records in one call.

SELECT customer_id,
       f_geocode_address(address_line,
                         municipality_name,
                         state_code,
                         post_code,
                         country_code)
FROM customer_address
WHERE address_status = 'NEW'
  AND country_code IS NOT NULL;

Geocoding results can be persisted in a separate database table in Amazon Redshift. We can then parse the saved JSON results to extract the geographic coordinates.

CREATE TABLE customer_address_geocode AS
SELECT customer_id,
       address_line,
       municipality_name,
       state_code,
       post_code,
       country_code,
       f_geocode_address(address_line,
                         municipality_name,
                         state_code,
                         post_code,
                         country_code) AS geocode_result
FROM customer_address
WHERE address_status = 'NEW'
  AND country_code IS NOT NULL;

Now let’s extract the geographic coordinates from the results field.

SELECT customer_id,
    cast(json_extract_path_text(geocode_result, 'Longitude') as float) as Longitude,
    cast(json_extract_path_text(geocode_result, 'Latitude') as float) as Latitude
FROM customer_address_geocode;

Visualize the address location using QuickSight

Let’s configure a dataset in QuickSight and create an analysis for this dataset.

  1. Create a new Amazon Redshift data source Redshift-Geocode either using an auto-discovered or manually configured endpoint.
  2. Create a new dataset for the data source using the preceding SELECT statement as a custom SQL.

Now we’re ready to configure our analysis.

  1. On the QuickSight console, create a new QuickSight analysis using the address location dataset.
  2. Choose the Point on map visual type.

  1. Choose the longitude and latitude fields from the fields list and drag them to the Geospatial field wells.

You should now see points on the map representing the customer address locations.

Conclusion

Congratulations! We have successfully integrated Amazon Location Service with Amazon Redshift and geocoded address data in an Amazon Redshift table without leaving our SQL environment. We enriched the address data by adding location coordinates. We also successfully visualized the address locations on a map in QuickSight. You can now explore extending the solution to other functionalities of Amazon Location Service such as reverse geocoding or even integrating with Lambda functions for any other custom functionality.


About the author

Parag Srivastava is a Solutions Architect at Amazon Web Services (AWS), helping enterprise customers with successful cloud adoption and migration. During his professional career, he has been extensively involved in complex digital transformation projects. He is also passionate about building innovative solutions around geospatial aspects of addresses.

Synchronize and control your Amazon Redshift clusters maintenance windows

Post Syndicated from Ahmed Gamaleldin original https://aws.amazon.com/blogs/big-data/synchronize-and-control-your-amazon-redshift-clusters-maintenance-windows/

Amazon Redshift is a data warehouse that can expand to exabyte-scale. Today, tens of thousands of AWS customers (including NTT DOCOMO, Finra, and Johnson & Johnson) use Amazon Redshift to run mission-critical business intelligence dashboards, analyze real-time streaming data, and run predictive analytics jobs.

Amazon Redshift powers analytical workloads for Fortune 500 companies, startups, and everything in between. With the constant increase in generated data, Amazon Redshift customers continue to achieve successes in delivering better service to their end-users, improving their products, and running an efficient and effective business. Availability, therefore, is key in continuing to drive customer success, and AWS will use commercially reasonable efforts to make Amazon Redshift available with a Monthly Uptime Percentage for each multi-node cluster, during any monthly billing cycle, of at least 99.9% (our “Service Commitment”).

In this post, we present a solution to help you provide a predictable and repeatable experience to your Amazon Redshift end-users by taking control of recurring Amazon Redshift maintenance windows.

Amazon Redshift maintenance windows

Amazon Redshift periodically performs maintenance to apply fixes, enhancements, and new features to your cluster. This type of maintenance occurs during a 30-minute maintenance window set by default per Region from an 8-hour block on a random day of the week. You should change the scheduled maintenance window according to your business needs by modifying the cluster, either programmatically or by using the Amazon Redshift console. The window must be at least 30 minutes and not longer than 24 hours. For more information, see Managing clusters using the console.

If a maintenance event is scheduled for a given week, it starts during the assigned 30-minute maintenance window. While Amazon Redshift performs maintenance, it terminates any queries or other operations that are in progress. If there are no maintenance tasks to perform during the scheduled maintenance window, your cluster continues to operate normally until the next scheduled maintenance window. Amazon Redshift uses Amazon Simple Notification Service (Amazon SNS) to send notifications of Amazon Redshift events. You enable notifications by creating an Amazon Redshift event subscription. You can create an Amazon Redshift event notification subscription so you can be notified when an event occurs for a given cluster.

When an Amazon Redshift cluster is scheduled for maintenance, you receive an Amazon Redshift “Pending” event, as described in the following table.

Amazon Redshift category Event ID Event severity Description
Pending REDSHIFT-EVENT-2025 INFO Your database for cluster <cluster name> will be updated between <start time> and <end time>. Your cluster will not be accessible. Plan accordingly.
Pending REDSHIFT-EVENT-2026 INFO Your cluster <cluster name> will be updated between <start time> and <end time>. Your cluster will not be accessible. Plan accordingly.

Amazon Redshift also gives you the option to reschedule your cluster’s maintenance window by deferring your upcoming maintenance by up to 45 days. This option is particularly helpful if you want to maximize your cluster uptime by deferring a future maintenance window. For example, if your cluster’s maintenance window is set to Wednesday 8:30–9:00 UTC and you need to have nonstop access to your cluster for the next 2 weeks, you can defer maintenance to a date 2 weeks from now. We don’t perform any maintenance on your cluster when you have specified a deferment.

Deferring a maintenance window doesn’t apply to mandatory Amazon Redshift updates, such as vital security patches. Scheduled maintenance is different from Amazon Redshift mandatory maintenance. If Amazon Redshift needs to update hardware or make other mandatory updates during your period of deferment, we notify you and make the required changes. Your cluster isn’t available during these updates and such maintenance can’t be deferred. If a hardware replacement is required, you receive an event notification through the AWS Management Console and your SNS subscription as a “Pending” item, as shown in the following table.

Amazon Redshift category Event ID Event severity Description
Pending REDSHIFT-EVENT-3601 INFO A node on your cluster <cluster name> will be replaced between <start time> and <end time>. You can’t defer this maintenance. Plan accordingly.
Pending REDSHIFT-EVENT-3602 INFO A node on your cluster <cluster name> is scheduled to be replaced between <start time> and <end time>. Your cluster will not be accessible. Plan accordingly.

Challenge

As the number of the Amazon Redshift clusters you manage for an analytics application grows, the end-user experience becomes highly dependent on recurring maintenance. This means that you want to make sure maintenance windows across all clusters happen at the same time and fall on the same day each month. You want to avoid a situation in which, for example, your five Amazon Redshift clusters are each updated on a different day or week. Ultimately, you want to give your Amazon Redshift end-users an uninterrupted number of days where the clusters aren’t subject to any scheduled maintenance. This gives you the opportunity to announce a scheduled maintenance to your users well ahead of the maintenance date.

Amazon Redshift clusters are scheduled for maintenance depending on several factors, including when a cluster was created and its Region. For example, you may have two clusters on the same Amazon Redshift version that are scheduled for different maintenance windows. Regions implement the Amazon Redshift latest versions and patches at different times. A cluster running in us-east-1 might be scheduled for the same maintenance 1 week before another cluster in eu-west-1. This makes it harder for you to provide your end-users with a predictable maintenance schedule.

To solve this issue, you typically need to frequently check and synchronize the maintenance windows across all your clusters to a specific time, for example sat:06:00-sat:06:30. Additionally, you want to avoid having your clusters scheduled for maintenance at different intervals. For that, you need to defer maintenance across all your clusters to fall on the same exact day. For example, you can defer all maintenance across all clusters to happen 1 month from now regardless of when the cluster was last updated. This way you know that your clusters aren’t scheduled for maintenance for the next 30 days. This gives you enough time to announce the maintenance schedule to your Amazon Redshift users.

Solution overview

Having one day per month when Amazon Redshift scheduled maintenance occurs across all your clusters provides your users with a seamless experience. It gives you control and predictability over when clusters aren’t available. You can announce this maintenance window ahead of time to avoid sudden interruptions.

The following solution deploys an AWS Lambda function (RedshiftMaintenanceSynchronizer) to your AWS account. The function runs on a schedule configurable through an AWS CloudFormation template parameter frequency to run every 6, 12, or 24 hours. This function synchronizes maintenance schedules across all your Amazon Redshift clusters to happen at the same time on the same day. It also gives you the option to defer all future maintenance windows across all clusters by a number of days (deferment days) for up to 45 days. This enables you to provide your users with an uninterrupted number of days when your clusters aren’t subject to maintenance.

We use the following input parameters:

  • Deferment days – The number of days to defer all future scheduled maintenance windows. Amazon Redshift can defer maintenance windows by up to 45 days. The solution adds this number to the date of the last successfully completed maintenance across any of your clusters. The resulting date is the new maintenance date for all your clusters.
  • Frequency – The frequency to run this solution. You can configure it to run every 6, 12, or 24 hours.
  • Day – The preferred day of the week to schedule maintenance windows.
  • Hour – The preferred hour of the day to schedule maintenance windows (24-hour format).
  • Minute – The preferred minute of the hour to schedule maintenance windows.

The Lambda function performs the following steps:

  1. Lists all your Amazon Redshift clusters in the Region.
  2. Updates the maintenance window for all clusters to the same value of the day/hour/minute input parameters from the CloudFormation template.
  3. Checks for the last successfully completed maintenance across all clusters.
  4. Calculates the deferment maintenance date by adding the deferment input parameter to the date of the last successfully completed maintenance. For example, if the deferment parameter is 30 days and the last successful maintenance window completed on July 1, 2020, then the next deferment maintenance date is July 31, 2020.
  5. Defers the next maintenance window across all clusters to the deferment date calculated in the previous step.

Launch the solution

To get started, deploy the cloudFormation template to your AWS account.

  1. For Stack name, enter a name for your stack for easy reference.
  2. For Day, choose the day of week for the maintenance window to occur.
  3. For Hour, choose the hour for the window to start.
  4. For Minute, choose the minute within the hour for the window to start.

Your window should be a time with the least cluster activity and during off-peak work hours.

  1. For Deferment Days, choose the number of days to defer all future scheduled maintenance windows.
  2. For Solution Run Frequency, choose the frequency of 6, 12, or 24 hours.

cloudFormation Template

After the template has successfully deployed, the following resources are available:

  • The RedshiftMaintenanceSync Lambda function. This is a Python 3.8 Lambda function that syncs and defers the maintenance windows across all your Amazon Redshift clusters.

lamda_console

  • The RedshiftMaintenanceSyncEventRule Amazon CloudWatch event rule. This rule triggers on a schedule based on the Frequency input parameter. It triggers the RedshiftMaintenanceSync Lambda function to run the solution logic.

When the Lambda function starts and detects an already available deferment on any of your clusters, it doesn’t attempt to modify the existing deferment, and exits instead.

The solution logs any deferment it performs on any cluster in the associated CloudWatch log group.

Synchronize and defer maintenance windows

To demonstrate this solution, I have two Amazon Redshift clusters with different preferred maintenance windows and scheduled intervals.

Cluster A (see the following screenshot) has a preferred maintenance window set to Friday at 4:30–5:00 PM. It’s scheduled for maintenance in 2 days.

Cluster B has a preferred maintenance window set to Tuesday at 9:45–10:15 AM. It’s scheduled for maintenance in 6 days.

This means that my Amazon Redshift users have a 30-minute interruption twice in the next 2 and 6 days. Also, these interruptions happen at completely different times.

Let’s launch the solution and see how the cluster maintenance windows and scheduling intervals change.

The Lambda function does the following every time it runs:

  1. Lists all the clusters.
  2. Checks if any of the clusters have a deferment enabled and exits if it finds any.
  3. Syncs all preferred maintenance windows across all clusters to fall on the same time and day of week.
  4. Checks for the latest successfully completed maintenance date.
  5. Adds the deferment days to the last maintenance date. This becomes the new deferment date.
  6. Applies the new deferment date to all clusters.

Cluster A’s maintenance window was synced to the input parameters I passed to the CloudFormation template. Additionally, the next maintenance window was deferred until June 21, 2021, 8:06 AM (UTC +02:00).

Cluster B’s maintenance window is also synced to the same value from Cluster A, and the next maintenance window was deferred to the same exact day as Cluster A: June 21, 2021, 8:06 AM (UTC +02:00).

Finally, let’s check the CloudWatch log group to understand what the solution did.

Now both clusters have the same maintenance window and next scheduled maintenance, which is a month from now. In a production scenario, this gives you enough lead time to announce the maintenance window to your Amazon Redshift end-users and provide them with the exact day and time when clusters aren’t available.

Summary

This solution can help you provide a predictable and repeatable experience to your Amazon Redshift end-users by taking control of recurring Amazon Redshift maintenance windows. It enables you to provide your users with an uninterrupted number of days where clusters are always available—barring any scheduled mandatory upgrades. Click here to get started with Amazon Redshift today.


About the Author

Ahmed_Gamaleldin

Ahmed Gamaleldin is a Senior Technical Account Manager (TAM) at Amazon Web Services. Ahmed helps customers run optimized workloads on AWS and make the best out of their cloud journey.

Accelerate your data warehouse migration to Amazon Redshift – Part 2

Post Syndicated from Michael Soo original https://aws.amazon.com/blogs/big-data/part-2-accelerate-your-data-warehouse-migration-to-amazon-redshift/

This is the second post in a multi-part series. We’re excited to shared dozens of new features to automate your schema conversion; preserve your investment in existing scripts, reports, and applications; accelerate query performance; and potentially reduce your overall cost to migrate to Amazon Redshift. Check out the first post Accelerate your data warehouse migration to Amazon Redshift – Part 1 to learn more about automated macro conversion, case-insensitive string comparison, case-sensitive identifiers, and other exciting new features.

Amazon Redshift is the leading cloud data warehouse. No other data warehouse makes it as easy to gain new insights from your data. With Amazon Redshift, you can query exabytes of data across your data warehouse, operational data stores, and data lake using standard SQL. You can also integrate other services like Amazon EMR, Amazon Athena, and Amazon SageMaker to use all the analytic capabilities in the AWS Cloud.

Many customers have asked for help migrating their self-managed data warehouse engines to Amazon Redshift. In these cases, you may have terabytes (or petabytes) of historical data, a heavy reliance on proprietary features, and thousands of extract, transform, and load (ETL) processes and reports built over years (or decades) of use.

Until now, migrating a data warehouse to AWS was complex and involved a significant amount of manual effort.

Today, we’re happy to share additional enhancements to the AWS Schema Conversion Tool (AWS SCT) to automate your migrations to Amazon Redshift. These enhancements reduce the recoding needed for your data tables, and more importantly, the manual work needed for views, stored procedures, scripts, and other application code that use those tables.

In this post, we introduce automation for INTERVAL and PERIOD data types, automatic type casting, binary data support, and some other enhancements that have been requested by customers. We show you how to use AWS SCT to convert objects from a Teradata data warehouse and provide links to relevant documentation so you can continue exploring these new capabilities.

INTERVAL data types

An INTERVAL is an unanchored duration of time, like “1 year” or “2 hours,” that doesn’t have a specific start or end time. In Teradata, INTERVAL data is implemented as 13 distinct data types depending on the granularity of time being represented. The following table summarizes these types.

Year intervals Month intervals Day intervals Hour intervals Minute intervals Second intervals

INTERVAL YEAR

INTERVAL YEAR TO MONTH

INTERVAL MONTH

 

INTERVAL DAY

INTERVAL DAY TO HOUR

INTERVAL DAY TO MINUTE

INTERVAL DAY TO SECOND

INTERVAL HOUR

INTERVAL HOUR TO MINUTE

INTERVAL HOUR TO SECOND

INTERVAL MINUTE

INTERVAL MINUTE TO SECOND

INTERVAL SECOND

 

Amazon Redshift doesn’t support INTERVAL data types natively. Previously, if you used INTERVAL types in your data warehouse, you had to develop custom code as part of the database conversion process.

Now, AWS SCT automatically converts INTERVAL data types for you. AWS SCT converts an INTERVAL column into a CHARACTER VARYING column in Amazon Redshift. Then AWS SCT converts your application code that uses the column to emulate the INTERVAL semantics.

For example, consider the following Teradata table, which has a MONTH interval column. This table store different types of leaves of absences and the allowable duration for each.

CREATE TABLE testschema.loa_durations (
  loa_type_id INTEGER
, loa_name VARCHAR(100) CHARACTER SET LATIN
, loa_duration INTERVAL MONTH(2))
PRIMARY INDEX (loa_type_id);

AWS SCT converts the table to Amazon Redshift as follows. Because Amazon Redshift doesn’t have a native INTERVAL data type, AWS SCT replaces it with a VARCHAR data type.

CREATE TABLE testschema.loa_durations(
  loa_type_id INTEGER
, loa_name VARCHAR(100)
, loa_duration VARCHAR(64)
)
DISTSTYLE KEY
DISTKEY
(
loa_type_id
)
SORTKEY
(
loa_type_id
);

Now, let’s suppose your application code uses the loa_duration column, like the following Teradata view. Here, the INTERVAL MONTH field is added to the current date to compute when a leave of absence ends if it starts today.

REPLACE VIEW testschema.loa_projected_end_date AS
SELECT
  loa_type_id loa_type_id
, loa_name loa_name
, loa_duration
, current_date AS today
, current_date + loa_duration AS end_date
FROM
testschema.loa_durations
;

Because the data is stored as CHARACTER VARYING, AWS SCT injects the proper type CAST into the Amazon Redshift code to interpret the string values as a MONTH interval. It then converts the arithmetic using Amazon Redshift date functions.

CREATE OR REPLACE VIEW testschema.loa_projected_end_date (loa_type_id, loa_name, loa_duration, today, end_date) AS
SELECT
  loa_type_id AS loa_type_id
, loa_name AS loa_name
, loa_duration
, CURRENT_DATE AS today
, dateadd(MONTH, CAST (loa_duration AS INTEGER),CURRENT_DATE)::DATE AS end_date
FROM testschema.loa_durations
;

Also, as a bonus, AWS SCT automatically converts any literal INTERVAL values that you might be using in your code.

For example, consider the following Teradata table. The table contains a DATE column, which records the last date when an employee was promoted.

CREATE TABLE TESTSCHEMA.employees (
  id INTEGER
, name VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC
, manager_id INTEGER
, last_promo_date DATE FORMAT 'YY/MM/DD'
)
UNIQUE PRIMARY INDEX ( id );

Now, suppose the database contains a view that computes the next date an employee is eligible for a promotion. We implement a business rule that employees who have never been promoted are eligible for promotion in 1.5 years. All other employees become eligible 2.5 years after their last promotion. See the following code:

REPLACE VIEW testschema.eligible_for_promo AS
SELECT 
  id
, name
, last_promo_date
, CASE WHEN last_promo_date is NULL THEN current_date + INTERVAL '18' MONTH
       ELSE last_promo_date + INTERVAL '2-06' YEAR TO MONTH
  END eligible_date
FROM employees
;

AWS SCT converts the INTERVAL values used in the CASE statement and translates the date expressions accordingly:

CREATE OR REPLACE VIEW testschema.eligible_for_promo (id, name, last_promo_date, eligible_date) AS
SELECT
  id
, name
, last_promo_date
, CASE
    WHEN last_promo_date IS NULL THEN dateadd(MONTH, 18, CURRENT_DATE)::DATE
    ELSE dateadd(MONTH, 30, last_promo_date)::DATE
  END AS eligible_date
FROM testschema.employees
;

We’re excited about INTERVAL automation in AWS SCT and encourage you to give it a try. For more information about getting started with AWS SCT, see Installing, verifying, and updating AWS SCT.

PERIOD data type

A PERIOD data value represents a duration of time with a specified start and end. For example, the Teradata literal “(2021-01-01 to 2021-01-31)” is a period with a duration of 31 days that starts and ends on the first and last day of January 2021, respectively. PERIOD data types can have three different granularities: DATE, TIME, or TIMESTAMP. The following table provides some examples.

Period Type Example
PERIOD(DATE) “(2021-01-01 to 2021-01-31) “
PERIOD(TIME) “(12:00:00 to 13:00:00)”
PERIOD(TIMESTAMP) “(2021-01-31 00:00:00 to 2021-01-31 23:59:59)”

As with INTERVAL, the PERIOD data type isn’t natively supported by Amazon Redshift. Previously, if you used these data types in your tables, you had to write custom code as part of the database conversion process.

Now, AWS SCT automatically converts PERIOD data types for you. AWS SCT converts a PERIOD column into two DATE (or TIME or TIMESTAMP) columns as appropriate on Amazon Redshift. Then AWS SCT converts your application code that uses the column to emulate the source engine semantics.

For example, consider the following Teradata table:

CREATE SET TABLE testschema.period_table (
  id INTEGER
, period_col PERIOD(timestamp)) 
UNIQUE PRIMARY INDEX (id);

AWS SCT converts the PERIOD(TIMESTAMP) column into two TIMESTAMP columns in Amazon Redshift:

CREATE TABLE IF NOT EXISTS testschema.period_table(
  id INTEGER
, period_col_begin TIMESTAMP
, period_col_end TIMESTAMP
)
DISTSTYLE KEY
DISTKEY
(id)
SORTKEY
(id);

Now, let’s look at a simple example of how you can use AWS SCT to convert your application code. A common operation in Teradata is to extract the starting (or ending) timestamps in a PERIOD value using the BEGIN and END built-in functions:

REPLACE VIEW testschema.period_view_begin_end AS 
SELECT 
  BEGIN(period_col) AS period_start
, END(period_col) AS period_end 
FROM testschema.period_table
;

AWS SCT converts the view to reference the transformed table columns:

CREATE OR REPLACE VIEW testschema.period_view_begin_end (period_start, period_end) AS
SELECT
  period_col_begin AS period_start
, period_col_end AS period_end
FROM testschema.period_table;

We’ll continue to build automation for PERIOD data conversion, so stay tuned for more improvements. In the meantime, you can try out the PERIOD data type conversion features in AWS SCT now. For more information, see Installing, verifying, and updating AWS SCT.

Type casting

Some data warehouse engines, like Teradata, provide an extensive set of rules to cast data values in expressions. These rules permit implicit casts, where the target data type is inferred from the expression, and explicit casts, which typically use a function to perform the type conversion.

Previously, you had to manually convert implicit cast operations in your SQL code. Now, we’re happy to share that AWS SCT automatically converts implicit casts as needed. This feature is available now for the following set of high-impact Teradata data types.

Category Source data type Target data types
Numeric CHAR BIGINT
NUMBER
TIMESTAMP
VARCHAR NUMBER
NUMERIC
DEC
CHAR
GEOMETRY
INTEGER DATE
DEC
BIGINT DATE
NUMBER CHARACTER
VARCHAR
DEC
DECIMAL DATE
TIMESTAMP
SMALLINT
DOUBLE PRECISION
FLOAT DEC
Time DATE BIGINT
INTEGER
DECIMAL
FLOAT
NUMBER
CHARACTER
TIMESTAMP
INTERVAL NUMBER
BIGINT
INTEGER
Other GEOMETRY DECIMAL

Let’s look at how to cast numbers to DATE. Many Teradata applications treat numbers and DATE as equivalent values. Internally, Teradata stores DATE values as INTEGER. The rules to convert between an INTEGER and a DATE are well-known and developers have commonly exploited this information to perform date calculations using INTEGER arithmetic.

For example, consider the following Teradata table:

CREATE TABLE testschema.employees (
  id INTEGER
, name VARCHAR(20) CHARACTER SET LATIN
, manager_id INTEGER
, last_promo_date DATE FORMAT 'YY/MM/DD')
UNIQUE PRIMARY INDEX ( id );

We insert a single row of data into the table:

select * from employees;

 *** Query completed. One row found. 4 columns returned. 
 *** Total elapsed time was 1 second.

         id  name                   manager_id  last_promo_date
-----------  --------------------  -----------  ---------------
        112  Britney                       201                ?

We use a macro to update the last_promo_date field for id = 112. The macro accepts a BIGINT parameter to populate the DATE field.

replace macro testschema.set_last_promo_date(emp_id integer, emp_promo_date bigint) AS (
update testschema.employees
set last_promo_date = :emp_promo_date
where id = :emp_id;
);

Now, we run the macro and check the value of the last_promo_date attribute:

exec testschema.set_last_promo_date(112, 1410330);

 *** Update completed. One row changed. 
 *** Total elapsed time was 1 second.


select * from employees;

 *** Query completed. One row found. 4 columns returned. 
 *** Total elapsed time was 1 second.

         id  name                   manager_id  last_promo_date
-----------  --------------------  -----------  ---------------
        112  Britney                       201         41/03/30

You can see the last_promo_date attribute is set to the date March 30, 2041.

Now, let’s use AWS SCT to convert the table and macro to Amazon Redshift. As we saw in Part 1 of this series, AWS SCT converts the Teradata macro into an Amazon Redshift stored procedure:

CREATE TABLE IF NOT EXISTS testschema.employees(
  id INTEGER
, name CHARACTER VARYING(20) 
, manager_id INTEGER
, last_promo_date DATE
)
DISTSTYLE KEY
DISTKEY
(id)
SORTKEY
(id);

CREATE OR REPLACE PROCEDURE testschema.set_last_promo_date(par_emp_id INTEGER, par_emp_promo_date BIGINT)
AS $BODY$
BEGIN
    UPDATE testschema.employees
    SET last_promo_date = TO_DATE((par_emp_promo_date + 19000000), 'YYYYMMDD')
        WHERE id = par_emp_id;
END;
$BODY$
LANGUAGE plpgsql;

Note that 20410330 = 1410330 + 19000000; so adding 19,000,000 to the input returns the correct date value 2041-03-30.

Now, when we run the stored procedure, it updates the last_promo_date as expected:

myredshift=# select * from testschema.employees;
 id  |  name   | manager_id | last_promo_date
 112 | Britney |        201 |
(1 row)

myredshift=# call testschema.set_last_promo_date(112, 1410330);
CALL

myredshift=# select * from testschema.employees;
 id  |  name   | manager_id | last_promo_date
 112 | Britney |        201 | 2041-03-30
(1 row)

Automatic data type casting is available in AWS SCT now. You can download the latest version and try it out.

BLOB data

Amazon Redshift doesn’t have native support for BLOB columns, which you use to store large binary objects like text or images.

Previously, if you were migrating a table with a BLOB column, you had to manually move the BLOB values to file storage, like Amazon Simple Storage Service (Amazon S3), then add a reference to the S3 file in the table. Using Amazon S3 as the storage target for binary objects is a best practice because these objects are large and typically have low analytic value.

We’re happy to share that AWS SCT now automates this process for you. AWS SCT replaces the BLOB column with a CHARACTER VARYING column on the target table. Then, when you use the AWS SCT data extractors to migrate your data, the extractors upload the BLOB value to Amazon S3 and insert a reference to the BLOB into the target table.

For example, let’s create a table in Teradata and populate it with some data:

CREATE SET TABLE TESTSCHEMA.blob_table (
  id INTEGER
, blob_col BLOB(10485760))
PRIMARY INDEX ( id );

select * from blob_table;

 *** Query completed. 2 rows found. 2 columns returned. 
 *** Total elapsed time was 1 second.

         id blob_col
----------- ---------------------------------------------------------------
          1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
          2 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Now, we convert the table with AWS SCT and build it on Amazon Redshift:

myredshift=# \d testschema.blob_table;
                    Table "testschema.blob_table"
Column  | Type                     | Collation | Nullable | Default 
id      | integer                  |.          |          | 
blob_col | character varying(1300) |           |          |

Then we use the AWS SCT data extractors to migrate the table data from Teradata to Amazon Redshift.

When we look at the table in Amazon Redshift, you can see the paths to the S3 files that contain the BLOB values:

myredshift=# select * from testschema.blob_table;
(2 rows)

 id |                                                               blob_col                                                               
  2 | s3://<bucket name>/data/c12f53330dd3427a845a77f143d4a1a1/dbdee8e0485c481dad601fd6170fbfb4_lobs/2/308b6f0a902941e793212058570cdda5.dat
  1 | s3://<bucket name>/data/c12f53330dd3427a845a77f143d4a1a1/dbdee8e0485c481dad601fd6170fbfb4_lobs/2/a7686067af5549479b52d81e83c3871e.dat

And on Amazon S3, you can see the actual data files. There are two, one for each BLOB value:

$ aws s3 ls s3://<bucket name>/data/c12f53330dd3427a845a77f143d4a1a1/dbdee8e0485c481dad601fd6170fbfb4_lobs/2/
2021-05-13 23:59:47         23 522fee54fda5472fbae790f43e36cba1.dat
2021-05-13 23:59:47         24 5de6c53831f741629476e2c2cbc6b226.dat

BLOB support is available now in AWS SCT and the AWS SCT data extractors. Download the latest version of the application and try it out today.

Multi-byte CHARACTER conversion

Teradata supports multibyte characters in CHARACTER data columns, which are fixed length fields. Amazon Redshift supports multibyte characters in CHARACTER VARYING fields but not in fixed-length CHARACTER columns.

Previously, if you had fixed-length CHARACTER columns, you had to determine if they contained multibyte character data, and increase the target column size as appropriate.

AWS SCT now bridges this gap for you. If your Teradata tables contain CHARACTER columns with multibyte characters, AWS SCT automatically converts these columns to Amazon Redshift CHARACTER VARYING fields and sets the column sizes accordingly. Consider the following example, which contains four columns, a LATIN column that contains only single-byte characters, and UNICODE, GRAPHIC, and KANJISJIS columns that can contain multi-byte characters:

create table testschema.char_table (
  latin_col char(70) character set latin
, unicode_col char(70) character set unicode
, graphic_col char(70) character set graphic
, kanjisjis_col char(70) character set kanjisjis
);

AWS SCT translates the LATIN column to a fixed length CHARACTER column. The multi-byte columns are upsized and converted to CHARACTER VARYING:

CREATE TABLE IF NOT EXISTS testschema.char_table (
  latin_col CHARACTER(70)
, unicode_col CHARACTER VARYING(210)
, graphic_col CHARACTER VARYING(210)
, kanjisjis_col CHARACTER VARYING(210)
)
DISTSTYLE KEY
DISTKEY
(col1)
SORTKEY
(col1);

Automatic conversion for multibyte CHARACTER columns is available in AWS SCT now.

GEOMETRY data type size

Amazon Redshift has long supported geospatial data with a GEOMETRY data type and associated spatial functions.

Previously, Amazon Redshift restricted the maximum size of a GEOMETRY column to 64 KB, which constrained some customers with large objects. Now, we’re happy to share that the maximum size of GEOMETRY objects has been increased to just under 1 MB (specifically, 1,048,447 bytes).

For example, consider the following Teradata table:

create table geometry_table (
 id INTEGER
, geometry_col1 ST_GEOMETRY 
, geometry_col2 ST_GEOMETRY(1000)
, geometry_col3 ST_GEOMETRY(1048447) 
, geometry_col4 ST_GEOMETRY(10484470)
, geometry_col5 ST_GEOMETRY INLINE LENGTH 1000
)
;

You can use AWS SCT to convert it to Amazon Redshift. The converted table definition is as follows. A size specification isn’t needed on the converted columns because Amazon Redshift implicitly sets the column size.

CREATE TABLE IF NOT EXISTS testschema.geometry_table(
id INTEGER,
geometry_col1 GEOMETRY,
geometry_col2 GEOMETRY,
geometry_col3 GEOMETRY,
geometry_col4 GEOMETRY,
geometry_col5 GEOMETRY
)
DISTSTYLE KEY
DISTKEY
(
id
)
SORTKEY
(
id
);
ALTER TABLE testschema.geometry_table ALTER DISTSTYLE AUTO;
ALTER TABLE testschema.geometry_table ALTER SORTKEY AUTO;

Large GEOMETRY columns are available in Amazon Redshift now. For more information, see Querying spatial data in Amazon Redshift.

Conclusion

We’re happy to share these new features with you. If you’re contemplating a migration to Amazon Redshift, these capabilities can help automate your schema conversion and preserve your investment in existing reports, applications, and ETL, as well as accelerate your query performance.

This post described a few of the dozens of new features we have recently introduced to automate your data warehouse migrations to Amazon Redshift. We will share more in upcoming posts. You’ll hear about additional SQL automation, a purpose-built scripting language for Amazon Redshift with BTEQ compatibility, and automated support for proprietary SQL features.

Check back soon for more information. Until then, you can learn more about Amazon Redshift and the AWS Schema Conversion Tool on the AWS website. Happy migrating!


About the Author

Michael Soo is a database engineer with the AWS DMS and AWS SCT team at Amazon Web Services.

How Jobcase is using Amazon Redshift ML to recommend job search content at scale

Post Syndicated from Clay Martin original https://aws.amazon.com/blogs/big-data/how-jobcase-is-using-amazon-redshift-ml-to-recommend-job-search-content-at-scale/

This post is co-written with Clay Martin and Ajay Joshi from Jobcase as the lead authors.

Jobcase is an online community dedicated to empowering and advocating for the world’s workers. We’re the third-largest destination for job search in the United States, and connect millions of Jobcasers to relevant job opportunities, companies, and other resources on a daily basis. Recommendation is central to everything we do.

In this post, Jobcase data scientists discuss how Amazon Redshift ML helped us generate billions of job search recommendations in record time and with improved relevance.

The challenge: Scaling job recommendations

At Jobcase, we have used Amazon Redshift as our primary data warehouse for 8 years. Over the years, we built up a significant amount of historical job seeker and job content interaction data, which is stored in highly optimized compressed tables.

Our recommender system applies machine learning (ML) models to these big datasets, which poses a familiar problem: the data and ML models aren’t colocated on the same compute clusters, which involves moving large amounts of data across networks. In many cases, at batch inference time, data must be shuttled out of—and eventually brought back into—a data warehouse, which can be a time-consuming, expensive, and sometimes error-prone process. This also requires data engineering resources to set up data pipelines, and often becomes a bottleneck for data scientists to perform quick experimentation and drive business value. Until now, this data/model colocation issue has proven to be a big obstacle to applying ML to batch recommendation at scale.

How Amazon Redshift ML helped solve this challenge

Amazon Redshift ML, powered by Amazon SageMaker Autopilot, makes it easy for data analysts and database developers to create, train, and apply ML models using familiar SQL commands in Amazon Redshift data warehouses.

Amazon Redshift ML has proven to be a great solution to some of these problems at Jobcase. With Amazon Redshift ML’s in-database local inference capability, we now perform model inference on billions of records in a matter of minutes, directly in our Amazon Redshift data warehouse. In this post, we talk about our journey to Amazon Redshift ML, our previous attempts to use ML for recommendations, and where we go from here.

What we’re trying to solve

Job search is a unique and challenging domain for recommender system design and implementation. There are extraordinary variables to consider. For instance, many job openings only last for days or weeks, and they must be within a reasonable commute for job seekers. Some jobs require skills that only a subset of our members possess. These constraints don’t necessarily apply to, say, movie recommendations. On the other hand, job preferences are relatively stable; if a member is interested in truck driver jobs on Monday, there’s a good chance they’re still interested on Tuesday.

The Jobcase recommender system is responsible for generating job search content for over 10 million active Jobcasers per day. On average, every day we have about 20–30 million unique job listings in the eligible pool for recommendations. The system runs overnight and generates predictions in batch mode, and is expected to be completed by early morning hours. These recommendations are used throughout the day to engage with Jobcase members through various communication channels like email, SMS, and push notifications.

Our recommendation system

Each communication channel has its own peculiarities and deliverability constraints. To handle these constraints, our recommender system is broken into multiple steps and phases, where the final step is used to fine-tune the model for a particular channel. All of this multi-channel job seeker interaction data resides in our Amazon Redshift cluster.

In phase one, we apply unsupervised learning techniques to reduce the candidate set of items per member. We calculate item-item similarity scores from members’ engagement histories, and use these scores to generate the N most similar items per user. This collaborative filtering phase poses an important design trade-off: filter out too many relevant candidate items, and member engagement drops; filter out too few, and downstream inference remains computationally infeasible.

The second phase is a channel-specific supervised learning phase. It uses the similarity score from the first phase along with other predicted metrics as features and attributes, and tries to directly predict member engagement for that channel. In this example, let’s assume email is the channel and member engagement is captured by the dependent variable email click-through rate (CTR) = email clicks/email sends.

Here we also include job seeker features, such as educational attainment, commute preferences, location, and so on, as well as item or job content features, such as channel-specific historical macro or local engagement rates. Generating predictions for over 10 million job seekers paired to 200–300 items per member requires 2–3 billion predictions for just one channel.

Simplifying ML from within Amazon Redshift without any data movement

Until now, our data/model colocation problem has been challenging to solve from a cost and performance perspective. This is where Amazon Redshift ML has been instrumental in significantly improving the recommender system by enabling billions of non-linear model predictions in just a few minutes.

Before Amazon Redshift ML, we needed to write custom data pipelines to get data out of the data warehouse to Amazon Simple Storage Service (Amazon S3), then to ML inference instances, and finally pipe predictions back into the data warehouse for consumption. This added additional time delays and cost. Historically, it has been a challenge to improve this phase, and we had to rely on relatively simple linear models, optimized via A/B testing and hard-coded into SQL statements.

With Amazon Redshift ML, we were able to bring cutting-edge model classes with in-database local inference capabilities directly into our data warehouse. Therefore, the expressive power of the models that we could fit vastly increased. The following architecture diagram shows how we simplified our data pipeline with Amazon Redshift ML.

Our success story with Amazon Redshift ML

We have previously attempted to move one or both phases of our system out of Amazon Redshift. To improve our collaborative filtering phase, we tried using open-source libraries on Amazon Elastic Compute Cloud (Amazon EC2) instances that implement matrix factorization algorithms and natural language processing (NLP) inspired techniques such as Global Vectors (GloVe), which are distributed word representations. None of these solutions generated enough improvement in terms of member engagement to justify the increased data pipeline complexity, operational time delays, and operational expense. Pipelines to improve supervised user-item scoring had similar difficulties.

When Amazon Redshift ML was released in preview mode December 2020, we spun up a small Amazon Redshift cluster to test its capabilities against our use cases. We were immediately struck by the fact that Amazon Redshift ML makes fitting an XGBoost model or feed-forward neural network as easy as writing a SQL query. When Amazon Redshift ML became GA at the end of May 2021, we set it up in production within a day, and deployed a production model within a week. The following is a sample model that we trained and predicted with Amazon Redshift ML.

The following is the training code:

CREATE MODEL
f_ml_predict_email_content1_ctr
FROM (SELECT * FROM
Email_content1_click_history)
TARGET is_click
FUNCTION f_ml_predict_email_content1_ctr
PROBLEM_TYPE REGRESSION
OBJECTIVE 'MSE';

The following is the prediction code:

UPDATE email_content1_new_data
SET ctr_score = 
f_ml_predict_email_content1_ctr(
 feature1
,feature2,
 . . .
,feature_n)

Now we have several models in production, each performing billions of predictions in Amazon Redshift. The following are some of the key benefits we realized with Amazon Redshift ML:

  • Running model predictions at scale, performing billions of predictions in minutes, which we couldn’t achieve before implementing Amazon Redshift ML
  • Significant reduction in the model development cycle by eliminating the data pipelines
  • Significant reduction in model testing cycles by testing bigger cohort sizes, which helped reach our desired statistical significance quickly
  • Reduced cost by using Amazon Redshift ML’s local in-database inference capability, which saved cost on external ML frameworks and compute cost
  • 5–10% improvement in member engagement rates across several different email template types, resulting in increased revenue

Conclusion

In this post, we described how Amazon Redshift ML helped Jobcase effectively match millions of jobs to over 10 million active Jobcase members on a daily basis.

If you’re an Amazon Redshift user, Amazon Redshift ML provides an immediate and significant value add, with local in-database inference at no additional cost. It gives data scientists the ability to experiment quickly without data engineering dependencies. Amazon Redshift ML currently supports regression and classification model types, but is still able to achieve a great balance between speed, accuracy, complexity, and cost.


About Authors

Clay Martin is a Data Scientist at Jobcase Inc. He designs recommender systems to connect Jobcase members with the most relevant job content.

 

 

Ajay Joshi is Senior Software Engineer at Jobcase Inc. Ajay supports the data analytics and machine learning infrastructure at Jobcase and helps them design and maintain data warehouses powered by Amazon Redshift, Amazon SageMaker, and other AWS services.

 

 

Manash Deb is a Senior Analytics Specialist Solutions Architect at AWS. He has worked on building end-to-end data-driven solutions in different database and data warehousing technologies for over 15 years. He loves to learn new technologies and solving, automating, and simplifying customer problems with easy-to-use cloud data solutions on AWS.

 

 

Debu Panda, a Principal Product Manager at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world.

Case-insensitive collation support for string processing in Amazon Redshift

Post Syndicated from Mengchu Cai original https://aws.amazon.com/blogs/big-data/case-insensitive-collation-support-for-string-processing-in-amazon-redshift/

Amazon Redshift is a fast, fully managed, cloud-native data warehouse. Tens of thousands of customers have successfully migrated their workloads to Amazon Redshift. We hear from customers that they need case-insensitive collation for strings in Amazon Redshift in order to maintain the same functionality and meet their performance goals when they migrate their existing workloads from legacy, on-premises data warehouses like Teradata, Oracle, or IBM. With that goal in mind, AWS provides an option to create case-insensitive and case-sensitive collation.

In this post, we discuss how to use case-insensitive collation and how to override the default collation. Also, we specifically explain the process to migrate your existing Teradata database using the native Amazon Redshift collation capability.

What is collation?

Collation is a set of rules that tells a database engine how to compare and sort the CHAR and VARCHAR columns data in SQL. A case-insensitive collation ignores the differences between uppercase and lowercase letters for string comparison and sorting, whereas a case-sensitive collation does not. For example, in case-insensitive collation, “A” and “a” are equal. Different operations such as LIKE predicates, group by, order by, Regex, similar to behave based on the collation defined while the stored data keeps its original case.

We can define collation at three levels:

  • Database
  • Column
  • Expression

Database-level collation

You can specify collation when creating a new database. All VARCHAR and CHAR columns in the current database pick up the database-level collation as default if no column-level override exists. If you don’t specify collation for a new database, the default collation continues to be the current semantic of case-sensitive.

To create a database with collation in Amazon Redshift, use the following syntax:

CREATE DATABASE database_name [ WITH ]
[ COLLATE CASE_SENSITIVE | COLLATE CASE_INSENSITIVE ]

To alter a database with collation in Amazon Redshift, use the following syntax:

ALTER DATABASE database_name
COLLATE CASE_SENSITIVE | COLLATE CASE_INSENSITIVE

This action only works when the database is empty. Therefore, it’s important to use collation as a first step before any database objects are created.

To find the collation of an existing database, use the following query.

select db_collation();

This provides output as case_sensitive or case_insensitive.

Column-level collation

You can specify collation for a VARCHAR or CHAR column when creating a new table. This overrides the database-level collation.

To define a table with column-level collation in Amazon Redshift, use the following syntax:

create table T1 (
col1 varchar(20) collate case_sensitive encode zstd,
col2 char(20) collate case_insensitive encode zstd,
col3 varchar(20) encode zstd /* not specified, use collation of current database */
);

To find the collation defined at the individual column level, use the following query:

 SELECT TABLE_NAME, COLUMN_NAME, data_type, COLLATION_NAME
FROM svv_columns WHERE TABLE_NAME = 't1' ORDER BY COLUMN_NAME;

Expression-level collation

You can use the collate function in a SQL query to override collation at the column level and database level.

To use the collate function in Amazon Redshift, use the following syntax. This example code converts col1 to case-insensitive and compares with the value john:

select * from T1 where collate(col1, 'case_insensitive') = 'john';

Solution overview

In the following use case, we discuss how to convert a legacy Teradata database’s collation syntax to Amazon Redshift syntax. In Teradata, based on the session mode, the default CHAR and VARCHAR columns collation changes. We can override this default collation at the column and expression level. In Amazon Redshift, we can define collation at the database, column, and expression levels.

In this use case, let’s consider the migration of two tables, invoice and customer, and the corresponding queries built using these two tables from the Teradata database ci_db. Make sure that ci_db is not an existing database in Amazon Redshift. We perform the following steps to walk through the migration process:

  1. Identify the collation in Teradata.
  2. Set up the Amazon Redshift database and schema.
  3. Set up the Amazon Redshift DDL.
  4. Load sample data.
  5. Test the Reporting queries for collation syntax.

Identify the collation in Teradata

In Teradata, based on the session mode, the default CHAR and VARCHAR column collation changes. If the Teradata session mode is in ANSI mode, the default is case-sensitive; if it’s in Teradata mode, it’s dependent on the DefaultCaseSpec parameter at cluster level. If DefaultCaseSpec parameter is TRUE, the default is case sensitive; if it’s FALSE, it’s case insensitive. By default, DefaultCaseSpec parameter is set to FALSE. We can override this default collation at the column and expression level.

SELECT transaction_mode FROM dbc.sessioninfo
   	WHERE sessionno = SESSION;

If the output is A, it’s in ANSI mode; if its T, it’s in Teradata mode. For this use case, let’s assume that the session is in Teradata mode.

To identify the collation override at the column level, run the following commands on the Teradata environment:

show table ci_db.invoice;
show table ci_db.customer;

We receive the following outputs:

CREATE MULTISET TABLE ci_db.invoice ,FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO,
     MAP = TD_MAP1
     (
      invoice_id VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,
      cust_nbr VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC
      )
PRIMARY INDEX ( invoice_id );


CREATE MULTISET TABLE ci_db.customer ,FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO,
     MAP = TD_MAP1
     (
      cust_nbr VARCHAR(100) CHARACTER SET LATIN CASESPECIFIC,
      cust_name VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC)
PRIMARY INDEX ( cust_nbr );

Set up the Amazon Redshift database and schema

In Amazon Redshift a database contains one or more named schemas. Each schema in a database contains tables and other kinds of named objects. By default, a database has a single schema, which is named PUBLIC. In general, we recommend to create each Teradata database as a schema in Amazon Redshift.

To create similar functionality of case insensitivity at the database level in Amazon Redshift, we create a new database ci_database with CASE_INSENSITIVE collation:

create database ci_database WITH COLLATE CASE_INSENSITIVE; 

After the database is created in Amazon Redshift, connect to the database by altering the database name from the default database to ci_database in the connection details. Let’s create the schema ci_db in the current database ci_database.

The following code creates the schema in Amazon Redshift and sets it as the default schema for the session:

create schema ci_db;
set search_path to ci_db;

Set up Amazon Redshift DDL

Based on the Teradata DDL output from earlier, we can observe the following:

  • The ci_db.invoice table has all the fields as case-insensitive.
  • The ci_db.customer table has cust_nbr as case-sensitive and cust_name as case-insensitive.

Because we created the case-insensitive database in Amazon Redshift, we need to mention the collation override at the column level for the case-sensitive columns only.

To attain similar collation logic in Amazon Redshift, create the tables with the following syntax:

create table ci_db.invoice 
(
invoice_id VARCHAR(100) ENCODE ZSTD
,cust_nbr VARCHAR(100) ENCODE ZSTD
)
DISTKEY ( invoice_id );



create table ci_db.customer 
(
cust_nbr VARCHAR(100) collate case_sensitive ENCODE ZSTD
,cust_name VARCHAR(100) ENCODE ZSTD 
)
DISTKEY ( cust_nbr );

Load sample data

In general, we recommend using the AWS SCT agent tool to migrate the data from Teradata to Amazon Redshift. For this post, let’s load the following sample data into the tables invoice and customer on Amazon Redshift. Run the following insert statements:

insert into ci_db.invoice (invoice_id ,cust_nbr) values ('inv1','a1');
insert into ci_db.invoice (invoice_id ,cust_nbr) values ('INV1','A1');
insert into ci_db.invoice (invoice_id ,cust_nbr) values ('inv2','b1');
insert into ci_db.invoice (invoice_id ,cust_nbr) values ('INV2','B1');

insert into ci_db.customer (cust_nbr,cust_name) values ( 'a1','John');
insert into ci_db.customer (cust_nbr,cust_name) values ( 'A1','David');
insert into ci_db.customer (cust_nbr,cust_name) values ( 'b1','Bob');
insert into ci_db.customer (cust_nbr,cust_name) values ( 'B1','Mary');

Test the Reporting queries for collation syntax

Let’s review the data that we loaded into both the tables:

select * from ci_db.invoice;

invoice_id | cust_nbr 
------------+----------
 inv1       | a1
 INV1       | A1
 inv2       | b1
 INV2       | B1
(4 rows)

select * from ci_db.customer;

cust_nbr | cust_name 
----------+-----------
 a1       | John 
 A1       | David
 b1       | Bob
 B1       | Mary
(4 rows)

Default collation column query

Run the following query on both environments and observe that both inv1 and INV1 invoices are returned because the invoice_id column has the default database collation (case-insensitive):

select * from ci_db.invoice where invoice_id ='inv1';

invoice_id | cust_nbr 
------------+----------
 inv1       | a1
 INV1       | A1
(2 rows)

Case-sensitive collation column query

Run the following query on both environments and observe that only the customer a1 is returned; the customer A1 is ignored because the cust_nbr field collation is case-sensitive:

select * from ci_db.customer where cust_nbr ='a1';

cust_nbr | cust_name 
----------+-----------
 a1       | John
(1 row)

Case-insensitive expression override collation query

Run the following query and observe that by performing an expression-level override to case-insensitive for the case-sensitive column cust_nbr it returns both customers a1 and A1.

The following is the Teradata syntax:

select * from ci_db.customer where cust_nbr (not casespecific) ='a1';
cust_nbr | cust_name 
----------+-----------
 a1       | John
 A1       | David
(2 rows)

The following is the Amazon Redshift syntax:

select * from ci_db.customer where collate(cust_nbr ,'case_insensitive') ='a1';

cust_nbr | cust_name 
----------+-----------
 a1       | John
 A1       | David
(2 rows)

Expression-level override in join condition query

When different collation columns are joined, in Teradata by default it performs a case-sensitive join. To achieve similar functionality in Amazon Redshift, we have to perform a case-sensitive expression-level override for the case-insensitive column. In the following example, the cust_nbr column is case-insensitive in the invoice table, whereas it’s case-sensitive in the customer table.

The following is the Teradata syntax:

select 
 inv.invoice_id
,inv.cust_nbr as cust_nbr_from_invoice
,cust.cust_nbr as cust_nbr_from_customer
,cust.cust_name  

from ci_db.invoice inv
inner join ci_db.customer cust 
on inv.cust_nbr = cust.cust_nbr
where cust.cust_nbr = 'a1'
order by inv.invoice_id;

invoice_id | cust_nbr_from_invoice | cust_nbr_from_customer | cust_name 
------------+-----------------------+------------------------+-----------
 inv1       | a1                    | a1                     | John
(1 row)

The following is the Amazon Redshift syntax:

select 
 inv.invoice_id
,inv.cust_nbr as cust_nbr_from_invoice
,cust.cust_nbr as cust_nbr_from_customer
,cust.cust_name  

from ci_db.invoice inv
inner join ci_db.customer cust 
on collate (inv.cust_nbr , 'case_sensitive') = cust.cust_nbr
where cust.cust_nbr = 'a1'
order by inv.invoice_id;

invoice_id | cust_nbr_from_invoice | cust_nbr_from_customer | cust_name 
------------+-----------------------+------------------------+-----------
 inv1       | a1                    | a1                     | John
(1 row)

Materialized views with column-level override:

To perform complex queries on large tables in Amazon Redshift, we can create materialized views to reduce the time it takes to compute the results. We can create materialized views on top of the Amazon Redshift tables and the column-level collations defined are honored. 

The following code creates a materialized view in Amazon Redshift:

create materialized view customer_mv AS
select collate(cust_nbr,'case_insensitive') as cust_nbr_ci  ,cust_name from ci_db.customer;

Run the following query and observe that both customers a1 and A1 are returned because the materialized view has the case-insensitive override at the materialized view level even though in the base table it’s a case-sensitive override:

SELECT * FROM customer_mv WHERE cust_nbr_ci ='a1';

cust_nbr_ci | cust_name 
------------+-----------
 a1         | John
 A1         | David
(2 rows)

Identify the column-level collations in Teradata

For Teradata to Amazon Redshift migrations, it’s important to identify the list of all columns with column-level override collation logic in Teradata. You can use the following query to identify the collation override at each column level:

select databasename,tablename,columnname,uppercaseflag from dbc.columns where databasename ='ci_db';

If Uppercaseflag shows as C, it’s a case-sensitive column; N means not case-sensitive.

Things to consider

  • All string comparison operators, like, order by and group by clauses, aggregate functions, window functions, and scalar functions, follow the database and column collation
  • If a function or an operation returns the VARCHAR or CHAR type and takes multiple inputs with different collation types (case-sensitive and case-insensitive), you should use the expression-level override
  • For external queries, including Amazon Redshift Spectrum and Amazon Aurora PostgreSQL, federated queries use database-level collation only.

For other details and limitations, see Amazon Redshift Database Collation documentation.

We recommend using AWS SCT to accelerate your Teradata migration to Amazon Redshift. Refer to this blog for more details.

Conclusion

This post demonstrated how to use collation for string processing at the database, column, and expression level in Amazon Redshift. We also walked through migrating existing Teradata database collations to Amazon Redshift syntax.

The Amazon Redshift collation feature for string processing reduces the effort required when migrating from traditional on-premises MPP data warehouses such as Teradata to Amazon Redshift without refactoring your application code. This feature also helps you achieve your performance goals using Amazon Redshift by keeping the on-premises default case-insensitive feature.

We hope you can take advantage of this new Amazon Redshift feature to migrate to the AWS Cloud for database freedom.


About the authors

Mengchu Cai is a principal engineer at AWS. He works on redshift query optimization, query performance and SQL functionality challenges.

 

 

 

Vamsi Bhadriraju is a Data Architect at AWS. He works closely with enterprise customers to build data lakes and analytical applications on the AWS Cloud.

 

 

 

Yichao Xue is a Software Engineer with Amazon Redshift Query Processing team. He enjoys solving challenging problems for different components of Redshift, including workload management, Redshift Spectrum, federated queries, and recently case-insensitive collation. Outside work, he likes reading, watching movies, listening to music and traveling around the world.

Data preparation using Amazon Redshift with AWS Glue DataBrew

Post Syndicated from Dhiraj Thakur original https://aws.amazon.com/blogs/big-data/data-preparation-using-amazon-redshift-with-aws-glue-databrew/

With AWS Glue DataBrew, data analysts and data scientists can easily access and visually explore any amount of data across their organization directly from their Amazon Simple Storage Service (Amazon S3) data lake, Amazon Redshift data warehouse, Amazon Aurora, and other Amazon Relational Database Service (Amazon RDS) databases. You can choose from over 250 built-in functions to merge, pivot, and transpose the data without writing code.

Now, with added support for JDBC-accessible databases, DataBrew also supports additional data stores, including PostgreSQL, MySQL, Oracle, and Microsoft SQL Server. In this post, we use DataBrew to clean data from an Amazon Redshift table, and transform and use different feature engineering techniques to prepare data to build a machine learning (ML) model. Finally, we store the transformed data in an S3 data lake to build the ML model in Amazon SageMaker.

Use case overview

For our use case, we use mock student datasets that contain student details like school, student ID, name, age, student study time, health, country, and marks. The following screenshot shows an example of our data.

For our use case, the data scientist uses this data to build an ML model to predict a student’s score in upcoming annual exam. However, this raw data requires cleaning and transformation. A data engineer must perform the required data transformation so the data scientist can use the transformed data to build the model in SageMaker.

Solution overview

The following diagram illustrates our solution architecture.

The workflow includes the following steps:

  1. Create a JDBC connection for Amazon Redshift and a DataBrew project.
  2. AWS DataBrew queries sample student performance data from Amazon Redshift and does the transformation and feature engineering to prepare the data to build ML model.
  3. The DataBrew job writes the final output to our S3 output bucket.
  4. The data scientist builds the ML model in SageMaker to predict student marks in an upcoming annual exam.

We cover steps 1–3 in this post.

Prerequisites

To complete this solution, you should have an AWS account.

Prelab setup

Before beginning this tutorial, make sure you have the required permissions to create the resources required as part of the solution.

For our use case, we use a mock dataset. You can download the DDL and data files from GitHub.

  1. Create the Amazon Redshift cluster to capture the student performance data.
  2. Set up a security group for Amazon Redshift.
  3. Create a schema called student_schema and a table called study_details. You can use DDLsql to create database objects.
  4. We recommend using the COPY command to load a table in parallel from data files on Amazon S3. However, for this post, you can use study_details.sql to insert the data in the tables.

Create an Amazon Redshift connection

To create your Amazon Redshift connection, complete the following steps:

  1. On the DataBrew console, choose Datasets.
  2. On the Connections tab, choose Create connection.
  3. For Connection name, enter a name (for example, student-db-connection).
  4. For Connection type, select JDBC.
  5. Provide other parameters like the JDBC URL and login credentials.
  6. In the Network options section, choose the VPC, subnet, and security groups of your Amazon Redshift cluster.
  7. Choose Create connection.

Create datasets

To create the datasets, complete the following steps:

  1. On the Datasets page of the DataBrew console, choose Connect new dataset.
  2. For Dataset name, enter a name (for example, student).
  3. For Your JDBC source, choose the connection you created (AwsGlueDatabrew-student-db-connection).
  4. Select the study_details table.
  5. For Enter S3 destination, enter an S3 bucket for Amazon Redshift to store the intermediate result.
  6. Choose Create dataset.

You can also configure a lifecycle rule to automatically clean up old files from the S3 bucket.

Create a project using the datasets

To create your DataBrew project, complete the following steps:

  1. On the DataBrew console, on the Projects page, choose Create project.
  2. For Project Name, enter student-proj.
  3. For Attached recipe, choose Create new recipe.

The recipe name is populated automatically.

  1. For Select a dataset, select My datasets.
  2. Select the student dataset.
  3. For Role name, choose the AWS Identity and Access Management (IAM) role to be used with DataBrew.
  4. Choose Create project.

You can see a success message along with our Amazon Redshift study_details table with 500 rows.

After the project is opened, a DataBrew interactive session is created. DataBrew retrieves sample data based on your sampling configuration selection.

Create a profiling job

DataBrew helps you evaluate the quality of your data by profiling it to understand data patterns and detect anomalies.

To create your profiling job, complete the following steps:

  1. On the DataBrew console, choose Jobs in the navigation pane.
  2. On the Profile jobs tab, choose Create job.
  3. For Job name, enter student-profile-job.
  4. Choose the student dataset.
  5. Provide the S3 location for job output.
  6. For Role name, choose the role to be used with DataBrew.
  7. Choose Create and run job.

Wait for the job to complete.

  1. Choose the Columns statistics tab.

You can see that the age column has some missing values.

You can also see that the study_time_in_hr column has two outliers.

Build a transformation recipe

All ML algorithms use input data to generate outputs. Input data comprises features usually in structured columns. To work properly, the features need to have specific characteristics. This is where feature engineering comes in. In this section, we perform some feature engineering techniques to prepare our dataset to build the model in SageMaker.

Let’s drop the unnecessary columns from our dataset that aren’t required for model building.

  1. Choose Column and choose Delete.
  2. For Source columns, choose the columns school_name, first_name, and last_name.
  3. Choose Apply.

We know from the profiling report that the age value is missing in two records. Let’s fill in the missing value with the median age of other records.

  1. Choose Missing and choose Fill with numeric aggregate.
  2. For Source column, choose age.
  3. For Numeric aggregate, choose Median.
  4. For Apply transform to, select All rows.
  5. Choose Apply.

We know from the profiling report that the study_time_in_hr column has two outliers, which we can remove.

  1. Choose Outliers and choose Remove outliers.
  2. For Source column, choose study_time_in_hr.
  3. Select Z-score outliers.
  4. For Standard deviation threshold, choose 3.
  5. Select Remove outliers.
  6. Under Remove outliers, select All outliers.
  7. Under Outlier removal options¸ select Delete outliers.
  8. Choose Apply.
  9. Choose Delete rows and click Apply.

The next step is to convert the categorical value to a numerical value for the gender column.

  1. Choose Mapping and choose Categorical mapping.
  2. For Source column, choose gender.
  3. For Mapping options, select Map top 1 values.
  4. For Map values, select Map values to numeric values.
  5. For M, choose 1.
  6. For Others, choose 2.
  7. For Destination column, enter gender_mapped.
  8. For Apply transform to, select All rows.
  9. Choose Apply.

ML algorithms often can’t work on label data directly, requiring the input variables to be numeric. One-hot encoding is one technique that converts categorical data that doesn’t have an ordinal relationship with each other to numeric data.

To apply one-hot encoding, complete the following steps:

  1. Choose Encode and choose One-hot encode column.
  2. For Source column, choose health.
  3. For Apply transform to, select All rows.
  4. Choose Apply.

The following screenshot shows the full recipe that we applied to our dataset before we can use it to build our model in SageMaker.

Run the DataBrew recipe job on the full data

Now that we have built the recipe, we can create and run a DataBrew recipe job.

  1. On the project details page, choose Create job.
  2. For Job name¸ enter student-performance.

We use CSV as the output format.

  1. For File type, choose CSV.
  2. For Role name, choose an existing role or create a new one.
  3. Choose Create and run job.
  4. Navigate to the Jobs page and wait for the student-performance job to complete.
  5. Choose the Destination link to navigate to Amazon S3 to access the job output.

Clean up

Delete the following resources that might accrue cost over time:

  • The Amazon Redshift cluster
  • The recipe job student-performance
  • The job output stored in your S3 bucket
  • The IAM roles created as part of projects and jobs
  • The DataBrew project student-proj and its associated recipe student-proj-recipe
  • The DataBrew datasets

Conclusion

In this post, we saw how to create a JDBC connection for an Amazon Redshift data warehouse. We learned how to use this connection to create a DataBrew dataset for an Amazon Redshift table. We also saw how easily we can bring data from Amazon Redshift into DataBrew, seamlessly apply transformations and feature engineering techniques, and run recipe jobs that refresh the transformed data for ML model building in SageMaker.


About the Author

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

 

 

 

Work with semistructured data using Amazon Redshift SUPER

Post Syndicated from Satish Sathiya original https://aws.amazon.com/blogs/big-data/work-with-semistructured-data-using-amazon-redshift-super/

With the new SUPER data type and the PartiQL language, Amazon Redshift expands data warehouse capabilities to natively ingest, store, transform, and analyze semi-structured data. Semi-structured data (such as weblogs and sensor data) fall under the category of data that doesn’t conform to a rigid schema expected in relational databases. It often contain complex values such as arrays and nested structures that are associated with serialization formats, such as JSON.

The schema of the JSON can evolve over time according to the business use case. Traditional SQL users who are experienced in handling structured data often find it challenging to deal with semi-structured data sources such as nested JSON documents due to lack of SQL support, the need to learn multiple complex functions, and the need to use third-party tools.

This post is part of a series that talks about ingesting and querying semi-structured data in Amazon Redshift using the SUPER data type.

With the introduction of the SUPER data type, Amazon Redshift provides a rapid and flexible way to ingest JSON data and query it without the need to impose a schema. This means that you don’t need to worry about the schema of the incoming document, and can load it directly into Amazon Redshift without any ETL to flatten the data. The SUPER data type is stored in an efficient binary encoded Amazon Redshift native format.

The SUPER data type can represent the following types of data:

  • An Amazon Redshift scalar value:
    • A null
    • A Boolean
    • Amazon Redshift numbers, such as SMALLINT, INTEGER, BIGINT, DECIMAL, or floating point (such as FLOAT4 or FLOAT8)
    • Amazon Redshift string values, such as VARCHAR and CHAR
  • Complex values:
    • An array of values, including scalar or complex
    • A structure, also known as tuple or object, that is a map of attribute names and values (scalar or complex)

For more information about the SUPER type, see Ingesting and querying semistructured data in Amazon Redshift.

After the semi-structured and nested data is loaded into the SUPER data type, you can run queries on it by using the PartiQL extension of SQL. PartiQL is backward-compatible to SQL. It enables seamless querying of semi-structured and structured data and is used by multiple AWS services, such as Amazon DynamoDB, Amazon Quantum Ledger Database (Amazon QLDB), and AWS Glue Elastic Views. With PartiQL, the query engine can work with schema-less SUPER data that originated in serialization formats, such as JSON. With the use of PartiQL, familiar SQL constructs seamlessly combine access to both the classic, tabular SQL data and the semi-structured data in SUPER. You can perform object and array navigation and also unnesting with simple and intuitive extensions to SQL semantics.

For more information about PartiQL, see Announcing PartiQL: One query language for all your data.

Use cases

The SUPER data type is useful when processing and querying semi-structured or nested data such as web logs, data from industrial Internet of Things (IoT) machines and equipment, sensors, genomics, and so on. To explain the different features and functionalities of SUPER, we use sample industrial IoT data from manufacturing.

The following diagram shows the sequence of events in which the data is generated, collected, and finally stored in Amazon Redshift as the SUPER data type.

Manufacturers are embracing the cloud solutions with connected machines and factories to transform and use data to make data-driven decisions so they can optimize operations, increase productivity, and improve availability while reducing costs.

As an example, the following diagram depicts a common asset hierarchy of a fictional smart manufacturing company.

This data is typically semi-structured and hierarchical in nature. To find insights from this data using traditional methods and tools, you need to extract, preprocess, load, and transform it into the proper structured format to run typical analytical queries using a data warehouse. The time to insight is delayed because of the initial steps required for data cleansing and transformation typically performed using third-party tools or other AWS services.

Amazon Redshift SUPER handles these use cases by helping manufacturers extract, load, and query (without any transformation) a variety of data sources collected from edge computing and industrial IoT devices. Let’s use this sample dataset to drive our examples to explain the capabilities of Amazon Redshift SUPER.

The following is an example subscription dataset for assets (such as crowder, gauge, and pneumatic cylinder) in the workshop, which collects metrics on different properties (such as air pressure, machine state, finished parts count, and failures). Data on these metrics is generated continuously in a time series fashion and pushed to the AWS Cloud using the AWS IoT SiteWise connector. This specific example collects the Machine State property for an asset. The following sample data called subscriptions.json has scalar columns like topicId and qos, and a nested array called messages, which has metrics on the assets.

The following is another dataset called asset_metadata.json, which describes different assets and their properties, more like a dimension table. In this example, the asset_name is IAH10 Line 1, which is a processing plant line for a specific site.

Load data into SUPER columns

Now that we covered the basics behind Amazon Redshift SUPER and the industrial IoT use case, let’s look at different ways to load this dataset.

Copy the JSON document into multiple SUPER data columns

In this use case for handling semi-structured data, the user knows the incoming data’s top-level schema and structure, but some of the columns are semi-structured and nested in structures or arrays. You can choose to shred a JSON document into multiple columns that can be a combination of the SUPER data type and Amazon Redshift scalar types. The following code shows what the table DDL looks like if we translate the subscriptions.json semi-structured schema into a SUPER equivalent:

CREATE TABLE subscription_auto (
topicId INT
,topicFilter VARCHAR
,qos INT
,messages SUPER
);

To load data into this table, specify the auto option along with FORMAT JSON in the COPY command to split the JSON values across multiple columns. The COPY matches the JSON attributes with column names and allows nested values, such as JSON arrays and objects, to be ingested as SUPER values. Run the following command to ingest data into the subscription_auto table. Replace the AWS Identity and Access Management (IAM) role with your own credentials. The ignorecase option passed along with auto is required only if your JSON attribute names are in CamelCase. In our case, our columns topicId and topicFilter scalar columns are in CamelCase.

COPY subscription_auto FROM 's3://redshift-downloads/semistructured/super-blog/subscriptions/' 
IAM_ROLE '<< your IAM role >>' 
FORMAT JSON 'auto ignorecase';

A select * from subscription_auto command looks like the following code. The messages SUPER column holds the entire array in this case.

-- A sample output
SELECT * FROM subscription_auto;
[ RECORD 1 ]
topicid     | 1001
topicfilter | $aws/sitewise/asset-models/+/assets/+/properties/+
qos         | 0
messages    | [{"format":"json","topic":"$aws\/sitewise\/asset-models\/8926cf44-14ea-4cd8-a7c6-e61af641dbeb\/assets\/0aaf2aa2-0299-442a-b2ea-ecf3d62f2a2c\/properties\/3ff67d41-bf69-4d57-b461-6f1513e127a4","timestamp":1616381297183,"payload":{"type":"PropertyValueUpdate","payload":{"assetId":"0aaf2aa2-0299-442a-b2ea-ecf3d62f2a2c","propertyId":"3ff67d41-bf69-4d57-b461-6f1513e127a4","values":[{"timestamp":{"timeInSeconds":1616381289,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381292,"offsetInNanos":425000000},"quality":"FAIR","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381286,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381293,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381287,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381290,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381290,"offsetInNanos":925000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381294,"offsetInNanos":425000000},"quality":"FAIR","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381285,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381288,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}}]}}}]

Alternatively, you can specify jsonpaths to load the same data, as in the following code. The jsonpaths option is helpful if you want to load only selective columns from your JSON document. For more information, see COPY from JSON format.

COPY subscription_auto FROM 's3://redshift-downloads/semistructured/super-blog/subscriptions/' 
IAM_ROLE '<< your IAM role >>' 
FORMAT JSON 's3://redshift-downloads/semistructured/super-blog/jsonpaths/subscription_jsonpaths.json';

subscription_jsonpaths.json looks like the following:

{"jsonpaths": [
    "$.topicId",
    "$.topicFilter",
    "$.qos",
    "$.messages"
   ]
}

While we’re in the section about loading, let’s also create the asset_metadata table and load it with relevant data, which we need in our later examples. The asset_metadata table has more information about industry shop floor assets and their properties like asset_name, property_name, and model_id.

CREATE TABLE asset_metadata (
asset_id VARCHAR
,asset_name VARCHAR
,asset_model_id VARCHAR
,asset_property_id VARCHAR
,asset_property_name VARCHAR
,asset_property_data_type VARCHAR
,asset_property_unit VARCHAR
,asset_property_alias VARCHAR
);

COPY asset_metadata FROM 's3://redshift-downloads/semistructured/super-blog/asset_metadata/' 
IAM_ROLE '<< your IAM role >>' 
FORMAT JSON 'auto';

Copy the JSON document into a single SUPER column

You can also load a JSON document into a single SUPER data column. This is typical when the schema of the incoming JSON is unknown and evolving. SUPER’s schema-less ingestion comes to the forefront here, letting you load the data in a flexible fashion.

For this use case, assume that we don’t know the names of the columns in subscription.json and want to load it into Amazon Redshift. It’s as simple as the following code:

CREATE TABLE subscription_noshred (s super);

After the table is created, we can use the COPY command to ingest data from Amazon Simple Storage Service (Amazon S3) into the single SUPER column. The noshred is a required option to go along with FORMAT JSON, which tells the COPY parser not to shred the document but load it into a single column.

COPY subscription_noshred FROM 's3://redshift-downloads/semistructured/super-blog/subscriptions/' 
IAM_ROLE '<< your IAM role >>' 
FORMAT JSON 'noshred';

After the COPY has successfully ingested the JSON, the subscription_noshred table has a SUPER column s that contains the data of the entire JSON object. The ingested data maintains all the properties of the JSON nested structure but in a SUPER data type.

The following code shows how select star (*) into subscription_noshred looks; the entire JSON structure is in SUPER column s:

--A sample output
SELECT * FROM subscription_noshred;
[ RECORD 1 ]
s | {"topicId":1001,"topicFilter":"$aws\/sitewise\/asset-models\/+\/assets\/+\/properties\/+","qos":0,"messages":[{"format":"json","topic":"$aws\/sitewise\/asset-models\/8926cf44-14ea-4cd8-a7c6-e61af641dbeb\/assets\/0aaf2aa2-0299-442a-b2ea-ecf3d62f2a2c\/properties\/3ff67d41-bf69-4d57-b461-6f1513e127a4","timestamp":1616381297183,"payload":{"type":"PropertyValueUpdate","payload":{"assetId":"0aaf2aa2-0299-442a-b2ea-ecf3d62f2a2c","propertyId":"3ff67d41-bf69-4d57-b461-6f1513e127a4","values":[{"timestamp":{"timeInSeconds":1616381289,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381292,"offsetInNanos":425000000},"quality":"FAIR","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381286,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381293,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381287,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381290,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381290,"offsetInNanos":925000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381294,"offsetInNanos":425000000},"quality":"FAIR","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381285,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381288,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}}]}}}]}

Similar to the noshred option, we can also use jsonpaths to load complete documents. This can be useful in cases where we want to extract additional columns, such as distribution and sort keys, while still loading the complete document as a SUPER column. In the following example, we map our first column to the root JSON object, while also mapping a distribution key to the second column, and a sort key to the third column.

subscription_sorted_jsonpaths.json looks like the following:

{"jsonpaths": [
"$",
"$.topicId",
"$.topicFilter"
]
}

CREATE TABLE subscription_sorted (
s super
,topicId INT
,topicFilter VARCHAR
) DISTKEY(topicFilter) SORTKEY(topicId);

COPY subscription_sorted FROM 's3://redshift-downloads/semistructured/super-blog/subscriptions/' 
IAM_ROLE '<< your IAM role >>' 
FORMAT JSON 's3://redshift-downloads/semistructured/superblog/jsonpaths/subscription_sorted_jsonpaths.json';

Copy data from columnar formats like Parquet and ORC

If your semi-structured or nested data is already available in either Apache Parquet or Apache ORC formats, you can use the COPY command with the SERIALIZETOJSON option to ingest data into Amazon Redshift. The Amazon Redshift table structure should match the number of columns and the column data types of the Parquet or ORC files. Amazon Redshift can replace any Parquet or ORC column, including structure and array types, with SUPER data columns. The following are the COPY examples to load from Parquet and ORC format:

--Parquet 
COPY subscription_auto FROM 's3://redshift-downloads/semistructured/super-blog/subscriptions_parquet/' 
IAM_ROLE '<< your IAM role >>' FORMAT PARQUET SERIALIZETOJSON;

--ORC 
COPY subscription_auto FROM 's3://redshift-downloads/semistructured/super-blog/subscriptions_orc/' 
IAM_ROLE '<< your IAM role >>' FORMAT ORC SERIALIZETOJSON;

Apart from COPY, you can load the same data in columnar format in Amazon S3 using Amazon Redshift Spectrum and the INSERT command. The following example assumes the Redshift Spectrum external schema super_workshop and the external table subscription_parquet is already created and available in the database.

We need to set an important session-level configuration parameter for this to work: SET json_serialization_enable TO true. For more information, see Serializing complex nested JSON. This session-level parameter allows you to query top-level nested collection columns as serialized JSON.

/* -- Set the GUC for JSON serialization -- */ 
SET json_serialization_enable TO true;

INSERT INTO subscription_parquet 
SELECT topicId
,topicFilter
,qos
,JSON_PARSE(messages) FROM super_workshop.subscription_parquet;

As of this writing, we can’t use the SUPER column as such as a distribution or sort key, but if we need to use one of the attributes in a SUPER column as a distribution a sort key and keep the entire SUPER column intact as a separate column, we can use the following code as a workaround (apart from the jsonpaths example described earlier). For example, let’s assume the column format within the array messages needs to be used a distribution key for the Amazon Redshift table:

SET json_serialization_enable TO true;

DROP TABLEIF EXISTS subscription_messages;

CREATE TABLE subscription_messages (
m_format VARCHAR
,m_messages super
) distkey (m_format);

--'format' scalar column is extracted from 'messages' super column 
INSERT INTO subscription_messages 
SELECT element.format::VARCHAR
,supercol FROM (
SELECT JSON_PARSE(messages) AS supercol FROM super_workshop.subscription_parquet
) AS tbl,tbl.supercol element;

Apart from using the Amazon Redshift COPY command, we can also ingest JSON-formatted data into Amazon Redshift using the traditional SQL INSERT command:

--Insert example 
INSERT INTO subscription_auto VALUES (
0000
,'topicFiltervalue'
,0000
,JSON_PARSE('[{"format":"json"},{"topic": "sample topic"},[1,2,3]]')
); 

The JSON_PARSE() function parses the incoming data in proper JSON format and helps convert it into the SUPER data type. Without the JSON_PARSE() function, Amazon Redshift treats and ingests the value as a single string into SUPER instead of a JSON-formatted value.

Query SUPER columns

Amazon Redshift uses the PartiQL language to offer SQL-compatible access to relational, semi-structured, and nested data. PartiQL’s extensions to SQL are straightforward to understand, treat nested data as first-class citizens, and seamlessly integrate with SQL. The PartiQL syntax uses dot notation and array subscript for path navigation when accessing nested data.

The Amazon Redshift implementation of PartiQL supports dynamic typing for querying semi-structured data. This enables automatic filtering, joining, and aggregation on the combination of structured, semi-structured, and nested datasets. PartiQL allows the FROM clause items to iterate over arrays and use unnest operations.

The following sections focus on different query access patterns that involve navigating the SUPER data type with path and array navigation, unnest, or joins.

Navigation

We may want to find an array element by its index, or we may want to find the positions of certain elements in their arrays, such as the first element or the last element. We can reference a specific element simply by using the index of the element within square brackets (the index is 0-based) and within that element we can reference an object or struct by using the dot notation.

Let’s use our subscription_auto table to demonstrate the examples. We want to access the first element of the array, and within that we want to know the value of the attribute format:

SELECT messages[0].format FROM subscription_auto ;
--A sample output
 format
--------
 "json"
 "json"
 "json"
(3 rows)

Amazon Redshift can also use a table alias as a prefix to the notation. The following example is the same query as the previous example:

SELECT sub.messages[0].format FROM subscription_auto AS sub;
--A sample output
 format
--------
 "json"
 "json"
 "json"
(3 rows)

You can use the dot and bracket notations in all types of queries, such as filtering, join, and aggregation. You can use them in a query in which there are normally column references. The following example uses a SELECT statement that filters results:

SELECT COUNT(*) FROM subscription_auto WHERE messages[0].format IS NOT NULL;
-- A sample output
 count
-------
    11
(1 row)

Unnesting and flattening

To unnest or flatten an array, Amazon Redshift uses the PartiQL syntax to iterate over SUPER arrays by using the FROM clause of a query. We continue with the previous example in the following code, which iterates over the array attribute messages values:

SELECT c.*, o FROM subscription_auto c, c.messages AS o WHERE o.topic='sample topic';
-- A sample output
                             messages                 |        o
------------------------------------------------------+--------------------------
[{"format":"json"},{"topic":"sample topic"},[1,2,3]]  | {"topic":"sample topic"}
(1 row)

The preceding query has one extension over standard SQL: the c.messages AS o, where o is the alias for the SUPER array messages and serves to iterate over it. In standard SQL, the FROM clause x (AS) y means “for each tuple y in table x.” Similarly, the FROM clause x (AS) y, if x is a SUPER value, translates to “for each (SUPER) value y in (SUPER) array value x.” The projection list can also use the dot and bracket notation for regular navigation.

When unnesting an array, if we want to get the array subscripts (starting from 0) as well, we can specify AT some_index right after unnest. The following examples iterates over both the array values and array subscripts:

SELECT o, index FROM subscription_auto c, c.messages AS o AT index 
WHERE o.topic='sample topic';

-- A sample output 
       o                  | index
--------------------------+-------
{"topic":"sample topic"}  | 1
(1 row)

If we have an array of arrays, we can do multiple unnestings to iterate into the inner arrays. The following example shows how we do it. Note that unnesting non-array expressions (the objects inside c.messages) are ignored, only the arrays are unnested.

SELECT messages, array, inner_array_element FROM subscription_auto c, c.messages AS array, array AS inner_array_element WHERE array = json_parse('[1,2,3]');

-- A sample output 
    messages                                          | array   | inner_array_element
------------------------------------------------------+---------+---------------------
[{"format":"json"},{"topic":"sample topic"},[1,2,3]]  | [1,2,3] | 1
[{"format":"json"},{"topic":"sample topic"},[1,2,3]]  | [1,2,3] | 2
[{"format":"json"},{"topic":"sample topic"},[1,2,3]]  | [1,2,3] | 3
(3 rows)

Dynamic typing

With the schema-less nature of semi-structured data, the same attributes within the JSON might have values of different types. For example, asset_id from our example might have initially started with integers and then because of a business decision changed into alphanumeric (string) values before finally settling on array type. Amazon Redshift SUPER handles this situation by using dynamic typing. Schema-less SUPER data is processed without the need to statically declare the data types before using them in a query. Dynamic typing is most useful in joins and GROUP BY clauses. Although deep comparing of SUPER column is possible, we recommend restricting the joins and aggregations to use the leaf-level scalar attribute for optimal performance. The following example uses a SELECT statement that requires no explicit casting of the dot and bracket expressions to the usual Amazon Redshift types:

SELECT messages[0].format,
        messages[0].topic
FROM subscription_auto
WHERE messages[0].payload.payload."assetId" > 0;
 
--A sample output
 format | topic
--------+-------
"json"  | "sample topic" 
(1 rows)

When your JSON attribute names are in mixed case or CamelCase, the following session parameter is required to query the data. In the preceding query, assetId is in mixed case. For more information, see SUPER configurations.

SET enable_case_sensitive_identifier to TRUE;

The greater than (>) sign in this query evaluates to true when messages[0].payload.payload."assetId" is greater than 0. In all other cases, the equality sign evaluates to false, including the cases where the arguments of the equality are different types. The messages[0].payload.payload."assetId" attribute can potentially contain a string, integer, array, or structure, and Amazon Redshift only knows that it is a SUPER data type. This is where dynamic typing helps in processing the schemaless data by evaluating the types at runtime based on the query. For example, processing the first record of “assetId” may result in an integer, whereas the second record can be a string, and so on. When using an SQL operator or function with dot and bracket expressions that have dynamic types, Amazon Redshift produces results similar to using a standard SQL operator or function with the respective static types. In this example, when the dynamic type of the path expression is an integer, the comparison with the integer 0 is meaningful. Whenever the dynamic type of “assetId” is any other data type except being an integer, the equality returns false.

For queries with joins and aggregations, dynamic typing automatically matches values with different dynamic types without performing a long CASE WHEN analysis to find out what data types may appear. The following code is an example of an aggregation query in which we count the number of topics:

SELECT messages[0].format,
count(messages[0].topic)
FROM subscription_auto WHERE messages[0].payload.payload."assetId" > 'abc' GROUP BY 1;

-- a sample output
 format | count
--------+-------
"json"  |   2
(1 row)

For the next join query example, we unnest and flatten the messages array from subscription_auto and join with the asset_metadata table to get the asset_name and property_name based on the asset_id and property_id, which we use as join keys.

Joins on SUPER should preferably be on an extracted path and avoid deep compare of the entire nested field for performance. In the following examples, the join keys used are on extracted path keys and not on the whole array:

SELECT c.topicId
,c.qos
,a.asset_name
,o.payload.payload."assetId"
,a.asset_property_name
,o.payload.payload."propertyId"
FROM subscription_auto AS c
,c.messages AS o -- o is the alias for messages array
,asset_metadata AS a
WHERE o.payload.payload."assetId" = a.asset_id AND o.payload.payload."propertyId" = a.asset_property_id AND o.payload.type = 'PropertyValueUpdate';
 
--A sample output 
 topicid | qos |  asset_name  |                assetId                 | asset_property_name |               propertyId
---------+-----+--------------+----------------------------------------+---------------------+----------------------------------------
    1001 |   0 | IAH10 Line 1 | "0aaf2aa2-0299-442a-b2ea-ecf3d62f2a2c" | stop                | "3ff67d41-bf69-4d57-b461-6f1513e127a4"
(1 row)

The following code is another join query that is looking for a count on the quality of the metrics collected for a specific asset (in this case IAH10 Line) and it is property (Machine State) and categorizes it based on the quality:

SELECT v.quality::VARCHAR
,count(*)
FROM subscription_auto AS c
,c.messages AS o -- o is the alias for messages array
,o.payload.payload.VALUES AS v -- v is the alias for values array 
,asset_metadata AS a 
WHERE o."assetId" = a.asset_name 
AND o."propertyId" = a.asset_property_name 
AND a.asset_name = 'IAH10 Line 1' 
AND a.asset_property_name = 'Machine State' 
GROUP BY v.quality;

-- A sample output 
quality   | count
----------+-------
CRITICAL  | 152 
GOOD      | 2926 
FAIR      | 722

Summary

This post discussed the benefits of the new SUPER data type and use cases in which nested data types can help improve storage efficiency, performance, or simplify analysis. Amazon Redshift SUPER along with PartiQL enables you to write queries over relational and hierarchical data model with ease. The next post in this series will focus on how to speed up frequent queries on SUPER columns using materialized views. Try it out and share your experience!


About the Authors

Satish Sathiya is a Senior Product Engineer at Amazon Redshift. He is an avid big data enthusiast who collaborates with customers around the globe to achieve success and meet their data warehousing and data lake architecture needs.

 

 

 

Runyao Chen is a Software Development Engineer at Amazon Redshift. He is passionate about MPP databases and has worked on SQL language features such as querying semistructured data using SUPER. In his spare time, he enjoys reading and exploring new restaurants.

 

 

 

Cody Cunningham is a Software Development Engineer at Amazon Redshift. He works on Redshift’s data ingestion, implementing new features and performance improvements, and recently worked on the ingestion support for SUPER.

Amazon Redshift identity federation with multi-factor authentication

Post Syndicated from Manash Deb original https://aws.amazon.com/blogs/big-data/amazon-redshift-identity-federation-with-multi-factor-authentication/

Password-based access control alone is not considered secure enough, and many organizations are adopting multi-factor authentication (MFA) and single sign-on (SSO) as a de facto standard to prevent unauthorized access to systems and data. SSO frees up time and resources for both administrators and end users from the painful process of password-based credential management. MFA capability adds an extra layer of protection, improving organizational security and confidence.

Amazon Redshift is a fast, fully-managed cloud data warehouse that provides browser-based plugins for JDBC/ODBC drivers, which helps you easily implement SSO capabilities added with MFA to secure your data warehouse, and also helps automation and enforcement of data access policies across the organization through directory federation.

You can integrate Amazon Redshift with different identity providers (IdPs) like Microsoft Azure Active Directory, Ping, Okta, OneLogin, and more. For more information, see Using a credentials provider plugin. You may already have pre-existing integrations for federating to AWS using industry standard Security Assertion Markup Language (SAML) with these IdPs. In this post, we explain the steps to integrate the Amazon Redshift browser-based SAML plugin to add SSO and MFA capability with your federation IdP.

Solution overview

With this integration, users get authenticated to Amazon Redshift using the SSO and MFA credentials of the IdP application, which uses SAML tokens to map the IdP user identity (like login name or email address) as the database user in Amazon Redshift. It can also map users’ directory group memberships to corresponding database groups in Amazon Redshift, which allows you to control authorization grants for database objects in Amazon Redshift.

The following diagram illustrates our solution architecture.

High-level steps for SSO-MFA integration

The following diagram illustrates the authentication flow with the browser SAML plugin.

We complete the following high-level steps to set up the SSO and MFA authentication capability to an Amazon Redshift data warehouse using a browser-based SAML plugin with our IdP:

  1. Create a custom SAML 2.0 application with the IdP with the following configurations:
    1. A redirect URI (for example, http://localhost:7890/redshift/).
    2. MFA capability enabled.
    3. Relevant SAML claim attributes.
    4. Appropriate directory groups and users with the IdP.
  2. Add appropriate AWS Identity and Access Management (IAM) permissions:
    1. Add an IdP.
    2. Add appropriate IAM roles for the IdP.
    3. Use IAM policies to add appropriate permissions to the roles to access the Amazon Redshift cluster.
  3. Set up Amazon Redshift with group-level access control:
    1. Connect to Amazon Redshift using superuser credentials.
    2. Set up appropriate database groups in Amazon Redshift.
    3. Grant access permissions appropriate to relevant groups.
  4. Connect to Amazon Redshift with your JDBC/ODBC SQL client:
    1. Configure connection attributes for the IdP.
    2. Enable browser-based MFA.
    3. Connect to Amazon Redshift.

Create a custom SAML 2.0 application with the IdP

The first step of this setup is to create a SAML 2.0 application with your IdP. The various directory groups and users that need to log in to the Amazon Redshift cluster should have access to this SAML application. Provide an appropriate redirect_uri (for example, http://localhost:7890/redshift/) in the SAML application configuration, so that the IdP can seamlessly redirect SAML responses over HTTPS to this URI, which then allows the Amazon Redshift JDBC/ODBC driver to authenticate and authorize the user.

The following screenshot shows a SAML application configuration with PingOne as the IdP (for more details on PingOne Amazon Redshift SSO federation, see Federating single sign-on access to your Amazon Redshift cluster with PingIdentity).

You need to download the metadata XML file from the provider (as shown in the preceding screenshot) and use it in a later step to create a SAML IdP in IAM.

Next, you can enable MFA for this application so that users are authorized to access Amazon Redshift only after they pass the two-factor authentication with MFA.

The following screenshot shows the MFA configuration settings with PingOne as the IdP.

As part of the IdP application setup, map the following claim attributes so that Amazon Redshift can access them using the SAML response.

Claim Attribute Namespace Description Example Value
Role https://aws.amazon.com/SAML/Attributes/Role aws_iam_role_arn_for_identity_provider, aws_identity_provider_arn arn:aws:iam::<account>:role/PingOne_Redshift_SSO_Role,arn:aws:iam::<account>:saml-provider/PingOne
RoleSessionName https://aws.amazon.com/SAML/Attributes/RoleSessionName Identification for the user session, which in most cases is the email_id of the user. email
AutoCreate https://redshift.amazon.com/SAML/Attributes/AutoCreate If this parameter is set, new users authenticated by the IdP are automatically created in Amazon Redshift. "true"
DbUser https://redshift.amazon.com/SAML/Attributes/DbUser Identification for the user session, which in most cases is the email_id of the user. email
DbGroups https://redshift.amazon.com/SAML/Attributes/DbGroups Amazon Redshift database group names for the user, which in most cases is the same as the directory groups the user belongs to. data_scientist

The following screenshot is an example of these claim attributes set up for PingOne as IdP.

Apart from setting up the SAML application, you also need to set up appropriate directory groups and users with your IdP, which you will use to grant SSO and MFA access to users for different applications like AWS Single Sign-On and Amazon Redshift application single sign-on.

The following screenshot is an example of this user group set up for PingOne as IdP.

Add appropriate permissions using IAM

After you complete the configuration settings with the IdP, the next step is to configure appropriate permissions in IAM in your AWS account for identity federation using IAM.

The first step is to add an IdP using the SAML metadata XML file downloaded from the IdP’s SAML application you created in the previous step.

After you add the IdP, you need to create an IAM role with that IdP as a trusted entity.

Set the value of the SAML:aud attribute to the same redirect URI defined in your IdP’s SAML application setup (for example, http://localhost:7890/redshift/).

Create a new IAM policy with the necessary permissions needed by the users to access Amazon Redshift, and attach it to the IAM role you created earlier. See the following sample policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "redshift:CreateClusterUser",
                "redshift:JoinGroup",
                "redshift:GetClusterCredentials",
                "redshift:ListSchemas",
                "redshift:ListTables",
                "redshift:ListDatabases",
                "redshift:ExecuteQuery",
                "redshift:FetchResults",
                "redshift:CancelQuery",
                "redshift:DescribeClusters",
                "redshift:DescribeQuery",
                "redshift:DescribeTable"
            ],
            "Resource": [
                "arn:aws:redshift:<region>:<account>:cluster:<clusterName>",
                "arn:aws:redshift:<region>:<account>:dbuser:<clusterName>/${redshift:DbUser}",
                "arn:aws:redshift:<region>:<account>:dbname:<clusterName>/${redshift:DbName}",
                "arn:aws:redshift:<region>:<account>:dbgroup:<clusterName>/bi_users_group",
                "arn:aws:redshift:<region>:<account>:dbgroup:<clusterName>/etl_users_group",
                "arn:aws:redshift:<region>:<account>:dbgroup:<clusterName>/analysts_group"
            ]
        }
    ]
}

You can also use an AWS CloudFormation template to automate this IAM setup by uploading the IdP- specific SAML metadata XML file from the SAML application you created.

The following template takes care of creating the IAM resources required for this setup. You need to enter the following parameters to the template:

  • FederationProviderName – Enter a suitable name for the IAM IdP.
  • FederationXmlS3Location – Enter the Amazon Simple Storage Service (Amazon S3) URI where you uploaded the SAML metadata XML file from your IdP’s SAML application.
  • RedshiftClusterEndpoint – Enter the endpoint URI of your Amazon Redshift cluster. You can get this URI via the Amazon Redshift console. If you have multiple Amazon Redshift clusters, you may need to modify this CloudFormation template to add permissions for all the clusters in your account.

Grant group-level access control with Amazon Redshift

If you haven’t set up an Amazon Redshift cluster yet, see Getting started with Amazon Redshift for a step-by-step guide to create a new cluster in your AWS account.

If you already have an Amazon Redshift cluster, note the primary user credentials for that cluster and refer to the following resources to connect to that cluster using a SQL client like SQL Workbench/J and the latest Amazon Redshift JDBC driver with AWS SDK:

When you’re logged in, you need to set up the appropriate groups in Amazon Redshift. The following example code sets up three database groups for business intelligence (BI) users, analysts, and a cross-user group in Amazon Redshift:

CREATE GROUP bi_users_group;
CREATE GROUP analysts_group;
CREATE GROUP cross_user_group;

You can then set up database objects and appropriate access permissions for them. In the following code, we set up two schemas for analysts and BI users and then grant access on them to the relevant groups:

CREATE SCHEMA IF NOT EXISTS bi_schema;
CREATE SCHEMA IF NOT EXISTS analysts_schema;

GRANT SELECT ON ALL TABLES IN SCHEMA bi_schema TO GROUP bi_users_group, GROUP cross_user_group;
ALTER DEFAULT PRIVILEGES IN SCHEMA bi_schema GRANT SELECT ON TABLES TO GROUP bi_users_group, GROUP cross_user_group;

GRANT SELECT ON ALL TABLES IN SCHEMA analysts_schema TO GROUP analysts_group, GROUP cross_user_group;
ALTER DEFAULT PRIVILEGES IN SCHEMA analysts_schema GRANT SELECT ON TABLES TO GROUP analysts_group, GROUP cross_user_group;

These group-level grants allow federated users to access Amazon Redshift objects based on their associated permissions. As explained earlier in this post, you can map your IdP directory groups to their respective database groups in Amazon Redshift, which allows you to control both authentication and authorization to Amazon Redshift based on the IdP credentials.

However, you may choose to control the authorization part within the database itself instead of relying on IdP directory groups. In this case, you use the IdP only to facilitate system authentication to Amazon Redshift, but for data authorization, you map the users and groups manually using alter group statements, as in the following code:

CREATE USER "[email protected]" PASSWORD DISABLE;
ALTER GROUP bi_users_group ADD USER "[email protected]";
ALTER GROUP cross_user_group ADD USER "[email protected]";

In the preceding example, we create a new user, exampleuser, with password disabled. We can use the IdP credentials for this user to authenticate and therefore it doesn’t need a password. But to provide authorization, we added this user to the bi_user and cross_user groups, so that it can inherit the permissions granted to these groups and can work seamlessly with SSO and MFA federation.

Configure your JDBC/ODBC SQL client to use the browser-based plugin to connect to Amazon Redshift

In this step, you can test Amazon Redshift connectivity through your IdP using a SQL client like SQL Workbench/J.

You need to provide the following configurations in the SQL client.

Property Value
Driver Amazon Redshift
URL jdbc:redshift:iam://<your-redshift-cluster-endpoint>

Additionally, you need to set up the following extended properties.

Property Value
login_url This is the SAML application’s login page
plugin_name com.amazon.redshift.plugin.BrowserSamlCredentialsProvider
idp_response_timeout Number of seconds to allow for SSO authentication to complete before timing out

The following screenshot shows the configurations to connect SQLWorkbench/J client with PingOne as IdP.

The following table summarizes our property values.

Property Value
login_url https://sso.connect.pingidentity.com/sso/sp/initsso?saasid=<your_saas_id>&idpid=<your_idp_id> (This is the SAML application’s SSO URL from your IDP)
plugin_name com.amazon.redshift.plugin.BrowserSamlCredentialsProvider
idp_response_timeout 120

When you choose Test or OK, a new web browser window opens that shows the SAML application’s login page.

If this is the first time you’re logging in to PingOne, and haven’t set up MFA before, you can download and pair the PingID mobile app on iOS or Android.

After the PingID app is installed and paired, it pushes a notification to your phone to approve or deny the MFA authorization. When the MFA succeeds, the browser displays a success message on the redirect page.

After the connection is successful, let’s run a SQL query to confirm that the correct user identification was passed and also confirm that the correct database group was mapped for this SQL user session, based on the user’s directory group. In this case, the user manish was mapped to the bi_users_group directory group in PingOne. We should see the SQL session reflect the corresponding database group in Amazon Redshift.

We were able to successfully accomplish MFA-based SSO identity federation with PingOne using the browser-based plugin that Amazon Redshift provides.

IdP-specific configurations

As mentioned earlier, the first step of this process is to set up SSO for Amazon Redshift with your IdP. The setup steps for that may vary depending on the provider. For more information, see the following resources:

The following videos also cover these details if you want to view them in action:

Conclusion

Amazon Redshift makes it easy to integrate identity federation with your existing third-party identity providers, allowing you to centralize user and access management in a single corporate directory. In this post, we showed how the Amazon Redshift browser-based plugin works with popular SAML-based IdPs to provide an additional security layer with MFA authentication. You can also use the instructions in this post to set up various SAML-based IdPs (like Ping, Okta, JumpCloud, and OneLogin) to implement SSO and MFA with Amazon Redshift.


About the Authors

Manash Deb is a Senior Analytics Specialist Solutions Architect at AWS. He has worked on building end-to-end data driven solutions in different database and data warehousing technologies for over fifteen years. He loves to learn new technologies and solving, automating and simplifying customer problems with easy-to-use cloud data solutions on AWS.

 

 

Manish Vazirani is an Analytics Specialist Solutions Architect at Amazon Web Services.

 

 

 

 

 

 

Improve query performance using AWS Glue partition indexes

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/improve-query-performance-using-aws-glue-partition-indexes/

While creating data lakes on the cloud, the data catalog is crucial to centralize metadata and make the data visible, searchable, and queryable for users. With the recent exponential growth of data volume, it becomes much more important to optimize data layout and maintain the metadata on cloud storage to keep the value of data lakes.

Partitioning has emerged as an important technique for optimizing data layout so that the data can be queried efficiently by a variety of analytic engines. Data is organized in a hierarchical directory structure based on the distinct values of one or more columns. Over time, hundreds of thousands of partitions get added to a table, resulting in slow queries. To speed up query processing of highly partitioned tables cataloged in AWS Glue Data Catalog, you can take advantage of AWS Glue partition indexes.

Partition indexes are available for queries in Amazon EMRAmazon Redshift Spectrum, and AWS Glue extract, transform, and load (ETL) jobs (Spark DataFrame). When partition indexes are enabled on the heavily partitioned AWS Glue Data Catalog tables, all these query engines are accelerated. You can add partition indexes to both new tables and existing tables. This post demonstrates how to utilize partition indexes, and discusses the benefit you can get with partition indexes when working with highly partitioned data.

Partition indexes

AWS Glue partition indexes are an important configuration to reduce overall data transfers and processing, and reduce query processing time. In the AWS Glue Data Catalog, the GetPartitions API is used to fetch the partitions in the table. The API returns partitions that match the expression provided in the request. If no partition indexes are present on the table, all the partitions of the table are loaded, and then filtered using the query expression provided by the user in the GetPartitions request. The query takes more time to run as the number of partitions increase on a table with no indexes. With an index, the GetPartitions request tries to fetch a subset of the partitions instead of loading all the partitions in the table.

The following are key benefits of partition indexes:

  • Increased query performance
  • Increased concurrency as a result of fewer GetPartitions API calls
  • Cost savings:
    • Analytic engine cost (query performance is related to the charges in Amazon EMR and AWS Glue ETL)
    • AWS Glue Data Catalog API request cost

Setting up resources with AWS CloudFormation

This post provides an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs. Some of the resources that this stack deploys incur costs when in use.

The CloudFormation template generates the following resources:

If you’re using AWS Lake Formation permissions, you need to ensure that the IAM user or role running AWS CloudFormation has the required permissions (to create a database on the Data Catalog).

The tables use sample data located in an Amazon Simple Storage Service (Amazon S3) public bucket. Initially, no partition indexes are configured in these AWS Glue Data Catalog tables.

To create your resources, complete the following steps:

  1. Sign in to the CloudFormation console.
  2. Choose Launch Stack:
  3. Choose Next.
  4. For DatabaseName, leave as the default.
  5. Choose Next.
  6. On the next page, choose Next.
  7. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  8. Choose Create.

Stack creation can take up to 5 minutes. When the stack is completed, you have two Data Catalog tables: table_with_index and table_without_index. Both tables point to the same S3 bucket, and the data is highly partitioned based on yearmonthday, and hour columns for more than 42 years (1980-2021). In total, there are 367,920 partitions, and each partition has one JSON file, data.json. In the following sections, you see how the partition indexes work with these sample tables.

Setting up a partition index on the AWS Glue console

You can create partition indexes at any time. If you want to create a new table with partition indexes, you can make the CreateTable API call with a list of PartitionIndex objects. If you want to add a partition index to an existing table, make the CreatePartitionIndex API call. You can also perform these actions on the AWS Glue console. You can create up to three partition indexes on a table.

Let’s configure a new partition index for the table table_with_index we created with the CloudFormation template.

  1. On the AWS Glue console, choose Tables.
  2. Choose the table table_with_index.
  3. Choose Partitions and indices.
  4. Choose Add new index.
  5. For Index name, enter year-month-day-hour.
  6. For Selected keys from schema, select year, month, day, and hour.
  7. Choose Add index.

The Status column of the newly created partition index shows the status as Creating. Wait for the partition index to be Active. The process takes about 1 hour because more number of partitions longer it takes for index creation and we have 367,920 partitions on this table.

Now the partition index is ready for the table table_with_index. You can use this index from various analytic engines when you query against the table. You see default behavior in the table table_without_index because no partition indexes are configured for this table.

You can follow (or skip) any of the following sections based on your interest.

Making a GetPartitions API call with an expression

Before we use the partition index from various query engines, let’s try making the GetPartitions API call using AWS Command Line Interface (AWS CLI) to see the difference. The AWS CLI get-partitions command makes multiple GetPartitions API calls if needed. In this section, we simply use the time command to compare the duration for each table, and use the debug logging to compare the number of API calls for each table.

  1. Run the get-partitions command against the table table_without_index with the expression year='2021' and month='04' and day='01':
    $ time aws glue get-partitions --database-name partition_index --table-name table_without_index --expression "year='2021' and month='04' and day='01'"
    ...
    real    3m57.438s
    user    0m2.872s
    sys    0m0.248s
    

The command took about 4 minutes. Note that you used only three partition columns out of four.

  1. Run the same command with debug logging to get the number of the GetPartitionsAPI calls:
    $ aws glue get-partitions --database-name partition_index --table-name table_without_index --expression "year='2021' and month='04' and day='01'" --debug 2>get-partitions-without-index.log
    $ cat get-partitions-without-index.log | grep x-amz-target:AWSGlue.GetPartitions | wc -l
         737

There were 737 GetPartitions API calls when the partition indexes aren’t used.

  1. Next, run the get-partitions command against table_with_index with the same expression:
    $ time aws glue get-partitions --database-name partition_index --table-name table_with_index --expression "year='2020' and month='07' and day='01' and hour='09'"
    ...
    real    0m2.697s
    user    0m0.442s
    sys    0m0.163s

The command took just 2.7 seconds. You can see how quickly the required partitions were returned.

  1. Run the same command with debug logging to get the number of the GetPartitionsAPI calls:
    $ aws glue get-partitions --database-name partition_index --table-name table_with_index --expression "year='2021' and month='04' and day='01'" --debug 2>get-partitions-with-index.log
    $ cat get-partitions-with-index.log | grep x-amz-target:AWSGlue.GetPartitions | wc -l
           4
    

There were only four GetPartitions API calls when the partition indexes are used.

Querying a table using Apache Spark on Amazon EMR

In this section, we explore querying a table using Apache Spark on Amazon EMR.

  1. Launch a new EMR cluster with Apache Spark.

For instructions, see Setting Up Amazon EMR. You need to specify the AWS Glue Data Catalog as the metastore. In this example, we use the default EMR cluster (release: emr-6.2.0, three m5.xlarge nodes).

  1. Connect to the EMR node using SSH.
  2. Run the spark-sql command on the EMR node to start an interactive shell for Spark SQL:
    $ spark-sql

  3. Run the following SQL against partition_index.table_without_index:
    spark-sql> SELECT count(*), sum(value) FROM partition_index.table_without_index WHERE year='2021' AND month='04' AND day='01';
    24    13840.894731640636
    Time taken: 35.518 seconds, Fetched 1 row(s)

The query took 35 seconds. Even though you aggregated records only in the specific partition, the query took so long because there are many partitions and the GetPartitions API call takes time.

Now let’s run the same query against table_with_index to see how much benefit the partition index introduces.

  1. Run the following SQL against partition_index.table_with_index:
    spark-sql> SELECT count(*), sum(value) FROM partition_index.table_with_index WHERE year='2021' AND month='04' AND day='01';
    24    13840.894731640636
    Time taken: 2.247 seconds, Fetched 1 row(s)

The query took just 2 seconds. The reason for the difference in query duration is because the number of GetPartitions calls is smaller because of the partition index.

The following chart shows the granular metrics for query planning time without and with the partition index. The query planning time with the index is far less than that without the index.

For more information about comparing metrics in Apache Spark, see Appendix 2 at the end of this post.

Querying a table using Redshift Spectrum

To query with Redshift Spectrum, complete the following steps:

  1. Launch a new Redshift cluster.

You need to configure an IAM role for the cluster to utilize Redshift Spectrum and the Amazon Redshift query editor. Choose dc2.large, 1 node in this example. You need to launch the cluster in the us-east-1 Region because you need to place your cluster in the same Region as the bucket location.

  1. Connect with the Redshift query editor. For instructions, see Querying a database using the query editor.
  2. Create an external schema for the partition_index database to use it in Redshift Spectrum: (replace <your IAM role ARN> with your IAM role ARN).
    create external schema spectrum from data catalog 
    database 'partition_index' 
    iam_role '<your IAM role ARN>'
    create external database if not exists;

  3. Run the following SQL against spectrum_schema.table_without_index:
    SELECT count(*), sum(value) FROM spectrum.table_without_index WHERE year='2021' AND month='04' AND day='01'

The following screenshot shows our output.

The query took more than 3 minutes.

  1. Run the following SQL against spectrum_schema.table_with_index:
    SELECT count(*), sum(value) FROM spectrum.table_with_index WHERE year='2021' AND month='04' AND day='01'

The following screenshot shows our output.

The query for the table using indexes took just 8 seconds, which is much faster than the table without indexes.

Querying a table using AWS Glue ETL

Let’s launch an AWS Glue development endpoint and an Amazon SageMaker notebook.

  1. Open the AWS Glue console, choose Dev endpoints.
  2. Choose Add endpoint.
  3. For Development endpoint name, enter partition-index.
  4. For IAM role, choose your IAM role.

For more information about roles, see Managing Access Permissions for AWS Glue Resources.

  1. For Worker type under Security configuration, script libraries, and job parameters (optional), choose 1X.
  2. For Number of workers, enter 4.
  3. For Dependent jar path, enter s3://crawler-public/json/serde/json-serde.jar.
  4. Select Use Glue data catalog as the Hive metastore under Catalog options (optional).
  5. Choose Next.
  6. For Networking, leave as is (by default, Skip networking configuration is selected), and choose Next.
  7. For Add an SSH public key (Optional), leave it blank, and choose Next.
  8. Choose Finish.
  9. Wait for the development endpoint partition-index to show as READY.

The endpoint may take up to 10 minutes to be ready.

  1. Select the development endpoint partition-index, and choose Create SageMaker notebook on the Actions
  2. For Notebook name, enter partition-index.
  3. Select Create an IAM role.
  4. For IAM role, enter partition-index.
  5. Choose Create notebook.
  6. Wait for the notebook aws-glue-partition-index to show the status as Ready.

The notebook may take up to 3 minutes to be ready.

  1. Select the notebook aws-glue-partition-index, and choose Open notebook.
  2. Choose Sparkmagic (PySpark)on the New
  3. Enter the following code snippet against table_without_index, and run the cell:
    %%time
    %%sql
    SELECT count(*), sum(value) FROM partition_index.table_without_index WHERE year='2021' AND month='04' AND day='01'

The following screenshot shows our output.

The query took 3 minutes.

  1. Enter the following code snippet against partition_index.table_with_index, and run the cell:
    %%time
    %%sql
    SELECT count(*), sum(value) FROM partition_index.table_with_index WHERE year='2021' AND month='04' AND day='01'

The following screenshot shows our output.

The cell took just 7 seconds. The query for the table using indexes is faster than the table without indexes.

Cleaning up

Now to the final step, cleaning up the resources:

  1. Delete the CloudFormation stack. 
  2. Delete the EMR cluster.
  3. Delete the Amazon Redshift cluster.
  4. Delete the AWS Glue development endpoint and SageMaker notebook.

Conclusion

In this post, we explained how to use partition indexes and how they accelerate queries in various query engines. If you have several millions of partitions, the performance benefit is significantly more. You can learn about partition indexes more deeply in Working with Partition Indexes.


Appendix 1: Setting up a partition index using AWS CLI

If you prefer using the AWS CLI, run the following create-partition-index command to set up a partition index:

$ aws glue create-partition-index --database-name partition_index --table-name table_with_index --partition-index Keys=year,month,day,hour,IndexName=year-month-day-hour

To get the status of the partition index, run the following get-partition-indexes command:

$ aws glue get-partition-indexes --database-name partition_index --table-name table_with_index
{
    "PartitionIndexDescriptorList": [
        {
            "IndexName": "year-month-day-hour",
            "Keys": [
                {
                    "Name": "year",
                    "Type": "string"
                },
                {
                    "Name": "month",
                    "Type": "string"
                },
                {
                    "Name": "day",
                    "Type": "string"
                },
                {
                    "Name": "hour",
                    "Type": "string"
                }
            ],
            "IndexStatus": "CREATING"
        }
    ]
}

Appendix 2: Comparing breakdown metrics in Apache Spark

If you’re interested in comparing the breakdown metrics for query planning time, you can register a SQL listener with the following Scala code snippet:

spark.listenerManager.register(new org.apache.spark.sql.util.QueryExecutionListener {
  override def onSuccess(funcName: String, qe: org.apache.spark.sql.execution.QueryExecution, durationNs: Long): Unit = {
    val metricMap = qe.tracker.phases.mapValues { ps => ps.endTimeMs - ps.startTimeMs }
    println(metricMap.toSeq)
  }
  override def onFailure(funcName: String, qe: org.apache.spark.sql.execution.QueryExecution, exception: Exception): Unit = {}
})

If you use spark-shell, you can register the listener as follows:

$ spark-shell
...
scala> spark.listenerManager.register(new org.apache.spark.sql.util.QueryExecutionListener {
     |   override def onSuccess(funcName: String, qe: org.apache.spark.sql.execution.QueryExecution, durationNs: Long): Unit = {
     |     val metricMap = qe.tracker.phases.mapValues { ps => ps.endTimeMs - ps.startTimeMs }
     |     println(metricMap.toSeq)
     |   }
     |   override def onFailure(funcName: String, qe: org.apache.spark.sql.execution.QueryExecution, exception: Exception): Unit = {}
     | })

Then run the same query without using the index to get the breakdown metrics:

scala> spark.sql("SELECT count(*), sum(value) FROM partition_index.table_without_index WHERE year='2021' AND month='04' AND day='01'").show()
Vector((planning,208), (optimization,29002), (analysis,4))
+--------+------------------+
|count(1)|        sum(value)|
+--------+------------------+
|      24|13840.894731640632|
+--------+------------------+

In this example, we use the same setup for the EMR cluster (release: emr-6.2.0, three m5.xlarge nodes). The console has additional line:

Vector((planning,208), (optimization,29002), (analysis,4)) 

Apache Spark’s query planning mechanism has three phases: analysis, optimization, and physical planning (shown as just planning). This line means that the query planning took 4 milliseconds in analysis, 29,002 milliseconds in optimization, and 208 milliseconds in physical planning.

Let’s try running the same query using the index:

scala> spark.sql("SELECT count(*), sum(value) FROM partition_index.table_with_index WHERE year='2021' AND month='04' AND day='01'").show()
Vector((planning,7), (optimization,608), (analysis,2))                          
+--------+------------------+
|count(1)|        sum(value)|
+--------+------------------+
|      24|13840.894731640634|
+--------+------------------+

The query planning took 2 milliseconds in analysis, 608 milliseconds in optimization, and 7 milliseconds in physical planning.


About the Authors

Noritaka Sekiyama is a Senior Big Data Architect at AWS Glue and AWS Lake Formation. He is passionate about big data technology and open source software, and enjoys building and experimenting in the analytics area.

 

 

 

Sachet Saurabh is a Senior Software Development Engineer at AWS Glue and AWS Lake Formation. He is passionate about building fault tolerant and reliable distributed systems at scale.

 

 

 

Vikas Malik is a Software Development Manager at AWS Glue. He enjoys building solutions that solve business problems at scale. In his free time, he likes playing and gardening with his kids and exploring local areas with family.

 

 

 

 

Amazon Redshift ML Is Now Generally Available – Use SQL to Create Machine Learning Models and Make Predictions from Your Data

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/amazon-redshift-ml-is-now-generally-available-use-sql-to-create-machine-learning-models-and-make-predictions-from-your-data/

With Amazon Redshift, you can use SQL to query and combine exabytes of structured and semi-structured data across your data warehouse, operational databases, and data lake. Now that AQUA (Advanced Query Accelerator) is generally available, you can improve the performance of your queries by up to 10 times with no additional costs and no code changes. In fact, Amazon Redshift provides up to three times better price/performance than other cloud data warehouses.

But what if you want to go a step further and process this data to train machine learning (ML) models and use these models to generate insights from data in your warehouse? For example, to implement use cases such as forecasting revenue, predicting customer churn, and detecting anomalies? In the past, you would need to export the training data from Amazon Redshift to an Amazon Simple Storage Service (Amazon S3) bucket, and then configure and start a machine learning training process (for example, using Amazon SageMaker). This process required many different skills and usually more than one person to complete. Can we make it easier?

Today, Amazon Redshift ML is generally available to help you create, train, and deploy machine learning models directly from your Amazon Redshift cluster. To create a machine learning model, you use a simple SQL query to specify the data you want to use to train your model, and the output value you want to predict. For example, to create a model that predicts the success rate for your marketing activities, you define your inputs by selecting the columns (in one or more tables) that include customer profiles and results from previous marketing campaigns, and the output column you want to predict. In this example, the output column could be one that shows whether a customer has shown interest in a campaign.

After you run the SQL command to create the model, Redshift ML securely exports the specified data from Amazon Redshift to your S3 bucket and calls Amazon SageMaker Autopilot to prepare the data (pre-processing and feature engineering), select the appropriate pre-built algorithm, and apply the algorithm for model training. You can optionally specify the algorithm to use, for example XGBoost.

Architectural diagram.

Redshift ML handles all of the interactions between Amazon Redshift, S3, and SageMaker, including all the steps involved in training and compilation. When the model has been trained, Redshift ML uses Amazon SageMaker Neo to optimize the model for deployment and makes it available as a SQL function. You can use the SQL function to apply the machine learning model to your data in queries, reports, and dashboards.

Redshift ML now includes many new features that were not available during the preview, including Amazon Virtual Private Cloud (VPC) support. For example:

Architectural diagram.

  • You can also create SQL functions that use existing SageMaker endpoints to make predictions (remote inference). In this case, Redshift ML is batching calls to the endpoint to speed up processing.

Before looking into how to use these new capabilities in practice, let’s see the difference between Redshift ML and similar features in AWS databases and analytics services.

ML Feature Data Training
from SQL
Predictions
using SQL Functions
Amazon Redshift ML

Data warehouse

Federated relational databases

S3 data lake (with Redshift Spectrum)

Yes, using
Amazon SageMaker Autopilot
Yes, a model can be imported and executed inside the Amazon Redshift cluster, or invoked using a SageMaker endpoint.
Amazon Aurora ML Relational database
(compatible with MySQL or PostgreSQL)
No

Yes, using a SageMaker endpoint.

A native integration with Amazon Comprehend for sentiment analysis is also available.

Amazon Athena ML

S3 data lake

Other data sources can be used through Athena Federated Query.

No Yes, using a SageMaker endpoint.

Building a Machine Learning Model with Redshift ML
Let’s build a model that predicts if customers will accept or decline a marketing offer.

To manage the interactions with S3 and SageMaker, Redshift ML needs permissions to access those resources. I create an AWS Identity and Access Management (IAM) role as described in the documentation. I use RedshiftML for the role name. Note that the trust policy of the role allows both Amazon Redshift and SageMaker to assume the role to interact with other AWS services.

From the Amazon Redshift console, I create a cluster. In the cluster permissions, I associate the RedshiftML IAM role. When the cluster is available, I load the same dataset used in this super interesting blog post that my colleague Julien wrote when SageMaker Autopilot was announced.

The file I am using (bank-additional-full.csv) is in CSV format. Each line describes a direct marketing activity with a customer. The last column (y) describes the outcome of the activity (if the customer subscribed to a service that was marketed to them).

Here are the first few lines of the file. The first line contains the headers.

age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y 56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no

I store the file in one of my S3 buckets. The S3 bucket is used to unload data and store SageMaker training artifacts.

Then, using the Amazon Redshift query editor in the console, I create a table to load the data.

CREATE TABLE direct_marketing (
	age DECIMAL NOT NULL, 
	job VARCHAR NOT NULL, 
	marital VARCHAR NOT NULL, 
	education VARCHAR NOT NULL, 
	credit_default VARCHAR NOT NULL, 
	housing VARCHAR NOT NULL, 
	loan VARCHAR NOT NULL, 
	contact VARCHAR NOT NULL, 
	month VARCHAR NOT NULL, 
	day_of_week VARCHAR NOT NULL, 
	duration DECIMAL NOT NULL, 
	campaign DECIMAL NOT NULL, 
	pdays DECIMAL NOT NULL, 
	previous DECIMAL NOT NULL, 
	poutcome VARCHAR NOT NULL, 
	emp_var_rate DECIMAL NOT NULL, 
	cons_price_idx DECIMAL NOT NULL, 
	cons_conf_idx DECIMAL NOT NULL, 
	euribor3m DECIMAL NOT NULL, 
	nr_employed DECIMAL NOT NULL, 
	y BOOLEAN NOT NULL
);

I load the data into the table using the COPY command. I can use the same IAM role I created earlier (RedshiftML) because I am using the same S3 bucket to import and export the data.

COPY direct_marketing 
FROM 's3://my-bucket/direct_marketing/bank-additional-full.csv' 
DELIMITER ',' IGNOREHEADER 1
IAM_ROLE 'arn:aws:iam::123412341234:role/RedshiftML'
REGION 'us-east-1';

Now, I create the model straight form the SQL interface using the new CREATE MODEL statement:

CREATE MODEL direct_marketing
FROM direct_marketing
TARGET y
FUNCTION predict_direct_marketing
IAM_ROLE 'arn:aws:iam::123412341234:role/RedshiftML'
SETTINGS (
  S3_BUCKET 'my-bucket'
);

In this SQL command, I specify the parameters required to create the model:

  • FROM – I select all the rows in the direct_marketing table, but I can replace the name of the table with a nested query (see example below).
  • TARGET – This is the column that I want to predict (in this case, y).
  • FUNCTION – The name of the SQL function to make predictions.
  • IAM_ROLE – The IAM role assumed by Amazon Redshift and SageMaker to create, train, and deploy the model.
  • S3_BUCKET – The S3 bucket where the training data is temporarily stored, and where model artifacts are stored if you choose to retain a copy of them.

Here I am using a simple syntax for the CREATE MODEL statement. For more advanced users, other options are available, such as:

  • MODEL_TYPE – To use a specific model type for training, such as XGBoost or multilayer perceptron (MLP). If I don’t specify this parameter, SageMaker Autopilot selects the appropriate model class to use.
  • PROBLEM_TYPE – To define the type of problem to solve: regression, binary classification, or multiclass classification. If I don’t specify this parameter, the problem type is discovered during training, based on my data.
  • OBJECTIVE – The objective metric used to measure the quality of the model. This metric is optimized during training to provide the best estimate from data. If I don’t specify a metric, the default behavior is to use mean squared error (MSE) for regression, the F1 score for binary classification, and accuracy for multiclass classification. Other available options are F1Macro (to apply F1 scoring to multiclass classification) and area under the curve (AUC). More information on objective metrics is available in the SageMaker documentation.

Depending on the complexity of the model and the amount of data, it can take some time for the model to be available. I use the SHOW MODEL command to see when it is available:

SHOW MODEL direct_marketing

When I execute this command using the query editor in the console, I get the following output:

Console screenshot.

As expected, the model is currently in the TRAINING state.

When I created this model, I selected all the columns in the table as input parameters. I wonder what happens if I create a model that uses fewer input parameters? I am in the cloud and I am not slowed down by limited resources, so I create another model using a subset of the columns in the table:

CREATE MODEL simple_direct_marketing
FROM (
        SELECT age, job, marital, education, housing, contact, month, day_of_week, y
 	  FROM direct_marketing
)
TARGET y
FUNCTION predict_simple_direct_marketing
IAM_ROLE 'arn:aws:iam::123412341234:role/RedshiftML'
SETTINGS (
  S3_BUCKET 'my-bucket'
);

After some time, my first model is ready, and I get this output from SHOW MODEL. The actual output in the console is in multiple pages, I merged the results here to make it easier to follow:

Console screenshot.

From the output, I see that the model has been correctly recognized as BinaryClassification, and F1 has been selected as the objective. The F1 score is a metrics that considers both precision and recall. It returns a value between 1 (perfect precision and recall) and 0 (lowest possible score). The final score for the model (validation:f1) is 0.79. In this table I also find the name of the SQL function (predict_direct_marketing) that has been created for the model, its parameters and their types, and an estimation of the training costs.

When the second model is ready, I compare the F1 scores. The F1 score of the second model is lower (0.66) than the first one. However, with fewer parameters the SQL function is easier to apply to new data. As is often the case with machine learning, I have to find the right balance between complexity and usability.

Using Redshift ML to Make Predictions
Now that the two models are ready, I can make predictions using SQL functions. Using the first model, I check how many false positives (wrong positive predictions) and false negatives (wrong negative predictions) I get when applying the model on the same data used for training:

SELECT predict_direct_marketing, y, COUNT(*)
  FROM (SELECT predict_direct_marketing(
                   age, job, marital, education, credit_default, housing,
                   loan, contact, month, day_of_week, duration, campaign,
                   pdays, previous, poutcome, emp_var_rate, cons_price_idx,
                   cons_conf_idx, euribor3m, nr_employed), y
          FROM direct_marketing)
 GROUP BY predict_direct_marketing, y;

The result of the query shows that the model is better at predicting negative rather than positive outcomes. In fact, even if the number of true negatives is much bigger than true positives, there are much more false positives than false negatives. I added some comments in green and red to the following screenshot to clarify the meaning of the results.

Console screenshot.

Using the second model, I see how many customers might be interested in a marketing campaign. Ideally, I should run this query on new customer data, not the same data I used for training.

SELECT COUNT(*)
  FROM direct_marketing
 WHERE predict_simple_direct_marketing(
           age, job, marital, education, housing,
           contact, month, day_of_week) = true;

Wow, looking at the results, there are more than 7,000 prospects!

Console screenshot.

Availability and Pricing
Redshift ML is available today in the following AWS Regions: US East (Ohio), US East (N Virginia), US West (Oregon), US West (San Francisco), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (Paris), Europe (Stockholm), Asia Pacific (Hong Kong) Asia Pacific (Tokyo), Asia Pacific (Singapore), Asia Pacific (Sydney), and South America (São Paulo). For more information, see the AWS Regional Services list.

With Redshift ML, you pay only for what you use. When training a new model, you pay for the Amazon SageMaker Autopilot and S3 resources used by Redshift ML. When making predictions, there is no additional cost for models imported into your Amazon Redshift cluster, as in the example I used in this post.

Redshift ML also allows you to use existing Amazon SageMaker endpoints for inference. In that case, the usual SageMaker pricing for real-time inference applies. Here you can find a few tips on how to control your costs with Redshift ML.

To learn more, you can see this blog post from when Redshift ML was announced in preview and the documentation.

Start getting better insights from your data with Redshift ML.

Danilo

Implementing multi-tenant patterns in Amazon Redshift using data sharing

Post Syndicated from Rajesh Francis original https://aws.amazon.com/blogs/big-data/implementing-multi-tenant-patterns-in-amazon-redshift-using-data-sharing/

Software service providers offer subscription-based analytics capabilities in the cloud with Analytics as a Service (AaaS), and increasingly customers are turning to AaaS for business insights. A multi-tenant storage strategy allows the service providers to build a cost-effective architecture to meet increasing demand.

Multi-tenancy means a single instance of software and its supporting infrastructure is shared to serve multiple customers. For example, a software service provider could generate data that is housed in a single data warehouse cluster, but accessed securely by multiple customers. This storage strategy offers an opportunity to centralize management of data, simplify ETL processes, and optimize costs. However, service providers have to constantly balance between cost and providing a better user experience for their customers.

With the new data sharing feature, you can use Amazon Redshift to scale and meet both objectives of managing costs by simplifying storage and ETL pipelines while still providing consistent performance to customers.  You can ingest data into a cluster designated as a producer cluster, and share this live data with one or more consumer clusters. Clusters accessing this shared data are isolated from each other, therefore performance of a producer cluster isn’t impacted by workloads on consumer clusters. This enables consuming clusters to get consistent performance based on individual compute capacity.

In this post, we focus on various AaaS patterns, and discuss how you can use data sharing in a multi-tenant architecture to scale for virtually unlimited users. We discuss detailed steps to use data sharing with different storage strategies.

Multi-tenant storage patterns

Multi-tenant storage patterns help simplify the architecture and long-term maintenance of the analytics platform. In a multi-tenant strategy, data is stored centrally in a single cluster for all tenants, enabling simplification of the ETL ingestion pipeline and data management. In the previously published whitepaper SaaS Storage Strategies, various models of storage and benefits are covered for a single cluster scenario.

The three strategies you can choose from are:

  • Pool model – Data is stored in a single database schema for all tenants, and a new column (tenant_id) is used to scope and control access to individual tenant data. Access to the multi-tenant data is controlled using views built on the tables.
  • Bridge model – Storage and access to data for each tenant is controlled at individual schema level in the same database.
  • Silo model – Storage and access control to data for each tenant is maintained in separate databases

The following diagram illustrates the architecture of these multi-tenant storage strategies.

The following diagram illustrates the architecture of these multi-tenant storage strategies.

In the following sections, we will discuss how these multi-tenant strategies can be implemented using Amazon Redshift data sharing feature with a multi-cluster architecture.

Scaling your multi-tenant architecture using data sharing

AaaS providers implementing multi-tenant architectures were previously limited to resources of a single cluster to meet the compute and concurrency requirements of users across all the tenants. As the number of tenants increased, you could either turn on concurrency scaling or create additional clusters. However, the addition of new clusters means additional ingestion pipelines and increased operational overhead.

With data sharing in Amazon Redshift, you can easily and securely share data across clusters. Data ingested into the producer cluster is shared with one or more consumer clusters, which allows total separation of ETL and BI workloads. Several consumer clusters can read data from the managed storage of a producer cluster. This enables instant, granular, and high-performance access without data copies and movement. Workloads accessing shared data are isolated from each other and the producer. You can distribute workloads across multiple clusters while simplifying and consolidating the ETL ingestion pipeline into one main producer cluster, providing optimal price for performance.

Consumer clusters can in turn be producers for the data sets they own. Customers can optimize costs even further by collocating multiple tenants on the same consumer cluster. For instance, you can group low volume tier 3 tenants into a single consumer cluster to provider a lower cost offering, while high volume tier 1 tenants get their own isolated compute clusters. Consumer clusters can be created in the same account as producer or in a different AWS account. With this you can have separate billing for the consumer clusters, where you can chargeback to the business group that uses the consumer cluster or even allow your customers to use their own Redshift cluster in their account, so they pay for usage of the consumer cluster. The following diagram shows the difference in ETL and consumer access patterns in a multi-tenant architecture using data sharing versus a single cluster approach without data sharing.

Consumer clusters can in turn be producers for the data sets they own.

Multi-tenant architecture with data sharing compared to single cluster approach

Creating a multi-tenant architecture for an AaaS solution

For this post, we use a simple data model with a fact and a dimension table to demonstrate how to leverage data sharing to design a scalable multi-tenant AaaS solution. We cover detailed steps involved for each storage strategy using this data model. The tables are as follows:

  • Customer – dimension table containing customer details
  • Sales – fact table containing sales transactions

We use two Amazon Redshift ra3.4xl clusters, with 2 nodes each, and designate one cluster as producer and other as consumer.

The high-level steps involved in enabling data sharing across clusters are as follows:

  1. Create a data share in the producer cluster and assign database objects to the data share.
  2. From the producer cluster, grant usage on the data share to consumer clusters, identified by namespace or AWS account.
  3. From the consumer cluster, create an external database using the data share from the producer
  4. Query the tables in the data share through the external shared database in the consumer cluster. Grant access to other users to access this shared database and objects.

Creating producer and consumer Amazon Redshift clusters

Let us start by creating two Amazon Redshift ra3.4xl clusters with 2-nodes each, one for the producer and other for consumer.

  1. On the Amazon Redshift cluster, create two clusters of RA3 instance type, and name them ds-producer and ds-consumer-c1, respectively.
  1. Next, log in to Amazon Redshift using the query editor. You can also use a SQL client tool like DBeaver, SQL Workbench, or Aginity Workbench. For configuration information, see Connecting to an Amazon Redshift cluster using SQL client tools.

Get the cluster namespace of the producer and consumer clusters from the console. We will use the namespaces to create the tenant table and to create and access the data shares. You can also get the cluster namespaces by logging into each of the clusters and executing the SELECT CURRENT_NAMESPACE statement in the query editor.

Please note to replace the corresponding namespaces in the code sections wherever producercluster_namespace, consumercluster1_namespace, and consumercluster_namespace is referenced.

The following screenshot shows the namespace on the Amazon Redshift console.

Now that we have the clusters created, we will go through the detailed steps for the three models. First, we will cover the Pool model, followed by Bridge model and finally the Silo model.

Pool model

The pool model represents an all-in, multi-tenant model where all tenants share the same storage constructs and provides the most benefit in simplifying the AaaS solution.

With this model, data storage is centralized in one cluster database, and data is stored for all tenants in the same set of data models. To scope and control access to tenant data, we introduce a column (tenant_id) that serves as a unique identifier for each tenant.

Security management to prevent cross-tenant access is one of the main aspects to address with the pool model. We can implement row-level security and provide secure access to the data by creating database views and set application-level policies by creating groups with specific access and assigning users to the groups. The following diagram illustrates the pool model architecture.

The following diagram illustrates the pool model architecture.

To create a multi-tenant solution using the pool model, you create data shares for the pool model in the producer cluster, and share data with the consumer cluster. We provide more detail on these steps in the following sections.

Creating data shares for the pool model in the producer cluster

To create data shares for the pool model in the producer cluster, complete the following steps:

  1. Log in to the producer cluster as an admin user and run the following script.

Note that we have a tenant table to store unique identifiers for each tenant or consumer (tenant).

We add a column (tenant_id) to the sales and customer tables to uniquely identify tenant data. This tenant_id references the tenant_id in the tenant table to uniquely identify the tenant and consumer records. See the following code:

/**********************************************/
/*       datasharing datasetup pool model     */
/**********************************************/

-- Create Schema
create schema sales;

CREATE TABLE IF NOT EXISTS sales.tenant ( 
t_tenantid int8 not null,
t_name varchar(50) not null,
t_namespace varchar(50),
t_account varchar(16)
)
DISTSTYLE AUTO
SORTKEY AUTO;

-- Create tables for multi-tenant sales schema
drop table sales.customer;
CREATE TABLE IF NOT EXISTS sales.customer(
  c_tenantid int8 not null,
  c_custid int8 not null,
  c_name varchar(25) not null,
  c_region varchar(40) not null,
  Primary Key(c_tenantid, c_custid)
) 
DISTSTYLE AUTO
SORTKEY AUTO;

CREATE TABLE IF NOT EXISTS sales.sales (
  s_tenantid int8 not null,
  s_orderid int8 not null,
  s_custid int8 not null,
  s_totalprice numeric(12,2) not null,
  s_orderdate date not null,
  Primary Key(s_tenantid, s_orderid)
) 
DISTSTYLE AUTO
SORTKEY AUTO;
  1. Set up the tenant table with the details for each consumer cluster, and ingest data into the customer dimension and sales fact tables. Using the COPY command is the recommended way to ingest data into Amazon Redshift, but for illustration purposes, we use INSERT statements:
    -- Ingest data 
    insert into sales.tenant values
    (0, 'primary', '<producercluster_namespace>',''),
    (1, 'tenant1', '<consumercluster1_namespace>',''),
    (2, 'tenant2', '<consumercluster2_namespace>','');
    
    insert into sales.customer values
    (1, 1, 'Customer 1', 'NorthEast'),
    (1, 2, 'Customer 2', 'SouthEast'),
    (2, 1, 'Customer 3', 'NorthWest'),
    (2, 2, 'Customer 4', 'SouthEast');
    
    truncate table sales.sales;
    insert into sales.sales values
    (1, 1, 1, 2434.33, '2020-11-21'),
    (1, 2, 2, 54.90, '2020-5-5'),
    (1, 3, 2, 9678.99, '2020-3-8'),
    (2, 1, 2, 452.24, '2020-1-23'),
    (2, 2, 1, 76523.10, '2020-11-3'),
    (2, 3, 1, 6745.20, '2020-10-01');
    
    select count(*) from sales.tenant;
    select count(*) from sales.customer;
    select count(*) from sales.sales;

Securing data on the producer cluster by restricting access

In the pool model, no external user has direct access to underlying tables. All access is restricted using views.

  1. Create a view for each of the fact and dimension tables to include a condition to filter records from the consumer tenant’s namespace. In our example, we create v_customersales to combine sales fact and customer dimension tables with a restrictive filter for tenant.namespace = current_namespace. See the following code:
    /**********************************************/
    /* We will create late binding views          */
    /* but materialized views could also be used  */
    /**********************************************/
    
    create or replace view sales.v_customer as
    select * 
    from sales.customer c, sales.tenant t
    where c.c_tenantid = t.t_tenantid 
    and t.t_namespace  = current_namespace;
    
    create or replace view sales.v_sales as
    select * 
    from sales.sales s, sales.tenant t
    where s.s_tenantid = t.t_tenantid 
    and t.t_namespace  = current_namespace;
    
    create or replace view sales.v_customersales as 
    select c_tenantid, c_name, c_region, 
    	date_part(w, to_date(s_orderdate,'YYYY-MM-DD')) as "week", 
    	date_part(mon, to_date(s_orderdate,'YYYY-MM-DD')) as "month", 
    	date_part(dow, to_date(s_orderdate,'YYYY-MM-DD')) as "dow",
    	date_part(yr, to_date(s_orderdate,'YYYY-MM-DD')) as "year",
    	date_part(d, to_date(s_orderdate,'YYYY-MM-DD')) as "dom", 
    	t.t_namespace
    from sales.tenant t, sales.customer c, sales.sales s
    where t.t_tenantid = c.c_tenantid 
    and c.c_tenantid = s.s_tenantid 
    and c.c_custid = s.s_custid 
    and t.t_namespace = current_namespace 
    WITH NO SCHEMA BINDING;
    
    select * from sales.v_customersales;
          
    

Now that we have database objects created in the producer cluster, we can share the data with the consumer clusters.

Sharing data with the consumer cluster

To share data with the consumer cluster, complete the following steps:

  1. Create a data share for the sales data:
    /***************************************************/
    /* Create Datashare and add objects to the share    */
    /****************************************************/
    CREATE DATASHARE salesshare;
    

  1. Enter the following code to alter the data share, add the sales schema to be shared with the consumer clusters, and add all tables in the sales schema to be shared with the consumer cluster:
    /************************************************************/
    /* Add objects at desired granularities: schemas, tables,   */
    /* views include materialized, and SQL UDFs                 */
    /************************************************************/
    ALTER DATASHARE salesshare ADD SCHEMA sales;  -- New addition to create SCHEMA first
    
    /*For pool model, we share only the views and not tables */
    ALTER DATASHARE SalesShare ADD TABLE sales.v_customer;
    ALTER DATASHARE SalesShare ADD TABLE sales.v_customersales;
    

For the pool model, we share only the views with the consumer cluster and not the tables. The ALTER statement ADD TABLE is used to add both views and tables.

  1. Grant usage on the sales data share to the namespace of the BI consumer cluster. You can get the namespace of the BI cluster from the console or using the SELECT CURRENT_NAMESPACE statement in the BI cluster. See the following code:
    /********************************************************************/
    /* Grant access to consumer clusters                                */
    /* login to Consumer BI Cluster and get the Namespace from          */
    /* the Redshift console or using SELECT CURRENT_NAMESPACE           */
    /********************************************************************/
    SELECT CURRENT_NAMESPACE;
    
    --Namespace refers to the namespace GUID of the consumer cluster in the account
    GRANT USAGE ON DATASHARE salesshare TO NAMESPACE '<consumercluster1_namespace>'
    --Account numbers are 12 digit long
    GRANT USAGE ON DATASHARE salesshare TO ACCOUNT 'Consumer_AWSAccount';
    

  1. View data shares that are shared from the producer cluster:
    SELECT * FROM SVV_DATASHARES;

The following screenshot shows the output.

The following screenshot shows the output.

You can also see the data shares and their detailed objects and consumers using the following commands:

SHOW DATASHARES;
DESC DATASHARE salesshare;
select * from SVV_DATASHARE_OBJECTS;
select * from SVV_DATASHARE_CONSUMERS;

Viewing and querying data shares for the pool model from the consumer cluster

To view and query data shares from the consumer cluster, complete the following steps:

  1. Log in to the consumer cluster as an admin user and view the data share objects:
    /**********************************************************/
    /* Login to Consumer cluster as awsuser:                  */
    /* View datashares and create local database for querying */
    /**********************************************************/
    select * from SVV_DATASHARE_OBJECTS;

The following screenshot shows the results.

The following screenshot shows the results.

  1. Create a new database from the data share of the producer cluster:
    /**********************************************************/
    /* Create a local database and schema reference           */
    /**********************************************************/
    CREATE DATABASE sales_db FROM DATASHARE salesshare
    OF NAMESPACE '<producercluster_namespace>';

  1. Optionally, you can create an external schema in the consumer cluster pointing to the schema in the database of the producer cluster.

Creating a local external schema in the consumer cluster allows schema-level access controls within the consumer cluster, and uses a two-part notation when referencing shared data objects (localschema.table; vs. external_db.producerschema.table). See the following code:

/*********************************************************/
/* Create External Schema - Optional                     */
/* reason for schema: give specific access to schema     */
/* using shared alias get access to a secondary database */
/*********************************************************/
CREATE EXTERNAL SCHEMA sales_schema 
FROM REDSHIFT DATABASE 'sales_db' SCHEMA 'sales';
  1. Now you can query the shared data from the producer cluster by using the syntax tenant.schema.table:
    select * from sales_db.sales.customer;
    select * from sales_db.sales.v_customersales;

  1. From the tenant1 consumer cluster, you can view the databases and the tenants that are accessible to tenant1. tenant1_schema is as follows:
    select * from SVV_REDSHIFT_DATABASES;
    

The following screenshot shows the results.

The following screenshot shows the results.

Creating local consumer users and controlling access

You can control access to users in your consumer cluster by creating users and groups, and assigning access to the data share objects.

  1. Log in as an admin user on consumer cluster 1 and enter the following code to create tenant1_group, grant usage on the local database sales_db and schema sales_schema to the group, and assign the user tenant1_user to the tenant1_group:
    /********************************************************/
    /* Consumers can create own users and assign privileges */
    /* Create tenant1_group and assign privileges to read   */
    /* sales_db and the sales_schema                        */
    /* Create tenant1_user in tenant1_group                 */
    /********************************************************/
    create group tenant1_group;
    create user tenant1_user password 'Redshift#123!' in group tenant1_group;
    
    GRANT USAGE ON DATABASE sales_db TO tenant1_group;
    GRANT USAGE ON SCHEMA sales_schema TO GROUP tenant1_group;
    

  1. Now, login as tenant1_user to consumer cluster 1 and select data from the views v_customer and v_customersales:
    /*******************************************************/
    /* select from view returns only sales records related */
    /* to Consumer A namespace                             */
    /*******************************************************/
    select * from sales_db.sales.v_customer;
    select * from sales_db.sales.v_customersales;
    

You should see only the data relevant to tenant 1 and not the data that is associated with tenant 2.

You should see only the data relevant to tenant 1 and not the data that is associated with tenant 2.

 

Create Materialized views to optimize performance

Consumer clusters can have their own database objects which are local to the consumer. You can also create materialized views on the datashare objects and control when to refresh the dataset for your consumers. This provides another level of isolation from the producer cluster, and will ensure the consumer clusters go against their local dataset.

  1. Log in as an admin user on consumer cluster 1 and enter the following code to create a materialized view for customersales. This will create a local view that can be periodically refreshed from the consumer cluster.

 

/*******************************************************/
/* Create materialized view in consumer cluster        */
/*******************************************************/
create MATERIALIZED view tenant1_sales.mv_customersales as 
select c_tenantid, c_name, c_region, 
	date_part(w, to_date(s_orderdate,'YYYY-MM-DD')) as "week", 
	date_part(mon, to_date(s_orderdate,'YYYY-MM-DD')) as "month", 
	date_part(dow, to_date(s_orderdate,'YYYY-MM-DD')) as "dow",
	date_part(yr, to_date(s_orderdate,'YYYY-MM-DD')) as "year",
	date_part(d, to_date(s_orderdate,'YYYY-MM-DD')) as "dom", 
	t.t_namespace
from sales_db.tenant t, sales_db.customer c, sales_db.sales s
where t.t_tenantid = c.c_tenantid 
and c.c_tenantid = s.s_tenantid 
and c.c_custid = s.s_custid 
and t.t_namespace = current_namespace;

select * from tenant1_sales.mv_customersales top 100;

REFRESH MATERIALIZED VIEW tenant1_sales.mv_customersales;

 

With the preceding steps, we have demonstrated how you can control access to the tenant data in the same datastore using views. We also reviewed how data shares help efficiently share data between producer and consumer clusters with transaction consistency. We also saw how a local materialized view can be created to further isolate your BI workloads for your customers and provide a consistent, performant user experience. In the next section we will discuss the Bridge model.

Bridge model

In the bridge model, data for each tenant is stored in its own schema in a database and contains a similar set of tables. Data shares are created for each schema and shared with the corresponded consumer. This is an appealing balance between silo and pool model, providing both data isolation and ETL consolidation. With Amazon Redshift, you can create up to 9,900 schemas. For more information, see Quotas and limits in Amazon Redshift.

With data sharing, separate consumer clusters can be provisioned to use the same managed storage from producer cluster. Consumer clusters have all the capabilities of a producer cluster, and can in turn be producer clusters for data objects they own. Consumers can’t share data that is already shared with them. Without data sharing, queries from all customers are directed to a single cluster. The following diagram illustrates the bridge model.

The following diagram illustrates the bridge model.

To create a multi-tenant architecture using bridge model, complete the steps in the following sections.

Creating database schemas and tables for the bridge model in the producer cluster

As we did in the pool model, the first step is to create the database schema and tables. We log in to the producer cluster as an admin user and create separate schemas for each tenant. For our post, we create two schemas, tenant1 and tenant2, to store data for two tenants.

  1. Log in to the producer cluster as the admin user.
  1. Use the script below to create two schemas, tenant1 and tenant2, and create tables for customer dimension and sales facts under each of the two schemas. See the following code:
    /****************************************/
    /* Bridge -  Data Model */
    /****************************************/
    -- Create schemas tenant1 and tenant2
    create schema tenant1;
    create schema tenant2;
    
    -- Create tables for tenant1
    CREATE TABLE IF NOT EXISTS tenant1.customer (
      c_custid int8 not null ,
      c_name varchar(25) not null,
      c_region varchar(40) not null,
      Primary Key(c_custid)
    ) diststyle ALL sortkey(c_custid);
    
    CREATE TABLE IF NOT EXISTS tenant1.sales (
      s_orderid int8 not null,
      s_custid int8 not null,
      s_totalprice numeric(12,2) not null,
      s_orderdate date not null,
      Primary Key(s_orderid)
    ) distkey(s_orderid) sortkey(s_orderdate, s_orderid) ;
    
      
    
    -- Create tables for tenant2
    CREATE TABLE IF NOT EXISTS tenant2.customer (
      c_custid int8 not null ,
      c_name varchar(25) not null,
      c_region varchar(40) not null,
      Primary Key(c_custid)
    ) diststyle ALL sortkey(c_custid);
    
    CREATE TABLE IF NOT EXISTS tenant2.sales (
      s_orderid int8 not null,
      s_custid int8 not null,
      s_totalprice numeric(12,2) not null,
      s_orderdate date not null,
      Primary Key(s_orderid)
    ) distkey(s_orderid) sortkey(s_orderdate, s_orderid) ;
    

  1. Ingest data into the customer dimension and sales fact tables. Using the COPY command is the recommended way to ingest data into Amazon Redshift, but for illustration purposes, we use the INSERT statement:
    -- ingest data for tenant1
    -- ingest customer data
    insert into tenant1.customer values
    (1, 'Customer 1', 'NorthEast'),
    (2, 'Customer 2', 'SouthEast');
    
    -- ingest sales data
    insert into tenant1.sales values
    (1, 1, 2434.33, '2020-11-21'),
    (2, 2, 54.90, '2020-5-5'),
    (3, 2, 9678.99, '2020-3-8');
    
    select count(*) from tenant1.customer;
    select count(*) from tenant1.sales;
    
    
    -- ingest data for tenant2
    -- ingest customer data
    insert into tenant2.customer values
    (1, 'Customer 3', 'NorthWest'),
    (2, 'Customer 4', 'SouthEast');
    
    -- ingest sales data
    truncate table tenant2.sales;
    insert into tenant2.sales values
    (1, 2, 452.24, '2020-1-23'),
    (2, 1, 76523.10, '2020-11-3'),
    (3, 1, 6745.20, '2020-10-01');
    
    
    select count(*) from tenant2.customer;
    select count(*) from tenant2.sales;
    

Creating data shares and granting usage to the consumer cluster

In the following code, we create two data shares, tenant1share and tenant2share, to share the database objects under the two schemas to the respective consumer clusters.

  1. Create two datashares tenant1share and tenant2share to share the database objects under the two schemas to the respective consumer clusters.
    /******************************************************************/
    /*   Create Datashare and add database objects to the datashare   */
    /******************************************************************/
    CREATE DATASHARE tenant1share;
    CREATE DATASHARE tenant2share;

  1. Alter the datashare and add the schema(s) for respective tenants to be shared with the consumer cluster
    ALTER DATASHARE tenant1share ADD SCHEMA tenant1;
    ALTER DATASHARE tenant2share ADD SCHEMA tenant2;

  1. Alter the datashare and add all tables in the schema(s) to be shared with the consumer cluster
    ALTER DATASHARE tenant1share ADD ALL TABLES IN SCHEMA tenant1;
    ALTER DATASHARE tenant2share ADD ALL TABLES IN SCHEMA tenant2;

Getting the namespace of the first consumer cluster

  1. Log in to the consumer cluster and get the namespace from the console or by running the select current_namespace command:
    /* Grant access to tenant1 schema for Tenant1 BI Cluster */
    /* login to tenant1 BI Cluster and get the Namespace 
     * or get the Namespace from the Redshift console */
    SELECT CURRENT_NAMESPACE;

 

  1. Grant usage on the data share for the first tenant to the namespace of the BI cluster. You can get the namespace of the BI cluster from the console or using the SELECT CURRENT_NAMESPACE statement in the BI cluster:
    -- Grant usage on the datashare to the first consumer cluster
    -- Namespace refers to the namespace GUID of the consumer cluster 
    GRANT USAGE ON DATASHARE tenant1share TO NAMESPACE '<consumercluster1_namespace>'
    --Account numbers are 12 digit long (optional)
    --GRANT USAGE ON DATASHARE tenant1share TO ACCOUNT '<Consumer_AWSAccount>';

Getting the namespace of the second consumer cluster

  1. Log in to the second consumer cluster and get the namespace from the console or by running the select current_namespace command:
    /* Grant access to tenant2 schema for Tenant2 BI Cluster */
    /*login to tenant1 BI Cluster and get the Namespace      */
    SELECT CURRENT_NAMESPACE;
    

  1. Grant usage on the data share for the second tenant to the namespace of the second consumer cluster you just got from the previous step:
    -- Grant usage on the datashare to the second consumer cluster
    GRANT USAGE ON DATASHARE tenant2share TO NAMESPACE '<consumercluster2_namespace>'
    --Account numbers are 12 digit long (optional)
    --GRANT USAGE ON DATASHARE tenant2share TO ACCOUNT '<Consumer_AWSAccount>';

  1. To view data shares from the producer cluster, enter the following code:
    /******************************************************************/
    /*   View Datashares created, Datashare objects and consumers     */
    /******************************************************************/
    select * from SVV_DATASHARES;
    select * from SVV_DATASHARE_OBJECTS;
    select * from SVV_DATASHARE_CONSUMERS;

The following screenshot shows the commands in the query editor.

The following screenshot shows the commands in the query editor.

The following screenshot shows the query results.

The following screenshot shows the query results.

Accessing data using the consumer cluster from the data share

To access data using the consumer cluster, complete the following steps:

  1. Log in to the first consumer cluster ds-consumer-c1, as an admin user.
  1. View data share objects from the SVV_DATASHARE_OBJECTS system view:
    /*****************************************************/
    /* Consumer cluster as adminuser
    /* List the shares available and review contents for each 
    /********************************************************/
    -- You can view datashare objects associated with the cluster
    -- using either of the two commands
    SHOW DATASHARES;
    select * from SVV_DATASHARES;

The following screenshot shows the query results.

The following screenshot shows the query results.

--View objects shared in inbound share for consumer
select * from SVV_DATASHARE_OBJECTS;

The following screenshot shows the query results.

The following screenshot shows the query results.

--View namespace or clusters granted usage to a datashare
select * from svv_datashare_consumers;
  1. Create a local database in the first consumer cluster, and an external schema to be able to provide controlled access to the specific schema to the consumer clusters:
    /*******************************************************/
    /* Create a local database and schema reference        */
    /* to the share objects                                */
    /*******************************************************/
    CREATE DATABASE tenant1_db FROM DATASHARE tenant1share
    OF NAMESPACE '<producercluster_namespace>';

  1. Query the database tables using the three-part notation db.tenant.table:
    select * from tenant1_db.tenant1.customer;
    select * from tenant1_db.tenant1.sales;

  1. Optionally, you can create an external schema.

There are two reasons to create an external schema: either to enable two-part notation access to the tables from the consumer cluster, or to provide restricted access to the specific schemas for selected users, when multiple schemas are shared from the producer cluster. See the following code for our external schema:

/* Create External Schema */
CREATE EXTERNAL SCHEMA tenant1_schema FROM REDSHIFT DATABASE 'tenant1_db' SCHEMA 'tenant1';
  1. If you created the local schemas, you can use the following two-part notation to query the database tables:
    select * from tenant1_schema.customer;
    select * from tenant1_schema.sales;

  1. You can view the shared databases by querying the SVV_REDSHIFT_DATABASES table:
    select * from SVV_REDSHIFT_DATABASES;

The following screenshot shows the query results.

The following screenshot shows the query results.

Creating consumer users for managing access

Still logged in as an admin user to the consumer cluster, you can create other users who have access to the database objects.

  1. Create users and groups, and assign users and object privileges to the groups with the following code:
    /*******************************************************/
    /* Consumer can create own users and assign privileges */
    /* Create tenant1_user and assign privileges to        */
    /* read datashare from tenant1 schema                  */
    /*******************************************************/
    create group tenant1_group;
    create user tenant1_user password 'Redshift#123!' in group tenant1_group;
    
    GRANT USAGE ON DATABASE tenant1_db TO tenant1_user;
    GRANT USAGE ON SCHEMA tenant1_schema TO GROUP tenant1_group;
    

Now tenant1_user can log in and query the shared tables from tenant schema.

  1. Log in to the consumer cluster as tenant1_user and query the tables:
    /************************************************************/
    /* Consumer cluster as tenant1_user - As a consumer cluster */
    /* administrator, you must follow these steps:              */
    /************************************************************/
    
    select * from tenant1_db.tenant1.customer;
    select * from tenant1_db.tenant1.sales;
    
    
    /************************************************************/
    /* If you have created and External Schema                  */
    /* you can use the two-part notation to query tables.       */
    /************************************************************/
    
    select * from tenant1_schema.customer;
    select * from tenant1_schema.sales;
    

Revoking access to a data share (optional)

  1. At any point, if you want to revoke access to the data share, you can the REVOKE USAGE command:
    /*************************************************************/
    /* To revoke access at any time use the REVOKE USAGE command */
    /*************************************************************/
    --Namespace refers to the namespace GUID of the consumer cluster in the account
    REVOKE USAGE ON DATASHARE Salesshare FROM NAMESPACE '<consumercluster1_namespace>'
    --Account numbers are 12 digit long
    REVOKE USAGE ON DATASHARE Salesshare FROM ACCOUNT '<Consumer_AWSAccount>';
    

Silo model

The third option is to store data for each tenant in separate databases within a cluster. If you need your data isolated from other tenants, you can use the silo model and each database may have distinct data models, monitoring, management, and security footprints.

Amazon Redshift supports cross-database queries across databases, which allow you to simplify data organization. You can store common or granular datasets used across all tenants in a centralized database, and use the cross-database query capability to join relevant data for each tenant.

The steps to create a data share in a silo model is similar to a bridge model; however, unlike a bridge model (where data share is for each schema), the silo model has a data share created for each database. The following diagram illustrates the architecture of the silo model.

The following diagram illustrates the architecture of the silo model.

Creating data shares for the silo model in the producer cluster

To create data shares for the silo model in the producer cluster, complete the following steps:

  1. Log in to the producer cluster as an admin user and create separate databases for each tenant:
    /*****************************************************/
    /** Silo Model – Create databases for the 2 tenants **/
    /*****************************************************/
    create database tenant1_silodb;
    
    create database tenant2_silodb;
    

  1. Log in again to the producer cluster with the database name and user ID for the database that you want to share (tenant1_silodb) and create the schema and tables:
    /***********************************************************/
    /* login to tenant1_db and create schema tenant1 and tables*/
    /***********************************************************/
    create schema tenant1_siloschema;
    
    -- Create tables for tenant1
    CREATE TABLE IF NOT EXISTS tenant1_silodb.tenant1_siloschema.customer (
      c_custid int8 not null ,
      c_name varchar(25) not null,
      c_region varchar(40) not null,
      Primary Key(c_custid)
    ) diststyle ALL sortkey(c_custid);
    
    CREATE TABLE IF NOT EXISTS tenant1_silodb.tenant1_siloschema.sales (
      s_orderid int8 not null,
      s_custid int8 not null,
      s_totalprice numeric(12,2) not null,
      s_orderdate date not null,
      Primary Key(s_orderid)
    ) distkey(s_orderid) sortkey(s_orderdate, s_orderid) ;
    
    insert into tenant1_siloschema.customer values
    (1, 'Customer 1', 'NorthEast'),
    (2, 'Customer 2', 'SouthEast');
    
    truncate table tenant1_siloschema.sales;
    insert into tenant1.sales values
    (1, 1, 2434.33, '2020-11-21'),
    (2, 2, 54.90, '2020-5-5'),
    (3, 2, 9678.99, '2020-3-8');
    

  1. Create a data share with a name for the first tenant (for example, tenant1dbshare):
    /******************************************************************/
    /*   Create datashare and add database objects to the datashare   */
    /******************************************************************/
    CREATE DATASHARE tenant1_silodbshare;
    

  1. Run Alter datashare commands to add the schemas to be shared with the consumer cluster and add all tables in the schemas to be shared with the consumer cluster:
    ALTER DATASHARE tenant1_silodbshare ADD SCHEMA tenant1_siloschema;
    ALTER DATASHARE tenant1_silodbshare ADD ALL TABLES IN SCHEMA tenant1_siloschema;

  1. Grant usage on the data share for first tenant to the namespace of the BI cluster. You can get the namespace of the BI cluster from the console or using the SELECT CURRENT_NAMESPACE statement in the BI cluster:
    --Namespace refers to the namespace GUID of the consumer cluster in the account
    GRANT USAGE ON DATASHARE tenant1_silodbshare TO NAMESPACE ‘<consumercluster1_namespace>’
    --Account numbers are 12 digit long (optional)
    --GRANT USAGE ON DATASHARE tenant1_silodbshare TO ACCOUNT '<AWS-Account>';
    

Viewing and querying data shares for the silo model from the consumer cluster

To view and query your data shares, complete the following steps:

  1. Log in to the consumer cluster as an admin user.
  2. Create a new database from the data share of the producer cluster:
    CREATE DATABASE tenant1_silodb FROM DATASHARE tenant1_silodbshare
    OF NAMESPACE ‘<producercluster_namespace>’;

Now you can start querying the shared data from the producer cluster by using the syntax – tenant.schema.table. If you created an external schema, then you can also use the two-part notation to query the tables.

  1. Query the data with the following code:
    select * from tenant1_silodb.tenant1.customer;
    select * from tenant1_silodb.tenant1.sales;
    

  1. Optionally, you can create an external schema pointing to the schema in the database of the producer cluster. This allows you query shared tables using a two-part notation. See the following code:
    CREATE EXTERNAL SCHEMA tenant1_siloschema FROM REDSHIFT DATABASE 'tenant1_silodb' SCHEMA 'tenant1';
    
    --With this Schema, you can access using two-part notation to select from data share tables
    select * from tenant1_siloschema.customer;
    select * from tenant1_siloschema.sales;
    

  1. You can repeat the same steps for tenant2 to share the tenant2 database with tenant2 You can also control access to users in your consumer cluster by creating users and groups, and assigning access to the data share objects.

System views to view data shares

We have introduced new system tables and views to easily identify the data shares and related objects. You can use three different groups of system views to view the data share objects:

  • Views starting with SVV_DATASHARES – has detail of datashares and objects in a datashare.
View Name Purpose
SVV_DATASHARES View a list of data shares created on the cluster and data shares shared with the cluster
SVV_DATASHARE_OBJECTS View a list of objects in all data shares created on the cluster or shared with the cluster
SVV_DATASHARE_CONSUMERS View a list of consumers for data share created on the cluster
  • Views starting with SVV_REDSHIFT – contains details on both local and remote Redshift databases.
View Name Purpose
SVV_REDSHIFT_DATABASES List of all databases that a user has access to
SVV_REDSHIFT_SCHEMAS List of all schemas that user has access to
SVV_REDSHIFT_TABLES List of all tables that a user has access to
SVV_REDSHIFT_COLUMNS List of all columns that a user has access to
SVV_REDSHIFT_FUNCTIONS List of all functions that user has access to
  • Views starting with SVV_ALL– contain local and remote databases, external schemas including  spectrum and federated query, and external schema references to shared data.  If you create external schemas in consumer cluster, you need to use the SVV_ALL views to look at the objects.
View Name Purpose
SVV_ ALL _SCHEMAS Union of list of all schemas from SVV_REDSHIFT_SCHEMA view and consolidated list of all external tables and schemas that user has access to
SVV_ ALL _TABLES List of all tables that a user has access to
SVV_ ALL _COLUMNS List of all columns that a user has access to
SVV_ ALL _FUNCTIONS List of all functions that user has access to

Considerations for choosing a storage strategy

You can adopt a storage strategy or choose a hybrid approach based on business, technical, and operational requirements. Before deciding on a strategy, consider the quotas and limits for various objects in Amazon Redshift, and the number of databases per cluster or number of schemas per database to check if it meets your requirements. The following table summarizes these considerations.

  Pool Bridge Silo
Separation of tenant data Views Schema Database
ETL pipeline complexity Low Low Medium
Limits 100,000 tables (RA3 – 4x, 16x large clusters) 9,900 schemas per database 60 databases per cluster
Chargeback to consumer accounts Yes Yes Yes
Scalability High High High

Conclusion

In this post, we discussed how you can use the new data sharing feature of Amazon Redshift to implement an AaaS solution with a multi-tenant architecture while meeting SLAs for consumers using separate Amazon Redshift clusters. We demonstrated three types of models providing various levels of isolation for the tenant data. We compared and contrasted the models and provided guidance on when to choose an implementation model. We encourage you to try the data sharing feature to build your AaaS or software as a service (SaaS) solutions.


About the Authors

Rajesh Francis is a Sr. Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to build scalable Analytic solutions.

 

 

 

Neeraja Rentachintala is a Principal Product Manager with Amazon Redshift. Neeraja is a seasoned Product Management and GTM leader, bringing over 20 years of experience in product vision, strategy and leadership roles in data products and platforms. Neeraja delivered products in analytics, databases, data Integration, application integration, AI/Machine Learning, large scale distributed systems across On-Premise and Cloud, serving Fortune 500 companies as part of ventures including MapR (acquired by HPE), Microsoft SQL Server, Oracle, Informatica and Expedia.com.

 

Jeetesh Srivastva is a Sr. Analytics specialist solutions architect at AWS. He specializes in Amazon Redshift and works with customers to implement scalable solutions leveraging Redshift and other AWS Analytic services. He has worked to deliver on premises and cloud based analytic solutions for customers in banking & finance and hospitality industry verticals.