Tag Archives: Amazon Redshift

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

Post Syndicated from Parag Doshi original https://aws.amazon.com/blogs/big-data/how-tricentis-unlocks-insights-across-the-software-development-lifecycle-at-speed-and-scale-using-amazon-redshift/

This is a guest post co-written with Parag Doshi, Guru Havanur, and Simon Guindon from Tricentis.

Tricentis is the global leader in continuous testing for DevOps, cloud, and enterprise applications. It has been well published since the State of DevOps 2019 DORA Metrics were published that with DevOps, companies can deploy software 208 times more often and 106 times faster, recover from incidents 2,604 times faster, and release 7 times fewer defects. Speed changes everything, and continuous testing across the entire CI/CD lifecycle is the key. However, speed is only realized when you have the confidence to release software on demand. Tricentis instills that confidence by providing software tools that enable Agile Continuous Testing (ACT) at scale. Whether exploratory or automated, functional or performance, API or UI, targeting mainframes, custom applications, packaged applications, or cloud-native applications, Tricentis provides a comprehensive suite of specialized continuous testing tools that help its customers achieve the confidence to release on demand.

The next phase of Tricentis’ journey is to unlock insights across all testing tools. Teams may struggle to have a unified view of software quality due to siloed testing across many disparate tools. For users that require a unified view of software quality, this is unacceptable. In this post, we share how the AWS Data Lab helped Tricentis to improve their software as a service (SaaS) Tricentis Analytics platform with insights powered by Amazon Redshift.

The challenge

Tricentis provides SaaS and on-premises solutions to thousands of customers globally. Every change to software worth testing is tracked in test management tools such as Tricentis qTest, test automation tools such as Tosca or Testim, or performance testing tools such as Neoload. Although Tricentis has amassed such data over a decade, the data remains untapped for valuable insights. Each of these tools has its own reporting capabilities that make it difficult to combine the data for integrated and actionable business insights.

Additionally, the scale is significant because the multi-tenant data sources provide a continuous stream of testing activity, and our users require quick data refreshes as well as historical context for up to a decade due to compliance and regulatory demands.

Finally, data integrity is of paramount importance. Every event in the data source can be relevant, and our customers don’t tolerate data loss, poor data quality, or discrepancies between the source and Tricentis Analytics. While aggregating, summarizing, and aligning to a common information model, all transformations must not affect the integrity of data from its source.

The solution

Tricentis Analytics aims to address the challenges of high volume, near-real-time, and visually appealing reporting and analytics across the entire Tricentis product portfolio.

The initial customer objectives were:

  • Provide export of data securely accessible from the AWS Cloud
  • Provide an initial set of pre-built dashboards that provide immediate business insights
  • Beta test a solution with early adopter customers within 6 weeks

Considering the multi-tenant data source, Tricentis and the AWS Data Lab team engineered for the following constraints:

  • Deliver the end-to-end pipeline to load only the eligible customers into an analytics repository
  • Transform the multi-tenant data into single-tenant data isolated for each customer in strictly segregated environments

Knowing that data will be unified across many sources deployed in any environment, the architecture called for an enterprise-grade analytics platform. The data pipeline consists of multiple layers:

  • Ingesting data from the source either as application events or change data capture (CDC) streams
  • Queuing data so that we can rewind and replay the data back in time without going back to the source
  • Light transformations such as splitting multi-tenant data into single tenant data to isolate customer data
  • Persisting and presenting data in a scalable and reliable lake house (data lake and data warehouse) repository

Some customers will access the repository directly via an API with the proper guardrails for stability to combine their test data with other data sources in their enterprise, while other customers will use dashboards to gain insights on testing. Initially, Tricentis defines these dashboards and charts to enable insight on test runs, test traceability with requirements, and many other pre-defined use cases that can be valuable to customers. In the future, more capabilities will be provided to end-users to come up with their own analytics and insights.

How Tricentis and the AWS Data Lab were able to establish business insights in 6 weeks

Given the challenge of Tricentis Analytics with live customers in 6 weeks, Tricentis partnered with the AWS Data Lab. From detailed design to a beta release, Tricentis had customers expecting to consume data from a data lake specific to only their data, and all of the data that had been generated for over a decade. Customers also required their own repository, an Apache Parquet data lake, which would combine with other data in the customer environment to gather even greater insights.

The AWS account team proposed the AWS Data Lab Build Lab session to help Tricentis accelerate the process of designing and building their prototype. The Build Lab is a two-to-five-day intensive build by a team of customer builders with guidance from an AWS Data Lab Solutions Architect. During the Build Lab, the customer will construct a prototype in their environment, using their data, with guidance on real-world architectural patterns and anti-patterns, as well as strategies for building effective solutions, from AWS service experts. Including the pre-lab preparation work, the total engagement duration is 3–6 weeks and in the Tricentis case was 3 weeks: two for the pre-lab preparation work and one for the lab. The weeks that followed the lab included go-to-market activities with specific customers, documentation, hardening, security reviews, performance testing, data integrity testing, and automation activities.

The 2 weeks before the lab were used for the following:

  • Understanding the use case and working backward with an architecture
  • Preparing the Tricentis team for the lab by delivering all the training on the services to be used during the lab

For this solution, Tricentis and AWS built a data pipeline that consumes data from streaming, which was in place before the lab, and this streaming has the database transactions captured through CDC. In the streaming, the data from each table is separated by topic, and data from all the customers comes on the same topic (no isolation). Because of that, a pipeline was created to separate customers to create their tables isolated by the schema on the final destination at Amazon Redshift. The following diagram illustrates the solution architecture.

The main idea of this architecture is to be event-driven with eventual consistency. Any time new test cases or test results are created or modified, events trigger such that processing is immediate and new snapshot files are available via an API or data is pulled at the refresh frequency of the reporting or business intelligence (BI) tool. Every time the Amazon Simple Storage Service (Amazon S3) sink connector from Apache Kafka delivers a file on Amazon S3, Amazon EventBridge triggers an AWS Lambda function to transform the multi-tenant file into separated files, one per customer per table, and land it on specific folders on Amazon S3. As the files are created, another process is triggered to load the data from each customer on their schema or table on Amazon Redshift. On Amazon Redshift, materialized views were used to get the queries for the dashboards ready and easier to be returned to the Apache Superset. Also, the materialized views were configured to refresh automatically (with the autorefresh option), so Amazon Redshift updates the data automatically in the materialized views as soon as possible after base tables changes.

In the following sections, we detail specific implementation challenges and additional features required by customers discovered along the way.

Data export

As stated earlier, some customers want to get an export of their test data and create their data lake. For these customers, Tricentis provides incremental data as Apache Parquet files and will have the ability to filter on specific projects and specific date ranges. To ensure data integrity, Tricentis uses its technology known as Tosca DI (not part of the AWS Data Lab session).

Data security

The solution uses the following data security guardrails:

  • Data isolation guardrails – Tricentis source databases systems are used by all customers, and therefore, data from different customers is in the same database. To isolate customer-specific data, Tricentis has a unique identifier that discriminates customer-specific data. All the queries filter data based on the discriminator to get customer-specific data. EventBridge triggers a Lambda function to transform multi-tenant files to single-tenant (customer) files to land in customer-specific S3 folders. Another Lambda function is triggered to load data from customer-specific folders to their specific schema in Amazon Redshift. The latter Lambda function is data isolation aware and triggers an alert and stops processing further for any data that doesn’t belong to a specific customer.
  • Data access guardrails – To ensure access control, Tricentis applied role-based access control principles to users and service accounts for specific work-related resources. Access to Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon S3, Amazon Relational Database Service (Amazon RDS), and Amazon Redshift was controlled by granting privileges at the role level and assigning those roles appropriate resources.

Pay per use and linear cost scalability

Tricentis’s objective is to pay for the compute and storage used and grow analytics infrastructure with linear cost scalability. To better manage storage costs in the data plane, Tricentis stores all raw and intermediate data in Amazon S3 storage in a compressed format. The Amazon MSK and Amazon Redshift is right-sized for Tricentis Analytics load and is allowed to scale up or down with no downtime based on future business needs. Data on all the stores, including Amazon MSK, Amazon Redshift, and Amazon S3, is subjected to tiered storage and retention policies per the customer data retention and archival requirements to reduce the cost further and provide linear cost scalability.

In the control plane, Debezium and Kafka Connect resources are turned on and off, so you only pay for what you use. Lambda triggers are triggered on an event or a schedule and turned off after completing tasks.

Automated data integrity

High data integrity is a fundamental design principle of Tricentis Analytics. Fortunately, Tricentis has a product called ToscaDI, which is used to automate the measurement of data integrity across many different data sources. The main idea is to use the machine-generated data type and log sequence number (LSN) to reflect the latest snapshot data from the change data capture (CDC) streams. Tricentis reached the data integrity automation milestone outside of the AWS Data Lab window by automatically triggering Tosca DI at various stages of the AWS serverless architecture (illustrated earlier), and because of that Tricentis was able to ensure expected record counts at every step, preventing data loss or inadvertent data manipulation. In future versions, Tricentis will have much deeper data integrity verification record counts and incorporate specific fields to ensure data quality (for example, nullness) and semantic or format validation. To date, the combination of CDC and data cleansing has resulted in ultra-high data integrity when comparing source data to the final Parquet file contents.

Performance and data loss prevention

Performance was tuned for maximum throughput at three stages in the pipeline:

  • Data ingestion – Data integrity during ingestion was dramatically improved using CDC events and allowed us to rely on the well-respected replication mechanisms in PostgreSQL and Kafka, which simplified the system and eliminated a lot of the past data corrections that were in place. The Amazon S3 sink connector further streams data into Amazon S3 in real time by partitioning data into fixed-sized files. Fixed-size data files avoid further latency due to unbound file sizes. As a result, data was higher quality and was streamed in real time at a much faster rate.
  • Data transformation – Batch processing is highly cost efficient and compute efficient, and can mitigate various potential performance issues if appropriately implemented. Tricentis uses batch transformation to move data from multi-tenant Amazon S3 to single-tenant Amazon S3 and between single-tenant Amazon S3 to Amazon Redshift by micro-batch loading. The batch processing is staged to work within the Lamba invocations limits and maximum Amazon Redshift connections limits to keep the cost minimum. However, the transformation pipeline is configurable to go real time by processing every incoming S3 file on an EventBridge event.
  • Data queries – Materialized views with appropriate sort keys significantly improve the performance of repeated and predictable dashboard workloads. Tricentis pipelines use dynamic data loading in views and precomputed results in materialized views to seamlessly improve the performance of dashboards, along with setting up appropriate simple and compound sort keys to accelerate performance. Tricentis query performance is further accelerated by range-restricted predicates in sort keys.

Implementation challenges

Tricentis worked within the default limit of 1,000 concurrent Lambda function runs by keeping track of available functions at any given time and firing only those many functions for which slots are available. For the 10 GB memory limit per function, Tricentis right-sized the Amazon S3 sink connector generated files and single-tenant S3 files to not exceed 4 GB in size. The Lambda function throttling can be prevented by requesting a higher limit of concurrent runs if that becomes necessary later.

Tricentis also experienced some Amazon Redshift connection limitations. Amazon Redshift has quotas and adjustable quotas that limit the use of server resources. To effectively manage Amazon Redshift limits of maximum connections, Tricentis used connection pools to ensure optimal consumption and stability.

Results and next steps

The collaborative approach between Tricentis and the AWS Data Lab allowed considerable acceleration and the ability to meet timelines for establishing a big data solution that will benefit Tricentis customers for years. Since this writing, customer onboarding, observability and alerting, and security scanning were automated as part of a DevSecOps pipeline.

Within 6 weeks, the team was able to beta a data export service for one of Tricentis’ customers.

In the future, Tricentis anticipates adding multiple data sources, unify towards a common, ubiquitous language for testing data, and deliver richer insights so that our customers can have the correct data in a single view and increase confidence in their delivery of software at scale and speed.

Conclusion

In this post, we walked you through the journey the Tricentis team took with the AWS Data Lab during their participation in a Build Lab session. During the session, the Tricentis team and AWS Data Lab worked together to identify a best-fit architecture for their use cases and implement a prototype for delivering new insights for their customers.

To learn more about how the AWS Data Lab can help you turn your ideas into solutions, visit AWS Data Lab.


About the Authors

  Parag Doshi is Vice President of Engineering at Tricentis, where he continues to lead towards the vision of Innovation at the Speed of Imagination. He brings innovation to market by building world-class quality engineering SaaS such as qTest, the flagship test management product, and a new capability called Tricentis Analytics, which unlocks software development lifecycle insights across all types of testing. Prior to Tricentis, Parag was the founder of Anthem’s Cloud Platform Services, where he drove a hybrid cloud and DevSecOps capability and migrated 100 mission-critical applications. He enabled Anthem to build a new pharmacy benefits management business in AWS, resulting in $800 million in total operating gain for Anthem in 2020 per Forbes and CNBC. He also held posts at Hewlett-Packard, having multiple roles including Chief Technologist and head of architecture for DXC’s Virtual Private Cloud, and CTO for HP’s Application Services in the Americas region.

Guru Havanur serves as a Principal, Big Data Engineering and Analytics team in Tricentis. Guru is responsible for data, analytics, development, integration with other products, security, and compliance activities. He strives to work with other Tricentis products and customers to improve data sharing, data quality, data integrity, and data compliance through the modern big data platform. With over 20 years of experience in data warehousing, a variety of databases, integration, architecture, and management, he thrives for excellence.

Simon Guindon is an Architect at Tricentis. He has expertise in large-scale distributed systems and database consistency models, and works with teams in Tricentis around the world on scalability and high availability. You can follow his Twitter @simongui.

Ricardo Serafim is a Senior AWS Data Lab Solutions Architect. With a focus on data pipelines, data lakes, and data warehouses, Ricardo helps customers create an end-to-end architecture and test an MVP as part of their path to production. Outside of work, Ricardo loves to travel with his family and watch soccer games, mainly from the “Timão” Sport Club Corinthians Paulista.

Visualize database privileges on Amazon Redshift using Grafana

Post Syndicated from Yota Hamaoka original https://aws.amazon.com/blogs/big-data/visualize-database-privileges-on-amazon-redshift-using-grafana/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift enables you to use SQL for analyzing structured and semi-structured data with best price performance along with secure access to the data.

As more users start querying data in a data warehouse, access control is paramount to protect valuable organizational data. Database administrators want to continuously monitor and manage user privileges to maintain proper data access in the data warehouse. Amazon Redshift provides granular access control on the database, schema, table, column, row, and other database objects by granting privileges to roles, groups, and users from a SQL interface. To monitor privileges configured in Amazon Redshift, you can retrieve them by querying system tables.

Although Amazon Redshift provides a broad capability of managing access to database objects, we have heard from customers that they want to visualize and monitor privileges without using a SQL interface. In this post, we introduce predefined dashboards using Grafana which visualizes database privileges without writing SQL. This dashboard will help database administrators to reduce the time spent on database administration and increase the frequency of monitoring cycles.

Database security in Amazon Redshift

Security is the top priority at AWS. Amazon Redshift provides four levels of control:

  • Cluster management
  • Cluster connectivity
  • Database access
  • Temporary database credentials and single sign-on

This post focuses on database access, which relates to user access control against database objects. For more information, see Managing database security.

Amazon Redshift uses the GRANT command to define permissions in the database. For most database objects, GRANT takes three parameters:

  • Identity – The entity you grant access to. This could be a user, role, or group.
  • Object – The type of database object. This could be a database, schema, table or view, column, row, function, procedure, language, datashare, machine leaning (ML) model, and more.
  • Privilege – The type of operation. Examples include CREATE, SELECT, ALTER, DROP, DELETE, and INSERT. The level of privilege depends on the object.

To remove access, use the REVOKE command.

Additionally, Amazon Redshift offers granular access control with the Row-level security (RLS) feature. You can attach or detach RLS policies to identities with the ATTACH RLS POLICY and DETACH RLS POLICY commands, respectively. See RLS policy ownership and management for more details.

Generally, database administrator monitors and reviews the identities, objects, and privileges periodically to ensure proper access is configured. They also need to investigate access configurations if database users face permission errors. These tasks require a SQL interface to query multiple system tables, which can be a repetitive and undifferentiated operation. Therefore, database administrators need a single pane of glass to quickly navigate through identities, objects, and privileges without writing SQL.

Solution overview

The following diagram illustrates the solution architecture and its key components:

  • Amazon Redshift contains database privilege information in system tables.
  • Grafana provides a predefined dashboard to visualize database privileges. The dashboard runs queries against the Amazon Redshift system table via the Amazon Redshift Data API.

Note that the dashboard focuses on visualization. SQL interface is required to configure privileges in Amazon Redshift. You can use query editor v2, a web-based SQL interface which enables users to run SQL commands from a browser.

Prerequisites

Before moving to the next section, you should have the following prerequisites:

While Amazon Managed Grafana controls the plugin version and updates periodically, local Grafana allows user to control the version. Therefore, local Grafana could be an option if you need earlier access for the latest features. Refer to plugin changelog for released features and versions.

Import the dashboards

After you have finished the prerequisites, you should have access to Grafana configured with Amazon Redshift as a data source. Next, import two dashboards for visualization.

  1. In Grafana console, go to the created Redshift data source and click Dashboards
  2. Import the Amazon Redshift Identities and Objects
  3. Go to the data source again and import the Amazon Redshift Privileges

Each dashboard will appear once imported.

Amazon Redshift Identities and Objects dashboard

The Amazon Redshift Identities and Objects dashboard shows identites and database objects in Amazon Redshift, as shown in the following screenshot.

The Identities section shows the detail of each user, role, and group in the source database.

One of the key features in this dashboard is the Role assigned to Role, User section, which uses a node graph panel to visualize the hierarchical structure of roles and users from multiple system tables. This visualization can help administrators quickly examine which roles are inherited to users instead of querying multiple system tables. For more information about role-based access, refer to Role-based access control (RBAC).

Amazon Redshift Privileges dashboard

The Amazon Redshift Privileges dashboard shows privileges defined in Amazon Redshift.

In the Role and Group assigned to User section, open the Role assigned to User panel to list the roles for a specific user. In this panel, you can list and compare roles assigned to multiple users. Use the User drop-down at the top of the dashboard to select users.

The dashboard will refresh immediately and show filtered result for selected users. Following screenshot is the filtered result for user hr1, hr2 and it3.

The Object Privileges section shows the privileges granted for each database object and identity. Note that objects with no privileges granted are not listed here. To show the full list of database objects, use the Amazon Redshift Identities and Objects dashboard.

The Object Privileges (RLS) section contains visualizations for row-level security (RLS). The Policy attachments panel enables you to examine RLS configuration by visualizing relation between of tables, policies, roles and users.

Conclusion

In this post, we introduced a visualization for database privileges of Amazon Redshift using predefined Grafana dashboards. Database administrators can use these dashboards to quickly navigate through identities, objects, and privileges without writing SQL. You can also customize the dashboard to meet your business requirements. The JSON definition file of this dashboard is maintained as part of OSS in the Redshift data source for Grafana GitHub repository.

For more information about the topics described to in this post, refer to the following:


About the author

Yota Hamaoka is an Analytics Solution Architect at Amazon Web Services. He is focused on driving customers to accelerate their analytics journey with Amazon Redshift.

Simplify Online Analytical Processing (OLAP) queries in Amazon Redshift using new SQL constructs such as ROLLUP, CUBE, and GROUPING SETS

Post Syndicated from Satesh Sonti original https://aws.amazon.com/blogs/big-data/simplify-online-analytical-processing-olap-queries-in-amazon-redshift-using-new-sql-constructs-such-as-rollup-cube-and-grouping-sets/

Amazon Redshift is a fully managed, petabyte-scale, massively parallel data warehouse that makes it fast, simple, and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools.

We are continuously investing to make analytics easy with Redshift by simplifying SQL constructs and adding new operators. Now we are adding ROLLUP, CUBE, and GROUPING SETS SQL aggregation extensions to perform multiple aggregate operations in single statement and easily include subtotals, totals, and collections of subtotals in a query.

In this post, we discuss how to use these extensions to simplify your queries in Amazon Redshift.

Solution overview

Online Analytical Processing (OLAP) is an effective tool for today’s data and business analysts. It helps you see your mission-critical metrics at different aggregation levels in a single pane of glass. An analyst can use OLAP aggregations to analyze buying patterns by grouping customers by demographic, geographic, and psychographic data, and then summarizing the data to look for trends. This could include analyzing the frequency of purchases, the time frames between purchases, and the types of items being purchased. Such analysis can provide insight into customer preferences and behavior, which can be used to inform marketing strategies and product development. For example, a data analyst can query the data to display a spreadsheet showing a company’s certain type of products sold in the US in the month of July, compare revenue figures with those for the same products in September, and then see a comparison of other product sales in the US at the same time period.

Traditionally, business analysts and data analysts use a set of SQL UNION queries to achieve the desired level of detail and rollups. However, it can be very time consuming and cumbersome to write and maintain. Furthermore, the level of detail and rollups that can be achieved with this approach is limited, because it requires the user to write multiple queries for each different level of detail and rollup.

Many customers are considering migrating to Amazon Redshift from other data warehouse systems that support OLAP GROUP BY clauses. To make this migration process as seamless as possible, Amazon Redshift now offers support for ROLLUP, CUBE, and GROUPING SETS. This will allow for a smoother migration of OLAP workloads, with minimal rewrites. Ultimately, this will result in a faster and streamlined transition to Amazon Redshift. Business and data analysts can now write a single SQL to do the job of multiple UNION queries.

In the next sections, we use sample supplier balances data from TPC-H dataset as a running example to demonstrate the use of ROLLUP, CUBE, and GROUPING SETS extensions. This dataset consists of supplier account balances across different regions and countries. We demonstrate how to find account balance subtotals and grand totals at each nation level, region level, and a combination of both. All these analytical questions can be answered by a business user by running simple single-line SQL statements. Along with aggregations, this post also demonstrates how the results can be traced back to attributes participated in generating subtotals.

Data preparation

To set up the use case, complete the following steps:

  1. On the Amazon Redshift console, in the navigation pane, choose Editor¸ then Query editor v2.

The query editor v2 opens in a new browser tab.

  1. Create a supplier sample table and insert sample data:
create table supp_sample (supp_id integer, region_nm char(25), nation_nm char(25), acct_balance numeric(12,2));

INSERT INTO public.supp_sample (supp_id,region_nm,nation_nm,acct_balance)
VALUES
(90470,'AFRICA                   ','KENYA                    ',1745.57),
(99910,'AFRICA                   ','ALGERIA                  ',3659.98),
(26398,'AMERICA                  ','UNITED STATES            ',2575.77),
(43908,'AMERICA                  ','CANADA                   ',1428.27),
(3882,'AMERICA                  ','UNITED STATES            ',7932.67),
(42168,'ASIA                     ','JAPAN                    ',343.34),
(68461,'ASIA                     ','CHINA                    ',2216.11),
(89676,'ASIA                     ','INDIA                    ',4160.75),
(52670,'EUROPE                   ','RUSSIA                   ',2469.40),
(32190,'EUROPE                   ','RUSSIA                   ',1119.55),
(19587,'EUROPE                   ','GERMANY                  ',9904.98),
(1134,'MIDDLE EAST              ','EGYPT                    ',7977.48),
(35213,'MIDDLE EAST              ','EGYPT                    ',737.28),
(36132,'MIDDLE EAST              ','JORDAN                   ',5052.87);

We took a sample from the result of the following query run on TPC-H dataset. You can use the following query and take sample records to try the SQL statement described in this post:

select s_suppkey supp_id, r.r_name region_nm,n.n_name nation_nm, s.s_acctbal acct_balance
from supplier s, nation n, region r
where
s.s_nationkey = n.n_nationkey
and n.n_regionkey = r.r_regionkey

Let’s review the sample data before running the SQLs using GROUPING SETS, ROLLUP, and CUBE extensions.

The supp_sample table consists of supplier account balances from various nations and regions across the world. The following are the attribute definitions:

  • supp_id – The unique identifier for each supplier
  • region_nm – The region in which the supplier operates
  • nation_nm – The nation in which the supplier operates
  • acct_balance – The supplier’s outstanding account balance

GROUPING SETS

GROUPING SETS is a SQL aggregation extension to group the query results by one or more columns in a single statement. You can use GROUPING SETS instead of performing multiple SELECT queries with different GROUP BY keys and merge (UNION) their results.

In this section, we show how to find the following:

  • Account balances aggregated for each region
  • Account balances aggregated for each nation
  • Merged results of both aggregations

Run the following SQL statement using GROUPING SETS:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM supp_sample
GROUP BY GROUPING SETS (region_nm, nation_nm);

As shown in the following screenshot, the result set includes aggregated account balances by region_nm, followed by nation_nm, and then both results combined in a single output.

ROLLUP

The ROLLUP function generates aggregated results at multiple levels of grouping, starting from the most detailed level and then aggregating up to the next level. It groups data by particular columns and extra rows that represent the subtotals, and assumes a hierarchy among the GROUP BY columns.

In this section, we show how to find the following:

  • Account balances for each combination of region_nm and nation_nm
  • Rolled-up account balances for each region_nm
  • Rolled-up account balances for all regions

Use the following SQL statement using ROLLUP:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM supp_sample
GROUP BY ROLLUP (region_nm, nation_nm)
ORDER BY region_nm,nation_nm;

The following result shows rolled-up values starting from each combination of region_nm and nation_nm and rolls up in the hierarchy from nation_nm to region_nm. The rows with a value for region_nm and NULL value for nation_nm represent the subtotals for the region (marked in green). The rows with NULL value for both region_nm and nation_nm has the grand total—the rolled-up account balances for all regions (marked in red).


ROLLUP is structurally equivalent to the following GROUPING SETS query:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM supp_sample
GROUP BY GROUPING SETS((region_nm, nation_nm), (region_nm), ())
ORDER BY region_nm,nation_nm;

You can rewrite the preceding ROLLUP query using GROUPING SETS. However, using ROLLUP is a much simpler and readable construct for this use case.

CUBE

CUBE groups data by the provided columns, returning extra subtotal rows representing the totals throughout all levels of grouping columns, in addition to the grouped rows. CUBE returns the same rows as ROLLUP, while adding additional subtotal rows for every combination of grouping column not covered by ROLLUP.

In this section, we show how to find the following:

  • Account balance subtotals for each nation_nm
  • Account balance subtotals for each region_nm
  • Account balance subtotals for each group of region_nm and nation_nm combination
  • Overall total account balance for all regions

Run the following SQL statement using CUBE:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM supp_sample
WHERE region_nm in ('AFRICA','AMERICA','ASIA') GROUP BY CUBE(region_nm, nation_nm)
ORDER BY region_nm, nation_nm;

In the preceding query, we added a filter to limit results for easy explanation. You can remove this filter in your test to view data for all regions.

In the following result sets, you can see the subtotals at region level (marked in green). These subtotal records are the same records generated by ROLLUP. Additionally, CUBE generated subtotals for each nation_nm (marked in yellow). Finally, you can also see the grand total for all three regions mentioned in the query (marked in red).

CUBE is structurally equivalent to the following GROUPING SETS query:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM supp_sample
WHERE region_nm in ('AFRICA','AMERICA','ASIA') -- added the filter to limit results.  You can remove this filter in your test to view data for all regions
GROUP BY GROUPING SETS((region_nm, nation_nm), (region_nm), (nation_nm), ())
ORDER BY region_nm;

You can rewrite the preceding CUBE query using GROUPING SETS. However, using CUBE is a much simpler and readable construct for this use.

NULL values

NULL is a valid value in a column that participates in GROUPING SETS, ROLLUP, and CUBE, and it’s not aggregated with the NULL values added explicitly to the result set to satisfy the schema of returning tuples.

Let’s create an example table orders containing details about items ordered, item descriptions, and quantity of the items:

-- Create example orders table and insert sample records
CREATE TABLE orders(item_no int,description varchar,quantity int);
INSERT INTO orders(item_no,description,quantity)
VALUES
(101,'apples',10),
(102,null,15),
(103,'banana',20);

--View the data
SELECT * FROM orders;

We use the following ROLLUP query to aggregate quantities by item_no and description:

SELECT item_no, description, sum(quantity)
FROM orders
GROUP BY ROLLUP(item_no, description)
ORDER BY 1,2;

In the following result, there are two output rows for item_no 102. The row marked in green is the actual data record in the input, and the row marked in red is the subtotal record added by the ROLLUP function.

This demonstrates that NULL values in input are separate from the NULL values added by SQL aggregate extensions.

Grouping and Grouping_ID functions

GROUPING indicates whether a column in the GROUP BY list is aggregated or not. GROUPING(expr) returns 0 if a tuple is grouped on expr; otherwise it returns 1. GROUPING_ID(expr1, expr2, …, exprN) returns an integer representation of the bitmap that consists of GROUPING(expr1), GROUPING(expr2), …, GROUPING(exprN).

This feature helps us clearly understand the aggregation grain, slice and dice data, and apply filters when business users are performing analysis. Also provides auditability for the generated aggregations.

For example, let’s use the preceding supp_sampe table. The following ROLLUP query utilizes GROUPING and GROUPING_ID functions:

SELECT region_nm,
nation_nm,
sum(acct_balance) as total_balance,
GROUPING(region_nm) as gr,
GROUPING(nation_nm) as gn,
GROUPING_ID(region_nm, nation_nm) as grn
FROM supp_sample
GROUP BY ROLLUP(region_nm, nation_nm)
ORDER BY region_nm;

In the following result set, the rows rolled up at nation_nm have 1 value for gn. This indicates that the total_balance is the aggregated value for all the nation_nm values in the region. The last row has gr value as 1. It indicates that total_balance is an aggregated value at region level including all the nations. The grn is an integer representation of bitmap (11 in binary translated to 3 in integer representation).

Performance assessment

Performance is often a key factor, and we wanted to make sure we’re offering most performant SQL features in Amazon Redshift. We performed benchmarking with the 3 TB TPC-H public dataset on an Amazon Redshift cluster with different sizes (5-node Ra3-4XL, 2-node Ra3-4XL, 2-node-Ra3-XLPLUS). Additionally, we disabled query caching so that query results aren’t cached. This allows us to measure the performance of the database as opposed to its ability to serve results from cache. The results were consistent across multiple runs.

We loaded the supplier, region, and nation files from the 3 TB public dataset and created a view on top of those three tables, as shown in the following code. This query joins the three tables to create a unified record. The joined dataset is used for performance assessment.

create view v_supplier_balances as
select r.r_name region_nm,n.n_name nation_nm, s.s_acctbal acct_balance
from supplier s, nation n, region r
where
s.s_nationkey = n.n_nationkey
and n.n_regionkey = r.r_regionkey;

We ran the following example SELECT queries using GROUPING SETS, CUBE, and ROLLUP, and captured performance metrics in the following tables.
ROLLUP:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM v_supplier_balances
GROUP BY ROLLUP (region_nm, nation_nm)
ORDER BY region_nm;
Cluster Run 1 in ms Run 2 in ms Run 3 in ms
5-node-Ra3-4XL 120 118 117
2-node-Ra3-4XL 405 389 391
2-node-Ra3-XLPLUS 490 460 461

CUBE:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM v_supplier_balances
GROUP BY CUBE(region_nm, nation_nm)
ORDER BY region_nm;
Cluster Run 1 in ms Run 2 in ms Run 3 in ms
5-node-Ra3-4XL 224 215 214
2-node-Ra3-4XL 412 392 392
2-node-Ra3-XLPLUS 872 798 793

GROUPING SETS:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM v_supplier_balances
GROUP BY GROUPING SETS(region_nm, nation_nm)
ORDER BY region_nm;
Cluster Run 1 in ms Run 2 in ms Run 3 in ms
5-node-Ra3-4XL 210 198 198
2-node-Ra3-4XL 345 328 328
2-node-Ra3-XLPLUS 675 674 674

When we ran the same set of queries for ROLLUP and CUBE and ran with UNION ALL, we saw better performance with ROLLUP and CUBE functionality.

Cluster CUBE (run in ms) ROLLUP (run in ms) UNION ALL (run in ms)
5-node-Ra3-4XL 214 117 321
2-node-Ra3-4XL 392 391 543
2-node-Ra3-XLPLUS 793 461 932

Clean up

To clean up your resources, drop the tables and views you created while following along with the example in this post.

Conclusion

In this post, we talked about the new aggregated extensions ROLLUP, CUBE, and GROUPING SETS added to Amazon Redshift. We also discussed general uses cases, implementation examples, and performance results. You can simplify your existing aggregation queries using these new SQL aggregation extensions and use them in future development for building more simplified, readable queries. If you have any feedback or questions, please leave them in the comments section.


About the Authors

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 16 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Yanzhu Ji is a Product Manager on the Amazon Redshift team. She worked on the Amazon Redshift team as a Software Engineer before becoming a Product Manager. She has a rich experience of how the customer-facing Amazon Redshift features are built from planning to launching, and always treats customers’ requirements as first priority. In her personal life, Yanzhu likes painting, photography, and playing tennis.

Dinesh Kumar is a Database Engineer with more than a decade of experience working in the databases, data warehousing, and analytics space. Outside of work, he enjoys trying different cuisines and spending time with his family and friends.

Patterns for enterprise data sharing at scale

Post Syndicated from Venkata Sistla original https://aws.amazon.com/blogs/big-data/patterns-for-enterprise-data-sharing-at-scale/

Data sharing is becoming an important element of an enterprise data strategy. AWS services like AWS Data Exchange provide an avenue for companies to share or monetize their value-added data with other companies. Some organizations would like to have a data sharing platform where they can establish a collaborative and strategic approach to exchange data with a restricted group of companies in a closed, secure, and exclusive environment. For example, financial services companies and their auditors, or manufacturing companies and their supply chain partners. This fosters development of new products and services and helps improve their operational efficiency.

Data sharing is a team effort, it’s important to note that in addition to establishing the right infrastructure, successful data sharing also requires organizations to ensure that business owners sponsor data sharing initiatives. They also need to ensure that data is of high quality. Data platform owners and security teams should encourage proper data use and fix any privacy and confidentiality issues.

This blog discusses various data sharing options and common architecture patterns that organizations can adopt to set up their data sharing infrastructure based on AWS service availability and data compliance.

Data sharing options and data classification types

Organizations operate across a spectrum of security compliance constraints. For some organizations, it’s possible to use AWS services like AWS Data Exchange. However, organizations working in heavily regulated industries like federal agencies or financial services might be limited by the allow listed AWS service options. For example, if an organization is required to operate in a Fedramp Medium or Fedramp High environment, their options to share data may be limited by the AWS services that are available and have been allow listed. Service availability is based on platform certification by AWS, and allow listing is based on the organizations defining their security compliance architecture and guidelines.

The kind of data that the organization wants to share with its partners may also have an impact on the method used for data sharing. Complying with data classification rules may further limit their choice of data sharing options they may choose.

The following are some general data classification types:

  • Public data – Important information, though often freely available for people to read, research, review and store. It typically has the lowest level of data classification and security.
  • Private data – Information you might want to keep private like email inboxes, cell phone content, employee identification numbers, or employee addresses. If private data were shared, destroyed, or altered, it might pose a slight risk to an individual or the organization.
  • Confidential or restricted data – A limited group of individuals or parties can access sensitive information often requiring special clearance or special authorization. Confidential or restricted data access might involve aspects of identity and authorization management. Examples of confidential data include Social Security numbers and vehicle identification numbers.

The following is a sample decision tree that you can refer to when choosing your data sharing option based on service availability, classification type, and data format (structured or unstructured). Other factors like usability, multi-partner accessibility, data size, consumption patterns like bulk load/API access, and more may also affect the choice of data sharing pattern.

decisiontree

In the following sections, we discuss each pattern in more detail.

Pattern 1: Using AWS Data Exchange

AWS Data Exchange makes exchanging data easier, helping organizations lower costs, become more agile, and innovate faster. Organizations can choose to share data privately using AWS Data Exchange with their external partners. AWS Data Exchange offers perimeter controls that are applied at identity and resource levels. These controls decide which external identities have access to specific data resources. AWS Data Exchange provides multiple different patterns for external parties to access data, such as the following:

The following diagram illustrates an example architecture.

pattern1

With AWS Data Exchange, once the dataset to share (or sell) is configured, AWS Data Exchange automatically manages entitlements (and billing) between the producer and the consumer. The producer doesn’t have to manage policies, set up new access points, or create new Amazon Redshift data shares for each consumer, and access is automatically revoked if the subscription ends. This can significantly reduce the operational overhead in sharing data.

Pattern 2: Using AWS Lake Formation for centralized access management

You can use this pattern in cases where both the producer and consumer are on the AWS platform with an AWS account that is enabled to use AWS Lake Formation. This pattern provides a no-code approach to data sharing. The following diagram illustrates an example architecture.

pattern2

In this pattern, the central governance account has Lake Formation configured for managing access across the producer’s org accounts. Resource links from the production account Amazon Simple Storage Service (Amazon S3) bucket are created in Lake Formation. The producer grants Lake Formation permissions on an AWS Glue Data Catalog resource to an external account, or directly to an AWS Identity and Access Management (IAM) principal in another account. Lake Formation uses AWS Resource Access Manager (AWS RAM) to share the resource. If the grantee account is in the same organization as the grantor account, the shared resource is available immediately to the grantee. If the grantee account is not in the same organization, AWS RAM sends an invitation to the grantee account to accept or reject the resource grant. To make the shared resource available, the consumer administrator in the grantee account must use the AWS RAM console or AWS Command Line Interface (AWS CLI) to accept the invitation.

Authorized principals can share resources explicitly with an IAM principal in an external account. This feature is useful when the producer wants to have control over who in the external account can access the resources. The permissions the IAM principal receives are a union of direct grants and the account-level grants that are cascaded down to the principals. The data lake administrator of the recipient account can view the direct cross-account grants, but can’t revoke permissions.

Pattern 3: Using AWS Lake Formation from the producer external sharing account

The producer may have stringent security requirements where no external consumer should access their production account or their centralized governance account. They may also not have Lake Formation enabled on their production platform. In such cases, as shown in the following diagram, the producer production account (Account A) is dedicated to its internal organization users. The producer creates another account, the producer external sharing account (Account B), which is dedicated for external sharing. This gives the producer more latitude to create specific policies for specific organizations.

The following architecture diagram shows an overview of the pattern.

pattern3

The producer implements a process to create an asynchronous copy of data in Account B. The bucket can be configured for Same Region Replication (SRR) or Cross Region Replication (CRR) for objects that need to be shared. This facilitates automated refresh of data to the external account to the “External Published Datasets” S3 bucket without having to write any code.

Creating a copy of the data allows the producer to add another degree of separation between the external consumer and its production data. It also helps meet any compliance or data sovereignty requirements.

Lake Formation is set up on Account B, and the administrator creates resources links for the “External Published Datasets” S3 bucket in its account to grant access. The administrator follows the same process to grant access as described earlier.

Pattern 4: Using Amazon Redshift data sharing

This pattern is ideally suited for a producer who has most of their published data products on Amazon Redshift. This pattern also requires the producer’s external sharing account (Account B) and the consumer account (Account C) to have an encrypted Amazon Redshift cluster or Amazon Redshift Serverless endpoint that meets the prerequisites for Amazon Redshift data sharing.

The following architecture diagram shows an overview of the pattern.

pattern4

Two options are possible depending on the producer’s compliance constraints:

  • Option A – The producer enables data sharing directly on the production Amazon Redshift cluster.
  • Option B – The producer may have constraints with respect to sharing the production cluster. The producer creates a simple AWS Glue job that copies data from the Amazon Redshift cluster in the production Account A to the Amazon Redshift cluster in the external Account B. This AWS Glue job can be scheduled to refresh data as needed by the consumer. When the data is available in Account B, the producer can create multiple views and multiple data shares as needed.

In both options, the producer maintains complete control over what data is being shared, and the consumer admin maintains full control over who can access the data within their organization.

After both the producer and consumer admins approve the data sharing request, the consumer user can access this data as if it were part of their own account without have to write any additional code.

Pattern 5: Sharing data securely and privately using APIs

You can adopt this pattern when the external partner doesn’t have a presence on AWS. You can also use this pattern when published data products are spread across various services like Amazon S3, Amazon Redshift, Amazon DynamoDB, and Amazon OpenSearch Service but the producer would like to maintain a single data sharing interface.

Here’s an example use case: Company A would like to share some of its log data in near-real time with its partner Company B, who uses this data to generate predictive insights for Company A. Company A stores this data in Amazon Redshift. The company wants to share this transactional information with its partner after masking the personally identifiable information (PII) in a cost-effective and secure way to generate insights. Company B doesn’t use the AWS platform.

Company A establishes a microbatch process using an AWS Lambda function or AWS Glue that queries Amazon Redshift to get incremental log data, applies the rules to redact the PII, and loads this data to the “Published Datasets” S3 bucket. This instantiates an SRR/CRR process that refreshes this data in the “External Sharing” S3 bucket.

The following diagram shows how the consumer can then use an API-based approach to access this data.

pattern5

The workflow contains the following steps:

  1. An HTTPS API request is sent from the API consumer to the API proxy layer.
  2. The HTTPS API request is forwarded from the API proxy to Amazon API Gateway in the external sharing AWS account.
  3. Amazon API Gateway calls the request receiver Lambda function.
  4. The request receiver function writes the status to a DynamoDB control table.
  5. A second Lambda function, the poller, checks the status of the results in the DynamoDB table.
  6. The poller function fetches results from Amazon S3.
  7. The poller function sends a presigned URL to download the file from the S3 bucket to the requestor via Amazon Simple Email Service (Amazon SES).
  8. The requestor downloads the file using the URL.
  9. The network perimeter AWS account only allows egress internet connection.
  10. The API proxy layer enforces both the egress security controls and perimeter firewall before the traffic leaves the producer’s network perimeter.
  11. The AWS Transit Gateway security egress VPC routing table only allows connectivity from the required producer’s subnet, while preventing internet access.

Pattern 6: Using Amazon S3 access points

Data scientists may need to work collaboratively on image, videos, and text documents. Legal and audit groups may want to share reports and statements with the auditing agencies. This pattern discusses an approach to sharing such documents. The pattern assumes that the external partners are also on AWS. Amazon S3 access points allow the producer to share access with their consumer by setting up cross-account access without having to edit bucket policies.

Access points are named network endpoints that are attached to buckets that you can use to perform S3 object operations, such as GetObject and PutObject. Each access point has distinct permissions and network controls that Amazon S3 applies for any request that is made through that access point. Each access point enforces a customized access point policy that works in conjunction with the bucket policy attached to the underlying bucket.

The following architecture diagram shows an overview of the pattern.

pattern6

The producer creates an S3 bucket and enables the use of access points. As part of the configuration, the producer specifies the consumer account, IAM role, and privileges for the consumer IAM role.

The consumer users with the IAM role in the consumer account can access the S3 bucket via the internet or restricted to an Amazon VPC via VPC endpoints and AWS PrivateLink.

Conclusion

Each organization has its unique set of constraints and requirements that it needs to fulfill to set up an efficient data sharing solution. In this post, we demonstrated various options and best practices available to organizations. The data platform owner and security team should work together to assess what works best for your specific situation. Your AWS account team is also available to help.

Related resources

For more information on related topics, refer to the following:


About the Authors


Venkata Sistla
is a Cloud Architect – Data & Analytics at AWS. He specializes in building data processing capabilities and helping customers remove constraints that prevent them from leveraging their data to develop business insights.

Santosh Chiplunkar is a Principal Resident Architect at AWS. He has over 20 years of experience helping customers solve their data challenges. He helps customers develop their data and analytics strategy and provides them with guidance on how to make it a reality.

A hybrid approach in healthcare data warehousing with Amazon Redshift

Post Syndicated from Bindhu Chinnadurai original https://aws.amazon.com/blogs/big-data/a-hybrid-approach-in-healthcare-data-warehousing-with-amazon-redshift/

Data warehouses play a vital role in healthcare decision-making and serve as a repository of historical data. A healthcare data warehouse can be a single source of truth for clinical quality control systems. Data warehouses are mostly built using the dimensional model approach, which has consistently met business needs.

Loading complex multi-point datasets into a dimensional model, identifying issues, and validating data integrity of the aggregated and merged data points are the biggest challenges that clinical quality management systems face. Additionally, scalability of the dimensional model is complex and poses a high risk of data integrity issues.

The data vault approach solves most of the problems associated with dimensional models, but it brings other challenges in clinical quality control applications and regulatory reports. Because data is closer to the source and stored in raw format, it has to be transformed before it can be used for reporting and other application purposes. This is one of the biggest hurdles with the data vault approach.

In this post, we discuss some of the main challenges enterprise data warehouses face when working with dimensional models and data vaults. We dive deep into a hybrid approach that aims to circumvent the issues posed by these two and also provide recommendations to take advantage of this approach for healthcare data warehouses using Amazon Redshift.

What is a dimensional data model?

Dimensional modeling is a strategy for storing data in a data warehouse using dimensions and facts. It optimizes the database for faster data retrieval. Dimensional models have a distinct structure and organize data to provide reports that increase performance.

In a dimensional model, a transaction record is divided either into facts (often numerical), additive transactional data, or dimensions (referential information that gives context to the facts). This categorization of data into facts and dimensions, as well as the entity-relationship framework of the dimensional model, presents complex business processes in a way that is easy for analysts to understand.

A dimensional model in data warehousing is designed for reading, summarizing, and analyzing numerical information such as patient vital stats, lab reading values, counts, and so on. Regardless of the division or use case it is related to, dimensional data models can be used to store data obtained from tracking various processes like patient encounters, provider practice metrics, aftercare surveys, and more.

The majority of healthcare clinical quality data warehouses are built on top of dimensional modeling techniques. The benefit of using dimensional data modeling is that, when data is stored in a data warehouse, it’s easier to persist and extract it.

Although it’s a competent data structure technique, there are challenges in scalability, source tracking, and troubleshooting with the dimensional modeling approach. Tracking and validating the source of aggregated and compute data points is important in clinical quality regulatory reporting systems. Any mistake in regulatory reports may result in a large penalty from regulatory and compliance agencies. These challenges exist because the data points are labeled using meaningless numeric surrogate keys, and any minor error can impair prediction accuracy, and consequently affect the quality of judgments. The ways to countervail these challenges are by refactoring and bridging the dimensions. But that adds data noise over time and reduces accuracy.

Let’s look at an example of a typical dimensional data warehouse architecture in healthcare, as shown in the following logical model.

The following diagram illustrates a sample dimensional model entity-relationship diagram.

This data model contains dimensions and fact tables. You can use the following query to retrieve basic provider and patient relationship data from the dimensional model:

SELECT * FROM Fac_PatientEncounter FP

JOIN Dim_PatientEncounter DP ON FP.EncounterKey = DP.EncounterKey

JOIN Dim_Provider PR ON PR.ProviderKey = FP.ProviderKey

Challenges of dimensional modeling

Dimensional modeling requires data preprocessing before generating a star schema, which involves a large amount of data processing. Any change to the dimension definition results in a lengthy and time-consuming reprocessing of the dimension data, which often results in data redundancy.

Another issue is that, when relying merely on dimensional modeling, analysts can’t assure the consistency and accuracy of data sources. Especially in healthcare, where lineage, compliance, history, and traceability are of prime importance because of the regulations in place.

A data vault seeks to provide an enterprise data warehouse while solving the shortcomings of dimensional modeling approaches. It is a data modeling methodology designed for large-scale data warehouse platforms.

What is a data vault?

The data vault approach is a method and architectural framework for providing a business with data analytics services to support business intelligence, data warehousing, analytics, and data science needs. The data vault is built around business keys (hubs) defined by the company; the keys obtained from the sources are not the same.

Amazon Redshift RA3 instances and Amazon Redshift Serverless are perfect choices for a data vault. And when combined with Amazon Redshift Spectrum, a data vault can deliver more value.

There are three layers to the data vault:

  • Staging
  • Data vault
  • Business vault

Staging involves the creation of a replica of the original data, which is primarily used to aid the process of transporting data from various sources to the data warehouse. There are no restrictions on this layer, and it is typically not persistent. It is 1:1 with the source systems, generally in the same format as that of the sources.

The data vault is based on business keys (hubs), which are defined by the business. All in-scope data is loaded, and auditability is maintained. At the heart of all data warehousing is integration, and this layer contains integrated data from multiple sources built around the enterprise-wide business keys. Although data lakes resemble data vaults, a data vault provides more features of a data warehouse. However, it combines the functionalities of both.

The business vault stores the outcome of business rules, including deduplication, conforming results, and even computations. When results are calculated for two or more data marts, this helps eliminate redundant computation and associated inconsistencies.

Because business vaults still don’t satisfy reporting needs, enterprises create a data mart after the business vault to satisfy dashboarding needs.

Data marts are ephemeral views that can be implemented directly on top of the business and raw vaults. This makes it easy to adapt over time and eliminates the danger of inconsistent results. If views don’t give the required level of performance, the results can be stored in a table. This is the presentation layer and is designed to be requirements-driven and scope-specific subsets of the warehouse data. Although dimensional modeling is commonly used to deliver this layer, marts can also be flat files, .xml files, or in other forms.

The following diagram shows the typical data vault model used in clinical quality repositories.

When the dimensional model as shown earlier is converted into a data vault using the same structure, it can be represented as follows.

Advantages of a data vault

Although any data warehouse should be built within the context of an overarching company strategy, data vaults permit incremental delivery. You can start small and gradually add more sources over time, just like Kimball’s dimensional design technique.

With a data vault, you don’t have to redesign the structure when adding new sources, unlike dimensional modeling. Business rules can be easily changed because raw and business-generated data is kept independent of each other in a data vault.

A data vault isolates technical data reorganization from business rules, thereby facilitating the separation of these potentially tricky processes. Similarly, data cleaning can be maintained separately from data import.

A data vault accommodates changes over time. Unlike a pure dimensional design, a data vault separates raw and business-generated data and accepts changes from both sources.

Data vaults make it easy to maintain data lineage because it includes metadata identifying the source systems. In contrast to dimensional design, where data is cleansed before loading, data vault updates are always gradual, and results are never lost, providing an automatic audit trail.

When raw data is stored in a data vault, historical attributes that weren’t initially available can be added to the presentation area. Data marts can be implemented as views by adding a new column to an existing view.

In data vault 2.0, hash keys eliminate data load dependencies, which allows near-real-time data loading, as well as concurrent data loads of terabytes to petabytes. The process of mastering both entity-relationship modeling and dimensional design takes time and practice, but the process of automating a data vault is easier.

Challenges of a data vault

A data vault is not a one-size-fits-all solution for data warehouses, and it does have a few limitations.

To begin with, when directly feeding the data vault model into a report on one subject area, you need to combine multiple types of data. Due to the incapability of reporting technologies to perform such data processing, this integration can reduce report performance and increase the risk of errors. However, data vault models could improve report performance by incorporating dimensional models or adding additional reporting layers. And for data models that can be directly reported, a dimensional model can be developed.

Additionally, if the data is static or if it comes from a single source, it reduces the efficacy of data vaults. They often negate many benefits of data vaults, and require more business logic, which can be avoided.

The storage requirement for a data vault is also significantly higher. Three separate tables for the same subject area can effectively increase the number of tables by three, and when they are inserts only. If the data is basic, you can achieve the benefits listed here with a simpler dimensional model rather than deploying a data vault.

The following sample query retrieves provider and patient data from a data vault using the sample model we discussed in this section:

SELECT * FROM Lnk_PatientEncounter LP

JOIN Hub_Provider HP ON LP.ProviderKey = HP.ProviderKey

JOIN Dim_Sat_Provider DSP ON HP.ProviderKey = DSP.ProviderKey AND _Current=1

JOIN Hub_Patient Pt ON Pt.PatientEncounterKey = LP.PatientEncounterKey

JOIN Dim_Sat_PatientEncounter DPt ON DPt.PatientEncounterKey = Pt.PatientEncounterKey AND _Current=1

The query involves many joins, which increases the depth and time for the query run, as illustrated in the following chart.

This following table shows that the SQL depth and runtime is proportional, where depth is the number of joins. If the number of joins increase, then the runtime also increases and therefore the cost.

SQL Depth Runtime in Seconds Cost per Query in Seconds
14 80 40,000
12 60 30,000
5 30 15,000
3 25 12,500

The hybrid model addresses major issues raised by the data vault and dimensional model approaches that we’ve discussed in this post, while also allowing improvements in data collection, including IoT data streaming.

What is a hybrid model?

The hybrid model combines the data vault and a portion of the star schema to provide the advantages of both the data vault and dimensional model, and is mainly intended for logical enterprise data warehouses.

The hybrid approach is designed from the bottom up to be gradual and modular, and it can be used for big data, structured, and unstructured datasets. The primary data contains the business rules and enterprise-level data standards norms, as well as additional metadata needed to transform, validate, and enrich data for dimensional approaches. In this model, data processes from left to right provide data vault advantages, and data processes from right to left provide dimensional model advantages. Here, the data vault satellite tables serve as both satellite tables and dimensional tables.

After combining the dimensional and the data vault models, the hybrid model can be viewed as follows.

The following is an example entity-relation diagram of the hybrid model, which consists of a fact table from the dimensional model and all other entities from the data vault. The satellite entity from the data vault plays the dual role. When it’s connected to a data vault, it acts as a sat table, and when connected to a fact table, it acts as a dimension table. To serve this dual purpose, sat tables have two keys: a foreign key to connect with the data vault, and a primary key to connect with the fact table.

The following diagram illustrates the physical hybrid data model.

The following diagram illustrates a typical hybrid data warehouse architecture.

The following query retrieves provider and patient data from the hybrid model:

SELECT * FROM Fac_PatientEncounter FP

JOIN Dim_Sat_Provider DSP ON FP.DimProviderID =DSP.DimProviderID

JOIN Dim_Sat_PatientEncounter DPt ON DPt.DimPatientEncounterID = Pt.DimPatientEncounterID

The number of joins is reduced from five to three by using the hybrid model.

Advantages of using the hybrid model

With this model, structural information is segregated from descriptive information to promote flexibility and avoid re-engineering in the event of a change. It maintains data integrity, allowing organizations to avoid hefty fines when data integrity is compromised.

The hybrid paradigm enables non-data professionals to interact with raw data by allowing users to update or create metadata and data enrichment rules. The hybrid approach simplifies the process of gathering and evaluating datasets for business applications. It enables concurrent data loading and eliminates the need for a corporate vault.

The hybrid model also benefits from the fact that there is no dependency between objects in the data storage. With hybrid data warehousing, scalability is multiplied.

You can build the hybrid model on AWS and take advantage of the benefits of Amazon Redshift, which is a fully managed, scalable cloud data warehouse that accelerates your time to insights with fast, simple, and secure analytics at scale. Amazon Redshift continuously adds features that make it faster, more elastic, and easier to use:

  • Amazon Redshift data sharing enhances the hybrid model by eliminating the need for copying data across departments. It also simplifies the work of keeping the single source of truth, saving memory and limiting redundancy. It enables instant, granular, and fast data access across Amazon Redshift clusters without the need to copy or move it. Data sharing provides live access to data so that users always see the most up-to-date and consistent information as it’s updated in the data warehouse.
  • Redshift Spectrum enables you to query open format data directly in the Amazon Simple Storage Service (Amazon S3) data lake without having to load the data or duplicate your infrastructure, and it integrates well with the data lake.
  • With Amazon Redshift concurrency scaling, you can get consistently fast performance for thousands of concurrent queries and users. It instantly adds capacity to support additional users and removes it when the load subsides, with nothing to manage at your end.
  • To realize the benefits of using a hybrid model on AWS, you can get started today without needing to provision and manage data warehouse clusters using Redshift Serverless. All the related services that Amazon Redshift integrates with (such as Amazon Kinesis, AWS Lambda, Amazon QuickSight, Amazon SageMaker, Amazon EMR, AWS Lake Formation, and AWS Glue) are available to work with Redshift Serverless.

Conclusion

With the hybrid model, data can be transformed and loaded into a target data model efficiently and transparently. With this approach, data partners can research data networks more efficiently and promote comparative effectiveness. And with the several newly introduced features of Amazon Redshift, a lot of heavy lifting is done by AWS to handle your workload demands, and you only pay for what you use.

You can get started with the following steps:

  1. Create an Amazon Redshift RA3 instance for your primary clinical data repository and data marts.
  2. Build a data vault schema for the raw vault and create materialized views for the business vault.
  3. Enable Amazon Redshift data shares to share data between the producer cluster and consumer cluster.
  4. Load the structed and unstructured data into the producer cluster data vault for business use.

About the Authors

Bindhu Chinnadurai is a Senior Partner Solutions Architect in AWS based out of London, United Kingdom. She has spent 18+ years working in everything for large scale enterprise environments. Currently she engages with AWS partner to help customers migrate their workloads to AWS with focus on scalability, resiliency, performance and sustainability. Her expertise is DevSecOps.

 Sarathi Balakrishnan was the Global Partner Solutions Architect, specializing in Data, Analytics and AI/ML at AWS. He worked closely with AWS partner globally to build solutions and platforms on AWS to accelerate customers’ business outcomes with state-of-the-art cloud technologies and achieve more in their cloud explorations. He helped with solution architecture, technical guidance, and best practices to build cloud-native solutions. He joined AWS with over 20 years of large enterprise experience in agriculture, insurance, health care and life science, marketing and advertisement industries to develop and implement data and AI strategies.

Build a data storytelling application with Amazon Redshift Serverless and Toucan

Post Syndicated from Louis Hourcade original https://aws.amazon.com/blogs/big-data/build-a-data-storytelling-application-with-amazon-redshift-serverless-and-toucan/

This post was co-written with Django Bouchez, Solution Engineer at Toucan.

Business intelligence (BI) with dashboards, reports, and analytics remains one of the most popular use cases for data and analytics. It provides business analysts and managers with a visualization of the business’s past and current state, helping leaders make strategic decisions that dictate the future. However, customers continue to ask for better ways to tell stories with their data, and therefore increase the adoption rate of their BI tools.

Most BI tools on the market provide an exhaustive set of customization options to build data visualizations. It might appear as a good idea, but ultimately burdens business analysts that need to navigate through endless possibilities before building a report. Analysts are not graphic designers, and a poorly designed data visualization can hide the insight it’s intended to convey, or even mislead the viewer. To realize more value from your data, you should focus on building data visualizations that tell stories, and are easily understandable by your audience. This is where guided analytics helps. Instead of presenting unlimited options for customization, it intentionally limits choice by enforcing design best practices. The simplicity of a guided experience enables business analysts to spend more time generating actual insight rather than worrying about how to present them.

This post illustrates the concept of guided analytics and shows you how you can build a data storytelling application with Amazon Redshift Serverless and Toucan, an AWS Partner. Toucan natively integrates with Redshift Serverless, which enables you to deploy a scalable data stack in minutes without the need to manage any infrastructure component.

Amazon Redshift is a fully managed cloud data warehouse service that enables you to analyze large amounts of structured and semi-structured data. Amazon Redshift can scale from a few gigabytes to a petabyte-scale data warehouse, and AWS recently announced the global availability of Redshift Serverless, making it one of the best options for storing data and running ad hoc analytics in a scalable and cost-efficient way.

With Redshift Serverless, you can get insights on your data by running standalone SQL queries or by using data visualizations tools such as Amazon QuickSight, Toucan, or other third-party options without having to manage your data warehouse infrastructure.

Toucan is a cloud-based guided analytics platform built with one goal in mind: reduce the complexity of bringing data insights to business users. For this purpose, Toucan provides a no-code and comprehensive user experience at every stage of the data storytelling application, which includes data connection, building the visualization, and distribution on any device.

If you’re in a hurry and want to see what you can do with this integration, check out Shark attacks visualization with AWS & Toucan, where Redshift Serverless and Toucan help in understanding the evolution of shark attacks in the world.

Overview of solution

There are many BI tools in the market, each providing an ever-increasing set of capabilities and customization options to differentiate from the competition. Paradoxically, this doesn’t seem to increase the adoption rate of BI tools in enterprises. With more complex tools, data owners spend time building fancy visuals, and tend to add as much information as possible in their dashboards instead of providing a clear and simple message to business users.

In this post, we illustrate the concept of guided analytics by putting ourselves in the shoes of a data engineer that needs to communicate stories to business users with data visualizations. This fictional data engineer has to create dashboards to understand how shark attacks evolved in the last 120 years. After loading the shark attacks dataset in Redshift Serverless, we guide you in using Toucan to build stories that provide a better understanding of shark attacks through time. With Toucan, you can natively connect to datasets in Redshift Serverless, transform the data with a no-code interface, build storytelling visuals, and publish them for business users. The shark attacks visualization example illustrates what you can achieve by following instructions in this post.

Additionally, we have recorded a video tutorial that explains how to connect Toucan with Redshift Serverless and start building charts.

Solution architecture

The following diagram depicts the architecture of our solution.

Architecture diagram

We use an AWS CloudFormation stack to deploy all the resources you need in your AWS account:

  • Networking components – This includes a VPC, three public subnets, an internet gateway, and a security group to host the Redshift Serverless endpoint. In this post, we use public subnets to facilitate data access from external sources such as Toucan instances. In this case, the data in Redshift Serverless is still protected by the security group that restricts incoming traffic, and by the database credentials. For a production workload, it is recommended to keep traffic in the Amazon network. For that, you can set the Redshift Serverless endpoints in private subnets, and deploy Toucan in your AWS account through the AWS Marketplace.
  • Redshift Serverless components – This includes a Redshift Serverless namespace and workgroup. The Redshift Serverless workspace is publicly accessible to facilitate the connection from Toucan instances. The database name and the administrator user name are defined as parameters when deploying the CloudFormation stack, and the administrator password is created in AWS Secrets Manager. In this post, we use database credentials to connect to Redshift Serverless, but Toucan also supports connection with AWS credentials and AWS Identity and Access Management (IAM) profiles.
  • Custom resources – The CloudFormation stack includes a custom resource, which is an AWS Lambda function that loads shark attacks data automatically in your Redshift Serverless database when the CloudFormation stack is created.
  • IAM roles and permissions – Finally, the CloudFormation stack includes all IAM roles associated with services previously mentioned to interact with other AWS resources in your account.

In the following sections, we provide all the instructions to connect Toucan with your data in Redshift Serverless, and guide you to build your data storytelling application.

Sample dataset

In this post, we use a custom dataset that lists all known shark attacks in the world, starting from 1900. You don’t have to import the data yourself; we use the Amazon Redshift COPY command to load the data when deploying the CloudFormation stack. The COPY command is one the fastest and most scalable methods to load data into Amazon Redshift. For more information, refer to Using a COPY command to load data.

The dataset contains 4,900 records with the following columns:

  • Date
  • Year
  • Decade
  • Century
  • Type
  • Zone_Type
  • Zone
  • Country
  • Activity
  • Sex
  • Age
  • Fatal
  • Time
  • Species
  • href (a PDF link with the description of the context)
  • Case_Number

Prerequisites

For this solution, you should have the following prerequisites:

  • An AWS account. If you don’t have one already, see the instructions in Sign Up for AWS.
  • An IAM user or role with permissions on AWS resources used in this solution.
  • A Toucan free trial to build the data storytelling application.

Set up the AWS resources

You can launch the CloudFormation stack in any Region where Redshift Serverless is available.

  1. Choose Launch Stack to start creating the required AWS resources for this post:

  1. Specify the database name in Redshift Serverless (default is dev).
  2. Specify the administrator user name (default is admin).

You don’t have to specify the database administrator password because it’s created in Secrets Manager by the CloudFormation stack. The secret’s name is AWS-Toucan-Redshift-Password. We use the secret value in subsequent steps.

Test the deployment

The CloudFormation stack takes a few minutes to deploy. When it’s complete, you can confirm the resources were created. To access your data, you need to get the Redshift Serverless database credentials.

  1. On the Outputs tab for the CloudFormation stack, note the name of the Secrets Manager secret.

BDB-2389temp

  1. On the Secrets Manager console, navigate to the Amazon Redshift database secret and choose Retrieve secret value to get the database administrator user name and password.

  1. To make sure your Redshift Serverless database is available and contains the shark attacks dataset, open the Redshift Serverless workgroup on the Amazon Redshift console and choose Query data to access the query editor.
  2. Also note the Redshift Serverless endpoint, which you need to connect with Toucan.

  1. In the Amazon Redshift query editor, run the following SQL query to view the shark attacks data:
SELECT * FROM "dev"."public"."shark_attacks";

Redshift Query Editor v2

Note that you need to change the name of the database in the SQL query if you change the default value when launching the CloudFormation stack.

You have configured Redshift Serverless in your AWS account and uploaded the shark attacks dataset. Now it’s time to use this data by building a storytelling application.

Launch your Toucan free trial

The first step is to access Toucan platform through the Toucan free trial.

Fill the form and complete the signup steps. You then arrive in the Storytelling Studio, in Staging mode. Feel free to explore what has been already created.

Toucan Home page

Connect Redshift Serverless with Toucan

To connect Redshift Serverless and Toucan, complete the following steps:

  1. Choose Datastore at the bottom of the Toucan Storytelling Studio.
  2. Choose Connectors.

Toucan is natively integrated with Redshift Serverless with AnyConnect.

  1. Search for the Amazon Redshift connector, and complete the form with the following information:
    • Name – The name of the connector in Toucan.
    • Host – Your Redshift Serverless endpoint.
    • Port – The listening port of your Amazon Redshift database (5439).
    • Default Database – The name of the database to connect to (dev by default, unless edited in the CloudFormation stack parameters).
    • Authentication Method – The authentication mechanism to connect to Redshift Serverless. In this case, we use database credentials.
    • User – The user name to use for authentication with Redshift Serverless (admin by default, unless edited in the CloudFormation stack parameters).
    • Password – The password to use for authentication with Redshift Serverless (you should retrieve it from Secrets Manager; the secret’s name is AWS-Toucan-Redshift-Password).

Toucan connection

Create a live query

You are now connected to Redshift Serverless. Complete the following steps to create a query:

  1. On the home page, choose Add tile to create a new visualization.

Toucan new tile

  1. Choose the Live Connections tab, then choose the Amazon Redshift connector you created in the previous step.

Toucan Live Connection

The Toucan trial guides you in building your first live query, in which you can transform your data without writing code using the Toucan YouPrep module.

For instance, as shown in the following screenshot, you can use this no-code interface to compute the sum of fatal shark attacks by activities, get the top five, and calculate the percentage of the total.

Toucan query data

Build your first chart

When your data is ready, choose the Tile tab and complete the form that helps you build charts.

For example, you can configure a leaderboard of the five most dangerous activities, and add a highlight for activities with more than 100 attacks.

Choose Save Changes to save your work and go back to the home page.

Toucan chart builder

Publish and share your work

Until this stage, you have been working in working in Staging mode. To make your work available to everyone, you need to publish it into Production.

On the bottom right of the home page, choose the eye icon to preview your work by putting yourself in the shoes of your future end-users. You can then choose Publish to make your work available to all.

Toucan publish

Toucan also offers multiple embedding options to make your charts easier for end-users to access, such as mobile and tablet.

Toucan multi devices

Following these steps, you connected to Redshift Serverless, transformed the data with the Toucan no-code interface, and built data visualizations for business end-users. The Toucan trial guides you in every stage of this process to help you get started.

Redshift Serverless and Toucan guided analytics provide an efficient approach to increase the adoption rate of BI tools by decreasing infrastructure work for data engineers, and by simplifying dashboard understanding for business end-users. This post only covered a small part of what Redshift Serverless and Toucan offer, so feel free to explore other functionalities in the Amazon Redshift Serverless documentation and Toucan documentation.

Clean up

Some of the resources deployed in this post through the CloudFormation template incur costs as long as they’re in use. Be sure to remove the resources and clean up your work when you’re finished in order to avoid unnecessary cost.

On the CloudFormation console, choose Delete stack to remove all resources.

Conclusion

This post showed you how to set up an end-to-end architecture for guided analytics with Redshift Serverless and Toucan.

This solution benefits from the scalability of Redshift Serverless, which enables you to store, transform, and expose data in a cost-efficient way, and without any infrastructure to manage. Redshift Serverless natively integrates with Toucan, a guided analytics tool designed to be used by everyone, on any device.

Guided analytics focuses on communicating stories through data reports. By setting intentional constraints on customization options, Toucan makes it easy for data owners to build meaningful dashboards with a clear and concise message for end-users. It works for both your internal and external customers, on an unlimited number of use cases.

Try it now with our CloudFormation template and a free Toucan trial!


About the Authors


Louis
Louis Hourcade
is a Data Scientist in the AWS Professional Services team. He works with AWS customer across various industries to accelerate their business outcomes with innovative technologies. In his spare time he enjoys running, climbing big rocks, and surfing (not so big) waves.


Benjamin
Benjamin Menuet
is a Data Architect with AWS Professional Services. He helps customers develop big data and analytics solutions to accelerate their business outcomes. Outside of work, Benjamin is a trail runner and has finished some mythic races like the UTMB.


Xavier
Xavier Naunay
is a Data Architect with AWS Professional Services. He is part of the AWS ProServe team, helping enterprise customers solve complex problems using AWS services. In his free time, he is either traveling or learning about technology and other cultures.


Django
Django Bouchez
is a Solution Engineer at Toucan. He works alongside the Sales team to provide support on technical and functional validation and proof, and is also helping R&D demo new features with Cloud Partners like AWS. Outside of work, Django is a homebrewer and practices scuba diving and sport climbing.

Automate deployment of an Amazon QuickSight analysis connecting to an Amazon Redshift data warehouse with an AWS CloudFormation template

Post Syndicated from Sandeep Bajwa original https://aws.amazon.com/blogs/big-data/automate-deployment-of-an-amazon-quicksight-analysis-connecting-to-an-amazon-redshift-data-warehouse-with-an-aws-cloudformation-template/

Amazon Redshift is the most widely used data warehouse in the cloud, best suited for analyzing exabytes of data and running complex analytical queries. Amazon QuickSight is a fast business analytics service to build visualizations, perform ad hoc analysis, and quickly get business insights from your data. QuickSight provides easy integration with Amazon Redshift, providing native access to all your data and enabling organizations to scale their business analytics capabilities to hundreds of thousands of users. QuickSight delivers fast and responsive query performance by using a robust in-memory engine (SPICE).

As a QuickSight administrator, you can use AWS CloudFormation templates to migrate assets between distinct environments from development, to test, to production. AWS CloudFormation helps you model and set up your AWS resources so you can spend less time managing those resources and more time focusing on your applications that run in AWS. You no longer need to create data sources or analyses manually. You create a template that describes all the AWS resources that you want, and AWS CloudFormation takes care of provisioning and configuring those resources for you. In addition, with versioning, you have your previous assets, which provides the flexibility to roll back deployments if the need arises. For more details, refer to Amazon QuickSight resource type reference.

In this post, we show how to automate the deployment of a QuickSight analysis connecting to an Amazon Redshift data warehouse with a CloudFormation template.

Solution overview

Our solution consists of the following steps:

  1. Create a QuickSight analysis using an Amazon Redshift data source.
  2. Create a QuickSight template for your analysis.
  3. Create a CloudFormation template for your analysis using the AWS Command Line Interface (AWS CLI).
  4. Use the generated CloudFormation template to deploy a QuickSight analysis to a target environment.

The following diagram shows the architecture of how you can have multiple AWS accounts, each with its own QuickSight environment connected to its own Amazon Redshift data source. In this post, we outline the steps involved in migrating QuickSight assets in the dev account to the prod account. For this post, we use Amazon Redshift as the data source and create a QuickSight visualization using the Amazon Redshift sample TICKIT database.

The following diagram illustrates flow of the high-level steps.

Prerequisites

Before setting up the CloudFormation stacks, you must have an AWS account and an AWS Identity and Access Management (IAM) user with sufficient permissions to interact with the AWS Management Console and the services listed in the architecture.

The migration requires the following prerequisites:

Create a QuickSight analysis in your dev environment

In this section, we walk through the steps to set up your QuickSight analysis using an Amazon Redshift data source.

Create an Amazon Redshift data source

To connect to your Amazon Redshift data warehouse, you need to create a data source in QuickSight. As shown in the following screenshot, you have two options:

  • Auto-discovered
  • Manual connect

QuickSight auto-discovers Amazon Redshift clusters that are associated with your AWS account. These resources must be located in the same Region as your QuickSight account.

For more details, refer to Authorizing connections from Amazon QuickSight to Amazon Redshift clusters.

You can also manually connect and create a data source.

Create an Amazon Redshift dataset

The next step is to create a QuickSight dataset, which identifies the specific data in a data source you want to use.

For this post, we use the TICKIT database created in an Amazon Redshift data warehouse, which consists of seven tables: two fact tables and five dimensions, as shown in the following figure.

This sample database application helps analysts track sales activity for the fictional TICKIT website, where users buy and sell tickets online for sporting events, shows, and concerts.

  1. On the Datasets page, choose New dataset.
  2. Choose the data source you created in the previous step.
  3. Choose Use custom SQL.
  4. Enter the custom SQL as shown in the following screenshot.

The following screenshot shows our completed data source.

Create a QuickSight analysis

The next step is to create an analysis that utilizes this dataset. In QuickSight, you analyze and visualize your data in analyses. When you’re finished, you can publish your analysis as a dashboard to share with others in your organization.

  1. On the All analyses tab of the QuickSight start page, choose New analysis.

The Datasets page opens.

  1. Choose a dataset, then choose Use in analysis.

  1. Create a visual. For more information about creating visuals, see Adding visuals to Amazon QuickSight analyses.

Create a QuickSight template from your analysis

A QuickSight template is a named object in your AWS account that contains the definition of your analysis and references to the datasets used. You can create a template using the QuickSight API by providing the details of the source analysis via a parameter file. You can use templates to easily create a new analysis.

You can use AWS Cloud9 from the console to run AWS CLI commands.

The following AWS CLI command demonstrates how to create a QuickSight template based on the sales analysis you created (provide your AWS account ID for your dev account):

aws quicksight create-template --aws-account-id  <DEVACCOUNT>--template-id QS-RS-SalesAnalysis-Template --cli-input-json file://parameters.json

The parameter.json file contains the following details (provide your source QuickSight user ARN, analysis ARN, and dataset ARN):

{
    "Name": "QS-RS-SalesAnalysis-Temp",
    "Permissions": [
        {"Principal": "<QS-USER-ARN>", 
          "Actions": [ "quicksight:CreateTemplate",
                       "quicksight:DescribeTemplate",                   
                       "quicksight:DescribeTemplatePermissions",
                       "quicksight:UpdateTemplate"         
            ] } ] ,
     "SourceEntity": {
       "SourceAnalysis": {
         "Arn": "<QS-ANALYSIS-ARN>",
         "DataSetReferences": [
           {
             "DataSetPlaceholder": "sales",
             "DataSetArn": "<QS-DATASET-ARN>"
           }
         ]
       }
     },
     "VersionDescription": "1"
    }

You can use the AWS CLI describe-user, describe_analysis, and describe_dataset commands to get the required ARNs.

To upload the updated parameter.json file to AWS Cloud9, choose File from the tool bar and choose Upload local file.

The QuickSight template is created in the background. QuickSight templates aren’t visible within the QuickSight UI; they’re a developer-managed or admin-managed asset that is only accessible via the AWS CLI or APIs.

To check the status of the template, run the describe-template command:

aws quicksight describe-template --aws-account-id <DEVACCOUNT> --template-id "QS-RS-SalesAnalysis-Temp"

The following code shows command output:

Copy the template ARN; we need it later to create a template in the production account.

The QuickSight template permissions in the dev account need to be updated to give access to the prod account. Run the following command to update the QuickSight template. This provides the describe privilege to the target account to extract details of the template from the source account:

aws quicksight update-template-permissions --aws-account-id <DEVACCOUNT> --template-id “QS-RS-SalesAnalysis-Temp” --grant-permissions file://TemplatePermission.json

The file TemplatePermission.json contains the following details (provide your target AWS account ID):

[
  {
    "Principal": "arn:aws:iam::<TARGET ACCOUNT>",
    "Actions": [
      "quicksight:UpdateTemplatePermissions",
      "quicksight:DescribeTemplate"
    ]
  }
]

To upload the updated TemplatePermission.json file to AWS Cloud9, choose the File menu from the tool bar and choose Upload local file.

Create a CloudFormation template

In this section, we create a CloudFormation template containing our QuickSight assets. In this example, we use a YAML formatted template saved on our local machine. We update the following different sections of the template:

  • AWS::QuickSight::DataSource
  • AWS::QuickSight::DataSet
  • AWS::QuickSight::Template
  • AWS::QuickSight::Analysis

Some of the information required to complete the CloudFormation template can be gathered from the source QuickSight account via the describe AWS CLI commands, and some information needs to be updated for the target account.

Create an Amazon Redshift data source in AWS CloudFormation

In this step, we add the AWS::QuickSight::DataSource section of the CloudFormation template.

Gather the following information on the Amazon Redshift cluster in the target AWS account (production environment):

  • VPC connection ARN
  • Host
  • Port
  • Database
  • User
  • Password
  • Cluster ID

You have the option to create a custom DataSourceID. This ID is unique per Region for each AWS account.

Add the following information to the template:

Resources:
  RedshiftBuildQSDataSource:
    Type: 'AWS::QuickSight::DataSource'
    Properties:  
      DataSourceId: "RS-Sales-DW"      
      AwsAccountId: !Sub ${AWS::ACCOUNT ID}
      VpcConnectionProperties:
        VpcConnectionArn: <VPC-CONNECTION-ARN>      
      Type: REDSHIFT   
      DataSourceParameters:
        RedshiftParameters:     
          Host: "<HOST>"
          Port: <PORT>
          Clusterid: "<CLUSTER ID>"
          Database: "<DATABASE>"    
      Name: "RS-Sales-DW"
      Credentials:
        CredentialPair:
          Username: <USER>
          Password: <PASSWORD>
      Permissions:

Create an Amazon Redshift dataset in AWS CloudFormation

In this step, we add the AWS::QuickSight::DataSet section in the CloudFormation template to match the dataset definition from the source account.

Gather the dataset details and run the list-data-sets command to get all datasets from the source account (provide your source dev account ID):

aws quicksight list-data-sets  --aws-account-id <DEVACCOUNT>

The following code is the output:

Run the describe-data-set command, specifying the dataset ID from the previous command’s response:

aws quicksight describe-data-set --aws-account-id <DEVACCOUNT> --data-set-id "<YOUR-DATASET-ID>"

The following code shows partial output:

Based on the dataset description, add the AWS::Quicksight::DataSet resource in the CloudFormation template, as shown in the following code. Note that you can also create a custom DataSetID. This ID is unique per Region for each AWS account.

QSRSBuildQSDataSet:
    Type: 'AWS::QuickSight::DataSet'
    Properties:
      DataSetId: "RS-Sales-DW" 
      Name: "sales" 
      AwsAccountId: !Sub ${AWS::ACCOUNT ID}
      PhysicalTableMap:
        PhysicalTable1:          
          CustomSql:
            SqlQuery: "select sellerid, username, (firstname ||' '|| lastname) as name,city, sum(qtysold) as sales
              from sales, date, users
              where sales.sellerid = users.userid and sales.dateid = date.dateid and year = 2008
              group by sellerid, username, name, city
              order by 5 desc
              limit 10"
            DataSourceArn: !GetAtt RedshiftBuildQSDataSource.Arn
            Name"RS-Sales-DW"
            Columns:
            - Type: INTEGER
              Name: sellerid
            - Type: STRING
              Name: username
            - Type: STRING
              Name: name
            - Type: STRING
              Name: city
            - Type: DECIMAL
              Name: sales                                     
      LogicalTableMap:
        LogicalTable1:
          Alias: sales
          Source:
            PhysicalTableId: PhysicalTable1
          DataTransforms:
          - CastColumnTypeOperation:
              ColumnName: sales
              NewColumnType: DECIMAL
      Permissions:
        - Principal: !Join 
            - ''
            - - 'arn:aws:quicksight:'
              - !Ref QuickSightIdentityRegion
              - ':'
              - !Ref 'AWS::AccountId'
              - ':user/default/'
              - !Ref QuickSightUser
          Actions:
            - 'quicksight:UpdateDataSetPermissions'
            - 'quicksight:DescribeDataSet'
            - 'quicksight:DescribeDataSetPermissions'
            - 'quicksight:PassDataSet'
            - 'quicksight:DescribeIngestion'
            - 'quicksight:ListIngestions'
            - 'quicksight:UpdateDataSet'
            - 'quicksight:DeleteDataSet'
            - 'quicksight:CreateIngestion'
            - 'quicksight:CancelIngestion'
      ImportMode: DIRECT_QUERY

You can specify ImportMode to choose between Direct_Query or Spice.

Create a QuickSight template in AWS CloudFormation

In this step, we add the AWS::QuickSight::Template section in the CloudFormation template, representing the analysis template.

Use the source template ARN you created earlier and add the AWS::Quicksight::Template resource in the CloudFormation template:

QSTCFBuildQSTemplate:
    Type: 'AWS::QuickSight::Template'
    Properties:
      TemplateId: "QS-RS-SalesAnalysis-Temp"
      Name: "QS-RS-SalesAnalysis-Temp"
      AwsAccountId:!Sub ${AWS::ACCOUNT ID}
      SourceEntity:
        SourceTemplate:
          Arn: '<SOURCE-TEMPLATE-ARN>'          
      Permissions:
        - Principal: !Join 
            - ''
            - - 'arn:aws:quicksight:'
              - !Ref QuickSightIdentityRegion
              - ':'
              - !Ref 'AWS::AccountId'
              - ':user/default/'
              - !Ref QuickSightUser
          Actions:
            - 'quicksight:DescribeTemplate'
      VersionDescription: Initial version - Copied over from AWS account.

Create a QuickSight analysis

In this last step, we add the AWS::QuickSight::Analysis section in the CloudFormation template. The analysis is linked to the template created in the target account.

Add the AWS::Quicksight::Analysis resource in the CloudFormation template as shown in the following code:

QSRSBuildQSAnalysis:
    Type: 'AWS::QuickSight::Analysis'
    Properties:
      AnalysisId: 'Sales-Analysis'
      Name: 'Sales-Analysis'
      AwsAccountId:!Sub ${AWS::ACCOUNT ID}
      SourceEntity:
        SourceTemplate:
          Arn: !GetAtt  QSTCFBuildQSTemplate.Arn
          DataSetReferences:
            - DataSetPlaceholder: 'sales'
              DataSetArn: !GetAtt QSRSBuildQSDataSet.Arn
      Permissions:
        - Principal: !Join 
            - ''
            - - 'arn:aws:quicksight:'
              - !Ref QuickSightIdentityRegion
              - ':'
              - !Ref 'AWS::AccountId'
              - ':user/default/'
              - !Ref QuickSightUser
          Actions:
            - 'quicksight:RestoreAnalysis'
            - 'quicksight:UpdateAnalysisPermissions'
            - 'quicksight:DeleteAnalysis'
            - 'quicksight:DescribeAnalysisPermissions'
            - 'quicksight:QueryAnalysis'
            - 'quicksight:DescribeAnalysis'
            - 'quicksight:UpdateAnalysis'      

Deploy the CloudFormation template in the production account

To create a new CloudFormation stack that uses the preceding template via the AWS CloudFormation console, complete the following steps:

  1. On the AWS CloudFormation console, choose Create Stack.
  2. On the drop-down menu, choose with new resources (standard).
  3. For Prepare template, select Template is ready.
  4. For Specify template, choose Upload a template file.
  5. Save the provided CloudFormation template in a .yaml file and upload it.
  6. Choose Next.
  7. Enter a name for the stack. For this post, we use QS-RS-CF-Stack.
  8. Choose Next.
  9. Choose Next again.
  10. Choose Create Stack.

The status of the stack changes to CREATE_IN_PROGRESS, then to CREATE_COMPLETE.

Verify the QuickSight objects in the following table have been created in the production environment.

QuickSight Object Type Object Name (Dev) Object Name ( Prod)
Data Source RS-Sales-DW RS-Sales-DW
Dataset Sales Sales
Template QS-RS-Sales-Temp QS-RS-SalesAnalysis-Temp
Analysis Sales Analysis Sales-Analysis

The following example shows that Sales Analysis was created in the target account.

Conclusion

This post demonstrated an approach to migrate a QuickSight analysis with an Amazon Redshift data source from one QuickSight account to another with a CloudFormation template.

For more information about automating dashboard deployment, customizing access to the QuickSight console, configuring for team collaboration, and implementing multi-tenancy and client user segregation, check out the videos Virtual Admin Workshop: Working with Amazon QuickSight APIs and Admin Level-Up Virtual Workshop, V2 on YouTube.


About the author

Sandeep Bajwa is a Sr. Analytics Specialist based out of Northern Virginia, specialized in the design and implementation of analytics and data lake solutions.

How OLX Group migrated to Amazon Redshift RA3 for simpler, faster, and more cost-effective analytics

Post Syndicated from Miguel Chin original https://aws.amazon.com/blogs/big-data/how-olx-group-migrated-to-amazon-redshift-ra3-for-simpler-faster-and-more-cost-effective-analytics/

This is a guest post by Miguel Chin, Data Engineering Manager at OLX Group and David Greenshtein, Specialist Solutions Architect for Analytics, AWS.

OLX Group is one of the world’s fastest-growing networks of online marketplaces, operating in over 30 countries around the world. We help people buy and sell cars, find housing, get jobs, buy and sell household goods, and much more.

We live in a data-producing world, and as companies want to become data driven, there is the need to analyze more and more data. These analyses are often done using data warehouses. However, a common data warehouse issue with ever-growing volumes of data is storage limitations and the degrading performance that comes with it. This scenario is very familiar to us in OLX Group. Our data warehouse is built using Amazon Redshift and is used by multiple internal teams to power their products and data-driven business decisions. As such, it’s crucial to maintain a cluster with high availability and performance while also being storage cost-efficient.

In this post, we share how we modernized our Amazon Redshift data warehouse by migrating to RA3 nodes and how it enabled us to achieve our business expectations. Hopefully you can learn from our experience in case you are considering doing the same.

Status quo before migration

Here at OLX Group, Amazon Redshift has been our choice for data warehouse for over 5 years. We started with a small Amazon Redshift cluster of 7 DC2.8xlarge nodes, and as its popularity and adoption increased inside the OLX Group data community, this cluster naturally grew.

Before migrating to RA3, we were using a 16 DC2.8xlarge nodes cluster with a highly tuned workload management (WLM), and performance wasn’t an issue at all. However, we kept facing challenges with storage demand due to having more users, more data sources, and more prepared data. Almost every day we would get an alert that our disk space was close to 100%, which was about 40 TB worth of data.

Our usual method to solve storage problems used to be to simply increase the number of nodes. Overall, we reached a cluster size of 18 nodes. However, this solution wasn’t cost-efficient enough because we were adding compute capacity to the cluster even though computation power was underutilized. We saw this as a temporary solution, and we mainly did it to buy some time to explore other cost-effective alternatives, such as RA3 nodes.

Amazon Redshift RA3 nodes along with Redshift Managed Storage (RMS) provided separation of storage and compute, enabling us to scale storage and compute separately to better meet our business requirements.

Our data warehouse had the following configuration before the migration:

  • 18 x DC2.8xlarge nodes
  • 250 monthly active users, consistently increasing
  • 10,000 queries per hour, 30 queries in parallel
  • 40 TB of data, consistently increasing
  • 100% disk space utilization

This cluster’s performance was generally good, ETL (extract, transform, and load) and interactive queries barely had any queue time, and 80% of them would finish in under 5 minutes.

Evaluating the performance of Amazon Redshift clusters with RA3 nodes

In this section, we discuss how we conducted a performance evaluation of RA3 nodes with an Amazon Redshift cluster.

Test environment

In order to be confident with the performance of the RA3 nodes, we decided to stress test them in a controlled environment before making the decision to migrate. To assess the nodes and find an optimal RA3 cluster configuration, we collaborated with AllCloud, the AWS premier consulting partner. The following figures illustrate the approach we took to evaluate the performance of RA3.

Test setup

This strategy aims to replicate a realistic workload in different RA3 cluster configurations and compare them with our DC2 configuration. To do this, we required the following:

  • A reference cluster snapshot – This ensures that we can replay any tests starting from the same state.
  • A set of queries from the production cluster – This set can be reconstructed from the Amazon Redshift logs (STL_QUERYTEXT) and enriched by metadata (STL_QUERY). It should be noted that we only took into consideration SELECT and FETCH query types (to simplify this first stage of performance tests). The following chart shows what the profile of our test set looked like.

  • A replay tool to orchestrate all the query operations – AllCloud developed a Python application for us for this purpose.

For more details about approach we used, including using the Amazon Redshift Simple Replay utility, refer to Compare different node types for your workload using Amazon Redshift.

Next, we picked which cluster configurations we wanted to test, which RA3 type, and how many nodes. For the specifications of each node type, refer to Amazon Redshift pricing.

First, we decided to test the same DC2 cluster we had in production as a way to validate our test environment, followed by RA3 clusters using RA3.4xlarge nodes with various numbers of nodes. We used RA3.4xlarge because it gives us more flexibility to fine-tune how many nodes we need compared to the RA3.16xlarge instance (1 x RA3.16xlarge node is equivalent to 4 x RA3.4xlarge nodes in terms of CPU and memory). With this in mind, we tested the following cluster configurations and used the replay tool to take measurements of the performance of each cluster.

18 x DC2

(Reference)

18 x RA3

(Before Classic Resize)

18 x RA3 6 x RA3
Queries Number 1560 1560 1560 1560
Timeouts 25 66 127
Duration/s Mean 1.214 1.037 1.167 1.921
Std. 2.268 2.026 2.525 3.488
Min. 0.003 0.000 0.002 0.002
Q 25% 0.005 0.004 0.004 0.004
Q 50% 0.344 0.163 0.118 0.183
Q 75% 1.040 0.746 1.076 2.566
Max. 25.411 15.492 19.770 19.132

These results show how the DC2 cluster compares with other RA3 configurations. For 50% of the faster queries (quantile 50%) they ran faster than on DC2. Regarding the number of RA3 nodes, six nodes were clearly slower, particularly noticeable on quantile 75% of query durations.

We used the following steps to deploy different clusters:

  1. Use 18 x DC2.8xlarge, restored from the original snapshot (18 x DC2.8xlarge).
  2. Take measurements 18 x DC2.
  3. Use 18 x RA3.4xlarge, restored from the original snapshot (18 x DC2.8xlarge).
  4. Take measurements 18 x RA3 (before classic resize).
  5. Use 6 x RA3.4xlarge, classic resize from 18 x RA3.4xlarge.
  6. Take snapshot from 6 x RA3.4xlarge.
  7. Take measurements 6 x RA3.
  8. Use 6 x RA3.4xlarge, restored from 6 x RA3.4xlarge snapshot.
  9. Use 18 x RA3.4xlarge, elastic resize from 6 x RA3.4xlarge.
  10. Take measurements 18x RA3.

Although these are promising results, there were some limitations in the test environment setup. We were concerned that we weren’t stressing the clusters enough, queries were only running in sequence using a single client, and the fact that we were using only SELECT and FETCH query types moved us away from a realistic workload. Therefore, we proceeded to the second stage of our tests.

Concurrency stress test

To stress the clusters, we changed our replay tool to run multiple queries in parallel. Queries extracted from the log files were queued with the same frequency as they were originally run in the reference cluster. Up to 50 clients take queries from the queue and send them to Amazon Redshift. The timing of all queries is recorded for comparison with the reference cluster.

The cluster performance is evaluated by measuring the temporal course of the query concurrency. If a cluster is equally performant as the reference cluster, the concurrency will closely follow the concurrency of the reference cluster. Queries pushed to the query queue are immediately picked up by a client and sent to the cluster. If the cluster isn’t capable of handling the queries as fast as the reference cluster, the number of running concurrent queries will increase when compared to the reference cluster. We also decided to keep concurrency scaling disabled during this test because we wanted to focus on node types instead of cluster features.

The following table shows the concurrent queries running on a DC2 and RA3 (both 18 nodes) with two different query test sets (3:00 AM and 1:00 PM). These were selected so we could test both our day and overnight workloads. 3:00 AM is when we have a peak of automated ETL jobs running, and 1:00 PM is when we have high user activity.

The median of running concurrent queries on the RA3 cluster is much higher than the DC2 one. This led us to conclude that a cluster of 18 RA3.4xlarge might not be enough to handle this workload reliably.

Concurrency 18 x DC2.8xlarge 18 x RA3.4xlarge
Starting 3:00 AM 1:00 PM 3:00 AM 1:00 PM
Mean 5 7 10 5
STD 11 13 7 4
25% 1 1 5 2
50% 2 2 8 4
75% 4 4 13 7
Max 50 50 50 27

RA3.16xlarge

Initially, we chose the RA3.4xlarge node type for more granular control in fine-tuning the number of nodes. However, we overlooked one important detail: the same instance type is used for worker and leader nodes. A leader node needs to manage all the parallel processing happening in the cluster, and a single RA3.4xlarge wasn’t enough to do so.

With this in mind, we tested two more cluster configurations: 6 x RA3.16xlarge and 8 x RA3.16xlarge, and once again measured concurrency. This time the results were much better; RA3.16xlarge was able to keep up with the reference concurrency, and the sweet spot seemed to be between 6–8 nodes.

Concurrency 18 x DC2.8xlarge 18 x RA3.4xlarge 6 x RA3.16xlarge 8 x RA3.16xlarge
Starting 3:00 AM 1:00 PM 3:00 AM 1:00 PM 3:00 AM 3:00 AM
Mean 5 7 10 5 3 1
STD 11 13 7 4 4 1
25% 1 1 5 2 2 0
50% 2 2 8 4 3 1
75% 4 4 13 7 4 2
Max 50 50 50 27 38 9

Things were looking better and our target configuration was now a 7 x RA3.16xlarge cluster. We were now confident enough to proceed with the migration.

The migration

Regardless of how excited we were to proceed, we still wanted to do a calculated migration. It’s best practice to have a playbook for migrations—a step-by-step guide on what needs to be done and also a contingency plan that includes a rollback plan. For simplicity reasons, we list here only the relevant steps in case you are looking for inspiration.

Migration plan

The migration plan included the following key steps:

  1. Remove the DNS from the current cluster, in our case in Amazon Route 53. No users should be able to query after this.
  2. Check if any sessions are still running a query, and decide to wait or stop it. This strongly indicates these users are using the direct cluster URL to connect.
    1. To check running sessions, use SELECT * FROM STV_SESSIONS.
    2. To check stopped sessions, use SELECT PG_TERMINATE_BACKEND(xxxxx);.
  3. Create a snapshot of the DC2 cluster.
  4. Pause the DC2 cluster.
  5. Create an RA3 cluster from the snapshot with the following configuration:
    1. Node type – RA3.16xlarge
    2. Number of nodes – 7
    3. Database name – Same as the DC2
    4. Associated IAM roles – Same as the DC2
    5. VPC – Same as the DC2
    6. VPC security groups – Same as the DC2
    7. Parameter groups – Same as the DC2
  6. Wait for SELECT COUNT(1) FROM STV_UNDERREPPED_BLOCKS to return 0. This is related to the hydration process of the cluster.
  7. Point the DNS to the RA3 cluster.
  8. Users can now query the cluster again.

Contingency plan

In case the performance of hourly and daily ETL is not acceptable, the contingency plan is triggered:

  1. Add one more node to deal with the unexpected workload.
  2. Increase the limit of concurrency scaling hours.
  3. Reassess the parameter group.

Following this plan, we migrated from DC2 to RA3 nodes in roughly 3.5 hours, from stopping the old cluster to booting the new one and letting our processes fully synchronize. We then proceeded to monitor performance for a couple of hours. Storage capacity was looking great and everything was running smoothly, but we were curious to see how the overnight processes would perform.

The next morning, we woke up to what we dreaded: a slow cluster. We triggered our contingency plan and in the following few days we ended up implementing all three actions we had in the contingency plan.

Adding one extra node itself didn’t provide much help, however users did experience good performance during the hours concurrency scaling was on. The concurrency scaling feature allows Amazon Redshift to temporarily increase cluster capacity whenever the workload requires it. We configured it to allow a maximum of 4 hours per day—1 hour for free and 3 hours paid. We chose this particular value because price-wise it is equivalent to adding one more node (taking us to nine nodes) with the added advantage of only using and paying for it when the workload requires it.

The last action we took was related to the parameter group, in particular, the WLM. As initially stated, we had a manually fine-tuned WLM, but it proved to be inefficient for this new RA3 cluster. Therefore, we decided to try auto WLM with the following configuration.

Manual WLM before introducing auto WLM Queue 1 Data Team ETL queue (daily and hourly), admin, monitoring, data quality queries
Queue 2 Users queue (for both their ETL and ad hoc queries)
Auto WLM Queue 1: Priority highest Daily Data Team ETL queue
Queue 2: Priority high Admin queries
Queue 3: Priority normal User queries and hourly Data Team ETL
Queue 4: Priority low Monitoring, data quality queries

Manual WLM requires you to manually allocate a percentage of resources and define a number of slots per queue. Although this gives you resource segregation, it also means resources are constantly allocated and can go to waste if they’re not used. Auto WLM dynamically sets these variables depending on each queue’s priority and workload. This means that a query in the highest priority queue will get all the resources allocated to it, while lower priority queues will need to wait for available resources. With this in mind, we split our ETL depending on its priority: daily ETL to highest, hourly ETL to normal (to give a fair chance for user queries to compete for resources), and monitoring and data quality to low.

After applying concurrency scaling and auto WLM, we achieved stable performance for a whole week, and considered the migration a success.

Status quo after migration

Almost a year has passed since we migrated to RA3 nodes, and we couldn’t be more satisfied. Thanks to Redshift Managed Storage (RMS), our disk space issues are a thing of the past, and performance has been generally great compared to our previous DC2 cluster. We are now at 300 monthly active users. Cluster costs did increase due to the new node type and concurrency scaling, but we now feel prepared for the future and don’t expect any cluster resizing anytime soon.

Looking back, we wanted to have a carefully planned and prepared migration, and we were able to learn more about RA3 with our test environment. However, our experience also shows that test environments aren’t always bulletproof, and some details may be overlooked. In the end, these are our main takeaways from the migration to RA3 nodes:

  • Pick the right node type according to your workload. An RA3.16xlarge cluster provides more powerful leader and worker nodes.
  • Use concurrency scaling to provision more resources when the workload demands it. Adding a new node is not always the most cost-efficient solution.
  • Manual WLM requires a lot of adjustments; using auto WLM allows for a better and fairer distribution of cluster resources.

Conclusion

In this post, we covered how OLX Group modernized our Amazon Redshift data warehouse by migrating to RA3 nodes. We detailed how we tested before migration, the migration itself, and the outcome. We are now starting to explore the possibilities provided by the RA3 nodes. In particular, the data sharing capabilities together with Redshift Serverless open the door for exciting architecture setups that we are looking forward to.

If you are going through the same storage issues we used to face with your Amazon Redshift cluster, we highly recommend migrating to RA3 nodes. Its RMS feature decouples the scalability of compute and storage power, providing a more cost-efficient solution.

Thanks for reading this post and hopefully you found it useful. If you’re going through the same scenario and have any questions, feel free to reach out.


About the author

Miguel Chin is a Data Engineering Manager at OLX Group, one of the world’s fastest-growing networks of trading platforms. He is responsible for managing a domain-oriented team of data engineers that helps shape the company’s data ecosystem by evangelizing cutting-edge data concepts like data mesh.

David Greenshtein is a Specialist Solutions Architect for Analytics at AWS with a passion for ETL and automation. He works with AWS customers to design and build analytics solutions enabling business to make data-driven decisions. In his free time, he likes jogging and riding bikes with his son.

Synchronize your Salesforce and Snowflake data to speed up your time to insight with Amazon AppFlow

Post Syndicated from Ramesh Ranganathan original https://aws.amazon.com/blogs/big-data/synchronize-your-salesforce-and-snowflake-data-to-speed-up-your-time-to-insight-with-amazon-appflow/

This post was co-written with Amit Shah, Principal Consultant at Atos.

Customers across industries seek meaningful insights from the data captured in their Customer Relationship Management (CRM) systems. To achieve this, they combine their CRM data with a wealth of information already available in their data warehouse, enterprise systems, or other software as a service (SaaS) applications. One widely used approach is getting the CRM data into your data warehouse and keeping it up to date through frequent data synchronization.

Integrating third-party SaaS applications is often complicated and requires significant effort and development. Developers need to understand the application APIs, write implementation and test code, and maintain the code for future API changes. Amazon AppFlow, which is a low-code/no-code AWS service, addresses this challenge.

Amazon AppFlow is a fully managed integration service that enables you to securely transfer data between SaaS applications, like Salesforce, SAP, Zendesk, Slack, and ServiceNow, and AWS services like Amazon Simple Storage Service (Amazon S3) and Amazon Redshift in just a few clicks. With Amazon AppFlow, you can run data flows at enterprise scale at the frequency you choose—on a schedule, in response to a business event, or on demand.

In this post, we focus on synchronizing your data from Salesforce to Snowflake (on AWS) without writing code. This post walks you through the steps to set up a data flow to address full and incremental data load using an example use case.

Solution overview

Our use case involves the synchronization of the Account object from Salesforce into Snowflake. In this architecture, you use Amazon AppFlow to filter and transfer the data to your Snowflake data warehouse.

You can configure Amazon AppFlow to run your data ingestion in three different ways:

  • On-demand – You can manually run the flow through the AWS Management Console, API, or SDK call.
  • Event-driven – Amazon AppFlow can subscribe and listen to change data capture (CDC) events from the source SaaS application.
  • Scheduled – Amazon AppFlow can run schedule-triggered flows based on a pre-defined schedule rule. With scheduled flows, you can choose either full or incremental data transfer:
    • With full transfer, Amazon AppFlow transfers a snapshot of all records at the time of the flow run from the source to the destination.
    • With incremental transfer, Amazon AppFlow transfers only the records that have been added or changed since the last successful flow run. To determine the incremental delta of your data, AppFlow requires you to specify a source timestamp field to instruct how Amazon AppFlow identifies new or updated records.

We use the on-demand trigger for the initial load of data from Salesforce to Snowflake, because it helps you pull all the records, irrespective of their creation. To then synchronize data periodically with Snowflake, after we run the on-demand trigger, we configure a scheduled trigger with incremental transfer. With this approach, Amazon AppFlow pulls the records based on a chosen timestamp field from the Salesforce Account object periodically, based on the time interval specified in the flow.

The Account_Staging table is created in Snowflake to act as a temporary storage that can be used to identify the data change events. Then the permanent table (Account) is updated from the staging table by running a SQL stored procedure that contains the incremental update logic. The following figure depicts the various components of the architecture and the data flow from the source to the target.

The data flow contains the following steps:

  1. First, the flow is run with on-demand and full transfer mode to load the full data into Snowflake.
  2. The Amazon AppFlow Salesforce connector pulls the data from Salesforce and stores it in the Account Data S3 bucket in CSV format.
  3. The Amazon AppFlow Snowflake connector loads the data into the Account_Staging table.
  4. A scheduled task, running at regular intervals in Snowflake, triggers a stored procedure.
  5. The stored procedure starts an atomic transaction that loads the data into the Account table and then deletes the data from the Account_Staging table.
  6. After the initial data is loaded, you update the flow to capture incremental updates from Salesforce. The flow trigger configuration is changed to scheduled, to capture data changes in Salesforce. This enables Snowflake to get all updates, deletes, and inserts in Salesforce at configured intervals.
  7. The flow uses the configured LastModifiedDate field to determine incremental changes.
  8. Steps 3, 4, and 5 are run again to load the incremental updates into the Snowflake Accounts table.

Prerequisites

To get started, you need the following prerequisites:

  • A Salesforce user account with sufficient privileges to install connected apps. Amazon AppFlow uses a connected app to communicate with Salesforce APIs. If you don’t have a Salesforce account, you can sign up for a developer account.
  • A Snowflake account with sufficient permissions to create and configure the integration, external stage, table, stored procedures, and tasks.
  • An AWS account with access to AWS Identity and Access Management (IAM), Amazon AppFlow, and Amazon S3.

Set up Snowflake configuration and Amazon S3 data

Complete the following steps to configure Snowflake and set up your data in Amazon S3:

  1. Create two S3 buckets in your AWS account: one for holding the data coming from Salesforce, and another for holding error records.

A best practice when creating your S3 bucket is to make sure you block public access to the bucket to ensure your data is not accessible by unauthorized users.

  1. Create an IAM policy named snowflake-access that allows listing the bucket contents and reading S3 objects inside the bucket.

Follow the instructions for steps 1 and 2 in Configuring a Snowflake Storage Integration to Access Amazon S3 to create an IAM policy and role. Replace the placeholders with your S3 bucket names.

  1. Log in to your Snowflake account and create a new warehouse called SALESFORCE and database called SALESTEST.
  2. Specify the format in which data will be available in Amazon S3 for Snowflake to load (for this post, CSV):
USE DATABASE SALESTEST;
CREATE or REPLACE file format my_csv_format
type = csv
field_delimiter = ','
Y skip_header = 1
null_if = ('NULL', 'null')
empty_field_as_null = true
compression = gzip;
  1. Amazon AppFlow uses the Snowflake COPY command to move data using an S3 bucket. To configure this integration, follow steps 3–6 in Configuring a Snowflake Storage Integration to Access Amazon S3.

These steps create a storage integration with your S3 bucket, update IAM roles with Snowflake account and user details, and creates an external stage.

This completes the setup in Snowflake. In the next section, you create the required objects in Snowflake.

Create schemas and procedures in Snowflake

In your Snowflake account, complete the following steps to create the tables, stored procedures, and tasks for implementing the use case:

  1. In your Snowflake account, open a worksheet and run the following DDL scripts to create the Account and Account_staging tables:
CREATE or REPLACE TABLE ACCOUNT_STAGING (
ACCOUNT_NUMBER STRING NOT NULL,
ACCOUNT_NAME STRING,
ACCOUNT_TYPE STRING,
ANNUAL_REVENUE NUMBER,
ACTIVE BOOLEAN NOT NULL,
DELETED BOOLEAN,
LAST_MODIFIED_DATE STRING,
primary key (ACCOUNT_NUMBER)
);

CREATE or REPLACE TABLE ACCOUNT (
ACCOUNT_NUMBER STRING NOT NULL,
ACCOUNT_NAME STRING,
ACCOUNT_TYPE STRING,
ANNUAL_REVENUE NUMBER,
ACTIVE BOOLEAN NOT NULL,
LAST_MODIFIED_DATE STRING,
primary key (ACCOUNT_NUMBER)
);
  1. Create a stored procedure in Snowflake to load data from staging to the Account table:
CREATE or REPLACE procedure sp_account_load( )
returns varchar not null
language sql
as
$$
begin
Begin transaction;
merge into ACCOUNT using ACCOUNT_STAGING
on ACCOUNT.ACCOUNT_NUMBER = ACCOUNT_STAGING.ACCOUNT_NUMBER
when matched AND ACCOUNT_STAGING.DELETED=TRUE then delete
when matched then UPDATE SET
ACCOUNT.ACCOUNT_NAME = ACCOUNT_STAGING.ACCOUNT_NAME,
ACCOUNT.ACCOUNT_TYPE = ACCOUNT_STAGING.ACCOUNT_TYPE,
ACCOUNT.ANNUAL_REVENUE = ACCOUNT_STAGING.ANNUAL_REVENUE,
ACCOUNT.ACTIVE = ACCOUNT_STAGING.ACTIVE,
ACCOUNT.LAST_MODIFIED_DATE = ACCOUNT_STAGING.LAST_MODIFIED_DATE
when NOT matched then
INSERT (
ACCOUNT.ACCOUNT_NUMBER,
ACCOUNT.ACCOUNT_NAME,
ACCOUNT.ACCOUNT_TYPE,
ACCOUNT.ANNUAL_REVENUE,
ACCOUNT.ACTIVE,
ACCOUNT.LAST_MODIFIED_DATE
)
values(
ACCOUNT_STAGING.ACCOUNT_NUMBER,
ACCOUNT_STAGING.ACCOUNT_NAME,
ACCOUNT_STAGING.ACCOUNT_TYPE,
ACCOUNT_STAGING.ANNUAL_REVENUE,
ACCOUNT_STAGING.ACTIVE,
ACCOUNT_STAGING.LAST_MODIFIED_DATE
) ;

Delete from ACCOUNT_STAGING;
Commit;
end;
$$
;

This stored procedure determines whether the data contains new records that need to be inserted or existing records that need to be updated or deleted. After a successful run, the stored procedure clears any data from your staging table.

  1. Create a task in Snowflake to trigger the stored procedure. Make sure that the time interval for this task is more than the time interval configured in Amazon AppFlow for pulling the incremental changes from Salesforce. The time interval should be sufficient for data to be processed.
CREATE OR REPLACE TASK TASK_ACCOUNT_LOAD
WAREHOUSE = SALESFORCE
SCHEDULE = 'USING CRON 5 * * * * America/Los_Angeles'
AS
call sp_account_load();
  1. Provide the required permissions to run the task and resume the task:
show tasks;
  • As soon as task is created it will be suspended state so needs to resume it manually first time
ALTER TASK TASK_ACCOUNT_LOAD RESUME;
  • If the role which is assigned to us doesn’t have proper access to resume/execute task needs to grant execute task privilege to that role
GRANT EXECUTE TASK, EXECUTE MANAGED TASK ON ACCOUNT TO ROLE SYSADMIN;

This completes the Snowflake part of configuration and setup.

Create a Salesforce connection

First, let’s create a Salesforce connection that can be used by AppFlow to authenticate and pull records from your Salesforce instance. On the AWS console, make sure you are in the same Region where your Snowflake instance is running.

  1. On the Amazon AppFlow console, choose Connections in the navigation pane.
  2. From the list of connectors, select Salesforce.
  3. Choose Create connection.
  4. For Connection name, enter a name of your choice (for example, Salesforce-blog).
  5. Leave the rest of the fields as default and choose Continue.
  6. You’re redirected to a sign-in page, where you need to log in to your Salesforce instance.
  7. After you allow Amazon AppFlow access to your Salesforce account, your connection is successfully created.
           

 Create a Snowflake connection

Complete the following steps to create your Snowflake connection:

  1. On the Connections menu, choose Snowflake.
  2. Choose Create connection.
  3. Provide information for the Warehouse, Stage name, and Bucket details fields.
  4. Enter your credential details.

  1. For Region, choose the same Region where Snowflake is running.
  2. For Connection name, name your connection Snowflake-blog.
  3. Leave the rest of the fields as default and choose Connect.

Create a flow in Amazon AppFlow

Now you create a flow in Amazon AppFlow to load the data from Salesforce to Snowflake. Complete the following steps:

  1. On the Amazon AppFlow console, choose Flows in the navigation pane.
  2. Choose Create flow.
  3. On the Specify flow details page, enter a name for the flow (for example, AccountData-SalesforceToSnowflake).
  4. Optionally, provide a description for the flow and tags.
  5. Choose Next.

  1. On the Configure flow page, for Source name¸ choose Salesforce.
  2. Choose the Salesforce connection we created in the previous step (Salesforce-blog).
  3. For Choose Salesforce object, choose Account.
  4. For Destination name, choose Snowflake.
  5. Choose the newly created Snowflake connection.
  6. For Choose Snowflake object, choose the staging table you created earlier (SALESTEST.PUBLIC. ACCOUNT_STAGING).

  1. In the Error handling section, provide your error S3 bucket.
  2. For Choose how to trigger the flow¸ select Run on demand.
  3. Choose Next.

  1. Select Manually map fields to map the fields between your source and destination.
  2. Choose the fields Account Number, Account Name, Account Type, Annual Revenue, Active, Deleted, and Last Modified Date.

  1. Map each source field to its corresponding destination field.
  2. Under Additional settings, leave the Import deleted records unchecked (default setting).

  1. In the Validations section, add validations for the data you’re pulling from Salesforce.

Because the schema for the Account_Staging table in Snowflake database has a NOT NULL constraint for the fields Account_Number and Active, records containing a null value for these fields should be ignored.

  1. Choose Add Validation to configure validations for these fields.
  2. Choose Next.

  1. Leave everything else as default, proceed to the final page, and choose Create Flow.
  2. After the flow is created, choose Run flow.

When the flow run completes successfully, it will bring all records into your Snowflake staging table.

Verify data in Snowflake

The data will be loaded into the Account_staging table. To verify that data is loaded in Snowflake, complete the following steps:

  1. Validate the number of records by querying the ACCOUNT_STAGING table in Snowflake.
  2. Wait for your Snowflake task to run based on the configured schedule.
  3. Verify that all the data is transferred to the ACCOUNT table and the ACCOUNT_STAGING table is truncated.

Configure an incremental data load from Salesforce

Now let’s configure an incremental data load from Salesforce:

  1. On the Amazon AppFlow console, select your flow, and choose Edit.
  2. Go to the Edit configuration step and change to Run flow on schedule.
  3. Set the flow to run every 5 minutes, and provide a start date of Today, with a start time in the future.
  4. Choose Incremental transfer and choose the LastModifiedDate field.
  5. Choose Next.
  6. In the Additional settings section, select Import deleted records.

This ensures that deleted records from the source are also ingested.

  1. Choose Save and then choose Activate flow.

Now your flow is configured to capture all incremental changes.

Test the solution

Log in to your Salesforce account, and edit any record in the Account object.

Within 5 minutes or less, a scheduled flow will pick up your change and write the changed record into your Snowflake staging table and trigger the synchronization process.

You can see the details of the run, including number of records transferred, on the Run History tab of your flow.

Clean up

Clean up the resources in your AWS account by completing the following steps:

  1. On the Amazon AppFlow console, choose Flows in the navigation pane.
  2. From the list of flows, select the flow AccountData-SalesforceToSnowflakeand delete it.
  3. Enter delete to delete the flow.
  4. Choose Connections in the navigation pane.
  5. Choose Salesforce from the list of connectors, select Salesforce-blog, and delete it.
  6. Enter delete to delete the connector.
  7. On the Connections page, choose Snowflake from the list of connectors, select Snowflake-blog, and delete it.
  8. Enter delete to delete the connector.
  9. On the IAM console, choose Roles in the navigation page, then select the role you created for Snowflake and delete it.
  10. Choose Policies in the navigation pane, select the policy you created for Snowflake, and delete it.
  11. On the Amazon S3 console, search for the data bucket you created, choose Empty to delete the objects, then delete the bucket.
  12. Search for the error bucket you created, choose Empty to delete the objects, then delete the bucket.
  13. Clean up resources in your Snowflake account:
  • Delete the task TASK_ACCOUNT_LOAD:
ALTER TASK TASK_ACCOUNT_LOAD SUSPEND;
DROP TASK TASK_ACCOUNT_LOAD;
  • Delete the stored procedure sp_account_load:
DROP procedure sp_account_load();
  • Delete the tables ACCOUNT_STAGING and ACCOUNT:
DROP TABLE ACCOUNT_STAGING;
DROP TABLE ACCOUNT;

Conclusion

In this post, we walked you through how to integrate and synchronize your data from Salesforce to Snowflake using Amazon AppFlow. This demonstrates how you can set up your ETL jobs without having to learn new programming languages by using Amazon AppFlow and your familiar SQL language. This is a proof of concept, but you can try to handle edge cases like failure of Snowflake tasks or understand how incremental transfer works by making multiple changes to a Salesforce record within the scheduled time interval.

For more information on Amazon AppFlow, visit Amazon AppFlow.


About the authors

Ramesh Ranganathan is a Senior Partner Solution Architect at AWS. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, application modernization and cloud native development. He is passionate about technology and enjoys experimenting with AWS Serverless services.

Kamen Sharlandjiev is an Analytics Specialist Solutions Architect and Amazon AppFlow expert. He’s on a mission to make life easier for customers who are facing complex data integration challenges. His secret weapon? Fully managed, low-code AWS services that can get the job done with minimal effort and no coding.

Amit Shah is a cloud based modern data architecture expert and currently leading AWS Data Analytics practice in Atos. Based in Pune in India, he has 20+ years of experience in data strategy, architecture, design and development. He is on a mission to help organization become data-driven.

­­Use fuzzy string matching to approximate duplicate records in Amazon Redshift

Post Syndicated from Sean Beath original https://aws.amazon.com/blogs/big-data/use-fuzzy-string-matching-to-approximate-duplicate-records-in-amazon-redshift/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift enables you to run complex SQL analytics at scale and performance on terabytes to petabytes of structured and unstructured data, and make the insights widely available through popular business intelligence (BI) and analytics tools.

It’s common to ingest multiple data sources into Amazon Redshift to perform analytics. Often, each data source will have its own processes of creating and maintaining data, which can lead to data quality challenges within and across sources.

One challenge you may face when performing analytics is the presence of imperfect duplicate records within the source data. Answering questions as simple as “How many unique customers do we have?” can be very challenging when the data you have available is like the following table.

Name Address Date of Birth
Cody Johnson 8 Jeffery Brace, St. Lisatown 1/3/1956
Cody Jonson 8 Jeffery Brace, St. Lisatown 1/3/1956

Although humans can identify that Cody Johnson and Cody Jonson are most likely the same person, it can be difficult to distinguish this using analytics tools. This identification of duplicate records also becomes nearly impossible when working on large datasets across multiple sources.

This post presents one possible approach to addressing this challenge in an Amazon Redshift data warehouse. We import an open-source fuzzy matching Python library to Amazon Redshift, create a simple fuzzy matching user-defined function (UDF), and then create a procedure that weights multiple columns in a table to find matches based on user input. This approach allows you to use the created procedure to approximately identify your unique customers, improving the accuracy of your analytics.

This approach doesn’t solve for data quality issues in source systems, and doesn’t remove the need to have a wholistic data quality strategy. For addressing data quality challenges in Amazon Simple Storage Service (Amazon S3) data lakes and data pipelines, AWS has announced AWS Glue Data Quality (preview). You can also use AWS Glue DataBrew, a visual data preparation tool that makes it easy for data analysts and data scientists to clean and normalize data to prepare it for analytics.

Prerequisites

To complete the steps in this post, you need the following:

The following AWS CloudFormation stack will deploy a new Redshift Serverless endpoint and an S3 bucket for use in this post.

BDB-2063-launch-cloudformation-stack

All SQL commands shown in this post are available in the following notebook, which can be imported into the Amazon Redshift Query Editor V2.

Overview of the dataset being used

The dataset we use is mimicking a source that holds customer information. This source has a manual process of inserting and updating customer data, and this has led to multiple instances of non-unique customers being represented with duplicate records.

The following examples show some of the data quality issues in the dataset being used.

In this first example, all three customers are the same person but have slight differences in the spelling of their names.

id name age address_line1 city postcode state
1 Cody Johnson 80 8 Jeffrey Brace St. Lisatown 2636 South Australia
101 Cody Jonson 80 8 Jeffrey Brace St. Lisatown 2636 South Australia
121 Kody Johnson 80 8 Jeffrey Brace St. Lisatown 2636 South Australia

In this next example, the two customers are the same person with slightly different addresses.

id name age address_line1 city postcode state
7 Angela Watson 59 3/752 Bernard Follow Janiceberg 2995 Australian Capital Territory
107 Angela Watson 59 752 Bernard Follow Janiceberg 2995 Australian Capital Territory

In this example, the two customers are different people with the same address. This simulates multiple different customers living at the same address who should still be recognized as different people.

id name age address_line1 city postcode state
6 Michael Hunt 69 8 Santana Rest St. Jessicamouth 2964 Queensland
106 Sarah Hunt 69 8 Santana Rest St. Jessicamouth 2964 Queensland

Load the dataset

First, create a new table in your Redshift Serverless endpoint and copy the test data into it by doing the following:

  1. Open the Query Editor V2 and log in using the admin user name and details defined when the endpoint was created.
  2. Run the following CREATE TABLE statement:
    create table customer (
        id smallint, 
        urid smallint,
        name varchar(100),
        age smallint,
        address_line1 varchar(200),
        city varchar(100),
        postcode smallint,
        state varchar(100)
    )
    ;

    Screenshot of CREATE TABLE statement for customer table being run successfully in Query Editor V2

  3. Run the following COPY command to copy data into the newly created table:
    copy customer (id, name, age, address_line1, city, postcode, state)
    from ' s3://redshift-blogs/fuzzy-string-matching/customer_data.csv'
    IAM_ROLE default
    FORMAT csv
    REGION 'us-east-1'
    IGNOREHEADER 1
    ;

  4. Confirm the COPY succeeded and there are 110 records in the table by running the following query:
    select count(*) from customer;

    Screenshot showing the count of records in the customer table is 110. Query is run in Query Editor V2

Fuzzy matching

Fuzzy string matching, more formally known as approximate string matching, is the technique of finding strings that match a pattern approximately rather than exactly. Commonly (and in this solution), the Levenshtein distance is used to measure the distance between two strings, and therefore their similarity. The smaller the Levenshtein distance between two strings, the more similar they are.

In this solution, we exploit this property of the Levenshtein distance to estimate if two customers are the same person based on multiple attributes of the customer, and it can be expanded to suit many different use cases.

This solution uses TheFuzz, an open-source Python library that implements the Levenshtein distance in a few different ways. We use the partial_ratio function to compare two strings and provide a result between 1–100. If one of the strings matches perfectly with a portion of the other, the partial_ratio function will return 100.

Weighted fuzzy matching

By adding a scaling factor to each of our column fuzzy matches, we can create a weighted fuzzy match for a record. This is especially useful in two scenarios:

  • We have more confidence in some columns of our data than others, and therefore want to prioritize their similarity results.
  • One column is much longer than the others. A single character difference in a long string will have much less impact on the Levenshtein distance than a single character difference in a short string. Therefore, we want to prioritize long string matches over short string matches.

The solution in this post applies weighted fuzzy matching based on user input defined in another table.

Create a table for weight information

This reference table holds two columns; the table name and the column mapping with weights. The column mapping is held in a SUPER datatype, which allows JSON semistructured data to be inserted and queried directly in Amazon Redshift. For examples on how to query semistructured data in Amazon Redshift, refer to Querying semistructured data.

In this example, we apply the largest weight to the column address_line1 (0.5) and the smallest weight to the city and postcode columns (0.1).

Using the Query Editor V2, create a new table in your Redshift Serverless endpoint and insert a record by doing the following:

  1. Run the following CREATE TABLE statement:
    CREATE TABLE ref_unique_record_weight_map(table_name varchar(100), column_mapping SUPER);

  2. Run the following INSERT statement:
    INSERT INTO ref_unique_record_weight_map VALUES (
        'customer',
        JSON_PARSE('{
        "colmap":[
        {
            "colname": "name",
            "colweight": 0.3
        },
        {
            "colname": "address_line1",
            "colweight": 0.5
        },
        {
            "colname": "city",
            "colweight": 0.1
        },
        {
            "colname": "postcode",
            "colweight": 0.1
        }
        ]
    }')
    );

  3. Confirm the mapping data has inserted into the table correctly by running the following query:
    select * from ref_unique_record_weight_map;

    Screenshot showing the result of querying the ref_unique_record_weight_map table in Query Editor V2

  4. To check all weights for the customer table add up to 1 (100%), run the following query:
    select  cm.table_name, 
            sum(colmap.colweight) as total_column_weight 
    from    ref_unique_record_weight_map cm, cm.column_mapping.colmap colmap 
    where   cm.table_name = 'customer'
    group by cm.table_name;

    Screenshot showing the total weight applied to the customer table is 1.0

User-defined functions

With Amazon Redshift, you can create custom scalar user-defined functions (UDFs) using a Python program. A Python UDF incorporates a Python program that runs when the function is called and returns a single value. In addition to using the standard Python functionality, you can import your own custom Python modules, such as the module described earlier (TheFuzz).

In this solution, we create a Python UDF to take two input values and compare their similarity.

Import external Python libraries to Amazon Redshift

Run the following code snippet to import the TheFuzz module into Amazon Redshift as a new library. This makes the library available within Python UDFs in the Redshift Serverless endpoint. Make sure to provide the name of the S3 bucket you created earlier.

CREATE OR REPLACE LIBRARY thefuzz LANGUAGE plpythonu 
FROM 's3://<your-bucket>/thefuzz.zip' 
IAM_ROLE default;

Create a Python user-defined function

Run the following code snippet to create a new Python UDF called unique_record. This UDF will do the following:

  1. Take two input values that can be of any data type as long as they are the same data type (such as two integers or two varchars).
  2. Import the newly created thefuzz Python library.
  3. Return an integer value comparing the partial ratio between the two input values.
CREATE OR REPLACE FUNCTION unique_record(value_a ANYELEMENT, value_b ANYELEMENT) 
RETURNS INTEGER IMMUTABLE
AS
$$
    from thefuzz import fuzz

    return fuzz.partial_ratio(value_a, value_b)
$$ LANGUAGE plpythonu;

You can test the function by running the following code snippet:

select unique_record('Cody Johnson'::varchar, 'Cody Jonson'::varchar)

The result shows that these two strings are have a similarity value of 91%.

Screenshot showing that using the created function on the Cody Johnson/Cody Jonson name example provides a response of 91

Now that the Python UDF has been created, you can test the response of different input values.

Alternatively, you can follow the amazon-redshift-udfs GitHub repo to install the f_fuzzy_string_match Python UDF.

Stored procedures

Stored procedures are commonly used to encapsulate logic for data transformation, data validation, and business-specific logic. By combining multiple SQL steps into a stored procedure, you can reduce round trips between your applications and the database.

In this solution, we create a stored procedure that applies weighting to multiple columns. Because this logic is common and repeatable regardless of the source table or data, it allows us to create the stored procedure once and use it for multiple purposes.

Create a stored procedure

Run the following code snippet to create a new Amazon Redshift stored procedure called find_unique_id. This procedure will do the following:

  1. Take one input value. This value is the table you would like to create a golden record for (in our case, the customer table).
  2. Declare a set of variables to be used throughout the procedure.
  3. Check to see if weight data is in the staging table created in previous steps.
  4. Build a query string for comparing each column and applying weights using the weight data inserted in previous steps.
  5. For each record in the input table that doesn’t have a unique record ID (URID) yet, it will do the following:
    1. Create a temporary table to stage results. This temporary table will have all potential URIDs from the input table.
    2. Allocate a similarity value to each URID. This value specifies how similar this URID is to the record in question, weighted with the inputs defined.
    3. Choose the closest matched URID, but only if there is a >90% match.
    4. If there is no URID match, create a new URID.
    5. Update the source table with the new URID and move to the next record.

This procedure will only ever look for new URIDs for records that don’t already have one allocated. Therefore, rerunning the URID procedure multiple times will have no impact on the results.

CREATE OR REPLACE PROCEDURE find_unique_id(table_name varchar(100)) AS $$
DECLARE
    unique_record RECORD;
    column_info RECORD;

    column_fuzzy_comparison_string varchar(MAX) := '0.0';
    max_simularity_value decimal(5,2) := 0.0;

    table_check varchar(100);
    temp_column_name varchar(100);
    temp_column_weight decimal(5,2);
    unique_record_id smallint := 0;
BEGIN
    /* 
        Check the ref_unique_record_weight_map table to see if there is a mapping record for the provided table.
        If there is no table, raise an exception
    */
    SELECT INTO table_check cm.table_name from ref_unique_record_weight_map cm where cm.table_name = quote_ident(table_name);
    IF NOT FOUND THEN
        RAISE EXCEPTION 'Input table ''%'' not found in mapping object', table_name;
        RETURN;
    END IF;

    /*
        Build query to be used to compare each column using the mapping record in the ref_unique_record_weight_map table.
        For each column specified in the mapping object, append a weighted comparison of the column
    */
    FOR column_info IN (
        select  colmap.colname::varchar(100) column_name, 
                colmap.colweight column_weight 
        from    ref_unique_record_weight_map cm, cm.column_mapping.colmap colmap 
        where   cm.table_name = quote_ident(table_name)
    ) LOOP
        temp_column_name = column_info.column_name;
        temp_column_weight = column_info.column_weight;
        
        column_fuzzy_comparison_string = column_fuzzy_comparison_string || 
            ' + unique_record(t1.' || 
            temp_column_name || 
            '::varchar, t2.' || 
            temp_column_name || 
            '::varchar)*' || 
            temp_column_weight;
    END LOOP;

    /* Drop temporary table if it exists */
    EXECUTE 'DROP TABLE IF EXISTS #unique_record_table';

    /*
        For each record in the source table that does not have a Unique Record ID (URID):
            1. Create a new temporary table holding all possible URIDs for this record (i.e. all URIDs that have are present). 
                Note: This temporary table will only be present while the simularity check is being calculated
            2. Update each possible URID in the temporary table with it's simularity to the record being checked
            3. Find the most simular record with a URID
                3a. If the most simular record is at least 90% simular, take it's URID (i.e. this is not a unique record, and matches another in the table)
                3b. If there is no record that is 90% simular, create a new URID (i.e. this is a unique record)
            4. Drop the temporary table in preparation for the next record
    */
    FOR unique_record in EXECUTE 'select * from ' || table_name || ' where urid is null order by id asc' LOOP

        RAISE INFO 'test 1';

        /* Create temporary table */
        EXECUTE '
            CREATE TABLE #unique_record_table AS 
            SELECT id, urid, 0.0::decimal(5,2) as simularity_value 
            FROM ' || table_name || '
            where urid is not null
            ';

        /* Update simularity values in temporary table */
        EXECUTE '
            UPDATE #unique_record_table  
            SET simularity_value = round(calc_simularity_value,2)::decimal(5,2)
            FROM (
                SELECT ' || column_fuzzy_comparison_string || ' as calc_simularity_value,
                        t2.id as upd_id
                FROM ' || table_name || ' t1
                INNER JOIN ' || table_name || ' t2
                ON t1.id <> t2.id
                AND t1.id = ' || quote_literal(unique_record.id) || '
                ) t
            WHERE t.upd_id = id
            ';

        /* Find largest simularity value */
        SELECT INTO max_simularity_value simularity_value FROM (
            SELECT  MAX(simularity_value) as simularity_value 
            FROM    #unique_record_table
        );

        /* If there is a >90% similar match, choose it's URID. Otherwise, create a new URID */
        IF max_simularity_value > 90 THEN
            SELECT INTO unique_record_id urid FROM (
                SELECT urid
                FROM #unique_record_table
                WHERE simularity_value = max_simularity_value
            );
        ELSE 
            EXECUTE 'select COALESCE(MAX(urid)+1,1) FROM ' || table_name INTO unique_record_id;
        END IF;
        
        /* Update table with new URID value */
        EXECUTE 'UPDATE ' || table_name || ' SET urid = ' || quote_literal(unique_record_id) || ' WHERE id = ' || quote_literal(unique_record.id);

        /* Drop temporary table and repeat process */
        EXECUTE 'DROP TABLE #unique_record_table';

        max_simularity_value = 0.0;
    END LOOP;

END;
$$ LANGUAGE plpgsql;

Now that the stored procedure has been created, create the unique record IDs for the customer table by running the following in the Query Editor V2. This will update the urid column of the customer table.

CALL find_unique_id('customer'); 
select * from customer;

Screenshot showing the customer table now has values inserted in the URID column

When the procedure has completed its run, you can identify what duplicate customers were given unique IDs by running the following query:

select * 
from customer
where urid in (
    select urid 
    from customer 
    group by urid 
    having count(*) > 1
    )
order by urid asc
;

Screenshot showing the records that have been identified as duplicate records

From this you can see that IDs 1, 101, and 121 have all been given the same URID, as have IDs 7 and 107.

Screenshot showing the result for IDs 1, 101, and 121

Screenshot showing the result for IDs 7, and 107

The procedure has also correctly identified that IDs 6 and 106 are different customers, and they therefore don’t have the same URID.

Screenshot showing the result for IDs 6, and 106

Clean up

To avoid incurring future reoccurring charges, delete all files in the S3 bucket you created. After you delete the files, go to the AWS CloudFormation console and delete the stack deployed in this post. This will delete all created resources.

Conclusion

In this post, we showed one approach to identifying imperfect duplicate records by applying a fuzzy matching algorithm in Amazon Redshift. This solution allows you to identify data quality issues and apply more accurate analytics to your dataset residing in Amazon Redshift.

We showed how you can use open-source Python libraries to create Python UDFs, and how to create a generic stored procedure to identify imperfect matches. This solution is extendable to provide any functionality required, including adding as a regular process in your ELT (extract, load, and transform) workloads.

Test the created procedure on your datasets to investigate the presence of any imperfect duplicates, and use the knowledge learned throughout this post to create stored procedures and UDFs to implement further functionality.

If you’re new to Amazon Redshift, refer to Getting started with Amazon Redshift for more information and tutorials on Amazon Redshift. You can also refer to the video Get started with Amazon Redshift Serverless for information on starting with Redshift Serverless.


About the Author

Sean Beath is an Analytics Solutions Architect at Amazon Web Services. He has experience in the full delivery lifecycle of data platform modernisation using AWS services and works with customers to help drive analytics value on AWS.

Build a serverless analytics application with Amazon Redshift and Amazon API Gateway

Post Syndicated from David Zhang original https://aws.amazon.com/blogs/big-data/build-a-serverless-analytics-application-with-amazon-redshift-and-amazon-api-gateway/

Serverless applications are a modernized way to perform analytics among business departments and engineering teams. Business teams can gain meaningful insights by simplifying their reporting through web applications and distributing it to a broader audience.

Use cases can include the following:

  • Dashboarding – A webpage consisting of tables and charts where each component can offer insights to a specific business department.
  • Reporting and analysis – An application where you can trigger large analytical queries with dynamic inputs and then view or download the results.
  • Management systems – An application that provides a holistic view of the internal company resources and systems.
  • ETL workflows – A webpage where internal company individuals can trigger specific extract, transform, and load (ETL) workloads in a user-friendly environment with dynamic inputs.
  • Data abstraction – Decouple and refactor underlying data structure and infrastructure.
  • Ease of use – An application where you want to give a large set of user-controlled access to analytics without having to onboard each user to a technical platform. Query updates can be completed in an organized manner and maintenance has minimal overhead.

In this post, you will learn how to build a serverless analytics application using Amazon Redshift Data API and Amazon API Gateway WebSocket and REST APIs.

Amazon Redshift is fully managed by AWS, so you no longer need to worry about data warehouse management tasks such as hardware provisioning, software patching, setup, configuration, monitoring nodes and drives to recover from failures, or backups. The Data API simplifies access to Amazon Redshift because you don’t need to configure drivers and manage database connections. Instead, you can run SQL commands to an Amazon Redshift cluster by simply calling a secured API endpoint provided by the Data API. The Data API takes care of managing database connections and buffering data. The Data API is asynchronous, so you can retrieve your results later.

API Gateway is a fully managed service that makes it easy for developers to publish, maintain, monitor, and secure APIs at any scale. With API Gateway, you can create RESTful APIs and WebSocket APIs that enable real-time two-way communication applications. API Gateway supports containerized and serverless workloads, as well as web applications. API Gateway acts as a reverse proxy to many of the compute resources that AWS offers.

Event-driven model

Event-driven applications are increasingly popular among customers. Analytical reporting web applications can be implemented through an event-driven model. The applications run in response to events such as user actions and unpredictable query events. Decoupling the producer and consumer processes allows greater flexibility in application design and building decoupled processes. This design can be achieved with the Data API and API Gateway WebSocket and REST APIs.

Both REST API calls and WebSocket establish communication between the client and the backend. Due to the popularity of REST, you may wonder why WebSockets are present and how they contribute to an event-driven design.

What are WebSockets and why do we need them?

Unidirectional communication is customary when building analytical web solutions. In traditional environments, the client initiates a REST API call to run a query on the backend and either synchronously or asynchronously waits for the query to complete. The “wait” aspect is engineered to apply the concept of polling. Polling in this context is when the client doesn’t know when a backend process will complete. Therefore, the client will consistently make a request to the backend and check.

What is the problem with polling? Main challenges include the following:

  • Increased traffic in your network bandwidth – A large number of users performing empty checks will impact your backend resources and doesn’t scale well.
  • Cost usage – Empty requests don’t deliver any value to the business. You pay for the unnecessary cost of resources.
  • Delayed response – Polling is scheduled in time intervals. If the query is complete in-between these intervals, the user can only see the results after the next check. This delay impacts the user experience and, in some cases, may result in UI deadlocks.

For more information on polling, check out From Poll to Push: Transform APIs using Amazon API Gateway REST APIs and WebSockets.

WebSockets is another approach compared to REST when establishing communication between the front end and backend. WebSockets enable you to create a full duplex communication channel between the client and the server. In this bidirectional scenario, the client can make a request to the server and is notified when the process is complete. The connection remains open, with minimal network overhead, until the response is received.

You may wonder why REST is present, since you can transfer response data with WebSockets. A WebSocket is a light weight protocol designed for real-time messaging between systems. The protocol is not designed for handling large analytical query data and in API Gateway, each frame’s payload can only hold up to 32 KB. Therefore, the REST API performs large data retrieval.

By using the Data API and API Gateway, you can build decoupled event-driven web applications for your data analytical needs. You can create WebSocket APIs with API Gateway and establish a connection between the client and your backend services. You can then initiate requests to perform analytical queries with the Data API. Due to the Data API’s asynchronous nature, the query completion generates an event to notify the client through the WebSocket channel. The client can decide to either retrieve the query results through a REST API call or perform other follow-up actions. The event-driven architecture enables bidirectional interoperable messages and data while keeping your system components agnostic.

Solution overview

In this post, we show how to create a serverless event-driven web application by querying with the Data API in the backend, establishing a bidirectional communication channel between the user and the backend with the WebSocket feature in API Gateway, and retrieving the results using its REST API feature. Instead of designing an application with long-running API calls, you can use the Data API. The Data API allows you to run SQL queries asynchronously, removing the need to hold long, persistent database connections.

The web application is protected using Amazon Cognito, which is used to authenticate the users before they can utilize the web app and also authorize the REST API calls when made from the application.

Other relevant AWS services in this solution include AWS Lambda and Amazon EventBridge. Lambda is a serverless, event-driven compute resource that enables you to run code without provisioning or managing servers. EventBridge is a serverless event bus allowing you to build event-driven applications.

The solution creates a lightweight WebSocket connection between the browser and the backend. When a user submits a request using WebSockets to the backend, a query is submitted to the Data API. When the query is complete, the Data API sends an event notification to EventBridge. EventBridge signals the system that the data is available and notifies the client. Afterwards, a REST API call is performed to retrieve the query results for the client to view.

We have published this solution on the AWS Samples GitHub repository and will be referencing it during the rest of this post.

The following architecture diagram highlights the end-to-end solution, which you can provision automatically with AWS CloudFormation templates run as part of the shell script with some parameter variables.

The application performs the following steps (note the corresponding numbered steps in the process flow):

  1. A web application is provisioned on AWS Amplify; the user needs to sign up first by providing their email and a password to access the site.
  2. The user verifies their credentials using a pin sent to their email. This step is mandatory for the user to then log in to the application and continue access to the other features of the application.
  3. After the user is signed up and verified, they can sign in to the application and requests data through their web or mobile clients with input parameters. This initiates a WebSocket connection in API Gateway. (Flow 1, 2)
  4. The connection request is handled by a Lambda function, OnConnect, which initiates an asynchronous database query in Amazon Redshift using the Data API. The SQL query is taken from a SQL script in Amazon Simple Storage Service (Amazon S3) with dynamic input from the client. (Flow 3, 4, 6, 7)
  5. In addition, the OnConnect Lambda function stores the connection, statement identifier, and topic name in an Amazon DynamoDB database. The topic name is an extra parameter that can be used if users want to implement multiple reports on the same webpage. This allows the front end to map responses to the correct report. (Flow 3, 4, 5)
  6. The Data API runs the query, mentioned in step 2. When the operation is complete, an event notification is sent to EventBridge. (Flow 8)
  7. EventBridge activates an event rule to redirect that event to another Lambda function, SendMessage. (Flow 9)
  8. The SendMessage function notifies the client that the SQL query is complete via API Gateway. (Flow 10, 11, 12)
  9. After the notification is received, the client performs a REST API call (GET) to fetch the results. (Flow 13, 14, 15, 16)
  10. The GetResult function is triggered, which retrieves the SQL query result and returns it to the client.
  11. The user is now able to view the results on the webpage.
  12. When clients disconnect from their browser, API Gateway automatically deletes the connection information from the DynamoDB table using the onDisconnect function. (Flow 17, 18,19)

Prerequisites

Prior to deploying your event-driven web application, ensure you have the following:

  • An Amazon Redshift cluster in your AWS environment – This is your backend data warehousing solution to run your analytical queries. For instructions to create your Amazon Redshift cluster, refer to Getting started with Amazon Redshift.
  • An S3 bucket that you have access to – The S3 bucket will be your object storage solution where you can store your SQL scripts. To create your S3 bucket, refer to Create your first S3 bucket.

Deploy CloudFormation templates

The code associated to the design is available in the following GitHub repository. You can clone the repository inside an AWS Cloud9 environment in our AWS account. The AWS Cloud9 environment comes with AWS Command Line Interface (AWS CLI) installed, which is used to run the CloudFormation templates to set up the AWS infrastructure. Make sure that the jQuery library is installed; we use it to parse the JSON output during the run of the script.

The complete architecture is set up using three CloudFormation templates:

  • cognito-setup.yaml – Creates the Amazon Cognito user pool to web app client, which is used for authentication and protecting the REST API
  • backend-setup.yaml – Creates all the required Lambda functions and the WebSocket and Rest APIs, and configures them on API Gateway
  • webapp-setup.yaml – Creates the web application hosting using Amplify to connect and communicate with the WebSocket and Rest APIs.

These CloudFormation templates are run using the script.sh shell script, which takes care of all the dependencies as required.

A generic template is provided for you to customize your own DDL SQL scripts as well as your own query SQL scripts. We have created sample scripts for you to follow along.

  1. Download the sample DDL script and upload it to an existing S3 bucket.
  2. Change the IAM role value to your Amazon Redshift cluster’s IAM role with permissions to AmazonS3ReadOnlyAccess.

For this post, we copy the New York Taxi Data 2015 dataset from a public S3 bucket.

  1. Download the sample query script and upload it to an existing S3 bucket.
  2. Upload the modified sample DDL script and the sample query script into a preexisting S3 bucket that you own, and note down the S3 URI path.

If you want to run your own customized version, modify the DDL and query script to fit your scenario.

  1. Edit the script.sh file before you run it and set the values for the following parameters:
    • RedshiftClusterEndpoint (aws_redshift_cluster_ep) – Your Amazon Redshift cluster endpoint available on the AWS Management Console
    • DBUsername (aws_dbuser_name) – Your Amazon Redshift database user name
    • DDBTableName (aws_ddbtable_name) – The name of your DynamoDB table name that will be created
    • WebsocketEndpointSSMParameterName (aws_wsep_param_name) – The parameter name that stores the WebSocket endpoint in AWS Systems Manager Parameter Store.
    • RestApiEndpointSSMParameterName (aws_rapiep_param_name) – The parameter name that stores the REST API endpoint in Parameter Store.
    • DDLScriptS3Path (aws_ddl_script_path) – The S3 URI to the DDL script that you uploaded.
    • QueryScriptS3Path (aws_query_script_path) – The S3 URI to the query script that you uploaded.
    • AWSRegion (aws_region) – The Region where the AWS infrastructure is being set up.
    • CognitoPoolName (aws_user_pool_name) – The name you want to give to your Amazon Cognito user pool
    • ClientAppName (aws_client_app_name) – The name of the client app to be configured for the web app to handle the user authentication for the users

The default acceptable values are already provided as part of the downloaded code.

  1. Run the script using the following command:
./script.sh

During deployment, AWS CloudFormation creates and triggers the Lambda function SetupRedshiftLambdaFunction, which sets up an Amazon Redshift database table and populates data into the table. The following diagram illustrates this process.

Use the demo app

When the shell script is complete, you can start interacting with the demo web app:

  1. On the Amplify console, under All apps in the navigation pane, choose DemoApp.
  2. Choose Run build.

The DemoApp web application goes through a phase of Provision, Build, Deploy.

  1. When it’s complete, use the URL provided to access the web application.

The following screenshot shows the web application page. It has minimal functionality: you can sign in, sign up, or verify a user.

  1. Choose Sign Up.

  1. For Email ID, enter an email.
  2. For Password, enter a password that is at least eight characters long, has at least one uppercase and lowercase letter, at least one number, and at least one special character.
  3. Choose Let’s Enroll.

The Verify your Login to Demo App page opens.

  1. Enter your email and the verification code sent to the email you specified.
  2. Choose Verify.


You’re redirected to a login page.

  1. Sign in using your credentials.

You’re redirected to the demoPage.html website.

  1. Choose Open Connection.

You now have an active WebSocket connection between your browser and your backend AWS environment.

  1. For Trip Month, specify a month (for this example, December) and choose Submit.

You have now defined the month and year you want to query your data upon. After a few seconds, you can to see the output delivered from the WebSocket.

You may continue using the active WebSocket connection for additional queries—just choose a different month and choose Submit again.

  1. When you’re done, choose Close Connection to close the WebSocket connection.

For exploratory purposes, while your WebSocket connection is active, you can navigate to your DynamoDB table on the DynamoDB console to view the items that are currently stored. After the WebSocket connection is closed, the items stored in DynamoDB are deleted.

Clean up

To clean up your resources, complete the following steps:

  1. On the Amazon S3 console, navigate to the S3 bucket containing the sample DDL script and query script and delete them from the bucket.
  2. On the Amazon Redshift console, navigate to your Amazon Redshift cluster and delete the data you copied over from the sample DDL script.
    1. Run truncate nyc_yellow_taxi;
    2. Run drop table nyc_yellow_taxi;
  3. On the AWS CloudFormation console, navigate to the CloudFormation stacks and choose Delete. Delete the stacks in the following order:
    1. WebappSetup
    2. BackendSetup
    3. CognitoSetup

All resources created in this solution will be deleted.

Monitoring

You can monitor your event-driven web application events, user activity, and API usage with Amazon CloudWatch and AWS CloudTrail. Most areas of this solution already have logging enabled. To view your API Gateway logs, you can turn on CloudWatch Logs. Lambda comes with default logging and monitoring and can be accessed with CloudWatch.

Security

You can secure access to the application using Amazon Cognito, which is a developer-centric and cost-effective customer authentication, authorization, and user management solution. It provides both identity store and federation options that can scale easily. Amazon Cognito supports logins with social identity providers and SAML or OIDC-based identity providers, and supports various compliance standards. It operates on open identity standards (OAuth2.0, SAML 2.0, and OpenID Connect). You can also integrate it with API Gateway to authenticate and authorize the REST API calls either using the Amazon Cognito client app or a Lambda function.

Considerations

The nature of this application includes a front-end client initializing SQL queries to Amazon Redshift. An important component to consider are potential malicious activities that the client can perform, such as SQL injections. With the current implementation, that is not possible. In this solution, the SQL queries preexist in your AWS environment and are DQL statements (they don’t alter the data or structure). However, as you develop this application to fit your business, you should evaluate these areas of risk.

AWS offers a variety of security services to help you secure your workloads and applications in the cloud, including AWS Shield, AWS Network Firewall, AWS Web Application Firewall, and more. For more information and a full list, refer to Security, Identity, and Compliance on AWS.

Cost optimization

The AWS services that the CloudFormation templates provision in this solution are all serverless. In terms of cost optimization, you only pay for what you use. This model also allows you to scale without manual intervention. Review the following pages to determine the associated pricing for each service:

Conclusion

In this post, we showed you how to create an event-driven application using the Amazon Redshift Data API and API Gateway WebSocket and REST APIs. The solution helps you build data analytical web applications in an event-driven architecture, decouple your application, optimize long-running database queries processes, and avoid unnecessary polling requests between the client and the backend.

You also used severless technologies, API Gateway, Lambda, DynamoDB, and EventBridge. You didn’t have to manage or provision any servers throughout this process.

This event-driven, serverless architecture offers greater extensibility and simplicity, making it easier to maintain and release new features. Adding new components or third-party products is also simplified.

With the instructions in this post and the generic CloudFormation templates we provided, you can customize your own event-driven application tailored to your business. For feedback or contributions, we welcome you to contact us through the AWS Samples GitHub Repository by creating an issue.


About the Authors

David Zhang is an AWS Data Architect in Global Financial Services. He specializes in designing and implementing serverless analytics infrastructure, data management, ETL, and big data systems. He helps customers modernize their data platforms on AWS. David is also an active speaker and contributor to AWS conferences, technical content, and open-source initiatives. During his free time, he enjoys playing volleyball, tennis, and weightlifting. Feel free to connect with him on LinkedIn.

Manash Deb is a Software Development Manager in the AWS Directory Service team. With over 18 years of software dev experience, his passion is designing and delivering highly scalable, secure, zero-maintenance applications in the AWS identity and data analytics space. He loves mentoring and coaching others and to act as a catalyst and force multiplier, leading highly motivated engineering teams, and building large-scale distributed systems.

Pavan Kumar Vadupu Lakshman Manikya is an AWS Solutions Architect who helps customers design robust, scalable solutions across multiple industries. With a background in enterprise architecture and software development, Pavan has contributed in creating solutions to handle API security, API management, microservices, and geospatial information system use cases for his customers. He is passionate about learning new technologies and solving, automating, and simplifying customer problems using these solutions.

What’s new in Amazon Redshift – 2022, a year in review

Post Syndicated from Manan Goel original https://aws.amazon.com/blogs/big-data/whats-new-in-amazon-redshift-2022-a-year-in-review/

In 2021 and 2020, we told you about the new features in Amazon Redshift that make it easier, faster, and more cost-effective to analyze all your data and find rich and powerful insights. In 2022, we are happy to report that the Amazon Redshift team was hard at work. We worked backward from customer requirements and announced multiple new features to make it easier, faster, and more cost-effective to analyze all your data. This post covers some of these new features.

At AWS, for data and analytics, our strategy is to give you a modern data architecture that helps you break free from data silos; have purpose-built data, analytics, machine learning (ML), and artificial intelligence services to use the right tool for the right job; and have open, governed, secure, and fully managed services to make analytics available to everyone. Within AWS’s modern data architecture, Amazon Redshift as the cloud data warehouse remains a key component, enabling you to run complex SQL analytics at scale and performance on terabytes to petabytes of structured and unstructured data, and make the insights widely available through popular business intelligence (BI) and analytics tools. We continue to work backward from customers’ requirements, and in 2022 launched over 40 features in Amazon Redshift to help customers with their top data warehousing use cases, including:

  • Self-service analytics
  • Easy data ingestion
  • Data sharing and collaboration
  • Data science and machine learning
  • Secure and reliable analytics
  • Best price performance analytics

Let’s dive deeper and discuss the new Amazon Redshift features in these areas.

Self-service analytics

Customers continue to tell us that data and analytics is becoming ubiquitous, and everyone in their organization needs analytics. We announced Amazon Redshift Serverless (in preview) in 2021 to make it easy to run and scale analytics in seconds without having to provision and manage data warehouse infrastructure. In July 2022, we announced the general availability of Redshift Serverless, and since then thousands of customers, including Peloton, Broadridge Financials, and NextGen Healthcare, have used it to quickly and easily analyze their data. Amazon Redshift Serverless automatically provisions and intelligently scales data warehouse capacity to deliver high performance for all your analytics, and you only pay for the compute used for the duration of the workloads on a per-second basis. Since GA, we have added features like resource tagging, simplified monitoring, and availability in additional AWS Regions to further simplify billing and expand the reach across more Regions worldwide.

In 2021, we launched Amazon Redshift Query Editor V2, which is a free web-based tool for data analysts, data scientists, and developers to explore, analyze, and collaborate on data in Amazon Redshift data warehouses and data lakes. In 2022, Query Editor V2 got additional enhancements such as notebook support for improved collaboration to author, organize, and annotate queries; user access through identity provider (IdP) credentials for single sign-on; and the ability to run multiple queries concurrently to improve developer productivity.

Autonomics is another area where we are actively working to use ML-based optimizations and give customers a self-learning and self-optimizing data warehouse. In 2022, we announced the general availability of Automated Materialized Views (AutoMVs) to improve the performance of queries (reduce the total runtime) without any user effort by automatically creating and maintaining materialized views. AutoMVs, combined with automatic refresh, incremental refresh, and automatic query rewriting for materialized views, made materialized views maintenance free, giving you faster performance automatically. In addition, the automatic table optimization (ATO) capability for schema optimization and automatic workload management (auto WLM) capability for workload optimization got further improvements for better query performance.

Easy data ingestion

Customers tell us that they have their data distributed over multiple data sources like transactional databases, data warehouses, data lakes, and big data systems. They want the flexibility to integrate this data with no-code/low-code, zero-ETL data pipelines or analyze this data in place without moving it. Customers tell us that their current data pipelines are complex, manual, rigid, and slow, resulting in incomplete, inconsistent, and stale views of data, limiting insights. Customers have asked us for a better way forward, and we are pleased to announce a number of new capabilities to simplify and automate data pipelines.

Amazon Aurora zero-ETL integration with Amazon Redshift (preview) enables you to run near-real-time analytics and ML on petabytes of transactional data. It offers a no-code solution for making transactional data from multiple Amazon Aurora databases available in Amazon Redshift data warehouses within seconds of being written to Aurora, eliminating the need to build and maintain complex data pipelines. With this feature, Aurora customers can also access Amazon Redshift capabilities such as complex SQL analytics, built-in ML, data sharing, and federated access to multiple data stores and data lakes. This feature is now available in preview for Amazon Aurora MySQL-Compatible Edition version 3 (with MySQL 8.0 compatibility), and you can request access to the preview.

Amazon Redshift now supports auto-copy from Amazon S3 (preview) to simplify data loading from Amazon Simple Storage Service (Amazon S3) into Amazon Redshift. You can now set up continuous file ingestion rules (copy jobs) to track your Amazon S3 paths and automatically load new files without the need for additional tools or custom solutions. Copy jobs can be monitored through system tables, and they automatically keep track of previously loaded files and exclude them from the ingestion process to prevent data duplication. This feature is now available in preview; you can try this feature by creating a new cluster using the preview track.

Customers continue to tell us that they need instantaneous, in-the-moment, real-time analytics, and we are pleased to announce the general availability of streaming ingestion support in Amazon Redshift for Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK). This feature eliminates the need to stage streaming data in Amazon S3 before ingesting it into Amazon Redshift, enabling you to achieve low latency, measured in seconds, while ingesting hundreds of megabytes of streaming data per second into your data warehouses. You can use SQL within Amazon Redshift to connect to and directly ingest data from multiple Kinesis data streams or MSK topics, create auto-refreshing streaming materialized views with transformations on top of streams directly to access streaming data, and combine real-time data with historical data for better insights. For example, Adobe has integrated Amazon Redshift streaming ingestion as part of their Adobe Experience Platform for ingesting and analyzing, in real time, the web and applications clickstream and session data for various applications like CRM and customer support applications.

Customers have told us that they want simple, out-of-the-box integration between Amazon Redshift, BI and ETL (extract, transform, and load) tools, and business applications like Salesforce and Marketo. We are pleased to announce the general availability of Informatica Data Loader for Amazon Redshift, which enables you to use Informatica Data Loader for high-speed and high-volume data loading into Amazon Redshift for free. You can simply select the Informatica Data Loader option on the Amazon Redshift console. Once in Informatica Data Loader, you can connect to sources such as Salesforce or Marketo, choose Amazon Redshift as a target, and begin to load your data.

Data sharing and collaboration

Customers continue to tell us that they want to analyze all their first-party and third-party data and make the rich data-driven insights available to their customers, partners, and suppliers. We launched new features in 2021, such as Data Sharing and AWS Data Exchange integration, to make it easier for you to analyze all of your data and share it within and outside your organizations.

A great example of a customer using data sharing is Orion. Orion provides real-time data as a service (DaaS) solutions for customers in the financial services industry, such as wealth management, asset management, and investment management providers. They have over 2,500 data sources that are primarily SQL Server databases sitting both on premises and in AWS. Data is streamed using Kafka connecters into Amazon Redshift. They have a producer cluster that receives all this data and then uses Data Sharing to share data in real time for collaboration. This is a multi-tenant architecture that serves multiple clients. Given the sensitivity of their data, data sharing is a way to provide workload isolation between clusters and also securely share that data to end-users.

In 2022, we continued to invest in this area to improve the performance, governance, and developer productivity with new features to make it easier, simpler, and faster to share and collaborate on data.

As customers are building large-scale data sharing configurations, they have asked for simplified governance and security for shared data, and we are adding centralized access control with AWS Lake Formation for Amazon Redshift datashares to enable sharing live data across multiple Amazon Redshift data warehouses. With this feature, Amazon Redshift now supports simplified governance of Amazon Redshift datashares by using AWS Lake Formation as a single pane of glass to centrally manage data or permissions on datashares. You can view, modify, and audit permissions, including row-level and column-level security on the tables and views in the Amazon Redshift datashares, using Lake Formation APIs and the AWS Management Console, and allow the Amazon Redshift datashares to be discovered and consumed by other Amazon Redshift data warehouses.

Data science and machine learning

Customers continue to tell us that they want their data and analytics systems to help them answer a wide range of questions, from what is happening in their business (descriptive analytics) to why is it happening (diagnostic analytics) and what will happen in the future (predictive analytics). Amazon Redshift provides features like complex SQL analytics, data lake analytics, and Amazon Redshift ML for customers to analyze their data and discover powerful insights. Redshift ML integrates Amazon Redshift with Amazon SageMaker, a fully managed ML service, enabling you to create, train, and deploy ML models using familiar SQL commands.

Customers have also asked us for better integration between Amazon Redshift and Apache Spark, so we are excited to announce Amazon Redshift integration for Apache Spark to make data warehouses easily accessible for Spark-based applications. Now, developers using AWS analytics and ML services such as Amazon EMR, AWS Glue, and SageMaker can effortlessly build Apache Spark applications that read from and write to their Amazon Redshift data warehouses. Amazon EMR and AWS Glue package the Redshift-Spark connector so you can easily connect to your data warehouse from your Spark-based applications. You can use several pushdown capabilities for operations such as sort, aggregate, limit, join, and scalar functions so that only the relevant data is moved from your Amazon Redshift data warehouse to the consuming Spark application. You can also make your applications more secure by utilizing AWS Identity and Access Management (IAM) credentials to connect to Amazon Redshift.

Secure and reliable analytics

Customers continue to tell us that their data warehouses are mission-critical systems that need high availability, reliability, and security. We launched a number of new features in 2022 in this area.

Amazon Redshift now supports Multi-AZ deployments (in preview) for RA3 instance-based clusters, which enables running your data warehouse in multiple AWS Availability Zones simultaneously and continuous operation in unforeseen Availability Zone-wide failure scenarios. Multi-AZ support is already available for Redshift Serverless. An Amazon Redshift Multi-AZ deployment allows you to recover in case of Availability Zone failures without any user intervention. An Amazon Redshift Multi-AZ data warehouse is accessed as a single data warehouse with one endpoint, and helps you maximize performance by distributing workload processing across multiple Availability Zones automatically. No application changes are needed to maintain business continuity during unforeseen outages.

In 2022, we launched features like role-based access control, row-level security, and data masking (in preview) to make it easier for you to manage access and decide who has access to which data, including obfuscating personally identifiable information (PII) like credit card numbers.

You can use role-based access control (RBAC) to control end-user access to data at a broad or granular level based on an end-user’s job role and permissions. With RBAC, you can create a role using SQL, grant a collection of granular permissions to the role, and then assign that role to end-users. Roles can be granted object-level, column-level, and system-level permissions. Additionally, RBAC introduces out-of-box system roles for DBAs, operators, security admins, or customized roles.

Row-level security (RLS) simplifies design and implementation of fine-grained access to the rows in tables. With RLS, you can restrict access to a subset of rows within a table based on the users’ job role or permissions with SQL.

Amazon Redshift support for dynamic data masking (DDM), which is now available in preview, allows you to simplify protecting PII such as Social Security numbers, credits card numbers, and phone numbers in your Amazon Redshift data warehouse. With dynamic data masking, you control access to your data through simple SQL-based masking policies that determine how Amazon Redshift returns sensitive data to the user at query time. You can create masking policies to define consistent, format-preserving, and irreversible masked data values. You can apply a masking policy on a specific column or list of columns in a table. Also, you have the flexibility of choosing how to show the masked data. For example, you can completely hide the data, replace partial real values with wildcard characters, or define your own way to mask the data using SQL expressions, Python, or AWS Lambda user-defined functions. Additionally, you can apply a conditional masking policy based on other columns, which selectively protects the column data in a table based on the values in one or more different columns.

We also announced enhancements to audit logging, native integration with Microsoft Azure Active Directory, and support for default IAM roles in additional Regions to further simplify security management.

Best price performance analytics

Customers continue to tell us that they need fast and cost-effective data warehouses that deliver high performance at any scale while keeping costs low. From day 1 since Amazon Redshift’s launch in 2012, we have taken a data-driven approach and used fleet telemetry to build a cloud data warehouse service that gives you the best price performance at any scale. Over the years, we have evolved Amazon Redshift’s architecture and launched features such as Redshift Managed Storage (RMS) for separation of storage and compute, Amazon Redshift Spectrum for data lake queries, automatic table optimization for physical schema optimization, automatic workload management to prioritize workloads and allocate the right compute and memory, cluster resize to scale compute and storage vertically, and concurrency scaling to dynamically scale compute out or in. Our performance benchmarks continue to demonstrate Amazon Redshift’s price performance leadership.

In 2022, we added new features such as the general availability of concurrency scaling for write operations like COPY, INSERT, UPDATE, and DELETE to support virtually unlimited concurrent users and queries. We also introduced performance improvements for string-based data processing through vectorized scans over lightweight, CPU-efficient, dictionary-encoded string columns, which allows the database engine to operate directly over compressed data.

We also added support for SQL operators such as MERGE (single operator for inserts or updates); CONNECY_BY (for hierarchical queries); GROUPING SETS, ROLLUP, and CUBE (for multi-dimensional reporting); and increased the size of the SUPER data type to 16 MB to make it easier for you to migrate from legacy data warehouses to Amazon Redshift.

Conclusion

Our customers continue to tell us that data and analytics remains a top priority for them and the need to cost-effectively extract more business value from their data during these times is more pronounced than any other time in the past. Amazon Redshift as your cloud data warehouse enables you to run complex SQL analytics with scale and performance on terabytes to petabytes of structured and unstructured data and make the insights widely available through popular BI and analytics tools.

Although we launched over 40 features in 2022 and the pace of innovation continues to accelerate, it remains day 1 and we look forward to hearing from you on how these features help you unlock more value for your organizations. We invite you to try these new features and get in touch with us through your AWS account team if you have further comments.


About the author

Manan Goel is a Product Go-To-Market Leader for AWS Analytics Services including Amazon Redshift at AWS. He has more than 25 years of experience and is well versed with databases, data warehousing, business intelligence, and analytics. Manan holds a MBA from Duke University and a BS in Electronics & Communications engineering.

Build near real-time logistics dashboards using Amazon Redshift and Amazon Managed Grafana for better operational intelligence

Post Syndicated from Paul Villena original https://aws.amazon.com/blogs/big-data/build-near-real-time-logistics-dashboards-using-amazon-redshift-and-amazon-managed-grafana-for-better-operational-intelligence/

Amazon Redshift is a fully managed data warehousing service that is currently helping tens of thousands of customers manage analytics at scale. It continues to lead price-performance benchmarks, and separates compute and storage so each can be scaled independently and you only pay for what you need. It also eliminates data silos by simplifying access to your operational databases, data warehouse, and data lake with consistent security and governance policies.

With the Amazon Redshift streaming ingestion feature, it’s easier than ever to access and analyze data coming from real-time data sources. It simplifies the streaming architecture by providing native integration between Amazon Redshift and the streaming engines in AWS, which are Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK). Streaming data sources like system logs, social media feeds, and IoT streams can continue to push events to the streaming engines, and Amazon Redshift simply becomes just another consumer. Before Amazon Redshift streaming was available, we had to stage the streaming data first in Amazon Simple Storage Service (Amazon S3) and then run the copy command to load it into Amazon Redshift. Eliminating the need to stage data in Amazon S3 results in faster performance and improved latency. With this feature, we can ingest hundreds of megabytes of data per second and have a latency of just a few seconds.

Another common challenge for our customers is the additional skill required when using streaming data. In Amazon Redshift streaming ingestion, only SQL is required. We use SQL to do the following:

  • Define the integration between Amazon Redshift and our streaming engines with the creation of external schema
  • Create the different streaming database objects that are actually materialized views
  • Query and analyze the streaming data
  • Generate new features that are used to predict delays using machine learning (ML)
  • Perform inferencing natively using Amazon Redshift ML

In this post, we build a near real-time logistics dashboard using Amazon Redshift and Amazon Managed Grafana. Our example is an operational intelligence dashboard for a logistics company that provides situational awareness and augmented intelligence for their operations team. From this dashboard, the team can see the current state of their consignments and their logistics fleet based on events that happened only a few seconds ago. It also shows the consignment delay predictions of an Amazon Redshift ML model that helps them proactively respond to disruptions before they even happen.

Solution overview

This solution is composed of the following components, and the provisioning of resources is automated using the AWS Cloud Development Kit (AWS CDK):

  • Multiple streaming data sources are simulated through Python code running in our serverless compute service, AWS Lambda
  • The streaming events are captured by Amazon Kinesis Data Streams, which is a highly scalable serverless streaming data service
  • We use the Amazon Redshift streaming ingestion feature to process and store the streaming data and Amazon Redshift ML to predict the likelihood of a consignment getting delayed
  • We use AWS Step Functions for serverless workflow orchestration
  • The solution includes a consumption layer built on Amazon Managed Grafana where we can visualize the insights and even generate alerts through Amazon Simple Notification Service (Amazon SNS) for our operations team

The following diagram illustrates our solution architecture.

Prerequisites

The project has the following prerequisites:

Sample deployment using the AWS CDK

The AWS CDK is an open-source project that allows you to define your cloud infrastructure using familiar programming languages. It uses high-level constructs to represent AWS components to simplify the build process. In this post, we use Python to define the cloud infrastructure due to its familiarity to many data and analytics professionals.

Clone the GitHub repository and install the Python dependencies:

git clone https://github.com/aws-samples/amazon-redshift-streaming-workshop
cd amazon-redshift-streaming-workshop
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Next, bootstrap the AWS CDK. This sets up the resources required by the AWS CDK to deploy into the AWS account. This step is only required if you haven’t used the AWS CDK in the deployment account and Region.

cdk bootstrap

Deploy all stacks:

cdk deploy IngestionStack 
cdk deploy RedshiftStack 
cdk deploy StepFunctionStack

The entire deployment time takes 10–15 minutes.

Access streaming data using Amazon Redshift streaming ingestion

The AWS CDK deployment provisions an Amazon Redshift cluster with the appropriate default IAM role to access the Kinesis data stream. We can create an external schema to establish a connection between the Amazon Redshift cluster and the Kinesis data stream:

CREATE EXTERNAL SCHEMA ext_kinesis FROM KINESIS
IAM_ROLE default;

For instructions on how to connect to the cluster, refer to Connecting to the Redshift Cluster.

We use a materialized view to parse data in the Kinesis data stream. In this case, the whole payload is ingested as is and stored using the SUPER data type in Amazon Redshift. Data stored in streaming engines is usually in semi-structured format, and the SUPER data type provides a fast and efficient way to analyze semi-structured data within Amazon Redshift.


See the following code:

CREATE MATERIALIZED VIEW consignment_stream AS
SELECT approximate_arrival_timestamp,
JSON_PARSE(from_varbyte(kinesis_data, 'utf-8')) as consignment_data FROM ext_kinesis.consignment_stream
WHERE is_utf8(kinesis_data)
AND is_valid_json(from_varbyte(kinesis_data, 'utf-8'));

Refreshing the materialized view invokes Amazon Redshift to read data directly from the Kinesis data stream and load it into the materialized view. This refresh can be done automatically by adding the AUTO REFRESH clause in the materialized view definition. However, in this example, we are orchestrating the end-to-end data pipeline using AWS Step Functions.

REFRESH MATERIALIZED VIEW consignment_stream;

Now we can start running queries against our streaming data and unify it with other datasets like logistics fleet data. If we want to know the distribution of our consignments across different states, we can easily unpack the contents of the JSON payload using the PartiQL syntax.

SELECT cs.consignment_data.origin_state::VARCHAR,
COUNT(1) number_of_consignments,
AVG(on_the_move) running_fleet,
AVG(scheduled_maintenance + unscheduled_maintenance) under_maintenance
FROM consignment_stream cs
INNER JOIN fleet_summary fs
on TRIM(cs.consignment_data.origin_state::VARCHAR) = fs.vehicle_location
GROUP BY 1

Generate features using Amazon Redshift SQL functions

The next step is to transform and enrich the streaming data using Amazon Redshift SQL to generate additional features that will be used by Amazon Redshift ML for its predictions. We use date and time functions to identify the day of the week, and calculate the number of days between the order date and target delivery date.

We also use geospatial functions, specifically ST_DistanceSphere, to calculate the distance between origin and destination locations. The GEOMETRY data type within Amazon Redshift provides a cost-effective way to analyze geospatial data such as longitude and latitudes at scale. In this example, the addresses have already been converted to longitude and latitude. However, if you need to perform geocoding, you can integrate Amazon Location Services with Amazon Redshift using user-defined functions (UDFs). On top of geocoding, Amazon Location Service also allows you to more accurately calculate route distance between origin and destination, and even specify waypoints along the way.

We use another materialized view to persist these transformations. A materialized view provides a simple yet efficient way to create data pipelines using its incremental refresh capability. Amazon Redshift identifies the incremental changes from the last refresh and only updates the target materialized view based on these changes. In this materialized view, all our transformations are deterministic, so we expect our data to be consistent when going through a full refresh or an incremental refresh.


See the following code:

CREATE MATERIALIZED VIEW consignment_transformed AS
SELECT
consignment_data.consignmentid::INT consignment_id,
consignment_data.consignment_date::TIMESTAMP consignment_date,
consignment_data.delivery_date::TIMESTAMP delivery_date,
consignment_data.origin_state::VARCHAR origin_state,
consignment_data.destination_state::VARCHAR destination_state,
consignment_data.revenue::FLOAT revenue,
consignment_data.cost::FLOAT cost,
DATE_PART(dayofweek, consignment_data.consignment_date::TIMESTAMP)::INT day_of_week,
DATE_PART(hour, consignment_data.consignment_date::TIMESTAMP)::INT "hour",
DATEDIFF(days,
consignment_data.consignment_date::TIMESTAMP,
consignment_data.delivery_date::TIMESTAMP
)::INT days_to_deliver,
(ST_DistanceSphere(
ST_Point(consignment_data.origin_lat::FLOAT, consignment_data.origin_long::FLOAT),
ST_Point(consignment_data.destination_lat::FLOAT, consignment_data.destination_long::FLOAT)
) / 1000 --convert to km
) delivery_distance
FROM consignment_stream;

Predict delays using Amazon Redshift ML

We can use this enriched data to make predictions on the delay probability of a consignment. Amazon Redshift ML is a feature of Amazon Redshift that allows you to use the power of Amazon Redshift to build, train, and deploy ML models directly within your data warehouse.

The training of a new Amazon Redshift ML model has been initiated as part of the AWS CDK deployment using the CREATE MODEL statement. The training dataset is defined in the FROM clause, and TARGET defines which column the model is trying to predict. The FUNCTION clause defines the name of the function that is used to make predictions.

CREATE MODEL ml_delay_prediction -- already executed by CDK
FROM (SELECT * FROM ext_s3.consignment_train)
TARGET probability
FUNCTION fnc_delay_probabilty
IAM_ROLE default
SETTINGS (
MAX_RUNTIME 1800, --seconds
S3_BUCKET '<ingestionstack-s3bucketname>' --replace S3 bucket name
)

This simplified model is trained using historical observations, and the training process takes around 30 minutes to complete. You can check the status of the training job by running the SHOW MODEL statement:

SHOW MODEL ml_delay_prediction;

When the model is ready, we can start making predictions on new data that is streamed into Amazon Redshift. Predictions are generated using the Amazon Redshift ML function that was defined during the training process. We pass the calculated features from the transformed materialized view into this function, and the prediction results populate the delay_probability column.

This final output is persisted into the consignment_predictions table, and Step Functions is orchestrating the ongoing incremental data load into this target table. We use a table for the final output, instead of a materialized view, because ML predictions have randomness involved and it may give us non-deterministic results. Using a table gives us more control on how data is loaded.


See the following code:

CREATE TABLE consignment_predictions AS
SELECT *, fnc_delay_probability(
day_of_week, "hour", days_to_deliver, delivery_distance) delay_probability
FROM consignment_transformed;

Create an Amazon Managed Grafana dashboard

We use Amazon Managed Grafana to create a near real-time logistics dashboard. Amazon Managed Grafana is a fully managed service that makes it easy to create, configure, and share interactive dashboards and charts for monitoring your data. We can also use Grafana to set up alerts and notifications based on specific conditions or thresholds, allowing you to quickly identify and respond to issues.

The high-level steps in setting up the dashboard are as follows:

  1. Create a Grafana workspace.
  2. Set up Grafana authentication using AWS IAM Identity Center (successor to AWS Single Sign-On) or using direct SAML integration.
  3. Configure Amazon Redshift as a Grafana data source.
  4. Import the JSON file for the near real-time logistics dashboard.

A more detailed set of instructions is available in the GitHub repository for your reference.

Clean up

To avoid ongoing charges, delete the resources deployed. Access the Amazon Linux 2 environment and run the AWS CDK destroy command. Delete the Grafana objects related to this deployment.

cd amazon-redshift-streaming-workshop
source .venv/bin/activate
cdk destroy –all

Conclusion

In this post, we showed how easy it is to build a near real-time logistics dashboard using Amazon Redshift and Amazon Managed Grafana. We created an end-to-end modern data pipeline using only SQL. This shows how Amazon Redshift is a powerful platform for democratizing your data—it enables a wide range of users, including business analysts, data scientists, and others, to work with and analyze data without requiring specialized technical skills or expertise.

We encourage you to explore what else can be achieved with Amazon Redshift and Amazon Managed Grafana. We also recommend you visit the AWS Big Data Blog for other useful blog posts on Amazon Redshift.


About the Author

Paul Villena is an Analytics Solutions Architect in AWS with expertise in building modern data and analytics solutions to drive business value. He works with customers to help them harness the power of the cloud. His areas of interests are infrastructure-as-code, serverless technologies and coding in Python.

How BookMyShow saved 80% in costs by migrating to an AWS modern data architecture

Post Syndicated from Mahesh Vandi Chalil original https://aws.amazon.com/blogs/big-data/how-bookmyshow-saved-80-in-costs-by-migrating-to-an-aws-modern-data-architecture/

This is a guest post co-authored by Mahesh Vandi Chalil, Chief Technology Officer of BookMyShow.

BookMyShow (BMS), a leading entertainment company in India, provides an online ticketing platform for movies, plays, concerts, and sporting events. Selling up to 200 million tickets on an annual run rate basis (pre-COVID) to customers in India, Sri Lanka, Singapore, Indonesia, and the Middle East, BookMyShow also offers an online media streaming service and end-to-end management for virtual and on-ground entertainment experiences across all genres.

The pandemic gave BMS the opportunity to migrate and modernize our 15-year-old analytics solution to a modern data architecture on AWS. This architecture is modern, secure, governed, and cost-optimized architecture, with the ability to scale to petabytes. BMS migrated and modernized from on-premises and other cloud platforms to AWS in just four months. This project was run in parallel with our application migration project and achieved 90% cost savings in storage and 80% cost savings in analytics spend.

The BMS analytics platform caters to business needs for sales and marketing, finance, and business partners (e.g., cinemas and event owners), and provides application functionality for audience, personalization, pricing, and data science teams. The prior analytics solution had multiple copies of data, for a total of over 40 TB, with approximately 80 TB of data in other cloud storage. Data was stored on‑premises and in the cloud in various data stores. Growing organically, the teams had the freedom to choose their technology stack for individual projects, which led to the proliferation of various tools, technology, and practices. Individual teams for personalization, audience, data engineering, data science, and analytics used a variety of products for ingestion, data processing, and visualization.

This post discusses BMS’s migration and modernization journey, and how BMS, AWS, and AWS Partner Minfy Technologies team worked together to successfully complete the migration in four months and saving costs. The migration tenets using the AWS modern data architecture made the project a huge success.

Challenges in the prior analytics platform

  • Varied Technology: Multiple teams used various products, languages, and versions of software.
  • Larger Migration Project: Because the analytics modernization was a parallel project with application migration, planning was crucial in order to consider the changes in core applications and project timelines.
  • Resources: Experienced resource churn from the application migration project, and had very little documentation of current systems.
  • Data : Had multiple copies of data and no single source of truth; each data store provided a view for the business unit.
  • Ingestion Pipelines: Complex data pipelines moved data across various data stores at varied frequencies. We had multiple approaches in place to ingest data to Cloudera, via over 100 Kafka consumers from transaction systems and MQTT(Message Queue Telemetry Transport messaging protocol) for clickstreams, stored procedures, and Spark jobs. We had approximately 100 jobs for data ingestion across Spark, Alteryx, Beam, NiFi, and more.
  • Hadoop Clusters: Large dedicated hardware on which the Hadoop clusters were configured incurring fixed costs. On-premises Cloudera setup catered to most of the data engineering, audience, and personalization batch processing workloads. Teams had their implementation of HBase and Hive for our audience and personalization applications.
  • Data warehouse: The data engineering team used TiDB as their on-premises data warehouse. However, each consumer team had their own perspective of data needed for analysis. As this siloed architecture evolved, it resulted in expensive storage and operational costs to maintain these separate environments.
  • Analytics Database: The analytics team used data sourced from other transactional systems and denormalized data. The team had their own extract, transform, and load (ETL) pipeline, using Alteryx with a visualization tool.

Migration tenets followed which led to project success:

  • Prioritize by business functionality.
  • Apply best practices when building a modern data architecture from Day 1.
  • Move only required data, canonicalize the data, and store it in the most optimal format in the target. Remove data redundancy as much possible. Mark scope for optimization for the future when changes are intrusive.
  • Build the data architecture while keeping data formats, volumes, governance, and security in mind.
  • Simplify ELT and processing jobs by categorizing the jobs as rehosted, rewritten, and retired. Finalize canonical data format, transformation, enrichment, compression, and storage format as Parquet.
  • Rehost machine learning (ML) jobs that were critical for business.
  • Work backward to achieve our goals, and clear roadblocks and alter decisions to move forward.
  • Use serverless options as a first option and pay per use. Assess the cost and effort for rearchitecting to select the right approach. Execute a proof of concept to validate this for each component and service.

Strategies applied to succeed in this migration:

  • Team – We created a unified team with people from data engineering, analytics, and data science as part of the analytics migration project. Site reliability engineering (SRE) and application teams were involved when critical decisions were needed regarding data or timeline for alignment. The analytics, data engineering, and data science teams spent considerable time planning, understanding the code, and iteratively looking at the existing data sources, data pipelines, and processing jobs. AWS team with partner team from Minfy Technologies helped BMS arrive at a migration plan after a proof of concept for each of the components in data ingestion, data processing, data warehouse, ML, and analytics dashboards.
  • Workshops – The AWS team conducted a series of workshops and immersion days, and coached the BMS team on the technology and best practices to deploy the analytics services. The AWS team helped BMS explore the configuration and benefits of the migration approach for each scenario (data migration, data pipeline, data processing, visualization, and machine learning) via proof-of-concepts (POCs). The team captured the changes required in the existing code for migration. BMS team also got acquainted with the following AWS services:
  • Proof of concept – The BMS team, with help from the partner and AWS team, implemented multiple proofs of concept to validate the migration approach:
    • Performed batch processing of Spark jobs in Amazon EMR, in which we checked the runtime, required code changes, and cost.
    • Ran clickstream analysis jobs in Amazon EMR, testing the end-to-end pipeline. Team conducted proofs of concept on AWS IoT Core for MQTT protocol and streaming to Amazon S3.
    • Migrated ML models to Amazon SageMaker and orchestrated with Amazon MWAA.
    • Created sample QuickSight reports and dashboards, in which features and time to build were assessed.
    • Configured for key scenarios for Amazon Redshift, in which time for loading data, query performance, and cost were assessed.
  • Effort vs. cost analysis – Team performed the following assessments:
    • Compared the ingestion pipelines, the difference in data structure in each store, the basis of the current business need for the data source, the activity for preprocessing the data before migration, data migration to Amazon S3, and change data capture (CDC) from the migrated applications in AWS.
    • Assessed the effort to migrate approximately 200 jobs, determined which jobs were redundant or need improvement from a functional perspective, and completed a migration list for the target state. The modernization of the MQTT workflow code to serverless was time-consuming, decided to rehost on Amazon Elastic Compute Cloud (Amazon EC2) and modernization to Amazon Kinesis in to the next phase.
    • Reviewed over 400 reports and dashboards, prioritized development in phases, and reassessed business user needs.

AWS cloud services chosen for proposed architecture:

  • Data lake – We used Amazon S3 as the data lake to store the single truth of information for all raw and processed data, thereby reducing the copies of data storage and storage costs.
  • Ingestion – Because we had multiple sources of truth in the current architecture, we arrived at a common structure before migration to Amazon S3, and existing pipelines were modified to do preprocessing. These one-time preprocessing jobs were run in Cloudera, because the source data was on-premises, and on Amazon EMR for data in the cloud. We designed new data pipelines for ingestion from transactional systems on the AWS cloud using AWS Glue ETL.
  • Processing – Processing jobs were segregated based on runtime into two categories: batch and near-real time. Batch processes were further divided into transient Amazon EMR clusters with varying runtimes and Hadoop application requirements like HBase. Near-real-time jobs were provisioned in an Amazon EMR permanent cluster for clickstream analytics, and a data pipeline from transactional systems. We adopted a serverless approach using AWS Glue ETL for new data pipelines from transactional systems on the AWS cloud.
  • Data warehouse – We chose Amazon Redshift as our data warehouse, and planned on how the data would be distributed based on query patterns.
  • Visualization – We built the reports in Amazon QuickSight in phases and prioritized them based on business demand. We discussed with business users their current needs and identified the immediate reports required. We defined the phases of report and dashboard creation and built the reports in Amazon QuickSight. We plan to use embedded reports for external users in the future.
  • Machine learning – Custom ML models were deployed on Amazon SageMaker. Existing Airflow DAGs were migrated to Amazon MWAA.
  • Governance, security, and compliance – Governance with Amazon Lake Formation was adopted from Day 1. We configured the AWS Glue Data Catalog to reference data used as sources and targets. We had to comply to Payment Card Industry (PCI) guidelines because payment information was in the data lake, so we ensured the necessary security policies.

Solution overview

BMS modern data architecture

The following diagram illustrates our modern data architecture.

The architecture includes the following components:

  1. Source systems – These include the following:
    • Data from transactional systems stored in MariaDB (booking and transactions).
    • User interaction clickstream data via Kafka consumers to DataOps MariaDB.
    • Members and seat allocation information from MongoDB.
    • SQL Server for specific offers and payment information.
  2. Data pipeline – Spark jobs on an Amazon EMR permanent cluster process the clickstream data from Kafka clusters.
  3. Data lake – Data from source systems was stored in their respective Amazon S3 buckets, with prefixes for optimized data querying. For Amazon S3, we followed a hierarchy to store raw, summarized, and team or service-related data in different parent folders as per the source and type of data. Lifecycle polices were added to logs and temp folders of different services as per teams’ requirements.
  4. Data processing – Transient Amazon EMR clusters are used for processing data into a curated format for the audience, personalization, and analytics teams. Small file merger jobs merge the clickstream data to a larger file size, which saved costs for one-time queries.
  5. Governance – AWS Lake Formation enables the usage of AWS Glue crawlers to capture the schema of data stored in the data lake and version changes in the schema. The Data Catalog and security policy in AWS Lake Formation enable access to data for roles and users in Amazon Redshift, Amazon Athena, Amazon QuickSight, and data science jobs. AWS Glue ETL jobs load the processed data to Amazon Redshift at scheduled intervals.
  6. Queries – The analytics team used Amazon Athena to perform one-time queries raised from business teams on the data lake. Because report development is in phases, Amazon Athena was used for exporting data.
  7. Data warehouse – Amazon Redshift was used as the data warehouse, where the reports for the sales teams, management, and third parties (i.e., theaters and events) are processed and stored for quick retrieval. Views to analyze the total sales, movie sale trends, member behavior, and payment modes are configured here. We use materialized views for denormalized tables, different schemas for metadata, and transactional and behavior data.
  8. Reports – We used Amazon QuickSight reports for various business, marketing, and product use cases.
  9. Machine learning – Some of the models deployed on Amazon SageMaker are as follows:
    • Content popularity – Decides the recommended content for users.
    • Live event popularity – Calculates the popularity of live entertainment events in different regions.
    • Trending searches – Identifies trending searches across regions.

Walkthrough

Migration execution steps

We standardized tools, services, and processes for data engineering, analytics, and data science:

  • Data lake
    • Identified the source data to be migrated from Archival DB, BigQuery, TiDB, and the analytics database.
    • Built a canonical data model that catered to multiple business teams and reduced the copies of data, and therefore storage and operational costs. Modified existing jobs to facilitate migration to a canonical format.
    • Identified the source systems, capacity required, anticipated growth, owners, and access requirements.
    • Ran the bulk data migration to Amazon S3 from various sources.
  • Ingestion
    • Transaction systems – Retained the existing Kafka queues and consumers.
    • Clickstream data – Successfully conducted a proof of concept to use AWS IoT Core for MQTT protocol. But because we needed to make changes in the application to publish to AWS IoT Core, we decided to implement it as part of mobile application modernization at a later time. We decided to rehost the MQTT server on Amazon EC2.
  • Processing
  • Listed the data pipelines relevant to business and migrated them with minimal modification.
  • Categorized workloads into critical jobs, redundant jobs, or jobs that can be optimized:
    • Spark jobs were migrated to Amazon EMR.
    • HBase jobs were migrated to Amazon EMR with HBase.
    • Metadata stored in Hive-based jobs were modified to use the AWS Glue Data Catalog.
    • NiFi jobs were simplified and rewritten in Spark run in Amazon EMR.
  • Amazon EMR clusters were configured one persistent cluster for streaming the clickstream and personalization workloads. We used multiple transient clusters for running all other Spark ETL or processing jobs. We used Spot Instances for task nodes to save costs. We optimized data storage with specific jobs to merge small files and compressed file format conversions.
  • AWS Glue crawlers identified new data in Amazon S3. AWS Glue ETL jobs transformed and uploaded processed data to the Amazon Redshift data warehouse.
  • Datawarehouse
    • Defined the data warehouse schema by categorizing the critical reports required by the business, keeping in mind the workload and reports required in future.
    • Defined the staging area for incremental data loaded into Amazon Redshift, materialized views, and tuning the queries based on usage. The transaction and primary metadata are stored in Amazon Redshift to cater to all data analysis and reporting requirements. We created materialized views and denormalized tables in Amazon Redshift to use as data sources for Amazon QuickSight dashboards and segmentation jobs, respectively.
    • Optimally used the Amazon Redshift cluster by loading last two years data in Amazon Redshift, and used Amazon Redshift Spectrum to query historical data through external tables. This helped balance the usage and cost of the Amazon Redshift cluster.
  • Visualization
    • Amazon QuickSight dashboards were created for the sales and marketing team in Phase 1:
      • Sales summary report – An executive summary dashboard to get an overview of sales across the country by region, city, movie, theatre, genre, and more.
      • Live entertainment – A dedicated report for live entertainment vertical events.
      • Coupons – A report for coupons purchased and redeemed.
      • BookASmile – A dashboard to analyze the data for BookASmile, a charity initiative.
  • Machine learning
    • Listed the ML workloads to be migrated based on current business needs.
    • Priority ML processing jobs were deployed on Amazon EMR. Models were modified to use Amazon S3 as source and target, and new APIs were exposed to use the functionality. ML models were deployed on Amazon SageMaker for movies, live event clickstream analysis, and personalization.
    • Existing artifacts in Airflow orchestration were migrated to Amazon MWAA.
  • Security
    • AWS Lake Formation was the foundation of the data lake, with the AWS Glue Data Catalog as the foundation for the central catalog for the data stored in Amazon S3. This provided access to the data by various functionalities, including the audience, personalization, analytics, and data science teams.
    • Personally identifiable information (PII) and payment data was stored in the data lake and data warehouse, so we had to comply to PCI guidelines. Encryption of data at rest and in transit was considered and configured in each service level (Amazon S3, AWS Glue Data Catalog, Amazon EMR, AWS Glue, Amazon Redshift, and QuickSight). Clear roles, responsibilities, and access permissions for different user groups and privileges were listed and configured in AWS Identity and Access Management (IAM) and individual services.
    • Existing single sign-on (SSO) integration with Microsoft Active Directory was used for Amazon QuickSight user access.
  • Automation
    • We used AWS CloudFormation for the creation and modification of all the core and analytics services.
    • AWS Step Functions was used to orchestrate Spark jobs on Amazon EMR.
    • Scheduled jobs were configured in AWS Glue for uploading data in Amazon Redshift based on business needs.
    • Monitoring of the analytics services was done using Amazon CloudWatch metrics, and right-sizing of instances and configuration was achieved. Spark job performance on Amazon EMR was analyzed using the native Spark logs and Spark user interface (UI).
    • Lifecycle policies were applied to the data lake to optimize the data storage costs over time.

Benefits of a modern data architecture

A modern data architecture offered us the following benefits:

  • Scalability – We moved from a fixed infrastructure to the minimal infrastructure required, with configuration to scale on demand. Services like Amazon EMR and Amazon Redshift enable us to do this with just a few clicks.
  • Agility – We use purpose-built managed services instead of reinventing the wheel. Automation and monitoring were key considerations, which enable us to make changes quickly.
  • Serverless – Adoption of serverless services like Amazon S3, AWS Glue, Amazon Athena, AWS Step Functions, and AWS Lambda support us when our business has sudden spikes with new movies or events launched.
  • Cost savings – Our storage size was reduced by 90%. Our overall spend on analytics and ML was reduced by 80%.

Conclusion

In this post, we showed you how a modern data architecture on AWS helped BMS to easily share data across organizational boundaries. This allowed BMS to make decisions with speed and agility at scale; ensure compliance via unified data access, security, and governance; and to scale systems at a low cost without compromising performance. Working with the AWS and Minfy Technologies teams helped BMS choose the correct technology services and complete the migration in four months. BMS achieved the scalability and cost-optimization goals with this updated architecture, which has set the stage for innovation using graph databases and enhanced our ML projects to improve customer experience.


About the Authors

Mahesh Vandi Chalil is Chief Technology Officer at BookMyShow, India’s leading entertainment destination. Mahesh has over two decades of global experience, passionate about building scalable products that delight customers while keeping innovation as the top goal motivating his team to constantly aspire for these. Mahesh invests his energies in creating and nurturing the next generation of technology leaders and entrepreneurs, both within the organization and outside of it. A proud husband and father of two daughters and plays cricket during his leisure time.

Priya Jathar is a Solutions Architect working in Digital Native Business segment at AWS. She has more two decades of IT experience, with expertise in Application Development, Database, and Analytics. She is a builder who enjoys innovating with new technologies to achieve business goals. Currently helping customers Migrate, Modernise, and Innovate in Cloud. In her free time she likes to paint, and hone her gardening and cooking skills.

Vatsal Shah is a Senior Solutions Architect at AWS based out of Mumbai, India. He has more than nine years of industry experience, including leadership roles in product engineering, SRE, and cloud architecture. He currently focuses on enabling large startups to streamline their cloud operations and help them scale on the cloud. He also specializes in AI and Machine Learning use cases.

Accelerate orchestration of an ELT process using AWS Step Functions and Amazon Redshift Data API

Post Syndicated from Poulomi Dasgupta original https://aws.amazon.com/blogs/big-data/accelerate-orchestration-of-an-elt-process-using-aws-step-functions-and-amazon-redshift-data-api/

Extract, Load, and Transform (ELT) is a modern design strategy where raw data is first loaded into the data warehouse and then transformed with familiar Structured Query Language (SQL) semantics leveraging the power of massively parallel processing (MPP) architecture of the data warehouse. When you use an ELT pattern, you can also use your existing SQL workload while migrating from your on-premises data warehouse to Amazon Redshift. This eliminates the need to rewrite relational and complex SQL workloads into a new framework. With Amazon Redshift, you can load, transform, and enrich your data efficiently using familiar SQL with advanced and robust SQL support, simplicity, and seamless integration with your existing SQL tools. When you adopt an ELT pattern, a fully automated and highly scalable workflow orchestration mechanism will help to minimize the operational effort that you must invest in managing the pipelines. It also ensures the timely and accurate refresh of your data warehouse.

AWS Step Functions is a low-code, serverless, visual workflow service where you can orchestrate complex business workflows with an event-driven framework and easily develop repeatable and dependent processes. It can ensure that the long-running, multiple ELT jobs run in a specified order and complete successfully instead of manually orchestrating those jobs or maintaining a separate application.

Amazon DynamoDB is a fast, flexible NoSQL database service for single-digit millisecond performance at any scale.

This post explains how to use AWS Step Functions, Amazon DynamoDB, and Amazon Redshift Data API to orchestrate the different steps in your ELT workflow and process data within the Amazon Redshift data warehouse.

Solution overview

In this solution, we will orchestrate an ELT process using AWS Step Functions. As part of the ELT process, we will refresh the dimension and fact tables at regular intervals from staging tables, which ingest data from the source. We will maintain the current state of the ELT process (e.g., Running or Ready) in an audit table that will be maintained at Amazon DynamoDB. AWS Step Functions allows you to directly call the Data API from a state machine, reducing the complexity of running the ELT pipeline. For loading the dimensions and fact tables, we will be using Amazon Redshift Data API from AWS Lambda. We will use Amazon EventBridge for scheduling the state machine to run at a desired interval based on the customer’s SLA.

For a given ELT process, we will set up a JobID in a DynamoDB audit table and set the JobState as “Ready” before the state machine runs for the first time. The state machine performs the following steps:

  1. The first process in the Step Functions workflow is to pass the JobID as input to the process that is configured as JobID 101 in Step Functions and DynamoDB by default via the CloudFormation template.
  2. The next step is to fetch the current JobState for the given JobID by running a query against the DynamoDB audit table using Lambda Data API.
  3. If JobState is “Running,” then it indicates that the previous iteration is not completed yet, and the process should end.
  4. If the JobState is “Ready,” then it indicates that the previous iteration was completed successfully and the process is ready to start. So, the next step will be to update the DynamoDB audit table to change the JobState to “Running” and JobStart to the current time for the given JobID using DynamoDB Data API within a Lambda function.
  5. The next step will be to start the dimension table load from the staging table data within Amazon Redshift using Lambda Data API. In order to achieve that, we can either call a stored procedure using the Amazon Redshift Data API, or we can also run series of SQL statements synchronously using Amazon Redshift Data API within a Lambda function.
  6. In a typical data warehouse, multiple dimension tables are loaded in parallel at the same time before the fact table gets loaded. Using Parallel flow in Step Functions, we will load two dimension tables at the same time using Amazon Redshift Data API within a Lambda function.
  7. Once the load is completed for both the dimension tables, we will load the fact table as the next step using Amazon Redshift Data API within a Lambda function.
  8. As the load completes successfully, the last step would be to update the DynamoDB audit table to change the JobState to “Ready” and JobEnd to the current time for the given JobID, using DynamoDB Data API within a Lambda function.
    Solution Overview

Components and dependencies

The following architecture diagram highlights the end-to-end solution using AWS services:

Architecture Diagram

Before diving deeper into the code, let’s look at the components first:

  • AWS Step Functions – You can orchestrate a workflow by creating a State Machine to manage failures, retries, parallelization, and service integrations.
  • Amazon EventBridge – You can run your state machine on a daily schedule by creating a Rule in Amazon EventBridge.
  • AWS Lambda – You can trigger a Lambda function to run Data API either from Amazon Redshift or DynamoDB.
  • Amazon DynamoDB – Amazon DynamoDB is a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale. DynamoDB is extremely efficient in running updates, which improves the performance of metadata management for customers with strict SLAs.
  • Amazon Redshift – Amazon Redshift is a fully managed, scalable cloud data warehouse that accelerates your time to insights with fast, easy, and secure analytics at scale.
  • Amazon Redshift Data API – You can access your Amazon Redshift database using the built-in Amazon Redshift Data API. Using this API, you can access Amazon Redshift data with web services–based applications, including AWS Lambda.
  • DynamoDB API – You can access your Amazon DynamoDB tables from a Lambda function by importing boto3.

Prerequisites

To complete this walkthrough, you must have the following prerequisites:

  1. An AWS account.
  2. An Amazon Redshift cluster.
  3. An Amazon Redshift customizable IAM service role with the following policies:
    • AmazonS3ReadOnlyAccess
    • AmazonRedshiftFullAccess
  4. Above IAM role associated to the Amazon Redshift cluster.

Deploy the CloudFormation template

To set up the ETL orchestration demo, the steps are as follows:

  1. Sign in to the AWS Management Console.
  2. Click on Launch Stack.

    CreateStack-1
  3. Click Next.
  4. Enter a suitable name in Stack name.
  5. Provide the information for the Parameters as detailed in the following table.
CloudFormation template parameter Allowed values Description
RedshiftClusterIdentifier Amazon Redshift cluster identifier Enter the Amazon Redshift cluster identifier
DatabaseUserName Database user name in Amazon Redshift cluster Amazon Redshift database user name which has access to run SQL Script
DatabaseName Amazon Redshift database name Name of the Amazon Redshift primary database where SQL script would be run
RedshiftIAMRoleARN Valid IAM role ARN attached to Amazon Redshift cluster AWS IAM role ARN associated with the Amazon Redshift cluster

Create Stack 2

  1. Click Next and a new page appears. Accept the default values in the page and click Next. On the last page check the box to acknowledge resources might be created and click on Create stack.
    Create Stack 3
  2. Monitor the progress of the stack creation and wait until it is complete.
  3. The stack creation should complete approximately within 5 minutes.
  4. Navigate to Amazon Redshift console.
  5. Launch Amazon Redshift query editor v2 and connect to your cluster.
  6. Browse to the database name provided in the parameters while creating the cloudformation template e.g., dev, public schema and expand Tables. You should see the tables as shown below.
    Redshift Query Editor v2 1
  7. Validate the sample data by running the following SQL query and confirm the row count match above the screenshot.
select 'customer',count(*) from public.customer
union all
select 'fact_yearly_sale',count(*) from public.fact_yearly_sale
union all
select 'lineitem',count(*) from public.lineitem
union all
select 'nation',count(*) from public.nation
union all
select 'orders',count(*) from public.orders
union all
select 'supplier',count(*) from public.supplier

Run the ELT orchestration

  1. After you deploy the CloudFormation template, navigate to the stack detail page. On the Resources tab, choose the link for DynamoDBETLAuditTable to be redirected to the DynamoDB console.
  2. Navigate to Tables and click on table name beginning with <stackname>-DynamoDBETLAuditTable. In this demo, the stack name is DemoETLOrchestration, so the table name will begin with DemoETLOrchestration-DynamoDBETLAuditTable.
  3. It will expand the table. Click on Explore table items.
  4. Here you can see the current status of the job, which will be in Ready status.
    DynamoDB 1
  5. Navigate again to stack detail page on the CloudFormation console. On the Resources tab, choose the link for RedshiftETLStepFunction to be redirected to the Step Functions console.
    CFN Stack Resources
  6. Click Start Execution. When it successfully completes, all steps will be marked as green.
    Step Function Running
  7. While the job is running, navigate back to DemoETLOrchestration-DynamoDBETLAuditTable in the DynamoDB console screen. You will see JobState as Running with JobStart time.
    DynamoDB 2
  1. After Step Functions completes, JobState will be changed to Ready with JobStart and JobEnd time.
    DynamoDB 3

Handling failure

In the real world sometimes, the ELT process can fail due to unexpected data anomalies or object related issues. In that case, the step function execution will also fail with the failed step marked in red as shown in the screenshot below:
Step Function 2

Once you identify and fix the issue, please follow the below steps to restart the step function:

  1. Navigate to the DynamoDB table beginning with DemoETLOrchestration-DynamoDBETLAuditTable. Click on Explore table items and select the row with the specific JobID for the failed job.
  2. Go to Action and select Edit item to modify the JobState to Ready as shown below:
    DynamoDB 4
  3. Follow steps 5 and 6 under the “Run the ELT orchestration” section to restart execution of the step function.

Validate the ELT orchestration

The step function loads the dimension tables public.supplier and public.customer and the fact table public.fact_yearly_sale. To validate the orchestration, the process steps are as follows:

  1. Navigate to the Amazon Redshift console.
  2. Launch Amazon Redshift query editor v2 and connect to your cluster.
  3. Browse to the database name provided in the parameters while creating the cloud formation template e.g., dev, public schema.
  4. Validate the data loaded by Step Functions by running the following SQL query and confirm the row count to match as follows:
select 'customer',count(*) from public.customer
union all
select 'fact_yearly_sale',count(*) from public.fact_yearly_sale
union all
select 'supplier',count(*) from public.supplier

Redshift Query Editor v2 2

Schedule the ELT orchestration

The steps are as follows to schedule the Step Functions:

  1. Navigate to the Amazon EventBridge console and choose Create rule.
    Event Bridge 1
  1. Under Name, enter a meaningful name, for example, Trigger-Redshift-ELTStepFunction.
  2. Under Event bus, choose default.
  3. Under Rule Type, select Schedule.
  4. Click on Next.
    Event Bridge 2
  5. Under Schedule pattern, select A schedule that runs at a regular rate, such as every 10 minutes.
  6. Under Rate expression, enter Value as 5 and choose Unit as Minutes.
  7. Click on Next.
    Event Bridge 3
  8. Under Target types, choose AWS service.
  9. Under Select a Target, choose Step Functions state machine.
  10. Under State machine, choose the step function created by the CloudFormation template.
  11. Under Execution role, select Create a new role for this specific resource.
  12. Click on Next.
    Event Bridge 4
  13. Review the rule parameters and click on Create Rule.

After the rule has been created, it will automatically trigger the step function every 5 minutes to perform ELT processing in Amazon Redshift.

Clean up

Please note that deploying a CloudFormation template incurs cost. To avoid incurring future charges, delete the resources you created as part of the CloudFormation stack by navigating to the AWS CloudFormation console, selecting the stack, and choosing Delete.

Conclusion

In this post, we described how to easily implement a modern, serverless, highly scalable, and cost-effective ELT workflow orchestration process in Amazon Redshift using AWS Step Functions, Amazon DynamoDB and Amazon Redshift Data API. As an alternate solution, you can also use Amazon Redshift for metadata management instead of using Amazon DynamoDB. As part of this demo, we show how a single job entry in DynamoDB gets updated for each run, but you can also modify the solution to maintain a separate audit table with the history of each run for each job, which would help with debugging or historical tracking purposes. Step Functions manage failures, retries, parallelization, service integrations, and observability so your developers can focus on higher-value business logic. Step Functions can integrate with Amazon SNS to send notifications in case of failure or success of the workflow. Please follow this AWS Step Functions documentation to implement the notification mechanism.


About the Authors

Poulomi Dasgupta is a Senior Analytics Solutions Architect with AWS. She is passionate about helping customers build cloud-based analytics solutions to solve their business problems. Outside of work, she likes travelling and spending time with her family.

Raks KhareRaks Khare is an Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers architect data analytics solutions at scale on the AWS platform.

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 13 years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling
and cooking.

Run a popular benchmark on Amazon Redshift Serverless easily with AWS Data Exchange

Post Syndicated from Jon Roberts original https://aws.amazon.com/blogs/big-data/run-a-popular-benchmark-on-amazon-redshift-serverless-easily-with-aws-data-exchange/

Amazon Redshift is a fast, easy, secure, and economical cloud data warehousing service designed for analytics. AWS announced Amazon Redshift Serverless general availability in July 2022, providing an easier experience to operate Amazon Redshift. Amazon Redshift Serverless makes it simple to run and scale analytics without having to manage your data warehouse infrastructure. Amazon Redshift Serverless automatically provisions and intelligently scales data warehouse capacity to deliver fast performance for even the most demanding and unpredictable workloads, and you pay only for what you use.

Amazon Redshift Serverless measures data warehouse capacity in Redshift Processing Units (RPUs). You pay for the workloads you run in RPU-hours on a per-second basis (with a 60-second minimum charge), including queries that access external data in open file formats like CSV and Parquet stored in Amazon S3. For more information on RPU pricing, refer to Amazon Redshift pricing.

AWS Data Exchange makes it easy to find, subscribe to, and use third-party data in the cloud. With AWS Data Exchange for Amazon Redshift, customers can start querying, evaluating, analyzing, and joining third-party data with their own first-party data without requiring any extracting, transforming, and loading (ETL). Data providers can list and offer products containing Amazon Redshift datasets in the AWS Data Exchange catalog, granting subscribers direct, read-only access to the data stored in Amazon Redshift. This feature empowers customers to quickly query, analyze, and build applications with these third-party datasets.

TPC-DS is a commonly used benchmark for measuring the query performance of data warehouse solutions such as Amazon Redshift. The benchmark is useful in proving the query capabilities of executing simple to complex queries in a timely manner. It is also used to measure the performance of different database configurations, different concurrent workloads, and also against other database products.

This blog post walks you through the steps you’ll need to set up Amazon Redshift Serverless and run the SQL queries derived from the TPC-DS benchmark against data from the AWS Data Exchange.

Solution overview

We will get started by creating an Amazon Redshift Serverless workgroup and namespace. A namespace is a collection of database objects and users while a workgroup is a collection of compute resources. To simplify executing the benchmark queries, a Linux EC2 instance will also be deployed.

Next, a GitHub repo containing the TPC-DS derived queries will be used. The TPC-DS benchmark is frequently used for evaluating the performance and functionality of cloud data warehouses. The TPC-DS benchmark includes additional steps and requirements to be considered official, but for this blog post, we are focused on only executing the SQL SELECT queries from this benchmark.

The last component of the solution is data. The TPC-DS benchmark includes binaries for generating data, but this is time-consuming to run. We have avoided this problem by generating the data, and we have made this available freely in the AWS Data Exchange.

Automated setup: CloudFormation

BDB-2063-launch-cloudformation-stack

Click the Launch Stack link above to launch the CloudFormation stack, which will automate the deployment of resources needed for the demo. The template deploys the following resources in your default VPC:

  • Amazon Compute Cloud (Amazon EC2) instance with the latest version of Amazon Linux
  • Amazon Redshift Serverless workgroup and namespace
  • IAM role with redshift-serverless:GetWorkgroup action granted; this is attached to the EC2 instance so that a command line interface (CLI) command can run to complete the instance configuration
  • Security group with inbound port 22 (ssh) and connectivity between the EC2 instance and the Amazon Redshift Serverless workgroup
  • The GitHub repo is downloaded in the EC2 instance

Template parameters

  • Stack: CloudFormation term used to define all of the resources created by the template.
  • KeyName: This is the name of an existing key pair. If you don’t have one already, create a key pair that is used to connect by SSH to the EC2 instance. More information on key pairs.
  • SSHLocation: The CIDR mask for incoming connections to the EC2 instance. The default is 0.0.0.0/0, which means any IP address is allowed to connect by SSH to the EC2 instance, provided the client has the key pair private key. The best practice for security is to limit this to a smaller range of IP addresses. For example, you can use sites like www.whatismyip.com to get your IP address.
  • RedshiftCapacity: This is the number of RPUs for the Amazon Redshift Serverless workgroup. The default is 128 and is recommended for analyzing the larger TPC-DS datasets. You can update the RPU capacity after deployment if you like or redeploy it with a different capacity value.

Manual setup

If you choose not to use the CloudFormation template, deploy the EC2 instance and Amazon Redshift Serverless with the following instructions. The following steps are only needed if you are manually provisioning the resources rather than using the provided CloudFormation template.

Amazon Redshift setup

Here are the high-level steps to create an Amazon Redshift Serverless workgroup. You can get more detailed information from this News Blog post.

To create your workgroup, complete the following steps:

  1. On the Amazon Redshift console, navigate to the Amazon Redshift Serverless dashboard.
  2. Choose Create workgroup.
  3. For Workgroup name, enter a name.
  4. Choose Next.
  5. For Namespace, enter a unique name.
  6. Choose Next.
  7. Choose Create.

These steps create an Amazon Redshift Serverless workgroup with 128 RPUs. This is the default; you can easily adjust this up or down based on your workload and budget constraints.

Linux EC2 instance setup

  • Deploy a virtual machine in the same AWS Region as your Amazon Redshift database using the Amazon Linux 2 AMI.
  • The Amazon Linux 2 AMI (64-bit x86) with the t2.micro instance type is an inexpensive and tested configuration.
  • Add the security group configured for your Amazon Redshift database to your EC2 instance.
  • Install psql with sudo yum install postgresql.x86_64 -y
  • Download this GitHub repo.
    git clone --depth 1 https://github.com/aws-samples/redshift-benchmarks /home/ec2-user/redshift-benchmarks

  • Set the following environment variables for Amazon Redshift:
    • PGHOST: This is the endpoint for the Amazon Redshift database.
    • PGPORT: This is the port the database listens on. The Amazon Redshift default is 5439.
    • PGUSER: This is your Amazon Redshift database user.
    • PGDATABASE: This is the database name where your external schemas are created. This is NOT the database created for the data share. We suggest using the default “dev” database.

Example:

export PGUSER="awsuser" 
export PGHOST="default.01.us-east-1.redshift-serverless.amazonaws.com" 
export PGPORT="5439" 
export PGDATABASE="dev"

Configure the .pgpass file to store your database credentials. The format for the .pgpass file is: hostname:port:database:user:password

Example:

default.01.us-east-1.redshift-serverless.amazonaws.com:5439:*:user1:user1P@ss

AWS Data Exchange setup

AWS Data Exchange provides third-party data in multiple data formats, including Amazon Redshift. You can subscribe to catalog listings in multiple storage locations like Amazon S3 and Amazon Redshift data shares. We encourage you to explore the AWS Data Exchange catalog on your own because there are many datasets available that can be used to enhance your data in Amazon Redshift.

First, subscribe to the AWS Marketplace listing for TPC-DS Benchmark Data. Select the Continue to subscribe button from the AWS Data Exchange catalog listing. After you review the offer and Terms and Conditions of the data product, choose Subscribe. Note that you will need the appropriate IAM permissions to subscribe to AWS Data Exchange on Marketplace. More information can be found at AWS managed policies for AWS Data Exchange.

TPC-DS uses 24 tables in a dimensional model that simulates a decision support system. It has store, catalog, and web sales as well as store, catalog, and web returns fact tables. It also has the dimension tables to support these fact tables.

TPC-DS includes a utility to generate data for the benchmark at a given scale factor. The smallest scale factor is 1 GB (uncompressed). Most benchmark tests for cloud warehouses are run with 3–10 TB of data because the dataset is large enough to stress the system but also small enough to complete the entire test in a reasonable amount of time.

There are six database schemas provided in the TPC-DS Benchmark Data subscription with 1; 10; 100; 1,000; 3,000; and 10,000 scale factors. The scale factor refers to the uncompressed data size measured in GB. Each schema refers to a dataset with the corresponding scale factor.

Scale factor (GB) ADX schema Amazon Redshift Serverless external schema
1 tpcds1 ext_tpcds1
10 tpcds10 ext_tpcds10
100 tpcds100 ext_tpcds100
1,000 tpcds1000 ext_tpcds1000
3,000 tpcds3000 ext_tpcds3000
10,000 tpcds10000 ext_tpcds10000

The following steps will create external schemas in your Amazon Redshift Serverless database that maps to schemas found in the AWS Data Exchange.

  • Log in to the EC2 instance and create a database connection.
    psql

  • Run the following query:
    select share_name, producer_account, producer_namespace from svv_datashares;

  • Use the output of this query to run the next command:
    create database tpcds_db from datashare <share_name> of account '<producer_account>' namespace '<producer_namespace>';

  • Last, you create the external schemas in Amazon Redshift:
    create external schema ext_tpcds1 from redshift database tpcds_db schema tpcds1;
    create external schema ext_tpcds10 from redshift database tpcds_db schema tpcds10;
    create external schema ext_tpcds100 from redshift database tpcds_db schema tpcds100;
    create external schema ext_tpcds1000 from redshift database tpcds_db schema tpcds1000;
    create external schema ext_tpcds3000 from redshift database tpcds_db schema tpcds3000;
    create external schema ext_tpcds10000 from redshift database tpcds_db schema tpcds10000;

  • You can now exit psql with this command:
    \q

TPC-DS derived benchmark

The TPC-DS derived benchmark consists of 99 queries in four broad categories:

  • Reporting queries
  • Ad hoc queries
  • Iterative OLAP queries
  • Data mining queries

In addition to running the 99 queries, the benchmark tests concurrency. During the concurrency portion of the test, there are n sessions (default of 5) that run the queries. Each session runs the 99 queries with different parameters and in slightly different order. This concurrency test stresses the resources of the database and generally takes longer to complete than just a single session running the 99 queries.

Some data warehouse products are configured to optimize single-user performance, whereas others may not have the ability to manage the workload effectively. This is a great way to demonstrate the workload management and stability of Amazon Redshift.

Since the data for each scale factor is located in a different schema, running the benchmark against each scale factor requires changing the schema you are referencing. The search_path defines which schemas to search for tables when a query contains objects without a schema included. For example:

ALTER USER <username> SET search_path=ext_tpcds3000,public;

The benchmark scripts set the search_path automatically.

Note: The scripts create a schema called tpcds_reports which will store the detailed output of each step of the benchmark. Each time the scripts are run, this schema will be recreated, and the latest results will be stored. If you happen to already have a schema named tpcds_reports, these scripts will drop the schema.

Running the TPC-DS derived queries

  • Connect by SSH to the EC2 instance with your key pair.
  • Change directory:
    cd ~/redshift-benchmarks/adx-tpc-ds/

  • Optionally configure the variables for the scripts in tpcds_variables.sh

Here are the default values for the variables you can set:

  • EXT_SCHEMA="ext_tpcds3000": This is the name of the external schema created that has the TPC-DS dataset. The “3000” value means the scale factor is 3000 or 3 TB (uncompressed).
  • EXPLAIN="false": If set to false, queries will run. If set to true, queries will generate explain plans rather than actually running. Each query will be logged in the log directory. Default is false.
  • MULTI_USER_COUNT="5": 0 to 20 concurrent users will run the queries. The order of the queries was set with dsqgen. Setting to 0 will skip the multi-user test. Default is 5.
  • Run the benchmark:
    ./rollout.sh > rollout.log 2>&1 &

TPC-DS derived benchmark results

We performed a test with the 3 TB ADX TPC-DS dataset on an Amazon Redshift Serverless workgroup with 128 RPUs. Additionally, we disabled query caching so that query results aren’t cached. This allows us to measure the performance of the database as opposed to its ability to serve results from cache.

The test comprises two sections. The first will run the 99 queries serially using one user while the second section will start multiple sessions based on the configuration file you set earlier. Each session will run concurrently, and each will run all 99 queries but in a different order.

The total runtime for the single-user queries was 15 minutes and 11 seconds. As shown in the following graph, the longest-running query was query 67, with an elapsed time of only 101 seconds. The average runtime was only 9.2 seconds.

1 Session TPC-DS 3TB 128 RPUs

With five concurrent users, the runtime was 28 minutes and 35 seconds, which demonstrates how Amazon Redshift Serverless performs well for single-user and concurrent-user workloads.

5 Concurrent Sessions TPC-DS 3TB 128 RPUs

As you can see, it was pretty easy to deploy Amazon Redshift Serverless, subscribe to an AWS Data Exchange product listing, and run a fairly complex benchmark in a short amount of time.

Next steps

You can run the benchmark scripts again but with different dataset sizes or a different number of concurrent users by editing the tpcds_variables.sh file. You can also try resizing your Amazon Redshift Serverless workgroup to see the performance difference with more or fewer RPUs. You can also run individual queries to see the results firsthand.

Another thing to try is to subscribe to other AWS Data Exchange products and query this data from Amazon Redshift Serverless. Be curious and explore using Amazon Redshift Serverless and the AWS Data Exchange!

Clean up

If you deployed the resources with the automated solution, you just need to delete the stack created in CloudFormation. All resources created by the stack will be deleted automatically.

If you deployed the resources manually, you need to delete the following:

  • The Amazon Redshift database created earlier.
    • If you deployed Amazon Redshift Serverless, you will need to delete both the workgroup and the namespace.
  • The Amazon EC2 instance.

Optionally, you can unsubscribe from the TPC-DS data by going to your AWS Data Exchange Subscriptions and then turning Renewal to Off.

Conclusion

This blog post covered deploying Amazon Redshift Serverless, subscribing to an AWS Data Exchange product, and running a complex benchmark in a short amount of time. Amazon Redshift Serverless can handle high levels of concurrency with very little effort and excels in price-performance.

If you have any questions or feedback, please leave them in the comments section.


About the author

Jon RobertsJon Roberts is a Sr. Analytics Specialist based out of Nashville, specializing in Amazon Redshift. He has over 27 years of experience working in relational databases. In his spare time, he runs.

Code conversion from Greenplum to Amazon Redshift: Handling arrays, dates, and regular expressions

Post Syndicated from Jagrit Shrestha original https://aws.amazon.com/blogs/big-data/code-conversion-from-greenplum-to-amazon-redshift-handling-arrays-dates-and-regular-expressions/

Amazon Redshift is a fully managed service for data lakes, data analytics, and data warehouses for startups, medium enterprises, and large enterprises. Amazon Redshift is used by tens of thousands of businesses around the globe for modernizing their data analytics platform.

Greenplum is an open-source, massively parallel database used for analytics, mostly for on-premises infrastructure. Greenplum is based on the PostgreSQL database engine.

Many customers have found migration to Amazon Redshift from Greenplum an attractive option instead of managing on-premises Greenplum for the following reasons:

Even though both Greenplum and Amazon Redshift use the open-source PostgreSQL database engine, migration still requires a lot of planning and manual intervention. This post covers the key functions and considerations while performing code conversion from Greenplum to Amazon Redshift. It is focused on the migration of procedures, functions, and views.

Solution overview

AWS Database Migration Service (AWS DMS) and the AWS Schema Conversion Tool (AWS SCT) can migrate most of the objects in a heterogeneous database migration from Greenplum to Amazon Redshift. But there are some situations where code conversion teams encounter errors and warnings for views, procedures, and functions while creating them in Amazon Redshift. To address this type of situation, manual conversion of the code is required.

The posts focuses on how to handle the following while migrating from Greenplum to Amazon Redshift:

  • Arrays
  • Dates and timestamps
  • Regular expressions (regex)

Please note that for this post, we use Greenplum 4.3 and Amazon Redshift PostgreSQL 8.2.

Working with array functions

The AWS SCT doesn’t convert array functions while migrating from Greenplum or PostgreSQL to Amazon Redshift. Developers need to extensively convert those functions manually. This post outlines the most common array functions:

  • ARRAY_UPPER
  • JSON_EXTACT_ARRAY_ELEMENT_TEXT and JSON_ARRAY_LENGTH
  • UNNEST ()
  • STRING_AGG()
  • ANY ARRAY()

ARRAY_UPPER()

This function returns the upper bound of an array. It can be used to extract the nth element from an array in PostgreSQL or Greenplum.

The Greenplum code is as follows:

With temp1 as
(
Select 'John' as FirstName, 'Smith' as LastName ,
array['"111-222-3333"','"101-201-3001"','"XXX-YYY-ZZZZ"','NULL'] as PhoneNumbers
union all
Select 'Bob' as FirstName, 'Haris' as LastName ,
array['222-333-4444','201-301-4001','AAA-BBB-CCCC'] as PhoneNumbers
union all
Select 'Mary' as FirstName, 'Jane' as LastName ,
array['333-444-5555','301-401-3001','DDD-EEE-FFFF'] as PhoneNumbers
)
Select Firstname, PhoneNumbers[ARRAY_UPPER(PhoneNumbers,1)]

There is no function to extract an element from an array in Amazon Redshift; however, there are two JSON functions that can be used for this purpose:

  • JSON_EXTRACT_ARRAY_ELEMENT_TEXT() – Returns a JSON array element in the outermost array of a JSON string
  • JSON_ARRAY_LENGTH() – Returns the number of elements in the outer array of a JSON string

See the following code:

With temp1 as
(
Select 'John' as FirstName, 'Smith' as LastName ,
array['"111-222-3333"','"101-201-3001"','"XXX-YYY-ZZZZ"'] as PhoneNumbers
union all
Select 'Bob' as FirstName, 'Haris' as LastName ,
array['"222-333-4444"','"201-301-4001"','"AAA-BBB-CCCC"'] as PhoneNumbers
union all
Select 'Mary' as FirstName, 'Jane' as LastName ,
array['"333-444-5555"','"301-401-3001"','"DDD-EEE-FFFF"'] as PhoneNumbers
)

Select
FirstName
,('['+array_to_string(phoneNumbers,',')+']') as JSONConvertedField
,JSON_EXTRACT_ARRAY_ELEMENT_TEXT
(
'['+array_to_string(phoneNumbers,',')+']'
,JSON_ARRAY_LENGTH('['+array_to_string(phoneNumbers,',')+']')-1
) as LastElementFromArray
from temp1

UNNEST()

UNNEST() is PostgreSQL’s system function for semi-structured data, expanding an array, or a combination of arrays to a set of rows. It is introduced to improve the database performance of thousands or records for inserts, updates, and deletes.

You can use UNNEST() for basic array, multiple arrays, and multiple arrays with different lengths.

Some of Amazon Redshift functions used to unnest arrays are split_part, json_extract_path_text, json_array_length, and json_extract_array_element_text.

In Greenplum, the UNNEST function is used to expand an array to a set of rows:

Select ‘A’,unnest(array([1,2])

Output
A 1
A 2

with temp1 as
(
Select 'John' as FirstName, 'Smith' as LastName ,
'111-222-3333' as Mobilephone,'101-201-3001' as HomePhone
union all
Select 'Bob' as FirstName, 'Haris' as LastName ,
'222-333-4444' as Mobilephone,'201-301-4001' as HomePhone
union all
Select 'Mary' as FirstName, 'Jane' as LastName ,
'333-444-5555' as Mobilephone,'301-401-3001' as HomePhone
)

select
FirstName
,LastName
,unnest(array[‘Mobile’::text,’HomePhone’::text]) as PhoneType
,unnest(array[MobilePhone::text,HomePhone::text]) as PhoneNumber
from
temp1
order by 1,2,3

Amazon Redshift doesn’t support the UNNEST function; you can use the following workaround:

with temp1 as
(
Select 'John' as FirstName, 'Smith' as LastName ,
'111-222-3333' as Mobilephone,'101-201-3001' as HomePhone
union all
Select 'Bob' as FirstName, 'Haris' as LastName ,
'222-333-4444' as Mobilephone,'201-301-4001' as HomePhone
union all
Select 'Mary' as FirstName, 'Jane' as LastName ,
'333-444-5555' as Mobilephone,'301-401-3001' as HomePhone
),
ns as
(
Select row_number() over(order by 1) as n from pg_tables
)

Select
FirstName
,LastName
,split_part('Mobile,Home',',',ns.n::int) as PhoneType
,split_part(MobilePhone|| '&&' || HomePhone, '&&', ns.n::int) as PhoneNumber
from
temp1, ns
where
ns.n<=regexp_count('Mobile,Home',',')+1
order by 1,2,3

When the element of array is in the form of array itself, use the JSON_EXTRACT_ARRAY_ELEMENT_TEXT() function and JSON_ARRAY_LENGTH:

with ns as
(
Select row_number() over(order by 1) as n from pg_tables
)

Select JSON_EXTRACT_ARRAY_ELEMENT_TEXT('["arrayelement1","arrayelement2"]',ns.n-1)
from ns
where
ns.n<=JSON_ARRAY_LENGTH('["arrayelement1","arrayelement2"]')

STRING_AGG()

The STRING_AGG() function is an aggregate function that concatenates a list of strings and places a separator between them. The function doesn’t add the separator at the end of the string. See the following code:

STRING_AGG ( expression, separator [order_by_clause] )

The Greenplum code is as follows:

with temp1 as
(
Select 'Finance'::text as Dept, 'John'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Finance'::text as Dept, 'John'::text as FirstName, 'Doe'::text as LastName
union all
Select 'Finance'::text as Dept, 'Mary'::text as FirstName, 'Jane'::text as LastName
union all
Select 'Marketing'::text as Dept, 'Bob'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Marketing'::text as Dept, 'Steve'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Account'::text as Dept, 'Phil'::text as FirstName, 'Adams'::text as LastName
union all
Select 'Account'::text as Dept, 'Jim'::text as FirstName, 'Smith'::text as LastName
)
Select dept,STRING_AGG(FirstName||' '||LastName,' ; ') as Employees from temp1 group by dept order by 1

The Amazon Redshift equivalent for the STRING_AGG() function is LISTAGG(). This aggregate function orders the rows for that group according to the ORDER BY expression, then concatenates the values into a single string:

LISTAGG(expression, separator [order_by_clause])

See the following code:

Create temporary Table temp1 as
(
Select 'Finance'::text as Dept, 'John'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Finance'::text as Dept, 'John'::text as FirstName, 'Doe'::text as LastName
union all
Select 'Finance'::text as Dept, 'Mary'::text as FirstName, 'Jane'::text as LastName
union all
Select 'Marketing'::text as Dept, 'Bob'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Marketing'::text as Dept, 'Steve'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Account'::text as Dept, 'Phil'::text as FirstName, 'Adams'::text as LastName
union all
Select 'Account'::text as Dept, 'Jim'::text as FirstName, 'Smith'::text as LastName
)

Select dept,LISTAGG(FirstName||' '||LastName,' ; ') as Employees from temp1
group by dept
order by 1

ANY ARRAY()

The PostgreSQL ANY ARRAY() function evaluates and compare the left-hand expression to each element in array:

Select * from temp1 where DeptName = ANY ARRAY('10-F','20-F','30-F')

In Amazon Redshift, the evaluation can be achieved with an IN operator:

Select * from temp1 where DeptName IN ('10-F','20-F','30-F')

Working with date functions

In this section, we discuss calculating the difference between date_part for Greenplum and datediff for Amazon Redshift.

When the application needs to calculate the difference between the subfields of dates for Greenplum, it uses the function date_part, which allows you to retrieve subfields such as year, month, week, and day. In the following example queries, we calculate the number of completion_days by calculating the difference between originated_date and eco_date.

To calculate the difference between the subfields of the date, Amazon Redshift has the function datediff. The following queries show an example of how to calculate the completion_days as the difference between eco_date and orginated_date. DATEDIFF determines the number of date part boundaries that are crossed between the two expressions.

We compare the Greenplum and Amazon Redshift queries as follows:

  • Difference by year

The following Greenplum query returns 1 year between 2009-01-01 and 2009-12-31:

SELECT date_part(‘year’, TIMESTAMP ‘2009-01-01’) - date_part(‘year’, 2008-12-31’) as year;

The following Amazon Redshift query returns 1 year between 2009-01-01 and 2009-12-31:

SELECT datediff (year, ‘2008-12-31’ , ‘2009-01-01’ ) as year;
  • Difference by month

The following Greenplum query returns 1 month between 2009-01-01 and 2008-12-31:

SELECT (date_part(‘year’, ‘2009-01-01’ :: date) - date_part(‘year’, ‘2008-12-31’ :: date)) * 12 +<br />(date_part(‘month’, ‘2009-01-01’) - date_part(‘month’, ‘2008-12-31’ :: date)) as month;

The following Amazon Redshift query returns 1 month between 2009-01-01 and 2008-12-31:

SELECT datediff( month, ‘2008-12-31’ , ‘2009-01-01’ ) as month;
  • Difference by week

The following Greenplum query returns 0 weeks between 2009-01-01 and 2009-12-31:

SELECT date_part(‘week’, timestamp ‘2009-01-01’ ) - date_part(‘week’, timestamp ‘2008-12-31’) as week;

The following Amazon Redshift query returns 0 weeks between 2009-01-01 and 2009-12-31:

SELECT datediff( week, ‘2008-12-31’ , ‘2009-01-01’ ) as week;
  • Difference by day

The following Greenplum query returns 1 day:

SELECT date_part ('day', '2009-01-01 24:00:00' :: timestamp - '2008-12-31 24:00:00 :: timestamp) as day;

The following Amazon Redshift query returns 1 day:

SELECT datediff (day, ‘2008-12-31’, ‘2009-01-01’) as day;
  • Difference by hour

The following Greenplum query returns 1 hour:

SELECT date_part(‘hour’, ‘2009-01-01 22:56:10’ :: timestamp - ‘2008-12-31 21:54:55' :: timestamp)

The following Amazon Redshift query returns 1 hour:

SELECT datediff (hour, ‘2009-01-01 21:56:10’, ‘2009-01-01’ ) as hour;
  • Difference by minute

The following Greenplum query returns 3 minutes:

SELECT date_part(‘minute’, ‘2009-01-01 22:56:10’ :: timestamp - ‘2009-01-01 21:53:10’ :: timestamp) as minutes;

The following Amazon Redshift query returns 1 minute:

SELECT datediff(minute, ‘2009-01-01 21:56:10’, ‘2009-01-01 21:57:55’) as minute;
  • Difference by second

The following Greenplum query returns 40 seconds:

SELECT date_part(‘second’, ‘2009-01-01 22:56:50’ :: timestamp - ‘2009-01-01 21:53:10’ : : timestamp) as seconds;

The following Amazon Redshift query returns 45 seconds:

SELECT datediff(second, ‘2009-01-01 21:56:10’, ‘2009-01-01- 21:56:55’) as seconds;

Now let’s look at how we use Amazon Redshift to calculate days and weeks in seconds.

The following Amazon Redshift query displays 2 days:

SELECT datediff(second, ‘2008-12-30 21:56:10’, ‘2009-01-01- 21:56:55’)/(60*60*24) as days;

The following Amazon Redshift query displays 9 weeks:

SELECT datediff(second, ‘2008-10-30 21:56:10’, ‘2009-01-01- 21:56:55’)/(60*60*24*7) as weeks;

For Greenplum, the date subfields need to be in single quotes, whereas for Amazon Redshift, we can use date subfields such as year, month, week, day, minute, second without quotes. For Greenplum, we have to subtract the subfield from one part to another part, whereas for Amazon Redshift we can use commas to separate the two dates.

Extract ISOYEAR from date

ISOYEAR 8601 is a week-numbering year. It begins with the Monday of the week containing the 4th of January. So for the date of early January or late December, the ISO year may be different from the Gregorian year. ISO year has 52 or 53 full weeks (364 or 371 days). The extra week is called a leap week; a year with such a week is called a leap year.

The following Greenplum query displays the ISOYEAR 2020:

SELECT extract (ISOYEAR from ‘2019-12-30’ :: date) as ISOYEARS;

The following Amazon Redshift query displays the ISOYEAR 2020:

SELECT to_char(‘2019-12-30’ :: date, ‘IYYYY’) as ISOYEARS;

Function to generate_series()

Greenplum has adopted the PostgreSQL function generate_series(). But the generate_series function works differently with Amazon Redshift while retrieving records from the table because it’s a leader node-only function.

To display a series of numbers in Amazon Redshift, run the following query on the leader node. In this example, it displays 10 rows, numbered 1–10:

SELECT generate_series(1,10);

To display a series of days for a given date, use the following query. It extracts the day from the given date and subtracts 1, to display a series of numbers from 0–6:

SELECT generate_series(0, extract(day from date ‘2009-01-07’) :: int -1);

But for the queries fetching the record from the table, joining with another table’s row, and processing data at the compute node, it doesn’t work, and generates an error message with Invalid Operation. The following code is an example of a SQL statement that works for Greenplum but fails for Amazon Redshift:

SELECT column_1,
FROM table_1t1
JOIN table_2 t2
ON t2.code = t1.code
CROSS JOIN generate_series(1,12) gen(fiscal_month)
WHERE condition_1

For Amazon Redshift, the solution is to create a table to store the series data, and rewrite the code as follows:

SELECT column1,
FROM table_t1 t1
JOIN table_t2 t2
ON t2.code = t1.code
CROSS JOIN (select “number” as fiscal_month FROM table_t3 WHERE “number”<=12) gen
WHERE condition_1

Working with regular expressions (regex functions)

Amazon Redshift and Greenplum both support three conditions for pattern matching:

  • LIKE
  • SIMILAR TO
  • POSIX operators

In this post, we don’t discuss all of these pattern matching in detail. Instead, we discuss a few regex functions and regex escape characters that aren’t supported by Amazon Redshift.

Regexp_split_to_table function

The Regex_split_to_table function splits a string using a POSIX regular expression pattern as delimiter.

This function has the following syntax:

Regexp_split_to_table(string,pattern [,flags])

For Greenplum, we use the following query:

select regexp_split_to_table ('bat,cat,hat',’\,’) as regexp_split_table_GP

For Amazon Redshift, the regexp_split_to_table function has to be converted using the Amazon Redshift split_part function:

SELECT column1,
FROM table_t1 t1
JOIN table_t2 t2
ON t2.code = t1.code
CROSS JOIN (select “number” as fiscal_month FROM table_t3 WHERE “number”<=12) gen
WHERE condition_1

Another way to convert regexp_split_to_table is as follows:

SELECT column1,
FROM table_t1 t1
JOIN table_t2 t2
ON t2.code = t1.code
CROSS JOIN (select “number” as fiscal_month FROM table_t3 WHERE “number”<=12) gen
WHERE condition_1

Substring from regex expressions

Substring (the string from the regex pattern) extracts the substring or value matching the pattern that is passed on. If there is no match, null is returned. For more information, refer to Pattern Matching.

We use the following code in Greenplum:

create temp table data1 ( col1 varchar );
insert into data1 values ('hellohowareyou 12\687687abcd');
select substring( col1 from '[A-Za-z]+$') from data1;
from data1

We can use the regexp_substr function to convert this code to Amazon Redshift. It returns the characters extracted from a string by searching for a regular expression pattern. The syntax is as follows:

REGEXP_SUBSTR ( source_string, pattern [, position [, occurrence [, parameters ] ] ] )
select regexp_substr( col1, '[A-Za-z]+$') as substring_from_rs from data1

Key points while converting regular expression escapes

The Postgres escape character E doesn’t work in Amazon Redshift. Additionally, the following Greenplum regular expression constraints aren’t supported in Amazon Redshift:

  • \m – Matches only at the beginning of a word
  • \y – Matches only at the beginning or end of a word

For Amazon Redshift, use \\< and \\>, or [[:<:]] and [[:>:]] instead.

Use the following code for Greenplum:

select col1,
case
when (col1) ~ E '\\m[0-9]{2}[A-Z]{1}[0-9]{1}' then
regexp_replace(col1, E '([0-9]{2})([A-Z]{1})([0-9]{1})',E '\\2')
else 'nothing'
end as regex_test
from temp1123

Use the following code for Amazon Redshift:

select col1,
case
when (col1) ~ '\\<[0-9]{2}[A-Z]{1}[0-9]{1}>\\' then
regexp_replace(col1,'([0-9]{2})([A-Z]{1})([0-9]{1})','\\2')
else 'nothing'
end as regex_test
from temp1123

OR

select col1,
case
when (col1) ~ '[[:<:]][0-9]{2}[A-Z]{1}[0-9]{1}[[:>:]]' then
regexp_replace(col1,'([0-9]{2})([A-Z]{1})([0-9]{1}) (.*)','\\2')
else 'nothing'
end as regex_test
from temp1123

Conclusion

For heterogeneous database migration from Greenplum to the Amazon Redshift, you can use AWS DMS and the AWS SCT to migrate most of the database objects, such as tables, views, stored procedures, and functions.

There are some situations in which one function is used for the source environment, and the target environment doesn’t support the same function. In this case, manual conversion is required to produce the same results set and complete the database migration.

In some cases, use of a new window function supported by the target environment proves more efficient for analytical queries to process petabytes of data.

This post included several situations where manual code conversion is required, which also improves the code efficiency and make queries efficient.

If you have any questions or suggestions, please share your feedback.


About the Authors

Jagrit Shrestha is a Database consultant at Amazon Web Services (AWS). He works as a database specialist helping customers migrate their on-premises database workloads to AWS and provide technical guidance.

Ishwar Adhikary is a Database Consultant at Amazon Web Services (AWS). He works closely with customers to modernize their database and application infrastructures. His focus area is migration of relational databases from On-premise data center to AWS Cloud.

Shrenik Parekh works as a Database Consultants at Amazon Web Services (AWS). He is expertise in database migration assessment, database migration, modernizing database environment with purpose-built database using AWS cloud database services. He is also focused on AWS web services for data analytics. In his spare time, he loves hiking, yoga and other outdoor activities.

Santhosh Meenhallimath is a Data Architect at AWS. He works on building analytical solutions, building data lakes and migrate Database into AWS.

Near-real-time fraud detection using Amazon Redshift Streaming Ingestion with Amazon Kinesis Data Streams and Amazon Redshift ML

Post Syndicated from Praveen Kadipikonda original https://aws.amazon.com/blogs/big-data/near-real-time-fraud-detection-using-amazon-redshift-streaming-ingestion-with-amazon-kinesis-data-streams-and-amazon-redshift-ml/

The importance of data warehouses and analytics performed on data warehouse platforms has been increasing steadily over the years, with many businesses coming to rely on these systems as mission-critical for both short-term operational decision-making and long-term strategic planning. Traditionally, data warehouses are refreshed in batch cycles, for example, monthly, weekly, or daily, so that businesses can derive various insights from them.

Many organizations are realizing that near-real-time data ingestion along with advanced analytics opens up new opportunities. For example, a financial institute can predict if a credit card transaction is fraudulent by running an anomaly detection program in near-real-time mode rather than in batch mode.

In this post, we show how Amazon Redshift can deliver streaming ingestion and machine learning (ML) predictions all in one platform.

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL.

Amazon Redshift ML makes it easy for data analysts and database developers to create, train, and apply ML models using familiar SQL commands in Amazon Redshift data warehouses.

We’re excited to launch Amazon Redshift Streaming Ingestion for Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK), which enables you to ingest data directly from a Kinesis data stream or Kafka topic without having to stage the data in Amazon Simple Storage Service (Amazon S3). Amazon Redshift streaming ingestion allows you to achieve low latency in the order of seconds while ingesting hundreds of megabytes of data into your data warehouse.

This post demonstrates how Amazon Redshift, the cloud data warehouse allows you to build near-real-time ML predictions by using Amazon Redshift streaming ingestion and Redshift ML features with familiar SQL language.

Solution overview

By following the steps outlined in this post, you’ll be able to set up a producer streamer application on an Amazon Elastic Compute Cloud (Amazon EC2) instance that simulates credit card transactions and pushes data to Kinesis Data Streams in real time. You set up an Amazon Redshift Streaming Ingestion materialized view on Amazon Redshift, where streaming data is received. You train and build a Redshift ML model to generate real-time inferences against the streaming data.

The following diagram illustrates the architecture and process flow.

The step-by-step process is as follows:

  1. The EC2 instance simulates a credit card transaction application, which inserts credit card transactions into the Kinesis data stream.
  2. The data stream stores the incoming credit card transaction data.
  3. An Amazon Redshift Streaming Ingestion materialized view is created on top of the data stream, which automatically ingests streaming data into Amazon Redshift.
  4. You build, train, and deploy an ML model using Redshift ML. The Redshift ML model is trained using historical transactional data.
  5. You transform the streaming data and generate ML predictions.
  6. You can alert customers or update the application to mitigate risk.

This walkthrough uses credit card transaction streaming data. The credit card transaction data is fictitious and is based on a simulator. The customer dataset is also fictitious and is generated with some random data functions.

Prerequisites

  1. Create an Amazon Redshift cluster.
  2. Configure the cluster to use Redshift ML.
  3. Create an AWS Identity and Access Management (IAM) user.
  4. Update the IAM role attached to the Redshift cluster to include permissions to access the Kinesis data stream. For more information about the required policy, refer to Getting started with streaming ingestion.
  5. Create an m5.4xlarge EC2 instance. We tested Producer application with m5.4xlarge instance but you are free to use other instance type. When creating the instance, use the amzn2-ami-kernel-5.10-hvm-2.0.20220426.0-x86_64-gp2 AMI.
  6. To make sure that Python3 is installed in the EC2 instance, run the following command to verity your Python version (note that the data extraction script only works on Python 3):
python3 --version
  1. Install the following dependent packages to run the simulator program:
sudo yum install python3-pip
pip3 install numpy
pip3 install pandas
pip3 install matplotlib
pip3 install seaborn
pip3 install boto3
  1. Configure Amazon EC2 using the variables like AWS credentials generated for IAM user created in step 3 above. The following screenshot shows an example using aws configure.

Set up Kinesis Data Streams

Amazon Kinesis Data Streams is a massively scalable and durable real-time data streaming service. It can continuously capture gigabytes of data per second from hundreds of thousands of sources, such as website clickstreams, database event streams, financial transactions, social media feeds, IT logs, and location-tracking events. The data collected is available in milliseconds to enable real-time analytics use cases such as real-time dashboards, real-time anomaly detection, dynamic pricing, and more. We use Kinesis Data Streams because it’s a serverless solution that can scale based on usage.

Create a Kinesis data stream

First, you need to create a Kinesis data stream to receive the streaming data:

  1. On the Amazon Kinesis console, choose Data streams in the navigation pane.
  2. Choose Create data stream.
  3. For Data stream name, enter cust-payment-txn-stream.
  4. For Capacity mode, select On-demand.
  5. For the rest of the options, choose the default options and follow through the prompts to complete the setup.
  6. Capture the ARN for the created data stream to use in the next section when defining your IAM policy.

Streaming ARN Highlight

Set up permissions

For a streaming application to write to Kinesis Data Streams, the application needs to have access to Kinesis. You can use the following policy statement to grant the simulator process that you set up in next section access to the data stream. Use the ARN of the data stream that you saved in the previous step.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt123",
"Effect": "Allow",
"Action": [
"kinesis:DescribeStream",
"kinesis:PutRecord",
"kinesis:PutRecords",
"kinesis:GetShardIterator",
"kinesis:GetRecords",
"kinesis:ListShards",
"kinesis:DescribeStreamSummary"
],
"Resource": [
"arn:aws:kinesis:us-west-2:xxxxxxxxxxxx:stream/cust-payment-txn-stream"
]
}
]
}

Configure the stream producer

Before we can consume streaming data in Amazon Redshift, we need a streaming data source that writes data to the Kinesis data stream. This post uses a custom-built data generator and the AWS SDK for Python (Boto3) to publish the data to the data stream. For setup instructions, refer to Producer Simulator. This simulator process publishes streaming data to the data stream created in the previous step (cust-payment-txn-stream).

Configure the stream consumer

This section talks about configuring the stream consumer (the Amazon Redshift streaming ingestion view).

Amazon Redshift Streaming Ingestion provides low-latency, high-speed ingestion of streaming data from Kinesis Data Streams into an Amazon Redshift materialized view. You can configure your Amazon Redshift cluster to enable streaming ingestion and create a materialized view with auto refresh, using SQL statements, as described in Creating materialized views in Amazon Redshift. The automatic materialized view refresh process will ingest streaming data at hundreds of megabytes of data per second from Kinesis Data Streams into Amazon Redshift. This results in fast access to external data that is quickly refreshed.

After creating the materialized view, you can access your data from the data stream using SQL and simplify your data pipelines by creating materialized views directly on top of the stream.

Complete the following steps to configure an Amazon Redshift streaming materialized view:

  1. On the IAM console, choose policies in the navigation pane.
  2. Choose Create policy.
  3. Create a new IAM policy called KinesisStreamPolicy.  For the streaming policy definition, see Getting started with streaming ingestion.
  4. In the navigation pane, choose Roles.
  5. Choose Create role.
  6. Select AWS service and choose Redshift and Redshift customizable.
  7. Create a new role called redshift-streaming-role and attach the policy KinesisStreamPolicy.
  8. Create an external schema to map to Kinesis Data Streams :
CREATE EXTERNAL SCHEMA custpaytxn
FROM KINESIS IAM_ROLE 'arn:aws:iam::386xxxxxxxxx:role/redshift-streaming-role';

Now you can create a materialized view to consume the stream data. You can use the SUPER data type to store the payload as is, in JSON format, or use Amazon Redshift JSON functions to parse the JSON data into individual columns. For this post, we use the second method because the schema is well defined.

  1. Create the streaming ingestion materialized view cust_payment_tx_stream. By specifying AUTO REFRESH YES in the following code, you can enable automatic refresh of the streaming ingestion view, which saves time by avoiding building data pipelines:
CREATE MATERIALIZED VIEW cust_payment_tx_stream
AUTO REFRESH YES
AS
SELECT approximate_arrival_timestamp ,
partition_key,
shard_id,
sequence_number,
json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'TRANSACTION_ID')::bigint as TRANSACTION_ID,
json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'TX_DATETIME')::character(50) as TX_DATETIME,
json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'CUSTOMER_ID')::int as CUSTOMER_ID,
json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'TERMINAL_ID')::int as TERMINAL_ID,
json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'TX_AMOUNT')::decimal(18,2) as TX_AMOUNT,
json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'TX_TIME_SECONDS')::int as TX_TIME_SECONDS,
json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'TX_TIME_DAYS')::int as TX_TIME_DAYS
FROM custpaytxn."cust-payment-txn-stream"
Where is_utf8(kinesis_data) AND can_json_parse(kinesis_data);

Note that json_extract_path_text has a length limitation of 64 KB. Also from_varbye filters records larger than 65KB.

  1. Refresh the data.

The Amazon Redshift streaming materialized view is auto refreshed by Amazon Redshift for you. This way, you don’t need worry about data staleness. With materialized view auto refresh, data is automatically loaded into Amazon Redshift as it becomes available in the stream. If you choose to manually perform this operation, use the following command:

REFRESH MATERIALIZED VIEW cust_payment_tx_stream ;
  1. Now let’s query the streaming materialized view to see sample data:
Select * from cust_payment_tx_stream limit 10;

  1. Let’s check how many records are in the streaming view now:
Select count(*) as stream_rec_count from cust_payment_tx_stream;

Now you have finished setting up the Amazon Redshift streaming ingestion view, which is continuously updated with incoming credit card transaction data. In my setup, I see that around 67,000 records have been pulled into the streaming view at the time when I ran my select count query. This number could be different for you.

Redshift ML

With Redshift ML, you can bring a pre-trained ML model or build one natively. For more information, refer to Using machine learning in Amazon Redshift.

In this post, we train and build an ML model using a historical dataset. The data contains a tx_fraud field that flags a historical transaction as fraudulent or not. We build a supervised ML model using Redshift Auto ML, which learns from this dataset and predicts incoming transactions when those are run through the prediction functions.

In the following sections, we show how to set up the historical dataset and customer data.

Load the historical dataset

The historical table has more fields than what the streaming data source has. These fields contain the customer’s most recent spend and terminal risk score, like number of fraudulent transactions computed by transforming streaming data. There are also categorical variables like weekend transactions or nighttime transactions.

To load the historical data, run the commands using the Amazon Redshift query editor.

Create the transaction history table with the following code. The DDL can also be found on GitHub.

CREATE TABLE cust_payment_tx_history
(
TRANSACTION_ID integer,
TX_DATETIME timestamp,
CUSTOMER_ID integer,
TERMINAL_ID integer,
TX_AMOUNT decimal(9,2),
TX_TIME_SECONDS integer,
TX_TIME_DAYS integer,
TX_FRAUD integer,
TX_FRAUD_SCENARIO integer,
TX_DURING_WEEKEND integer,
TX_DURING_NIGHT integer,
CUSTOMER_ID_NB_TX_1DAY_WINDOW decimal(9,2),
CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW decimal(9,2),
CUSTOMER_ID_NB_TX_7DAY_WINDOW decimal(9,2),
CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW decimal(9,2),
CUSTOMER_ID_NB_TX_30DAY_WINDOW decimal(9,2),
CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW decimal(9,2),
TERMINAL_ID_NB_TX_1DAY_WINDOW decimal(9,2),
TERMINAL_ID_RISK_1DAY_WINDOW decimal(9,2),
TERMINAL_ID_NB_TX_7DAY_WINDOW decimal(9,2),
TERMINAL_ID_RISK_7DAY_WINDOW decimal(9,2),
TERMINAL_ID_NB_TX_30DAY_WINDOW decimal(9,2),
TERMINAL_ID_RISK_30DAY_WINDOW decimal(9,2)
);
Copy cust_payment_tx_history
FROM 's3://redshift-demos/redshiftml-reinvent/2022/ant312/credit-card-transactions/credit_card_transactions_transformed_balanced.csv'
iam_role default
ignoreheader 1
csv ;

Let’s check how many transactions are loaded:

select count(1) from cust_payment_tx_history;

Check the monthly fraud and non-fraud transactions trend:

SELECT to_char(tx_datetime, 'YYYYMM') as YearMonth,
sum(case when tx_fraud=1 then 1 else 0 end) as fraud_tx,
sum(case when tx_fraud=0 then 1 else 0 end) as non_fraud_tx,
count(*) as total_tx
FROM cust_payment_tx_history
GROUP BY YearMonth;

Create and load customer data

Now we create the customer table and load data, which contains the email and phone number of the customer. The following code creates the table, loads the data, and samples the table. The table DDL is available on GitHub.

CREATE TABLE public."customer_info"(customer_id bigint NOT NULL encode az64,
job_title character varying(500) encode lzo,
email_address character varying(100) encode lzo,
full_name character varying(200) encode lzo,
phone_number character varying(20) encode lzo,
city varchar(50),
state varchar(50)
);
COPY customer_info
FROM 's3://redshift-demos/redshiftml-reinvent/2022/ant312/customer-data/Customer_Data.csv'
IGNOREHEADER 1
IAM_ROLE default CSV;
Select count(1) from customer_info;

Our test data has about 5,000 customers. The following screenshot shows sample customer data.

Build an ML model

Our historical card transaction table has 6 months of data, which we now use to train and test the ML model.

The model takes the following fields as input:

TX_DURING_WEEKEND ,
TX_AMOUNT,
TX_DURING_NIGHT ,
CUSTOMER_ID_NB_TX_1DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW ,
CUSTOMER_ID_NB_TX_7DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW ,
CUSTOMER_ID_NB_TX_30DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW ,
TERMINAL_ID_NB_TX_1DAY_WINDOW ,
TERMINAL_ID_RISK_1DAY_WINDOW ,
TERMINAL_ID_NB_TX_7DAY_WINDOW ,
TERMINAL_ID_RISK_7DAY_WINDOW ,
TERMINAL_ID_NB_TX_30DAY_WINDOW ,
TERMINAL_ID_RISK_30DAY_WINDOW

We get tx_fraud as output.

We split this data into training and test datasets. Transactions from 2022-04-01 to 2022-07-31 are for the training set. Transactions from 2022-08-01 to 2022-09-30 are used for the test set.

Let’s create the ML model using the familiar SQL CREATE MODEL statement. We use a basic form of the Redshift ML command. The following method uses Amazon SageMaker Autopilot, which performs data preparation, feature engineering, model selection, and training automatically for you. Provide the name of your S3 bucket containing the code.

CREATE MODEL cust_cc_txn_fd
FROM (
SELECT TX_AMOUNT ,
TX_FRAUD ,
TX_DURING_WEEKEND ,
TX_DURING_NIGHT ,
CUSTOMER_ID_NB_TX_1DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW ,
CUSTOMER_ID_NB_TX_7DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW ,
CUSTOMER_ID_NB_TX_30DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW ,
TERMINAL_ID_NB_TX_1DAY_WINDOW ,
TERMINAL_ID_RISK_1DAY_WINDOW ,
TERMINAL_ID_NB_TX_7DAY_WINDOW ,
TERMINAL_ID_RISK_7DAY_WINDOW ,
TERMINAL_ID_NB_TX_30DAY_WINDOW ,
TERMINAL_ID_RISK_30DAY_WINDOW
FROM cust_payment_tx_history
WHERE cast(tx_datetime as date) between '2022-06-01' and '2022-09-30'
) TARGET tx_fraud
FUNCTION fn_customer_cc_fd
IAM_ROLE default
SETTINGS (
S3_BUCKET '<replace this with your s3 bucket name>',
s3_garbage_collect off,
max_runtime 3600
);

I call the ML model as Cust_cc_txn_fd, and the prediction function as fn_customer_cc_fd. The FROM clause shows the input columns from the historical table public.cust_payment_tx_history. The target parameter is set to tx_fraud, which is the target variable that we’re trying to predict. IAM_Role is set to default because the cluster is configured with this role; if not, you have to provide your Amazon Redshift cluster IAM role ARN. I set the max_runtime to 3,600 seconds, which is the time we give to SageMaker to complete the process. Redshift ML deploys the best model that is identified in this time frame.

Depending on the complexity of the model and the amount of data, it can take some time for the model to be available. If you find your model selection is not completing, increase the value for max_runtime. You can set a max value of 9999.

The CREATE MODEL command is run asynchronously, which means it runs in the background. You can use the SHOW MODEL command to see the status of the model. When the status shows as Ready, it means the model is trained and deployed.

show model cust_cc_txn_fd;

The following screenshots show our output.

From the output, I see that the model has been correctly recognized as BinaryClassification, and F1 has been selected as the objective. The F1 score is a metric that considers both precision and recall. It returns a value between 1 (perfect precision and recall) and 0 (lowest possible score). In my case, it’s 0.91. The higher the value, the better the model performance.

Let’s test this model with the test dataset. Run the following command, which retrieves sample predictions:

SELECT
tx_fraud ,
fn_customer_cc_fd(
TX_AMOUNT ,
TX_DURING_WEEKEND ,
TX_DURING_NIGHT ,
CUSTOMER_ID_NB_TX_1DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW ,
CUSTOMER_ID_NB_TX_7DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW ,
CUSTOMER_ID_NB_TX_30DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW ,
TERMINAL_ID_NB_TX_1DAY_WINDOW ,
TERMINAL_ID_RISK_1DAY_WINDOW ,
TERMINAL_ID_NB_TX_7DAY_WINDOW ,
TERMINAL_ID_RISK_7DAY_WINDOW ,
TERMINAL_ID_NB_TX_30DAY_WINDOW ,
TERMINAL_ID_RISK_30DAY_WINDOW )
FROM cust_payment_tx_history
WHERE cast(tx_datetime as date) >= '2022-10-01'
limit 10 ;

We see that some values are matching and some are not. Let’s compare predictions to the ground truth:

SELECT
tx_fraud ,
fn_customer_cc_fd(
TX_AMOUNT ,
TX_DURING_WEEKEND ,
TX_DURING_NIGHT ,
CUSTOMER_ID_NB_TX_1DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW ,
CUSTOMER_ID_NB_TX_7DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW ,
CUSTOMER_ID_NB_TX_30DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW ,
TERMINAL_ID_NB_TX_1DAY_WINDOW ,
TERMINAL_ID_RISK_1DAY_WINDOW ,
TERMINAL_ID_NB_TX_7DAY_WINDOW ,
TERMINAL_ID_RISK_7DAY_WINDOW ,
TERMINAL_ID_NB_TX_30DAY_WINDOW ,
TERMINAL_ID_RISK_30DAY_WINDOW
) as prediction, count(*) as values
FROM public.cust_payment_tx_history
WHERE cast(tx_datetime as date) >= '2022-08-01'
Group by 1,2 ;

We validated that the model is working and the F1 score is good. Let’s move on to generating predictions on streaming data.

Predict fraudulent transactions

Because the Redshift ML model is ready to use, we can use it to run the predictions against streaming data ingestion. The historical dataset has more fields than what we have in the streaming data source, but they’re just recency and frequency metrics around the customer and terminal risk for a fraudulent transaction.

We can apply the transformations on top of the streaming data very easily by embedding the SQL inside the views. Create the first view, which aggregates streaming data at the customer level. Then create the second view, which aggregates streaming data at terminal level, and the third view, which combines incoming transactional data with customer and terminal aggregated data and calls the prediction function all in one place. The code for the third view is as follows:

CREATE VIEW public.cust_payment_tx_fraud_predictions
as
select a.approximate_arrival_timestamp,
d.full_name , d.email_address, d.phone_number,
a.TRANSACTION_ID, a.TX_DATETIME, a.CUSTOMER_ID, a.TERMINAL_ID,
a.TX_AMOUNT ,
a.TX_TIME_SECONDS ,
a.TX_TIME_DAYS ,
public.fn_customer_cc_fd(a.TX_AMOUNT ,
a.TX_DURING_WEEKEND,
a.TX_DURING_NIGHT,
c.CUSTOMER_ID_NB_TX_1DAY_WINDOW ,
c.CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW ,
c.CUSTOMER_ID_NB_TX_7DAY_WINDOW ,
c.CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW ,
c.CUSTOMER_ID_NB_TX_30DAY_WINDOW ,
c.CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW ,
t.TERMINAL_ID_NB_TX_1DAY_WINDOW ,
t.TERMINAL_ID_RISK_1DAY_WINDOW ,
t.TERMINAL_ID_NB_TX_7DAY_WINDOW ,
t.TERMINAL_ID_RISK_7DAY_WINDOW ,
t.TERMINAL_ID_NB_TX_30DAY_WINDOW ,
t.TERMINAL_ID_RISK_30DAY_WINDOW ) Fraud_prediction
From
(select
Approximate_arrival_timestamp,
TRANSACTION_ID, TX_DATETIME, CUSTOMER_ID, TERMINAL_ID,
TX_AMOUNT ,
TX_TIME_SECONDS ,
TX_TIME_DAYS ,
case when extract(dow from cast(TX_DATETIME as timestamp)) in (1,7) then 1 else 0 end as TX_DURING_WEEKEND,
case when extract(hour from cast(TX_DATETIME as timestamp)) between 00 and 06 then 1 else 0 end as TX_DURING_NIGHT
FROM cust_payment_tx_stream) a
join terminal_transformations t
on a.terminal_id = t.terminal_id
join customer_transformations c
on a.customer_id = c.customer_id
join customer_info d
on a.customer_id = d.customer_id
;

Run a SELECT statement on the view:

select * from
cust_payment_tx_fraud_predictions
where Fraud_prediction = 1;

As you run the SELECT statement repeatedly, the latest credit card transactions go through transformations and ML predictions in near-real time.

This demonstrates the power of Amazon Redshift—with easy-to-use SQL commands, you can transform streaming data by applying complex window functions and apply an ML model to predict fraudulent transactions all in one step, without building complex data pipelines or building and managing additional infrastructure.

Expand the solution

Because the data streams in and ML predictions are made in near-real time, you can build business processes for alerting your customer using Amazon Simple Notification Service (Amazon SNS), or you can lock the customer’s credit card account in an operational system.

This post doesn’t go into the details of these operations, but if you’re interested in learning more about building event-driven solutions using Amazon Redshift, refer to the following GitHub repository.

Clean up

To avoid incurring future charges, delete the resources that were created as part of this post.

Conclusion

In this post, we demonstrated how to set up a Kinesis data stream, configure a producer and publish data to streams, and then create an Amazon Redshift Streaming Ingestion view and query the data in Amazon Redshift. After the data was in the Amazon Redshift cluster, we demonstrated how to train an ML model and build a prediction function and apply it against the streaming data to generate predictions near-real time.

If you have any feedback or questions, please leave them in the comments.


About the Authors

Bhanu Pittampally is an Analytics Specialist Solutions Architect based out of Dallas. He specializes in building analytic solutions. His background is in data warehouses—architecture, development, and administration. He has been in the data and analytics field for over 15 years.

Praveen Kadipikonda is a Senior Analytics Specialist Solutions Architect at AWS based out of Dallas. He helps customers build efficient, performant, and scalable analytic solutions. He has worked with building databases and data warehouse solutions for over 15 years.

Ritesh Kumar Sinha is an Analytics Specialist Solutions Architect based out of San Francisco. He has helped customers build scalable data warehousing and big data solutions for over 16 years. He loves to design and build efficient end-to-end solutions on AWS. In his spare time, he loves reading, walking, and doing yoga.

How to get best price performance from your Amazon Redshift Data Sharing deployment

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/how-to-get-best-price-performance-from-your-amazon-redshift-data-sharing-deployment/

Amazon Redshift is a fast, scalable, secure, and fully-managed data warehouse that enables you to analyze all of your data using standard SQL easily and cost-effectively. Amazon Redshift Data Sharing allows customers to securely share live, transactionally consistent data in one Amazon Redshift cluster with another Amazon Redshift cluster across accounts and regions without needing to copy or move data from one cluster to another.

Amazon Redshift Data Sharing was initially launched in March 2021, and added support for cross-account data sharing was added in August 2021. The cross-region support became generally available in February 2022. This provides full flexibility and agility to share data across Redshift clusters in the same AWS account, different accounts, or different regions.

Amazon Redshift Data Sharing is used to fundamentally redefine Amazon Redshift deployment architectures into a hub-spoke, data mesh model to better meet performance SLAs, provide workload isolation, perform cross-group analytics, easily onboard new use cases, and most importantly do all of this without the complexity of data movement and data copies. Some of the most common questions asked during data sharing deployment are, “How big should my consumer clusters and producer clusters be?”, and “How do I get the best price performance for workload isolation?”. As workload characteristics like data size, ingestion rate, query pattern, and maintenance activities can impact data sharing performance, a continuous strategy to size both consumer and producer clusters to maximize the performance and minimize cost should be implemented. In this post, we provide a step-by-step approach to help you determine your producer and consumer clusters sizes for the best price performance based on your specific workload.

Generic consumer sizing guidance

The following steps show the generic strategy to size your producer and consumer clusters. You can use it as a starting point and modify accordingly to cater your specific use case scenario.

Size your producer cluster

You should always make sure that you properly size your producer cluster to get the performance that you need to meet your SLA. You can leverage the sizing calculator from the Amazon Redshift console to get a recommendation for the producer cluster based on the size of your data and query characteristic. Look for Help me choose on the console in AWS Regions that support RA3 node types to use this sizing calculator. Note that this is just an initial recommendation to get started, and you should test running your full workload on the initial size cluster and elastic resize the cluster up and down accordingly to get the best price performance.

Size and setup initial consumer cluster

You should always size your consumer cluster based on your compute needs. One way to get started is to follow the generic cluster sizing guide similar to the producer cluster above.

Setup Amazon Redshift data sharing

Setup data sharing from producer to consumer once you have both the producer and consumer cluster setup. Refer to this post for guidance on how to setup data sharing.

Test consumer only workload on initial consumer cluster

Test consumer only workload on the new initial consumer cluster. This can be done by pointing consumer applications, for example ETL tools, BI applications, and SQL clients, to the new consumer cluster and rerunning the workload to evaluate the performance against your requirements.

Test consumer only workload on different consumer cluster configurations

If the initial size consumer cluster meets or exceeds your workload performance requirements, then you can either continue to use this cluster configuration or you can test on smaller configurations to see if you can further reduce the cost and still get the performance that you need.

On the other hand, if the initial size consumer cluster fails to meet your workload performance requirements, then you can further test larger configurations to get the configuration that meets your SLA.

As a rule of thumb, size up the consumer cluster by 2x the initial cluster configuration incrementally until it meets your workload requirements.

Once you plan out what configuration you want to test, use elastic resize to resize the initial cluster to the target cluster configuration. After elastic resize is completed, perform the same workload test and evaluate the performance against your SLA. Select the configuration that meets your price performance target.

Test producer only workload on different producer cluster configurations

Once you move your consumer workload to the consumer cluster with the optimum price performance, there might be an opportunity to reduce the compute resource on the producer to save on costs.

To achieve this, you can rerun the producer only workload on 1/2x of the original producer size and evaluate the workload performance. Resizing the cluster up and down accordingly depends on the result, and then you select the minimum producer configuration that meets your workload performance requirements.

Re-evaluate after a full workload run over time

As Amazon Redshift continues evolving, and there are continuous performance and scalability improvement releases, data sharing performance will continue improving. Furthermore, numerous variables might impact the performance of data sharing queries. The following are just some examples:

  • Ingestion rate and amount of data change
  • Query pattern and characteristic
  • Workload changes
  • Concurrency
  • Maintenance activities, for example vacuum, analyze, and ATO

This is why you must re-evaluate the producer and consumer cluster sizing using the strategy above on occasion, especially after a full workload deployment, to gain the new best price performance from your cluster’s configuration.

Automated sizing solutions

If your environment involved more complex architecture, for example with multiple tools or applications (BI, ingestion or streaming, ETL, data science), then it might not feasible to use the manual method from the generic guidance above. Instead, you can leverage solutions in this section to automatically replay the workload from your production cluster on the test consumer and producer clusters to evaluate the performance.

Simple Replay utility will be leveraged as the automated solution to guide you through the process of getting the right producer and consumer clusters size for the best price performance.

Simple Replay is a tool for conducting a what-if analysis and evaluating how your workload performs in different scenarios. For example, you can use the tool to benchmark your actual workload on a new instance type like RA3, evaluate a new feature, or assess different cluster configurations. It also includes enhanced support for replaying data ingestion and export pipelines with COPY and UNLOAD statements. To get started and replay your workloads, download the tool from the Amazon Redshift GitHub repository.

Here we walk through the steps to extract your workload logs from the source production cluster and replay them in an isolated environment. This lets you perform a direct comparison between these Amazon Redshift clusters seamlessly and select the clusters configuration that best meet your price performance target.

The following diagram shows the solution architecture.

Architecutre for testing simple replay

Solution walkthrough

Follow these steps to go through the solution to size your consumer and producer clusters.

Size your production cluster

You should always make sure to properly size your existing production cluster to get the performance that you need to meet your workload requirements. You can leverage the sizing calculator from the Amazon Redshift console to get a recommendation on the production cluster based on the size of your data and query characteristic. Look for Help me choose on the console in AWS Regions that support RA3 node types to use this sizing calculator. Note that this is just an initial recommendation to get started. You should test running your full workload on the initial size cluster and elastic resize the cluster up and down accordingly to get the best price performance.

Identify the workload to be isolated

You might have different workloads running on your original cluster, but the first step is to identify the most critical workload to the business that we want to isolate. This is because we want to make sure that the new architecture can meet your workload requirements. This post is a good reference on a data sharing workload isolation use case that can help you decide which workload can be isolated.

Setup Simple Replay

Once you know your critical workload, you must enable audit logging in your production cluster where the critical workload identified above is running to capture query activities and store in Amazon Simple Storage Service (Amazon S3). Note that it may take up to three hours for the audit logs to be delivered to Amazon S3. Once the audit log is available, proceed to setup Simple Replay and then extract the critical workload from the audit log. Note that start_time and end_time could be used as parameters to filter out the critical workload if those workloads run in certain time periods, for example 9am to 11am. Otherwise it will extract all of the logged activities.

Baseline workload

Create a baseline cluster with the same configuration as the producer cluster by restoring from the production snapshot. The purpose of starting with the same configuration is to baseline the performance with an isolated environment.

Once the baseline cluster is available, replay the extracted workload in the baseline cluster. The output from this replay will be the baseline used to compare against subsequent replays on different consumer configurations.

Setup initial producer and consumer test clusters

Create a producer cluster with the same production cluster configuration by restoring from the production snapshot. Create a consumer cluster with the recommended initial consumer size from the previous guidance. Furthermore, setup data sharing between the producer and consumer.

Replay workload on initial producer and consumer

Replay the producer only workload on the initial size producer cluster. This can be achieved using the “Exclude” filter parameter to exclude consumer queries, for example the user that runs consumer queries.

Replay the consumer only workload on the initial size consumer cluster. This can be achieved using the “Include” filter parameter to exclude consumer queries, for example the user that runs consumer queries.

Evaluate the performance of these replays against the baseline and workload performance requirements.

Replay consumer workload on different configurations

If the initial size consumer cluster meets or exceeds your workload performance requirements, then you can either use this cluster configuration or you can follow these steps to test on smaller configurations to see if you can further reduce costs and still get the performance that you need.

Compare initial consumer performance results against your workload requirements:

  1. If the result exceeds your workload performance requirements, then you can reduce the size of the consumer cluster incrementally, starting with 1/2x, retry the replay and evaluate the performance, then resize up or down accordingly based on the result until it meets your workload requirements. The purpose is to get a sweet spot where you’re comfortable with the performance requirements and get the lowest price possible.
  2. If the result fails to meet your workload performance requirements, then you can increase the size of the cluster incrementally, starting with 2x the original size, retry the replay and evaluate the performance until it meets your workload performance requirements.

Replay producer workload on different configurations

Once you split your workloads out to consumer clusters, the load on the producer cluster should be reduced and you should evaluate your producer cluster’s workload performance to seek the opportunity to downsize to save on costs.

The steps are similar to consumer replay. Elastic resize the producer cluster incrementally starting with 1/2x the original size, replay the producer only workload and evaluate the performance, and then further resize up or down until it meets your workload performance requirements. The purpose is to get a sweet spot where you’re comfortable with the workload performance requirements and get the lowest price possible. Once you have the desired producer cluster configuration, retry replay consumer workloads on the consumer cluster to make sure that the performance wasn’t impacted by producer cluster configuration changes. Finally, you should replay both producer and consumer workloads concurrently to make sure that the performance is achieved in a full workload scenario.

Re-evaluate after a full workload run over time

Similar to the generic guidance, you should re-evaluate the producer and consumer clusters sizing using the previous strategy on occasion, especially after full workload deployment to gain the new best price performance from your cluster’s configuration.

Clean up

Running these sizing tests in your AWS account may have some cost implications because it provisions new Amazon Redshift clusters, which may be charged as on-demand instances if you don’t have Reserved Instances. When you complete your evaluations, we recommend deleting the Amazon Redshift clusters to save on costs. We also recommend pausing your clusters when they’re not in use.

Applying Amazon Redshift and data sharing best practices

Proper sizing of both your producer and consumer clusters will give you a good start to get the best price performance from your Amazon Redshift deployment. However, sizing isn’t the only factor that can maximize your performance. In this case, understanding and following best practices are equally important.

General Amazon Redshift performance tuning best practices are applicable to data sharing deployment. Make sure that your deployment follows these best practices.

There numerous data sharing specific best practices that you should follow to make sure that you maximize the performance. Refer to this post for more details.

Summary

There is no one-size-fits-all recommendation on producer and consumer cluster sizes. It varies by workloads and your performance SLA. The purpose of this post is to provide you with guidance for how you can evaluate your specific data sharing workload performance to determine both consumer and producer cluster sizes to get the best price performance. Consider testing your workloads on producer and consumer using simple replay before adopting it in production to get the best price performance.


About the Authors

BP Yau is a Sr Product Manager at AWS. He is passionate about helping customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Sidhanth Muralidhar is a Principal Technical Account Manager at AWS. He works with large enterprise customers who run their workloads on AWS. He is passionate about working with customers and helping them architect workloads for costs, reliability, performance and operational excellence at scale in their cloud journey. He has a keen interest in Data Analytics as well.

Migrate Google BigQuery to Amazon Redshift using AWS Schema Conversion tool (SCT)

Post Syndicated from Jagadish Kumar original https://aws.amazon.com/blogs/big-data/migrate-google-bigquery-to-amazon-redshift-using-aws-schema-conversion-tool-sct/

Amazon Redshift is a fast, fully-managed, petabyte scale data warehouse that provides the flexibility to use provisioned or serverless compute for your analytical workloads. Using Amazon Redshift Serverless and Query Editor v2, you can load and query large datasets in just a few clicks and pay only for what you use. The decoupled compute and storage architecture of Amazon Redshift enables you to build highly scalable, resilient, and cost-effective workloads. Many customers migrate their data warehousing workloads to Amazon Redshift and benefit from the rich capabilities it offers. The following are just some of the notable capabilities:

  • Amazon Redshift seamlessly integrates with broader analytics services on AWS. This enables you to choose the right tool for the right job. Modern analytics is much wider than SQL-based data warehousing. Amazon Redshift lets you build lake house architectures and then perform any kind of analytics, such as interactive analytics, operational analytics, big data processing, visual data preparation, predictive analytics, machine learning (ML), and more.
  • You don’t need to worry about workloads, such as ETL, dashboards, ad-hoc queries, and so on, interfering with each other. You can isolate workloads using data sharing, while using the same underlying datasets.
  • When users run many queries at peak times, compute seamlessly scales within seconds to provide consistent performance at high concurrency. You get one hour of free concurrency scaling capacity for 24 hours of usage. This free credit meets the concurrency demand of 97% of the Amazon Redshift customer base.
  • Amazon Redshift is easy-to-use with self-tuning and self-optimizing capabilities. You can get faster insights without spending valuable time managing your data warehouse.
  • Fault Tolerance is inbuilt. All of the data written to Amazon Redshift is automatically and continuously replicated to Amazon Simple Storage Service (Amazon S3). Any hardware failures are automatically replaced.
  • Amazon Redshift is simple to interact with. You can access data with traditional, cloud-native, containerized, and serverless web services-based or event-driven applications and so on.
  • Redshift ML makes it easy for data scientists to create, train, and deploy ML models using familiar SQL. They can also run predictions using SQL.
  • Amazon Redshift provides comprehensive data security at no extra cost. You can set up end-to-end data encryption, configure firewall rules, define granular row and column level security controls on sensitive data, and so on.
  • Amazon Redshift integrates seamlessly with other AWS services and third-party tools. You can move, transform, load, and query large datasets quickly and reliably.

In this post, we provide a walkthrough for migrating a data warehouse from Google BigQuery to Amazon Redshift using AWS Schema Conversion Tool (AWS SCT) and AWS SCT data extraction agents. AWS SCT is a service that makes heterogeneous database migrations predictable by automatically converting the majority of the database code and storage objects to a format that is compatible with the target database. Any objects that can’t be automatically converted are clearly marked so that they can be manually converted to complete the migration. Furthermore, AWS SCT can scan your application code for embedded SQL statements and convert them.

Solution overview

AWS SCT uses a service account to connect to your BigQuery project. First, we create an Amazon Redshift database into which BigQuery data is migrated. Next, we create an S3 bucket. Then, we use AWS SCT to convert BigQuery schemas and apply them to Amazon Redshift. Finally, to migrate data, we use AWS SCT data extraction agents, which extract data from BigQuery, upload it into the S3 bucket, and then copy to Amazon Redshift.

Prerequisites

Before starting this walkthrough, you must have the following prerequisites:

  1. A workstation with AWS SCT, Amazon Corretto 11, and Amazon Redshift drivers.
    1. You can use an Amazon Elastic Compute Cloud (Amazon EC2) instance or your local desktop as a workstation. In this walkthrough, we’re using Amazon EC2 Windows instance. To create it, use this guide.
    2. To download and install AWS SCT on the EC2 instance that you previously created, use this guide.
    3. Download the Amazon Redshift JDBC driver from this location.
    4. Download and install Amazon Corretto 11.
  2. A GCP service account that AWS SCT can use to connect to your source BigQuery project.
    1. Grant BigQuery Admin and Storage Admin roles to the service account.
    2. Copy the Service account key file, which was created in the Google cloud management console, to the EC2 instance that has AWS SCT.
    3. Create a Cloud Storage bucket in GCP to store your source data during migration.

This walkthrough covers the following steps:

  • Create an Amazon Redshift Serverless Workgroup and Namespace
  • Create the AWS S3 Bucket and Folder
  • Convert and apply BigQuery Schema to Amazon Redshift using AWS SCT
    • Connecting to the Google BigQuery Source
    • Connect to the Amazon Redshift Target
    • Convert BigQuery schema to an Amazon Redshift
    • Analyze the assessment report and address the action items
    • Apply converted schema to target Amazon Redshift
  • Migrate data using AWS SCT data extraction agents
    • Generating Trust and Key Stores (Optional)
    • Install and start data extraction agent
    • Register data extraction agent
    • Add virtual partitions for large tables (Optional)
    • Create a local migration task
    • Start the Local Data Migration Task
  • View Data in Amazon Redshift

Create an Amazon Redshift Serverless Workgroup and Namespace

In this step, we create an Amazon Redshift Serverless workgroup and namespace. Workgroup is a collection of compute resources and namespace is a collection of database objects and users. To isolate workloads and manage different resources in Amazon Redshift Serverless, you can create namespaces and workgroups and manage storage and compute resources separately.

Follow these steps to create Amazon Redshift Serverless workgroup and namespace:

  • Navigate to the Amazon Redshift console.
  • In the upper right, choose the AWS Region that you want to use.
  • Expand the Amazon Redshift pane on the left and choose Redshift Serverless.
  • Choose Create Workgroup.
  • For Workgroup name, enter a name that describes the compute resources.
  • Verify that the VPC is the same as the VPC as the EC2 instance with AWS SCT.
  • Choose Next.

  • For Namespace name, enter a name that describes your dataset.
  • In Database name and password section, select the checkbox Customize admin user credentials.
    • For Admin user name, enter a username of your choice, for example awsuser.
    • For Admin user password: enter a password of your choice, for example MyRedShiftPW2022
  • Choose Next. Note that data in Amazon Redshift Serverless namespace is encrypted by default.
  • In the Review and Create page, choose Create.
  • Create an AWS Identity and Access Management (IAM) role and set it as the default on your namespace, as described in the following. Note that there can only be one default IAM role.
    • Navigate to the Amazon Redshift Serverless Dashboard.
    • Under Namespaces / Workgroups, choose the namespace that you just created.
    • Navigate toSecurity and encryption.
    • Under Permissions, choose Manage IAM roles.
    • Navigate to Manage IAM roles. Then, choose the Manage IAM roles drop-down and choose Create IAM role.
    • Under Specify an Amazon S3 bucket for the IAM role to access, choose one of the following methods:
      • Choose No additional Amazon S3 bucket to allow the created IAM role to access only the S3 buckets with a name starting with redshift.
      • Choose Any Amazon S3 bucket to allow the created IAM role to access all of the S3 buckets.
      • Choose Specific Amazon S3 buckets to specify one or more S3 buckets for the created IAM role to access. Then choose one or more S3 buckets from the table.
    • Choose Create IAM role as default. Amazon Redshift automatically creates and sets the IAM role as default.
  • Capture the Endpoint for the Amazon Redshift Serverless workgroup that you just created.

Create the S3 bucket and folder

During the data migration process, AWS SCT uses Amazon S3 as a staging area for the extracted data. Follow these steps to create the S3 bucket:

  • Navigate to the Amazon S3 console
  • Choose Create bucket. The Create bucket wizard opens.
  • For Bucket name, enter a unique DNS-compliant name for your bucket (e.g., uniquename-bq-rs). See rules for bucket naming when choosing a name.
  • For AWS Region, choose the region in which you created the Amazon Redshift Serverless workgroup.
  • Select Create Bucket.
  • In the Amazon S3 console, navigate to the S3 bucket that you just created (e.g., uniquename-bq-rs).
  • Choose “Create folder” to create a new folder.
  • For Folder name, enter incoming and choose Create Folder.

Convert and apply BigQuery Schema to Amazon Redshift using AWS SCT

To convert BigQuery schema to the Amazon Redshift format, we use AWS SCT. Start by logging in to the EC2 instance that we created previously, and then launch AWS SCT.

Follow these steps using AWS SCT:

Connect to the BigQuery Source

  • From the File Menu choose Create New Project.
  • Choose a location to store your project files and data.
  • Provide a meaningful but memorable name for your project, such as BigQuery to Amazon Redshift.
  • To connect to the BigQuery source data warehouse, choose Add source from the main menu.
  • Choose BigQuery and choose Next. The Add source dialog box appears.
  • For Connection name, enter a name to describe BigQuery connection. AWS SCT displays this name in the tree in the left panel.
  • For Key path, provide the path of the service account key file that was previously created in the Google cloud management console.
  • Choose Test Connection to verify that AWS SCT can connect to your source BigQuery project.
  • Once the connection is successfully validated, choose Connect.

Connect to the Amazon Redshift Target

Follow these steps to connect to Amazon Redshift:

  • In AWS SCT, choose Add Target from the main menu.
  • Choose Amazon Redshift, then choose Next. The Add Target dialog box appears.
  • For Connection name, enter a name to describe the Amazon Redshift connection. AWS SCT displays this name in the tree in the right panel.
  • For Server name, enter the Amazon Redshift Serverless workgroup endpoint captured previously.
  • For Server port, enter 5439.
  • For Database, enter dev.
  • For User name, enter the username chosen when creating the Amazon Redshift Serverless workgroup.
  • For Password, enter the password chosen when creating Amazon Redshift Serverless workgroup.
  • Uncheck the “Use AWS Glue” box.
  • Choose Test Connection to verify that AWS SCT can connect to your target Amazon Redshift workgroup.
  • Choose Connect to connect to the Amazon Redshift target.

Note that alternatively you can use connection values that are stored in AWS Secrets Manager. 

Convert BigQuery schema to an Amazon Redshift

After the source and target connections are successfully made, you see the source BigQuery object tree on the left pane and target Amazon Redshift object tree on the right pane.

Follow these steps to convert BigQuery schema to the Amazon Redshift format:

  • On the left pane, right-click on the schema that you want to convert.
  • Choose Convert Schema.
  • A dialog box appears with a question, The objects might already exist in the target database. Replace?. Choose Yes.

Once the conversion is complete, you see a new schema created on the Amazon Redshift pane (right pane) with the same name as your BigQuery schema.

The sample schema that we used has 16 tables, 3 views, and 3 procedures. You can see these objects in the Amazon Redshift format in the right pane. AWS SCT converts all of the BigQuery code and data objects to the Amazon Redshift format. Furthermore, you can use AWS SCT to convert external SQL scripts, application code, or additional files with embedded SQL.

Analyze the assessment report and address the action items

AWS SCT creates an assessment report to assess the migration complexity. AWS SCT can convert the majority of code and database objects. However, some of the objects may require manual conversion. AWS SCT highlights these objects in blue in the conversion statistics diagram and creates action items with a complexity attached to them.

To view the assessment report, switch from the Main view to the Assessment Report view as follows:

The Summary tab shows objects that were converted automatically, and objects that weren’t converted automatically. Green represents automatically converted or with simple action items. Blue represents medium and complex action items that require manual intervention.

The Action Items tab shows the recommended actions for each conversion issue. If you select an action item from the list, AWS SCT highlights the object to which the action item applies.

The report also contains recommendations for how to manually convert the schema item. For example, after the assessment runs, detailed reports for the database/schema show you the effort required to design and implement the recommendations for converting Action items. For more information about deciding how to handle manual conversions, see Handling manual conversions in AWS SCT. Amazon Redshift takes some actions automatically while converting the schema to Amazon Redshift. Objects with these actions are marked with a red warning sign.

You can evaluate and inspect the individual object DDL by selecting it from the right pane, and you can also edit it as needed. In the following example, AWS SCT modifies the RECORD and JSON datatype columns in BigQuery table ncaaf_referee_data to the SUPER datatype in Amazon Redshift. The partition key in the ncaaf_referee_data table is converted to the distribution key and sort key in Amazon Redshift.

Apply converted schema to target Amazon Redshift

To apply the converted schema to Amazon Redshift, select the converted schema in the right pane, right-click, and then choose Apply to database.

Migrate data from BigQuery to Amazon Redshift using AWS SCT data extraction agents

AWS SCT extraction agents extract data from your source database and migrate it to the AWS Cloud. In this walkthrough, we show how to configure AWS SCT extraction agents to extract data from BigQuery and migrate to Amazon Redshift.

First, install AWS SCT extraction agent on the same Windows instance that has AWS SCT installed. For better performance, we recommend that you use a separate Linux instance to install extraction agents if possible. For big datasets, you can use several data extraction agents to increase the data migration speed.

Generating trust and key stores (optional)

You can use Secure Socket Layer (SSL) encrypted communication with AWS SCT data extractors. When you use SSL, all of the data passed between the applications remains private and integral. To use SSL communication, you must generate trust and key stores using AWS SCT. You can skip this step if you don’t want to use SSL. We recommend using SSL for production workloads.

Follow these steps to generate trust and key stores:

  1. In AWS SCT, navigate to Settings → Global Settings → Security.
  2. Choose Generate trust and key store.
  3. Enter the name and password for trust and key stores and choose a location where you would like to store them.
  4. Choose Generate.

Install and configure Data Extraction Agent

In the installation package for AWS SCT, you find a sub-folder agent (\aws-schema-conversion-tool-1.0.latest.zip\agents). Locate and install the executable file with a name like aws-schema-conversion-tool-extractor-xxxxxxxx.msi.

In the installation process, follow these steps to configure AWS SCT Data Extractor:

  1. For Listening port, enter the port number on which the agent listens. It is 8192 by default.
  2. For Add a source vendor, enter no, as you don’t need drivers to connect to BigQuery.
  3. For Add the Amazon Redshift driver, enter YES.
  4. For Enter Redshift JDBC driver file or files, enter the location where you downloaded Amazon Redshift JDBC drivers.
  5. For Working folder, enter the path where the AWS SCT data extraction agent will store the extracted data. The working folder can be on a different computer from the agent, and a single working folder can be shared by multiple agents on different computers.
  6. For Enable SSL communication, enter yes. Choose No here if you don’t want to use SSL.
  7. For Key store, enter the storage location chosen when creating the trust and key store.
  8. For Key store password, enter the password for the key store.
  9. For Enable client SSL authentication, enter yes.
  10. For Trust store, enter the storage location chosen when creating the trust and key store.
  11. For Trust store password, enter the password for the trust store.
*************************************************
*                                               *
*     AWS SCT Data Extractor Configuration      *
*              Version 2.0.1.666                *
*                                               *
*************************************************
User name: Administrator
User home: C:\Windows\system32\config\systemprofile
*************************************************
Listening port [8192]: 8192
Add a source vendor [YES/no]: no
No one source data warehouse vendors configured. AWS SCT Data Extractor cannot process data extraction requests.
Add the Amazon Redshift driver [YES/no]: YES
Enter Redshift JDBC driver file or files: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject\redshift-jdbc42-2.1.0.9.jar
Working folder [C:\Windows\system32\config\systemprofile]: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject
Enable SSL communication [YES/no]: YES
Setting up a secure environment at "C:\Windows\system32\config\systemprofile". This process will take a few seconds...
Key store [ ]: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject\TrustAndKeyStores\BQToRedshiftKeyStore
Key store password:
Re-enter the key store password:
Enable client SSL authentication [YES/no]: YES
Trust store [ ]: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject\TrustAndKeyStores\BQToRedshiftTrustStore
Trust store password:
Re-enter the trust store password:

Starting Data Extraction Agent(s)

Use the following procedure to start extraction agents. Repeat this procedure on each computer that has an extraction agent installed.

Extraction agents act as listeners. When you start an agent with this procedure, the agent starts listening for instructions. You send the agents instructions to extract data from your data warehouse in a later section.

To start the extraction agent, navigate to the AWS SCT Data Extractor Agent directory. For example, in Microsoft Windows, double-click C:\Program Files\AWS SCT Data Extractor Agent\StartAgent.bat.

  • On the computer that has the extraction agent installed, from a command prompt or terminal window, run the command listed following your operating system.
  • To check the status of the agent, run the same command but replace start with status.
  • To stop an agent, run the same command but replace start with stop.
  • To restart an agent, run the same RestartAgent.bat file.

Register the Data Extraction Agent

Follow these steps to register the Data Extraction Agent:

  1. In AWS SCT, change the view to Data Migration view (other) and choose + Register.
  2. In the connection tab:
    1. For Description, enter a name to identify the Data Extraction Agent.
    2. For Host name, if you installed the Data Extraction Agent on the same workstation as AWS SCT, enter 0.0.0.0 to indicate local host. Otherwise, enter the host name of the machine on which the AWS SCT Data Extraction Agent is installed. It’s recommended to install the Data Extraction Agents on Linux for better performance.
    3. For Port, enter the number entered for the Listening Port when installing the AWS SCT Data Extraction Agent.
    4. Select the checkbox to use SSL (if using SSL) to encrypt the AWS SCT connection to the Data Extraction Agent.
  3. If you’re using SSL, then in the SSL Tab:
    1. For Trust store, choose the trust store name created when generating Trust and Key Stores (optionally, you can skip this if SSL connectivity isn’t needed).
    2. For Key Store, choose the key store name created when generating Trust and Key Stores (optionally, you can skip this if SSL connectivity isn’t needed).
  4. Choose Test Connection.
  5. Once the connection is validated successfully, choose Register.

Add virtual partitions for large tables (optional)

You can use AWS SCT to create virtual partitions to optimize migration performance. When virtual partitions are created, AWS SCT extracts the data in parallel for partitions. We recommend creating virtual partitions for large tables.

Follow these steps to create virtual partitions:

  1. Deselect all objects on the source database view in AWS SCT.
  2. Choose the table for which you would like to add virtual partitioning.
  3. Right-click on the table, and choose Add Virtual Partitioning.
  4. You can use List, Range, or Auto Split partitions. To learn more about virtual partitioning, refer to Use virtual partitioning in AWS SCT. In this example, we use Auto split partitioning, which generates range partitions automatically. You would specify the start value, end value, and how big the partition should be. AWS SCT determines the partitions automatically. For a demonstration, on the Lineorder table:
    1. For Start Value, enter 1000000.
    2. For End Value, enter 3000000.
    3. For Interval, enter 1000000 to indicate partition size.
    4. Choose Ok.

You can see the partitions automatically generated under the Virtual Partitions tab. In this example, AWS SCT automatically created the following five partitions for the field:

    1. <1000000
    2. >=1000000 and <=2000000
    3. >2000000 and <=3000000
    4. >3000000
    5. IS NULL

Create a local migration task

To migrate data from BigQuery to Amazon Redshift, create, run, and monitor the local migration task from AWS SCT. This step uses the data extraction agent to migrate data by creating a task.

Follow these steps to create a local migration task:

  1. In AWS SCT, under the schema name in the left pane, right-click on Standard tables.
  2. Choose Create Local task.
  3. There are three migration modes from which you can choose:
    1. Extract source data and store it on a local pc/virtual machine (VM) where the agent runs.
    2. Extract data and upload it on an S3 bucket.
    3. Choose Extract upload and copy, which extracts data to an S3 bucket and then copies to Amazon Redshift.
  4. In the Advanced tab, for Google CS bucket folder enter the Google Cloud Storage bucket/folder that you created earlier in the GCP Management Console. AWS SCT stores the extracted data in this location.
  5. In the Amazon S3 Settings tab, for Amazon S3 bucket folder, provide the bucket and folder names of the S3 bucket that you created earlier. The AWS SCT data extraction agent uploads the data into the S3 bucket/folder before copying to Amazon Redshift.
  6. Choose Test Task.
  7. Once the task is successfully validated, choose Create.

Start the Local Data Migration Task

To start the task, choose the Start button in the Tasks tab.

  • First, the Data Extraction Agent extracts data from BigQuery into the GCP storage bucket.
  • Then, the agent uploads data to Amazon S3 and launches a copy command to move the data to Amazon Redshift.
  • At this point, AWS SCT has successfully migrated data from the source BigQuery table to the Amazon Redshift table.

View data in Amazon Redshift

After the data migration task executes successfully, you can connect to Amazon Redshift and validate the data.

Follow these steps to validate the data in Amazon Redshift:

  1. Navigate to the Amazon Redshift QueryEditor V2.
  2. Double-click on the Amazon Redshift Serverless workgroup name that you created.
  3. Choose the Federated User option under Authentication.
  4. Choose Create Connection.
  5. Create a new editor by choosing the + icon.
  6. In the editor, write a query to select from the schema name and table name/view name you would like to verify. Explore the data, run ad-hoc queries, and make visualizations and charts and views.

The following is a side-by-side comparison between source BigQuery and target Amazon Redshift for the sports data-set that we used in this walkthrough.

Clean up up any AWS resources that you created for this exercise

Follow these steps to terminate the EC2 instance:

  1. Navigate to the Amazon EC2 console.
  2. In the navigation pane, choose Instances.
  3. Select the check-box for the EC2 instance that you created.
  4. Choose Instance state, and then Terminate instance.
  5. Choose Terminate when prompted for confirmation.

Follow these steps to delete Amazon Redshift Serverless workgroup and namespace

  1. Navigate to Amazon Redshift Serverless Dashboard.
  2. Under Namespaces / Workgroups, choose the workspace that you created.
  3. Under Actions, choose Delete workgroup.
  4. Select the checkbox Delete the associated namespace.
  5. Uncheck Create final snapshot.
  6. Enter delete in the delete confirmation text box and choose Delete.

Follow these steps to delete the S3 bucket

  1. Navigate to Amazon S3 console.
  2. Choose the bucket that you created.
  3. Choose Delete.
  4. To confirm deletion, enter the name of the bucket in the text input field.
  5. Choose Delete bucket.

Conclusion

Migrating a data warehouse can be a challenging, complex, and yet rewarding project. AWS SCT reduces the complexity of data warehouse migrations. Following this walkthrough, you can understand how a data migration task extracts, downloads, and then migrates data from BigQuery to Amazon Redshift. The solution that we presented in this post performs a one-time migration of database objects and data. Data changes made in BigQuery when the migration is in progress won’t be reflected in Amazon Redshift. When data migration is in progress, put your ETL jobs to BigQuery on hold or replay the ETLs by pointing to Amazon Redshift after the migration. Consider using the best practices for AWS SCT.

AWS SCT has some limitations when using BigQuery as a source. For example, AWS SCT can’t convert sub queries in analytic functions, geography functions, statistical aggregate functions, and so on. Find the full list of limitations in the AWS SCT user guide. We plan to address these limitations in future releases. Despite these limitations, you can use AWS SCT to automatically convert most of your BigQuery code and storage objects.

Download and install AWS SCT, sign in to the AWS Console, checkout Amazon Redshift Serverless, and start migrating!


About the authors

Cedrick Hoodye is a Solutions Architect with a focus on database migrations using the AWS Database Migration Service (DMS) and the AWS Schema Conversion Tool (SCT) at AWS. He works on DB migrations related challenges. He works closely with EdTech, Energy, and ISV business sector customers to help them realize the true potential of DMS service. He has helped migrate 100s of databases into the AWS cloud using DMS and SCT.

Amit Arora is a Solutions Architect with a focus on Database and Analytics at AWS. He works with our Financial Technology and Global Energy customers and AWS certified partners to provide technical assistance and design customer solutions on cloud migration projects, helping customers migrate and modernize their existing databases to the AWS Cloud.

Jagadish Kumar is an Analytics Specialist Solution Architect at AWS focused on Amazon Redshift. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.

Anusha Challa is a Senior Analytics Specialist Solution Architect at AWS focused on Amazon Redshift. She has helped many customers build large-scale data warehouse solutions in the cloud and on premises. Anusha is passionate about data analytics and data science and enabling customers achieve success with their large-scale data projects.