Tag Archives: Amazon Redshift

How AWS Payments migrated from Redash to Amazon Redshift Query Editor v2

Post Syndicated from Erol Murtezaoglu original https://aws.amazon.com/blogs/big-data/how-aws-payments-migrated-from-redash-to-amazon-redshift-query-editor-v2/

AWS Payments is part of the AWS Commerce Platform (CP) organization that owns the customer experience of paying AWS invoices. It helps AWS customers manage their payment methods and payment preferences, and helps customers make self-service payments to AWS.

The Machine Learning, Data and Analytics (MLDA) team at AWS Payments enables data-driven decision-making across payments processes and partners by delivering data, business insights, and causal and ML inferences through a scalable set of data, insights, and ML inference services.

In this post, we discuss how to democratize data access to Amazon Redshift using the Amazon Redshift Query Editor V2 .


At AWS Payments, we had been using Redash to allow our users to author and run SQL queries against our Amazon Redshift data warehouse. Redash is a web-based SQL client application that can be used to author and run queries, visualize query results with charts, and collaborate with teams.

Over time, we began to notice incompatibilities between Redash’s operations and the needs of our workload.

We had the following requirements in mind when looking for an alternative tool:

  • Authentication and authorization
    • Provide data access without creating a database user and password
    • Allow list users using permission groups (POSIX/LDAP) for accessing the tool
    • Limit user access to database objects
  • User experience
    • Run SQL queries on the selected database
    • Save a query and rerun it later
    • Write a dynamic SQL query and run the query based on input parameters
    • Export a query result to CSV
    • Search saved queries
    • Share a query with other users as a URL

After an evaluation of alternate services, we chose the Amazon Redshift Query Editor V2.

Amazon Redshift Query Editor V2

The Amazon Redshift Query Editor V2 has the following benefits:

  • It makes data across analytics and data scientists more accessible with a unified web-based analyst workbench for data analysts to explore, share, and collaborate on data through a SQL interface
  • It provides a managed service that allows you to focus on exploring your data without managing your infrastructure
  • Users can log in to the Query Editor using single sign-on (SSO)
  • Users can connect to Amazon Redshift using federated access without providing a user name and password
  • It enables you to collaborate with team members by providing the ability to share saved queries securely
  • You can benefit from new features as soon as they get released by the Amazon Redshift Query Editor team
  • You can keep track of changes made to saved queries using the Query History feature
  • You can write parameterized SQL queries, which allows you to reuse a query with different values
  • You can turn on the Chart feature to display a graphic visualization of the current page of results
  • You can use notebooks to organize, annotate, and share multiple SQL queries in a single document
  • You can run multiple queries in parallel by running each query in a separate tab

However, it presented the following challenges:

  • To restrict user access to other AWS services within our AWS account, we attached the AWS Identity and Access Management (IAM) policies (see the appendix at the end of this post) to the SAML IAM role. The policies promote the following:
    • The user can only access the Query Editor V2 service.
    • The federated user gets assigned to a database group with limited access.
  • The Query Editor V2 currently doesn’t support cross-account Amazon Redshift connections. However, we set up Amazon Redshift data sharing to access the Amazon Redshift cluster from other AWS accounts. For more details, refer to Sharing data across clusters in Amazon Redshift.

Architecture overview

The following diagram illustrates our architecture.
The diagram illustrates our architecture
In the following sections, we will walk you through the steps to set up the query editor and migrate Redash queries.


To implement this solution, you must set up federated access to the Amazon Redshift Query Editor V2 using your identity provider (IdP) services.

You can find more information in the following posts:

Set up Amazon Redshift Query Editor V2

To set up the query editor, complete the following steps:

  1. Create an Amazon Redshift database group with read-only access.
  2. Create an IAM role for accessing Query Editor V2 in an AWS account and attach the required IAM policies based on your use case. For more information, refer to Configuring your AWS account.
  3. Create a trust relationship between your IdP and AWS.
    trust relationship between your IdP and AWS
  4. Add the principal tag sqlworkbench-team to the IAM role to share queries. For more information, refer to Sharing a query.
    Add the principal tag sqlworkbench-team to the IAM role

Migrate Redash queries to Amazon Redshift Query Editor V2

In this section, we walk you through different ways to migrate your Redash queries to the Amazon Redshift Query Editor V2.

Query without parameters

Querying without parameters is pretty straightforward, just copy your query from Redash and enter it in the query editor.

  1. In Redash, navigate to the saved query and choose Edit Source.
  2. Copy the source query.
  3. In Amazon RedShift Query Editor V2, enter the query into the editor, choose the Save icon, and give your query a title.

Query with parameters

In Redash, a string between {{ }} will be treated as a parameter, but Amazon RedShift Query Editor V2 uses ${ } to identify a parameter. To migrate queries with parameters, follow the same steps but replace {{ with ${ and }} with }.

The following screenshot shows an example query in Redash.

screenshot shows an example query in RedashThe following screenshot shows the same query in Amazon RedShift Query Editor V2.

screenshot shows the same query in Query Editor V2

Multi-part query to a Query Editor V2 notebook

For a multi-part query, copy the query of each section of a Redash dashboard and add it to a notebook. The notebook in Amazon Redshift Query Editor V2 runs queries successively. You also can add a description for your query.

The following screenshot shows an example query on the Redash dashboard.
screenshot shows an example query on the Redash dashboard
The following screenshot shows the query in an Amazon Redshift Query Editor V2 notebook.
screenshot shows the query in an Amazon Redshift Query Editor V2 notebook


In this post, we demonstrated how we set up Amazon Redshift Query Editor V2 with SSO and Amazon Redshift federated access, and migrated our customers from Redash to Amazon Redshift Query Editor V2. This solution reduced our operational cost of maintaining a third-party application and its infrastructure.

If you have similar use cases and need to provide a web-based tool to your customers to explore data on your Amazon Redshift cluster, consider using Amazon Redshift Query Editor V2.

Appendix: Customer IAM policies

In this section, we provide the code for the IAM policies we attached to the SAML IAM role to restrict user access to other AWS services within our AWS account:

  • query-editor-credentials-policy – In the following code, provide your Region, account, and cluster parameters to grant access to Amazon Redshift to get cluster credentials, create users, and allow users to join groups:
    "Version": "2012-10-17",
    "Statement": [
            "Action": "redshift:GetClusterCredentials",
            "Resource": [
            "Effect": "Allow"
            "Action": "redshift:JoinGroup",
            "Resource": "arn:aws:redshift:<region>:<account>:dbgroup:<cluster>/payments_ro_users",
            "Effect": "Allow"
            "Action": "redshift:DescribeClusters",
            "Resource": "arn:aws:redshift:<region>:<account>:cluster:<cluster>",
            "Effect": "Allow"
            "Action": "redshift:CreateClusterUser",
            "Resource": "arn:aws:redshift:<region>:<account>:dbuser:<cluster>/${redshift:DbUser}",
            "Effect": "Allow"
  • query-editor-access-policy – See the following code:
    "Version": "2012-10-17",
    "Statement": [
            "Action": "redshift:DescribeClusters",
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "RedshiftPermissions"
            "Condition": {
                "StringEquals": {
                    "secretsmanager:ResourceTag/sqlworkbench-resource-owner": "${aws:userid}"
            "Action": [
            "Resource": "arn:aws:secretsmanager:::sqlworkbench!",
            "Effect": "Allow",
            "Sid": "SecretsManagerPermissions"
            "Condition": {
                "StringEquals": {
                    "aws:CalledViaLast": "sqlworkbench.amazonaws.com"
            "Action": "tag:GetResources",
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "ResourceGroupsTaggingPermissions"
            "Action": [
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "AmazonRedshiftQueryEditorV2NonResourceLevelPermissions"
            "Condition": {
                "StringEquals": {
                    "aws:RequestTag/sqlworkbench-resource-owner": "${aws:userid}"
            "Action": [
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "AmazonRedshiftQueryEditorV2CreateOwnedResourcePermissions"
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sqlworkbench-resource-owner": "${aws:userid}"
            "Action": [
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "AmazonRedshiftQueryEditorV2OwnerSpecificPermissions"
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sqlworkbench-resource-owner": "${aws:userid}",
                    "aws:RequestTag/sqlworkbench-resource-owner": "${aws:userid}"
                "ForAllValues:StringEquals": {
                    "aws:TagKeys": "sqlworkbench-resource-owner"
            "Action": "sqlworkbench:TagResource",
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "AmazonRedshiftQueryEditorV2TagOnlyUserIdPermissions"
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sqlworkbench-team": "${aws:PrincipalTag/sqlworkbench-team}"
            "Action": [
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "AmazonRedshiftQueryEditorV2TeamReadAccessPermissions"
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sqlworkbench-resource-owner": "${aws:userid}",
                    "aws:RequestTag/sqlworkbench-team": "${aws:PrincipalTag/sqlworkbench-team}"
            "Action": "sqlworkbench:TagResource",
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "AmazonRedshiftQueryEditorV2TagOnlyTeamPermissions"
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sqlworkbench-resource-owner": "${aws:userid}"
                "ForAllValues:StringEquals": {
                    "aws:TagKeys": "sqlworkbench-team"
            "Action": "sqlworkbench:UntagResource",
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "AmazonRedshiftQueryEditorV2UntagOnlyTeamPermissions"
  • query-editor-notebook-policy – See the following code:
    "Version": "2012-10-17",
    "Statement": [
            "Action": [
            "Resource": "*",
            "Effect": "Allow"
            "Condition": {
                "StringEquals": {
                    "aws:RequestTag/sqlworkbench-resource-owner": "${aws:userid}"
            "Action": [
            "Resource": "*",
            "Effect": "Allow"
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sqlworkbench-resource-owner": "${aws:userid}"
            "Action": [
            "Resource": "*",
            "Effect": "Allow"
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sqlworkbench-team": "${aws:PrincipalTag/sqlworkbench-team}"
            "Action": [
            "Resource": "*",
            "Effect": "Allow"

About the Authors

Mohammad Nejad

Mohammad Nejad leads the AWS Payments Data Platform team. He has experience leading teams, architecting designs, implementing solutions, and launching products. Currently, his team focuses on building a modern data platform on AWS to provide a complete solution for processing, analyzing, and presenting data.

Erol MurtezaogluErol Murtezaoglu, a Technical Product Manager at AWS, is an inquisitive and enthusiastic thinker with a drive for self-improvement and learning. He has a strong and proven technical background in software development and architecture, balanced with a drive to deliver commercially successful products. Erol highly values the process of understanding customer needs and problems, in order to deliver solutions that exceed expectations.

Mohamed ShaabanMohamed Shaaban is a Senior Software Engineer in Amazon Redshift and is based in Berlin, Germany. He has over 12 years of experience in the software engineering. He is passionate about cloud services and building solutions that delight customers. Outside of work, he is an amateur photographer who loves to explore and capture unique moments.

How SafetyCulture scales unpredictable dbt Cloud workloads in a cost-effective manner with Amazon Redshift

Post Syndicated from Anish Moorjani original https://aws.amazon.com/blogs/big-data/how-safetyculture-scales-unpredictable-dbt-cloud-workloads-in-a-cost-effective-manner-with-amazon-redshift/

This post is co-written by Anish Moorjani, Data Engineer at SafetyCulture.

SafetyCulture is a global technology company that puts the power of continuous improvement into everyone’s hands. Its operations platform unlocks the power of observation at scale, giving leaders visibility and workers a voice in driving quality, efficiency, and safety improvements.

Amazon Redshift is a fully managed data warehouse service that tens of thousands of customers use to manage analytics at scale. Together with price-performance, Amazon Redshift enables you to use your data to acquire new insights for your business and customers while keeping costs low.

In this post, we share the solution SafetyCulture used to scale unpredictable dbt Cloud workloads in a cost-effective manner with Amazon Redshift.

Use case

SafetyCulture runs an Amazon Redshift provisioned cluster to support unpredictable and predictable workloads. A source of unpredictable workloads is dbt Cloud, which SafetyCulture uses to manage data transformations in the form of models. Whenever models are created or modified, a dbt Cloud CI job is triggered to test the models by materializing the models in Amazon Redshift. To balance the needs of unpredictable and predictable workloads, SafetyCulture used Amazon Redshift workload management (WLM) to flexibly manage workload priorities.

With plans for further growth in dbt Cloud workloads, SafetyCulture needed a solution that does the following:

  • Caters for unpredictable workloads in a cost-effective manner
  • Separates unpredictable workloads from predictable workloads to scale compute resources independently
  • Continues to allow models to be created and modified based on production data

Solution overview

The solution SafetyCulture used is comprised of Amazon Redshift Serverless and Amazon Redshift Data Sharing, along with the existing Amazon Redshift provisioned cluster.

Amazon Redshift Serverless caters to unpredictable workloads in a cost-effective manner because compute cost is not incurred when there is no workload. You pay only for what you use. In addition, moving unpredictable workloads into a separate Amazon Redshift data warehouse allows each Amazon Redshift data warehouse to scale resources independently.

Amazon Redshift Data Sharing enables data access across Amazon Redshift data warehouses without having to copy or move data. Therefore, when a workload is moved from one Amazon Redshift data warehouse to another, the workload can continue to access data in the initial Amazon Redshift data warehouse.

The following figure shows the solution and workflow steps:

  1. We create a serverless instance to cater for unpredictable workloads. Refer to Managing Amazon Redshift Serverless using the console for setup steps.
  2. We create a datashare called prod_datashare to allow the serverless instance access to data in the provisioned cluster. Refer to Getting started data sharing using the console for setup steps. Database names are identical to allow queries with full path notation database_name.schema_name.object_name to run seamlessly in both data warehouses.
  3. dbt Cloud connects to the serverless instance and models, created or modified, are tested by being materialized in the default database dev, in either each users’ personal schema or a pull request related schema. Instead of dev, you can use a different database designated for testing. Refer to Connect dbt Cloud to Redshift for setup steps.
  4. You can query materialized models in the serverless instance with materialized models in the provisioned cluster to validate changes. After you validate the changes, you can implement models in the serverless instance in the provisioned cluster.


SafetyCulture carried out the steps to create the serverless instance and datashare, with integration to dbt Cloud, with ease. SafetyCulture also successfully ran its dbt project with all seeds, models, and snapshots materialized into the serverless instance via run commands from the dbt Cloud IDE and dbt Cloud CI jobs.

Regarding performance, SafetyCulture observed dbt Cloud workloads completing on average 60% faster in the serverless instance. Better performance could be attributed to two areas:

  • Amazon Redshift Serverless measures compute capacity using Redshift Processing Units (RPUs). Because it costs the same to run 64 RPUs in 10 minutes and 128 RPUs in 5 minutes, having a higher number of RPUs to complete a workload sooner was preferred.
  • With dbt Cloud workloads isolated on the serverless instance, dbt Cloud was configured with more threads to allow materialization of more models at once.

To determine cost, you can perform an estimation. 128 RPUs provides approximately the same amount of memory that an ra3.4xlarge 21-node provisioned cluster provides. In US East (N. Virginia), the cost of running a serverless instance with 128 RPUs is $48 hourly ($0.375 per RPU hour * 128 RPUs). In the same Region, the cost of running an ra3.4xlarge 21-node provisioned cluster on demand is $68.46 hourly ($3.26 per node hour * 21 nodes). Therefore, an accumulated hour of unpredictable workloads on a serverless instance is 29% more cost-effective than an on-demand provisioned cluster. Calculations in this example should be recalculated when performing future cost estimations because prices may change over time.


SafetyCulture had two key learnings to better integrate dbt with Amazon Redshift, which can be helpful for similar implementations.

First, when integrating dbt with an Amazon Redshift datashare, configure INCLUDENEW=True to ease management of database objects in a schema:


For example, assume the model customers.sql is materialized by dbt as the view customers. Next, customers is added to a datashare. When customers.sql is modified and rematerialized by dbt, dbt creates a new view with a temporary name, drops customers, and renames the new view to customers. Although the new view carries the same name, it’s a new database object that wasn’t added to the datashare. Therefore, customers is no longer found in the datashare.

Configuring INCLUDENEW=True allows new database objects to be automatically added to the datashare. An alternative to configuring INCLUDENEW=True and providing more granular control is the use of dbt post-hook.

Second, when integrating dbt with more than one Amazon Redshift data warehouse, define sources with database to aid dbt in evaluating the right database.

For example, assume a dbt project is used across two dbt Cloud environments to isolate production and test workloads. The dbt Cloud environment for production workloads is configured with the default database prod_db and connects to a provisioned cluster. The dbt Cloud environment for test workloads is configured with the default database dev and connects to a serverless instance. In addition, the provisioned cluster contains the table prod_db.raw_data.sales, which is made available to the serverless instance via a datashare as prod_db′.raw_data.sales.

When dbt compiles a model containing the source {{ source('raw_data', 'sales') }}, the source is evaluated as database.raw_data.sales. If database is not defined for sources, dbt sets the database to the configured environment’s default database. Therefore, the dbt Cloud environment connecting to the provisioned cluster evaluates the source as prod_db.raw_data.sales, while the dbt Cloud environment connecting to the serverless instance evaluates the source as dev.raw_data.sales, which is incorrect.

Defining database for sources allows dbt to consistently evaluate the right database across different dbt Cloud environments, because it removes ambiguity.


After testing Amazon Redshift Serverless and Data Sharing, SafetyCulture is satisfied with the result and has started productionalizing the solution.

“The PoC showed the vast potential of Redshift Serverless in our infrastructure,” says Thiago Baldim, Data Engineer Team Lead at SafetyCulture. “We could migrate our pipelines to support Redshift Serverless with simple changes to the standards we were using in our dbt. The outcome provided a clear picture of the potential implementations we could do, decoupling the workload entirely by teams and users and providing the right level of computation power that is fast and reliable.”

Although this post specifically targets unpredictable workloads from dbt Cloud, the solution is also relevant for other unpredictable workloads, including ad hoc queries from dashboards. Start exploring Amazon Redshift Serverless for your unpredictable workloads today.

About the authors

Anish Moorjani is a Data Engineer in the Data and Analytics team at SafetyCulture. He helps SafetyCulture’s analytics infrastructure scale with the exponential increase in the volume and variety of data.

Randy Chng is an Analytics Solutions Architect at Amazon Web Services. He works with customers to accelerate the solution of their key business problems.

Simplify data loading into Type 2 slowly changing dimensions in Amazon Redshift

Post Syndicated from Vaidy Kalpathy original https://aws.amazon.com/blogs/big-data/simplify-data-loading-into-type-2-slowly-changing-dimensions-in-amazon-redshift/

Thousands of customers rely on Amazon Redshift to build data warehouses to accelerate time to insights with fast, simple, and secure analytics at scale and analyze data from terabytes to petabytes by running complex analytical queries. Organizations create data marts, which are subsets of the data warehouse and usually oriented for gaining analytical insights specific to a business unit or team. The star schema is a popular data model for building data marts.

In this post, we show how to simplify data loading into a Type 2 slowly changing dimension in Amazon Redshift.

Star schema and slowly changing dimension overview

A star schema is the simplest type of dimensional model, in which the center of the star can have one fact table and a number of associated dimension tables. A dimension is a structure that captures reference data along with associated hierarchies, while a fact table captures different values and metrics that can be aggregated by dimensions. Dimensions provide answers to exploratory business questions by allowing end-users to slice and dice data in a variety of ways using familiar SQL commands.

Whereas operational source systems contain only the latest version of master data, the star schema enables time travel queries to reproduce dimension attribute values on past dates when the fact transaction or event actually happened. The star schema data model allows analytical users to query historical data tying metrics to corresponding dimensional attribute values over time. Time travel is possible because dimension tables contain the exact version of the associated attributes at different time ranges. Relative to the metrics data that keeps changing on a daily or even hourly basis, the dimension attributes change less frequently. Therefore, dimensions in a star schema that keeps track of changes over time are referred to as slowly changing dimensions (SCDs).

Data loading is one of the key aspects of maintaining a data warehouse. In a star schema data model, the central fact table is dependent on the surrounding dimension tables. This is captured in the form of primary key-foreign key relationships, where the dimension table primary keys are referred by foreign keys in the fact table. In the case of Amazon Redshift, uniqueness, primary key, and foreign key constraints are not enforced. However, declaring them will help the optimizer arrive at optimal query plans, provided that the data loading processes enforce their integrity. As part of data loading, the dimension tables, including SCD tables, get loaded first, followed by the fact tables.

SCD population challenge

Populating an SCD dimension table involves merging data from multiple source tables, which are usually normalized. SCD tables contain a pair of date columns (effective and expiry dates) that represent the record’s validity date range. Changes are inserted as new active records effective from the date of data loading, while simultaneously expiring the current active record on a previous day. During each data load, incoming change records are matched against existing active records, comparing each attribute value to determine whether existing records have changed or were deleted or are new records coming in.

In this post, we demonstrate how to simplify data loading into a dimension table with the following methods:

  • Using Amazon Simple Storage Service (Amazon S3) to host the initial and incremental data files from source system tables
  • Accessing S3 objects using Amazon Redshift Spectrum to carry out data processing to load native tables within Amazon Redshift
  • Creating views with window functions to replicate the source system version of each table within Amazon Redshift
  • Joining source table views to project attributes matching with dimension table schema
  • Applying incremental data to the dimension table, bringing it up to date with source-side changes

Solution overview

In a real-world scenario, records from source system tables are ingested on a periodic basis to an Amazon S3 location before being loaded into star schema tables in Amazon Redshift.

For this demonstration, data from two source tables, customer_master and customer_address, are combined to populate the target dimension table dim_customer, which is the customer dimension table.

The source tables customer_master and customer_address share the same primary key, customer_id, and will be joined on the same to fetch one record per customer_id along with attributes from both tables. row_audit_ts contains the latest timestamp at which the particular source record was inserted or last updated. This column helps identify the change records since the last data extraction.

rec_source_status is an optional column that indicates if the corresponding source record was inserted, updated, or deleted. This is applicable in cases where the source system itself provides the changes and populates rec_source_status appropriately.

The following figure provides the schema of the source and target tables.

Let’s look closer at the schema of the target table, dim_customer. It contains different categories of columns:

  • Keys – It contains two types of keys:
    • customer_sk is the primary key of this table. It is also called the surrogate key and has a unique value that is monotonically increasing.
    • customer_id is the source primary key and provides a reference back to the source system record.
  • SCD2 metadatarec_eff_dt and rec_exp_dt indicate the state of the record. These two columns together define the validity of the record. The value in rec_exp_dt will be set as ‘9999-12-31’ for presently active records.
  • Attributes – Includes first_name, last_name, employer_name, email_id, city, and country.

Data loading into a SCD table involves a first-time bulk data loading, referred to as the initial data load. This is followed by continuous or regular data loading, referred to as an incremental data load, to keep the records up to date with changes in the source tables.

To demonstrate the solution, we walk through the following steps for initial data load (1–7) and incremental data load (8–12):

  1. Land the source data files in an Amazon S3 location, using one subfolder per source table.
  2. Use an AWS Glue crawler to parse the data files and register tables in the AWS Glue Data Catalog.
  3. Create an external schema in Amazon Redshift to point to the AWS Glue database containing these tables.
  4. In Amazon Redshift, create one view per source table to fetch the latest version of the record for each primary key (customer_id) value.
  5. Create the dim_customer table in Amazon Redshift, which contains attributes from all relevant source tables.
  6. Create a view in Amazon Redshift joining the source table views from Step 4 to project the attributes modeled in the dimension table.
  7. Populate the initial data from the view created in Step 6 into the dim_customer table, generating customer_sk.
  8. Land the incremental data files for each source table in their respective Amazon S3 location.
  9. In Amazon Redshift, create a temporary table to accommodate the change-only records.
  10. Join the view from Step 6 and dim_customer and identify change records comparing the combined hash value of attributes. Populate the change records into the temporary table with an I, U, or D indicator.
  11. Update rec_exp_dt in dim_customer for all U and D records from the temporary table.
  12. Insert records into dim_customer, querying all I and U records from the temporary table.


Before you get started, make sure you meet the following prerequisites:

Land data from source tables

Create separate subfolders for each source table in an S3 bucket and place the initial data files within the respective subfolder. In the following image, the initial data files for customer_master and customer_address are made available within two different subfolders. To try out the solution, you can use customer_master_with_ts.csv and customer_address_with_ts.csv as initial data files.

It’s important to include an audit timestamp (row_audit_ts) column that indicates when each record was inserted or last updated. As part of incremental data loading, rows with the same primary key value (customer_id) can arrive more than once. The row_audit_ts column helps identify the latest version of such records for a given customer_id to be used for further processing.

Register source tables in the AWS Glue Data Catalog

We use an AWS Glue crawler to infer metadata from delimited data files like the CSV files used in this post. For instructions on getting started with an AWS Glue crawler, refer to Tutorial: Adding an AWS Glue crawler.

Create an AWS Glue crawler and point it to the Amazon S3 location that contains the source table subfolders, within which the associated data files are placed. When you’re creating the AWS Glue crawler, create a new database named rs-dimension-blog. The following screenshots show the AWS Glue crawler configuration chosen for our data files.

Note that for the Set output and scheduling section, the advanced options are left unchanged.

Running this crawler should create the following tables within the rs-dimension-blog database:

  • customer_address
  • customer_master

Create schemas in Amazon Redshift

First, create an AWS Identity and Access Management (IAM) role named rs-dim-blog-spectrum-role. For instructions, refer to Create an IAM role for Amazon Redshift.

The IAM role has Amazon Redshift as the trusted entity, and the permissions policy includes AmazonS3ReadOnlyAccess and AWSGlueConsoleFullAccess, because we’re using the AWS Glue Data Catalog. Then associate the IAM role with the Amazon Redshift cluster or endpoint.

Instead, you can also set the IAM role as the default for your Amazon Redshift cluster or endpoint. If you do so, in the following create external schema command, pass the iam_role parameter as iam_role default.

Now, open Amazon Redshift Query Editor V2 and create an external schema passing the newly created IAM role and specifying the database as rs-dimension-blog. The database name rs-dimension-blog is the one created in the Data Catalog as part of configuring the crawler in the preceding section. See the following code:

create external schema spectrum_dim_blog 
from data catalog 
database 'rs-dimension-blog' 
iam_role 'arn:aws:iam::<accountid>:role/rs-dim-blog-spectrum-role';

Check if the tables registered in the Data Catalog in the preceding section are visible from within Amazon Redshift:

select * 
from spectrum_dim_blog.customer_master 
limit 10;

select * 
from spectrum_dim_blog.customer_address 
limit 10;

Each of these queries will return 10 rows from the respective Data Catalog tables.

Create another schema in Amazon Redshift to host the table, dim_customer:

create schema rs_dim_blog;

Create views to fetch the latest records from each source table

Create a view for the customer_master table, naming it vw_cust_mstr_latest:

create view rs_dim_blog.vw_cust_mstr_latest as with rows_numbered as (
    row_number() over(
      partition by customer_id 
      order by 
        row_audit_ts desc
    ) as rnum 
  rnum = 1 with no schema binding;

The preceding query uses row_number, which is a window function provided by Amazon Redshift. Using window functions enables you to create analytic business queries more efficiently. Window functions operate on a partition of a result set, and return a value for every row in that window. The row_number window function determines the ordinal number of the current row within a group of rows, counting from 1, based on the ORDER BY expression in the OVER clause. By including the PARTITION BY clause as customer_id, groups are created for each value of customer_id and ordinal numbers are reset for each group.

Create a view for the customer_address table, naming it vw_cust_addr_latest:

create view rs_dim_blog.vw_cust_addr_latest as with rows_numbered as (
    row_number() over(
      partition by customer_id 
      order by 
        row_audit_ts desc
    ) as rnum 
  rnum = 1 with no schema binding;

Both view definitions use the row_number window function of Amazon Redshift, ordering the records by descending order of the row_audit_ts column (the audit timestamp column). The condition rnum=1 fetches the latest record for each customer_id value.

Create the dim_customer table in Amazon Redshift

Create dim_customer as an internal table in Amazon Redshift within the rs_dim_blog schema. The dimension table includes the column customer_sk, that acts as the surrogate key column and enables us to capture a time-sensitive version of each customer record. The validity period for each record is defined by the columns rec_eff_dt and rec_exp_dt, representing record effective date and record expiry date, respectively. See the following code:

create table rs_dim_blog.dim_customer (
  customer_sk bigint, 
  customer_id bigint, 
  first_name varchar(100), 
  last_name varchar(100), 
  employer_name varchar(100), 
  email_id varchar(100), 
  city varchar(100), 
  country varchar(100), 
  rec_eff_dt date, 
  rec_exp_dt date
) diststyle auto;

Create a view to consolidate the latest version of source records

Create the view vw_dim_customer_src, which consolidates the latest records from both source tables using left outer join, keeping them ready to be populated into the Amazon Redshift dimension table. This view fetches data from the latest views defined in the section “Create views to fetch the latest records from each source table”:

create view rs_dim_blog.vw_dim_customer_src as 
  rs_dim_blog.vw_cust_mstr_latest as m 
  left join rs_dim_blog.vw_cust_addr_latest as a on m.customer_id = a.customer_id 
order by 
  m.customer_id with no schema binding;

At this point, this view fetches the initial data for loading into the dim_customer table that we are about to create. In your use-case, use a similar approach to create and join the required source table views to populate your target dimension table.

Populate initial data into dim_customer

Populate the initial data into the dim_customer table by querying the view vw_dim_customer_src. Because this is the initial data load, running row numbers generated by the row_number window function will suffice to populate a unique value in the customer_sk column starting from 1:

insert into rs_dim_blog.dim_customer 
  row_number() over() as customer_sk, 
  cast('2022-07-01' as date) rec_eff_dt, 
  cast('9999-12-31' as date) rec_exp_dt 

In this query, we have specified ’2022-07-01’ as the value in rec_eff_dt for all initial data records. For your use-case, you can modify this date value as appropriate to your situation.

The preceding steps complete the initial data loading into the dim_customer table. In the next steps, we proceed with populating incremental data.

Land ongoing change data files in Amazon S3

After the initial load, the source systems provide data files on an ongoing basis, either containing only new and change records or a full extract containing all records for a particular table.

You can use the sample files customer_master_with_ts_incr.csv and customer_address_with_ts_incr.csv, which contain changed as well as new records. These incremental files need to be placed in the same location in Amazon S3 where the initial data files were placed. Please see section “Land data from source tables”. This will result in the corresponding Redshift Spectrum tables automatically reading the additional rows.

If you used the sample file for customer_master, after adding the incremental files, the following query shows the initial as well as incremental records:

order by 

In case of full extracts, we can identify deletes occurring in the source system tables by comparing the previous and current versions and looking for missing records. In case of change-only extracts where the rec_source_status column is present, its value will help us identify deleted records. In either case, land the ongoing change data files in the respective Amazon S3 locations.

For this example, we have uploaded the incremental data for the customer_master and customer_address source tables with a few customer_id records receiving updates and a few new records being added.

Create a temporary table to capture change records

Create the temporary table temp_dim_customer to store all changes that need to be applied to the target dim_customer table:

create temp table temp_dim_customer (
  customer_sk bigint, 
  customer_id bigint, 
  first_name varchar(100), 
  last_name varchar(100), 
  employer_name varchar(100), 
  email_id varchar(100), 
  city varchar(100), 
  country varchar(100), 
  rec_eff_dt date, 
  rec_exp_dt date, 
  iud_operation character(1)

Populate the temporary table with new and changed records

This is a multi-step process that can be combined into a single complex SQL. Complete the following steps:

  1. Fetch the latest version of all customer attributes by querying the view vw_dim_customer_src:
    coalesce(first_name, '') || coalesce(last_name, '') || coalesce(employer_name, '') || coalesce(email_id, '') || coalesce(city, '') || coalesce(country, ''), 512
  ) as hash_value, 
  current_date rec_eff_dt, 
  cast('9999-12-31' as date) rec_exp_dt 

Amazon Redshift offers hashing functions such as sha2, which converts a variable length string input into a fixed length character output. The output string is a text representation of the hexadecimal value of the checksum with the specified number of bits. In this case, we pass a concatenated set of customer attributes whose change we want to track, specifying the number of bits as 512. We’ll use the output of the hash function to determine if any of the attributes have undergone a change. This dataset will be called newver (new version).

Because we landed the ongoing change data in the same location as the initial data files, the records retrieved from the preceding query (in newver) include all records, even the unchanged ones. But because of the definition of the view vw_dim_customer_src, we get only one record per customerid, which is its latest version based on row_audit_ts.

  1. In a similar manner, retrieve the latest version of all customer records from dim_customer, which are identified by rec_exp_dt=‘9999-12-31’. While doing so, also retrieve the sha2 value of all customer attributes available in dim_customer:
    coalesce(first_name, '') || coalesce(last_name, '') || coalesce(employer_name, '') || coalesce(email_id, '') || coalesce(city, '') || coalesce(country, ''), 512
  ) as hash_value, 
  rec_exp_dt = '9999-12-31';

This dataset will be called oldver (old or existing version).

  1. Identify the current maximum surrogate key value from the dim_customer table:
  max(customer_sk) as maxval 

This value (maxval) will be added to the row_number before being used as the customer_sk value for the change records that need to be inserted.

  1. Perform a full outer join of the old version of records (oldver) and the new version (newver) of records on the customer_id column. Then compare the old and new hash values generated by the sha2 function to determine if the change record is an insert, update, or delete:
case when oldver.customer_id is null then 'I'
when newver.customer_id is null then 'D'
when oldver.hash_value != newver.hash_value then 'U'
else 'N' end as iud_op

We tag the records as follows:

  • If the customer_id is non-existent in the oldver dataset (oldver.customer_id is null), it’s tagged as an insert (‘I').
  • Otherwise, if the customer_id is non-existent in the newver dataset (newver.customer_id is null), it’s tagged as a delete (‘D').
  • Otherwise, if the old hash_value and new hash_value are different, these records represent an update (‘U').
  • Otherwise, it indicates that the record has not undergone any change and therefore can be ignored or marked as not-to-be-processed (‘N').

Make sure to modify the preceding logic if the source extract contains rec_source_status to identify deleted records.

Although sha2 output maps a possibly infinite set of input strings to a finite set of output strings, the chances of collision of hash values for the original row values and changed row values are very unlikely. Instead of individually comparing each column value before and after, we compare the hash values generated by sha2 to conclude if there has been a change in any of the attributes of the customer record. For your use-case, we recommend you choose a hash function that works for your data conditions after adequate testing. Instead, you can compare individual column values if none of the hash functions satisfactorily meet your expectations.

  1. Combining the outputs from the preceding steps, let’s create the INSERT statement that captures only change records to populate the temporary table:
insert into temp_dim_customer (
  customer_sk, customer_id, first_name, 
  last_name, employer_name, email_id, 
  city, country, rec_eff_dt, rec_exp_dt, 
) with newver as (
      coalesce(first_name, '') || coalesce(last_name, '') || coalesce(employer_name, '') || coalesce(email_id, '') || coalesce(city, '') || coalesce(country, ''), 512
    ) as hash_value, 
    current_date rec_eff_dt, 
    cast('9999-12-31' as date) rec_exp_dt 
oldver as (
      coalesce(first_name, '') || coalesce(last_name, '') || coalesce(employer_name, '') || coalesce(email_id, '') || coalesce(city, '') || coalesce(country, ''), 512
    ) as hash_value, 
    rec_exp_dt = '9999-12-31'
maxsk as (
    max(customer_sk) as maxval 
allrecs as (
    coalesce(oldver.customer_id, newver.customer_id) as customer_id, 
    case when oldver.customer_id is null then 'I' when newver.customer_id is null then 'D' when oldver.hash_value != newver.hash_value then 'U' else 'N' end as iud_op, 
    oldver full 
    outer join newver on oldver.customer_id = newver.customer_id
  (maxval + (row_number() over())) as customer_sk, 
  iud_op != 'N';

Expire updated customer records

With the temp_dim_customer table now containing only the change records (either ‘I’, ‘U’, or ‘D’), the same can be applied on the target dim_customer table.

Let’s first fetch all records with values ‘U’ or ‘D’ in the iud_op column. These are records that have either been deleted or updated in the source system. Because dim_customer is a slowly changing dimension, it needs to reflect the validity period of each customer record. In this case, we expire the presently active recorts that have been updated or deleted. We expire these records as of yesterday (by setting rec_exp_dt=current_date-1) matching on the customer_id column:

  rec_exp_dt = current_date - 1 
  customer_id in (
      temp_dim_customer as t 
      iud_operation in ('U', 'D')
  and rec_exp_dt = '9999-12-31';

Insert new and changed records

As the last step, we need to insert the newer version of updated records along with all first-time inserts. These are indicated by ‘U’ and ‘I’, respectively, in the iud_op column in the temp_dim_customer table:

insert into rs_dim_blog.dim_customer (
  customer_sk, customer_id, first_name, 
  last_name, employer_name, email_id, 
  city, country, rec_eff_dt, rec_exp_dt
  iud_operation in ('I', 'U');

Depending on the SQL client setting, you might want to run a commit transaction; command to verify that the preceding changes are persisted successfully in Amazon Redshift.

Check the final output

You can run the following query and see that the dim_customer table now contains both the initial data records plus the incremental data records, capturing multiple versions for those customer_id values that got changed as part of incremental data loading. The output also indicates that each record has been populated with appropriate values in rec_eff_dt and rec_exp_dt corresponding to the record validity period.

order by 

For the sample data files provided in this article, the preceding query returns the following records. If you’re using the sample data files provided in this post, note that the values in customer_sk may not match with what is shown in the following table.

In this post, we only show the important SQL statements; the complete SQL code is available in load_scd2_sample_dim_customer.sql.

Clean up

If you no longer need the resources you created, you can delete them to prevent incurring additional charges.


In this post, you learned how to simplify data loading into Type-2 SCD tables in Amazon Redshift, covering both initial data loading and incremental data loading. The approach deals with multiple source tables populating a target dimension table, capturing the latest version of source records as of each run.

Refer to Amazon Redshift data loading best practices for further materials and additional best practices, and see Updating and inserting new data for instructions to implement updates and inserts.

About the Author

Vaidy Kalpathy is a Senior Data Lab Solution Architect at AWS, where he helps customers modernize their data platform and defines end to end data strategy including data ingestion, transformation, security, visualization. He is passionate about working backwards from business use cases, creating scalable and custom fit architectures to help customers innovate using data analytics services on AWS.

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

Post Syndicated from Satesh Sonti original https://aws.amazon.com/blogs/big-data/how-gaming-companies-can-use-amazon-redshift-serverless-to-build-scalable-analytical-applications-faster-and-easier/

This post provides guidance on how to build scalable analytical solutions for gaming industry use cases using Amazon Redshift Serverless. It covers how to use a conceptual, logical architecture for some of the most popular gaming industry use cases like event analysis, in-game purchase recommendations, measuring player satisfaction, telemetry data analysis, and more. This post also discusses the art of the possible with newer innovations in AWS services around streaming, machine learning (ML), data sharing, and serverless capabilities.

Our gaming customers tell us that their key business objectives include the following:

  • Increased revenue from in-app purchases
  • High average revenue per user and lifetime value
  • Improved stickiness with better gaming experience
  • Improved event productivity and high ROI

Our gaming customers also tell us that while building analytics solutions, they want the following:

  • Low-code or no-code model – Out-of-the-box solutions are preferred to building customized solutions.
  • Decoupled and scalable – Serverless, auto scaled, and fully managed services are preferred over manually managed services. Each service should be easily replaceable, enhanced with little or no dependency. Solutions should be flexible to scale up and down.
  • Portability to multiple channels – Solutions should be compatible with most of endpoint channels like PC, mobile, and gaming platforms.
  • Flexible and easy to use – The solutions should provide less restrictive, easy-to-access, and ready-to-use data. They should also provide optimal performance with low or no tuning.

Analytics reference architecture for gaming organizations

In this section, we discuss how gaming organizations can use a data hub architecture to address the analytical needs of an enterprise, which requires the same data at multiple levels of granularity and different formats, and is standardized for faster consumption. A data hub is a center of data exchange that constitutes a hub of data repositories and is supported by data engineering, data governance, security, and monitoring services.

A data hub contains data at multiple levels of granularity and is often not integrated. It differs from a data lake by offering data that is pre-validated and standardized, allowing for simpler consumption by users. Data hubs and data lakes can coexist in an organization, complementing each other. Data hubs are more focused around enabling businesses to consume standardized data quickly and easily. Data lakes are more focused around storing and maintaining all the data in an organization in one place. And unlike data warehouses, which are primarily analytical stores, a data hub is a combination of all types of repositories—analytical, transactional, operational, reference, and data I/O services, along with governance processes. A data warehouse is one of the components in a data hub.

The following diagram is a conceptual analytics data hub reference architecture. This architecture resembles a hub-and-spoke approach. Data repositories represent the hub. External processes are the spokes feeding data to and from the hub. This reference architecture partly combines a data hub and data lake to enable comprehensive analytics services.

Let’s look at the components of the architecture in more detail.


Data can be loaded from multiple sources, such as systems of record, data generated from applications, operational data stores, enterprise-wide reference data and metadata, data from vendors and partners, machine-generated data, social sources, and web sources. The source data is usually in either structured or semi-structured formats, which are highly and loosely formatted, respectively.

Data inbound

This section consists of components to process and load the data from multiple sources into data repositories. It can be in batch mode, continuous, pub/sub, or any other
custom integration. ETL (extract, transform, and load) technologies, streaming services, APIs, and data exchange interfaces are the core components of this pillar. Unlike ingestion processes, data can be transformed as per business rules before loading. You can apply technical or business data quality rules and load raw data as well. Essentially, it provides the flexibility to get the data into repositories in its most usable form.

Data repositories

This section consists of a group of data stores, which includes data warehouses, transactional or operational data stores, reference data stores, domain data stores housing purpose-built business views, and enterprise datasets (file storage). The file storage component is usually a common component between a data hub and a data lake to avoid data duplication and provide comprehensiveness. Data can also be shared among all these repositories without physically moving with features, such as data sharing and federated queries. However, data copy and duplication are allowed considering various consumption needs in terms of formats and latency.

Data outbound

Data is often consumed using structured queries for analytical needs. Also, datasets are accessed for ML, data exporting, and publishing needs. This section consists of components to query the data, export, exchange, and APIs. In terms of implementation, the same technologies may be used for both inbound and outbound, but the functions are different. However, it’s not mandatory to use the same technologies. These processes aren’t transformation heavy because the data is already standardized and almost ready to consume. The focus is on the ease of consumption and integration with consuming services.


This pillar consists of various consumption channels for enterprise analytical needs. It includes business intelligence (BI) users, canned and interactive reports, dashboards, data science workloads, Internet of Things (IoT), web apps, and third-party data consumers. Popular consumption entities in many organizations are queries, reports, and data science workloads. Because there are multiple data stores maintaining data at different granularity and formats to service consumer needs, these consumption components depend on data catalogs for finding the right source.

Data governance

Data governance is key to the success of a data hub reference architecture. It constitutes components like metadata management, data quality, lineage, masking, and stewardship, which are required for organized maintenance of the data hub. Metadata management helps organize the technical and business metadata catalog, and consumers can reference this catalog to know what data is available in which repository and at what granularity, format, owners, refresh frequency, and so on. Along with metadata management, data quality is important to increase confidence for consumers. This includes data cleansing, validation, conformance, and data controls.

Security and monitoring

Users and application access should be controlled at multiple levels. It starts with authentication, then authorizing who and what should be accessed, policy management, encryption, and applying data compliance rules. It also includes monitoring components to log the activity for auditing and analysis.

Analytics data hub solution architecture on AWS

The following reference architecture provides an AWS stack for the solution components.

Let’s look at each component again and the relevant AWS services.

Data inbound services

AWS Glue and Amazon EMR services are ideal for batch processing. They scale automatically and are able to process most of the industry standard data formats. Amazon Kinesis Data Streams, Amazon Kinesis Data Firehose, and Amazon Managed Streaming for Apache Kafka (Amazon MSK) enables you to build streaming process applications. These streaming services integrate well with the Amazon Redshift streaming feature. This helps you process real-time sources, IoT data, and data from online channels. You can also ingest data with third-party tools like Informatica, dbt, and Matallion.

You can build RESTful APIs and WebSocket APIs using Amazon API Gateway and AWS Lambda, which will enable real-time two-way communication with web sources, social, and IoT sources. AWS Data Exchange helps with subscribing to third-party data in AWS Marketplace. Data subscription and access is fully managed with this service. Refer to the respective service documentation for further details.

Data repository services

Amazon Redshift is the recommended data storage service for OLAP (Online Analytical Processing) workloads such as cloud data warehouses, data marts, and other analytical data stores. This service is the core of this reference architecture on AWS and can address most analytical needs out of the box. You can use simple SQL to analyze structured and semi-structured data across data warehouses, data marts, operational databases, and data lakes to deliver the best price performance at any scale. The Amazon Redshift data sharing feature provides instant, granular, and high-performance access without data copies and data movement across multiple Amazon Redshift data warehouses in the same or different AWS accounts, and across Regions.

For ease of use, Amazon Redshift offers a serverless option. Amazon Redshift Serverless automatically provisions and intelligently scales data warehouse capacity to deliver fast performance for even the most demanding and unpredictable workloads, and you pay only for what you use. Just load your data and start querying right away in Amazon Redshift Query Editor or in your favorite BI tool and continue to enjoy the best price performance and familiar SQL features in an easy-to-use, zero administration environment.

Amazon Relational Database Service (Amazon RDS) is a fully managed service for building transactional and operational data stores. You can choose from many popular engines such as MySQL, PostgreSQL, MariaDB, Oracle, and SQL Server. With the Amazon Redshift federated query feature, you can query transactional and operational data in place without moving the data. The federated query feature currently supports Amazon RDS for PostgreSQL, Amazon Aurora PostgreSQL-Compatible Edition, Amazon RDS for MySQL, and Amazon Aurora MySQL-Compatible Edition.

Amazon Simple Storage Service (Amazon S3) is the recommended service for multi-format storage layers in the architecture. It offers industry-leading scalability, data availability, security, and performance. Organizations typically store data in Amazon S3 using open file formats. Open file formats enable analysis of the same Amazon S3 data using multiple processing and consumption layer components. Data in Amazon S3 can be easily queried in place using SQL with Amazon Redshift Spectrum. It helps you query and retrieve structured and semi-structured data from files in Amazon S3 without having to load the data. Multiple Amazon Redshift data warehouses can concurrently query the same datasets in Amazon S3 without the need to make copies of the data for each data warehouse.

Data outbound services

Amazon Redshift comes with the web-based analytics workbench Query Editor V2.0, which helps you run queries, explore data, create SQL notebooks, and collaborate on data with your teams in SQL through a common interface. AWS Transfer Family helps securely transfer files using SFTP, FTPS, FTP, and AS2 protocols. It supports thousands of concurrent users and is a fully managed, low-code service. Similar to inbound processes, you can utilize Amazon API Gateway and AWS Lambda for data pull using the Amazon Redshift Data API. And AWS Data Exchange helps publish your data to third parties for consumption through AWS Marketplace.

Consumption services

Amazon QuickSight is the recommended service for creating reports and dashboards. It enables you to create interactive dashboards, visualizations, and advanced analytics with ML insights. Amazon SageMaker is the ML platform for all your data science workload needs. It helps you build, train, and deploy models consuming the data from repositories in the data hub. You can use Amazon front-end web and mobile services and AWS IoT services to build web, mobile, and IoT endpoint applications to consume data out of the data hub.

Data governance services

The AWS Glue Data Catalog and AWS Lake Formation are the core data governance services AWS currently offers. These services help manage metadata centrally for all the data repositories and manage access controls. They also help with data classification and can automatically handle schema changes. You can use Amazon DataZone to discover and share data at scale across organizational boundaries with built-in governance and access controls. AWS is investing in this space to provide more a unified experience for AWS services. There are many partner products such as Collibra, Alation, Amorphic, Informatica, and more, which you can use as well for data governance functions with AWS services.

Security and monitoring services

AWS Identity and Access Management (AWS IAM) manages identities for AWS services and resources. You can define users, groups, roles, and policies for fine-grained access management of your workforce and workloads. AWS Key Management Service (AWS KMS) manages AWS keys or customer managed keys for your applications. Amazon CloudWatch and AWS CloudTrail help provide monitoring and auditing capabilities. You can collect metrics and events and analyze them for operational efficiency.

In this post, we’ve discussed the most common AWS services for the respective solution components. However, you aren’t limited to only these services. There are many other AWS services for specific use cases that may be more appropriate for your needs than what we discussed here. You can reach to AWS Analytics Solutions Architects for appropriate guidance.

Example architectures for gaming use cases

In this section, we discuss example architectures for two gaming use cases.

Game event analysis

In-game events (also called timed or live events) encourage player engagement through excitement and anticipation. Events entice players to interact with the game, increasing player satisfaction and revenue with in-game purchases. Events have become more and more important, especially as games shift from being static pieces of entertainment to be played as is to offering dynamic and changing content through the use of services that use information to make decisions about game play as the game is being played. This enables games to change as the players play and influence what works and what doesn’t, and gives any game a potentially infinite lifespan.

This capability of in-game events to offer fresh content and activities within a familiar framework is how you keep players engaged and playing for months to years. Players can enjoy new experiences and challenges within the familiar framework or world that they have grown to love.

The following example shows how such an architecture might appear, including changes to support various sections of the process like breaking the data into separate containers to accommodate scalability, charge-back, and ownership.

To fully understand how events are viewed by the players and to make decisions about future events requires information on how the latest event was actually performed. This means gathering a lot of data as the players play to build key performance indicators (KPIs) that measure the effectiveness and player satisfaction with each event. This requires analytics that specifically measure each event and capture, analyze, report on, and measure player experience for each event. These KPIs include the following:

  • Initial user flow interactions – What actions users are taking after they first receive or download an event update in a game. Are there any clear drop-off points or bottlenecks that are turning people off the event?
  • Monetization – When, what, and where users are spending money on in the event, whether it’s buying in-game currencies, answering ads, specials, and so on.
  • Game economy – How can users earn and spend virtual currencies or goods during an event, using in-game money, trades, or barter.
  • In-game activity – Player wins, losses, leveling up, competition wins, or player achievements within the event.
  • User to user interactions – Invitations, gifting, chats (private and group), challenges, and so on during an event.

These are just some of the KPIs and metrics that are key for predictive modeling of events as the game acquires new players while keeping existing users involved, engaged, and playing.

In-game activity analysis

In-game activity analysis essentially looks at any meaningful, purposeful activity the player might show, with the goal of trying to understand what actions are taken, their timing, and outcomes. This includes situational information about the players, including where they are playing (both geographical and cultural), how often, how long, what they undertake on each login, and other activities.

The following example shows how such an architecture might appear, including changes to support various sections of the process like breaking the data into separate warehouses. The multi-cluster warehouse approach helps scale the workload independently, provides flexibility to the implemented charge-back model, and supports decentralized data ownership.

The solution essentially logs information to help understand the behavior of your players, which can lead to insights that increase retention of existing players, and acquisition of new ones. This can provide the ability to do the following:

  • Provide in-game purchase recommendations
  • Measure player trends in the short term and over time
  • Plan events the players will engage in
  • Understand what parts of your game are most successful and which are less so

You can use this understanding to make decisions about future game updates, make in-game purchase recommendations, determine when and how your game economy may need to be balanced, and even allow players to change their character or play as the game progresses by injecting this information and accompanying decisions back into the game.


This reference architecture, while showing examples of only a few analysis types, provides a faster technology path for enabling game analytics applications. The decoupled, hub/spoke approach brings the agility and flexibility to implement different approaches to analytics and understanding the performance of game applications. The purpose-built AWS services described in this architecture provide comprehensive capabilities to easily collect, store, measure, analyze, and report game and event metrics. This helps you efficiently perform in-game analytics, event analysis, measure player satisfaction, and provide tailor-made recommendations to game players, efficiently organize events, and increase retention rates.

Thanks for reading the post. If you have any feedback or questions, please leave them in the comments.

About the authors

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 16 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Tanya Rhodes is a Senior Solutions Architect based out of San Francisco, focused on games customers with emphasis on analytics, scaling, and performance enhancement of games and supporting systems. She has over 25 years of experience in enterprise and solutions architecture specializing in very large business organizations across multiple lines of business including games, banking, healthcare, higher education, and state governments.

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

Post Syndicated from Parag Doshi original https://aws.amazon.com/blogs/big-data/how-tricentis-unlocks-insights-across-the-software-development-lifecycle-at-speed-and-scale-using-amazon-redshift/

This is a guest post co-written with Parag Doshi, Guru Havanur, and Simon Guindon from Tricentis.

Tricentis is the global leader in continuous testing for DevOps, cloud, and enterprise applications. It has been well published since the State of DevOps 2019 DORA Metrics were published that with DevOps, companies can deploy software 208 times more often and 106 times faster, recover from incidents 2,604 times faster, and release 7 times fewer defects. Speed changes everything, and continuous testing across the entire CI/CD lifecycle is the key. However, speed is only realized when you have the confidence to release software on demand. Tricentis instills that confidence by providing software tools that enable Agile Continuous Testing (ACT) at scale. Whether exploratory or automated, functional or performance, API or UI, targeting mainframes, custom applications, packaged applications, or cloud-native applications, Tricentis provides a comprehensive suite of specialized continuous testing tools that help its customers achieve the confidence to release on demand.

The next phase of Tricentis’ journey is to unlock insights across all testing tools. Teams may struggle to have a unified view of software quality due to siloed testing across many disparate tools. For users that require a unified view of software quality, this is unacceptable. In this post, we share how the AWS Data Lab helped Tricentis to improve their software as a service (SaaS) Tricentis Analytics platform with insights powered by Amazon Redshift.

The challenge

Tricentis provides SaaS and on-premises solutions to thousands of customers globally. Every change to software worth testing is tracked in test management tools such as Tricentis qTest, test automation tools such as Tosca or Testim, or performance testing tools such as Neoload. Although Tricentis has amassed such data over a decade, the data remains untapped for valuable insights. Each of these tools has its own reporting capabilities that make it difficult to combine the data for integrated and actionable business insights.

Additionally, the scale is significant because the multi-tenant data sources provide a continuous stream of testing activity, and our users require quick data refreshes as well as historical context for up to a decade due to compliance and regulatory demands.

Finally, data integrity is of paramount importance. Every event in the data source can be relevant, and our customers don’t tolerate data loss, poor data quality, or discrepancies between the source and Tricentis Analytics. While aggregating, summarizing, and aligning to a common information model, all transformations must not affect the integrity of data from its source.

The solution

Tricentis Analytics aims to address the challenges of high volume, near-real-time, and visually appealing reporting and analytics across the entire Tricentis product portfolio.

The initial customer objectives were:

  • Provide export of data securely accessible from the AWS Cloud
  • Provide an initial set of pre-built dashboards that provide immediate business insights
  • Beta test a solution with early adopter customers within 6 weeks

Considering the multi-tenant data source, Tricentis and the AWS Data Lab team engineered for the following constraints:

  • Deliver the end-to-end pipeline to load only the eligible customers into an analytics repository
  • Transform the multi-tenant data into single-tenant data isolated for each customer in strictly segregated environments

Knowing that data will be unified across many sources deployed in any environment, the architecture called for an enterprise-grade analytics platform. The data pipeline consists of multiple layers:

  • Ingesting data from the source either as application events or change data capture (CDC) streams
  • Queuing data so that we can rewind and replay the data back in time without going back to the source
  • Light transformations such as splitting multi-tenant data into single tenant data to isolate customer data
  • Persisting and presenting data in a scalable and reliable lake house (data lake and data warehouse) repository

Some customers will access the repository directly via an API with the proper guardrails for stability to combine their test data with other data sources in their enterprise, while other customers will use dashboards to gain insights on testing. Initially, Tricentis defines these dashboards and charts to enable insight on test runs, test traceability with requirements, and many other pre-defined use cases that can be valuable to customers. In the future, more capabilities will be provided to end-users to come up with their own analytics and insights.

How Tricentis and the AWS Data Lab were able to establish business insights in 6 weeks

Given the challenge of Tricentis Analytics with live customers in 6 weeks, Tricentis partnered with the AWS Data Lab. From detailed design to a beta release, Tricentis had customers expecting to consume data from a data lake specific to only their data, and all of the data that had been generated for over a decade. Customers also required their own repository, an Apache Parquet data lake, which would combine with other data in the customer environment to gather even greater insights.

The AWS account team proposed the AWS Data Lab Build Lab session to help Tricentis accelerate the process of designing and building their prototype. The Build Lab is a two-to-five-day intensive build by a team of customer builders with guidance from an AWS Data Lab Solutions Architect. During the Build Lab, the customer will construct a prototype in their environment, using their data, with guidance on real-world architectural patterns and anti-patterns, as well as strategies for building effective solutions, from AWS service experts. Including the pre-lab preparation work, the total engagement duration is 3–6 weeks and in the Tricentis case was 3 weeks: two for the pre-lab preparation work and one for the lab. The weeks that followed the lab included go-to-market activities with specific customers, documentation, hardening, security reviews, performance testing, data integrity testing, and automation activities.

The 2 weeks before the lab were used for the following:

  • Understanding the use case and working backward with an architecture
  • Preparing the Tricentis team for the lab by delivering all the training on the services to be used during the lab

For this solution, Tricentis and AWS built a data pipeline that consumes data from streaming, which was in place before the lab, and this streaming has the database transactions captured through CDC. In the streaming, the data from each table is separated by topic, and data from all the customers comes on the same topic (no isolation). Because of that, a pipeline was created to separate customers to create their tables isolated by the schema on the final destination at Amazon Redshift. The following diagram illustrates the solution architecture.

The main idea of this architecture is to be event-driven with eventual consistency. Any time new test cases or test results are created or modified, events trigger such that processing is immediate and new snapshot files are available via an API or data is pulled at the refresh frequency of the reporting or business intelligence (BI) tool. Every time the Amazon Simple Storage Service (Amazon S3) sink connector from Apache Kafka delivers a file on Amazon S3, Amazon EventBridge triggers an AWS Lambda function to transform the multi-tenant file into separated files, one per customer per table, and land it on specific folders on Amazon S3. As the files are created, another process is triggered to load the data from each customer on their schema or table on Amazon Redshift. On Amazon Redshift, materialized views were used to get the queries for the dashboards ready and easier to be returned to the Apache Superset. Also, the materialized views were configured to refresh automatically (with the autorefresh option), so Amazon Redshift updates the data automatically in the materialized views as soon as possible after base tables changes.

In the following sections, we detail specific implementation challenges and additional features required by customers discovered along the way.

Data export

As stated earlier, some customers want to get an export of their test data and create their data lake. For these customers, Tricentis provides incremental data as Apache Parquet files and will have the ability to filter on specific projects and specific date ranges. To ensure data integrity, Tricentis uses its technology known as Tosca DI (not part of the AWS Data Lab session).

Data security

The solution uses the following data security guardrails:

  • Data isolation guardrails – Tricentis source databases systems are used by all customers, and therefore, data from different customers is in the same database. To isolate customer-specific data, Tricentis has a unique identifier that discriminates customer-specific data. All the queries filter data based on the discriminator to get customer-specific data. EventBridge triggers a Lambda function to transform multi-tenant files to single-tenant (customer) files to land in customer-specific S3 folders. Another Lambda function is triggered to load data from customer-specific folders to their specific schema in Amazon Redshift. The latter Lambda function is data isolation aware and triggers an alert and stops processing further for any data that doesn’t belong to a specific customer.
  • Data access guardrails – To ensure access control, Tricentis applied role-based access control principles to users and service accounts for specific work-related resources. Access to Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon S3, Amazon Relational Database Service (Amazon RDS), and Amazon Redshift was controlled by granting privileges at the role level and assigning those roles appropriate resources.

Pay per use and linear cost scalability

Tricentis’s objective is to pay for the compute and storage used and grow analytics infrastructure with linear cost scalability. To better manage storage costs in the data plane, Tricentis stores all raw and intermediate data in Amazon S3 storage in a compressed format. The Amazon MSK and Amazon Redshift is right-sized for Tricentis Analytics load and is allowed to scale up or down with no downtime based on future business needs. Data on all the stores, including Amazon MSK, Amazon Redshift, and Amazon S3, is subjected to tiered storage and retention policies per the customer data retention and archival requirements to reduce the cost further and provide linear cost scalability.

In the control plane, Debezium and Kafka Connect resources are turned on and off, so you only pay for what you use. Lambda triggers are triggered on an event or a schedule and turned off after completing tasks.

Automated data integrity

High data integrity is a fundamental design principle of Tricentis Analytics. Fortunately, Tricentis has a product called ToscaDI, which is used to automate the measurement of data integrity across many different data sources. The main idea is to use the machine-generated data type and log sequence number (LSN) to reflect the latest snapshot data from the change data capture (CDC) streams. Tricentis reached the data integrity automation milestone outside of the AWS Data Lab window by automatically triggering Tosca DI at various stages of the AWS serverless architecture (illustrated earlier), and because of that Tricentis was able to ensure expected record counts at every step, preventing data loss or inadvertent data manipulation. In future versions, Tricentis will have much deeper data integrity verification record counts and incorporate specific fields to ensure data quality (for example, nullness) and semantic or format validation. To date, the combination of CDC and data cleansing has resulted in ultra-high data integrity when comparing source data to the final Parquet file contents.

Performance and data loss prevention

Performance was tuned for maximum throughput at three stages in the pipeline:

  • Data ingestion – Data integrity during ingestion was dramatically improved using CDC events and allowed us to rely on the well-respected replication mechanisms in PostgreSQL and Kafka, which simplified the system and eliminated a lot of the past data corrections that were in place. The Amazon S3 sink connector further streams data into Amazon S3 in real time by partitioning data into fixed-sized files. Fixed-size data files avoid further latency due to unbound file sizes. As a result, data was higher quality and was streamed in real time at a much faster rate.
  • Data transformation – Batch processing is highly cost efficient and compute efficient, and can mitigate various potential performance issues if appropriately implemented. Tricentis uses batch transformation to move data from multi-tenant Amazon S3 to single-tenant Amazon S3 and between single-tenant Amazon S3 to Amazon Redshift by micro-batch loading. The batch processing is staged to work within the Lamba invocations limits and maximum Amazon Redshift connections limits to keep the cost minimum. However, the transformation pipeline is configurable to go real time by processing every incoming S3 file on an EventBridge event.
  • Data queries – Materialized views with appropriate sort keys significantly improve the performance of repeated and predictable dashboard workloads. Tricentis pipelines use dynamic data loading in views and precomputed results in materialized views to seamlessly improve the performance of dashboards, along with setting up appropriate simple and compound sort keys to accelerate performance. Tricentis query performance is further accelerated by range-restricted predicates in sort keys.

Implementation challenges

Tricentis worked within the default limit of 1,000 concurrent Lambda function runs by keeping track of available functions at any given time and firing only those many functions for which slots are available. For the 10 GB memory limit per function, Tricentis right-sized the Amazon S3 sink connector generated files and single-tenant S3 files to not exceed 4 GB in size. The Lambda function throttling can be prevented by requesting a higher limit of concurrent runs if that becomes necessary later.

Tricentis also experienced some Amazon Redshift connection limitations. Amazon Redshift has quotas and adjustable quotas that limit the use of server resources. To effectively manage Amazon Redshift limits of maximum connections, Tricentis used connection pools to ensure optimal consumption and stability.

Results and next steps

The collaborative approach between Tricentis and the AWS Data Lab allowed considerable acceleration and the ability to meet timelines for establishing a big data solution that will benefit Tricentis customers for years. Since this writing, customer onboarding, observability and alerting, and security scanning were automated as part of a DevSecOps pipeline.

Within 6 weeks, the team was able to beta a data export service for one of Tricentis’ customers.

In the future, Tricentis anticipates adding multiple data sources, unify towards a common, ubiquitous language for testing data, and deliver richer insights so that our customers can have the correct data in a single view and increase confidence in their delivery of software at scale and speed.


In this post, we walked you through the journey the Tricentis team took with the AWS Data Lab during their participation in a Build Lab session. During the session, the Tricentis team and AWS Data Lab worked together to identify a best-fit architecture for their use cases and implement a prototype for delivering new insights for their customers.

To learn more about how the AWS Data Lab can help you turn your ideas into solutions, visit AWS Data Lab.

About the Authors

  Parag Doshi is Vice President of Engineering at Tricentis, where he continues to lead towards the vision of Innovation at the Speed of Imagination. He brings innovation to market by building world-class quality engineering SaaS such as qTest, the flagship test management product, and a new capability called Tricentis Analytics, which unlocks software development lifecycle insights across all types of testing. Prior to Tricentis, Parag was the founder of Anthem’s Cloud Platform Services, where he drove a hybrid cloud and DevSecOps capability and migrated 100 mission-critical applications. He enabled Anthem to build a new pharmacy benefits management business in AWS, resulting in $800 million in total operating gain for Anthem in 2020 per Forbes and CNBC. He also held posts at Hewlett-Packard, having multiple roles including Chief Technologist and head of architecture for DXC’s Virtual Private Cloud, and CTO for HP’s Application Services in the Americas region.

Guru Havanur serves as a Principal, Big Data Engineering and Analytics team in Tricentis. Guru is responsible for data, analytics, development, integration with other products, security, and compliance activities. He strives to work with other Tricentis products and customers to improve data sharing, data quality, data integrity, and data compliance through the modern big data platform. With over 20 years of experience in data warehousing, a variety of databases, integration, architecture, and management, he thrives for excellence.

Simon Guindon is an Architect at Tricentis. He has expertise in large-scale distributed systems and database consistency models, and works with teams in Tricentis around the world on scalability and high availability. You can follow his Twitter @simongui.

Ricardo Serafim is a Senior AWS Data Lab Solutions Architect. With a focus on data pipelines, data lakes, and data warehouses, Ricardo helps customers create an end-to-end architecture and test an MVP as part of their path to production. Outside of work, Ricardo loves to travel with his family and watch soccer games, mainly from the “Timão” Sport Club Corinthians Paulista.

Visualize database privileges on Amazon Redshift using Grafana

Post Syndicated from Yota Hamaoka original https://aws.amazon.com/blogs/big-data/visualize-database-privileges-on-amazon-redshift-using-grafana/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift enables you to use SQL for analyzing structured and semi-structured data with best price performance along with secure access to the data.

As more users start querying data in a data warehouse, access control is paramount to protect valuable organizational data. Database administrators want to continuously monitor and manage user privileges to maintain proper data access in the data warehouse. Amazon Redshift provides granular access control on the database, schema, table, column, row, and other database objects by granting privileges to roles, groups, and users from a SQL interface. To monitor privileges configured in Amazon Redshift, you can retrieve them by querying system tables.

Although Amazon Redshift provides a broad capability of managing access to database objects, we have heard from customers that they want to visualize and monitor privileges without using a SQL interface. In this post, we introduce predefined dashboards using Grafana which visualizes database privileges without writing SQL. This dashboard will help database administrators to reduce the time spent on database administration and increase the frequency of monitoring cycles.

Database security in Amazon Redshift

Security is the top priority at AWS. Amazon Redshift provides four levels of control:

  • Cluster management
  • Cluster connectivity
  • Database access
  • Temporary database credentials and single sign-on

This post focuses on database access, which relates to user access control against database objects. For more information, see Managing database security.

Amazon Redshift uses the GRANT command to define permissions in the database. For most database objects, GRANT takes three parameters:

  • Identity – The entity you grant access to. This could be a user, role, or group.
  • Object – The type of database object. This could be a database, schema, table or view, column, row, function, procedure, language, datashare, machine leaning (ML) model, and more.
  • Privilege – The type of operation. Examples include CREATE, SELECT, ALTER, DROP, DELETE, and INSERT. The level of privilege depends on the object.

To remove access, use the REVOKE command.

Additionally, Amazon Redshift offers granular access control with the Row-level security (RLS) feature. You can attach or detach RLS policies to identities with the ATTACH RLS POLICY and DETACH RLS POLICY commands, respectively. See RLS policy ownership and management for more details.

Generally, database administrator monitors and reviews the identities, objects, and privileges periodically to ensure proper access is configured. They also need to investigate access configurations if database users face permission errors. These tasks require a SQL interface to query multiple system tables, which can be a repetitive and undifferentiated operation. Therefore, database administrators need a single pane of glass to quickly navigate through identities, objects, and privileges without writing SQL.

Solution overview

The following diagram illustrates the solution architecture and its key components:

  • Amazon Redshift contains database privilege information in system tables.
  • Grafana provides a predefined dashboard to visualize database privileges. The dashboard runs queries against the Amazon Redshift system table via the Amazon Redshift Data API.

Note that the dashboard focuses on visualization. SQL interface is required to configure privileges in Amazon Redshift. You can use query editor v2, a web-based SQL interface which enables users to run SQL commands from a browser.


Before moving to the next section, you should have the following prerequisites:

While Amazon Managed Grafana controls the plugin version and updates periodically, local Grafana allows user to control the version. Therefore, local Grafana could be an option if you need earlier access for the latest features. Refer to plugin changelog for released features and versions.

Import the dashboards

After you have finished the prerequisites, you should have access to Grafana configured with Amazon Redshift as a data source. Next, import two dashboards for visualization.

  1. In Grafana console, go to the created Redshift data source and click Dashboards
  2. Import the Amazon Redshift Identities and Objects
  3. Go to the data source again and import the Amazon Redshift Privileges

Each dashboard will appear once imported.

Amazon Redshift Identities and Objects dashboard

The Amazon Redshift Identities and Objects dashboard shows identites and database objects in Amazon Redshift, as shown in the following screenshot.

The Identities section shows the detail of each user, role, and group in the source database.

One of the key features in this dashboard is the Role assigned to Role, User section, which uses a node graph panel to visualize the hierarchical structure of roles and users from multiple system tables. This visualization can help administrators quickly examine which roles are inherited to users instead of querying multiple system tables. For more information about role-based access, refer to Role-based access control (RBAC).

Amazon Redshift Privileges dashboard

The Amazon Redshift Privileges dashboard shows privileges defined in Amazon Redshift.

In the Role and Group assigned to User section, open the Role assigned to User panel to list the roles for a specific user. In this panel, you can list and compare roles assigned to multiple users. Use the User drop-down at the top of the dashboard to select users.

The dashboard will refresh immediately and show filtered result for selected users. Following screenshot is the filtered result for user hr1, hr2 and it3.

The Object Privileges section shows the privileges granted for each database object and identity. Note that objects with no privileges granted are not listed here. To show the full list of database objects, use the Amazon Redshift Identities and Objects dashboard.

The Object Privileges (RLS) section contains visualizations for row-level security (RLS). The Policy attachments panel enables you to examine RLS configuration by visualizing relation between of tables, policies, roles and users.


In this post, we introduced a visualization for database privileges of Amazon Redshift using predefined Grafana dashboards. Database administrators can use these dashboards to quickly navigate through identities, objects, and privileges without writing SQL. You can also customize the dashboard to meet your business requirements. The JSON definition file of this dashboard is maintained as part of OSS in the Redshift data source for Grafana GitHub repository.

For more information about the topics described to in this post, refer to the following:

About the author

Yota Hamaoka is an Analytics Solution Architect at Amazon Web Services. He is focused on driving customers to accelerate their analytics journey with Amazon Redshift.

Simplify Online Analytical Processing (OLAP) queries in Amazon Redshift using new SQL constructs such as ROLLUP, CUBE, and GROUPING SETS

Post Syndicated from Satesh Sonti original https://aws.amazon.com/blogs/big-data/simplify-online-analytical-processing-olap-queries-in-amazon-redshift-using-new-sql-constructs-such-as-rollup-cube-and-grouping-sets/

Amazon Redshift is a fully managed, petabyte-scale, massively parallel data warehouse that makes it fast, simple, and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools.

We are continuously investing to make analytics easy with Redshift by simplifying SQL constructs and adding new operators. Now we are adding ROLLUP, CUBE, and GROUPING SETS SQL aggregation extensions to perform multiple aggregate operations in single statement and easily include subtotals, totals, and collections of subtotals in a query.

In this post, we discuss how to use these extensions to simplify your queries in Amazon Redshift.

Solution overview

Online Analytical Processing (OLAP) is an effective tool for today’s data and business analysts. It helps you see your mission-critical metrics at different aggregation levels in a single pane of glass. An analyst can use OLAP aggregations to analyze buying patterns by grouping customers by demographic, geographic, and psychographic data, and then summarizing the data to look for trends. This could include analyzing the frequency of purchases, the time frames between purchases, and the types of items being purchased. Such analysis can provide insight into customer preferences and behavior, which can be used to inform marketing strategies and product development. For example, a data analyst can query the data to display a spreadsheet showing a company’s certain type of products sold in the US in the month of July, compare revenue figures with those for the same products in September, and then see a comparison of other product sales in the US at the same time period.

Traditionally, business analysts and data analysts use a set of SQL UNION queries to achieve the desired level of detail and rollups. However, it can be very time consuming and cumbersome to write and maintain. Furthermore, the level of detail and rollups that can be achieved with this approach is limited, because it requires the user to write multiple queries for each different level of detail and rollup.

Many customers are considering migrating to Amazon Redshift from other data warehouse systems that support OLAP GROUP BY clauses. To make this migration process as seamless as possible, Amazon Redshift now offers support for ROLLUP, CUBE, and GROUPING SETS. This will allow for a smoother migration of OLAP workloads, with minimal rewrites. Ultimately, this will result in a faster and streamlined transition to Amazon Redshift. Business and data analysts can now write a single SQL to do the job of multiple UNION queries.

In the next sections, we use sample supplier balances data from TPC-H dataset as a running example to demonstrate the use of ROLLUP, CUBE, and GROUPING SETS extensions. This dataset consists of supplier account balances across different regions and countries. We demonstrate how to find account balance subtotals and grand totals at each nation level, region level, and a combination of both. All these analytical questions can be answered by a business user by running simple single-line SQL statements. Along with aggregations, this post also demonstrates how the results can be traced back to attributes participated in generating subtotals.

Data preparation

To set up the use case, complete the following steps:

  1. On the Amazon Redshift console, in the navigation pane, choose Editor¸ then Query editor v2.

The query editor v2 opens in a new browser tab.

  1. Create a supplier sample table and insert sample data:
create table supp_sample (supp_id integer, region_nm char(25), nation_nm char(25), acct_balance numeric(12,2));

INSERT INTO public.supp_sample (supp_id,region_nm,nation_nm,acct_balance)
(90470,'AFRICA                   ','KENYA                    ',1745.57),
(99910,'AFRICA                   ','ALGERIA                  ',3659.98),
(26398,'AMERICA                  ','UNITED STATES            ',2575.77),
(43908,'AMERICA                  ','CANADA                   ',1428.27),
(3882,'AMERICA                  ','UNITED STATES            ',7932.67),
(42168,'ASIA                     ','JAPAN                    ',343.34),
(68461,'ASIA                     ','CHINA                    ',2216.11),
(89676,'ASIA                     ','INDIA                    ',4160.75),
(52670,'EUROPE                   ','RUSSIA                   ',2469.40),
(32190,'EUROPE                   ','RUSSIA                   ',1119.55),
(19587,'EUROPE                   ','GERMANY                  ',9904.98),
(1134,'MIDDLE EAST              ','EGYPT                    ',7977.48),
(35213,'MIDDLE EAST              ','EGYPT                    ',737.28),
(36132,'MIDDLE EAST              ','JORDAN                   ',5052.87);

We took a sample from the result of the following query run on TPC-H dataset. You can use the following query and take sample records to try the SQL statement described in this post:

select s_suppkey supp_id, r.r_name region_nm,n.n_name nation_nm, s.s_acctbal acct_balance
from supplier s, nation n, region r
s.s_nationkey = n.n_nationkey
and n.n_regionkey = r.r_regionkey

Let’s review the sample data before running the SQLs using GROUPING SETS, ROLLUP, and CUBE extensions.

The supp_sample table consists of supplier account balances from various nations and regions across the world. The following are the attribute definitions:

  • supp_id – The unique identifier for each supplier
  • region_nm – The region in which the supplier operates
  • nation_nm – The nation in which the supplier operates
  • acct_balance – The supplier’s outstanding account balance


GROUPING SETS is a SQL aggregation extension to group the query results by one or more columns in a single statement. You can use GROUPING SETS instead of performing multiple SELECT queries with different GROUP BY keys and merge (UNION) their results.

In this section, we show how to find the following:

  • Account balances aggregated for each region
  • Account balances aggregated for each nation
  • Merged results of both aggregations

Run the following SQL statement using GROUPING SETS:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM supp_sample
GROUP BY GROUPING SETS (region_nm, nation_nm);

As shown in the following screenshot, the result set includes aggregated account balances by region_nm, followed by nation_nm, and then both results combined in a single output.


The ROLLUP function generates aggregated results at multiple levels of grouping, starting from the most detailed level and then aggregating up to the next level. It groups data by particular columns and extra rows that represent the subtotals, and assumes a hierarchy among the GROUP BY columns.

In this section, we show how to find the following:

  • Account balances for each combination of region_nm and nation_nm
  • Rolled-up account balances for each region_nm
  • Rolled-up account balances for all regions

Use the following SQL statement using ROLLUP:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM supp_sample
GROUP BY ROLLUP (region_nm, nation_nm)
ORDER BY region_nm,nation_nm;

The following result shows rolled-up values starting from each combination of region_nm and nation_nm and rolls up in the hierarchy from nation_nm to region_nm. The rows with a value for region_nm and NULL value for nation_nm represent the subtotals for the region (marked in green). The rows with NULL value for both region_nm and nation_nm has the grand total—the rolled-up account balances for all regions (marked in red).

ROLLUP is structurally equivalent to the following GROUPING SETS query:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM supp_sample
GROUP BY GROUPING SETS((region_nm, nation_nm), (region_nm), ())
ORDER BY region_nm,nation_nm;

You can rewrite the preceding ROLLUP query using GROUPING SETS. However, using ROLLUP is a much simpler and readable construct for this use case.


CUBE groups data by the provided columns, returning extra subtotal rows representing the totals throughout all levels of grouping columns, in addition to the grouped rows. CUBE returns the same rows as ROLLUP, while adding additional subtotal rows for every combination of grouping column not covered by ROLLUP.

In this section, we show how to find the following:

  • Account balance subtotals for each nation_nm
  • Account balance subtotals for each region_nm
  • Account balance subtotals for each group of region_nm and nation_nm combination
  • Overall total account balance for all regions

Run the following SQL statement using CUBE:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM supp_sample
WHERE region_nm in ('AFRICA','AMERICA','ASIA') GROUP BY CUBE(region_nm, nation_nm)
ORDER BY region_nm, nation_nm;

In the preceding query, we added a filter to limit results for easy explanation. You can remove this filter in your test to view data for all regions.

In the following result sets, you can see the subtotals at region level (marked in green). These subtotal records are the same records generated by ROLLUP. Additionally, CUBE generated subtotals for each nation_nm (marked in yellow). Finally, you can also see the grand total for all three regions mentioned in the query (marked in red).

CUBE is structurally equivalent to the following GROUPING SETS query:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM supp_sample
WHERE region_nm in ('AFRICA','AMERICA','ASIA') -- added the filter to limit results.  You can remove this filter in your test to view data for all regions
GROUP BY GROUPING SETS((region_nm, nation_nm), (region_nm), (nation_nm), ())
ORDER BY region_nm;

You can rewrite the preceding CUBE query using GROUPING SETS. However, using CUBE is a much simpler and readable construct for this use.

NULL values

NULL is a valid value in a column that participates in GROUPING SETS, ROLLUP, and CUBE, and it’s not aggregated with the NULL values added explicitly to the result set to satisfy the schema of returning tuples.

Let’s create an example table orders containing details about items ordered, item descriptions, and quantity of the items:

-- Create example orders table and insert sample records
CREATE TABLE orders(item_no int,description varchar,quantity int);
INSERT INTO orders(item_no,description,quantity)

--View the data
SELECT * FROM orders;

We use the following ROLLUP query to aggregate quantities by item_no and description:

SELECT item_no, description, sum(quantity)
FROM orders
GROUP BY ROLLUP(item_no, description)

In the following result, there are two output rows for item_no 102. The row marked in green is the actual data record in the input, and the row marked in red is the subtotal record added by the ROLLUP function.

This demonstrates that NULL values in input are separate from the NULL values added by SQL aggregate extensions.

Grouping and Grouping_ID functions

GROUPING indicates whether a column in the GROUP BY list is aggregated or not. GROUPING(expr) returns 0 if a tuple is grouped on expr; otherwise it returns 1. GROUPING_ID(expr1, expr2, …, exprN) returns an integer representation of the bitmap that consists of GROUPING(expr1), GROUPING(expr2), …, GROUPING(exprN).

This feature helps us clearly understand the aggregation grain, slice and dice data, and apply filters when business users are performing analysis. Also provides auditability for the generated aggregations.

For example, let’s use the preceding supp_sampe table. The following ROLLUP query utilizes GROUPING and GROUPING_ID functions:

SELECT region_nm,
sum(acct_balance) as total_balance,
GROUPING(region_nm) as gr,
GROUPING(nation_nm) as gn,
GROUPING_ID(region_nm, nation_nm) as grn
FROM supp_sample
GROUP BY ROLLUP(region_nm, nation_nm)
ORDER BY region_nm;

In the following result set, the rows rolled up at nation_nm have 1 value for gn. This indicates that the total_balance is the aggregated value for all the nation_nm values in the region. The last row has gr value as 1. It indicates that total_balance is an aggregated value at region level including all the nations. The grn is an integer representation of bitmap (11 in binary translated to 3 in integer representation).

Performance assessment

Performance is often a key factor, and we wanted to make sure we’re offering most performant SQL features in Amazon Redshift. We performed benchmarking with the 3 TB TPC-H public dataset on an Amazon Redshift cluster with different sizes (5-node Ra3-4XL, 2-node Ra3-4XL, 2-node-Ra3-XLPLUS). Additionally, we disabled query caching so that query results aren’t cached. This allows us to measure the performance of the database as opposed to its ability to serve results from cache. The results were consistent across multiple runs.

We loaded the supplier, region, and nation files from the 3 TB public dataset and created a view on top of those three tables, as shown in the following code. This query joins the three tables to create a unified record. The joined dataset is used for performance assessment.

create view v_supplier_balances as
select r.r_name region_nm,n.n_name nation_nm, s.s_acctbal acct_balance
from supplier s, nation n, region r
s.s_nationkey = n.n_nationkey
and n.n_regionkey = r.r_regionkey;

We ran the following example SELECT queries using GROUPING SETS, CUBE, and ROLLUP, and captured performance metrics in the following tables.

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM v_supplier_balances
GROUP BY ROLLUP (region_nm, nation_nm)
ORDER BY region_nm;
Cluster Run 1 in ms Run 2 in ms Run 3 in ms
5-node-Ra3-4XL 120 118 117
2-node-Ra3-4XL 405 389 391
2-node-Ra3-XLPLUS 490 460 461


SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM v_supplier_balances
GROUP BY CUBE(region_nm, nation_nm)
ORDER BY region_nm;
Cluster Run 1 in ms Run 2 in ms Run 3 in ms
5-node-Ra3-4XL 224 215 214
2-node-Ra3-4XL 412 392 392
2-node-Ra3-XLPLUS 872 798 793


SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM v_supplier_balances
GROUP BY GROUPING SETS(region_nm, nation_nm)
ORDER BY region_nm;
Cluster Run 1 in ms Run 2 in ms Run 3 in ms
5-node-Ra3-4XL 210 198 198
2-node-Ra3-4XL 345 328 328
2-node-Ra3-XLPLUS 675 674 674

When we ran the same set of queries for ROLLUP and CUBE and ran with UNION ALL, we saw better performance with ROLLUP and CUBE functionality.

Cluster CUBE (run in ms) ROLLUP (run in ms) UNION ALL (run in ms)
5-node-Ra3-4XL 214 117 321
2-node-Ra3-4XL 392 391 543
2-node-Ra3-XLPLUS 793 461 932

Clean up

To clean up your resources, drop the tables and views you created while following along with the example in this post.


In this post, we talked about the new aggregated extensions ROLLUP, CUBE, and GROUPING SETS added to Amazon Redshift. We also discussed general uses cases, implementation examples, and performance results. You can simplify your existing aggregation queries using these new SQL aggregation extensions and use them in future development for building more simplified, readable queries. If you have any feedback or questions, please leave them in the comments section.

About the Authors

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 16 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Yanzhu Ji is a Product Manager on the Amazon Redshift team. She worked on the Amazon Redshift team as a Software Engineer before becoming a Product Manager. She has a rich experience of how the customer-facing Amazon Redshift features are built from planning to launching, and always treats customers’ requirements as first priority. In her personal life, Yanzhu likes painting, photography, and playing tennis.

Dinesh Kumar is a Database Engineer with more than a decade of experience working in the databases, data warehousing, and analytics space. Outside of work, he enjoys trying different cuisines and spending time with his family and friends.

Patterns for enterprise data sharing at scale

Post Syndicated from Venkata Sistla original https://aws.amazon.com/blogs/big-data/patterns-for-enterprise-data-sharing-at-scale/

Data sharing is becoming an important element of an enterprise data strategy. AWS services like AWS Data Exchange provide an avenue for companies to share or monetize their value-added data with other companies. Some organizations would like to have a data sharing platform where they can establish a collaborative and strategic approach to exchange data with a restricted group of companies in a closed, secure, and exclusive environment. For example, financial services companies and their auditors, or manufacturing companies and their supply chain partners. This fosters development of new products and services and helps improve their operational efficiency.

Data sharing is a team effort, it’s important to note that in addition to establishing the right infrastructure, successful data sharing also requires organizations to ensure that business owners sponsor data sharing initiatives. They also need to ensure that data is of high quality. Data platform owners and security teams should encourage proper data use and fix any privacy and confidentiality issues.

This blog discusses various data sharing options and common architecture patterns that organizations can adopt to set up their data sharing infrastructure based on AWS service availability and data compliance.

Data sharing options and data classification types

Organizations operate across a spectrum of security compliance constraints. For some organizations, it’s possible to use AWS services like AWS Data Exchange. However, organizations working in heavily regulated industries like federal agencies or financial services might be limited by the allow listed AWS service options. For example, if an organization is required to operate in a Fedramp Medium or Fedramp High environment, their options to share data may be limited by the AWS services that are available and have been allow listed. Service availability is based on platform certification by AWS, and allow listing is based on the organizations defining their security compliance architecture and guidelines.

The kind of data that the organization wants to share with its partners may also have an impact on the method used for data sharing. Complying with data classification rules may further limit their choice of data sharing options they may choose.

The following are some general data classification types:

  • Public data – Important information, though often freely available for people to read, research, review and store. It typically has the lowest level of data classification and security.
  • Private data – Information you might want to keep private like email inboxes, cell phone content, employee identification numbers, or employee addresses. If private data were shared, destroyed, or altered, it might pose a slight risk to an individual or the organization.
  • Confidential or restricted data – A limited group of individuals or parties can access sensitive information often requiring special clearance or special authorization. Confidential or restricted data access might involve aspects of identity and authorization management. Examples of confidential data include Social Security numbers and vehicle identification numbers.

The following is a sample decision tree that you can refer to when choosing your data sharing option based on service availability, classification type, and data format (structured or unstructured). Other factors like usability, multi-partner accessibility, data size, consumption patterns like bulk load/API access, and more may also affect the choice of data sharing pattern.


In the following sections, we discuss each pattern in more detail.

Pattern 1: Using AWS Data Exchange

AWS Data Exchange makes exchanging data easier, helping organizations lower costs, become more agile, and innovate faster. Organizations can choose to share data privately using AWS Data Exchange with their external partners. AWS Data Exchange offers perimeter controls that are applied at identity and resource levels. These controls decide which external identities have access to specific data resources. AWS Data Exchange provides multiple different patterns for external parties to access data, such as the following:

The following diagram illustrates an example architecture.


With AWS Data Exchange, once the dataset to share (or sell) is configured, AWS Data Exchange automatically manages entitlements (and billing) between the producer and the consumer. The producer doesn’t have to manage policies, set up new access points, or create new Amazon Redshift data shares for each consumer, and access is automatically revoked if the subscription ends. This can significantly reduce the operational overhead in sharing data.

Pattern 2: Using AWS Lake Formation for centralized access management

You can use this pattern in cases where both the producer and consumer are on the AWS platform with an AWS account that is enabled to use AWS Lake Formation. This pattern provides a no-code approach to data sharing. The following diagram illustrates an example architecture.


In this pattern, the central governance account has Lake Formation configured for managing access across the producer’s org accounts. Resource links from the production account Amazon Simple Storage Service (Amazon S3) bucket are created in Lake Formation. The producer grants Lake Formation permissions on an AWS Glue Data Catalog resource to an external account, or directly to an AWS Identity and Access Management (IAM) principal in another account. Lake Formation uses AWS Resource Access Manager (AWS RAM) to share the resource. If the grantee account is in the same organization as the grantor account, the shared resource is available immediately to the grantee. If the grantee account is not in the same organization, AWS RAM sends an invitation to the grantee account to accept or reject the resource grant. To make the shared resource available, the consumer administrator in the grantee account must use the AWS RAM console or AWS Command Line Interface (AWS CLI) to accept the invitation.

Authorized principals can share resources explicitly with an IAM principal in an external account. This feature is useful when the producer wants to have control over who in the external account can access the resources. The permissions the IAM principal receives are a union of direct grants and the account-level grants that are cascaded down to the principals. The data lake administrator of the recipient account can view the direct cross-account grants, but can’t revoke permissions.

Pattern 3: Using AWS Lake Formation from the producer external sharing account

The producer may have stringent security requirements where no external consumer should access their production account or their centralized governance account. They may also not have Lake Formation enabled on their production platform. In such cases, as shown in the following diagram, the producer production account (Account A) is dedicated to its internal organization users. The producer creates another account, the producer external sharing account (Account B), which is dedicated for external sharing. This gives the producer more latitude to create specific policies for specific organizations.

The following architecture diagram shows an overview of the pattern.


The producer implements a process to create an asynchronous copy of data in Account B. The bucket can be configured for Same Region Replication (SRR) or Cross Region Replication (CRR) for objects that need to be shared. This facilitates automated refresh of data to the external account to the “External Published Datasets” S3 bucket without having to write any code.

Creating a copy of the data allows the producer to add another degree of separation between the external consumer and its production data. It also helps meet any compliance or data sovereignty requirements.

Lake Formation is set up on Account B, and the administrator creates resources links for the “External Published Datasets” S3 bucket in its account to grant access. The administrator follows the same process to grant access as described earlier.

Pattern 4: Using Amazon Redshift data sharing

This pattern is ideally suited for a producer who has most of their published data products on Amazon Redshift. This pattern also requires the producer’s external sharing account (Account B) and the consumer account (Account C) to have an encrypted Amazon Redshift cluster or Amazon Redshift Serverless endpoint that meets the prerequisites for Amazon Redshift data sharing.

The following architecture diagram shows an overview of the pattern.


Two options are possible depending on the producer’s compliance constraints:

  • Option A – The producer enables data sharing directly on the production Amazon Redshift cluster.
  • Option B – The producer may have constraints with respect to sharing the production cluster. The producer creates a simple AWS Glue job that copies data from the Amazon Redshift cluster in the production Account A to the Amazon Redshift cluster in the external Account B. This AWS Glue job can be scheduled to refresh data as needed by the consumer. When the data is available in Account B, the producer can create multiple views and multiple data shares as needed.

In both options, the producer maintains complete control over what data is being shared, and the consumer admin maintains full control over who can access the data within their organization.

After both the producer and consumer admins approve the data sharing request, the consumer user can access this data as if it were part of their own account without have to write any additional code.

Pattern 5: Sharing data securely and privately using APIs

You can adopt this pattern when the external partner doesn’t have a presence on AWS. You can also use this pattern when published data products are spread across various services like Amazon S3, Amazon Redshift, Amazon DynamoDB, and Amazon OpenSearch Service but the producer would like to maintain a single data sharing interface.

Here’s an example use case: Company A would like to share some of its log data in near-real time with its partner Company B, who uses this data to generate predictive insights for Company A. Company A stores this data in Amazon Redshift. The company wants to share this transactional information with its partner after masking the personally identifiable information (PII) in a cost-effective and secure way to generate insights. Company B doesn’t use the AWS platform.

Company A establishes a microbatch process using an AWS Lambda function or AWS Glue that queries Amazon Redshift to get incremental log data, applies the rules to redact the PII, and loads this data to the “Published Datasets” S3 bucket. This instantiates an SRR/CRR process that refreshes this data in the “External Sharing” S3 bucket.

The following diagram shows how the consumer can then use an API-based approach to access this data.


The workflow contains the following steps:

  1. An HTTPS API request is sent from the API consumer to the API proxy layer.
  2. The HTTPS API request is forwarded from the API proxy to Amazon API Gateway in the external sharing AWS account.
  3. Amazon API Gateway calls the request receiver Lambda function.
  4. The request receiver function writes the status to a DynamoDB control table.
  5. A second Lambda function, the poller, checks the status of the results in the DynamoDB table.
  6. The poller function fetches results from Amazon S3.
  7. The poller function sends a presigned URL to download the file from the S3 bucket to the requestor via Amazon Simple Email Service (Amazon SES).
  8. The requestor downloads the file using the URL.
  9. The network perimeter AWS account only allows egress internet connection.
  10. The API proxy layer enforces both the egress security controls and perimeter firewall before the traffic leaves the producer’s network perimeter.
  11. The AWS Transit Gateway security egress VPC routing table only allows connectivity from the required producer’s subnet, while preventing internet access.

Pattern 6: Using Amazon S3 access points

Data scientists may need to work collaboratively on image, videos, and text documents. Legal and audit groups may want to share reports and statements with the auditing agencies. This pattern discusses an approach to sharing such documents. The pattern assumes that the external partners are also on AWS. Amazon S3 access points allow the producer to share access with their consumer by setting up cross-account access without having to edit bucket policies.

Access points are named network endpoints that are attached to buckets that you can use to perform S3 object operations, such as GetObject and PutObject. Each access point has distinct permissions and network controls that Amazon S3 applies for any request that is made through that access point. Each access point enforces a customized access point policy that works in conjunction with the bucket policy attached to the underlying bucket.

The following architecture diagram shows an overview of the pattern.


The producer creates an S3 bucket and enables the use of access points. As part of the configuration, the producer specifies the consumer account, IAM role, and privileges for the consumer IAM role.

The consumer users with the IAM role in the consumer account can access the S3 bucket via the internet or restricted to an Amazon VPC via VPC endpoints and AWS PrivateLink.


Each organization has its unique set of constraints and requirements that it needs to fulfill to set up an efficient data sharing solution. In this post, we demonstrated various options and best practices available to organizations. The data platform owner and security team should work together to assess what works best for your specific situation. Your AWS account team is also available to help.

Related resources

For more information on related topics, refer to the following:

About the Authors

Venkata Sistla
is a Cloud Architect – Data & Analytics at AWS. He specializes in building data processing capabilities and helping customers remove constraints that prevent them from leveraging their data to develop business insights.

Santosh Chiplunkar is a Principal Resident Architect at AWS. He has over 20 years of experience helping customers solve their data challenges. He helps customers develop their data and analytics strategy and provides them with guidance on how to make it a reality.

A hybrid approach in healthcare data warehousing with Amazon Redshift

Post Syndicated from Bindhu Chinnadurai original https://aws.amazon.com/blogs/big-data/a-hybrid-approach-in-healthcare-data-warehousing-with-amazon-redshift/

Data warehouses play a vital role in healthcare decision-making and serve as a repository of historical data. A healthcare data warehouse can be a single source of truth for clinical quality control systems. Data warehouses are mostly built using the dimensional model approach, which has consistently met business needs.

Loading complex multi-point datasets into a dimensional model, identifying issues, and validating data integrity of the aggregated and merged data points are the biggest challenges that clinical quality management systems face. Additionally, scalability of the dimensional model is complex and poses a high risk of data integrity issues.

The data vault approach solves most of the problems associated with dimensional models, but it brings other challenges in clinical quality control applications and regulatory reports. Because data is closer to the source and stored in raw format, it has to be transformed before it can be used for reporting and other application purposes. This is one of the biggest hurdles with the data vault approach.

In this post, we discuss some of the main challenges enterprise data warehouses face when working with dimensional models and data vaults. We dive deep into a hybrid approach that aims to circumvent the issues posed by these two and also provide recommendations to take advantage of this approach for healthcare data warehouses using Amazon Redshift.

What is a dimensional data model?

Dimensional modeling is a strategy for storing data in a data warehouse using dimensions and facts. It optimizes the database for faster data retrieval. Dimensional models have a distinct structure and organize data to provide reports that increase performance.

In a dimensional model, a transaction record is divided either into facts (often numerical), additive transactional data, or dimensions (referential information that gives context to the facts). This categorization of data into facts and dimensions, as well as the entity-relationship framework of the dimensional model, presents complex business processes in a way that is easy for analysts to understand.

A dimensional model in data warehousing is designed for reading, summarizing, and analyzing numerical information such as patient vital stats, lab reading values, counts, and so on. Regardless of the division or use case it is related to, dimensional data models can be used to store data obtained from tracking various processes like patient encounters, provider practice metrics, aftercare surveys, and more.

The majority of healthcare clinical quality data warehouses are built on top of dimensional modeling techniques. The benefit of using dimensional data modeling is that, when data is stored in a data warehouse, it’s easier to persist and extract it.

Although it’s a competent data structure technique, there are challenges in scalability, source tracking, and troubleshooting with the dimensional modeling approach. Tracking and validating the source of aggregated and compute data points is important in clinical quality regulatory reporting systems. Any mistake in regulatory reports may result in a large penalty from regulatory and compliance agencies. These challenges exist because the data points are labeled using meaningless numeric surrogate keys, and any minor error can impair prediction accuracy, and consequently affect the quality of judgments. The ways to countervail these challenges are by refactoring and bridging the dimensions. But that adds data noise over time and reduces accuracy.

Let’s look at an example of a typical dimensional data warehouse architecture in healthcare, as shown in the following logical model.

The following diagram illustrates a sample dimensional model entity-relationship diagram.

This data model contains dimensions and fact tables. You can use the following query to retrieve basic provider and patient relationship data from the dimensional model:

SELECT * FROM Fac_PatientEncounter FP

JOIN Dim_PatientEncounter DP ON FP.EncounterKey = DP.EncounterKey

JOIN Dim_Provider PR ON PR.ProviderKey = FP.ProviderKey

Challenges of dimensional modeling

Dimensional modeling requires data preprocessing before generating a star schema, which involves a large amount of data processing. Any change to the dimension definition results in a lengthy and time-consuming reprocessing of the dimension data, which often results in data redundancy.

Another issue is that, when relying merely on dimensional modeling, analysts can’t assure the consistency and accuracy of data sources. Especially in healthcare, where lineage, compliance, history, and traceability are of prime importance because of the regulations in place.

A data vault seeks to provide an enterprise data warehouse while solving the shortcomings of dimensional modeling approaches. It is a data modeling methodology designed for large-scale data warehouse platforms.

What is a data vault?

The data vault approach is a method and architectural framework for providing a business with data analytics services to support business intelligence, data warehousing, analytics, and data science needs. The data vault is built around business keys (hubs) defined by the company; the keys obtained from the sources are not the same.

Amazon Redshift RA3 instances and Amazon Redshift Serverless are perfect choices for a data vault. And when combined with Amazon Redshift Spectrum, a data vault can deliver more value.

There are three layers to the data vault:

  • Staging
  • Data vault
  • Business vault

Staging involves the creation of a replica of the original data, which is primarily used to aid the process of transporting data from various sources to the data warehouse. There are no restrictions on this layer, and it is typically not persistent. It is 1:1 with the source systems, generally in the same format as that of the sources.

The data vault is based on business keys (hubs), which are defined by the business. All in-scope data is loaded, and auditability is maintained. At the heart of all data warehousing is integration, and this layer contains integrated data from multiple sources built around the enterprise-wide business keys. Although data lakes resemble data vaults, a data vault provides more features of a data warehouse. However, it combines the functionalities of both.

The business vault stores the outcome of business rules, including deduplication, conforming results, and even computations. When results are calculated for two or more data marts, this helps eliminate redundant computation and associated inconsistencies.

Because business vaults still don’t satisfy reporting needs, enterprises create a data mart after the business vault to satisfy dashboarding needs.

Data marts are ephemeral views that can be implemented directly on top of the business and raw vaults. This makes it easy to adapt over time and eliminates the danger of inconsistent results. If views don’t give the required level of performance, the results can be stored in a table. This is the presentation layer and is designed to be requirements-driven and scope-specific subsets of the warehouse data. Although dimensional modeling is commonly used to deliver this layer, marts can also be flat files, .xml files, or in other forms.

The following diagram shows the typical data vault model used in clinical quality repositories.

When the dimensional model as shown earlier is converted into a data vault using the same structure, it can be represented as follows.

Advantages of a data vault

Although any data warehouse should be built within the context of an overarching company strategy, data vaults permit incremental delivery. You can start small and gradually add more sources over time, just like Kimball’s dimensional design technique.

With a data vault, you don’t have to redesign the structure when adding new sources, unlike dimensional modeling. Business rules can be easily changed because raw and business-generated data is kept independent of each other in a data vault.

A data vault isolates technical data reorganization from business rules, thereby facilitating the separation of these potentially tricky processes. Similarly, data cleaning can be maintained separately from data import.

A data vault accommodates changes over time. Unlike a pure dimensional design, a data vault separates raw and business-generated data and accepts changes from both sources.

Data vaults make it easy to maintain data lineage because it includes metadata identifying the source systems. In contrast to dimensional design, where data is cleansed before loading, data vault updates are always gradual, and results are never lost, providing an automatic audit trail.

When raw data is stored in a data vault, historical attributes that weren’t initially available can be added to the presentation area. Data marts can be implemented as views by adding a new column to an existing view.

In data vault 2.0, hash keys eliminate data load dependencies, which allows near-real-time data loading, as well as concurrent data loads of terabytes to petabytes. The process of mastering both entity-relationship modeling and dimensional design takes time and practice, but the process of automating a data vault is easier.

Challenges of a data vault

A data vault is not a one-size-fits-all solution for data warehouses, and it does have a few limitations.

To begin with, when directly feeding the data vault model into a report on one subject area, you need to combine multiple types of data. Due to the incapability of reporting technologies to perform such data processing, this integration can reduce report performance and increase the risk of errors. However, data vault models could improve report performance by incorporating dimensional models or adding additional reporting layers. And for data models that can be directly reported, a dimensional model can be developed.

Additionally, if the data is static or if it comes from a single source, it reduces the efficacy of data vaults. They often negate many benefits of data vaults, and require more business logic, which can be avoided.

The storage requirement for a data vault is also significantly higher. Three separate tables for the same subject area can effectively increase the number of tables by three, and when they are inserts only. If the data is basic, you can achieve the benefits listed here with a simpler dimensional model rather than deploying a data vault.

The following sample query retrieves provider and patient data from a data vault using the sample model we discussed in this section:

SELECT * FROM Lnk_PatientEncounter LP

JOIN Hub_Provider HP ON LP.ProviderKey = HP.ProviderKey

JOIN Dim_Sat_Provider DSP ON HP.ProviderKey = DSP.ProviderKey AND _Current=1

JOIN Hub_Patient Pt ON Pt.PatientEncounterKey = LP.PatientEncounterKey

JOIN Dim_Sat_PatientEncounter DPt ON DPt.PatientEncounterKey = Pt.PatientEncounterKey AND _Current=1

The query involves many joins, which increases the depth and time for the query run, as illustrated in the following chart.

This following table shows that the SQL depth and runtime is proportional, where depth is the number of joins. If the number of joins increase, then the runtime also increases and therefore the cost.

SQL Depth Runtime in Seconds Cost per Query in Seconds
14 80 40,000
12 60 30,000
5 30 15,000
3 25 12,500

The hybrid model addresses major issues raised by the data vault and dimensional model approaches that we’ve discussed in this post, while also allowing improvements in data collection, including IoT data streaming.

What is a hybrid model?

The hybrid model combines the data vault and a portion of the star schema to provide the advantages of both the data vault and dimensional model, and is mainly intended for logical enterprise data warehouses.

The hybrid approach is designed from the bottom up to be gradual and modular, and it can be used for big data, structured, and unstructured datasets. The primary data contains the business rules and enterprise-level data standards norms, as well as additional metadata needed to transform, validate, and enrich data for dimensional approaches. In this model, data processes from left to right provide data vault advantages, and data processes from right to left provide dimensional model advantages. Here, the data vault satellite tables serve as both satellite tables and dimensional tables.

After combining the dimensional and the data vault models, the hybrid model can be viewed as follows.

The following is an example entity-relation diagram of the hybrid model, which consists of a fact table from the dimensional model and all other entities from the data vault. The satellite entity from the data vault plays the dual role. When it’s connected to a data vault, it acts as a sat table, and when connected to a fact table, it acts as a dimension table. To serve this dual purpose, sat tables have two keys: a foreign key to connect with the data vault, and a primary key to connect with the fact table.

The following diagram illustrates the physical hybrid data model.

The following diagram illustrates a typical hybrid data warehouse architecture.

The following query retrieves provider and patient data from the hybrid model:

SELECT * FROM Fac_PatientEncounter FP

JOIN Dim_Sat_Provider DSP ON FP.DimProviderID =DSP.DimProviderID

JOIN Dim_Sat_PatientEncounter DPt ON DPt.DimPatientEncounterID = Pt.DimPatientEncounterID

The number of joins is reduced from five to three by using the hybrid model.

Advantages of using the hybrid model

With this model, structural information is segregated from descriptive information to promote flexibility and avoid re-engineering in the event of a change. It maintains data integrity, allowing organizations to avoid hefty fines when data integrity is compromised.

The hybrid paradigm enables non-data professionals to interact with raw data by allowing users to update or create metadata and data enrichment rules. The hybrid approach simplifies the process of gathering and evaluating datasets for business applications. It enables concurrent data loading and eliminates the need for a corporate vault.

The hybrid model also benefits from the fact that there is no dependency between objects in the data storage. With hybrid data warehousing, scalability is multiplied.

You can build the hybrid model on AWS and take advantage of the benefits of Amazon Redshift, which is a fully managed, scalable cloud data warehouse that accelerates your time to insights with fast, simple, and secure analytics at scale. Amazon Redshift continuously adds features that make it faster, more elastic, and easier to use:

  • Amazon Redshift data sharing enhances the hybrid model by eliminating the need for copying data across departments. It also simplifies the work of keeping the single source of truth, saving memory and limiting redundancy. It enables instant, granular, and fast data access across Amazon Redshift clusters without the need to copy or move it. Data sharing provides live access to data so that users always see the most up-to-date and consistent information as it’s updated in the data warehouse.
  • Redshift Spectrum enables you to query open format data directly in the Amazon Simple Storage Service (Amazon S3) data lake without having to load the data or duplicate your infrastructure, and it integrates well with the data lake.
  • With Amazon Redshift concurrency scaling, you can get consistently fast performance for thousands of concurrent queries and users. It instantly adds capacity to support additional users and removes it when the load subsides, with nothing to manage at your end.
  • To realize the benefits of using a hybrid model on AWS, you can get started today without needing to provision and manage data warehouse clusters using Redshift Serverless. All the related services that Amazon Redshift integrates with (such as Amazon Kinesis, AWS Lambda, Amazon QuickSight, Amazon SageMaker, Amazon EMR, AWS Lake Formation, and AWS Glue) are available to work with Redshift Serverless.


With the hybrid model, data can be transformed and loaded into a target data model efficiently and transparently. With this approach, data partners can research data networks more efficiently and promote comparative effectiveness. And with the several newly introduced features of Amazon Redshift, a lot of heavy lifting is done by AWS to handle your workload demands, and you only pay for what you use.

You can get started with the following steps:

  1. Create an Amazon Redshift RA3 instance for your primary clinical data repository and data marts.
  2. Build a data vault schema for the raw vault and create materialized views for the business vault.
  3. Enable Amazon Redshift data shares to share data between the producer cluster and consumer cluster.
  4. Load the structed and unstructured data into the producer cluster data vault for business use.

About the Authors

Bindhu Chinnadurai is a Senior Partner Solutions Architect in AWS based out of London, United Kingdom. She has spent 18+ years working in everything for large scale enterprise environments. Currently she engages with AWS partner to help customers migrate their workloads to AWS with focus on scalability, resiliency, performance and sustainability. Her expertise is DevSecOps.

 Sarathi Balakrishnan was the Global Partner Solutions Architect, specializing in Data, Analytics and AI/ML at AWS. He worked closely with AWS partner globally to build solutions and platforms on AWS to accelerate customers’ business outcomes with state-of-the-art cloud technologies and achieve more in their cloud explorations. He helped with solution architecture, technical guidance, and best practices to build cloud-native solutions. He joined AWS with over 20 years of large enterprise experience in agriculture, insurance, health care and life science, marketing and advertisement industries to develop and implement data and AI strategies.

Build a data storytelling application with Amazon Redshift Serverless and Toucan

Post Syndicated from Louis Hourcade original https://aws.amazon.com/blogs/big-data/build-a-data-storytelling-application-with-amazon-redshift-serverless-and-toucan/

This post was co-written with Django Bouchez, Solution Engineer at Toucan.

Business intelligence (BI) with dashboards, reports, and analytics remains one of the most popular use cases for data and analytics. It provides business analysts and managers with a visualization of the business’s past and current state, helping leaders make strategic decisions that dictate the future. However, customers continue to ask for better ways to tell stories with their data, and therefore increase the adoption rate of their BI tools.

Most BI tools on the market provide an exhaustive set of customization options to build data visualizations. It might appear as a good idea, but ultimately burdens business analysts that need to navigate through endless possibilities before building a report. Analysts are not graphic designers, and a poorly designed data visualization can hide the insight it’s intended to convey, or even mislead the viewer. To realize more value from your data, you should focus on building data visualizations that tell stories, and are easily understandable by your audience. This is where guided analytics helps. Instead of presenting unlimited options for customization, it intentionally limits choice by enforcing design best practices. The simplicity of a guided experience enables business analysts to spend more time generating actual insight rather than worrying about how to present them.

This post illustrates the concept of guided analytics and shows you how you can build a data storytelling application with Amazon Redshift Serverless and Toucan, an AWS Partner. Toucan natively integrates with Redshift Serverless, which enables you to deploy a scalable data stack in minutes without the need to manage any infrastructure component.

Amazon Redshift is a fully managed cloud data warehouse service that enables you to analyze large amounts of structured and semi-structured data. Amazon Redshift can scale from a few gigabytes to a petabyte-scale data warehouse, and AWS recently announced the global availability of Redshift Serverless, making it one of the best options for storing data and running ad hoc analytics in a scalable and cost-efficient way.

With Redshift Serverless, you can get insights on your data by running standalone SQL queries or by using data visualizations tools such as Amazon QuickSight, Toucan, or other third-party options without having to manage your data warehouse infrastructure.

Toucan is a cloud-based guided analytics platform built with one goal in mind: reduce the complexity of bringing data insights to business users. For this purpose, Toucan provides a no-code and comprehensive user experience at every stage of the data storytelling application, which includes data connection, building the visualization, and distribution on any device.

If you’re in a hurry and want to see what you can do with this integration, check out Shark attacks visualization with AWS & Toucan, where Redshift Serverless and Toucan help in understanding the evolution of shark attacks in the world.

Overview of solution

There are many BI tools in the market, each providing an ever-increasing set of capabilities and customization options to differentiate from the competition. Paradoxically, this doesn’t seem to increase the adoption rate of BI tools in enterprises. With more complex tools, data owners spend time building fancy visuals, and tend to add as much information as possible in their dashboards instead of providing a clear and simple message to business users.

In this post, we illustrate the concept of guided analytics by putting ourselves in the shoes of a data engineer that needs to communicate stories to business users with data visualizations. This fictional data engineer has to create dashboards to understand how shark attacks evolved in the last 120 years. After loading the shark attacks dataset in Redshift Serverless, we guide you in using Toucan to build stories that provide a better understanding of shark attacks through time. With Toucan, you can natively connect to datasets in Redshift Serverless, transform the data with a no-code interface, build storytelling visuals, and publish them for business users. The shark attacks visualization example illustrates what you can achieve by following instructions in this post.

Additionally, we have recorded a video tutorial that explains how to connect Toucan with Redshift Serverless and start building charts.

Solution architecture

The following diagram depicts the architecture of our solution.

Architecture diagram

We use an AWS CloudFormation stack to deploy all the resources you need in your AWS account:

  • Networking components – This includes a VPC, three public subnets, an internet gateway, and a security group to host the Redshift Serverless endpoint. In this post, we use public subnets to facilitate data access from external sources such as Toucan instances. In this case, the data in Redshift Serverless is still protected by the security group that restricts incoming traffic, and by the database credentials. For a production workload, it is recommended to keep traffic in the Amazon network. For that, you can set the Redshift Serverless endpoints in private subnets, and deploy Toucan in your AWS account through the AWS Marketplace.
  • Redshift Serverless components – This includes a Redshift Serverless namespace and workgroup. The Redshift Serverless workspace is publicly accessible to facilitate the connection from Toucan instances. The database name and the administrator user name are defined as parameters when deploying the CloudFormation stack, and the administrator password is created in AWS Secrets Manager. In this post, we use database credentials to connect to Redshift Serverless, but Toucan also supports connection with AWS credentials and AWS Identity and Access Management (IAM) profiles.
  • Custom resources – The CloudFormation stack includes a custom resource, which is an AWS Lambda function that loads shark attacks data automatically in your Redshift Serverless database when the CloudFormation stack is created.
  • IAM roles and permissions – Finally, the CloudFormation stack includes all IAM roles associated with services previously mentioned to interact with other AWS resources in your account.

In the following sections, we provide all the instructions to connect Toucan with your data in Redshift Serverless, and guide you to build your data storytelling application.

Sample dataset

In this post, we use a custom dataset that lists all known shark attacks in the world, starting from 1900. You don’t have to import the data yourself; we use the Amazon Redshift COPY command to load the data when deploying the CloudFormation stack. The COPY command is one the fastest and most scalable methods to load data into Amazon Redshift. For more information, refer to Using a COPY command to load data.

The dataset contains 4,900 records with the following columns:

  • Date
  • Year
  • Decade
  • Century
  • Type
  • Zone_Type
  • Zone
  • Country
  • Activity
  • Sex
  • Age
  • Fatal
  • Time
  • Species
  • href (a PDF link with the description of the context)
  • Case_Number


For this solution, you should have the following prerequisites:

  • An AWS account. If you don’t have one already, see the instructions in Sign Up for AWS.
  • An IAM user or role with permissions on AWS resources used in this solution.
  • A Toucan free trial to build the data storytelling application.

Set up the AWS resources

You can launch the CloudFormation stack in any Region where Redshift Serverless is available.

  1. Choose Launch Stack to start creating the required AWS resources for this post:

  1. Specify the database name in Redshift Serverless (default is dev).
  2. Specify the administrator user name (default is admin).

You don’t have to specify the database administrator password because it’s created in Secrets Manager by the CloudFormation stack. The secret’s name is AWS-Toucan-Redshift-Password. We use the secret value in subsequent steps.

Test the deployment

The CloudFormation stack takes a few minutes to deploy. When it’s complete, you can confirm the resources were created. To access your data, you need to get the Redshift Serverless database credentials.

  1. On the Outputs tab for the CloudFormation stack, note the name of the Secrets Manager secret.


  1. On the Secrets Manager console, navigate to the Amazon Redshift database secret and choose Retrieve secret value to get the database administrator user name and password.

  1. To make sure your Redshift Serverless database is available and contains the shark attacks dataset, open the Redshift Serverless workgroup on the Amazon Redshift console and choose Query data to access the query editor.
  2. Also note the Redshift Serverless endpoint, which you need to connect with Toucan.

  1. In the Amazon Redshift query editor, run the following SQL query to view the shark attacks data:
SELECT * FROM "dev"."public"."shark_attacks";

Redshift Query Editor v2

Note that you need to change the name of the database in the SQL query if you change the default value when launching the CloudFormation stack.

You have configured Redshift Serverless in your AWS account and uploaded the shark attacks dataset. Now it’s time to use this data by building a storytelling application.

Launch your Toucan free trial

The first step is to access Toucan platform through the Toucan free trial.

Fill the form and complete the signup steps. You then arrive in the Storytelling Studio, in Staging mode. Feel free to explore what has been already created.

Toucan Home page

Connect Redshift Serverless with Toucan

To connect Redshift Serverless and Toucan, complete the following steps:

  1. Choose Datastore at the bottom of the Toucan Storytelling Studio.
  2. Choose Connectors.

Toucan is natively integrated with Redshift Serverless with AnyConnect.

  1. Search for the Amazon Redshift connector, and complete the form with the following information:
    • Name – The name of the connector in Toucan.
    • Host – Your Redshift Serverless endpoint.
    • Port – The listening port of your Amazon Redshift database (5439).
    • Default Database – The name of the database to connect to (dev by default, unless edited in the CloudFormation stack parameters).
    • Authentication Method – The authentication mechanism to connect to Redshift Serverless. In this case, we use database credentials.
    • User – The user name to use for authentication with Redshift Serverless (admin by default, unless edited in the CloudFormation stack parameters).
    • Password – The password to use for authentication with Redshift Serverless (you should retrieve it from Secrets Manager; the secret’s name is AWS-Toucan-Redshift-Password).

Toucan connection

Create a live query

You are now connected to Redshift Serverless. Complete the following steps to create a query:

  1. On the home page, choose Add tile to create a new visualization.

Toucan new tile

  1. Choose the Live Connections tab, then choose the Amazon Redshift connector you created in the previous step.

Toucan Live Connection

The Toucan trial guides you in building your first live query, in which you can transform your data without writing code using the Toucan YouPrep module.

For instance, as shown in the following screenshot, you can use this no-code interface to compute the sum of fatal shark attacks by activities, get the top five, and calculate the percentage of the total.

Toucan query data

Build your first chart

When your data is ready, choose the Tile tab and complete the form that helps you build charts.

For example, you can configure a leaderboard of the five most dangerous activities, and add a highlight for activities with more than 100 attacks.

Choose Save Changes to save your work and go back to the home page.

Toucan chart builder

Publish and share your work

Until this stage, you have been working in working in Staging mode. To make your work available to everyone, you need to publish it into Production.

On the bottom right of the home page, choose the eye icon to preview your work by putting yourself in the shoes of your future end-users. You can then choose Publish to make your work available to all.

Toucan publish

Toucan also offers multiple embedding options to make your charts easier for end-users to access, such as mobile and tablet.

Toucan multi devices

Following these steps, you connected to Redshift Serverless, transformed the data with the Toucan no-code interface, and built data visualizations for business end-users. The Toucan trial guides you in every stage of this process to help you get started.

Redshift Serverless and Toucan guided analytics provide an efficient approach to increase the adoption rate of BI tools by decreasing infrastructure work for data engineers, and by simplifying dashboard understanding for business end-users. This post only covered a small part of what Redshift Serverless and Toucan offer, so feel free to explore other functionalities in the Amazon Redshift Serverless documentation and Toucan documentation.

Clean up

Some of the resources deployed in this post through the CloudFormation template incur costs as long as they’re in use. Be sure to remove the resources and clean up your work when you’re finished in order to avoid unnecessary cost.

On the CloudFormation console, choose Delete stack to remove all resources.


This post showed you how to set up an end-to-end architecture for guided analytics with Redshift Serverless and Toucan.

This solution benefits from the scalability of Redshift Serverless, which enables you to store, transform, and expose data in a cost-efficient way, and without any infrastructure to manage. Redshift Serverless natively integrates with Toucan, a guided analytics tool designed to be used by everyone, on any device.

Guided analytics focuses on communicating stories through data reports. By setting intentional constraints on customization options, Toucan makes it easy for data owners to build meaningful dashboards with a clear and concise message for end-users. It works for both your internal and external customers, on an unlimited number of use cases.

Try it now with our CloudFormation template and a free Toucan trial!

About the Authors

Louis Hourcade
is a Data Scientist in the AWS Professional Services team. He works with AWS customer across various industries to accelerate their business outcomes with innovative technologies. In his spare time he enjoys running, climbing big rocks, and surfing (not so big) waves.

Benjamin Menuet
is a Data Architect with AWS Professional Services. He helps customers develop big data and analytics solutions to accelerate their business outcomes. Outside of work, Benjamin is a trail runner and has finished some mythic races like the UTMB.

Xavier Naunay
is a Data Architect with AWS Professional Services. He is part of the AWS ProServe team, helping enterprise customers solve complex problems using AWS services. In his free time, he is either traveling or learning about technology and other cultures.

Django Bouchez
is a Solution Engineer at Toucan. He works alongside the Sales team to provide support on technical and functional validation and proof, and is also helping R&D demo new features with Cloud Partners like AWS. Outside of work, Django is a homebrewer and practices scuba diving and sport climbing.

Automate deployment of an Amazon QuickSight analysis connecting to an Amazon Redshift data warehouse with an AWS CloudFormation template

Post Syndicated from Sandeep Bajwa original https://aws.amazon.com/blogs/big-data/automate-deployment-of-an-amazon-quicksight-analysis-connecting-to-an-amazon-redshift-data-warehouse-with-an-aws-cloudformation-template/

Amazon Redshift is the most widely used data warehouse in the cloud, best suited for analyzing exabytes of data and running complex analytical queries. Amazon QuickSight is a fast business analytics service to build visualizations, perform ad hoc analysis, and quickly get business insights from your data. QuickSight provides easy integration with Amazon Redshift, providing native access to all your data and enabling organizations to scale their business analytics capabilities to hundreds of thousands of users. QuickSight delivers fast and responsive query performance by using a robust in-memory engine (SPICE).

As a QuickSight administrator, you can use AWS CloudFormation templates to migrate assets between distinct environments from development, to test, to production. AWS CloudFormation helps you model and set up your AWS resources so you can spend less time managing those resources and more time focusing on your applications that run in AWS. You no longer need to create data sources or analyses manually. You create a template that describes all the AWS resources that you want, and AWS CloudFormation takes care of provisioning and configuring those resources for you. In addition, with versioning, you have your previous assets, which provides the flexibility to roll back deployments if the need arises. For more details, refer to Amazon QuickSight resource type reference.

In this post, we show how to automate the deployment of a QuickSight analysis connecting to an Amazon Redshift data warehouse with a CloudFormation template.

Solution overview

Our solution consists of the following steps:

  1. Create a QuickSight analysis using an Amazon Redshift data source.
  2. Create a QuickSight template for your analysis.
  3. Create a CloudFormation template for your analysis using the AWS Command Line Interface (AWS CLI).
  4. Use the generated CloudFormation template to deploy a QuickSight analysis to a target environment.

The following diagram shows the architecture of how you can have multiple AWS accounts, each with its own QuickSight environment connected to its own Amazon Redshift data source. In this post, we outline the steps involved in migrating QuickSight assets in the dev account to the prod account. For this post, we use Amazon Redshift as the data source and create a QuickSight visualization using the Amazon Redshift sample TICKIT database.

The following diagram illustrates flow of the high-level steps.


Before setting up the CloudFormation stacks, you must have an AWS account and an AWS Identity and Access Management (IAM) user with sufficient permissions to interact with the AWS Management Console and the services listed in the architecture.

The migration requires the following prerequisites:

Create a QuickSight analysis in your dev environment

In this section, we walk through the steps to set up your QuickSight analysis using an Amazon Redshift data source.

Create an Amazon Redshift data source

To connect to your Amazon Redshift data warehouse, you need to create a data source in QuickSight. As shown in the following screenshot, you have two options:

  • Auto-discovered
  • Manual connect

QuickSight auto-discovers Amazon Redshift clusters that are associated with your AWS account. These resources must be located in the same Region as your QuickSight account.

For more details, refer to Authorizing connections from Amazon QuickSight to Amazon Redshift clusters.

You can also manually connect and create a data source.

Create an Amazon Redshift dataset

The next step is to create a QuickSight dataset, which identifies the specific data in a data source you want to use.

For this post, we use the TICKIT database created in an Amazon Redshift data warehouse, which consists of seven tables: two fact tables and five dimensions, as shown in the following figure.

This sample database application helps analysts track sales activity for the fictional TICKIT website, where users buy and sell tickets online for sporting events, shows, and concerts.

  1. On the Datasets page, choose New dataset.
  2. Choose the data source you created in the previous step.
  3. Choose Use custom SQL.
  4. Enter the custom SQL as shown in the following screenshot.

The following screenshot shows our completed data source.

Create a QuickSight analysis

The next step is to create an analysis that utilizes this dataset. In QuickSight, you analyze and visualize your data in analyses. When you’re finished, you can publish your analysis as a dashboard to share with others in your organization.

  1. On the All analyses tab of the QuickSight start page, choose New analysis.

The Datasets page opens.

  1. Choose a dataset, then choose Use in analysis.

  1. Create a visual. For more information about creating visuals, see Adding visuals to Amazon QuickSight analyses.

Create a QuickSight template from your analysis

A QuickSight template is a named object in your AWS account that contains the definition of your analysis and references to the datasets used. You can create a template using the QuickSight API by providing the details of the source analysis via a parameter file. You can use templates to easily create a new analysis.

You can use AWS Cloud9 from the console to run AWS CLI commands.

The following AWS CLI command demonstrates how to create a QuickSight template based on the sales analysis you created (provide your AWS account ID for your dev account):

aws quicksight create-template --aws-account-id  <DEVACCOUNT>--template-id QS-RS-SalesAnalysis-Template --cli-input-json file://parameters.json

The parameter.json file contains the following details (provide your source QuickSight user ARN, analysis ARN, and dataset ARN):

    "Name": "QS-RS-SalesAnalysis-Temp",
    "Permissions": [
        {"Principal": "<QS-USER-ARN>", 
          "Actions": [ "quicksight:CreateTemplate",
            ] } ] ,
     "SourceEntity": {
       "SourceAnalysis": {
         "Arn": "<QS-ANALYSIS-ARN>",
         "DataSetReferences": [
             "DataSetPlaceholder": "sales",
             "DataSetArn": "<QS-DATASET-ARN>"
     "VersionDescription": "1"

You can use the AWS CLI describe-user, describe_analysis, and describe_dataset commands to get the required ARNs.

To upload the updated parameter.json file to AWS Cloud9, choose File from the tool bar and choose Upload local file.

The QuickSight template is created in the background. QuickSight templates aren’t visible within the QuickSight UI; they’re a developer-managed or admin-managed asset that is only accessible via the AWS CLI or APIs.

To check the status of the template, run the describe-template command:

aws quicksight describe-template --aws-account-id <DEVACCOUNT> --template-id "QS-RS-SalesAnalysis-Temp"

The following code shows command output:

Copy the template ARN; we need it later to create a template in the production account.

The QuickSight template permissions in the dev account need to be updated to give access to the prod account. Run the following command to update the QuickSight template. This provides the describe privilege to the target account to extract details of the template from the source account:

aws quicksight update-template-permissions --aws-account-id <DEVACCOUNT> --template-id “QS-RS-SalesAnalysis-Temp” --grant-permissions file://TemplatePermission.json

The file TemplatePermission.json contains the following details (provide your target AWS account ID):

    "Principal": "arn:aws:iam::<TARGET ACCOUNT>",
    "Actions": [

To upload the updated TemplatePermission.json file to AWS Cloud9, choose the File menu from the tool bar and choose Upload local file.

Create a CloudFormation template

In this section, we create a CloudFormation template containing our QuickSight assets. In this example, we use a YAML formatted template saved on our local machine. We update the following different sections of the template:

  • AWS::QuickSight::DataSource
  • AWS::QuickSight::DataSet
  • AWS::QuickSight::Template
  • AWS::QuickSight::Analysis

Some of the information required to complete the CloudFormation template can be gathered from the source QuickSight account via the describe AWS CLI commands, and some information needs to be updated for the target account.

Create an Amazon Redshift data source in AWS CloudFormation

In this step, we add the AWS::QuickSight::DataSource section of the CloudFormation template.

Gather the following information on the Amazon Redshift cluster in the target AWS account (production environment):

  • VPC connection ARN
  • Host
  • Port
  • Database
  • User
  • Password
  • Cluster ID

You have the option to create a custom DataSourceID. This ID is unique per Region for each AWS account.

Add the following information to the template:

    Type: 'AWS::QuickSight::DataSource'
      DataSourceId: "RS-Sales-DW"      
      AwsAccountId: !Sub ${AWS::ACCOUNT ID}
        VpcConnectionArn: <VPC-CONNECTION-ARN>      
      Type: REDSHIFT   
          Host: "<HOST>"
          Port: <PORT>
          Clusterid: "<CLUSTER ID>"
          Database: "<DATABASE>"    
      Name: "RS-Sales-DW"
          Username: <USER>
          Password: <PASSWORD>

Create an Amazon Redshift dataset in AWS CloudFormation

In this step, we add the AWS::QuickSight::DataSet section in the CloudFormation template to match the dataset definition from the source account.

Gather the dataset details and run the list-data-sets command to get all datasets from the source account (provide your source dev account ID):

aws quicksight list-data-sets  --aws-account-id <DEVACCOUNT>

The following code is the output:

Run the describe-data-set command, specifying the dataset ID from the previous command’s response:

aws quicksight describe-data-set --aws-account-id <DEVACCOUNT> --data-set-id "<YOUR-DATASET-ID>"

The following code shows partial output:

Based on the dataset description, add the AWS::Quicksight::DataSet resource in the CloudFormation template, as shown in the following code. Note that you can also create a custom DataSetID. This ID is unique per Region for each AWS account.

    Type: 'AWS::QuickSight::DataSet'
      DataSetId: "RS-Sales-DW" 
      Name: "sales" 
      AwsAccountId: !Sub ${AWS::ACCOUNT ID}
            SqlQuery: "select sellerid, username, (firstname ||' '|| lastname) as name,city, sum(qtysold) as sales
              from sales, date, users
              where sales.sellerid = users.userid and sales.dateid = date.dateid and year = 2008
              group by sellerid, username, name, city
              order by 5 desc
              limit 10"
            DataSourceArn: !GetAtt RedshiftBuildQSDataSource.Arn
            - Type: INTEGER
              Name: sellerid
            - Type: STRING
              Name: username
            - Type: STRING
              Name: name
            - Type: STRING
              Name: city
            - Type: DECIMAL
              Name: sales                                     
          Alias: sales
            PhysicalTableId: PhysicalTable1
          - CastColumnTypeOperation:
              ColumnName: sales
              NewColumnType: DECIMAL
        - Principal: !Join 
            - ''
            - - 'arn:aws:quicksight:'
              - !Ref QuickSightIdentityRegion
              - ':'
              - !Ref 'AWS::AccountId'
              - ':user/default/'
              - !Ref QuickSightUser
            - 'quicksight:UpdateDataSetPermissions'
            - 'quicksight:DescribeDataSet'
            - 'quicksight:DescribeDataSetPermissions'
            - 'quicksight:PassDataSet'
            - 'quicksight:DescribeIngestion'
            - 'quicksight:ListIngestions'
            - 'quicksight:UpdateDataSet'
            - 'quicksight:DeleteDataSet'
            - 'quicksight:CreateIngestion'
            - 'quicksight:CancelIngestion'
      ImportMode: DIRECT_QUERY

You can specify ImportMode to choose between Direct_Query or Spice.

Create a QuickSight template in AWS CloudFormation

In this step, we add the AWS::QuickSight::Template section in the CloudFormation template, representing the analysis template.

Use the source template ARN you created earlier and add the AWS::Quicksight::Template resource in the CloudFormation template:

    Type: 'AWS::QuickSight::Template'
      TemplateId: "QS-RS-SalesAnalysis-Temp"
      Name: "QS-RS-SalesAnalysis-Temp"
      AwsAccountId:!Sub ${AWS::ACCOUNT ID}
          Arn: '<SOURCE-TEMPLATE-ARN>'          
        - Principal: !Join 
            - ''
            - - 'arn:aws:quicksight:'
              - !Ref QuickSightIdentityRegion
              - ':'
              - !Ref 'AWS::AccountId'
              - ':user/default/'
              - !Ref QuickSightUser
            - 'quicksight:DescribeTemplate'
      VersionDescription: Initial version - Copied over from AWS account.

Create a QuickSight analysis

In this last step, we add the AWS::QuickSight::Analysis section in the CloudFormation template. The analysis is linked to the template created in the target account.

Add the AWS::Quicksight::Analysis resource in the CloudFormation template as shown in the following code:

    Type: 'AWS::QuickSight::Analysis'
      AnalysisId: 'Sales-Analysis'
      Name: 'Sales-Analysis'
      AwsAccountId:!Sub ${AWS::ACCOUNT ID}
          Arn: !GetAtt  QSTCFBuildQSTemplate.Arn
            - DataSetPlaceholder: 'sales'
              DataSetArn: !GetAtt QSRSBuildQSDataSet.Arn
        - Principal: !Join 
            - ''
            - - 'arn:aws:quicksight:'
              - !Ref QuickSightIdentityRegion
              - ':'
              - !Ref 'AWS::AccountId'
              - ':user/default/'
              - !Ref QuickSightUser
            - 'quicksight:RestoreAnalysis'
            - 'quicksight:UpdateAnalysisPermissions'
            - 'quicksight:DeleteAnalysis'
            - 'quicksight:DescribeAnalysisPermissions'
            - 'quicksight:QueryAnalysis'
            - 'quicksight:DescribeAnalysis'
            - 'quicksight:UpdateAnalysis'      

Deploy the CloudFormation template in the production account

To create a new CloudFormation stack that uses the preceding template via the AWS CloudFormation console, complete the following steps:

  1. On the AWS CloudFormation console, choose Create Stack.
  2. On the drop-down menu, choose with new resources (standard).
  3. For Prepare template, select Template is ready.
  4. For Specify template, choose Upload a template file.
  5. Save the provided CloudFormation template in a .yaml file and upload it.
  6. Choose Next.
  7. Enter a name for the stack. For this post, we use QS-RS-CF-Stack.
  8. Choose Next.
  9. Choose Next again.
  10. Choose Create Stack.

The status of the stack changes to CREATE_IN_PROGRESS, then to CREATE_COMPLETE.

Verify the QuickSight objects in the following table have been created in the production environment.

QuickSight Object Type Object Name (Dev) Object Name ( Prod)
Data Source RS-Sales-DW RS-Sales-DW
Dataset Sales Sales
Template QS-RS-Sales-Temp QS-RS-SalesAnalysis-Temp
Analysis Sales Analysis Sales-Analysis

The following example shows that Sales Analysis was created in the target account.


This post demonstrated an approach to migrate a QuickSight analysis with an Amazon Redshift data source from one QuickSight account to another with a CloudFormation template.

For more information about automating dashboard deployment, customizing access to the QuickSight console, configuring for team collaboration, and implementing multi-tenancy and client user segregation, check out the videos Virtual Admin Workshop: Working with Amazon QuickSight APIs and Admin Level-Up Virtual Workshop, V2 on YouTube.

About the author

Sandeep Bajwa is a Sr. Analytics Specialist based out of Northern Virginia, specialized in the design and implementation of analytics and data lake solutions.

How OLX Group migrated to Amazon Redshift RA3 for simpler, faster, and more cost-effective analytics

Post Syndicated from Miguel Chin original https://aws.amazon.com/blogs/big-data/how-olx-group-migrated-to-amazon-redshift-ra3-for-simpler-faster-and-more-cost-effective-analytics/

This is a guest post by Miguel Chin, Data Engineering Manager at OLX Group and David Greenshtein, Specialist Solutions Architect for Analytics, AWS.

OLX Group is one of the world’s fastest-growing networks of online marketplaces, operating in over 30 countries around the world. We help people buy and sell cars, find housing, get jobs, buy and sell household goods, and much more.

We live in a data-producing world, and as companies want to become data driven, there is the need to analyze more and more data. These analyses are often done using data warehouses. However, a common data warehouse issue with ever-growing volumes of data is storage limitations and the degrading performance that comes with it. This scenario is very familiar to us in OLX Group. Our data warehouse is built using Amazon Redshift and is used by multiple internal teams to power their products and data-driven business decisions. As such, it’s crucial to maintain a cluster with high availability and performance while also being storage cost-efficient.

In this post, we share how we modernized our Amazon Redshift data warehouse by migrating to RA3 nodes and how it enabled us to achieve our business expectations. Hopefully you can learn from our experience in case you are considering doing the same.

Status quo before migration

Here at OLX Group, Amazon Redshift has been our choice for data warehouse for over 5 years. We started with a small Amazon Redshift cluster of 7 DC2.8xlarge nodes, and as its popularity and adoption increased inside the OLX Group data community, this cluster naturally grew.

Before migrating to RA3, we were using a 16 DC2.8xlarge nodes cluster with a highly tuned workload management (WLM), and performance wasn’t an issue at all. However, we kept facing challenges with storage demand due to having more users, more data sources, and more prepared data. Almost every day we would get an alert that our disk space was close to 100%, which was about 40 TB worth of data.

Our usual method to solve storage problems used to be to simply increase the number of nodes. Overall, we reached a cluster size of 18 nodes. However, this solution wasn’t cost-efficient enough because we were adding compute capacity to the cluster even though computation power was underutilized. We saw this as a temporary solution, and we mainly did it to buy some time to explore other cost-effective alternatives, such as RA3 nodes.

Amazon Redshift RA3 nodes along with Redshift Managed Storage (RMS) provided separation of storage and compute, enabling us to scale storage and compute separately to better meet our business requirements.

Our data warehouse had the following configuration before the migration:

  • 18 x DC2.8xlarge nodes
  • 250 monthly active users, consistently increasing
  • 10,000 queries per hour, 30 queries in parallel
  • 40 TB of data, consistently increasing
  • 100% disk space utilization

This cluster’s performance was generally good, ETL (extract, transform, and load) and interactive queries barely had any queue time, and 80% of them would finish in under 5 minutes.

Evaluating the performance of Amazon Redshift clusters with RA3 nodes

In this section, we discuss how we conducted a performance evaluation of RA3 nodes with an Amazon Redshift cluster.

Test environment

In order to be confident with the performance of the RA3 nodes, we decided to stress test them in a controlled environment before making the decision to migrate. To assess the nodes and find an optimal RA3 cluster configuration, we collaborated with AllCloud, the AWS premier consulting partner. The following figures illustrate the approach we took to evaluate the performance of RA3.

Test setup

This strategy aims to replicate a realistic workload in different RA3 cluster configurations and compare them with our DC2 configuration. To do this, we required the following:

  • A reference cluster snapshot – This ensures that we can replay any tests starting from the same state.
  • A set of queries from the production cluster – This set can be reconstructed from the Amazon Redshift logs (STL_QUERYTEXT) and enriched by metadata (STL_QUERY). It should be noted that we only took into consideration SELECT and FETCH query types (to simplify this first stage of performance tests). The following chart shows what the profile of our test set looked like.

  • A replay tool to orchestrate all the query operations – AllCloud developed a Python application for us for this purpose.

For more details about approach we used, including using the Amazon Redshift Simple Replay utility, refer to Compare different node types for your workload using Amazon Redshift.

Next, we picked which cluster configurations we wanted to test, which RA3 type, and how many nodes. For the specifications of each node type, refer to Amazon Redshift pricing.

First, we decided to test the same DC2 cluster we had in production as a way to validate our test environment, followed by RA3 clusters using RA3.4xlarge nodes with various numbers of nodes. We used RA3.4xlarge because it gives us more flexibility to fine-tune how many nodes we need compared to the RA3.16xlarge instance (1 x RA3.16xlarge node is equivalent to 4 x RA3.4xlarge nodes in terms of CPU and memory). With this in mind, we tested the following cluster configurations and used the replay tool to take measurements of the performance of each cluster.

18 x DC2


18 x RA3

(Before Classic Resize)

18 x RA3 6 x RA3
Queries Number 1560 1560 1560 1560
Timeouts 25 66 127
Duration/s Mean 1.214 1.037 1.167 1.921
Std. 2.268 2.026 2.525 3.488
Min. 0.003 0.000 0.002 0.002
Q 25% 0.005 0.004 0.004 0.004
Q 50% 0.344 0.163 0.118 0.183
Q 75% 1.040 0.746 1.076 2.566
Max. 25.411 15.492 19.770 19.132

These results show how the DC2 cluster compares with other RA3 configurations. For 50% of the faster queries (quantile 50%) they ran faster than on DC2. Regarding the number of RA3 nodes, six nodes were clearly slower, particularly noticeable on quantile 75% of query durations.

We used the following steps to deploy different clusters:

  1. Use 18 x DC2.8xlarge, restored from the original snapshot (18 x DC2.8xlarge).
  2. Take measurements 18 x DC2.
  3. Use 18 x RA3.4xlarge, restored from the original snapshot (18 x DC2.8xlarge).
  4. Take measurements 18 x RA3 (before classic resize).
  5. Use 6 x RA3.4xlarge, classic resize from 18 x RA3.4xlarge.
  6. Take snapshot from 6 x RA3.4xlarge.
  7. Take measurements 6 x RA3.
  8. Use 6 x RA3.4xlarge, restored from 6 x RA3.4xlarge snapshot.
  9. Use 18 x RA3.4xlarge, elastic resize from 6 x RA3.4xlarge.
  10. Take measurements 18x RA3.

Although these are promising results, there were some limitations in the test environment setup. We were concerned that we weren’t stressing the clusters enough, queries were only running in sequence using a single client, and the fact that we were using only SELECT and FETCH query types moved us away from a realistic workload. Therefore, we proceeded to the second stage of our tests.

Concurrency stress test

To stress the clusters, we changed our replay tool to run multiple queries in parallel. Queries extracted from the log files were queued with the same frequency as they were originally run in the reference cluster. Up to 50 clients take queries from the queue and send them to Amazon Redshift. The timing of all queries is recorded for comparison with the reference cluster.

The cluster performance is evaluated by measuring the temporal course of the query concurrency. If a cluster is equally performant as the reference cluster, the concurrency will closely follow the concurrency of the reference cluster. Queries pushed to the query queue are immediately picked up by a client and sent to the cluster. If the cluster isn’t capable of handling the queries as fast as the reference cluster, the number of running concurrent queries will increase when compared to the reference cluster. We also decided to keep concurrency scaling disabled during this test because we wanted to focus on node types instead of cluster features.

The following table shows the concurrent queries running on a DC2 and RA3 (both 18 nodes) with two different query test sets (3:00 AM and 1:00 PM). These were selected so we could test both our day and overnight workloads. 3:00 AM is when we have a peak of automated ETL jobs running, and 1:00 PM is when we have high user activity.

The median of running concurrent queries on the RA3 cluster is much higher than the DC2 one. This led us to conclude that a cluster of 18 RA3.4xlarge might not be enough to handle this workload reliably.

Concurrency 18 x DC2.8xlarge 18 x RA3.4xlarge
Starting 3:00 AM 1:00 PM 3:00 AM 1:00 PM
Mean 5 7 10 5
STD 11 13 7 4
25% 1 1 5 2
50% 2 2 8 4
75% 4 4 13 7
Max 50 50 50 27


Initially, we chose the RA3.4xlarge node type for more granular control in fine-tuning the number of nodes. However, we overlooked one important detail: the same instance type is used for worker and leader nodes. A leader node needs to manage all the parallel processing happening in the cluster, and a single RA3.4xlarge wasn’t enough to do so.

With this in mind, we tested two more cluster configurations: 6 x RA3.16xlarge and 8 x RA3.16xlarge, and once again measured concurrency. This time the results were much better; RA3.16xlarge was able to keep up with the reference concurrency, and the sweet spot seemed to be between 6–8 nodes.

Concurrency 18 x DC2.8xlarge 18 x RA3.4xlarge 6 x RA3.16xlarge 8 x RA3.16xlarge
Starting 3:00 AM 1:00 PM 3:00 AM 1:00 PM 3:00 AM 3:00 AM
Mean 5 7 10 5 3 1
STD 11 13 7 4 4 1
25% 1 1 5 2 2 0
50% 2 2 8 4 3 1
75% 4 4 13 7 4 2
Max 50 50 50 27 38 9

Things were looking better and our target configuration was now a 7 x RA3.16xlarge cluster. We were now confident enough to proceed with the migration.

The migration

Regardless of how excited we were to proceed, we still wanted to do a calculated migration. It’s best practice to have a playbook for migrations—a step-by-step guide on what needs to be done and also a contingency plan that includes a rollback plan. For simplicity reasons, we list here only the relevant steps in case you are looking for inspiration.

Migration plan

The migration plan included the following key steps:

  1. Remove the DNS from the current cluster, in our case in Amazon Route 53. No users should be able to query after this.
  2. Check if any sessions are still running a query, and decide to wait or stop it. This strongly indicates these users are using the direct cluster URL to connect.
    1. To check running sessions, use SELECT * FROM STV_SESSIONS.
    2. To check stopped sessions, use SELECT PG_TERMINATE_BACKEND(xxxxx);.
  3. Create a snapshot of the DC2 cluster.
  4. Pause the DC2 cluster.
  5. Create an RA3 cluster from the snapshot with the following configuration:
    1. Node type – RA3.16xlarge
    2. Number of nodes – 7
    3. Database name – Same as the DC2
    4. Associated IAM roles – Same as the DC2
    5. VPC – Same as the DC2
    6. VPC security groups – Same as the DC2
    7. Parameter groups – Same as the DC2
  6. Wait for SELECT COUNT(1) FROM STV_UNDERREPPED_BLOCKS to return 0. This is related to the hydration process of the cluster.
  7. Point the DNS to the RA3 cluster.
  8. Users can now query the cluster again.

Contingency plan

In case the performance of hourly and daily ETL is not acceptable, the contingency plan is triggered:

  1. Add one more node to deal with the unexpected workload.
  2. Increase the limit of concurrency scaling hours.
  3. Reassess the parameter group.

Following this plan, we migrated from DC2 to RA3 nodes in roughly 3.5 hours, from stopping the old cluster to booting the new one and letting our processes fully synchronize. We then proceeded to monitor performance for a couple of hours. Storage capacity was looking great and everything was running smoothly, but we were curious to see how the overnight processes would perform.

The next morning, we woke up to what we dreaded: a slow cluster. We triggered our contingency plan and in the following few days we ended up implementing all three actions we had in the contingency plan.

Adding one extra node itself didn’t provide much help, however users did experience good performance during the hours concurrency scaling was on. The concurrency scaling feature allows Amazon Redshift to temporarily increase cluster capacity whenever the workload requires it. We configured it to allow a maximum of 4 hours per day—1 hour for free and 3 hours paid. We chose this particular value because price-wise it is equivalent to adding one more node (taking us to nine nodes) with the added advantage of only using and paying for it when the workload requires it.

The last action we took was related to the parameter group, in particular, the WLM. As initially stated, we had a manually fine-tuned WLM, but it proved to be inefficient for this new RA3 cluster. Therefore, we decided to try auto WLM with the following configuration.

Manual WLM before introducing auto WLM Queue 1 Data Team ETL queue (daily and hourly), admin, monitoring, data quality queries
Queue 2 Users queue (for both their ETL and ad hoc queries)
Auto WLM Queue 1: Priority highest Daily Data Team ETL queue
Queue 2: Priority high Admin queries
Queue 3: Priority normal User queries and hourly Data Team ETL
Queue 4: Priority low Monitoring, data quality queries

Manual WLM requires you to manually allocate a percentage of resources and define a number of slots per queue. Although this gives you resource segregation, it also means resources are constantly allocated and can go to waste if they’re not used. Auto WLM dynamically sets these variables depending on each queue’s priority and workload. This means that a query in the highest priority queue will get all the resources allocated to it, while lower priority queues will need to wait for available resources. With this in mind, we split our ETL depending on its priority: daily ETL to highest, hourly ETL to normal (to give a fair chance for user queries to compete for resources), and monitoring and data quality to low.

After applying concurrency scaling and auto WLM, we achieved stable performance for a whole week, and considered the migration a success.

Status quo after migration

Almost a year has passed since we migrated to RA3 nodes, and we couldn’t be more satisfied. Thanks to Redshift Managed Storage (RMS), our disk space issues are a thing of the past, and performance has been generally great compared to our previous DC2 cluster. We are now at 300 monthly active users. Cluster costs did increase due to the new node type and concurrency scaling, but we now feel prepared for the future and don’t expect any cluster resizing anytime soon.

Looking back, we wanted to have a carefully planned and prepared migration, and we were able to learn more about RA3 with our test environment. However, our experience also shows that test environments aren’t always bulletproof, and some details may be overlooked. In the end, these are our main takeaways from the migration to RA3 nodes:

  • Pick the right node type according to your workload. An RA3.16xlarge cluster provides more powerful leader and worker nodes.
  • Use concurrency scaling to provision more resources when the workload demands it. Adding a new node is not always the most cost-efficient solution.
  • Manual WLM requires a lot of adjustments; using auto WLM allows for a better and fairer distribution of cluster resources.


In this post, we covered how OLX Group modernized our Amazon Redshift data warehouse by migrating to RA3 nodes. We detailed how we tested before migration, the migration itself, and the outcome. We are now starting to explore the possibilities provided by the RA3 nodes. In particular, the data sharing capabilities together with Redshift Serverless open the door for exciting architecture setups that we are looking forward to.

If you are going through the same storage issues we used to face with your Amazon Redshift cluster, we highly recommend migrating to RA3 nodes. Its RMS feature decouples the scalability of compute and storage power, providing a more cost-efficient solution.

Thanks for reading this post and hopefully you found it useful. If you’re going through the same scenario and have any questions, feel free to reach out.

About the author

Miguel Chin is a Data Engineering Manager at OLX Group, one of the world’s fastest-growing networks of trading platforms. He is responsible for managing a domain-oriented team of data engineers that helps shape the company’s data ecosystem by evangelizing cutting-edge data concepts like data mesh.

David Greenshtein is a Specialist Solutions Architect for Analytics at AWS with a passion for ETL and automation. He works with AWS customers to design and build analytics solutions enabling business to make data-driven decisions. In his free time, he likes jogging and riding bikes with his son.

Synchronize your Salesforce and Snowflake data to speed up your time to insight with Amazon AppFlow

Post Syndicated from Ramesh Ranganathan original https://aws.amazon.com/blogs/big-data/synchronize-your-salesforce-and-snowflake-data-to-speed-up-your-time-to-insight-with-amazon-appflow/

This post was co-written with Amit Shah, Principal Consultant at Atos.

Customers across industries seek meaningful insights from the data captured in their Customer Relationship Management (CRM) systems. To achieve this, they combine their CRM data with a wealth of information already available in their data warehouse, enterprise systems, or other software as a service (SaaS) applications. One widely used approach is getting the CRM data into your data warehouse and keeping it up to date through frequent data synchronization.

Integrating third-party SaaS applications is often complicated and requires significant effort and development. Developers need to understand the application APIs, write implementation and test code, and maintain the code for future API changes. Amazon AppFlow, which is a low-code/no-code AWS service, addresses this challenge.

Amazon AppFlow is a fully managed integration service that enables you to securely transfer data between SaaS applications, like Salesforce, SAP, Zendesk, Slack, and ServiceNow, and AWS services like Amazon Simple Storage Service (Amazon S3) and Amazon Redshift in just a few clicks. With Amazon AppFlow, you can run data flows at enterprise scale at the frequency you choose—on a schedule, in response to a business event, or on demand.

In this post, we focus on synchronizing your data from Salesforce to Snowflake (on AWS) without writing code. This post walks you through the steps to set up a data flow to address full and incremental data load using an example use case.

Solution overview

Our use case involves the synchronization of the Account object from Salesforce into Snowflake. In this architecture, you use Amazon AppFlow to filter and transfer the data to your Snowflake data warehouse.

You can configure Amazon AppFlow to run your data ingestion in three different ways:

  • On-demand – You can manually run the flow through the AWS Management Console, API, or SDK call.
  • Event-driven – Amazon AppFlow can subscribe and listen to change data capture (CDC) events from the source SaaS application.
  • Scheduled – Amazon AppFlow can run schedule-triggered flows based on a pre-defined schedule rule. With scheduled flows, you can choose either full or incremental data transfer:
    • With full transfer, Amazon AppFlow transfers a snapshot of all records at the time of the flow run from the source to the destination.
    • With incremental transfer, Amazon AppFlow transfers only the records that have been added or changed since the last successful flow run. To determine the incremental delta of your data, AppFlow requires you to specify a source timestamp field to instruct how Amazon AppFlow identifies new or updated records.

We use the on-demand trigger for the initial load of data from Salesforce to Snowflake, because it helps you pull all the records, irrespective of their creation. To then synchronize data periodically with Snowflake, after we run the on-demand trigger, we configure a scheduled trigger with incremental transfer. With this approach, Amazon AppFlow pulls the records based on a chosen timestamp field from the Salesforce Account object periodically, based on the time interval specified in the flow.

The Account_Staging table is created in Snowflake to act as a temporary storage that can be used to identify the data change events. Then the permanent table (Account) is updated from the staging table by running a SQL stored procedure that contains the incremental update logic. The following figure depicts the various components of the architecture and the data flow from the source to the target.

The data flow contains the following steps:

  1. First, the flow is run with on-demand and full transfer mode to load the full data into Snowflake.
  2. The Amazon AppFlow Salesforce connector pulls the data from Salesforce and stores it in the Account Data S3 bucket in CSV format.
  3. The Amazon AppFlow Snowflake connector loads the data into the Account_Staging table.
  4. A scheduled task, running at regular intervals in Snowflake, triggers a stored procedure.
  5. The stored procedure starts an atomic transaction that loads the data into the Account table and then deletes the data from the Account_Staging table.
  6. After the initial data is loaded, you update the flow to capture incremental updates from Salesforce. The flow trigger configuration is changed to scheduled, to capture data changes in Salesforce. This enables Snowflake to get all updates, deletes, and inserts in Salesforce at configured intervals.
  7. The flow uses the configured LastModifiedDate field to determine incremental changes.
  8. Steps 3, 4, and 5 are run again to load the incremental updates into the Snowflake Accounts table.


To get started, you need the following prerequisites:

  • A Salesforce user account with sufficient privileges to install connected apps. Amazon AppFlow uses a connected app to communicate with Salesforce APIs. If you don’t have a Salesforce account, you can sign up for a developer account.
  • A Snowflake account with sufficient permissions to create and configure the integration, external stage, table, stored procedures, and tasks.
  • An AWS account with access to AWS Identity and Access Management (IAM), Amazon AppFlow, and Amazon S3.

Set up Snowflake configuration and Amazon S3 data

Complete the following steps to configure Snowflake and set up your data in Amazon S3:

  1. Create two S3 buckets in your AWS account: one for holding the data coming from Salesforce, and another for holding error records.

A best practice when creating your S3 bucket is to make sure you block public access to the bucket to ensure your data is not accessible by unauthorized users.

  1. Create an IAM policy named snowflake-access that allows listing the bucket contents and reading S3 objects inside the bucket.

Follow the instructions for steps 1 and 2 in Configuring a Snowflake Storage Integration to Access Amazon S3 to create an IAM policy and role. Replace the placeholders with your S3 bucket names.

  1. Log in to your Snowflake account and create a new warehouse called SALESFORCE and database called SALESTEST.
  2. Specify the format in which data will be available in Amazon S3 for Snowflake to load (for this post, CSV):
CREATE or REPLACE file format my_csv_format
type = csv
field_delimiter = ','
Y skip_header = 1
null_if = ('NULL', 'null')
empty_field_as_null = true
compression = gzip;
  1. Amazon AppFlow uses the Snowflake COPY command to move data using an S3 bucket. To configure this integration, follow steps 3–6 in Configuring a Snowflake Storage Integration to Access Amazon S3.

These steps create a storage integration with your S3 bucket, update IAM roles with Snowflake account and user details, and creates an external stage.

This completes the setup in Snowflake. In the next section, you create the required objects in Snowflake.

Create schemas and procedures in Snowflake

In your Snowflake account, complete the following steps to create the tables, stored procedures, and tasks for implementing the use case:

  1. In your Snowflake account, open a worksheet and run the following DDL scripts to create the Account and Account_staging tables:
primary key (ACCOUNT_NUMBER)

primary key (ACCOUNT_NUMBER)
  1. Create a stored procedure in Snowflake to load data from staging to the Account table:
CREATE or REPLACE procedure sp_account_load( )
returns varchar not null
language sql
Begin transaction;
when matched AND ACCOUNT_STAGING.DELETED=TRUE then delete
when matched then UPDATE SET
when NOT matched then
) ;


This stored procedure determines whether the data contains new records that need to be inserted or existing records that need to be updated or deleted. After a successful run, the stored procedure clears any data from your staging table.

  1. Create a task in Snowflake to trigger the stored procedure. Make sure that the time interval for this task is more than the time interval configured in Amazon AppFlow for pulling the incremental changes from Salesforce. The time interval should be sufficient for data to be processed.
SCHEDULE = 'USING CRON 5 * * * * America/Los_Angeles'
call sp_account_load();
  1. Provide the required permissions to run the task and resume the task:
show tasks;
  • As soon as task is created it will be suspended state so needs to resume it manually first time
  • If the role which is assigned to us doesn’t have proper access to resume/execute task needs to grant execute task privilege to that role

This completes the Snowflake part of configuration and setup.

Create a Salesforce connection

First, let’s create a Salesforce connection that can be used by AppFlow to authenticate and pull records from your Salesforce instance. On the AWS console, make sure you are in the same Region where your Snowflake instance is running.

  1. On the Amazon AppFlow console, choose Connections in the navigation pane.
  2. From the list of connectors, select Salesforce.
  3. Choose Create connection.
  4. For Connection name, enter a name of your choice (for example, Salesforce-blog).
  5. Leave the rest of the fields as default and choose Continue.
  6. You’re redirected to a sign-in page, where you need to log in to your Salesforce instance.
  7. After you allow Amazon AppFlow access to your Salesforce account, your connection is successfully created.

 Create a Snowflake connection

Complete the following steps to create your Snowflake connection:

  1. On the Connections menu, choose Snowflake.
  2. Choose Create connection.
  3. Provide information for the Warehouse, Stage name, and Bucket details fields.
  4. Enter your credential details.

  1. For Region, choose the same Region where Snowflake is running.
  2. For Connection name, name your connection Snowflake-blog.
  3. Leave the rest of the fields as default and choose Connect.

Create a flow in Amazon AppFlow

Now you create a flow in Amazon AppFlow to load the data from Salesforce to Snowflake. Complete the following steps:

  1. On the Amazon AppFlow console, choose Flows in the navigation pane.
  2. Choose Create flow.
  3. On the Specify flow details page, enter a name for the flow (for example, AccountData-SalesforceToSnowflake).
  4. Optionally, provide a description for the flow and tags.
  5. Choose Next.

  1. On the Configure flow page, for Source name¸ choose Salesforce.
  2. Choose the Salesforce connection we created in the previous step (Salesforce-blog).
  3. For Choose Salesforce object, choose Account.
  4. For Destination name, choose Snowflake.
  5. Choose the newly created Snowflake connection.
  6. For Choose Snowflake object, choose the staging table you created earlier (SALESTEST.PUBLIC. ACCOUNT_STAGING).

  1. In the Error handling section, provide your error S3 bucket.
  2. For Choose how to trigger the flow¸ select Run on demand.
  3. Choose Next.

  1. Select Manually map fields to map the fields between your source and destination.
  2. Choose the fields Account Number, Account Name, Account Type, Annual Revenue, Active, Deleted, and Last Modified Date.

  1. Map each source field to its corresponding destination field.
  2. Under Additional settings, leave the Import deleted records unchecked (default setting).

  1. In the Validations section, add validations for the data you’re pulling from Salesforce.

Because the schema for the Account_Staging table in Snowflake database has a NOT NULL constraint for the fields Account_Number and Active, records containing a null value for these fields should be ignored.

  1. Choose Add Validation to configure validations for these fields.
  2. Choose Next.

  1. Leave everything else as default, proceed to the final page, and choose Create Flow.
  2. After the flow is created, choose Run flow.

When the flow run completes successfully, it will bring all records into your Snowflake staging table.

Verify data in Snowflake

The data will be loaded into the Account_staging table. To verify that data is loaded in Snowflake, complete the following steps:

  1. Validate the number of records by querying the ACCOUNT_STAGING table in Snowflake.
  2. Wait for your Snowflake task to run based on the configured schedule.
  3. Verify that all the data is transferred to the ACCOUNT table and the ACCOUNT_STAGING table is truncated.

Configure an incremental data load from Salesforce

Now let’s configure an incremental data load from Salesforce:

  1. On the Amazon AppFlow console, select your flow, and choose Edit.
  2. Go to the Edit configuration step and change to Run flow on schedule.
  3. Set the flow to run every 5 minutes, and provide a start date of Today, with a start time in the future.
  4. Choose Incremental transfer and choose the LastModifiedDate field.
  5. Choose Next.
  6. In the Additional settings section, select Import deleted records.

This ensures that deleted records from the source are also ingested.

  1. Choose Save and then choose Activate flow.

Now your flow is configured to capture all incremental changes.

Test the solution

Log in to your Salesforce account, and edit any record in the Account object.

Within 5 minutes or less, a scheduled flow will pick up your change and write the changed record into your Snowflake staging table and trigger the synchronization process.

You can see the details of the run, including number of records transferred, on the Run History tab of your flow.

Clean up

Clean up the resources in your AWS account by completing the following steps:

  1. On the Amazon AppFlow console, choose Flows in the navigation pane.
  2. From the list of flows, select the flow AccountData-SalesforceToSnowflakeand delete it.
  3. Enter delete to delete the flow.
  4. Choose Connections in the navigation pane.
  5. Choose Salesforce from the list of connectors, select Salesforce-blog, and delete it.
  6. Enter delete to delete the connector.
  7. On the Connections page, choose Snowflake from the list of connectors, select Snowflake-blog, and delete it.
  8. Enter delete to delete the connector.
  9. On the IAM console, choose Roles in the navigation page, then select the role you created for Snowflake and delete it.
  10. Choose Policies in the navigation pane, select the policy you created for Snowflake, and delete it.
  11. On the Amazon S3 console, search for the data bucket you created, choose Empty to delete the objects, then delete the bucket.
  12. Search for the error bucket you created, choose Empty to delete the objects, then delete the bucket.
  13. Clean up resources in your Snowflake account:
  • Delete the task TASK_ACCOUNT_LOAD:
  • Delete the stored procedure sp_account_load:
DROP procedure sp_account_load();
  • Delete the tables ACCOUNT_STAGING and ACCOUNT:


In this post, we walked you through how to integrate and synchronize your data from Salesforce to Snowflake using Amazon AppFlow. This demonstrates how you can set up your ETL jobs without having to learn new programming languages by using Amazon AppFlow and your familiar SQL language. This is a proof of concept, but you can try to handle edge cases like failure of Snowflake tasks or understand how incremental transfer works by making multiple changes to a Salesforce record within the scheduled time interval.

For more information on Amazon AppFlow, visit Amazon AppFlow.

About the authors

Ramesh Ranganathan is a Senior Partner Solution Architect at AWS. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, application modernization and cloud native development. He is passionate about technology and enjoys experimenting with AWS Serverless services.

Kamen Sharlandjiev is an Analytics Specialist Solutions Architect and Amazon AppFlow expert. He’s on a mission to make life easier for customers who are facing complex data integration challenges. His secret weapon? Fully managed, low-code AWS services that can get the job done with minimal effort and no coding.

Amit Shah is a cloud based modern data architecture expert and currently leading AWS Data Analytics practice in Atos. Based in Pune in India, he has 20+ years of experience in data strategy, architecture, design and development. He is on a mission to help organization become data-driven.

­­Use fuzzy string matching to approximate duplicate records in Amazon Redshift

Post Syndicated from Sean Beath original https://aws.amazon.com/blogs/big-data/use-fuzzy-string-matching-to-approximate-duplicate-records-in-amazon-redshift/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift enables you to run complex SQL analytics at scale and performance on terabytes to petabytes of structured and unstructured data, and make the insights widely available through popular business intelligence (BI) and analytics tools.

It’s common to ingest multiple data sources into Amazon Redshift to perform analytics. Often, each data source will have its own processes of creating and maintaining data, which can lead to data quality challenges within and across sources.

One challenge you may face when performing analytics is the presence of imperfect duplicate records within the source data. Answering questions as simple as “How many unique customers do we have?” can be very challenging when the data you have available is like the following table.

Name Address Date of Birth
Cody Johnson 8 Jeffery Brace, St. Lisatown 1/3/1956
Cody Jonson 8 Jeffery Brace, St. Lisatown 1/3/1956

Although humans can identify that Cody Johnson and Cody Jonson are most likely the same person, it can be difficult to distinguish this using analytics tools. This identification of duplicate records also becomes nearly impossible when working on large datasets across multiple sources.

This post presents one possible approach to addressing this challenge in an Amazon Redshift data warehouse. We import an open-source fuzzy matching Python library to Amazon Redshift, create a simple fuzzy matching user-defined function (UDF), and then create a procedure that weights multiple columns in a table to find matches based on user input. This approach allows you to use the created procedure to approximately identify your unique customers, improving the accuracy of your analytics.

This approach doesn’t solve for data quality issues in source systems, and doesn’t remove the need to have a wholistic data quality strategy. For addressing data quality challenges in Amazon Simple Storage Service (Amazon S3) data lakes and data pipelines, AWS has announced AWS Glue Data Quality (preview). You can also use AWS Glue DataBrew, a visual data preparation tool that makes it easy for data analysts and data scientists to clean and normalize data to prepare it for analytics.


To complete the steps in this post, you need the following:

The following AWS CloudFormation stack will deploy a new Redshift Serverless endpoint and an S3 bucket for use in this post.


All SQL commands shown in this post are available in the following notebook, which can be imported into the Amazon Redshift Query Editor V2.

Overview of the dataset being used

The dataset we use is mimicking a source that holds customer information. This source has a manual process of inserting and updating customer data, and this has led to multiple instances of non-unique customers being represented with duplicate records.

The following examples show some of the data quality issues in the dataset being used.

In this first example, all three customers are the same person but have slight differences in the spelling of their names.

id name age address_line1 city postcode state
1 Cody Johnson 80 8 Jeffrey Brace St. Lisatown 2636 South Australia
101 Cody Jonson 80 8 Jeffrey Brace St. Lisatown 2636 South Australia
121 Kody Johnson 80 8 Jeffrey Brace St. Lisatown 2636 South Australia

In this next example, the two customers are the same person with slightly different addresses.

id name age address_line1 city postcode state
7 Angela Watson 59 3/752 Bernard Follow Janiceberg 2995 Australian Capital Territory
107 Angela Watson 59 752 Bernard Follow Janiceberg 2995 Australian Capital Territory

In this example, the two customers are different people with the same address. This simulates multiple different customers living at the same address who should still be recognized as different people.

id name age address_line1 city postcode state
6 Michael Hunt 69 8 Santana Rest St. Jessicamouth 2964 Queensland
106 Sarah Hunt 69 8 Santana Rest St. Jessicamouth 2964 Queensland

Load the dataset

First, create a new table in your Redshift Serverless endpoint and copy the test data into it by doing the following:

  1. Open the Query Editor V2 and log in using the admin user name and details defined when the endpoint was created.
  2. Run the following CREATE TABLE statement:
    create table customer (
        id smallint, 
        urid smallint,
        name varchar(100),
        age smallint,
        address_line1 varchar(200),
        city varchar(100),
        postcode smallint,
        state varchar(100)

    Screenshot of CREATE TABLE statement for customer table being run successfully in Query Editor V2

  3. Run the following COPY command to copy data into the newly created table:
    copy customer (id, name, age, address_line1, city, postcode, state)
    from ' s3://redshift-blogs/fuzzy-string-matching/customer_data.csv'
    IAM_ROLE default
    FORMAT csv
    REGION 'us-east-1'

  4. Confirm the COPY succeeded and there are 110 records in the table by running the following query:
    select count(*) from customer;

    Screenshot showing the count of records in the customer table is 110. Query is run in Query Editor V2

Fuzzy matching

Fuzzy string matching, more formally known as approximate string matching, is the technique of finding strings that match a pattern approximately rather than exactly. Commonly (and in this solution), the Levenshtein distance is used to measure the distance between two strings, and therefore their similarity. The smaller the Levenshtein distance between two strings, the more similar they are.

In this solution, we exploit this property of the Levenshtein distance to estimate if two customers are the same person based on multiple attributes of the customer, and it can be expanded to suit many different use cases.

This solution uses TheFuzz, an open-source Python library that implements the Levenshtein distance in a few different ways. We use the partial_ratio function to compare two strings and provide a result between 1–100. If one of the strings matches perfectly with a portion of the other, the partial_ratio function will return 100.

Weighted fuzzy matching

By adding a scaling factor to each of our column fuzzy matches, we can create a weighted fuzzy match for a record. This is especially useful in two scenarios:

  • We have more confidence in some columns of our data than others, and therefore want to prioritize their similarity results.
  • One column is much longer than the others. A single character difference in a long string will have much less impact on the Levenshtein distance than a single character difference in a short string. Therefore, we want to prioritize long string matches over short string matches.

The solution in this post applies weighted fuzzy matching based on user input defined in another table.

Create a table for weight information

This reference table holds two columns; the table name and the column mapping with weights. The column mapping is held in a SUPER datatype, which allows JSON semistructured data to be inserted and queried directly in Amazon Redshift. For examples on how to query semistructured data in Amazon Redshift, refer to Querying semistructured data.

In this example, we apply the largest weight to the column address_line1 (0.5) and the smallest weight to the city and postcode columns (0.1).

Using the Query Editor V2, create a new table in your Redshift Serverless endpoint and insert a record by doing the following:

  1. Run the following CREATE TABLE statement:
    CREATE TABLE ref_unique_record_weight_map(table_name varchar(100), column_mapping SUPER);

  2. Run the following INSERT statement:
    INSERT INTO ref_unique_record_weight_map VALUES (
            "colname": "name",
            "colweight": 0.3
            "colname": "address_line1",
            "colweight": 0.5
            "colname": "city",
            "colweight": 0.1
            "colname": "postcode",
            "colweight": 0.1

  3. Confirm the mapping data has inserted into the table correctly by running the following query:
    select * from ref_unique_record_weight_map;

    Screenshot showing the result of querying the ref_unique_record_weight_map table in Query Editor V2

  4. To check all weights for the customer table add up to 1 (100%), run the following query:
    select  cm.table_name, 
            sum(colmap.colweight) as total_column_weight 
    from    ref_unique_record_weight_map cm, cm.column_mapping.colmap colmap 
    where   cm.table_name = 'customer'
    group by cm.table_name;

    Screenshot showing the total weight applied to the customer table is 1.0

User-defined functions

With Amazon Redshift, you can create custom scalar user-defined functions (UDFs) using a Python program. A Python UDF incorporates a Python program that runs when the function is called and returns a single value. In addition to using the standard Python functionality, you can import your own custom Python modules, such as the module described earlier (TheFuzz).

In this solution, we create a Python UDF to take two input values and compare their similarity.

Import external Python libraries to Amazon Redshift

Run the following code snippet to import the TheFuzz module into Amazon Redshift as a new library. This makes the library available within Python UDFs in the Redshift Serverless endpoint. Make sure to provide the name of the S3 bucket you created earlier.

FROM 's3://<your-bucket>/thefuzz.zip' 
IAM_ROLE default;

Create a Python user-defined function

Run the following code snippet to create a new Python UDF called unique_record. This UDF will do the following:

  1. Take two input values that can be of any data type as long as they are the same data type (such as two integers or two varchars).
  2. Import the newly created thefuzz Python library.
  3. Return an integer value comparing the partial ratio between the two input values.
    from thefuzz import fuzz

    return fuzz.partial_ratio(value_a, value_b)
$$ LANGUAGE plpythonu;

You can test the function by running the following code snippet:

select unique_record('Cody Johnson'::varchar, 'Cody Jonson'::varchar)

The result shows that these two strings are have a similarity value of 91%.

Screenshot showing that using the created function on the Cody Johnson/Cody Jonson name example provides a response of 91

Now that the Python UDF has been created, you can test the response of different input values.

Alternatively, you can follow the amazon-redshift-udfs GitHub repo to install the f_fuzzy_string_match Python UDF.

Stored procedures

Stored procedures are commonly used to encapsulate logic for data transformation, data validation, and business-specific logic. By combining multiple SQL steps into a stored procedure, you can reduce round trips between your applications and the database.

In this solution, we create a stored procedure that applies weighting to multiple columns. Because this logic is common and repeatable regardless of the source table or data, it allows us to create the stored procedure once and use it for multiple purposes.

Create a stored procedure

Run the following code snippet to create a new Amazon Redshift stored procedure called find_unique_id. This procedure will do the following:

  1. Take one input value. This value is the table you would like to create a golden record for (in our case, the customer table).
  2. Declare a set of variables to be used throughout the procedure.
  3. Check to see if weight data is in the staging table created in previous steps.
  4. Build a query string for comparing each column and applying weights using the weight data inserted in previous steps.
  5. For each record in the input table that doesn’t have a unique record ID (URID) yet, it will do the following:
    1. Create a temporary table to stage results. This temporary table will have all potential URIDs from the input table.
    2. Allocate a similarity value to each URID. This value specifies how similar this URID is to the record in question, weighted with the inputs defined.
    3. Choose the closest matched URID, but only if there is a >90% match.
    4. If there is no URID match, create a new URID.
    5. Update the source table with the new URID and move to the next record.

This procedure will only ever look for new URIDs for records that don’t already have one allocated. Therefore, rerunning the URID procedure multiple times will have no impact on the results.

CREATE OR REPLACE PROCEDURE find_unique_id(table_name varchar(100)) AS $$
    unique_record RECORD;
    column_info RECORD;

    column_fuzzy_comparison_string varchar(MAX) := '0.0';
    max_simularity_value decimal(5,2) := 0.0;

    table_check varchar(100);
    temp_column_name varchar(100);
    temp_column_weight decimal(5,2);
    unique_record_id smallint := 0;
        Check the ref_unique_record_weight_map table to see if there is a mapping record for the provided table.
        If there is no table, raise an exception
    SELECT INTO table_check cm.table_name from ref_unique_record_weight_map cm where cm.table_name = quote_ident(table_name);
        RAISE EXCEPTION 'Input table ''%'' not found in mapping object', table_name;
    END IF;

        Build query to be used to compare each column using the mapping record in the ref_unique_record_weight_map table.
        For each column specified in the mapping object, append a weighted comparison of the column
    FOR column_info IN (
        select  colmap.colname::varchar(100) column_name, 
                colmap.colweight column_weight 
        from    ref_unique_record_weight_map cm, cm.column_mapping.colmap colmap 
        where   cm.table_name = quote_ident(table_name)
    ) LOOP
        temp_column_name = column_info.column_name;
        temp_column_weight = column_info.column_weight;
        column_fuzzy_comparison_string = column_fuzzy_comparison_string || 
            ' + unique_record(t1.' || 
            temp_column_name || 
            '::varchar, t2.' || 
            temp_column_name || 
            '::varchar)*' || 

    /* Drop temporary table if it exists */
    EXECUTE 'DROP TABLE IF EXISTS #unique_record_table';

        For each record in the source table that does not have a Unique Record ID (URID):
            1. Create a new temporary table holding all possible URIDs for this record (i.e. all URIDs that have are present). 
                Note: This temporary table will only be present while the simularity check is being calculated
            2. Update each possible URID in the temporary table with it's simularity to the record being checked
            3. Find the most simular record with a URID
                3a. If the most simular record is at least 90% simular, take it's URID (i.e. this is not a unique record, and matches another in the table)
                3b. If there is no record that is 90% simular, create a new URID (i.e. this is a unique record)
            4. Drop the temporary table in preparation for the next record
    FOR unique_record in EXECUTE 'select * from ' || table_name || ' where urid is null order by id asc' LOOP

        RAISE INFO 'test 1';

        /* Create temporary table */
        EXECUTE '
            CREATE TABLE #unique_record_table AS 
            SELECT id, urid, 0.0::decimal(5,2) as simularity_value 
            FROM ' || table_name || '
            where urid is not null

        /* Update simularity values in temporary table */
        EXECUTE '
            UPDATE #unique_record_table  
            SET simularity_value = round(calc_simularity_value,2)::decimal(5,2)
            FROM (
                SELECT ' || column_fuzzy_comparison_string || ' as calc_simularity_value,
                        t2.id as upd_id
                FROM ' || table_name || ' t1
                INNER JOIN ' || table_name || ' t2
                ON t1.id <> t2.id
                AND t1.id = ' || quote_literal(unique_record.id) || '
                ) t
            WHERE t.upd_id = id

        /* Find largest simularity value */
        SELECT INTO max_simularity_value simularity_value FROM (
            SELECT  MAX(simularity_value) as simularity_value 
            FROM    #unique_record_table

        /* If there is a >90% similar match, choose it's URID. Otherwise, create a new URID */
        IF max_simularity_value > 90 THEN
            SELECT INTO unique_record_id urid FROM (
                SELECT urid
                FROM #unique_record_table
                WHERE simularity_value = max_simularity_value
            EXECUTE 'select COALESCE(MAX(urid)+1,1) FROM ' || table_name INTO unique_record_id;
        END IF;
        /* Update table with new URID value */
        EXECUTE 'UPDATE ' || table_name || ' SET urid = ' || quote_literal(unique_record_id) || ' WHERE id = ' || quote_literal(unique_record.id);

        /* Drop temporary table and repeat process */
        EXECUTE 'DROP TABLE #unique_record_table';

        max_simularity_value = 0.0;

$$ LANGUAGE plpgsql;

Now that the stored procedure has been created, create the unique record IDs for the customer table by running the following in the Query Editor V2. This will update the urid column of the customer table.

CALL find_unique_id('customer'); 
select * from customer;

Screenshot showing the customer table now has values inserted in the URID column

When the procedure has completed its run, you can identify what duplicate customers were given unique IDs by running the following query:

select * 
from customer
where urid in (
    select urid 
    from customer 
    group by urid 
    having count(*) > 1
order by urid asc

Screenshot showing the records that have been identified as duplicate records

From this you can see that IDs 1, 101, and 121 have all been given the same URID, as have IDs 7 and 107.

Screenshot showing the result for IDs 1, 101, and 121

Screenshot showing the result for IDs 7, and 107

The procedure has also correctly identified that IDs 6 and 106 are different customers, and they therefore don’t have the same URID.

Screenshot showing the result for IDs 6, and 106

Clean up

To avoid incurring future reoccurring charges, delete all files in the S3 bucket you created. After you delete the files, go to the AWS CloudFormation console and delete the stack deployed in this post. This will delete all created resources.


In this post, we showed one approach to identifying imperfect duplicate records by applying a fuzzy matching algorithm in Amazon Redshift. This solution allows you to identify data quality issues and apply more accurate analytics to your dataset residing in Amazon Redshift.

We showed how you can use open-source Python libraries to create Python UDFs, and how to create a generic stored procedure to identify imperfect matches. This solution is extendable to provide any functionality required, including adding as a regular process in your ELT (extract, load, and transform) workloads.

Test the created procedure on your datasets to investigate the presence of any imperfect duplicates, and use the knowledge learned throughout this post to create stored procedures and UDFs to implement further functionality.

If you’re new to Amazon Redshift, refer to Getting started with Amazon Redshift for more information and tutorials on Amazon Redshift. You can also refer to the video Get started with Amazon Redshift Serverless for information on starting with Redshift Serverless.

About the Author

Sean Beath is an Analytics Solutions Architect at Amazon Web Services. He has experience in the full delivery lifecycle of data platform modernisation using AWS services and works with customers to help drive analytics value on AWS.

Build a serverless analytics application with Amazon Redshift and Amazon API Gateway

Post Syndicated from David Zhang original https://aws.amazon.com/blogs/big-data/build-a-serverless-analytics-application-with-amazon-redshift-and-amazon-api-gateway/

Serverless applications are a modernized way to perform analytics among business departments and engineering teams. Business teams can gain meaningful insights by simplifying their reporting through web applications and distributing it to a broader audience.

Use cases can include the following:

  • Dashboarding – A webpage consisting of tables and charts where each component can offer insights to a specific business department.
  • Reporting and analysis – An application where you can trigger large analytical queries with dynamic inputs and then view or download the results.
  • Management systems – An application that provides a holistic view of the internal company resources and systems.
  • ETL workflows – A webpage where internal company individuals can trigger specific extract, transform, and load (ETL) workloads in a user-friendly environment with dynamic inputs.
  • Data abstraction – Decouple and refactor underlying data structure and infrastructure.
  • Ease of use – An application where you want to give a large set of user-controlled access to analytics without having to onboard each user to a technical platform. Query updates can be completed in an organized manner and maintenance has minimal overhead.

In this post, you will learn how to build a serverless analytics application using Amazon Redshift Data API and Amazon API Gateway WebSocket and REST APIs.

Amazon Redshift is fully managed by AWS, so you no longer need to worry about data warehouse management tasks such as hardware provisioning, software patching, setup, configuration, monitoring nodes and drives to recover from failures, or backups. The Data API simplifies access to Amazon Redshift because you don’t need to configure drivers and manage database connections. Instead, you can run SQL commands to an Amazon Redshift cluster by simply calling a secured API endpoint provided by the Data API. The Data API takes care of managing database connections and buffering data. The Data API is asynchronous, so you can retrieve your results later.

API Gateway is a fully managed service that makes it easy for developers to publish, maintain, monitor, and secure APIs at any scale. With API Gateway, you can create RESTful APIs and WebSocket APIs that enable real-time two-way communication applications. API Gateway supports containerized and serverless workloads, as well as web applications. API Gateway acts as a reverse proxy to many of the compute resources that AWS offers.

Event-driven model

Event-driven applications are increasingly popular among customers. Analytical reporting web applications can be implemented through an event-driven model. The applications run in response to events such as user actions and unpredictable query events. Decoupling the producer and consumer processes allows greater flexibility in application design and building decoupled processes. This design can be achieved with the Data API and API Gateway WebSocket and REST APIs.

Both REST API calls and WebSocket establish communication between the client and the backend. Due to the popularity of REST, you may wonder why WebSockets are present and how they contribute to an event-driven design.

What are WebSockets and why do we need them?

Unidirectional communication is customary when building analytical web solutions. In traditional environments, the client initiates a REST API call to run a query on the backend and either synchronously or asynchronously waits for the query to complete. The “wait” aspect is engineered to apply the concept of polling. Polling in this context is when the client doesn’t know when a backend process will complete. Therefore, the client will consistently make a request to the backend and check.

What is the problem with polling? Main challenges include the following:

  • Increased traffic in your network bandwidth – A large number of users performing empty checks will impact your backend resources and doesn’t scale well.
  • Cost usage – Empty requests don’t deliver any value to the business. You pay for the unnecessary cost of resources.
  • Delayed response – Polling is scheduled in time intervals. If the query is complete in-between these intervals, the user can only see the results after the next check. This delay impacts the user experience and, in some cases, may result in UI deadlocks.

For more information on polling, check out From Poll to Push: Transform APIs using Amazon API Gateway REST APIs and WebSockets.

WebSockets is another approach compared to REST when establishing communication between the front end and backend. WebSockets enable you to create a full duplex communication channel between the client and the server. In this bidirectional scenario, the client can make a request to the server and is notified when the process is complete. The connection remains open, with minimal network overhead, until the response is received.

You may wonder why REST is present, since you can transfer response data with WebSockets. A WebSocket is a light weight protocol designed for real-time messaging between systems. The protocol is not designed for handling large analytical query data and in API Gateway, each frame’s payload can only hold up to 32 KB. Therefore, the REST API performs large data retrieval.

By using the Data API and API Gateway, you can build decoupled event-driven web applications for your data analytical needs. You can create WebSocket APIs with API Gateway and establish a connection between the client and your backend services. You can then initiate requests to perform analytical queries with the Data API. Due to the Data API’s asynchronous nature, the query completion generates an event to notify the client through the WebSocket channel. The client can decide to either retrieve the query results through a REST API call or perform other follow-up actions. The event-driven architecture enables bidirectional interoperable messages and data while keeping your system components agnostic.

Solution overview

In this post, we show how to create a serverless event-driven web application by querying with the Data API in the backend, establishing a bidirectional communication channel between the user and the backend with the WebSocket feature in API Gateway, and retrieving the results using its REST API feature. Instead of designing an application with long-running API calls, you can use the Data API. The Data API allows you to run SQL queries asynchronously, removing the need to hold long, persistent database connections.

The web application is protected using Amazon Cognito, which is used to authenticate the users before they can utilize the web app and also authorize the REST API calls when made from the application.

Other relevant AWS services in this solution include AWS Lambda and Amazon EventBridge. Lambda is a serverless, event-driven compute resource that enables you to run code without provisioning or managing servers. EventBridge is a serverless event bus allowing you to build event-driven applications.

The solution creates a lightweight WebSocket connection between the browser and the backend. When a user submits a request using WebSockets to the backend, a query is submitted to the Data API. When the query is complete, the Data API sends an event notification to EventBridge. EventBridge signals the system that the data is available and notifies the client. Afterwards, a REST API call is performed to retrieve the query results for the client to view.

We have published this solution on the AWS Samples GitHub repository and will be referencing it during the rest of this post.

The following architecture diagram highlights the end-to-end solution, which you can provision automatically with AWS CloudFormation templates run as part of the shell script with some parameter variables.

The application performs the following steps (note the corresponding numbered steps in the process flow):

  1. A web application is provisioned on AWS Amplify; the user needs to sign up first by providing their email and a password to access the site.
  2. The user verifies their credentials using a pin sent to their email. This step is mandatory for the user to then log in to the application and continue access to the other features of the application.
  3. After the user is signed up and verified, they can sign in to the application and requests data through their web or mobile clients with input parameters. This initiates a WebSocket connection in API Gateway. (Flow 1, 2)
  4. The connection request is handled by a Lambda function, OnConnect, which initiates an asynchronous database query in Amazon Redshift using the Data API. The SQL query is taken from a SQL script in Amazon Simple Storage Service (Amazon S3) with dynamic input from the client. (Flow 3, 4, 6, 7)
  5. In addition, the OnConnect Lambda function stores the connection, statement identifier, and topic name in an Amazon DynamoDB database. The topic name is an extra parameter that can be used if users want to implement multiple reports on the same webpage. This allows the front end to map responses to the correct report. (Flow 3, 4, 5)
  6. The Data API runs the query, mentioned in step 2. When the operation is complete, an event notification is sent to EventBridge. (Flow 8)
  7. EventBridge activates an event rule to redirect that event to another Lambda function, SendMessage. (Flow 9)
  8. The SendMessage function notifies the client that the SQL query is complete via API Gateway. (Flow 10, 11, 12)
  9. After the notification is received, the client performs a REST API call (GET) to fetch the results. (Flow 13, 14, 15, 16)
  10. The GetResult function is triggered, which retrieves the SQL query result and returns it to the client.
  11. The user is now able to view the results on the webpage.
  12. When clients disconnect from their browser, API Gateway automatically deletes the connection information from the DynamoDB table using the onDisconnect function. (Flow 17, 18,19)


Prior to deploying your event-driven web application, ensure you have the following:

  • An Amazon Redshift cluster in your AWS environment – This is your backend data warehousing solution to run your analytical queries. For instructions to create your Amazon Redshift cluster, refer to Getting started with Amazon Redshift.
  • An S3 bucket that you have access to – The S3 bucket will be your object storage solution where you can store your SQL scripts. To create your S3 bucket, refer to Create your first S3 bucket.

Deploy CloudFormation templates

The code associated to the design is available in the following GitHub repository. You can clone the repository inside an AWS Cloud9 environment in our AWS account. The AWS Cloud9 environment comes with AWS Command Line Interface (AWS CLI) installed, which is used to run the CloudFormation templates to set up the AWS infrastructure. Make sure that the jQuery library is installed; we use it to parse the JSON output during the run of the script.

The complete architecture is set up using three CloudFormation templates:

  • cognito-setup.yaml – Creates the Amazon Cognito user pool to web app client, which is used for authentication and protecting the REST API
  • backend-setup.yaml – Creates all the required Lambda functions and the WebSocket and Rest APIs, and configures them on API Gateway
  • webapp-setup.yaml – Creates the web application hosting using Amplify to connect and communicate with the WebSocket and Rest APIs.

These CloudFormation templates are run using the script.sh shell script, which takes care of all the dependencies as required.

A generic template is provided for you to customize your own DDL SQL scripts as well as your own query SQL scripts. We have created sample scripts for you to follow along.

  1. Download the sample DDL script and upload it to an existing S3 bucket.
  2. Change the IAM role value to your Amazon Redshift cluster’s IAM role with permissions to AmazonS3ReadOnlyAccess.

For this post, we copy the New York Taxi Data 2015 dataset from a public S3 bucket.

  1. Download the sample query script and upload it to an existing S3 bucket.
  2. Upload the modified sample DDL script and the sample query script into a preexisting S3 bucket that you own, and note down the S3 URI path.

If you want to run your own customized version, modify the DDL and query script to fit your scenario.

  1. Edit the script.sh file before you run it and set the values for the following parameters:
    • RedshiftClusterEndpoint (aws_redshift_cluster_ep) – Your Amazon Redshift cluster endpoint available on the AWS Management Console
    • DBUsername (aws_dbuser_name) – Your Amazon Redshift database user name
    • DDBTableName (aws_ddbtable_name) – The name of your DynamoDB table name that will be created
    • WebsocketEndpointSSMParameterName (aws_wsep_param_name) – The parameter name that stores the WebSocket endpoint in AWS Systems Manager Parameter Store.
    • RestApiEndpointSSMParameterName (aws_rapiep_param_name) – The parameter name that stores the REST API endpoint in Parameter Store.
    • DDLScriptS3Path (aws_ddl_script_path) – The S3 URI to the DDL script that you uploaded.
    • QueryScriptS3Path (aws_query_script_path) – The S3 URI to the query script that you uploaded.
    • AWSRegion (aws_region) – The Region where the AWS infrastructure is being set up.
    • CognitoPoolName (aws_user_pool_name) – The name you want to give to your Amazon Cognito user pool
    • ClientAppName (aws_client_app_name) – The name of the client app to be configured for the web app to handle the user authentication for the users

The default acceptable values are already provided as part of the downloaded code.

  1. Run the script using the following command:

During deployment, AWS CloudFormation creates and triggers the Lambda function SetupRedshiftLambdaFunction, which sets up an Amazon Redshift database table and populates data into the table. The following diagram illustrates this process.

Use the demo app

When the shell script is complete, you can start interacting with the demo web app:

  1. On the Amplify console, under All apps in the navigation pane, choose DemoApp.
  2. Choose Run build.

The DemoApp web application goes through a phase of Provision, Build, Deploy.

  1. When it’s complete, use the URL provided to access the web application.

The following screenshot shows the web application page. It has minimal functionality: you can sign in, sign up, or verify a user.

  1. Choose Sign Up.

  1. For Email ID, enter an email.
  2. For Password, enter a password that is at least eight characters long, has at least one uppercase and lowercase letter, at least one number, and at least one special character.
  3. Choose Let’s Enroll.

The Verify your Login to Demo App page opens.

  1. Enter your email and the verification code sent to the email you specified.
  2. Choose Verify.

You’re redirected to a login page.

  1. Sign in using your credentials.

You’re redirected to the demoPage.html website.

  1. Choose Open Connection.

You now have an active WebSocket connection between your browser and your backend AWS environment.

  1. For Trip Month, specify a month (for this example, December) and choose Submit.

You have now defined the month and year you want to query your data upon. After a few seconds, you can to see the output delivered from the WebSocket.

You may continue using the active WebSocket connection for additional queries—just choose a different month and choose Submit again.

  1. When you’re done, choose Close Connection to close the WebSocket connection.

For exploratory purposes, while your WebSocket connection is active, you can navigate to your DynamoDB table on the DynamoDB console to view the items that are currently stored. After the WebSocket connection is closed, the items stored in DynamoDB are deleted.

Clean up

To clean up your resources, complete the following steps:

  1. On the Amazon S3 console, navigate to the S3 bucket containing the sample DDL script and query script and delete them from the bucket.
  2. On the Amazon Redshift console, navigate to your Amazon Redshift cluster and delete the data you copied over from the sample DDL script.
    1. Run truncate nyc_yellow_taxi;
    2. Run drop table nyc_yellow_taxi;
  3. On the AWS CloudFormation console, navigate to the CloudFormation stacks and choose Delete. Delete the stacks in the following order:
    1. WebappSetup
    2. BackendSetup
    3. CognitoSetup

All resources created in this solution will be deleted.


You can monitor your event-driven web application events, user activity, and API usage with Amazon CloudWatch and AWS CloudTrail. Most areas of this solution already have logging enabled. To view your API Gateway logs, you can turn on CloudWatch Logs. Lambda comes with default logging and monitoring and can be accessed with CloudWatch.


You can secure access to the application using Amazon Cognito, which is a developer-centric and cost-effective customer authentication, authorization, and user management solution. It provides both identity store and federation options that can scale easily. Amazon Cognito supports logins with social identity providers and SAML or OIDC-based identity providers, and supports various compliance standards. It operates on open identity standards (OAuth2.0, SAML 2.0, and OpenID Connect). You can also integrate it with API Gateway to authenticate and authorize the REST API calls either using the Amazon Cognito client app or a Lambda function.


The nature of this application includes a front-end client initializing SQL queries to Amazon Redshift. An important component to consider are potential malicious activities that the client can perform, such as SQL injections. With the current implementation, that is not possible. In this solution, the SQL queries preexist in your AWS environment and are DQL statements (they don’t alter the data or structure). However, as you develop this application to fit your business, you should evaluate these areas of risk.

AWS offers a variety of security services to help you secure your workloads and applications in the cloud, including AWS Shield, AWS Network Firewall, AWS Web Application Firewall, and more. For more information and a full list, refer to Security, Identity, and Compliance on AWS.

Cost optimization

The AWS services that the CloudFormation templates provision in this solution are all serverless. In terms of cost optimization, you only pay for what you use. This model also allows you to scale without manual intervention. Review the following pages to determine the associated pricing for each service:


In this post, we showed you how to create an event-driven application using the Amazon Redshift Data API and API Gateway WebSocket and REST APIs. The solution helps you build data analytical web applications in an event-driven architecture, decouple your application, optimize long-running database queries processes, and avoid unnecessary polling requests between the client and the backend.

You also used severless technologies, API Gateway, Lambda, DynamoDB, and EventBridge. You didn’t have to manage or provision any servers throughout this process.

This event-driven, serverless architecture offers greater extensibility and simplicity, making it easier to maintain and release new features. Adding new components or third-party products is also simplified.

With the instructions in this post and the generic CloudFormation templates we provided, you can customize your own event-driven application tailored to your business. For feedback or contributions, we welcome you to contact us through the AWS Samples GitHub Repository by creating an issue.

About the Authors

David Zhang is an AWS Data Architect in Global Financial Services. He specializes in designing and implementing serverless analytics infrastructure, data management, ETL, and big data systems. He helps customers modernize their data platforms on AWS. David is also an active speaker and contributor to AWS conferences, technical content, and open-source initiatives. During his free time, he enjoys playing volleyball, tennis, and weightlifting. Feel free to connect with him on LinkedIn.

Manash Deb is a Software Development Manager in the AWS Directory Service team. With over 18 years of software dev experience, his passion is designing and delivering highly scalable, secure, zero-maintenance applications in the AWS identity and data analytics space. He loves mentoring and coaching others and to act as a catalyst and force multiplier, leading highly motivated engineering teams, and building large-scale distributed systems.

Pavan Kumar Vadupu Lakshman Manikya is an AWS Solutions Architect who helps customers design robust, scalable solutions across multiple industries. With a background in enterprise architecture and software development, Pavan has contributed in creating solutions to handle API security, API management, microservices, and geospatial information system use cases for his customers. He is passionate about learning new technologies and solving, automating, and simplifying customer problems using these solutions.

What’s new in Amazon Redshift – 2022, a year in review

Post Syndicated from Manan Goel original https://aws.amazon.com/blogs/big-data/whats-new-in-amazon-redshift-2022-a-year-in-review/

In 2021 and 2020, we told you about the new features in Amazon Redshift that make it easier, faster, and more cost-effective to analyze all your data and find rich and powerful insights. In 2022, we are happy to report that the Amazon Redshift team was hard at work. We worked backward from customer requirements and announced multiple new features to make it easier, faster, and more cost-effective to analyze all your data. This post covers some of these new features.

At AWS, for data and analytics, our strategy is to give you a modern data architecture that helps you break free from data silos; have purpose-built data, analytics, machine learning (ML), and artificial intelligence services to use the right tool for the right job; and have open, governed, secure, and fully managed services to make analytics available to everyone. Within AWS’s modern data architecture, Amazon Redshift as the cloud data warehouse remains a key component, enabling you to run complex SQL analytics at scale and performance on terabytes to petabytes of structured and unstructured data, and make the insights widely available through popular business intelligence (BI) and analytics tools. We continue to work backward from customers’ requirements, and in 2022 launched over 40 features in Amazon Redshift to help customers with their top data warehousing use cases, including:

  • Self-service analytics
  • Easy data ingestion
  • Data sharing and collaboration
  • Data science and machine learning
  • Secure and reliable analytics
  • Best price performance analytics

Let’s dive deeper and discuss the new Amazon Redshift features in these areas.

Self-service analytics

Customers continue to tell us that data and analytics is becoming ubiquitous, and everyone in their organization needs analytics. We announced Amazon Redshift Serverless (in preview) in 2021 to make it easy to run and scale analytics in seconds without having to provision and manage data warehouse infrastructure. In July 2022, we announced the general availability of Redshift Serverless, and since then thousands of customers, including Peloton, Broadridge Financials, and NextGen Healthcare, have used it to quickly and easily analyze their data. Amazon Redshift Serverless automatically provisions and intelligently scales data warehouse capacity to deliver high performance for all your analytics, and you only pay for the compute used for the duration of the workloads on a per-second basis. Since GA, we have added features like resource tagging, simplified monitoring, and availability in additional AWS Regions to further simplify billing and expand the reach across more Regions worldwide.

In 2021, we launched Amazon Redshift Query Editor V2, which is a free web-based tool for data analysts, data scientists, and developers to explore, analyze, and collaborate on data in Amazon Redshift data warehouses and data lakes. In 2022, Query Editor V2 got additional enhancements such as notebook support for improved collaboration to author, organize, and annotate queries; user access through identity provider (IdP) credentials for single sign-on; and the ability to run multiple queries concurrently to improve developer productivity.

Autonomics is another area where we are actively working to use ML-based optimizations and give customers a self-learning and self-optimizing data warehouse. In 2022, we announced the general availability of Automated Materialized Views (AutoMVs) to improve the performance of queries (reduce the total runtime) without any user effort by automatically creating and maintaining materialized views. AutoMVs, combined with automatic refresh, incremental refresh, and automatic query rewriting for materialized views, made materialized views maintenance free, giving you faster performance automatically. In addition, the automatic table optimization (ATO) capability for schema optimization and automatic workload management (auto WLM) capability for workload optimization got further improvements for better query performance.

Easy data ingestion

Customers tell us that they have their data distributed over multiple data sources like transactional databases, data warehouses, data lakes, and big data systems. They want the flexibility to integrate this data with no-code/low-code, zero-ETL data pipelines or analyze this data in place without moving it. Customers tell us that their current data pipelines are complex, manual, rigid, and slow, resulting in incomplete, inconsistent, and stale views of data, limiting insights. Customers have asked us for a better way forward, and we are pleased to announce a number of new capabilities to simplify and automate data pipelines.

Amazon Aurora zero-ETL integration with Amazon Redshift (preview) enables you to run near-real-time analytics and ML on petabytes of transactional data. It offers a no-code solution for making transactional data from multiple Amazon Aurora databases available in Amazon Redshift data warehouses within seconds of being written to Aurora, eliminating the need to build and maintain complex data pipelines. With this feature, Aurora customers can also access Amazon Redshift capabilities such as complex SQL analytics, built-in ML, data sharing, and federated access to multiple data stores and data lakes. This feature is now available in preview for Amazon Aurora MySQL-Compatible Edition version 3 (with MySQL 8.0 compatibility), and you can request access to the preview.

Amazon Redshift now supports auto-copy from Amazon S3 (preview) to simplify data loading from Amazon Simple Storage Service (Amazon S3) into Amazon Redshift. You can now set up continuous file ingestion rules (copy jobs) to track your Amazon S3 paths and automatically load new files without the need for additional tools or custom solutions. Copy jobs can be monitored through system tables, and they automatically keep track of previously loaded files and exclude them from the ingestion process to prevent data duplication. This feature is now available in preview; you can try this feature by creating a new cluster using the preview track.

Customers continue to tell us that they need instantaneous, in-the-moment, real-time analytics, and we are pleased to announce the general availability of streaming ingestion support in Amazon Redshift for Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK). This feature eliminates the need to stage streaming data in Amazon S3 before ingesting it into Amazon Redshift, enabling you to achieve low latency, measured in seconds, while ingesting hundreds of megabytes of streaming data per second into your data warehouses. You can use SQL within Amazon Redshift to connect to and directly ingest data from multiple Kinesis data streams or MSK topics, create auto-refreshing streaming materialized views with transformations on top of streams directly to access streaming data, and combine real-time data with historical data for better insights. For example, Adobe has integrated Amazon Redshift streaming ingestion as part of their Adobe Experience Platform for ingesting and analyzing, in real time, the web and applications clickstream and session data for various applications like CRM and customer support applications.

Customers have told us that they want simple, out-of-the-box integration between Amazon Redshift, BI and ETL (extract, transform, and load) tools, and business applications like Salesforce and Marketo. We are pleased to announce the general availability of Informatica Data Loader for Amazon Redshift, which enables you to use Informatica Data Loader for high-speed and high-volume data loading into Amazon Redshift for free. You can simply select the Informatica Data Loader option on the Amazon Redshift console. Once in Informatica Data Loader, you can connect to sources such as Salesforce or Marketo, choose Amazon Redshift as a target, and begin to load your data.

Data sharing and collaboration

Customers continue to tell us that they want to analyze all their first-party and third-party data and make the rich data-driven insights available to their customers, partners, and suppliers. We launched new features in 2021, such as Data Sharing and AWS Data Exchange integration, to make it easier for you to analyze all of your data and share it within and outside your organizations.

A great example of a customer using data sharing is Orion. Orion provides real-time data as a service (DaaS) solutions for customers in the financial services industry, such as wealth management, asset management, and investment management providers. They have over 2,500 data sources that are primarily SQL Server databases sitting both on premises and in AWS. Data is streamed using Kafka connecters into Amazon Redshift. They have a producer cluster that receives all this data and then uses Data Sharing to share data in real time for collaboration. This is a multi-tenant architecture that serves multiple clients. Given the sensitivity of their data, data sharing is a way to provide workload isolation between clusters and also securely share that data to end-users.

In 2022, we continued to invest in this area to improve the performance, governance, and developer productivity with new features to make it easier, simpler, and faster to share and collaborate on data.

As customers are building large-scale data sharing configurations, they have asked for simplified governance and security for shared data, and we are adding centralized access control with AWS Lake Formation for Amazon Redshift datashares to enable sharing live data across multiple Amazon Redshift data warehouses. With this feature, Amazon Redshift now supports simplified governance of Amazon Redshift datashares by using AWS Lake Formation as a single pane of glass to centrally manage data or permissions on datashares. You can view, modify, and audit permissions, including row-level and column-level security on the tables and views in the Amazon Redshift datashares, using Lake Formation APIs and the AWS Management Console, and allow the Amazon Redshift datashares to be discovered and consumed by other Amazon Redshift data warehouses.

Data science and machine learning

Customers continue to tell us that they want their data and analytics systems to help them answer a wide range of questions, from what is happening in their business (descriptive analytics) to why is it happening (diagnostic analytics) and what will happen in the future (predictive analytics). Amazon Redshift provides features like complex SQL analytics, data lake analytics, and Amazon Redshift ML for customers to analyze their data and discover powerful insights. Redshift ML integrates Amazon Redshift with Amazon SageMaker, a fully managed ML service, enabling you to create, train, and deploy ML models using familiar SQL commands.

Customers have also asked us for better integration between Amazon Redshift and Apache Spark, so we are excited to announce Amazon Redshift integration for Apache Spark to make data warehouses easily accessible for Spark-based applications. Now, developers using AWS analytics and ML services such as Amazon EMR, AWS Glue, and SageMaker can effortlessly build Apache Spark applications that read from and write to their Amazon Redshift data warehouses. Amazon EMR and AWS Glue package the Redshift-Spark connector so you can easily connect to your data warehouse from your Spark-based applications. You can use several pushdown capabilities for operations such as sort, aggregate, limit, join, and scalar functions so that only the relevant data is moved from your Amazon Redshift data warehouse to the consuming Spark application. You can also make your applications more secure by utilizing AWS Identity and Access Management (IAM) credentials to connect to Amazon Redshift.

Secure and reliable analytics

Customers continue to tell us that their data warehouses are mission-critical systems that need high availability, reliability, and security. We launched a number of new features in 2022 in this area.

Amazon Redshift now supports Multi-AZ deployments (in preview) for RA3 instance-based clusters, which enables running your data warehouse in multiple AWS Availability Zones simultaneously and continuous operation in unforeseen Availability Zone-wide failure scenarios. Multi-AZ support is already available for Redshift Serverless. An Amazon Redshift Multi-AZ deployment allows you to recover in case of Availability Zone failures without any user intervention. An Amazon Redshift Multi-AZ data warehouse is accessed as a single data warehouse with one endpoint, and helps you maximize performance by distributing workload processing across multiple Availability Zones automatically. No application changes are needed to maintain business continuity during unforeseen outages.

In 2022, we launched features like role-based access control, row-level security, and data masking (in preview) to make it easier for you to manage access and decide who has access to which data, including obfuscating personally identifiable information (PII) like credit card numbers.

You can use role-based access control (RBAC) to control end-user access to data at a broad or granular level based on an end-user’s job role and permissions. With RBAC, you can create a role using SQL, grant a collection of granular permissions to the role, and then assign that role to end-users. Roles can be granted object-level, column-level, and system-level permissions. Additionally, RBAC introduces out-of-box system roles for DBAs, operators, security admins, or customized roles.

Row-level security (RLS) simplifies design and implementation of fine-grained access to the rows in tables. With RLS, you can restrict access to a subset of rows within a table based on the users’ job role or permissions with SQL.

Amazon Redshift support for dynamic data masking (DDM), which is now available in preview, allows you to simplify protecting PII such as Social Security numbers, credits card numbers, and phone numbers in your Amazon Redshift data warehouse. With dynamic data masking, you control access to your data through simple SQL-based masking policies that determine how Amazon Redshift returns sensitive data to the user at query time. You can create masking policies to define consistent, format-preserving, and irreversible masked data values. You can apply a masking policy on a specific column or list of columns in a table. Also, you have the flexibility of choosing how to show the masked data. For example, you can completely hide the data, replace partial real values with wildcard characters, or define your own way to mask the data using SQL expressions, Python, or AWS Lambda user-defined functions. Additionally, you can apply a conditional masking policy based on other columns, which selectively protects the column data in a table based on the values in one or more different columns.

We also announced enhancements to audit logging, native integration with Microsoft Azure Active Directory, and support for default IAM roles in additional Regions to further simplify security management.

Best price performance analytics

Customers continue to tell us that they need fast and cost-effective data warehouses that deliver high performance at any scale while keeping costs low. From day 1 since Amazon Redshift’s launch in 2012, we have taken a data-driven approach and used fleet telemetry to build a cloud data warehouse service that gives you the best price performance at any scale. Over the years, we have evolved Amazon Redshift’s architecture and launched features such as Redshift Managed Storage (RMS) for separation of storage and compute, Amazon Redshift Spectrum for data lake queries, automatic table optimization for physical schema optimization, automatic workload management to prioritize workloads and allocate the right compute and memory, cluster resize to scale compute and storage vertically, and concurrency scaling to dynamically scale compute out or in. Our performance benchmarks continue to demonstrate Amazon Redshift’s price performance leadership.

In 2022, we added new features such as the general availability of concurrency scaling for write operations like COPY, INSERT, UPDATE, and DELETE to support virtually unlimited concurrent users and queries. We also introduced performance improvements for string-based data processing through vectorized scans over lightweight, CPU-efficient, dictionary-encoded string columns, which allows the database engine to operate directly over compressed data.

We also added support for SQL operators such as MERGE (single operator for inserts or updates); CONNECY_BY (for hierarchical queries); GROUPING SETS, ROLLUP, and CUBE (for multi-dimensional reporting); and increased the size of the SUPER data type to 16 MB to make it easier for you to migrate from legacy data warehouses to Amazon Redshift.


Our customers continue to tell us that data and analytics remains a top priority for them and the need to cost-effectively extract more business value from their data during these times is more pronounced than any other time in the past. Amazon Redshift as your cloud data warehouse enables you to run complex SQL analytics with scale and performance on terabytes to petabytes of structured and unstructured data and make the insights widely available through popular BI and analytics tools.

Although we launched over 40 features in 2022 and the pace of innovation continues to accelerate, it remains day 1 and we look forward to hearing from you on how these features help you unlock more value for your organizations. We invite you to try these new features and get in touch with us through your AWS account team if you have further comments.

About the author

Manan Goel is a Product Go-To-Market Leader for AWS Analytics Services including Amazon Redshift at AWS. He has more than 25 years of experience and is well versed with databases, data warehousing, business intelligence, and analytics. Manan holds a MBA from Duke University and a BS in Electronics & Communications engineering.

Build near real-time logistics dashboards using Amazon Redshift and Amazon Managed Grafana for better operational intelligence

Post Syndicated from Paul Villena original https://aws.amazon.com/blogs/big-data/build-near-real-time-logistics-dashboards-using-amazon-redshift-and-amazon-managed-grafana-for-better-operational-intelligence/

Amazon Redshift is a fully managed data warehousing service that is currently helping tens of thousands of customers manage analytics at scale. It continues to lead price-performance benchmarks, and separates compute and storage so each can be scaled independently and you only pay for what you need. It also eliminates data silos by simplifying access to your operational databases, data warehouse, and data lake with consistent security and governance policies.

With the Amazon Redshift streaming ingestion feature, it’s easier than ever to access and analyze data coming from real-time data sources. It simplifies the streaming architecture by providing native integration between Amazon Redshift and the streaming engines in AWS, which are Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK). Streaming data sources like system logs, social media feeds, and IoT streams can continue to push events to the streaming engines, and Amazon Redshift simply becomes just another consumer. Before Amazon Redshift streaming was available, we had to stage the streaming data first in Amazon Simple Storage Service (Amazon S3) and then run the copy command to load it into Amazon Redshift. Eliminating the need to stage data in Amazon S3 results in faster performance and improved latency. With this feature, we can ingest hundreds of megabytes of data per second and have a latency of just a few seconds.

Another common challenge for our customers is the additional skill required when using streaming data. In Amazon Redshift streaming ingestion, only SQL is required. We use SQL to do the following:

  • Define the integration between Amazon Redshift and our streaming engines with the creation of external schema
  • Create the different streaming database objects that are actually materialized views
  • Query and analyze the streaming data
  • Generate new features that are used to predict delays using machine learning (ML)
  • Perform inferencing natively using Amazon Redshift ML

In this post, we build a near real-time logistics dashboard using Amazon Redshift and Amazon Managed Grafana. Our example is an operational intelligence dashboard for a logistics company that provides situational awareness and augmented intelligence for their operations team. From this dashboard, the team can see the current state of their consignments and their logistics fleet based on events that happened only a few seconds ago. It also shows the consignment delay predictions of an Amazon Redshift ML model that helps them proactively respond to disruptions before they even happen.

Solution overview

This solution is composed of the following components, and the provisioning of resources is automated using the AWS Cloud Development Kit (AWS CDK):

  • Multiple streaming data sources are simulated through Python code running in our serverless compute service, AWS Lambda
  • The streaming events are captured by Amazon Kinesis Data Streams, which is a highly scalable serverless streaming data service
  • We use the Amazon Redshift streaming ingestion feature to process and store the streaming data and Amazon Redshift ML to predict the likelihood of a consignment getting delayed
  • We use AWS Step Functions for serverless workflow orchestration
  • The solution includes a consumption layer built on Amazon Managed Grafana where we can visualize the insights and even generate alerts through Amazon Simple Notification Service (Amazon SNS) for our operations team

The following diagram illustrates our solution architecture.


The project has the following prerequisites:

Sample deployment using the AWS CDK

The AWS CDK is an open-source project that allows you to define your cloud infrastructure using familiar programming languages. It uses high-level constructs to represent AWS components to simplify the build process. In this post, we use Python to define the cloud infrastructure due to its familiarity to many data and analytics professionals.

Clone the GitHub repository and install the Python dependencies:

git clone https://github.com/aws-samples/amazon-redshift-streaming-workshop
cd amazon-redshift-streaming-workshop
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Next, bootstrap the AWS CDK. This sets up the resources required by the AWS CDK to deploy into the AWS account. This step is only required if you haven’t used the AWS CDK in the deployment account and Region.

cdk bootstrap

Deploy all stacks:

cdk deploy IngestionStack 
cdk deploy RedshiftStack 
cdk deploy StepFunctionStack

The entire deployment time takes 10–15 minutes.

Access streaming data using Amazon Redshift streaming ingestion

The AWS CDK deployment provisions an Amazon Redshift cluster with the appropriate default IAM role to access the Kinesis data stream. We can create an external schema to establish a connection between the Amazon Redshift cluster and the Kinesis data stream:

IAM_ROLE default;

For instructions on how to connect to the cluster, refer to Connecting to the Redshift Cluster.

We use a materialized view to parse data in the Kinesis data stream. In this case, the whole payload is ingested as is and stored using the SUPER data type in Amazon Redshift. Data stored in streaming engines is usually in semi-structured format, and the SUPER data type provides a fast and efficient way to analyze semi-structured data within Amazon Redshift.

See the following code:

SELECT approximate_arrival_timestamp,
JSON_PARSE(from_varbyte(kinesis_data, 'utf-8')) as consignment_data FROM ext_kinesis.consignment_stream
WHERE is_utf8(kinesis_data)
AND is_valid_json(from_varbyte(kinesis_data, 'utf-8'));

Refreshing the materialized view invokes Amazon Redshift to read data directly from the Kinesis data stream and load it into the materialized view. This refresh can be done automatically by adding the AUTO REFRESH clause in the materialized view definition. However, in this example, we are orchestrating the end-to-end data pipeline using AWS Step Functions.


Now we can start running queries against our streaming data and unify it with other datasets like logistics fleet data. If we want to know the distribution of our consignments across different states, we can easily unpack the contents of the JSON payload using the PartiQL syntax.

SELECT cs.consignment_data.origin_state::VARCHAR,
COUNT(1) number_of_consignments,
AVG(on_the_move) running_fleet,
AVG(scheduled_maintenance + unscheduled_maintenance) under_maintenance
FROM consignment_stream cs
INNER JOIN fleet_summary fs
on TRIM(cs.consignment_data.origin_state::VARCHAR) = fs.vehicle_location

Generate features using Amazon Redshift SQL functions

The next step is to transform and enrich the streaming data using Amazon Redshift SQL to generate additional features that will be used by Amazon Redshift ML for its predictions. We use date and time functions to identify the day of the week, and calculate the number of days between the order date and target delivery date.

We also use geospatial functions, specifically ST_DistanceSphere, to calculate the distance between origin and destination locations. The GEOMETRY data type within Amazon Redshift provides a cost-effective way to analyze geospatial data such as longitude and latitudes at scale. In this example, the addresses have already been converted to longitude and latitude. However, if you need to perform geocoding, you can integrate Amazon Location Services with Amazon Redshift using user-defined functions (UDFs). On top of geocoding, Amazon Location Service also allows you to more accurately calculate route distance between origin and destination, and even specify waypoints along the way.

We use another materialized view to persist these transformations. A materialized view provides a simple yet efficient way to create data pipelines using its incremental refresh capability. Amazon Redshift identifies the incremental changes from the last refresh and only updates the target materialized view based on these changes. In this materialized view, all our transformations are deterministic, so we expect our data to be consistent when going through a full refresh or an incremental refresh.

See the following code:

CREATE MATERIALIZED VIEW consignment_transformed AS
consignment_data.consignmentid::INT consignment_id,
consignment_data.consignment_date::TIMESTAMP consignment_date,
consignment_data.delivery_date::TIMESTAMP delivery_date,
consignment_data.origin_state::VARCHAR origin_state,
consignment_data.destination_state::VARCHAR destination_state,
consignment_data.revenue::FLOAT revenue,
consignment_data.cost::FLOAT cost,
DATE_PART(dayofweek, consignment_data.consignment_date::TIMESTAMP)::INT day_of_week,
DATE_PART(hour, consignment_data.consignment_date::TIMESTAMP)::INT "hour",
)::INT days_to_deliver,
ST_Point(consignment_data.origin_lat::FLOAT, consignment_data.origin_long::FLOAT),
ST_Point(consignment_data.destination_lat::FLOAT, consignment_data.destination_long::FLOAT)
) / 1000 --convert to km
) delivery_distance
FROM consignment_stream;

Predict delays using Amazon Redshift ML

We can use this enriched data to make predictions on the delay probability of a consignment. Amazon Redshift ML is a feature of Amazon Redshift that allows you to use the power of Amazon Redshift to build, train, and deploy ML models directly within your data warehouse.

The training of a new Amazon Redshift ML model has been initiated as part of the AWS CDK deployment using the CREATE MODEL statement. The training dataset is defined in the FROM clause, and TARGET defines which column the model is trying to predict. The FUNCTION clause defines the name of the function that is used to make predictions.

CREATE MODEL ml_delay_prediction -- already executed by CDK
FROM (SELECT * FROM ext_s3.consignment_train)
TARGET probability
FUNCTION fnc_delay_probabilty
IAM_ROLE default
MAX_RUNTIME 1800, --seconds
S3_BUCKET '<ingestionstack-s3bucketname>' --replace S3 bucket name

This simplified model is trained using historical observations, and the training process takes around 30 minutes to complete. You can check the status of the training job by running the SHOW MODEL statement:

SHOW MODEL ml_delay_prediction;

When the model is ready, we can start making predictions on new data that is streamed into Amazon Redshift. Predictions are generated using the Amazon Redshift ML function that was defined during the training process. We pass the calculated features from the transformed materialized view into this function, and the prediction results populate the delay_probability column.

This final output is persisted into the consignment_predictions table, and Step Functions is orchestrating the ongoing incremental data load into this target table. We use a table for the final output, instead of a materialized view, because ML predictions have randomness involved and it may give us non-deterministic results. Using a table gives us more control on how data is loaded.

See the following code:

CREATE TABLE consignment_predictions AS
SELECT *, fnc_delay_probability(
day_of_week, "hour", days_to_deliver, delivery_distance) delay_probability
FROM consignment_transformed;

Create an Amazon Managed Grafana dashboard

We use Amazon Managed Grafana to create a near real-time logistics dashboard. Amazon Managed Grafana is a fully managed service that makes it easy to create, configure, and share interactive dashboards and charts for monitoring your data. We can also use Grafana to set up alerts and notifications based on specific conditions or thresholds, allowing you to quickly identify and respond to issues.

The high-level steps in setting up the dashboard are as follows:

  1. Create a Grafana workspace.
  2. Set up Grafana authentication using AWS IAM Identity Center (successor to AWS Single Sign-On) or using direct SAML integration.
  3. Configure Amazon Redshift as a Grafana data source.
  4. Import the JSON file for the near real-time logistics dashboard.

A more detailed set of instructions is available in the GitHub repository for your reference.

Clean up

To avoid ongoing charges, delete the resources deployed. Access the Amazon Linux 2 environment and run the AWS CDK destroy command. Delete the Grafana objects related to this deployment.

cd amazon-redshift-streaming-workshop
source .venv/bin/activate
cdk destroy –all


In this post, we showed how easy it is to build a near real-time logistics dashboard using Amazon Redshift and Amazon Managed Grafana. We created an end-to-end modern data pipeline using only SQL. This shows how Amazon Redshift is a powerful platform for democratizing your data—it enables a wide range of users, including business analysts, data scientists, and others, to work with and analyze data without requiring specialized technical skills or expertise.

We encourage you to explore what else can be achieved with Amazon Redshift and Amazon Managed Grafana. We also recommend you visit the AWS Big Data Blog for other useful blog posts on Amazon Redshift.

About the Author

Paul Villena is an Analytics Solutions Architect in AWS with expertise in building modern data and analytics solutions to drive business value. He works with customers to help them harness the power of the cloud. His areas of interests are infrastructure-as-code, serverless technologies and coding in Python.

How BookMyShow saved 80% in costs by migrating to an AWS modern data architecture

Post Syndicated from Mahesh Vandi Chalil original https://aws.amazon.com/blogs/big-data/how-bookmyshow-saved-80-in-costs-by-migrating-to-an-aws-modern-data-architecture/

This is a guest post co-authored by Mahesh Vandi Chalil, Chief Technology Officer of BookMyShow.

BookMyShow (BMS), a leading entertainment company in India, provides an online ticketing platform for movies, plays, concerts, and sporting events. Selling up to 200 million tickets on an annual run rate basis (pre-COVID) to customers in India, Sri Lanka, Singapore, Indonesia, and the Middle East, BookMyShow also offers an online media streaming service and end-to-end management for virtual and on-ground entertainment experiences across all genres.

The pandemic gave BMS the opportunity to migrate and modernize our 15-year-old analytics solution to a modern data architecture on AWS. This architecture is modern, secure, governed, and cost-optimized architecture, with the ability to scale to petabytes. BMS migrated and modernized from on-premises and other cloud platforms to AWS in just four months. This project was run in parallel with our application migration project and achieved 90% cost savings in storage and 80% cost savings in analytics spend.

The BMS analytics platform caters to business needs for sales and marketing, finance, and business partners (e.g., cinemas and event owners), and provides application functionality for audience, personalization, pricing, and data science teams. The prior analytics solution had multiple copies of data, for a total of over 40 TB, with approximately 80 TB of data in other cloud storage. Data was stored on‑premises and in the cloud in various data stores. Growing organically, the teams had the freedom to choose their technology stack for individual projects, which led to the proliferation of various tools, technology, and practices. Individual teams for personalization, audience, data engineering, data science, and analytics used a variety of products for ingestion, data processing, and visualization.

This post discusses BMS’s migration and modernization journey, and how BMS, AWS, and AWS Partner Minfy Technologies team worked together to successfully complete the migration in four months and saving costs. The migration tenets using the AWS modern data architecture made the project a huge success.

Challenges in the prior analytics platform

  • Varied Technology: Multiple teams used various products, languages, and versions of software.
  • Larger Migration Project: Because the analytics modernization was a parallel project with application migration, planning was crucial in order to consider the changes in core applications and project timelines.
  • Resources: Experienced resource churn from the application migration project, and had very little documentation of current systems.
  • Data : Had multiple copies of data and no single source of truth; each data store provided a view for the business unit.
  • Ingestion Pipelines: Complex data pipelines moved data across various data stores at varied frequencies. We had multiple approaches in place to ingest data to Cloudera, via over 100 Kafka consumers from transaction systems and MQTT(Message Queue Telemetry Transport messaging protocol) for clickstreams, stored procedures, and Spark jobs. We had approximately 100 jobs for data ingestion across Spark, Alteryx, Beam, NiFi, and more.
  • Hadoop Clusters: Large dedicated hardware on which the Hadoop clusters were configured incurring fixed costs. On-premises Cloudera setup catered to most of the data engineering, audience, and personalization batch processing workloads. Teams had their implementation of HBase and Hive for our audience and personalization applications.
  • Data warehouse: The data engineering team used TiDB as their on-premises data warehouse. However, each consumer team had their own perspective of data needed for analysis. As this siloed architecture evolved, it resulted in expensive storage and operational costs to maintain these separate environments.
  • Analytics Database: The analytics team used data sourced from other transactional systems and denormalized data. The team had their own extract, transform, and load (ETL) pipeline, using Alteryx with a visualization tool.

Migration tenets followed which led to project success:

  • Prioritize by business functionality.
  • Apply best practices when building a modern data architecture from Day 1.
  • Move only required data, canonicalize the data, and store it in the most optimal format in the target. Remove data redundancy as much possible. Mark scope for optimization for the future when changes are intrusive.
  • Build the data architecture while keeping data formats, volumes, governance, and security in mind.
  • Simplify ELT and processing jobs by categorizing the jobs as rehosted, rewritten, and retired. Finalize canonical data format, transformation, enrichment, compression, and storage format as Parquet.
  • Rehost machine learning (ML) jobs that were critical for business.
  • Work backward to achieve our goals, and clear roadblocks and alter decisions to move forward.
  • Use serverless options as a first option and pay per use. Assess the cost and effort for rearchitecting to select the right approach. Execute a proof of concept to validate this for each component and service.

Strategies applied to succeed in this migration:

  • Team – We created a unified team with people from data engineering, analytics, and data science as part of the analytics migration project. Site reliability engineering (SRE) and application teams were involved when critical decisions were needed regarding data or timeline for alignment. The analytics, data engineering, and data science teams spent considerable time planning, understanding the code, and iteratively looking at the existing data sources, data pipelines, and processing jobs. AWS team with partner team from Minfy Technologies helped BMS arrive at a migration plan after a proof of concept for each of the components in data ingestion, data processing, data warehouse, ML, and analytics dashboards.
  • Workshops – The AWS team conducted a series of workshops and immersion days, and coached the BMS team on the technology and best practices to deploy the analytics services. The AWS team helped BMS explore the configuration and benefits of the migration approach for each scenario (data migration, data pipeline, data processing, visualization, and machine learning) via proof-of-concepts (POCs). The team captured the changes required in the existing code for migration. BMS team also got acquainted with the following AWS services:
  • Proof of concept – The BMS team, with help from the partner and AWS team, implemented multiple proofs of concept to validate the migration approach:
    • Performed batch processing of Spark jobs in Amazon EMR, in which we checked the runtime, required code changes, and cost.
    • Ran clickstream analysis jobs in Amazon EMR, testing the end-to-end pipeline. Team conducted proofs of concept on AWS IoT Core for MQTT protocol and streaming to Amazon S3.
    • Migrated ML models to Amazon SageMaker and orchestrated with Amazon MWAA.
    • Created sample QuickSight reports and dashboards, in which features and time to build were assessed.
    • Configured for key scenarios for Amazon Redshift, in which time for loading data, query performance, and cost were assessed.
  • Effort vs. cost analysis – Team performed the following assessments:
    • Compared the ingestion pipelines, the difference in data structure in each store, the basis of the current business need for the data source, the activity for preprocessing the data before migration, data migration to Amazon S3, and change data capture (CDC) from the migrated applications in AWS.
    • Assessed the effort to migrate approximately 200 jobs, determined which jobs were redundant or need improvement from a functional perspective, and completed a migration list for the target state. The modernization of the MQTT workflow code to serverless was time-consuming, decided to rehost on Amazon Elastic Compute Cloud (Amazon EC2) and modernization to Amazon Kinesis in to the next phase.
    • Reviewed over 400 reports and dashboards, prioritized development in phases, and reassessed business user needs.

AWS cloud services chosen for proposed architecture:

  • Data lake – We used Amazon S3 as the data lake to store the single truth of information for all raw and processed data, thereby reducing the copies of data storage and storage costs.
  • Ingestion – Because we had multiple sources of truth in the current architecture, we arrived at a common structure before migration to Amazon S3, and existing pipelines were modified to do preprocessing. These one-time preprocessing jobs were run in Cloudera, because the source data was on-premises, and on Amazon EMR for data in the cloud. We designed new data pipelines for ingestion from transactional systems on the AWS cloud using AWS Glue ETL.
  • Processing – Processing jobs were segregated based on runtime into two categories: batch and near-real time. Batch processes were further divided into transient Amazon EMR clusters with varying runtimes and Hadoop application requirements like HBase. Near-real-time jobs were provisioned in an Amazon EMR permanent cluster for clickstream analytics, and a data pipeline from transactional systems. We adopted a serverless approach using AWS Glue ETL for new data pipelines from transactional systems on the AWS cloud.
  • Data warehouse – We chose Amazon Redshift as our data warehouse, and planned on how the data would be distributed based on query patterns.
  • Visualization – We built the reports in Amazon QuickSight in phases and prioritized them based on business demand. We discussed with business users their current needs and identified the immediate reports required. We defined the phases of report and dashboard creation and built the reports in Amazon QuickSight. We plan to use embedded reports for external users in the future.
  • Machine learning – Custom ML models were deployed on Amazon SageMaker. Existing Airflow DAGs were migrated to Amazon MWAA.
  • Governance, security, and compliance – Governance with Amazon Lake Formation was adopted from Day 1. We configured the AWS Glue Data Catalog to reference data used as sources and targets. We had to comply to Payment Card Industry (PCI) guidelines because payment information was in the data lake, so we ensured the necessary security policies.

Solution overview

BMS modern data architecture

The following diagram illustrates our modern data architecture.

The architecture includes the following components:

  1. Source systems – These include the following:
    • Data from transactional systems stored in MariaDB (booking and transactions).
    • User interaction clickstream data via Kafka consumers to DataOps MariaDB.
    • Members and seat allocation information from MongoDB.
    • SQL Server for specific offers and payment information.
  2. Data pipeline – Spark jobs on an Amazon EMR permanent cluster process the clickstream data from Kafka clusters.
  3. Data lake – Data from source systems was stored in their respective Amazon S3 buckets, with prefixes for optimized data querying. For Amazon S3, we followed a hierarchy to store raw, summarized, and team or service-related data in different parent folders as per the source and type of data. Lifecycle polices were added to logs and temp folders of different services as per teams’ requirements.
  4. Data processing – Transient Amazon EMR clusters are used for processing data into a curated format for the audience, personalization, and analytics teams. Small file merger jobs merge the clickstream data to a larger file size, which saved costs for one-time queries.
  5. Governance – AWS Lake Formation enables the usage of AWS Glue crawlers to capture the schema of data stored in the data lake and version changes in the schema. The Data Catalog and security policy in AWS Lake Formation enable access to data for roles and users in Amazon Redshift, Amazon Athena, Amazon QuickSight, and data science jobs. AWS Glue ETL jobs load the processed data to Amazon Redshift at scheduled intervals.
  6. Queries – The analytics team used Amazon Athena to perform one-time queries raised from business teams on the data lake. Because report development is in phases, Amazon Athena was used for exporting data.
  7. Data warehouse – Amazon Redshift was used as the data warehouse, where the reports for the sales teams, management, and third parties (i.e., theaters and events) are processed and stored for quick retrieval. Views to analyze the total sales, movie sale trends, member behavior, and payment modes are configured here. We use materialized views for denormalized tables, different schemas for metadata, and transactional and behavior data.
  8. Reports – We used Amazon QuickSight reports for various business, marketing, and product use cases.
  9. Machine learning – Some of the models deployed on Amazon SageMaker are as follows:
    • Content popularity – Decides the recommended content for users.
    • Live event popularity – Calculates the popularity of live entertainment events in different regions.
    • Trending searches – Identifies trending searches across regions.


Migration execution steps

We standardized tools, services, and processes for data engineering, analytics, and data science:

  • Data lake
    • Identified the source data to be migrated from Archival DB, BigQuery, TiDB, and the analytics database.
    • Built a canonical data model that catered to multiple business teams and reduced the copies of data, and therefore storage and operational costs. Modified existing jobs to facilitate migration to a canonical format.
    • Identified the source systems, capacity required, anticipated growth, owners, and access requirements.
    • Ran the bulk data migration to Amazon S3 from various sources.
  • Ingestion
    • Transaction systems – Retained the existing Kafka queues and consumers.
    • Clickstream data – Successfully conducted a proof of concept to use AWS IoT Core for MQTT protocol. But because we needed to make changes in the application to publish to AWS IoT Core, we decided to implement it as part of mobile application modernization at a later time. We decided to rehost the MQTT server on Amazon EC2.
  • Processing
  • Listed the data pipelines relevant to business and migrated them with minimal modification.
  • Categorized workloads into critical jobs, redundant jobs, or jobs that can be optimized:
    • Spark jobs were migrated to Amazon EMR.
    • HBase jobs were migrated to Amazon EMR with HBase.
    • Metadata stored in Hive-based jobs were modified to use the AWS Glue Data Catalog.
    • NiFi jobs were simplified and rewritten in Spark run in Amazon EMR.
  • Amazon EMR clusters were configured one persistent cluster for streaming the clickstream and personalization workloads. We used multiple transient clusters for running all other Spark ETL or processing jobs. We used Spot Instances for task nodes to save costs. We optimized data storage with specific jobs to merge small files and compressed file format conversions.
  • AWS Glue crawlers identified new data in Amazon S3. AWS Glue ETL jobs transformed and uploaded processed data to the Amazon Redshift data warehouse.
  • Datawarehouse
    • Defined the data warehouse schema by categorizing the critical reports required by the business, keeping in mind the workload and reports required in future.
    • Defined the staging area for incremental data loaded into Amazon Redshift, materialized views, and tuning the queries based on usage. The transaction and primary metadata are stored in Amazon Redshift to cater to all data analysis and reporting requirements. We created materialized views and denormalized tables in Amazon Redshift to use as data sources for Amazon QuickSight dashboards and segmentation jobs, respectively.
    • Optimally used the Amazon Redshift cluster by loading last two years data in Amazon Redshift, and used Amazon Redshift Spectrum to query historical data through external tables. This helped balance the usage and cost of the Amazon Redshift cluster.
  • Visualization
    • Amazon QuickSight dashboards were created for the sales and marketing team in Phase 1:
      • Sales summary report – An executive summary dashboard to get an overview of sales across the country by region, city, movie, theatre, genre, and more.
      • Live entertainment – A dedicated report for live entertainment vertical events.
      • Coupons – A report for coupons purchased and redeemed.
      • BookASmile – A dashboard to analyze the data for BookASmile, a charity initiative.
  • Machine learning
    • Listed the ML workloads to be migrated based on current business needs.
    • Priority ML processing jobs were deployed on Amazon EMR. Models were modified to use Amazon S3 as source and target, and new APIs were exposed to use the functionality. ML models were deployed on Amazon SageMaker for movies, live event clickstream analysis, and personalization.
    • Existing artifacts in Airflow orchestration were migrated to Amazon MWAA.
  • Security
    • AWS Lake Formation was the foundation of the data lake, with the AWS Glue Data Catalog as the foundation for the central catalog for the data stored in Amazon S3. This provided access to the data by various functionalities, including the audience, personalization, analytics, and data science teams.
    • Personally identifiable information (PII) and payment data was stored in the data lake and data warehouse, so we had to comply to PCI guidelines. Encryption of data at rest and in transit was considered and configured in each service level (Amazon S3, AWS Glue Data Catalog, Amazon EMR, AWS Glue, Amazon Redshift, and QuickSight). Clear roles, responsibilities, and access permissions for different user groups and privileges were listed and configured in AWS Identity and Access Management (IAM) and individual services.
    • Existing single sign-on (SSO) integration with Microsoft Active Directory was used for Amazon QuickSight user access.
  • Automation
    • We used AWS CloudFormation for the creation and modification of all the core and analytics services.
    • AWS Step Functions was used to orchestrate Spark jobs on Amazon EMR.
    • Scheduled jobs were configured in AWS Glue for uploading data in Amazon Redshift based on business needs.
    • Monitoring of the analytics services was done using Amazon CloudWatch metrics, and right-sizing of instances and configuration was achieved. Spark job performance on Amazon EMR was analyzed using the native Spark logs and Spark user interface (UI).
    • Lifecycle policies were applied to the data lake to optimize the data storage costs over time.

Benefits of a modern data architecture

A modern data architecture offered us the following benefits:

  • Scalability – We moved from a fixed infrastructure to the minimal infrastructure required, with configuration to scale on demand. Services like Amazon EMR and Amazon Redshift enable us to do this with just a few clicks.
  • Agility – We use purpose-built managed services instead of reinventing the wheel. Automation and monitoring were key considerations, which enable us to make changes quickly.
  • Serverless – Adoption of serverless services like Amazon S3, AWS Glue, Amazon Athena, AWS Step Functions, and AWS Lambda support us when our business has sudden spikes with new movies or events launched.
  • Cost savings – Our storage size was reduced by 90%. Our overall spend on analytics and ML was reduced by 80%.


In this post, we showed you how a modern data architecture on AWS helped BMS to easily share data across organizational boundaries. This allowed BMS to make decisions with speed and agility at scale; ensure compliance via unified data access, security, and governance; and to scale systems at a low cost without compromising performance. Working with the AWS and Minfy Technologies teams helped BMS choose the correct technology services and complete the migration in four months. BMS achieved the scalability and cost-optimization goals with this updated architecture, which has set the stage for innovation using graph databases and enhanced our ML projects to improve customer experience.

About the Authors

Mahesh Vandi Chalil is Chief Technology Officer at BookMyShow, India’s leading entertainment destination. Mahesh has over two decades of global experience, passionate about building scalable products that delight customers while keeping innovation as the top goal motivating his team to constantly aspire for these. Mahesh invests his energies in creating and nurturing the next generation of technology leaders and entrepreneurs, both within the organization and outside of it. A proud husband and father of two daughters and plays cricket during his leisure time.

Priya Jathar is a Solutions Architect working in Digital Native Business segment at AWS. She has more two decades of IT experience, with expertise in Application Development, Database, and Analytics. She is a builder who enjoys innovating with new technologies to achieve business goals. Currently helping customers Migrate, Modernise, and Innovate in Cloud. In her free time she likes to paint, and hone her gardening and cooking skills.

Vatsal Shah is a Senior Solutions Architect at AWS based out of Mumbai, India. He has more than nine years of industry experience, including leadership roles in product engineering, SRE, and cloud architecture. He currently focuses on enabling large startups to streamline their cloud operations and help them scale on the cloud. He also specializes in AI and Machine Learning use cases.

Accelerate orchestration of an ELT process using AWS Step Functions and Amazon Redshift Data API

Post Syndicated from Poulomi Dasgupta original https://aws.amazon.com/blogs/big-data/accelerate-orchestration-of-an-elt-process-using-aws-step-functions-and-amazon-redshift-data-api/

Extract, Load, and Transform (ELT) is a modern design strategy where raw data is first loaded into the data warehouse and then transformed with familiar Structured Query Language (SQL) semantics leveraging the power of massively parallel processing (MPP) architecture of the data warehouse. When you use an ELT pattern, you can also use your existing SQL workload while migrating from your on-premises data warehouse to Amazon Redshift. This eliminates the need to rewrite relational and complex SQL workloads into a new framework. With Amazon Redshift, you can load, transform, and enrich your data efficiently using familiar SQL with advanced and robust SQL support, simplicity, and seamless integration with your existing SQL tools. When you adopt an ELT pattern, a fully automated and highly scalable workflow orchestration mechanism will help to minimize the operational effort that you must invest in managing the pipelines. It also ensures the timely and accurate refresh of your data warehouse.

AWS Step Functions is a low-code, serverless, visual workflow service where you can orchestrate complex business workflows with an event-driven framework and easily develop repeatable and dependent processes. It can ensure that the long-running, multiple ELT jobs run in a specified order and complete successfully instead of manually orchestrating those jobs or maintaining a separate application.

Amazon DynamoDB is a fast, flexible NoSQL database service for single-digit millisecond performance at any scale.

This post explains how to use AWS Step Functions, Amazon DynamoDB, and Amazon Redshift Data API to orchestrate the different steps in your ELT workflow and process data within the Amazon Redshift data warehouse.

Solution overview

In this solution, we will orchestrate an ELT process using AWS Step Functions. As part of the ELT process, we will refresh the dimension and fact tables at regular intervals from staging tables, which ingest data from the source. We will maintain the current state of the ELT process (e.g., Running or Ready) in an audit table that will be maintained at Amazon DynamoDB. AWS Step Functions allows you to directly call the Data API from a state machine, reducing the complexity of running the ELT pipeline. For loading the dimensions and fact tables, we will be using Amazon Redshift Data API from AWS Lambda. We will use Amazon EventBridge for scheduling the state machine to run at a desired interval based on the customer’s SLA.

For a given ELT process, we will set up a JobID in a DynamoDB audit table and set the JobState as “Ready” before the state machine runs for the first time. The state machine performs the following steps:

  1. The first process in the Step Functions workflow is to pass the JobID as input to the process that is configured as JobID 101 in Step Functions and DynamoDB by default via the CloudFormation template.
  2. The next step is to fetch the current JobState for the given JobID by running a query against the DynamoDB audit table using Lambda Data API.
  3. If JobState is “Running,” then it indicates that the previous iteration is not completed yet, and the process should end.
  4. If the JobState is “Ready,” then it indicates that the previous iteration was completed successfully and the process is ready to start. So, the next step will be to update the DynamoDB audit table to change the JobState to “Running” and JobStart to the current time for the given JobID using DynamoDB Data API within a Lambda function.
  5. The next step will be to start the dimension table load from the staging table data within Amazon Redshift using Lambda Data API. In order to achieve that, we can either call a stored procedure using the Amazon Redshift Data API, or we can also run series of SQL statements synchronously using Amazon Redshift Data API within a Lambda function.
  6. In a typical data warehouse, multiple dimension tables are loaded in parallel at the same time before the fact table gets loaded. Using Parallel flow in Step Functions, we will load two dimension tables at the same time using Amazon Redshift Data API within a Lambda function.
  7. Once the load is completed for both the dimension tables, we will load the fact table as the next step using Amazon Redshift Data API within a Lambda function.
  8. As the load completes successfully, the last step would be to update the DynamoDB audit table to change the JobState to “Ready” and JobEnd to the current time for the given JobID, using DynamoDB Data API within a Lambda function.
    Solution Overview

Components and dependencies

The following architecture diagram highlights the end-to-end solution using AWS services:

Architecture Diagram

Before diving deeper into the code, let’s look at the components first:

  • AWS Step Functions – You can orchestrate a workflow by creating a State Machine to manage failures, retries, parallelization, and service integrations.
  • Amazon EventBridge – You can run your state machine on a daily schedule by creating a Rule in Amazon EventBridge.
  • AWS Lambda – You can trigger a Lambda function to run Data API either from Amazon Redshift or DynamoDB.
  • Amazon DynamoDB – Amazon DynamoDB is a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale. DynamoDB is extremely efficient in running updates, which improves the performance of metadata management for customers with strict SLAs.
  • Amazon Redshift – Amazon Redshift is a fully managed, scalable cloud data warehouse that accelerates your time to insights with fast, easy, and secure analytics at scale.
  • Amazon Redshift Data API – You can access your Amazon Redshift database using the built-in Amazon Redshift Data API. Using this API, you can access Amazon Redshift data with web services–based applications, including AWS Lambda.
  • DynamoDB API – You can access your Amazon DynamoDB tables from a Lambda function by importing boto3.


To complete this walkthrough, you must have the following prerequisites:

  1. An AWS account.
  2. An Amazon Redshift cluster.
  3. An Amazon Redshift customizable IAM service role with the following policies:
    • AmazonS3ReadOnlyAccess
    • AmazonRedshiftFullAccess
  4. Above IAM role associated to the Amazon Redshift cluster.

Deploy the CloudFormation template

To set up the ETL orchestration demo, the steps are as follows:

  1. Sign in to the AWS Management Console.
  2. Click on Launch Stack.

  3. Click Next.
  4. Enter a suitable name in Stack name.
  5. Provide the information for the Parameters as detailed in the following table.
CloudFormation template parameter Allowed values Description
RedshiftClusterIdentifier Amazon Redshift cluster identifier Enter the Amazon Redshift cluster identifier
DatabaseUserName Database user name in Amazon Redshift cluster Amazon Redshift database user name which has access to run SQL Script
DatabaseName Amazon Redshift database name Name of the Amazon Redshift primary database where SQL script would be run
RedshiftIAMRoleARN Valid IAM role ARN attached to Amazon Redshift cluster AWS IAM role ARN associated with the Amazon Redshift cluster

Create Stack 2

  1. Click Next and a new page appears. Accept the default values in the page and click Next. On the last page check the box to acknowledge resources might be created and click on Create stack.
    Create Stack 3
  2. Monitor the progress of the stack creation and wait until it is complete.
  3. The stack creation should complete approximately within 5 minutes.
  4. Navigate to Amazon Redshift console.
  5. Launch Amazon Redshift query editor v2 and connect to your cluster.
  6. Browse to the database name provided in the parameters while creating the cloudformation template e.g., dev, public schema and expand Tables. You should see the tables as shown below.
    Redshift Query Editor v2 1
  7. Validate the sample data by running the following SQL query and confirm the row count match above the screenshot.
select 'customer',count(*) from public.customer
union all
select 'fact_yearly_sale',count(*) from public.fact_yearly_sale
union all
select 'lineitem',count(*) from public.lineitem
union all
select 'nation',count(*) from public.nation
union all
select 'orders',count(*) from public.orders
union all
select 'supplier',count(*) from public.supplier

Run the ELT orchestration

  1. After you deploy the CloudFormation template, navigate to the stack detail page. On the Resources tab, choose the link for DynamoDBETLAuditTable to be redirected to the DynamoDB console.
  2. Navigate to Tables and click on table name beginning with <stackname>-DynamoDBETLAuditTable. In this demo, the stack name is DemoETLOrchestration, so the table name will begin with DemoETLOrchestration-DynamoDBETLAuditTable.
  3. It will expand the table. Click on Explore table items.
  4. Here you can see the current status of the job, which will be in Ready status.
    DynamoDB 1
  5. Navigate again to stack detail page on the CloudFormation console. On the Resources tab, choose the link for RedshiftETLStepFunction to be redirected to the Step Functions console.
    CFN Stack Resources
  6. Click Start Execution. When it successfully completes, all steps will be marked as green.
    Step Function Running
  7. While the job is running, navigate back to DemoETLOrchestration-DynamoDBETLAuditTable in the DynamoDB console screen. You will see JobState as Running with JobStart time.
    DynamoDB 2
  1. After Step Functions completes, JobState will be changed to Ready with JobStart and JobEnd time.
    DynamoDB 3

Handling failure

In the real world sometimes, the ELT process can fail due to unexpected data anomalies or object related issues. In that case, the step function execution will also fail with the failed step marked in red as shown in the screenshot below:
Step Function 2

Once you identify and fix the issue, please follow the below steps to restart the step function:

  1. Navigate to the DynamoDB table beginning with DemoETLOrchestration-DynamoDBETLAuditTable. Click on Explore table items and select the row with the specific JobID for the failed job.
  2. Go to Action and select Edit item to modify the JobState to Ready as shown below:
    DynamoDB 4
  3. Follow steps 5 and 6 under the “Run the ELT orchestration” section to restart execution of the step function.

Validate the ELT orchestration

The step function loads the dimension tables public.supplier and public.customer and the fact table public.fact_yearly_sale. To validate the orchestration, the process steps are as follows:

  1. Navigate to the Amazon Redshift console.
  2. Launch Amazon Redshift query editor v2 and connect to your cluster.
  3. Browse to the database name provided in the parameters while creating the cloud formation template e.g., dev, public schema.
  4. Validate the data loaded by Step Functions by running the following SQL query and confirm the row count to match as follows:
select 'customer',count(*) from public.customer
union all
select 'fact_yearly_sale',count(*) from public.fact_yearly_sale
union all
select 'supplier',count(*) from public.supplier

Redshift Query Editor v2 2

Schedule the ELT orchestration

The steps are as follows to schedule the Step Functions:

  1. Navigate to the Amazon EventBridge console and choose Create rule.
    Event Bridge 1
  1. Under Name, enter a meaningful name, for example, Trigger-Redshift-ELTStepFunction.
  2. Under Event bus, choose default.
  3. Under Rule Type, select Schedule.
  4. Click on Next.
    Event Bridge 2
  5. Under Schedule pattern, select A schedule that runs at a regular rate, such as every 10 minutes.
  6. Under Rate expression, enter Value as 5 and choose Unit as Minutes.
  7. Click on Next.
    Event Bridge 3
  8. Under Target types, choose AWS service.
  9. Under Select a Target, choose Step Functions state machine.
  10. Under State machine, choose the step function created by the CloudFormation template.
  11. Under Execution role, select Create a new role for this specific resource.
  12. Click on Next.
    Event Bridge 4
  13. Review the rule parameters and click on Create Rule.

After the rule has been created, it will automatically trigger the step function every 5 minutes to perform ELT processing in Amazon Redshift.

Clean up

Please note that deploying a CloudFormation template incurs cost. To avoid incurring future charges, delete the resources you created as part of the CloudFormation stack by navigating to the AWS CloudFormation console, selecting the stack, and choosing Delete.


In this post, we described how to easily implement a modern, serverless, highly scalable, and cost-effective ELT workflow orchestration process in Amazon Redshift using AWS Step Functions, Amazon DynamoDB and Amazon Redshift Data API. As an alternate solution, you can also use Amazon Redshift for metadata management instead of using Amazon DynamoDB. As part of this demo, we show how a single job entry in DynamoDB gets updated for each run, but you can also modify the solution to maintain a separate audit table with the history of each run for each job, which would help with debugging or historical tracking purposes. Step Functions manage failures, retries, parallelization, service integrations, and observability so your developers can focus on higher-value business logic. Step Functions can integrate with Amazon SNS to send notifications in case of failure or success of the workflow. Please follow this AWS Step Functions documentation to implement the notification mechanism.

About the Authors

Poulomi Dasgupta is a Senior Analytics Solutions Architect with AWS. She is passionate about helping customers build cloud-based analytics solutions to solve their business problems. Outside of work, she likes travelling and spending time with her family.

Raks KhareRaks Khare is an Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers architect data analytics solutions at scale on the AWS platform.

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 13 years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling
and cooking.

Run a popular benchmark on Amazon Redshift Serverless easily with AWS Data Exchange

Post Syndicated from Jon Roberts original https://aws.amazon.com/blogs/big-data/run-a-popular-benchmark-on-amazon-redshift-serverless-easily-with-aws-data-exchange/

Amazon Redshift is a fast, easy, secure, and economical cloud data warehousing service designed for analytics. AWS announced Amazon Redshift Serverless general availability in July 2022, providing an easier experience to operate Amazon Redshift. Amazon Redshift Serverless makes it simple to run and scale analytics without having to manage your data warehouse infrastructure. Amazon Redshift Serverless automatically provisions and intelligently scales data warehouse capacity to deliver fast performance for even the most demanding and unpredictable workloads, and you pay only for what you use.

Amazon Redshift Serverless measures data warehouse capacity in Redshift Processing Units (RPUs). You pay for the workloads you run in RPU-hours on a per-second basis (with a 60-second minimum charge), including queries that access external data in open file formats like CSV and Parquet stored in Amazon S3. For more information on RPU pricing, refer to Amazon Redshift pricing.

AWS Data Exchange makes it easy to find, subscribe to, and use third-party data in the cloud. With AWS Data Exchange for Amazon Redshift, customers can start querying, evaluating, analyzing, and joining third-party data with their own first-party data without requiring any extracting, transforming, and loading (ETL). Data providers can list and offer products containing Amazon Redshift datasets in the AWS Data Exchange catalog, granting subscribers direct, read-only access to the data stored in Amazon Redshift. This feature empowers customers to quickly query, analyze, and build applications with these third-party datasets.

TPC-DS is a commonly used benchmark for measuring the query performance of data warehouse solutions such as Amazon Redshift. The benchmark is useful in proving the query capabilities of executing simple to complex queries in a timely manner. It is also used to measure the performance of different database configurations, different concurrent workloads, and also against other database products.

This blog post walks you through the steps you’ll need to set up Amazon Redshift Serverless and run the SQL queries derived from the TPC-DS benchmark against data from the AWS Data Exchange.

Solution overview

We will get started by creating an Amazon Redshift Serverless workgroup and namespace. A namespace is a collection of database objects and users while a workgroup is a collection of compute resources. To simplify executing the benchmark queries, a Linux EC2 instance will also be deployed.

Next, a GitHub repo containing the TPC-DS derived queries will be used. The TPC-DS benchmark is frequently used for evaluating the performance and functionality of cloud data warehouses. The TPC-DS benchmark includes additional steps and requirements to be considered official, but for this blog post, we are focused on only executing the SQL SELECT queries from this benchmark.

The last component of the solution is data. The TPC-DS benchmark includes binaries for generating data, but this is time-consuming to run. We have avoided this problem by generating the data, and we have made this available freely in the AWS Data Exchange.

Automated setup: CloudFormation


Click the Launch Stack link above to launch the CloudFormation stack, which will automate the deployment of resources needed for the demo. The template deploys the following resources in your default VPC:

  • Amazon Compute Cloud (Amazon EC2) instance with the latest version of Amazon Linux
  • Amazon Redshift Serverless workgroup and namespace
  • IAM role with redshift-serverless:GetWorkgroup action granted; this is attached to the EC2 instance so that a command line interface (CLI) command can run to complete the instance configuration
  • Security group with inbound port 22 (ssh) and connectivity between the EC2 instance and the Amazon Redshift Serverless workgroup
  • The GitHub repo is downloaded in the EC2 instance

Template parameters

  • Stack: CloudFormation term used to define all of the resources created by the template.
  • KeyName: This is the name of an existing key pair. If you don’t have one already, create a key pair that is used to connect by SSH to the EC2 instance. More information on key pairs.
  • SSHLocation: The CIDR mask for incoming connections to the EC2 instance. The default is, which means any IP address is allowed to connect by SSH to the EC2 instance, provided the client has the key pair private key. The best practice for security is to limit this to a smaller range of IP addresses. For example, you can use sites like www.whatismyip.com to get your IP address.
  • RedshiftCapacity: This is the number of RPUs for the Amazon Redshift Serverless workgroup. The default is 128 and is recommended for analyzing the larger TPC-DS datasets. You can update the RPU capacity after deployment if you like or redeploy it with a different capacity value.

Manual setup

If you choose not to use the CloudFormation template, deploy the EC2 instance and Amazon Redshift Serverless with the following instructions. The following steps are only needed if you are manually provisioning the resources rather than using the provided CloudFormation template.

Amazon Redshift setup

Here are the high-level steps to create an Amazon Redshift Serverless workgroup. You can get more detailed information from this News Blog post.

To create your workgroup, complete the following steps:

  1. On the Amazon Redshift console, navigate to the Amazon Redshift Serverless dashboard.
  2. Choose Create workgroup.
  3. For Workgroup name, enter a name.
  4. Choose Next.
  5. For Namespace, enter a unique name.
  6. Choose Next.
  7. Choose Create.

These steps create an Amazon Redshift Serverless workgroup with 128 RPUs. This is the default; you can easily adjust this up or down based on your workload and budget constraints.

Linux EC2 instance setup

  • Deploy a virtual machine in the same AWS Region as your Amazon Redshift database using the Amazon Linux 2 AMI.
  • The Amazon Linux 2 AMI (64-bit x86) with the t2.micro instance type is an inexpensive and tested configuration.
  • Add the security group configured for your Amazon Redshift database to your EC2 instance.
  • Install psql with sudo yum install postgresql.x86_64 -y
  • Download this GitHub repo.
    git clone --depth 1 https://github.com/aws-samples/redshift-benchmarks /home/ec2-user/redshift-benchmarks

  • Set the following environment variables for Amazon Redshift:
    • PGHOST: This is the endpoint for the Amazon Redshift database.
    • PGPORT: This is the port the database listens on. The Amazon Redshift default is 5439.
    • PGUSER: This is your Amazon Redshift database user.
    • PGDATABASE: This is the database name where your external schemas are created. This is NOT the database created for the data share. We suggest using the default “dev” database.


export PGUSER="awsuser" 
export PGHOST="default.01.us-east-1.redshift-serverless.amazonaws.com" 
export PGPORT="5439" 
export PGDATABASE="dev"

Configure the .pgpass file to store your database credentials. The format for the .pgpass file is: hostname:port:database:user:password


default.01.us-east-1.redshift-serverless.amazonaws.com:5439:*:user1:[email protected]

AWS Data Exchange setup

AWS Data Exchange provides third-party data in multiple data formats, including Amazon Redshift. You can subscribe to catalog listings in multiple storage locations like Amazon S3 and Amazon Redshift data shares. We encourage you to explore the AWS Data Exchange catalog on your own because there are many datasets available that can be used to enhance your data in Amazon Redshift.

First, subscribe to the AWS Marketplace listing for TPC-DS Benchmark Data. Select the Continue to subscribe button from the AWS Data Exchange catalog listing. After you review the offer and Terms and Conditions of the data product, choose Subscribe. Note that you will need the appropriate IAM permissions to subscribe to AWS Data Exchange on Marketplace. More information can be found at AWS managed policies for AWS Data Exchange.

TPC-DS uses 24 tables in a dimensional model that simulates a decision support system. It has store, catalog, and web sales as well as store, catalog, and web returns fact tables. It also has the dimension tables to support these fact tables.

TPC-DS includes a utility to generate data for the benchmark at a given scale factor. The smallest scale factor is 1 GB (uncompressed). Most benchmark tests for cloud warehouses are run with 3–10 TB of data because the dataset is large enough to stress the system but also small enough to complete the entire test in a reasonable amount of time.

There are six database schemas provided in the TPC-DS Benchmark Data subscription with 1; 10; 100; 1,000; 3,000; and 10,000 scale factors. The scale factor refers to the uncompressed data size measured in GB. Each schema refers to a dataset with the corresponding scale factor.

Scale factor (GB) ADX schema Amazon Redshift Serverless external schema
1 tpcds1 ext_tpcds1
10 tpcds10 ext_tpcds10
100 tpcds100 ext_tpcds100
1,000 tpcds1000 ext_tpcds1000
3,000 tpcds3000 ext_tpcds3000
10,000 tpcds10000 ext_tpcds10000

The following steps will create external schemas in your Amazon Redshift Serverless database that maps to schemas found in the AWS Data Exchange.

  • Log in to the EC2 instance and create a database connection.

  • Run the following query:
    select share_name, producer_account, producer_namespace from svv_datashares;

  • Use the output of this query to run the next command:
    create database tpcds_db from datashare <share_name> of account '<producer_account>' namespace '<producer_namespace>';

  • Last, you create the external schemas in Amazon Redshift:
    create external schema ext_tpcds1 from redshift database tpcds_db schema tpcds1;
    create external schema ext_tpcds10 from redshift database tpcds_db schema tpcds10;
    create external schema ext_tpcds100 from redshift database tpcds_db schema tpcds100;
    create external schema ext_tpcds1000 from redshift database tpcds_db schema tpcds1000;
    create external schema ext_tpcds3000 from redshift database tpcds_db schema tpcds3000;
    create external schema ext_tpcds10000 from redshift database tpcds_db schema tpcds10000;

  • You can now exit psql with this command:

TPC-DS derived benchmark

The TPC-DS derived benchmark consists of 99 queries in four broad categories:

  • Reporting queries
  • Ad hoc queries
  • Iterative OLAP queries
  • Data mining queries

In addition to running the 99 queries, the benchmark tests concurrency. During the concurrency portion of the test, there are n sessions (default of 5) that run the queries. Each session runs the 99 queries with different parameters and in slightly different order. This concurrency test stresses the resources of the database and generally takes longer to complete than just a single session running the 99 queries.

Some data warehouse products are configured to optimize single-user performance, whereas others may not have the ability to manage the workload effectively. This is a great way to demonstrate the workload management and stability of Amazon Redshift.

Since the data for each scale factor is located in a different schema, running the benchmark against each scale factor requires changing the schema you are referencing. The search_path defines which schemas to search for tables when a query contains objects without a schema included. For example:

ALTER USER <username> SET search_path=ext_tpcds3000,public;

The benchmark scripts set the search_path automatically.

Note: The scripts create a schema called tpcds_reports which will store the detailed output of each step of the benchmark. Each time the scripts are run, this schema will be recreated, and the latest results will be stored. If you happen to already have a schema named tpcds_reports, these scripts will drop the schema.

Running the TPC-DS derived queries

  • Connect by SSH to the EC2 instance with your key pair.
  • Change directory:
    cd ~/redshift-benchmarks/adx-tpc-ds/

  • Optionally configure the variables for the scripts in tpcds_variables.sh

Here are the default values for the variables you can set:

  • EXT_SCHEMA="ext_tpcds3000": This is the name of the external schema created that has the TPC-DS dataset. The “3000” value means the scale factor is 3000 or 3 TB (uncompressed).
  • EXPLAIN="false": If set to false, queries will run. If set to true, queries will generate explain plans rather than actually running. Each query will be logged in the log directory. Default is false.
  • MULTI_USER_COUNT="5": 0 to 20 concurrent users will run the queries. The order of the queries was set with dsqgen. Setting to 0 will skip the multi-user test. Default is 5.
  • Run the benchmark:
    ./rollout.sh > rollout.log 2>&1 &

TPC-DS derived benchmark results

We performed a test with the 3 TB ADX TPC-DS dataset on an Amazon Redshift Serverless workgroup with 128 RPUs. Additionally, we disabled query caching so that query results aren’t cached. This allows us to measure the performance of the database as opposed to its ability to serve results from cache.

The test comprises two sections. The first will run the 99 queries serially using one user while the second section will start multiple sessions based on the configuration file you set earlier. Each session will run concurrently, and each will run all 99 queries but in a different order.

The total runtime for the single-user queries was 15 minutes and 11 seconds. As shown in the following graph, the longest-running query was query 67, with an elapsed time of only 101 seconds. The average runtime was only 9.2 seconds.

1 Session TPC-DS 3TB 128 RPUs

With five concurrent users, the runtime was 28 minutes and 35 seconds, which demonstrates how Amazon Redshift Serverless performs well for single-user and concurrent-user workloads.

5 Concurrent Sessions TPC-DS 3TB 128 RPUs

As you can see, it was pretty easy to deploy Amazon Redshift Serverless, subscribe to an AWS Data Exchange product listing, and run a fairly complex benchmark in a short amount of time.

Next steps

You can run the benchmark scripts again but with different dataset sizes or a different number of concurrent users by editing the tpcds_variables.sh file. You can also try resizing your Amazon Redshift Serverless workgroup to see the performance difference with more or fewer RPUs. You can also run individual queries to see the results firsthand.

Another thing to try is to subscribe to other AWS Data Exchange products and query this data from Amazon Redshift Serverless. Be curious and explore using Amazon Redshift Serverless and the AWS Data Exchange!

Clean up

If you deployed the resources with the automated solution, you just need to delete the stack created in CloudFormation. All resources created by the stack will be deleted automatically.

If you deployed the resources manually, you need to delete the following:

  • The Amazon Redshift database created earlier.
    • If you deployed Amazon Redshift Serverless, you will need to delete both the workgroup and the namespace.
  • The Amazon EC2 instance.

Optionally, you can unsubscribe from the TPC-DS data by going to your AWS Data Exchange Subscriptions and then turning Renewal to Off.


This blog post covered deploying Amazon Redshift Serverless, subscribing to an AWS Data Exchange product, and running a complex benchmark in a short amount of time. Amazon Redshift Serverless can handle high levels of concurrency with very little effort and excels in price-performance.

If you have any questions or feedback, please leave them in the comments section.

About the author

Jon RobertsJon Roberts is a Sr. Analytics Specialist based out of Nashville, specializing in Amazon Redshift. He has over 27 years of experience working in relational databases. In his spare time, he runs.