Tag Archives: Amazon Redshift

Power highly resilient use cases with Amazon Redshift

2021-12-27 Asser Moustafa

Post Syndicated from Asser Moustafa original https://aws.amazon.com/blogs/big-data/power-highly-resilient-use-cases-with-amazon-redshift/

Amazon Redshift is the most popular and fastest cloud data warehouse, offering seamless integration with your data lake and other data sources, up to three times faster performance than any other cloud data warehouse, automated maintenance, separation of storage and compute, and up to 75% lower cost than any other cloud data warehouse. This post explores different architectures and use cases that focus on maximizing data availability, using Amazon Redshift as the core data warehouse platform.

In the modern data-driven organization, many data analytics use cases using Amazon Redshift have increasingly evolved to assume a critical business profile. Those use cases are now required to be highly resilient with little to no downtime. For example, analytical use cases that once relied solely on historical data and produced static forecasts are now expected to continuously weave real-time streaming and operational data into their ever-updating analytical forecasts. Machine learning (ML) use cases that relied on overnight batch jobs to extract customer churn predictions from extremely large datasets are now expected to perform those same customer churn predictions on demand using both historical and intraday datasets.

This post is part one of a series discussing high resiliency and availability with Amazon Redshift. In this post, we discuss a diverse set of popular analytical use cases that have traditionally or perhaps more recently assumed a critical business profile. The goal of this post is to show the art of the possible with high resiliency use cases. For each use case, we provide a brief description and explore the reasons for its critical business profile, and provide a reference architecture for implementing the use case following best practices. In the following section, we include a brief mention of some of the complimentary high resiliency features in Amazon Redshift as they apply for each use case.

In the final section of this post, we expand the scope to discuss high resiliency in a data ecosystem that uses Amazon Redshift. In particular, we discuss the Lake House Architecture in the high resiliency context.

Part two of this series (coming soon) provides a deeper look into the individual high resiliency and availability features of Amazon Redshift.

Now let’s explore some of the most popular use cases that have traditionally required high resiliency or have come to require high resiliency in the modern data-driven organization.

Data analytics as a service

Many analytical use cases focus on extracting value from data collected and produced by an organization to serve the organization’s internal business and operational goals. In many cases, however, the data collected and produced by an organization can itself be packaged and offered as a product to other organizations. More specifically, access to the data collected and produced along with analytical capabilities is typically offered as a paid service to other organizations. This is referred to as data analytics as a service (DaaS).

For example, consider a marketing agency that has amassed demographic information for a geographic location such as population by age, income, and family structure. Such demographic information often serves as a vital input for many organizations’ decisions to identify the ideal location for expansion, match their products with likely buyers, product offerings, and many other business needs. The marketing agency can offer access to this demographic information as a paid service to a multitude of retailers, healthcare providers, resorts, and more.

Some of the most critical aspects of DaaS offerings are ease of management, security, cost-efficiency, workload isolation, and high resiliency and availability. For example, the marketing agency offering the DaaS product needs the ability to easily refresh the demographic data on a regular cadence (ease of management), ensure paying customers are able to access only authorized data (security), minimize data duplication to avoid runaway costs and keep the DaaS competitively priced (cost-efficiency), ensure a consistent performance profile for paying customers (workload isolation), and ensure uninterrupted access to the paid service (high availability).

By housing the data in one or more Amazon Redshift clusters, organizations can use the service’s data sharing capabilities to make such DaaS patterns possible in an easily manageable, secure, cost-efficient, and workload-isolated manner. Paying customers are then able to access the data using the powerful search and aggregation capabilities of Amazon Redshift. The following architecture diagram illustrates a commonly used reference architecture for this scenario.

The following diagram illustrates another reference architecture that provides high resiliency and availability for internal and external consumers of the data.

While an in-depth discussion of the data sharing capabilities in Amazon Redshift is beyond the scope of this post, refer to the following resources for more information:

Implementing multi-tenant patterns in Amazon Redshift using data sharing
Cross-Account Data Sharing for Amazon Redshift
Announcing Amazon Redshift data sharing
Create highly resilient and available architectures with Amazon Redshift (coming soon)

Fresh forecasts

As the power of the modern data ecosystem is unleashed, analytical workloads that traditionally yielded point-in-time reports based solely on historical datasets are evolving to incorporate data in real-time and produce on-demand analysis.

For example, event coordinators that may have had to rely solely on historical datasets to create analytical sales forecasts in business intelligence (BI) dashboards for upcoming events are now able to use Amazon Redshift federated queries to incorporate live ticket sales stored in operational data stores such as Amazon Aurora or Amazon Relational Database Service (Amazon RDS). With federated queries, event coordinators can now have their analytical workloads running on Amazon Redshift query and incorporate operational data such as live ticket sales stored in Aurora on demand so that BI dashboards reflect the most up-to-date ticket sales.

Setting up federated queries is achieved by creating external tables that reference the tables of interest in an RDS instance. The following reference architecture illustrates one straightforward way to achieve federated queries using two different versions of Aurora.

While an in-depth discussion of federated query capabilities in Amazon Redshift is beyond the scope of this post, refer to the following resources for more information:

ML-based predictions

The multitude of ML-based predictive use cases and extensive analytical capabilities offered within the AWS ecosystem has placed ML in an ever-prominent and critical role within data-driven organizations. This could be retailers looking to predict customer churn, healthcare insurers looking to predict the number of claims in the next 30 days, financial services organizations working to detect fraud or managing their market risk and exposure, and more.

Amazon Redshift ML provides seamless integration to Amazon SageMaker for training ML models as often as necessary using data stored in Amazon Redshift. Redshift ML also provides the ability to weave on-demand, ML-based predictions directly into Amazon Redshift analytical workloads. The ease with which ML predictions can now be used in Amazon Redshift has paved the path to analytical workloads or BI dashboards that either use or center around ML-based predictions, and that are relied on heavily by operations teams, business teams, and many other users.

For example, retailers may have traditionally relied on ML models that were trained in a periodic cadence, perhaps weekly or some other lengthy interval, to predict customer churn. A lot can change, however, during those training intervals, rendering the retailer’s ability to predict customer churn less effective. With Redshift ML, retailers are now able to train their ML models using their most recent data within Amazon Redshift and incorporate ML predictions directly in the Amazon Redshift analytical workloads used to power BI dashboards.

The following reference architecture demonstrates the use of Redshift ML functions in various analytical workloads. With ANSI SQL commands, you can use Amazon Redshift data to create and train an ML model (Amazon Redshift uses SageMaker) that is then made accessible through an Amazon Redshift function. That function can then be used in various analytical workloads.

While an in-depth discussion of Redshift ML is beyond the scope of this post, refer to the following resources for more information:

Production data for dev environments

Having access to high-quality test data is one of the most common challenges encountered in the development process. To maintain access to high-quality test data, developers must often overcome hurdles such as high administrative overhead for replicating data, increased costs from data duplication, prolonged downtime, and risk of losing development artifacts when refreshing test environments.

The data sharing feature enables Amazon Redshift development clusters to access high-quality production data directly from an Amazon Redshift production or pre-production cluster in a straightforward, secure, and cost-efficient approach that achieves a highly resilient posture.

For example, you can establish a data share on the Amazon Redshift production cluster that securely exposes only the schemas, tables, or views appropriate for development environments. The Amazon Redshift development cluster can then use that data share to query the high-quality production data directly where it is persisted on Amazon Simple Storage Service (Amazon S3), without impacting the Amazon Redshift production cluster’s compute capacity. Because the development cluster uses its own compute capacity, the production cluster’s high resiliency and availability posture is insulated from long-running experimental or development workloads. Likewise, development workloads are insulated from competing for compute resources on the production cluster.

In addition, querying the high-quality production data via the production cluster’s data share avoids unnecessary data duplication that can lead to higher storage costs. As the production data changes, the development cluster automatically gains access to the latest high-quality production data.

Finally, for development features that require schema changes, developers are free to create custom schemas on the development cluster that are based on the high-quality production data. Because the production data is decoupled from the development cluster, the custom schemas are located only on the development cluster, and the production data is not impacted in any way.

Let’s explore two example reference architectures that you can use for this use case.

Production data for dev environments using current-generation Amazon Redshift instance types

With the native Amazon Redshift data sharing available with the current generation of Amazon Redshift instance types (RA3), we can use a relatively straightforward architecture to enable dev environments with the freshest high-quality production data.

In the following architecture diagram, the production cluster takes on the role of a producer cluster, because it’s the cluster producing the data of interest. The development clusters take on the role of the consumer cluster because they’re the clusters interested in accessing the produced data. Note that the producer and consumer roles are merely labels to clarify the different role of each cluster, and not a formal designation within Amazon Redshift.

Production data for dev environments using previous-generation Amazon Redshift instance types

When we discussed this use case, we relied entirely on the native data sharing capability in Amazon Redshift. However, if you’re using the previous generation Amazon Redshift instance types of dense compute (DC) and dense storage (DS) nodes in your production environments, you should employ a slightly different implementation of this use case, because native Amazon Redshift data sharing is available only for the current generation of Amazon Redshift instance types (RA3).

First, we use a snapshot of the dense compute or dense storage production cluster to restore the production environment to a new RA3 cluster that has the latest production data. Let’s call this cluster the dev-read cluster to emphasize that this cluster is only for read-only purposes and doesn’t exhibit any data modifications. In addition, we can stand up a second RA3 cluster that simply serves as a sandbox for developers with data shares established to the dev-read cluster. Let’s call this cluster the dev-write cluster, because its main purpose is to serve as a read/write sandbox for developers and broader development work.

The following diagram illustrates this setup.

One of the key benefits of having a separate dev-read and dev-write cluster is that the dev-read cluster can be swapped out with a new RA3 cluster containing fresher production data, without wiping out all of the potential development artifacts created by developers (stored procedures for debugging, modified schemas, elevated privileges, and so on). This resiliency is a crucial benefit for many development teams that might otherwise significantly delay refreshing their development data simply because they don’t want to lose their testing and debugging artifacts or broader development settings.

For example, if the development team wants to refresh the production data in the dev-read cluster on the first of every month, then every month you could rename the current dev-read cluster to dev-read-old, and use the latest production snapshot to create a new dev-read RA3 cluster. You also have to reestablish the data share setup between the dev-write and dev-read clusters along with the dev-read cluster swap, but this task can be automated fairly easily and quickly using a number of approaches.

Another key benefit is that the dev-read cluster doesn’t exhibit any load beyond the initial snapshot restoration, so it can be a simple two-node ra3.xlplus cluster to minimize cost, while the dev-write cluster can be more appropriately sized for development workloads. In other words, there is minimal additional cost with this setup vs. using single development cluster.

While an in-depth discussion of Amazon Redshift’s data sharing capabilities is beyond the scope of this post, refer to the following resources for more information:

Sharing Amazon Redshift data securely across Amazon Redshift clusters for workload isolation
Announcing Amazon Redshift data sharing
Create highly resilient and available architectures with Amazon Redshift (coming soon)

Streaming data analytics

With integration between the Amazon Kinesis family of services and Amazon Redshift, you have an easy and reliable way to load streaming data into data lakes as well as analytics services. Amazon Kinesis Data Firehose micro-batches real-time streaming messages and loads those micro-batches into the designated table within Amazon Redshift. With a few clicks on the Kinesis Data Firehose console, you can create a delivery stream that can ingest streaming data from hundreds of sources to multiple destinations, including Amazon Redshift. Should there be any interruptions in publishing streaming messages to Amazon Redshift, Kinesis Data Firehose automatically attempts multiple retries, and you can configure and customize that retry behavior.

You can also configure Amazon Kinesis Data Streams to convert the incoming data to open formats like Apache Parquet and ORC before data is delivered to Amazon Redshift for optimal query performance. You can even dynamically partition your streaming data using well-defined keys like customer_id or transaction_id. Kinesis Data Firehose groups data by these keys and delivers into key-unique S3 prefixes, making it easier for you to perform high-performance, cost-efficient analytics in Amazon S3 using Amazon Redshift and other AWS services.

The following reference architecture shows one of the straightforward approaches to integrating Kinesis Data Firehose and Amazon Redshift.

While an in-depth discussion of Kinesis Data Firehose and integration with Amazon Redshift are beyond the scope of this post, refer to the following resources for more information:

Amazon Kinesis- Setting up a Streaming Data Pipeline

Change data capture

While Amazon Redshift federated query enables Amazon Redshift to directly query data stored in an operational data store such as Aurora, there are also times when it helps for some of that operational data to be entirely replicated to Amazon Redshift for a multitude of other analytical use cases, such as data refinement.

After an initial replication from the operational data store to Amazon Redshift, ongoing change data capture (CDC) replications are required to keep Amazon Redshift updated with subsequent changes that occurred on the operational data store.

With AWS Database Migration Service (AWS DMS), you can automatically replicate changes in an operational data store such as Aurora to Amazon Redshift in a straightforward, cost-efficient, secure, and highly resilient and available approach. As data changes on the operational data store, AWS DMS automatically replicates those changes to the designated table on Amazon Redshift.

The following reference architecture illustrates the straightforward use of AWS DMS to replicate changes in an operational data store such as Amazon Aurora, Oracle, SQL Server, etc. to Amazon Redshift and other destinations such as Amazon S3.

While an in-depth discussion of AWS DMS is beyond the scope of this post, refer to the following resources for more information:

Workload isolation

Sharing data can improve the agility of your organization by encouraging more connections and fostering collaboration, which allows teams to build upon the work of others rather than repeat already existing processes. Amazon Redshift does this by giving you instant, granular, and high-performance access to data across Amazon Redshift clusters without needing you to manually copy or move your data. You have live access to data so your users can see the most up-to-date and consistent information as it’s updated in Amazon Redshift clusters.

Amazon Redshift parallelizes queries across the different nodes of a cluster, but there may be circumstances when you want to allow more concurrent queries than one cluster can provide or provide workload separation. You can use data sharing to isolate your workloads, thereby minimizing the possibility that a deadlock situation in one workload impacts other workloads running on the same cluster.

The traditional approach to high resiliency and availability is to deploy two or more identical, independent, and parallel Amazon Redshift clusters. However, this design requires that all database updates be performed on all Amazon Redshift clusters. This introduces complexity in your overall architecture. In this section, we demonstrate how to use data sharing to design a highly resilient and available architecture with workload isolation.

The following diagram illustrates the high-level architecture for data sharing in Amazon Redshift.

This architecture supports different kinds of business-critical workloads, such as using a central extract, transform, and load (ETL) cluster that shares data with multiple analytic or BI clusters. This approach provides BI workload isolation, so individual BI workloads don’t impact the performance of the ETL workloads and vice-versa. You can scale the individual Amazon Redshift cluster compute resources according to the workload-specific requirements of price and performance.

Amazon Redshift Spectrum is a feature of Amazon Redshift that enables you to run queries against exabytes of unstructured data in Amazon S3, with no loading or ETL required. You can use your producer cluster to process the Amazon S3 data and unload the resulting dataset back to Amazon S3. Then set up as many Amazon Redshift consumer clusters as you need to query your Amazon S3 data lake, thereby providing high resiliency and availability, and limitless concurrency.

Highly available data ecosystem using Amazon Redshift

In this section, we delve a little deeper into the Lake House Architecture, which achieves a wide range of best practices while providing several high resiliency and availability benefits that complement Amazon Redshift.

In the modern data ecosystem, many data-driven organizations have achieved tremendous success employing a Lake House Architecture to process the ever-growing volume, velocity, and variety of data. In addition, the Lake House Architecture has helped those data-driven organizations achieve greater resiliency.

As the following diagram shows, the Lake House Architecture consists of a data lake serving as the single source of truth with different compute layers such as Amazon Redshift sitting atop the data lake (in effect building a house on the lake, hence the term “lake house”).

Organizations can use a data lake to maximize data availability by centrally storing the data in the durable Amazon S3 layer but obtain access from multiple AWS products. Separation of compute and storage offers several resiliency and availability advantages. A data lake provides these same advantages but from a heterogeneous set of services that can all access a common data layer. Using Amazon Redshift with a Lake House Architecture reinforces the lake house’s high resiliency and availability. Furthermore, with the seamless integration of Amazon Redshift with the S3 data lake, you can use Redshift Spectrum to run ANSI SQL queries within Amazon Redshift that directly reference external tables in the S3 data lake, as is often done with cold data (data that is infrequently accessed).

In addition, there are a multitude of straightforward services such as AWS Glue, AWS DMS, and AWS Lambda that you can use to load warm data (data that is frequently accessed) from an S3 data lake to Amazon Redshift for greater performance.

Conclusion

In this post, we explored several analytical use cases that require high resiliency and availability and provided an overview of the Amazon Redshift features that help fulfill those requirements. We also presented a few example reference architectures for those use cases as well as a data ecosystem reference architecture that provides a wide range of benefits and reinforces high resiliency and availability postures.

For further information on high resiliency and availability within Amazon Redshift or implementing the aforementioned use cases, we encourage you to reach out to your AWS Solutions Architect—we look forward to helping.

About the Authors

Asser Moustafa is an Analytics Specialist Solutions Architect at AWS based out of Dallas, TX, USA. He advises customers in the Americas on their Amazon Redshift and data lake architectures and migrations, starting from the POC stage to actual production deployment and maintenance.

Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift. He is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Build a modern data architecture on AWS with Amazon AppFlow, AWS Lake Formation, and Amazon Redshift.

2021-12-20 Dr. Yannick Misteli

Post Syndicated from Dr. Yannick Misteli original https://aws.amazon.com/blogs/big-data/build-a-modern-data-architecture-on-aws-with-amazon-appflow-aws-lake-formation-and-amazon-redshift/

This is a guest post written by Dr. Yannick Misteli, lead cloud platform and ML engineering in global product strategy (GPS) at Roche.

Recently the Roche Data Insights (RDI) initiative was launched to achieve our vision using new ways of working and collaboration in order to build shared, interoperable data & insights with federated governance. Furthermore, a simplified & integrated data landscape shall be established in order to empower insights communities. One of the first domains to engage in this program is the Go-to-Market (GTM) area which comprises sales, marketing, medical access and market affairs in Roche. GTM domain enables Roche to understand customers and to ultimately create and deliver valuable services that meet their needs. GTM as a domain extends beyond health care professionals (HCPs) to a larger healthcare ecosystem consisting of patients, communities, health authorities, payers, providers, academia, competitors, so on and so forth. Therefore, Data & Analytics are key in supporting the internal and external stakeholders in their decision-making processes through actionable insights.

Roche GTM built a modern data and machine learning (ML) platform on AWS while utilizing DevOps best practices. The mantra of everything as code (EaC) was key in building a fully automated, scalable data lake and data warehouse on AWS.

In this this post, you learn about how Roche used AWS products and services such as Amazon AppFlow, AWS Lake Formation, and Amazon Redshift to provision and populate their data lake; how they sourced, transformed, and loaded data into the data warehouse; and how they realized best practices in security and access control.

In the following sections, you dive deep into the scalable, secure, and automated modern data platform that Roche has built. We demonstrate how to automate data ingestion, security standards, and utilize DevOps best practices to ease management of your modern data platform on AWS.

Data platform architecture

The following diagram illustrates the data platform architecture.

The architecture contains the following components:

The core infrastructure is deployed from a GitLab Runner container using the AWS Cloud Development Kit (AWS CDK)
Amazon AppFlow flows gather objects from Salesforce and places them into an Amazon Simple Storage Service (Amazon S3) bucket
Processed data from the S3 bucket is also loaded back to Salesforce using Amazon AppFlow
AWS Glue jobs load data from external databases (for example Oracle) and APIs to the S3 bucket
Invocation of DBT ELT jobs using Argo (an open source project) on Amazon Elastic Kubernetes Service (Amazon EKS) are launched to transform and cleanse data in external Amazon Redshift tables
An Amazon EKS cluster running Kubeflow interacts with an Amazon Redshift cluster for data science ML experimentation and use cases
Lake Formation secures the S3 buckets, AWS Glue databases, and tables aligned to specific AWS Identity and Access Management (IAM) roles
AWS Firewall Manager secures the entire AWS Cloud perimeter

Lake Formation security

We use Lake Formation to secure all data as it lands in the data lake. Separating each data lake layer into distinct S3 buckets and prefixes enables fine-grained access control policies that Lake Formation implements. This concept also extends to locking down access to specific rows and columns and applying policies to specific IAM roles and users. Governance and access to data lake resources is difficult to manage, but Lake Formation simplifies this process for administrators.

To secure access to the data lake using Lake Formation, the following steps are automated using the AWS CDK with customized constructs:

Register the S3 data buckets and prefixes, and corresponding AWS Glue databases with Lake Formation.
Add data lake administrators (GitLab runner IAM deployment role and administrator IAM role).
Grant the AWS Glue job IAM roles access to the specific AWS Glue databases.
Grant the AWS Lambda IAM role access to the Amazon AppFlow databases.
Grant the listed IAM roles access to the corresponding tables in the AWS Glue databases.

AWS Glue Data Catalog

The AWS Glue Data Catalog is the centralized registration and access point for all databases and tables that are created in both the data lake and in Amazon Redshift. This provides centralized transparency to all resources along with their schemas and the location of all data that is referenced. This is a critical aspect for any data operations performed within the lake house platform.

Data sourcing and ingestion

Data is sourced and loaded into the data lake through the use of AWS Glue jobs and Amazon AppFlow. The ingested data is made available in the Amazon Redshift data warehouse through Amazon Redshift Spectrum using external schemas and tables. The process of creating the external schemas and linking it to the Data Catalog is outlined later in this post.

Amazon AppFlow Salesforce ingestion

Amazon AppFlow is a fully-managed integration service that allows you to pull data from sources such as Salesforce, SAP, and Zendesk. Roche integrates with Salesforce to load Salesforce objects securely into their data lake without needing to write any custom code. Roche also pushes ML results back to Salesforce using Amazon AppFlow to facilitate the process.

Salesforce objects are first fully loaded into Amazon S3 and then are flipped to a daily incremental load to capture deltas. The data lands in the raw zone bucket in Parquet format using the date as a partition. The Amazon AppFlow flows are created through the use of a YAML configuration file (see the following code). This configuration is consumed by the AWS CDK deployment to create the corresponding flows.

appflow:
  flow_classes:
    salesforce:
      source: salesforce
      destination: s3
      incremental_load: 1
      schedule_expression: "rate(1 day)"
      s3_prefix: na
      connector_profile: roche-salesforce-connector-profile1,roche-salesforce-connector-profile2
      description: appflow flow flow from Salesforce
      environment: all
  - name: Account
    incremental_load: 1
    bookmark_col: appflow_date_str
  - name: CustomSalesforceObject
    pii: 0
    bookmark_col: appflow_date_str
    upsert_field_list: upsertField
    s3_prefix: prefix
    source: s3
    destination: salesforce
    schedule_expression: na
    connector_profile: roche-salesforce-connector-profile

The YAML configuration makes it easy to select whether data should be loaded from an S3 bucket back to Salesforce or from Salesforce to an S3 bucket. This configuration is subsequently read by the AWS CDK app and corresponding stacks to translate into Amazon AppFlow flows.

The following options are specified in the preceding YAML configuration file:

source – The location to pull data from (Amazon S3, Salesforce)
destination – The destination to put data to (Amazon S3, Salesforce)
object_name – The name of the Salesforce object to interact with
incremental_load – A Boolean specifying if the load should be incremental or full (0 means full, 1 means incremental)
schedule_expression – The cron or rate expression to run the flow (na makes it on demand)
s3_prefix – The prefix to push or pull the data from in the S3 bucket
connector_profile – The Amazon AppFlow connector profile name to use when connecting to Salesforce (can be a CSV list)
environment – The environment to deploy this Amazon AppFlow flow to (all means deploy to dev and prod, dev means development environment, prod means production environment)
upsert_field_list – The set of Salesforce object fields (can be a CSV list) to use when performing an upsert operation back to Salesforce (only applicable when loaded data back from an S3 bucket back to Salesforce)
bookmark_col – The name of the column to use in the Data Catalog for registering the daily load date string partition

Register Salesforce objects to the Data Catalog

Complete the following steps to register data loaded into the data lake with the Data Catalog and link it to Amazon Redshift:

Gather Salesforce object fields and corresponding data types.
Create a corresponding AWS Glue database in the Data Catalog.
Run a query against Amazon Redshift to create an external schema that links to the AWS Glue database.
Create tables and partitions in the AWS Glue database and tables.

Data is accessible via the Data Catalog and the Amazon Redshift cluster.

Amazon AppFlow dynamic field gathering

To construct the schema of the loaded Salesforce object in the data lake, you invoke the following Python function. The code utilizes an Amazon AppFlow client from Boto3 to dynamically gather the Salesforce object fields to construct the Salesforce object’s schema.

import boto3

client = boto3.client('appflow')

def get_salesforce_object_fields(object_name: str, connector_profile: str):
    """
    Gathers the Salesforce object and its corresponding fields.

    Parameters:
        salesforce_object_name (str) = the name of the Salesforce object to consume.
        appflow_connector_profile (str) = the name of AppFlow Connector Profile.

    Returns:
        object_schema_list (list) =  a list of the object's fields and datatype (a list of dictionaries).
    """
    print("Gathering Object Fields")

    object_fields = []

    response = client.describe_connector_entity(
        connectorProfileName=connector_profile,
        connectorEntityName=object_name,
        connectorType='Salesforce'
    )

    for obj in response['connectorEntityFields']:
        object_fields.append(
            {'field': obj['identifier'], 'data_type': obj['supportedFieldTypeDetails']['v1']['fieldType']})

    return object_fields

We use the function for both the creation of the Amazon AppFlow flow via the AWS CDK deployment and for creating the corresponding table in the Data Catalog in the appropriate AWS Glue database.

Create an Amazon CloudWatch Events rule, AWS Glue table, and partition

To add new tables (one per Salesforce object loaded into Amazon S3) and partitions into the Data Catalog automatically, you create an Amazon CloudWatch Events rule. This function enables you to query the data in both AWS Glue and Amazon Redshift.

After the Amazon AppFlow flow is complete, it invokes a CloudWatch Events rule and a corresponding Lambda function to either create a new table in AWS Glue or add a new partition with the corresponding date string for the current day. The CloudWatch Events rule looks like the following screenshot.

The invoked Lambda function uses the Amazon SageMaker Data Wrangler Python package to interact with the Data Catalog. Using the preceding function definition, the object fields and their data types are accessible to pass to the following function call:

import awswrangler as wr

def create_external_parquet_table(
    database_name: str, 
    table_name: str, 
    s3_path: str, 
    columns_map: dict, 
    partition_map: dict
):
    """
    Creates a new external table in Parquet format.

    Parameters:
        database_name (str) = the name of the database to create the table in.
        table_name (str) = the name of the table to create.
        s3_path (str) = the S3 path to the data set.
        columns_map (dict) = a dictionary object containing the details of the columns and their data types from appflow_utility.get_salesforce_object_fields
        partition_map (dict) = a map of the paritions for the parquet table as {'column_name': 'column_type'}
    
    Returns:
        table_metadata (dict) = metadata about the table that was created.
    """

    column_type_map = {}

    for field in columns_map:
        column_type_map[field['name']] = field['type']

    return wr.catalog.create_parquet_table(
        database=database_name,
        table=table_name,
        path=s3_path,
        columns_types=column_type_map,
        partitions_types=partition_map,
        description=f"AppFlow ingestion table for {table_name} object"
    )

If the table already exists, the Lambda function creates a new partition to account for the date in which the flow completed (if it doesn’t already exist):

import awswrangler as wr

def create_parquet_table_date_partition(
    database_name: str, 
    table_name: str, 
    s3_path: str, 
    year: str, 
    month: str, 
    day: str
):
    """
    Creates a new partition by the date (YYYY-MM-DD) on an existing parquet table.

    Parameters:
        database_name (str) = the name of the database to create the table in.
        table_name (str) = the name of the table to create.
        s3_path (str) = the S3 path to the data set.
        year(str) = the current year for the partition (YYYY format).
        month (str) = the current month for the partition (MM format).
        day (str) = the current day for the partition (DD format).
    
    Returns:
        table_metadata (dict) = metadata about the table that has a new partition
    """

    date_str = f"{year}{month}{day}"
    
    return wr.catalog.add_parquet_partitions(
        database=database_name,
        table=table_name,
        partitions_values={
            f"{s3_path}/{year}/{month}/{day}": [date_str]
        }
    )
    
def table_exists(
    database_name: str, 
    table_name: str
):
    """
    Checks if a table exists in the Glue catalog.

    Parameters:
        database_name (str) = the name of the Glue Database where the table should be.
        table_name (str) = the name of the table.
    
    Returns:
        exists (bool) = returns True if the table exists and False if it does not exist.
    """

    try:
        wr.catalog.table(database=database_name, table=table_name)
        return True
    except ClientError as e:
        return False

Amazon Redshift external schema query

An AWS Glue database is created for each Amazon AppFlow connector profile that is present in the preceding configuration. The objects that are loaded from Salesforce into Amazon S3 are registered as tables in the Data Catalog under the corresponding database. To link the database in the Data Catalog with an external Amazon Redshift schema, run the following query:

CREATE EXTERNAL SCHEMA ${connector_profile_name}_ext from data catalog
database '${appflow_connector_profile_name}'
iam_role 'arn:aws:iam::${AWS_ACCOUNT_ID}:role/RedshiftSpectrumRole'
region 'eu-west-1';

The specified iam_role value must be an IAM role created ahead of time and must have the appropriate access policies specified to query the Amazon S3 location.

Now, all the tables available in the Data Catalog can be queried using SQL locally in Amazon Redshift Spectrum.

Amazon AppFlow Salesforce destination

Roche trains and invokes ML models using data found in the Amazon Redshift data warehouse. After the ML models are complete, the results are pushed back into Salesforce. Through the use of Amazon AppFlow, we can achieve the data transfer without writing any custom code. The schema of the results must match the schema of the corresponding Salesforce object, and the format of the results must be written in either JSON lines or CSV format in order to be written back into Salesforce.

AWS Glue Jobs

To source on-premises data feeds into the data lake, Roche has built a set of AWS Glue jobs in Python. There are various external sources including databases and APIs that are directly loaded into the raw zone S3 bucket. The AWS Glue jobs are run on a daily basis to load new data. The data that is loaded follows the partitioning scheme of YYYYMMDD format in order to more efficiently store and query datasets. The loaded data is then converted into Parquet format for more efficient querying and storage purposes.

Amazon EKS and KubeFlow

To deploy ML models on Amazon EKS, Roche uses Kubeflow on Amazon EKS. The use of Amazon EKS as the backbone infrastructure makes it easy to build, train, test, and deploy ML models and interact with Amazon Redshift as a data source.

Firewall Manager

As an added layer of security, Roche takes extra precautions through the use of Firewall Manager. This allows Roche to explicitly deny or allow inbound and outbound traffic through the use of stateful and stateless rule sets. This also enables Roche to allow certain outbound access to external websites and deny websites that they don’t want resources inside of their Amazon VPC to have access to. This is critical especially when dealing with any sensitive datasets to ensure that data is secured and has no chance of being moved externally.

CI/CD

All the infrastructure outlined in the architecture diagram was automated and deployed to multiple AWS Regions using a continuous integration and continuous delivery (CI/CD) pipeline with GitLab Runners as the orchestrator. The GitFlow model was used for branching and invoking automated deployments to the Roche AWS accounts.

Infrastructure as code and AWS CDK

Infrastructure as code (IaC) best practices were used to facilitate the creation of all infrastructure. The Roche team uses the Python AWS CDK to deploy, version, and maintain any changes that occur to the infrastructure in their AWS account.

AWS CDK project structure

The top level of the project structure in GitLab includes the following folders (while not limited to just these folders) in order to keep infrastructure and code all in one location.

To facilitate the various resources that are created in the Roche account, the deployment was broken into the following AWS CDK apps, which encompass multiple stacks:

core
data_lake
data_warehouse

The core app contains all the stacks related to account setup and account bootstrapping, such as:

VPC creation
Initial IAM roles and policies
Security guardrails

The data_lake app contains all the stacks related to creating the AWS data lake, such as:

Lake Formation setup and registration
AWS Glue database creation
S3 bucket creation
Amazon AppFlow flow creation
AWS Glue job setup

The data_warehouse app contains all the stacks related to setting up the data warehouse infrastructure, such as:

Amazon Redshift cluster
Load balancer to Amazon Redshift cluster
Logging

The AWS CDK project structure described was chosen to keep the deployment flexible and to logically group together stacks that relied on each other. This flexibility allows for deployments to be broken out by function and deployed only when truly required and needed. This decoupling of different parts of the provisioning maintains flexibility when deploying.

AWS CDK project configuration

Project configurations are flexible and extrapolated away as YAML configuration files. For example, Roche has simplified the process of creating a new Amazon AppFlow flow and can add or remove flows as needed simply by adding a new entry into their YAML configuration. The next time the GitLab runner deployment occurs, it picks up the changes on AWS CDK synthesis to generate a new change set with the new set of resources. This configuration and setup keeps things dynamic and flexible while decoupling configuration from code.

Network architecture

The following diagram illustrates the network architecture.

We can break down the architecture into the following:

All AWS services are deployed in two Availability Zones (except Amazon Redshift)
Only private subnets have access to the on-premises Roche environment
Services are deployed in backend subnets
Perimeter protection using AWS Network Firewall
A network load balancer publishes services to the on premises environment

Network security configurations

Infrastructure, configuration, and security are defined as code in AWS CDK, and Roche uses a CI/CD pipeline to manage and deploy them. Roche has an AWS CDK application to deploy the core services of the project: VPC, VPN connectivity, and AWS security services (AWS Config, Amazon GuardDuty, and AWS Security Hub). The VPC contains four network layers deployed in two Availability Zones, and they have VPC endpoints to access AWS services like Amazon S3, Amazon DynamoDB, and Amazon Simple Queue Service (Amazon SQS). They limit internet access using AWS Network Firewall.

The infrastructure is defined as code and the configuration is segregated. Roche performed the VPC setup by running the CI/CD pipeline to deploy their infrastructure. The configuration is in a specific external file; if Roche wants to change any value of the VPC, they need to simply modify this file and run the pipeline again (without typing any new lines of code). If Roche wants to change any configurations, they don’t want to have to change any code. It makes it simple for Roche to make changes and simply roll them out to their environment, making the changes more transparent and easier to configure. Traceability of the configuration is more transparent and it makes it simpler for approving the changes.

The following code is an example of the VPC configuration:

"test": {
        "vpc": {
            "name": "",
            "cidr_range": "192.168.40.0/21",
            "internet_gateway": True,
            "flow_log_bucket": shared_resources.BUCKET_LOGGING,
            "flow_log_prefix": "vpc-flow-logs/",
        },
        "subnets": {
            "private_subnets": {
                "private": ["192.168.41.0/25", "192.168.41.128/25"],
                "backend": ["192.168.42.0/23", "192.168.44.0/23"],
            },
            "public_subnets": {
                "public": {
                    "nat_gateway": True,
                    "publics_ip": True,
                    "cidr_range": ["192.168.47.64/26", "192.168.47.128/26"],
                }
            },
            "firewall_subnets": {"firewall": ["192.168.47.0/28", "192.168.47.17/28"]},
        },
        ...
         "vpc_endpoints": {
            "subnet_group": "backend",
            "services": [
                "ec2",
                "ssm",
                "ssmmessages",
                "sns",
                "ec2messages",
                "glue",
                "athena",
                "secretsmanager",
                "ecr.dkr",
                "redshift-data",
                "logs",
                "sts",
            ],
            "gateways": ["dynamodb", "s3"],
            "subnet_groups_allowed": ["backend", "private"],
        },
        "route_53_resolvers": {
            "subnet": "private",
        ...

The advantages of this approach are as follows:

No need to modify the AWS CDK constructor and build new code to change VPC configuration
Central point to manage VPC configuration
Traceability of changes and history of the configuration through Git
Redeploy all the infrastructure in a matter of minutes in other Regions or accounts

Operations and alerting

Roche has developed an automated alerting system if any part of the end-to-end architecture encounters any issues, focusing on any issues when loading data from AWS Glue or Amazon AppFlow. All logging is published to CloudWatch by default for debugging purposes.

The operational alerts have been built for the following workflow:

AWS Glue jobs and Amazon AppFlow flows ingest data.
If a job fails, it emits an event to a CloudWatch Events rule.
The rule is triggered and invokes an Lambda function to send failure details to an Amazon Simple Notification Service (Amazon SNS) topic.
The SNS topic has a Lambda subscriber that gets invoked:
1. The Lambda function reads out specific webhook URLs from AWS Secrets Manager.
2. The function fires off an alert to the specific external systems.
The external systems receive the message and the appropriate parties are notified of the issue with details.

The following architecture outlines the alerting mechanisms built for the lake house platform.

Conclusion

The GTM (Go-To-Market) domain has been successful in enabling their business stakeholders, data engineers and data scientists providing a platform that is extendable to many use-cases that Roche faces. It is a key enabler and an accelerator for the GTM organization in Roche. Through a modern data platform, Roche is now able to better understand customers and ultimately create and deliver valuable services that meet their needs. It extends beyond health care professionals (HCPs) to a larger healthcare ecosystem. The platform and infrastructure in this blog help to support and accelerate both internal and external stakeholders in their decision-making processes through actionable insights.

The steps in this post can help you plan to build a similar modern data strategy using AWS managed services to ingest data from sources like Salesforce, automatically create metadata catalogs and share data seamlessly between the data lake and data warehouse, and create alerts in the event of an orchestrated data workflow failure. In part 2 of this post, you learn about how the data warehouse was built using an agile data modeling pattern and how ELT jobs were quickly developed, orchestrated, and configured to perform automated data quality testing.

Special thanks go to the Roche team: Joao Antunes, Krzysztof Slowinski, Krzysztof Romanowski, Bartlomiej Zalewski, Wojciech Kostka, Patryk Szczesnowicz, Igor Tkaczyk, Kamil Piotrowski, Michalina Mastalerz, Jakub Lanski, Chun Wei Chan, Andrzej Dziabowski for their project delivery and support with this post.

About The Authors

Dr. Yannick Misteli, Roche – Dr. Yannick Misteli is leading cloud platform and ML engineering teams in global product strategy (GPS) at Roche. He is passionate about infrastructure and operationalizing data-driven solutions, and he has broad experience in driving business value creation through data analytics.

Simon Dimaline, AWS – Simon Dimaline has specialised in data warehousing and data modelling for more than 20 years. He currently works for the Data & Analytics team within AWS Professional Services, accelerating customers’ adoption of AWS analytics services.

Matt Noyce, AWS – Matt Noyce is a Senior Cloud Application Architect in Professional Services at Amazon Web Services. He works with customers to architect, design, automate, and build solutions on AWS for their business needs.

Chema Artal Banon, AWS – Chema Artal Banon is a Security Consultant at AWS Professional Services and he works with AWS’s customers to design, build, and optimize their security to drive business. He specializes in helping companies accelerate their journey to the AWS Cloud in the most secure manner possible by helping customers build the confidence and technical capability.

A special Thank You goes out to the following people whose expertise made this post possible from AWS:

Thiyagarajan Arumugam – Principal Analytics Specialist Solutions Architect
Taz Sayed – Analytics Tech Leader
Glenith Paletta – Enterprise Service Manager
Mike Murphy – Global Account Manager
Natacha Maheshe – Senior Product Marketing Manager
Derek Young – Senior Product Manager
Jamie Campbell – Amazon AppFlow Product Manager
Kamen Sharlandjiev – Senior Solutions Architect – Amazon AppFlow
Sunil Jethwani – Principal Customer Delivery Architect
Vinay Shukla – Amazon Redshift Principal Product Manager
Nausheen Sayed – Program Manager

What’s new in Amazon Redshift – 2021, a year in review

2021-12-16 Manan Goel

Post Syndicated from Manan Goel original https://aws.amazon.com/blogs/big-data/whats-new-in-amazon-redshift-2021-a-year-in-review/

Amazon Redshift is the cloud data warehouse of choice for tens of thousands of customers who use it to analyze exabytes of data to gain business insights. Customers have asked for more capabilities in Redshift to make it easier, faster, and secure to store, process, and analyze all of their data. We announced Redshift in 2012 as the first cloud data warehouse to remove the complexity around provisioning, managing, and scaling data warehouses. Since then, we have launched capabilities such as Concurrency scaling, Spectrum, and RA3 nodes to help customers analyze all of their data and support growing analytics demands across all users in the organization. We continue to innovate with Redshift on our customers’ behalf and launched more than 50 significant features in 2021. This post covers some of those features, including use cases and benefits.

Working backwards from customer requirements, we are investing in Redshift to bring out new capabilities in three main areas:

Easy analytics for everyone
Analyze all of your data
Performance at any scale

Customers told us that the data warehouse users in their organizations are expanding from administrators, developers, analysts, and data scientists to the Line of Business (LoB) users, so we continue to invest to make Redshift easier to use for everyone. Customers also told us that they want to break free from data silos and access data across their data lakes, databases, and data warehouses and analyze that data with SQL and machine learning (ML). So we continue to invest in letting customers analyze all of their data. And finally, customers told us that they want the best price performance for analytics at any scale from Terabytes to Petabytes of data. So we continue to bring out new capabilities for performance at any scale. Let’s dive into each of these pillars and cover the key capabilities that we launched in 2021.

Amazon Redshift key innovations

Redshift delivers easy analytics for everyone

Easy analytics for everyone requires a simpler getting-started experience, automated manageability, and visual user interfaces that make is easier, simpler, and faster for both technical and non-technical users to quickly get started, operate, and analyze data in a data warehouse. We launched new features such as Redshift Serverless (in preview), Query Editor V2, and automated materialized views (in preview), as well as enhanced the Data API in 2021 to make it easier for customers to run their data warehouses.

Redshift Serverless (in preview) makes it easy to run and scale analytics in seconds without having to provision and manage data warehouse clusters. The serverless option lets all users, including data analysts, developers, business users, and data scientists use Redshift to get insights from data in seconds by simply loading and querying data into the data warehouse. Customers can launch a data warehouse and start analyzing the data with the Redshift Serverless option through just a few clicks in the AWS Management Console. There is no need to choose node types, node count, or other configurations. Customers can take advantage of pre-loaded sample data sets along with sample queries to kick start analytics immediately. They can create databases, schemas, tables, and load their own data from their desktop, Amazon Simple Storage Service (S3), via Amazon Redshift data shares, or restore an existing Amazon Redshift provisioned cluster snapshot. They can also directly query data in open formats, such as Parquet or ORC, in their Amazon S3 data lakes, as well as data in their operational databases, such as Amazon Aurora and Amazon RDS. Customers pay only for what they use, and they can manage their costs with granular cost controls.

Redshift Query Editor V2 is a web-based tool for data analysts, data scientists, and database developers to explore, analyze, and collaborate on data in Redshift data warehouses and data lake. Customers can use Query Editor’s visual interface to create and browse schema and tables, load data, author SQL queries and stored procedures, and visualize query results with charts. They can share and collaborate on queries and analysis, as well a track changes with built in version control. Query Editor V2 also supports SQL Notebooks (in preview), which provides a new Notebook interface that lets users such as data analysts and data scientists author queries, organize multiple SQL queries and annotations on a single document, and collaborate with their team members by sharing Notebooks.

Amazon Redshift Query Editor V2

Customers have long used Amazon Redshift materialized views (MV) for precomputed result sets, based on an SQL query over one or more base tables to improve query performance, particularly for frequently used queries such as those in dashboards and reports. In 2021, we launched Automated Materialized View (AutoMV) in preview to improve the performance of queries (reduce the total execution time) without any user effort by automatically creating and maintaining materialized views. Customers told us that while MVs offer significant performance benefits, analyzing the schema, data, and workload to determine which queries might benefit from having an MV or which MVs are no longer beneficial and should be dropped requires knowledge, time, and effort. AutoMV lets Redshift continually monitor the cluster to identify candidate MVs and evaluates the benefits vs costs. It creates MVs that have high benefit-to-cost ratios, while ensuring existing workloads are not negatively impacted by this process. AutoMV continually monitors the system and will drop MVs that are no longer beneficial. All of these are transparent to users and applications. Applications such as dashboards benefit without any code change thanks to automatic query re-write, which lets existing queries benefit from MVs even when not explicitly referenced. Customers can also set the MVs to autorefresh so that MVs always have up-to-date data for added convenience.

Customers have also asked us to simplify and automate data warehouse maintenance tasks, such as schema or table design, so that they can get optimal performance out of their clusters. Over the past few years, we have invested heavily to automate these maintenance tasks. For example, Automatic Table Optimization (ATO) selects the best sort and distribution keys to determine the optimal physical layout of data to maximize performance. We’ve extended ATO to modify column compression encodings to achieve high performance and reduce storage utilization. We have also introduced various features, such as auto vacuum delete and auto analyze, over the past few years to make sure that customer data warehouses continue to operate at peak performance.

Data API, which launched in 2020, has also seen major enhancements, such as multi-statement query execution, support for parameters to develop reusable code, and availability in more regions in 2021 to make it easier for customers to programmatically access data in Redshift. Data API lets Redshift enable customers to painlessly access data with all types of traditional, cloud-native, and containerized, serverless web services-based applications and event-driven applications. It simplifies data access, ingest, and egress from programming languages and platforms supported by the AWS SDK, such as Python, Go, Java, Node.js, PHP, Ruby, and C++. The Data API eliminates the need for configuring drivers and managing database connections. Instead, customers can run SQL commands to an Amazon Redshift cluster by simply calling a secured API endpoint provided by the Data API. The Data API takes care of managing database connections and buffering data. The Data API is asynchronous, so results can be retrieved later and are stored for 24 hours.

Finally in our easy analytics for everyone pillar, in 2021 we launched the Grafana Redshift Plugin to help customers gain a deeper understanding of their cluster’s performance. Grafana is a popular open-source tool for running analytics and monitoring systems online. The Grafana Redshift Plugin lets customers query system tables and views for the most complete set of operational metrics on their Redshift cluster. The Plugin is available in the Open Source Grafana repository, as well as in our Amazon Managed Grafana service. We also published a default in-depth operational dashboard to take advantage of this feature.

Redshift makes it possible for customers to analyze all of their data

Redshift gives customers the best of both data lakes and purpose-built data stores, such as databases and data warehouses. It enables customers to store any amount of data, at low cost, and in open, standards-based data formats such as parquet and JSON in data lakes, and run SQL queries against it without loading or transformations. Furthermore, it lets customers run complex analytic queries with high performance against terabytes to petabytes of structured and semi-structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution. Redshift lets customers access live data from the transactional databases as part of their business intelligence (BI) and reporting applications to enable operational analytics. Customers can break down data silos by seamlessly querying data in the data lakes, data warehouses, and databases; empower their teams to run analytics and ML using their preferred tool or technique; and manage who has access to data with the proper security and data governance controls. We launched new features in 2021, such as Data Sharing, AWS Data Exchange integration, and Redshift ML, to make it easier for customers to analyze all of their data.

Amazon Redshift data sharing lets customers extend the ease of use, performance, and cost benefits that Amazon Redshift offers in a single cluster to multi-cluster deployments while being able to share data. It enables instant, granular, and fast data access across Amazon Redshift clusters without the need to copy or move data around. Data sharing provides live access to data so that your users always see the most up-to-date and consistent information as it’s updated in the data warehouse. Customers can securely share live data with Amazon Redshift clusters in the same or different AWS accounts within the same region or across regions. Data sharing features several performance enhancements, including result caching and concurrency scaling, which allow customers to support a broader set of analytics applications and meet critical performance SLAs when querying shared data. Customers can use data sharing for use cases such as workload isolation and offer chargeability, as well as provide secure and governed collaboration within and across teams and external parties.

Customers also asked us to help them with internal or external data marketplaces so that they can enable use cases such as data as a service and onboard 3^rd-party data. We launched the public preview of AWS Data Exchange for Amazon Redshift, a new feature that enables customers to find and subscribe to third-party data in AWS Data Exchange that they can query in an Amazon Redshift data warehouse in minutes. Data providers can list and offer products containing Amazon Redshift data sets in the AWS Data Exchange catalog, granting subscribers direct, read-only access to the data stored in Amazon Redshift. This feature empowers customers to quickly query, analyze, and build applications with these third-party data sets. AWS Data Exchange for Amazon Redshift lets customers combine third-party data found on AWS Data Exchange with their own first-party data in their Amazon Redshift cloud data warehouse, with no ETL required. Since customers are directly querying provider data warehouses, they can be certain that they are using the latest data being offered. Additionally, entitlement, billing, and payment management are all automated: access to Amazon Redshift data is granted when a data subscription starts and is removed when it ends, invoices are automatically generated, and payments are automatically collected and disbursed through AWS Marketplace.

Customers also asked for our help to make it easy to train and deploy ML models such as prediction, natural language processing, object detection, and image classification directly on top of the data in purpose-built data stores without having to perform complex data movement or learn new tools. We launched Redshift ML earlier this year to enable customers to create, train, and deploy ML models using familiar SQL commands. Amazon Redshift ML lets customers leverage Amazon SageMaker, a fully managed ML service, without moving their data or learning new skills. Furthermore, Amazon Redshift ML powered by Amazon SageMaker lets customers use SQL statements to create and train ML models from their data in Amazon Redshift, and then use these models for use cases such as churn prediction and fraud risk scoring directly in their queries and reports. Amazon Redshift ML automatically discovers the best model and tunes it based on training data using Amazon SageMaker Autopilot. SageMaker Autopilot chooses between regression, binary, or multi-class classification models. Alternatively, customers can choose a specific model type such as Xtreme Gradient Boosted tree (XGBoost) or multilayer perceptron (MLP), a problem type like regression or classification, and preprocessors or hyperparameters. Amazon Redshift ML uses customer parameters to build, train, and deploy the model in the Amazon Redshift data warehouse. Customers can obtain predictions from these trained models using SQL queries as if they were invoking a user defined function (UDF), and leverage all of the benefits of Amazon Redshift, including massively parallel processing capabilities. Customers can also import their pre-trained SageMaker Autopilot, XGBoost, or MLP models into their Amazon Redshift cluster for local inference. Redshift ML supports both supervised and unsupervised ML for advanced analytics use cases ranging from forecasting to personalization.

Customers want to combine live data from operational databases with the data in Amazon Redshift data warehouse and the data in Amazon S3 data lake environment to get unified analytics views across all of the data in the enterprise. We launched Amazon Redshift federated query to let customers incorporate live data from the transactional databases as part of their BI and reporting applications to enable operational analytics. The intelligent optimizer in Amazon Redshift pushes down and distributes a portion of the computation directly into the remote operational databases to help speed up performance by reducing data moved over the network. Amazon Redshift complements subsequent execution of the query by leveraging its massively parallel processing capabilities for further speed up. Federated query also makes it easier to ingest data into Amazon Redshift by letting customers query operational databases directly, applying transformations on the fly, and loading data into the target tables without requiring complex ETL pipelines. In 2021, we added support for Amazon Aurora MySQL and Amazon RDS for MySQL databases in addition to the existing Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL databases for federated query to enable customers to access more data sources for richer analytics.

Finally in our analyze all your data pillar in 2021, we added data types such as SUPER, GEOGRAPHY, and VARBYTE to enable customers to store semi-structured data natively in the Redshift data warehouse so that they can analyze all of their data at scale and with performance. The SUPER data type lets customers ingest and store JSON and semi-structured data in their Amazon Redshift data warehouses. Amazon Redshift also includes support for PartiQL for SQL-compatible access to relational, semi-structured, and nested data. Using the SUPER data type and PartiQL in Amazon Redshift, customers can perform advanced analytics that combine classic structured SQL data (such as string, numeric, and timestamp) with the semi-structured SUPER data (such as JSON) with superior performance, flexibility, and ease-of-use. The GEOGRAPHY data type builds on Redshift’s support of spatial analytics, opening-up support for many more third-party spatial and GIS applications. Moreover, it adds to the GEOMETRY data type and over 70 spatial functions that are already available in Redshift. The GEOGRAPHY data type is used in queries requiring higher precision results for spatial data with geographic features that can be represented with a spheroid model of the Earth and referenced using latitude and longitude as a spatial coordinate system. VARBYTE is a variable size data type for storing and representing variable-length binary strings.

Redshift delivers performance at any scale

Since we announced Amazon Redshift in 2012, performance at any scale has been a foundational tenet for us to deliver value to tens of thousands of customers who trust us every day to gain business insights from their data. Our customers span all industries and sizes, from startups to Fortune 500 companies, and we work to deliver the best price performance for any use case. Over the years, we have launched features such as dynamically adding cluster capacity when you need it with concurrency scaling, making sure that you use cluster resources efficiently with automatic workload management (WLM), and automatically adjusting data layout, distribution keys, and query plans to provide optimal performance for a given workload. In 2021, we launched capabilities such as AQUA, concurrency scaling for writes, and further enhancements to RA3 nodes to continue to improve Redshift’ price performance.

We introduced the RA3 node types in 2019 as a technology that allows the independent scaling of compute and storage. We also described how customers, including Codeacademy, OpenVault, Yelp, and Nielsen, have taken advantage of Amazon Redshift RA3 nodes with managed storage to scale their cloud data warehouses and reduce costs. RA3 leverages Redshift Managed Storage (RMS) as its durable storage layer which allows near-unlimited storage capacity where data is committed back to Amazon S3. This enabled new capabilities, such as Data Sharing and AQUA, where RMS is used as a shared storage across multiple clusters. RA3 nodes are available in three sizes (16XL, 4XL, and XLPlus) to balance price/performance. In 2021, we launched single node RA3 XLPlus clusters to help customers cost-effectively migrate their smaller data warehouse workloads to RA3s and take advantage of better price performance. We also introduced a self-service DS2 to RA3 RI migration capability that lets RIs be converted at a flat cost between equivalent node types.

AQUA (Advanced Query Accelerator) for Amazon Redshift is a new distributed and hardware-accelerated cache that enables Amazon Redshift to run an order of magnitude faster than other enterprise cloud data warehouses by automatically boosting certain query types. AQUA uses AWS-designed processors with AWS Nitro chips adapted to speed up data encryption and compression, and custom analytics processors, implemented in FPGAs, to accelerate operations such as scans, filtering, and aggregation. AQUA is available with the RA3.16xlarge, RA3.4xlarge, or RA3.xlplus nodes at no additional charge and requires no code changes.

Concurrency Scaling was launched in 2019 to handle spiky and unpredictable read workloads without having to pre-provision any capacity. Redshift offers one hour of free Concurrency Scaling for every 24 hours of usage that your main cluster is running. It also offers cost controls to monitor and limit your usage and associated costs for Concurrency Scaling. In addition to read queries, supporting write queries has been a big ask from customers to support ETL workloads. In 2021, we launched Redshift Concurrency Scaling write queries support in preview with common operations such as INSERT, DELETE, UPDATE, and COPY to handle unpredictable spikes in ETL workloads. If you are currently using Concurrency Scaling, this new capability is automatically enabled in your cluster. You can monitor your Concurrency Scaling usage using the Amazon Redshift Console and get alerts on any usage exceeding your defined limits. You can also create, modify, and delete usage limits programmatically by using the AWS Command Line Interface (CLI) and AWS API.

Finally we continue to ensure that AWS has comprehensive security capabilities to satisfy the most demanding requirements, and Amazon Redshift continues to provides data security out-of-the-box at no extra cost. We introduced new security features in 2021, such as cross-VPC support and default IAM roles, to continue to make Redshift more secure for customer workloads.

Summary

When it comes to making it easier, simpler, and faster for customers to analyze all of their data, velocity matters and we are innovating at a rapid pace to bring new capabilities to Redshift. We continue to make Redshift features available in more AWS regions worldwide to make sure that all customers have access to all capabilities. We have covered the key features above and the complete list is available here. We look forward to how you will use some of these capabilities to continue innovating with data and analytics.

About the Author

Manan Goel is a Product Go-To-Market Leader for AWS Analytics Services including Amazon Redshift & AQUA at AWS. He has more than 25 years of experience and is well versed with databases, data warehousing, business intelligence, and analytics. Manan holds a MBA from Duke University and a BS in Electronics & Communications engineering.

Introducing new features for Amazon Redshift COPY: Part 1

2021-12-15 Dipankar Kushari

Post Syndicated from Dipankar Kushari original https://aws.amazon.com/blogs/big-data/part-1-introducing-new-features-for-amazon-redshift-copy/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL. Amazon Redshift offers up to three times better price performance than any other cloud data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as high-performance business intelligence (BI) reporting, dashboarding applications, data exploration, and real-time analytics.

Loading data is a key process for any analytical system, including Amazon Redshift. Loading very large datasets can take a long time and consume a lot of computing resources. How your data is loaded can also affect query performance. You can use many different methods to load data into Amazon Redshift. One of the fastest and most scalable methods is to use the COPY command. This post dives into some of the recent enhancements made to the COPY command and how to use them effectively.

Overview of the COPY command

A best practice for loading data into Amazon Redshift is to use the COPY command. The COPY command loads data in parallel from Amazon Simple Storage Service (Amazon S3), Amazon EMR, Amazon DynamoDB, or multiple data sources on any remote hosts accessible through a Secure Shell (SSH) connection.

The COPY command reads and loads data in parallel from a file or multiple files in an S3 bucket. You can take maximum advantage of parallel processing by splitting your data into multiple files, in cases where the files are compressed. The COPY command appends the new input data to any existing rows in the target table. The COPY command can load data from Amazon S3 for the file formats AVRO, CSV, JSON, and TXT, and for columnar format files such as ORC and Parquet.

Use COPY with FILLRECORD

In situations when the contiguous fields are missing at the end of some of the records for data files being loaded, COPY reports an error indicating that there is mismatch between the number of fields in the file being loaded and the number of columns in the target table. In some situations, columnar files (such as Parquet) that are produced by applications and ingested into Amazon Redshift via COPY may have additional fields added to the files (and new columns to the target Amazon Redshift table) over time. In such cases, these files may have values absent for certain newly added fields. To load these files, you previously had to either preprocess the files to fill up values in the missing fields before loading the files using the COPY command, or use Amazon Redshift Spectrum to read the files from Amazon S3 and then use INSERT INTO to load data into the Amazon Redshift table.

With the FILLRECORD parameter, you can now load data files with a varying number of fields successfully in the same COPY command, as long as the target table has all columns defined. The FILLRECORD parameter addresses ease of use because you can now directly use the COPY command to load columnar files with varying fields into Amazon Redshift instead of achieving the same result with multiple steps.

With the FILLRECORD parameter, missing columns are loaded as NULLs. For text and CSV formats, if the missing column is a VARCHAR column, zero-length strings are loaded instead of NULLs. To load NULLs to VARCHAR columns from text and CSV, specify the EMPTYASNULL keyword. NULL substitution only works if the column definition allows NULLs.

Use FILLRECORD while loading Parquet data from Amazon S3

In this section, we demonstrate the utility of FILLRECORD by using a Parquet file that has a smaller number of fields populated than the number of columns in the target Amazon Redshift table. First we try to load the file into the table without the FILLRECORD parameter in the COPY command, then we use the FILLRECORD parameter in the COPY command.

For the purpose of this demonstration, we have created the following components:

An Amazon Redshift cluster with a database, public schema, awsuser as admin user, and an AWS Identity and Access Management (IAM) role, used to perform the COPY command to load the file from Amazon S3, attached to the Amazon Redshift cluster. For details on authorizing Amazon Redshift to access other AWS services, refer to Authorizing Amazon Redshift to access other AWS services on your behalf.
An Amazon Redshift table named call_center_parquet.
A Parquet file already uploaded to an S3 bucket from where the file is copied into the Amazon Redshift cluster.

The following code is the definition of the call_center_parquet table:

DROP TABLE IF EXISTS public.call_center_parquet;
CREATE TABLE IF NOT EXISTS public.call_center_parquet
(
	cc_call_center_sk INTEGER NOT NULL ENCODE az64
	,cc_call_center_id varchar(100) NOT NULL  ENCODE lzo
	,cc_rec_start_date VARCHAR(50)   ENCODE lzo
	,cc_rec_end_date VARCHAR(50)   ENCODE lzo
	,cc_closed_date_sk varchar (100)   ENCODE lzo
	,cc_open_date_sk INTEGER   ENCODE az64
	,cc_name VARCHAR(50)   ENCODE lzo
	,cc_class VARCHAR(50)   ENCODE lzo
	,cc_employees INTEGER   ENCODE az64
	,cc_sq_ft INTEGER   ENCODE az64
	,cc_hours VARCHAR(20)   ENCODE lzo
	,cc_manager VARCHAR(40)   ENCODE lzo
	,cc_mkt_id INTEGER   ENCODE az64
	,cc_mkt_class VARCHAR(50)   ENCODE lzo
	,cc_mkt_desc VARCHAR(100)   ENCODE lzo
	,cc_market_manager VARCHAR(40)   ENCODE lzo
	,cc_division INTEGER   ENCODE az64
	,cc_division_name VARCHAR(50)   ENCODE lzo
	,cc_company INTEGER   ENCODE az64
	,cc_company_name VARCHAR(50)   ENCODE lzo
	,cc_street_number INTEGER   ENCODE az64
	,cc_street_name VARCHAR(60)   ENCODE lzo
	,cc_street_type VARCHAR(15)   ENCODE lzo
	,cc_suite_number VARCHAR(10)   ENCODE lzo
	,cc_city VARCHAR(60)   ENCODE lzo
	,cc_county VARCHAR(30)   ENCODE lzo
	,cc_state CHAR(2)   ENCODE lzo
	,cc_zip INTEGER   ENCODE az64
	,cc_country VARCHAR(20)   ENCODE lzo
	,cc_gmt_offset NUMERIC(5,2)   ENCODE az64
	,cc_tax_percentage NUMERIC(5,2)   ENCODE az64
)
DISTSTYLE ALL
;
ALTER TABLE public.call_center_parquet OWNER TO awsuser;

The table has 31 columns.

The Parquet file doesn’t contain any value for the cc_gmt_offset and cc_tax_percentage fields. It has 29 columns. The following screenshot shows the schema definition for the Parquet file located in Amazon S3, which we load into Amazon Redshift.

We ran the COPY command two different ways: with or without the FILLRECORD parameter.

We first tried to load the Parquet file into the call_center_parquet table without the FILLRECORD parameter:

COPY call_center_parquet
FROM 's3://*****************/parquet/part-00000-d9a3ab22-9d7d-439a-b607-2ddc2d39c5b0-c000.snappy.parquet'
iam_role 'arn:aws:iam::**********:role/RedshiftAttachedRole'
FORMAT PARQUET;

It generated an error while performing the copy.

Next, we tried to load the Parquet file into the call_center_parquet table and used the FILLRECORD parameter:

COPY call_center_parquet
FROM 's3://*****************/parquet/part-00000-d9a3ab22-9d7d-439a-b607-2ddc2d39c5b0-c000.snappy.parquet'
iam_role 'arn:aws:iam::**********:role/RedshiftAttachedRole'
FORMAT PARQUET FILLRECORD;

The Parquet data was loaded successfully in the call_center_parquet table, and NULL was entered into the cc_gmt_offset and cc_tax_percentage columns.

Split large text files while copying

The second new feature we discuss in this post is automatically splitting large files to take advantage of the massive parallelism of the Amazon Redshift cluster. A best practice when using the COPY command in Amazon Redshift is to load data using a single COPY command from multiple data files. This loads data in parallel by dividing the workload among the nodes and slices in the Amazon Redshift cluster. When all the data from a single file or small number of large files is loaded, Amazon Redshift is forced to perform a much slower serialized load, because the Amazon Redshift COPY command can’t utilize the parallelism of the Amazon Redshift cluster. You have to write additional preprocessing steps to split the large files into smaller files so that the COPY command loads data in parallel into the Amazon Redshift cluster.

The COPY command now supports automatically splitting a single file into multiple smaller scan ranges. This feature is currently supported only for large uncompressed delimited text files. More file formats and options, such as COPY with CSV keyword, will be added in the near future.

This helps improve performance for COPY queries when loading a small number of large uncompressed delimited text files into your Amazon Redshift cluster. Scan ranges are implemented by splitting the files into 64 MB chunks, which get assigned to each Amazon Redshift slice. This change addresses ease of use because you don’t need to split large uncompressed text files as an additional preprocessing step.

With Amazon Redshift’s ability to split large uncompressed text files, you can see performance improvements for the COPY command with a single large file or a few files with significantly varying relative sizes (for example, one file 5 GB in size and 20 files of a few KBs). Performance improvements for the COPY command are more significant as the file size increases even with keeping the same Amazon Redshift cluster configuration. Based on tests done, we observed a more than 1,500% performance improvement for the COPY command for loading a 6 GB uncompressed text file when the auto splitting feature became available.

There are no changes in the COPY query or keywords to enable this change, and splitting of files is automatically applied for the eligible COPY commands. Splitting isn’t applicable for the COPY query with the keywords CSV, REMOVEQUOTES, ESCAPE, and FIXEDWIDTH.

For the test, we used a single 6 GB uncompressed text file and the following COPY command:

COPY store_sales
FROM 's3://*****************/text/store_sales.txt'
iam_role 'arn:aws:iam::**********:role/RedshiftAttachedRole';

The Amazon Redshift cluster without the auto split option took 102 seconds to copy the file from Amazon S3 to the Amazon Redshift store_sales table. When the auto split option was enabled in the Amazon Redshift cluster (without any other configuration changes), the same 6 GB uncompressed text file took just 6.19 seconds to copy the file from Amazon S3 to the store_sales table.

Summary

In this post, we showed two enhancements to the Amazon Redshift COPY command. First, we showed how you can add the FILLRECORD parameter in the COPY command in order to successfully load data files even when the contiguous fields are missing at the end of some of the records, as long as the target table has all the columns. Secondly, we described how Amazon Redshift auto splits large uncompressed text files into 64 MB chunks before copying the files into the Amazon Redshift cluster to enhance COPY performance. This automatic split of large files allows you to use the COPY command on large uncompressed text files—Amazon Redshift auto splits the file without needing you to add a preprocessing step to the split the large files yourself. Try these features to make your data loading to Amazon Redshift much simpler by removing custom preprocessing steps.

In Part 2 of this series, we will discuss additional new features of Amazon Redshift COPY command and demonstrate how you can take benefits of those to optimize your data loading process.

About the Authors

Dipankar Kushari is a Senior Analytics Solutions Architect with AWS.

Cody Cunningham is a Software Development Engineer with AWS, working on data ingestion for Amazon Redshift.

Joe Yong is a Senior Technical Product Manager on the Amazon Redshift team and a relentless pursuer of making complex database technologies easy and intuitive for the masses. He has worked on database and data management systems for SMP, MPP, and distributed systems. Joe has shipped dozens of features for on-premises and cloud-native databases that serve IoT devices through petabyte-sized cloud data warehouses. Off keyboard, Joe tries to onsight 5.11s, hunt for good eats, and seek a cure for his Australian Labradoodle’s obsession with squeaky tennis balls.

Anshul Purohit is a Software Development Engineer with AWS, working on data ingestion and query processing for Amazon Redshift.

Use the default IAM role in Amazon Redshift to simplify accessing other AWS services

2021-12-14 Nita Shah

Post Syndicated from Nita Shah original https://aws.amazon.com/blogs/big-data/use-the-default-iam-role-in-amazon-redshift-to-simplify-accessing-other-aws-services/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL. Amazon Redshift offers up to three times better price performance than any other cloud data warehouse, and can expand to petabyte scale. Today, tens of thousands of AWS customers use Amazon Redshift to run mission-critical business intelligence dashboards, analyze real-time streaming data, and run predictive analytics jobs.

Many features in Amazon Redshift access other services, for example, when loading data from Amazon Simple Storage Service (Amazon S3). This requires you to create an AWS Identity and Access Management (IAM) role and grant that role to the Amazon Redshift cluster. Historically, this has required some degree of expertise to set up access configuration with other AWS services. For details about IAM roles and how to use them, see Create an IAM role for Amazon Redshift.

This post discusses the introduction of the default IAM role, which simplifies the use of other services such as Amazon S3, Amazon SageMaker, AWS Lambda, Amazon Aurora, and AWS Glue by allowing you to create an IAM role from the Amazon Redshift console and assign it as the default IAM role to new or existing Amazon Redshift cluster. The default IAM role simplifies SQL operations that access other AWS services (such as COPY, UNLOAD, CREATE EXTERNAL FUNCTION, CREATE EXTERNAL SCHEMA, CREATE MODEL, or CREATE LIBRARY) by eliminating the need to specify the Amazon Resource Name (ARN) for the IAM role.

Overview of solution

The Amazon Redshift SQL commands for COPY, UNLOAD, CREATE EXTERNAL FUNCTION, CREATE EXTERNAL TABLE, CREATE EXTERNAL SCHEMA, CREATE MODEL, or CREATE LIBRARY historically require the role ARN to be passed as an argument. Usually, these roles and accesses are set up by admin users. Most data analysts and data engineers using these commands aren’t authorized to view cluster authentication details. To eliminate the need to specify the ARN for the IAM role, Amazon Redshift now provides a new managed IAM policy AmazonRedshiftAllCommandsFullAccess, which has required privileges to use other related services such as Amazon S3, SageMaker, Lambda, Aurora, and AWS Glue. This policy is used for creating the default IAM role via the Amazon Redshift console. End-users can use the default IAM role by specifying IAM_ROLE with the DEFAULT keyword. When you use the Amazon Redshift console to create IAM roles, Amazon Redshift keeps track of all IAM roles created and preselects the most recent default role for all new cluster creations and restores from snapshots.

The Amazon Redshift default IAM role simplifies authentication and authorization with the following benefits:

It allows users to run SQL commands without providing the IAM role’s ARN
It avoids the need to use multiple AWS Management Console pages to create the Amazon Redshift cluster and IAM role
You don’t need to reconfigure default IAM roles every time Amazon Redshift introduces a new feature, which requires additional permission, because Amazon Redshift can modify or extend the AWS managed policy, which is attached to the default IAM role, as required

To demonstrate this, first we create an IAM role through the Amazon Redshift console that has a policy with permissions to run SQL commands such as COPY, UNLOAD, CREATE EXTERNAL FUNCTION, CREATE EXTERNAL TABLE, CREATE EXTERNAL SCHEMA, CREATE MODEL, or CREATE LIBRARY. We also demonstrate how to make an existing IAM role the default role, and remove a role as default. Then we show you how to use the default role with various SQL commands, and how to restrict access to the role.

Create a new cluster and set up the IAM default role

The default IAM role is supported in both Amazon Redshift clusters and Amazon Redshift Serverless (preview). To create a new cluster and configure our IAM role as the default role, complete the following steps:

On the Amazon Redshift console, choose Clusters in the navigation pane.

This page lists the clusters in your account in the current Region. A subset of properties of each cluster is also displayed.

Choose Create cluster.
Follow the instructions to enter the properties for cluster configuration.
If you know the required size of your cluster (that is, the node type and number of nodes), choose I’ll choose.
Choose the node type and number of nodes.

If you don’t know how large to size your cluster, choose Help me choose. Doing this starts a sizing calculator that asks you questions about the size and query characteristics of the data that you plan to store in your data warehouse.

Follow the instructions to enter properties for database configurations.
Under Associated IAM roles, on the Manage IAM roles menu, choose Create IAM role.
To specify an S3 bucket for the IAM role to access, choose one of the following methods:
1. Choose No additional S3 bucket to create the IAM role without specifying specific S3 buckets.
2. Choose Any S3 bucket to allow users that have access to your Amazon Redshift cluster to also access any S3 bucket and its contents in your AWS account.
3. Choose Specific S3 buckets to specify one or more S3 buckets that the IAM role being created has permission to access. Then choose one or more S3 buckets from the table.
Choose Create IAM role as default.

Amazon Redshift automatically creates and sets the IAM role as the default for your cluster.

Choose Create cluster to create the cluster.

The cluster might take several minutes to be ready to use. You can verify the new default IAM role under Cluster permissions.

You can only have one IAM role set as the default for the cluster. If you attempt to create another IAM role as the default for the cluster when an existing IAM role is currently assigned as the default, the new IAM role replaces the other IAM role as default.

Make an existing IAM role the default for your new or existing cluster

You can also attach your existing role to the cluster and make it default IAM role for more granular control of permissions with customized managed polices.

On the Amazon Redshift console, choose Clusters in the navigation pane.
Choose the cluster you want to associate IAM roles with.
Under Associated IAM roles, on the Manage IAM roles menu, choose Associated IAM roles.
Select an IAM role that you want make the default for the cluster.
Choose Associate IAM roles.
Under Associated IAM roles, on the Set default menu, choose Make default.
When prompted, choose Set default to confirm making the specified IAM role the default.
Choose Confirm.

Your IAM role is now listed as default.

Make an IAM role no longer default for your cluster

You can make an IAM role no longer the default role by changing the cluster permissions.

On the Amazon Redshift console, choose Clusters in the navigation pane.
Choose the cluster that you want to associate IAM roles with.
Under Associated IAM roles, select the default IAM role.
On the Set default menu, choose Clear default.
When prompted, choose Clear default to confirm.

Use the default IAM role to run SQL commands

Now we demonstrate how to use the default IAM role in SQL commands like COPY, UNLOAD, CREATE EXTERNAL FUNCTION, CREATE EXTERNAL TABLE, CREATE EXTERNAL SCHEMA, and CREATE MODEL using Amazon Redshift ML.

To run SQL commands, we use Amazon Redshift Query Editor V2, a web-based tool that you can use to explore, analyze, share, and collaborate on data stored on Amazon Redshift. It supports data warehouses on Amazon Redshift and data lakes through Amazon Redshift Spectrum. However, you can use the default IAM role with any tools of your choice.

For additional information, see Introducing Amazon Redshift Query Editor V2, a Free Web-based Query Authoring Tool for Data Analysts.

First verify the cluster is using the default IAM role, as shown in the following screenshot.

Load data from Amazon S3

The SQL in the following screenshot describes how to load data from Amazon S3 using the default IAM role.

Unload data to Amazon S3

With an Amazon Redshift lake house architecture, you can query data in your data lake and write data back to your data lake in open formats using the UNLOAD command. After the data files are in Amazon S3, you can share the data with other services for further processing.

The SQL in the following screenshot describes how to unload data to Amazon S3 using the default IAM role.

Create an ML model

Redshift ML enables SQL users to create, train, and deploy machine learning (ML) models using familiar SQL commands. The SQL in the following screenshot describes how to build an ML model using the default IAM role. We use the Iris dataset from the UCI Machine Learning Repository.

Create an external schema and external table

Redshift Spectrum is a feature of Amazon Redshift that allows you to perform SQL queries on data stored in S3 buckets using external schema and external tables. This eliminates the need to move data from a storage service to a database, and instead directly queries data inside an S3 bucket. Redshift Spectrum also expands the scope of a given query because it extends beyond a user’s existing Amazon Redshift data warehouse nodes and into large volumes of unstructured S3 data lakes.

The following SQL describes how to use the default IAM role in the CREATE EXTERNAL SCHEMA command. For more information, see Querying external data using Amazon Redshift Spectrum

The default IAM role requires redshift as part of the catalog database name or resources tagged with the Amazon Redshift service tag due to security considerations. You can customize the policy attached to default role as per your security requirement. In the following example, we use the AWS Glue Data Catalog name redshift_data.

Restrict access to the default IAM role

To control access privileges of the IAM role created and set it as default for your Amazon Redshift cluster, use the ASSUMEROLE privilege. This access control applies to database users and groups when they run commands such as COPY and UNLOAD. After you grant the ASSUMEROLE privilege to a user or group for the IAM role, the user or group can assume that role when running these commands. With the ASSUMEROLE privilege, you can grant access to the appropriate commands as required.

Best practices

Amazon Redshift uses the AWS security frameworks to implement industry-leading security in the areas of authentication, access control, auditing, logging, compliance, data protection, and network security. For more information, refer to Security in Amazon Redshift and Security best practices in IAM.

Conclusion

This post showed you how the default IAM role simplifies SQL operations that access other AWS services by eliminating the need to specify the ARN for the IAM role. This new functionality helps make Amazon Redshift easier than ever to use, and reduces reliance on an administrator to wrangle these permissions.

As an administrator, you can start using the default IAM role to grant IAM permissions to your Redshift cluster and allow your end-users such as data analysts and developers to use default IAM role with their SQL commands without having to provide the ARN for the IAM role.

About the Authors

Nita Shah is an Analytics Specialist Solutions Architect at AWS based out of New York. She has been building data warehouse solutions for over 20 years and specializes in Amazon Redshift. She is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Evgenii Rublev is a Software Development Engineer on the AWS Redshift team. He has worked on building end-to-end applications for over 10 years. He is passionate about innovations in building high-availability and high-performance applications to drive a better customer experience. Outside of work, Evgenii enjoys spending time with his family, traveling, and reading books.

Debu Panda, a Principal Product Manager at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences such as re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

Use unsupervised training with K-means clustering in Amazon Redshift ML

2021-12-08 Phil Bates

Post Syndicated from Phil Bates original https://aws.amazon.com/blogs/big-data/use-unsupervised-training-with-k-means-clustering-in-amazon-redshift-ml/

Amazon Redshift is the fastest, most widely used, fully managed, and petabyte-scale cloud data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads. Data analysts and database developers want to use this data to train machine learning (ML) models, which can then be used to generate insights for use cases such as forecasting revenue, predicting customer churn, and detecting anomalies.

Amazon Redshift ML makes it easy for SQL users to create, train, and deploy ML models using familiar SQL commands. In previous posts, we covered how Amazon Redshift supports supervised learning that includes regression, binary classification, and multiclass classification, as well as training models using XGBoost and providing advanced options such as preprocessors, problem type, and hyperparameters.

In this post, we use Redshift ML to perform unsupervised learning on unlabeled training data using the K-means algorithm. This algorithm solves clustering problems where you want to discover groupings in the data. Unlabeled data is grouped and partitioned based on their similarities and differences. By grouping, the K-means algorithm iteratively determines the best centroids and assigns each member to the closest centroid. Data points nearest the same centroid belong to the same group. Members of a group are as similar as possible to other members in the same group, and as different as possible from members of other groups. To learn more about K-means clustering, see K-means clustering with Amazon SageMaker.

Solution overview

The following are some use cases for K-means:

Ecommerce and retail – Segment your customers by purchase history, stores they visited, or clickstream activity.
Healthcare – Group similar images for image detection. For example, you can detect patterns for diseases or successful treatment scenarios.
Finance – Detect fraud by detecting anomalies in the dataset. For example, you can detect credit card fraud by abnormal purchase patterns.
Technology – Build a network intrusion detection system that aims to identify attacks or malicious activity.
Meteorology – Detect anomalies in sensor data collection such as storm forecasting.

In our example, we use K-means on the Global Database of Events, Language, and Tone (GDELT) dataset, which monitors world news across the world, and the data is stored for every second of every day. This information is freely available as part of the Registry of Open Data on AWS.

The data is stored as multiple files on Amazon Simple Storage Service (Amazon S3), with two different formats: historical, which covers the years 1979–2013, and daily updates, which cover the years 2013 and later. For this example, we use the historical format and bring in 1979 data.

For our use case, we use a subset of the data’s attributes:

EventCode – The raw CAMEO action code describing the action that Actor1 performed upon Actor2.
NumArticles – The total number of source documents containing one or more mentions of this event. You can use this to assess the importance of an event. The more discussion of that event, the more likely it is to be significant.
AvgTone – The average tone of all documents containing one or more mentions of this event. The score ranges from -100 (extremely negative) to +100 (extremely positive). Common values range between -10 and +10, with 0 indicating neutral.
Actor1Geo_Lat – The centroid latitude of the Actor1 landmark for mapping.
Actor1Geo_Long – The centroid longitude of the Actor1 landmark for mapping.
Actor2Geo_Lat – The centroid latitude of the Actor2 landmark for mapping.
Actor2Geo_Long – The centroid longitude of the Actor2 landmark for mapping.

Each row corresponds to an event at a specific location. For example, rows 53-57 in the file 1979.csv which we will use below, seem to all refer to interactions between FRA and AFR, dealing with consultation and diplomatic relations with a mostly positive tone. It is hard, if not impossible for us to make sense of such data at scale. Clusters of events, either with a similar tone, occurring in similar locations or between similar actors, are useful in visualizing and interpreting the data. Clustering can also reveal non-obvious structures such as potential common causes for different events, or the propagation of a root event across the globe, or the change in tone toward a common event over time. However, we do not know what makes two events similar – is it the location, the two actors, the tone, the time or some combination of these? Clustering algorithms can learn from data and determine 1) what makes different datapoints similar, 2) which datapoints are related to which other datapoints and 3) what are the common characteristics of these related datapoints.

Prerequisites

To get started, we need an Amazon Redshift cluster with version 1.0.33433 or higher and an AWS Identity and Access Management (IAM) role attached that provides access to Amazon SageMaker and permissions to an S3 bucket.

For an introduction to Redshift ML and instructions on setting it up, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.

To create a simple cluster, complete the following steps:

On the Amazon Redshift console, choose Clusters in the navigation pane.
Choose Create cluster.
Provide the configuration parameters such as cluster name, user name, and password.
For Associated IAM roles, on the menu Manage IAM roles, choose Create IAM role.

If you have an existing role with the required parameters, you can choose Associate IAM roles.

Select Specific S3 buckets and choose a bucket for storing the artifacts generated by Redshift ML.
Choose Create IAM role as default.

A default IAM role is created for you and automatically associated with the cluster.

Choose Create cluster.

Prepare the data

Load the GDELT data into Amazon Redshift using the following SQL. You can use the Amazon Redshift Query Editor v2 or your favorite SQL tool to run these commands.

To create the table, use the following commands:

DROP TABLE IF EXISTS gdelt_data CASCADE;

CREATE TABLE gdelt_data (
GlobalEventId   bigint,
SqlDate  bigint,
MonthYear bigint,
Year   bigint,
FractionDate double precision,
Actor1Code varchar(256),
Actor1Name varchar(256),
Actor1CountryCode varchar(256),
Actor1KnownGroupCode varchar(256),
Actor1EthnicCode varchar(256),
Actor1Religion1Code varchar(256),
Actor1Religion2Code varchar(256),
Actor1Type1Code varchar(256),
Actor1Type2Code varchar(256),
Actor1Type3Code varchar(256),
Actor2Code varchar(256),
Actor2Name varchar(256),
Actor2CountryCode varchar(256),
Actor2KnownGroupCode varchar(256),
Actor2EthnicCode varchar(256),
Actor2Religion1Code  varchar(256),
Actor2Religion2Code varchar(256),
Actor2Type1Code varchar(256),
Actor2Type2Code varchar(256),
Actor2Type3Code varchar(256),
IsRootEvent bigint,
EventCode bigint,
EventBaseCode bigint,
EventRootCode bigint,
QuadClass bigint,
GoldsteinScale double precision,
NumMentions bigint,
NumSources bigint,
NumArticles bigint,
AvgTone double precision,
Actor1Geo_Type bigint,
Actor1Geo_FullName varchar(256),
Actor1Geo_CountryCode varchar(256),
Actor1Geo_ADM1Code varchar(256),
Actor1Geo_Lat double precision,
Actor1Geo_Long double precision,
Actor1Geo_FeatureID bigint,
Actor2Geo_Type bigint,
Actor2Geo_FullName varchar(256),
Actor2Geo_CountryCode varchar(256),
Actor2Geo_ADM1Code varchar(256),
Actor2Geo_Lat double precision,
Actor2Geo_Long double precision,
Actor2Geo_FeatureID bigint,
ActionGeo_Type bigint,
ActionGeo_FullName varchar(256),
ActionGeo_CountryCode varchar(256),
ActionGeo_ADM1Code varchar(256),
ActionGeo_Lat double precision,
ActionGeo_Long double precision,
ActionGeo_FeatureID bigint,
DATEADDED bigint 
) ;

To load data into the table, use the following command:

COPY gdelt_data FROM 's3://gdelt-open-data/events/1979.csv'
region 'us-east-1' iam_role default csv delimiter '\t';

Create a model in Redshift ML

When using the K-means algorithm, you must specify an input K that specifies the number of clusters to find in the data. The output of this algorithm is a set of K centroids, one for each cluster. Each data point belongs to one of the K clusters that is closest to it. Each cluster is described by its centroid, which can be thought of as a multi-dimensional representation of the cluster. The K-means algorithm compares the distances between centroids and data points to learn how different the clusters are from each other. A larger distance generally indicates a greater difference between the clusters.

Before we create the model, let’s examine the training data by running the following SQL code in Amazon Redshift Query Editor v2:

select AvgTone, EventCode, NumArticles, Actor1Geo_Lat, Actor1Geo_Long, Actor2Geo_Lat, Actor2Geo_Long
from gdelt_data

The following screenshot shows our results.

We create a model with seven clusters from this data (see the following code). You can experiment by changing the K value and creating different models. The SageMaker K-means algorithm can obtain a good clustering with only a single pass over the data with very fast runtimes.

CREATE MODEL news_data_clusters
FROM (select AvgTone, EventCode, NumArticles, Actor1Geo_Lat, Actor1Geo_Long, Actor2Geo_Lat, Actor2Geo_Long
   from gdelt_data)
FUNCTION  news_monitoring_cluster
IAM_ROLE default
AUTO OFF
MODEL_TYPE KMEANS
PREPROCESSORS 'none'
HYPERPARAMETERS DEFAULT EXCEPT (K '7')
SETTINGS (S3_BUCKET '<<your-amazon-s3-bucket-name>>');

For more information about model training, see Machine learning overview. For a list of other hyper-parameters K-means supports, see K-means Hyperparameters, for the full syntax of CREATE MODEL see our documentation.

You can use the SHOW MODEL command to view the status of the model:

SHOW MODEL NEWS_DATA_CLUSTERS;

The results show that our model is in the READY state.

We can now run the query to identify the clusters. The following query shows the cluster associated with each GlobelEventId:

select globaleventid, news_monitoring_cluster ( AvgTone, EventCode, NumArticles, Actor1Geo_Lat, Actor1Geo_Long, Actor2Geo_Lat, Actor2Geo_Long ) as cluster 
from gdelt_data;

We get the following results.

Now let’s run a query to check the distribution of data across our clusters to see if seven is the appropriate cluster size for this dataset:

select events_cluster , count(*) as nbr_events  from   
(select globaleventid, news_monitoring_cluster( AvgTone, EventCode, NumArticles, Actor1Geo_Lat, Actor1Geo_Long, Actor2Geo_Lat, Actor2Geo_Long ) as events_cluster
from gdelt_data)
group by 1;

The results show that very few events are assigned to clusters 1 and 3.

Let’s try running the above query again after re-creating the model with nine clusters by changing the K value to 9.

Using nine clusters helps smooth out the cluster sizes. The smallest is now approximately 11,000 and the largest is approximately 117,000, compared to 188,000 when using seven clusters.

Now, let’s run the following query to determine the centers of the clusters based on number of articles by event code:

select news_monitoring_cluster ( AvgTone, EventCode, NumArticles, Actor1Geo_Lat, Actor1Geo_Long, Actor2Geo_Lat, Actor2Geo_Long ) as events_cluster, eventcode ,sum(numArticles) as numArticles from 
gdelt_data
group by 1,2 ;

Let’s run the following query to get more insights into the datapoints assigned to one of the clusters:

select news_monitoring_cluster ( AvgTone, EventCode, NumArticles, Actor1Geo_Lat, Actor1Geo_Long, Actor2Geo_Lat, Actor2Geo_Long ) as events_cluster, eventcode, actor1name, actor2name, sum(numarticles) as totalarticles
from gdelt_data
where events_cluster = 5
and actor1name <> ' 'and actor2name <> ' '
group by 1,2,3,4
order by 5 desc

Observing the datapoints assigned to the clusters, we see clusters of events corresponding to interactions between US and China – probably due to the establishment of diplomatic relations, between US and RUS – probably corresponding to the SALT II Treaty and those involving Iran– probably corresponding to the Iranian Revolution. Thus, clustering can help us make sense of the data, and show us the way as we continue to explore and use it.

Conclusion

Redshift ML makes it easy for users of all skill levels to use ML technology. With no prior ML knowledge, you can use Redshift ML to gain business insights for your data. You can take advantage of ML approaches such as supervised and unsupervised learning to classify your labeled and unlabeled data, respectively. In this post, we walked you through how to perform unsupervised learning with Redshift ML by creating an ML model that uses the K-means algorithm to discover grouping in your data.

For more information about building different models, see Amazon Redshift ML.

About the Authors

Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS with over 25 years of data warehouse experience.

Akash Gheewala is a Solutions Architect at AWS. He helps global enterprises across the high tech industry in their journey to the cloud. He does this through his passion for accelerating digital transformation for customers and building highly scalable and cost-effective solutions in the cloud. Akash also enjoys mental models, creating content and vagabonding about the world.

Murali Narayanaswamy is a principal machine learning scientist in AWS. He received his PhD from Carnegie Mellon University and works at the intersection of ML, AI, optimization, learning and inference to combat uncertainty in real-world applications including personalization, forecasting, supply chains and large scale systems.

Integrate Etleap with Amazon Redshift Streaming Ingestion (preview) to make data available in seconds

2021-11-30 Caius Brindescu

Post Syndicated from Caius Brindescu original https://aws.amazon.com/blogs/big-data/integrate-etleap-with-amazon-redshift-streaming-ingestion-preview-to-make-data-available-in-seconds/

Amazon Redshift is a fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using SQL and your extract, transform, and load (ETL), business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads.

Etleap is an AWS Advanced Technology Partner with the AWS Data & Analytics Competency and Amazon Redshift Service Ready designation. Etleap ETL removes the headaches experienced building data pipelines. A cloud-native platform that seamlessly integrates with AWS infrastructure, Etleap ETL consolidates data without the need for coding. Automated issue detection pinpoints problems so data teams can stay focused on business initiatives, not data pipelines.

In this post, we show how Etleap customers are integrating with the new streaming ingestion feature in Amazon Redshift (currently in limited preview) to load data directly from Amazon Kinesis Data Streams. This reduces load times from minutes to seconds and helps you gain faster data insights.

Amazon Redshift streaming ingestion with Kinesis Data Streams

Traditionally, you had to use Amazon Kinesis Data Firehose to land your stream into Amazon Simple Storage Service (Amazon S3) files and then employ a COPY command to move the data into Amazon Redshift. This method incurs latencies in the order of minutes.

Now, the native streaming ingestion feature in Amazon Redshift lets you ingest data directly from Kinesis Data Streams. The new feature enables you to ingest hundreds of megabytes of data per second and query it at exceptionally low latency—in many cases only 10 seconds after entering the data stream.

Configure Amazon Redshift streaming ingestion with SQL queries

Amazon Redshift streaming ingestion uses SQL to connect with one or more Kinesis data streams simultaneously. In this section, we walk through the steps to configure streaming ingestion.

Create an external schema

We begin by creating an external schema referencing Kinesis using syntax adapted from Redshift’s support for Federated Queries:

CREATE EXTERNAL SCHEMA MySchema
FROM Kinesis
IAM_ROLE { default | 'iam-role-arn' };

This external schema command creates an object inside Amazon Redshift that acts as a proxy to Kinesis Data Streams. Specifically, to the collection of data streams that are accessible via the AWS Identity and Access Management (IAM) role. You can use either the default Amazon Redshift cluster IAM role or a specified IAM role that has been attached to the cluster previously.

Create a materialized view

You can use Amazon Redshift materialized views to materialize a point-in-time view of a Kinesis data stream, as accumulated up to the time it is queried. The following command creates a materialized view over a stream from the previously defined schema:

CREATE MATERIALIZED VIEW MyView AS
SELECT *
FROM MySchema.MyStream;

Note the use of the dot syntax to pick out the particular stream desired. The attributes of the stream include a timestamp field, partition key, sequence number, and a VARBYTE data payload.

Although the previous materialized view definition simply performs a SELECT *, more sophisticated processing is possible, for instance, applying filtering conditions or shredding JSON data into columns. To demonstrate, consider the following Kinesis data stream with JSON payloads:

{
 “player” : “alice 127”,
 “region” : “us-west-1”,
 “action” : “entered shop”,
}

To demonstrate this, write a materialized view that shreds the JSON into columns, focusing only on the entered shop action:

CREATE MATERIALIZED VIEW ShopEntrances AS
SELECT ApproximateArrivalTimestamp, SequenceNumber,
   json_extract_path_text(from_varbyte(Data, 'utf-8'), 'player') as Player,
   json_extract_path_text(from_varbyte(Data, 'utf-8'), 'region') as Region
FROM MySchema.Actions
WHERE json_extract_path_text(from_varbyte(Data, 'utf-8'), 'action') = 'entered shop';

On the Amazon Redshift leader node, the view definition is parsed and analyzed. On success, it is added to the system catalogs. No further communication with Kinesis Data Streams occurs until the initial refresh.

Refresh the materialized view

The following command pulls data from Kinesis Data Streams into Amazon Redshift:

REFRESH MATERIALIZED VIEW MyView;

You can initiate it manually (via the SQL preceding command) or automatically via a scheduled query. In either case, it uses the IAM role associated with the stream. Each refresh is incremental and massively parallel, storing its progress in each Kinesis shard in the system catalogs so as to be ready for the next round of refresh.

With this process, you can now query near-real-time data from your Kinesis data stream through Amazon Redshift.

Use Amazon Redshift streaming ingestion with Etleap

Etleap pulls data from databases, applications, file stores, and event streams, and transforms it before loading it into an AWS data repository. Data ingestion pipelines typically process batches every 5–60 minutes, so when you query your data in Amazon Redshift, it’s at least 5 minutes out of date. For many use cases, such as ad hoc queries and BI reporting, this latency time is acceptable.

But what about when your team demands more up-to-date data? An example is operational dashboards, where you need to track KPIs in near-real time. Amazon Redshift load times are bottlenecked by COPY commands that move data from Amazon S3 into Amazon Redshift, as mentioned earlier.

This is where streaming ingestion comes in: by staging the data in Kinesis Data Streams rather than Amazon S3, Etleap can reduce data latency in Amazon Redshift to less than 10 seconds. To preview this feature, we ingest data from SQL databases such as MySQL and Postgres that support change data capture (CDC). The data flow is shown in the following diagram.

Etleap manages the end-to-end data flow through AWS Database Migration Service (AWS DMS) and Kinesis Data Streams, and creates and schedules Amazon Redshift queries, providing up-to-date data.

AWS DMS consumes the replication logs from the source, and produces insert, update, and delete events. These events are written to a Kinesis data stream that has multiple shards in order to handle the event load. Etleap transforms these events according to user-specified rules, and writes them to another data stream. Finally, a sequence of Amazon Redshift commands load data from the stream into a destination table. This procedure takes less than 10 seconds in real-world scenarios.

Configure Amazon Redshift streaming ingestion with Etleap

Previously, we explored how data in Kinesis Data Streams can be accessed in Amazon Redshift using SQL queries. In this section, we see how Etleap uses the streaming ingestion feature to mirror a table from MySQL into Amazon Redshift, and the end-to-end latency we can achieve.

Etleap customers that are part of the Streaming Ingestion Preview Program can ingest data into Amazon Redshift directly from an Etleap-managed Kinesis data stream. All pipelines from a CDC-enabled source automatically use this feature.

The destination table in Amazon Redshift is Type 1, a mirror of the table in the source database.

For example, say you want to mirror a MySQL table in Amazon Redshift. The table represents the online shopping carts that users have open. In this case, low latency is critical so that the platform marketing strategists can instantly identify abandoned carts and high demand items.

The cart table has the following structure:

CREATE TABLE cart (
id int PRIMARY KEY AUTO_INCREMENT, 
user_id INT,
current_price DECIMAL(6,2),
no_items INT,
checked_out TINY_INT(1),
update_date TIMESTAMP
);

Changes from the source table are captured using AWS DMS and then sent to Etleap via a Kinesis data stream. Etleap transforms these records and writes them to another data stream using the following structure:

{
            "id": 8322,
            "user_id": 443,
            "current_price": 22.98,
            "no_items": 3,
            "checked_out": 0,
            "update_date": "2021-11-05 23:11",
            "op": "U"
}

The structure encodes the row that was modified or inserted, as well as the operation type (represented by the op column), which can have three values: I (insert), U (update) or D (delete).

This information is then materialized in Amazon Redshift from the data stream:

CREATE EXTERNAL SCHEMA etleap_stream
FROM KINESIS
IAM_ROLE '<redacted>';

CREATE MATERIALIZED VIEW cart_staging
DISTSTYLE KEY
	DISTKEY(id)
	SORTKEY(etleap_sequence_no)
AS SELECT
	CAST(PartitionKey as bigint) AS etleap_sequence_no,
	CAST(JSON_EXTRACT_PATH_TEXT(from_varbyte(Data, 'utf-8'), 'id') as bigint) AS id,
	JSON_PARSE(FROM_VARBYTE(Data, 'utf-8')) AS Data
FROM etleap_stream."cart";

In the materialized view, we expose the following columns:

PartitionKey represents an Etleap sequence number, to ensure that updates are processed in the correct order.
We shred the primary keys of the table (id in the preceding example) from the payload, using them as a distribution key to improve the update performance.
The Data column is parsed out into a SUPER type from the JSON object in the stream. This is shredded into the corresponding columns in the cart table when the data is inserted.

With this staging materialized view, Etleap then updates the destination table (cart) that has the following schema:

CREATE TABLE cart ( 
id BIGINT PRIMARY KEY,
user_id BIGINT,
current_price DECIMAL(6,2),
no_items INT,
checked_out BOOLEAN,
update_date VARCHAR(64)
)
DISTSTYLE key
distkey(id);

To update the table, Etleap runs the following queries, selecting only the changed rows from the staging materialized view, and applies them to the cart table:

BEGIN;

REFRESH MATERIALIZED VIEW cart_staging;

UPDATE _etleap_si SET end_sequence_no = (
	SELECT COALESCE(MIN(etleap_sequence_no), (SELECT MAX(etleap_sequence_no) FROM cart_staging)) FROM 
	(
		SELECT 
			etleap_sequence_no, 
			LEAD(etleap_sequence_no, 1) OVER (ORDER BY etleap_sequence_no) - etleap_sequence_no AS diff
		FROM cart_staging 
		WHERE etleap_sequence_no > (SELECT start_sequence_no FROM _etleap_si WHERE table_name = 'cart')
	)
	WHERE diff > 1
) WHERE table_name = 'cart';



DELETE FROM cart
WHERE id IN (
	SELECT id
	FROM cart_staging
	WHERE etleap_sequence_no > (SELECT start_sequence_no FROM _etleap_si WHERE table_name = 'cart') 
	AND etleap_sequence_no <= (SELECT end_sequence_no FROM _etleap_si WHERE table_name = 'cart')
);

INSERT INTO cart
SELECT 
	DISTINCT(id),
	CAST(Data."timestamp" as timestamp),
	CAST(Data.payload as varchar(256)),
	CAST(Data.etleap_sequence_no as bigint) from
  	(SELECT id, 
  	JSON_PARSE(LAST_VALUE(JSON_SERIALIZE(Data)) OVER (PARTITION BY id ORDER BY etleap_sequence_no ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)) AS Data
   	FROM cart_staging
	WHERE etleap_sequence_no > (SELECT start_sequence_no FROM _etleap_si WHERE table_name = 'cart') 
	AND etleap_sequence_no <= (SELECT end_sequence_no FROM _etleap_si WHERE table_name = 'cart'
AND Data.op != 'D')
);


UPDATE _etleap_si SET start_sequence_no = end_sequence_no WHERE table_name = 'cart';

COMMIT;

We run the following sequence of queries:

Refresh the cart_staging materialized view to get new records from the cart stream.
Delete all records from the cart table that were updated or deleted since the last time we ran the update sequence.
Insert all the updated and newly inserted records from the cart_staging materialized view into the cart table.
Update the _etleap_si bookkeeping table with the current position. Etleap uses this table to optimize the query in the staging materialized view.

This update sequence runs continuously to minimize end-to-end latency. To measure performance, we simulated the change stream from a database table that has up to 100,000 inserts, updates, and deletes. We tested target table sizes of up to 1.28 billion rows. Testing was done on a 2-node ra3.xlplus Amazon Redshift cluster and a Kinesis data stream with 32 shards.

The following figure shows how long the update sequence takes on average over 5 runs in different scenarios. Even in the busiest scenario (100,000 changes to a 1.28 billion row table), the sequence takes just over 10 seconds to run. In our experiment, the refresh time was independent of the delta size, and took 3.7 seconds with a standard deviation of 0.4 seconds.

This shows that the update process can keep up with source database tables that have 1 billion rows and 10,000 inserts, updates, and deletes per second.

Summary

In this post, you learned about the native streaming ingestion feature in Amazon Redshift and how it achieves latency in seconds, while ingesting data from Kinesis Data Streams into Amazon Redshift. You also learned about the architecture of Amazon Redshift with the streaming ingestion feature enabled, how to configure it using SQL commands, and use the capability in Etleap.

To learn more about Etleap, take a look at the Etleap ETL on AWS Quick Start, or visit their listing on AWS Marketplace.

About the Authors

Caius Brindescu is an engineer at Etleap with over 3 years of experience in developing ETL software. In addition to development work, he helps customers make the most out of Etleap and Amazon Redshift. He holds a PhD from Oregon State University and one AWS certification (Big Data – Specialty).

Todd J. Green is a Principal Engineer with AWS Redshift. Before joining Amazon, TJ worked at innovative database startups including LogicBlox and RelationalAI, and was an Assistant Professor of Computer Science at UC Davis. He received his PhD in Computer Science from UPenn. In his career as a researcher, TJ won a number of awards, including the 2017 ACM PODS Test-of-Time Award.

Maneesh Sharma is a Senior Database Engineer with Amazon Redshift. He works and collaborates with various Amazon Redshift Partners to drive better integration. In his spare time, he likes running, playing ping pong, and exploring new travel destinations.

Jobin George is a Big Data Solutions Architect with more than a decade of experience designing and implementing large-scale big data and analytics solutions. He provides technical guidance, design advice, and thought leadership to some of the key AWS customers and big data partners.

Introducing Amazon Redshift Serverless – Run Analytics At Any Scale Without Having to Manage Data Warehouse Infrastructure

2021-11-30 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/introducing-amazon-redshift-serverless-run-analytics-at-any-scale-without-having-to-manage-infrastructure/

We’re seeing the use of data analytics expanding among new audiences within organizations, for example with users like developers and line of business analysts who don’t have the expertise or the time to manage a traditional data warehouse. Also, some customers have variable workloads with unpredictable spikes, and it can be very difficult for them to constantly manage capacity.

With Amazon Redshift, you use SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes. Today, I am happy to introduce the public preview of Amazon Redshift Serverless, a new capability that makes it super easy to run analytics in the cloud with high performance at any scale. Just load your data and start querying. There is no need to set up and manage clusters. You pay for the duration in seconds when your data warehouse is in use, for example, while you are querying or loading data. There is no charge when your data warehouse is idle.

Amazon Redshift Serverless automatically provisions the right compute resources for you to get started. As your demand evolves with more concurrent users and new workloads, your data warehouse scales seamlessly and automatically to adapt to the changes. You can optionally specify the base data warehouse size to have additional control on cost and application-specific SLAs.

With the new serverless option, you can continue to query data in other AWS data stores, such as Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Aurora and Amazon Relational Database Service (RDS) databases.

Amazon Redshift Serverless is ideal when it is difficult to predict compute needs such as variable workloads, periodic workloads with idle time, and steady-state workloads with spikes. This approach is also a good fit for ad-hoc analytics needs that need to get started quickly and for test and development environments.

Let’s see how this works in practice.

Using Amazon Redshift Serverless
I go to the Amazon Redshift console and choose the new serverless option. The first time, I set up the serverless endpoint and configure networking and security.

I confirm the default settings that use all subnets in my default Amazon Virtual Private Cloud (VPC) and its default security group. Data is always encrypted, and I use the default AWS-owned key. Optionally, I can customize all settings. I can associate now or later the AWS Identity and Access Management (IAM) roles to give permissions to access other AWS resources, for example, to be able to load data from an S3 bucket. The configuration of the serverless endpoint will be shared by all my serverless data warehouses in the same AWS account and Region.

To query data, I use Amazon Redshift Query Editor V2, a new free web-based tool that we made available a few months back. The query editor provides quick access to a few sample datasets to make it easy to learn Amazon Redshift’s SQL capabilities: TPC-H, TPC-DS, and tickit, a dataset containing information on ticket sales for events.

For a quick test, I use the tickit sample dataset so I don’t need to load any data. I prepare a query to get the list of tickets sold per date, sorted to see the dates with more sales first:

SELECT caldate, sum(qtysold) as sumsold
FROM   tickit.sales, tickit.date
WHERE  sales.dateid = date.dateid 
GROUP BY caldate
ORDER BY sumsold DESC;

By using the web-based query editor, I don’t need to configure a SQL client or set up the network permissions to reach the serverless endpoint. Instead, I just write my SQL query and run it.

I am a visual person. I enable the Chart option on the right of the result table and select a bar chart.

Satisfied with the clarity of the chart, I export it as an image file. In this way, I can quickly share it or include it in a report.

Amazon Redshift Serverless supports all rich SQL functionality of Amazon Redshift such as semi-structured data support. I can use any JDBC/ODBC-compliant tool or the Amazon Redshift Data API to query my data. To migrate data, I can take a snapshot of an Amazon Redshift provisioned cluster and restore it as serverless. Then, I just need to update my SQL applications to use the new serverless endpoint.

Availability and Pricing
Amazon Redshift Serverless is available in public preview in the following AWS Regions: US East (N. Virginia), US West (N. California, Oregon), Europe (Frankfurt, Ireland), Asia Pacific (Tokyo).

With Amazon Redshift Serverless, you pay separately for the compute and storage you use. Compute capacity is measured in Redshift Processing Units (RPUs), and you pay for the workloads in RPU-hours with per-second billing. For storage, you pay for data stored in Amazon Redshift-managed storage and storage used for snapshots, similar to what you’d pay with a provisioned cluster using RA3 instances.

To control your costs, you can specify usage limits and define actions that Amazon Redshift automatically takes if those limits are reached. You can specify usage limits in RPU-hours and associated with a daily, weekly, or monthly duration. Setting higher usage limits can improve the overall throughput of the system, especially for workloads that need to handle high concurrency while maintaining consistently high performance.

Compute resources automatically shutdown behind the scenes when there is no activity and resume when you are loading data, or there are queries coming in. When accessing your S3 data lake via the new serverless endpoint, you do not pay for Amazon Redshift Spectrum separately. You have a unified serverless experience and pay for data lake queries also in RPU-seconds. For more information, see the Amazon Redshift pricing page.

The serverless end point is configured at the AWS account level. If you have multiple teams or projects and want to manage costs separately, you can use separate AWS accounts. You can share data between your provisioned clusters and serverless endpoint and between serverless endpoints across accounts.

To help you get practice, we provide you upfront with $500 in AWS credits to try the Amazon Redshift Serverless public preview. You get the credits when you first create a database with Amazon Redshift Serverless. These credits are used to cover your costs for compute, storage, and snapshot usage of Amazon Redshift Serverless only.

Start using Amazon Redshift Serverless today to run and scale analytics without having to provision and manage data warehouse clusters.

— Danilo

Security considerations for Amazon Redshift cross-account data sharing

2021-11-23 Rajesh Francis

Post Syndicated from Rajesh Francis original https://aws.amazon.com/blogs/big-data/security-considerations-for-amazon-redshift-cross-account-data-sharing/

Data driven organizations recognize the intrinsic value of data and realize that monetizing data is not just about selling data to subscribers. They understand the indirect economic impact of data and the value that good data brings to the organization. They must democratize data and make it available for business decision makers to realize its benefits. Today, this would mean replicating data across multiple disparate databases, which requires moving the data across various platforms.

Amazon Redshift data sharing lets you securely and easily share live data across Amazon Redshift clusters or AWS accounts for read purposes. Data sharing can improve the agility of your organization by giving you instant, granular, and high-performance access to data across Amazon Redshift clusters without manually copying or moving it. Data sharing provides you with live access to data so that your users can see the most up-to-date and consistent information as it’s updated in Amazon Redshift clusters.

Cross-account data sharing lets you share data across multiple accounts. The accounts can be within the same organization or across different organizations. We have built in additional authorization steps for security control, since sharing data across accounts could also mean sharing data across different organizations. Please review AWS documentation on cross-account data sharing and a blog from our colleague for detailed steps. We also have a YouTube video on setting up cross-account data sharing for a business use case which you can refer as well.

Cross-account data sharing scenario

For this post, we will use this use case to demonstrate how you could setup cross-account data sharing with the option to control data sharing to specific consumer accounts from the producer account. The producer organization has one AWS account and one Redshift cluster. The consumer organization has two AWS accounts and three Redshift clusters in each of the accounts. The producer organization wants to share data from the producer cluster to one of the consumer accounts “ConsumerAWSAccount1”, and the consumer organization wants to restrict access to the data share to a specific Redshift cluster, “ConsumerCluster1”. Sharing to the second consumer account “ConsumerAWSAccount2” should be disallowed. Similarly, access to the data share should be restricted to the first consumer cluster, “ConsumerCluster1”.

Walkthrough

You can setup this behavior using the following steps:

Setup on the producer account:

Create a data share in the Producer cluster and add schema and tables.
Setup IAM policy to control which consumer accounts can be authorized for data share.
Grant data share usage to a consumer AWS account.

Setup on the consumer account:

Setup IAM policy to control which of the consumer Redshift clusters can be associated with the producer data share.
Associate consumer cluster to the data share created on the producer cluster.
Create database referencing the associated data share.

Prerequisites

To set up cross-account data sharing, you should have the following prerequisites:

Three AWS accounts. Once for producer < ProducerAWSAccount1>, and two consumer accounts – <ConsumerAWSAccount1> and < ConsumerAWSAccount2>.
AWS permissions to provision Amazon Redshift and create an IAM role and policy.

We assume you have provisioned the required Redshift clusters: one for the producer in the producer AWS Account, two Redshift clusters in ConsumerCluster1, and optionally one Redshift cluster in ConsumerCluster2

Two users in producer account, and two users in consumer account 1
- ProducerClusterAdmin
- ProducerCloudAdmin
- Consumer1ClusterAdmin
- Consumer1CloudAdmin

Security controls from producer and consumer

Approved list of consumer accounts from the producer account

When you share data across accounts, the producer admin can grant usage of the data share to a specific account. For additional security to allow the separation of duty between the database admin and the cloud security administrator, organizations might want to have an approved list of AWS accounts that can be granted access. You can achieve this by creating an IAM policy listing all of the approved accounts, and then add this policy to the role attached to the producer cluster.

Creating the IAM Policy for the approved list of consumer accounts

On the AWS IAM Console, choose Policies.
Choose Create policy.
On the JSON tab, enter the following policy:
This is the producer side policy. Note: you should replace the following text with the specific details for your cluster and account.

- “Resource”: “*” – Replace “*” with the ARN of the specific data share.
- <AWSAccountID> – Add one or more consumer account numbers based on the requirement.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Allow",
"Effect": "Allow",
"Action": [
"redshift:AuthorizeDataShare",
"redshift:DeauthorizeDataShare"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"redshift:ConsumerIdentifier": [
"<AWSAccountID>"
]
}
}
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"redshift:DescribeDataSharesForConsumer",
"redshift:DescribeDataSharesForProducer",
"redshift:DescribeClusters",
"redshift:DescribeDataShares"
],
"Resource": "*"
}
]
}

From the Amazon Redshift console in the producer AWS Account, choose Query Editor V2 and connect to the producer cluster using temporary credentials.
After connecting to the producer cluster, create the data share and add the schema and tables to the data share. Then, grant usage to the consumer accounts<ConsumerAWSAccount1> and <ConsumerAWSAccount2>

CREATE DATASHARE ds;

ALTER DATASHARE ds ADD SCHEMA PUBLIC;
ALTER DATASHARE ds ADD TABLE table1;
ALTER DATASHARE ds ADD ALL TABLES IN SCHEMA sf_schema;

GRANT USAGE ON DATASHARE ds TO ACCOUNT '<ConsumerAWSAccount1>;
GRANT USAGE ON DATASHARE ds TO ACCOUNT '<ConsumerAWSAccount2>;

Note: the GRANT will be successful even though the account is not listed in the IAM policy. But the Authorize step will validate against the list of approved accounts in the IAM policy, and it will fail if the account is not in the approved list.

Now the producer admin can authorize the data share by using the AWS CLI command line interface or the console. When you authorize the data share to <ConsumerAWSAccount1>, then the authorization is successful.

aws redshift authorize-data-share --data-share-arn <DATASHARE ARN> --consumer-identifier <ConsumerAWSAccount1>

When you authorize the data share to <ConsumerAWSAccount2>, the authorization fails, as the IAM policy we setup in the earlier step does not allow data share to <ConsumerAWSAccount2>.

aws redshift authorize-data-share --data-share-arn <DATASHARE ARN> --consumer-identifier <ConsumerAWSAccount2>

We have demonstrated how you can restrict access to the data share created on the producer cluster to specific consumer accounts by using a conditional construct with an approved account list in the IAM policy.

Approved list of Redshift clusters on consumer account

When you grant access to a data share to a consumer account, the consumer admin can determine which Redshift clusters can read the data share by associating it with the appropriate cluster. If the organization wants to control which of the Redshift clusters the admin can associate with the data share, then you can specify the approved list of Redshift clusters by using the cluster ARN in an IAM policy.

On the AWS IAM Console, choose Policies.
Choose Create policy.
On the JSON tab, enter the following policy:
This is the consumer side policy. Note: you should replace the following text with the specific details for your cluster and account.

- “Resource”: “*” – Replace “*” with the ARN of the specific data share.
- Replace “<ProducerDataShareARN>” with the ARN of the data share created in the Redshift cluster in AWS Consumer account 1.
- Replace “<ConsumerRedshiftCluster1ARN>” with the ARN of the first Redshift cluster in AWS Consumer account 1.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"redshift:AssociateDataShareConsumer",
"redshift:DisassociateDataShareConsumer"
],
"Resource": "<ProducerDataShareARN>",
"Condition": {
"StringEquals": {
"redshift:ConsumerArn": [ "<ConsumerRedshiftCluster1ARN>" ]
}
}
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"redshift:DescribeDataSharesForConsumer",
"redshift:DescribeDataSharesForProducer",
"redshift:DescribeClusters",
"redshift:DescribeDataShares"
],
"Resource": "*"
}
]
}

Now the consumer admin can associate the data share using the AWS CLI command line interface or the console. When you associate the Redshift cluster 1 <ConsumerRedshiftCluster1ARN >, the association is successful.

aws redshift associate-data-share-consumer --no-associate-entire-account --data-share-arn <ProducerDataShareARN> --consumer-arn <ConsumerRedshiftCluster1ARN>

Now the consumer admin can associate the data share by using the AWS CLI command line interface or the console. When you associate the Redshift cluster 2 <ConsumerRedshiftCluster2ARN >, the association fails.

aws redshift associate-data-share-consumer --no-associate-entire-account --data-share-arn <ProducerDataShareARN> --consumer-arn <ConsumerRedshiftCluster2ARN>

After associating the Consumer Redshift cluster 1 to the producer data share, from the Amazon Redshift console in the Consumer AWS Account, choose Query Editor V2 and connect to the consumer cluster using temporary credentials.
After connecting to the consumer cluster, you can create a database referencing the data share on the producer cluster, and then start querying the data.

CREATE DATABASE ds_db FROM DATASHARE ds OF ACCOUNT <PRODUCER ACCOUNT> NAMESPACE <PRODUCER CLUSTER NAMESPACE>;
 
Optional:
CREATE EXTERNAL SCHEMA Schema_from_datashare FROM REDSHIFT DATABASE 'ds_db' SCHEMA 'public';

GRANT USAGE ON DATABASE ds_db TO user/group;

GRANT USAGE ON SCHEMA Schema_from_datashare TO GROUP Analyst_group;

SELECT  * FROM ds_db.public.producer_t1;

You can use the query editor or the new Amazon Redshift Query Editor V2 to run the statements above to read the shared data from the producer by creating an external database reference from the consumer cluster.

Conclusion

We have demonstrated how you can restrict access to the data share created on the producer cluster to specific consumer accounts by listing approved accounts in the IAM policy.

On the consumer side, we have also demonstrated how you can restrict access to a particular Redshift cluster on the consumer account for the data share created on the producer cluster by listing approved Redshift cluster(s) in the IAM policy. Enterprises and businesses can use this approach to control the boundaries of Redshift data sharing at account and cluster granularity.

We encourage you to try cross-account data sharing with these additional security controls to securely share data across Amazon Redshift clusters both within your organizations and with your customers or partners.

About the Authors

Rajesh Francis is a Senior Analytics Customer Experience Specialist at AWS. He specializes in Amazon Redshift and focuses on helping to drive AWS market and technical strategy for data warehousing and analytics. Rajesh works closely with large strategic customers to help them adopt our new services and features, develop long-term partnerships, and feed customer requirements back to our product development teams to guide the direction of our product offerings.

Kiran Sharma is a Senior Big Data Consultant for AWS Professional Services. She works with our customers to architect and implement Big Data Solutions on variety of projects on AWS.

Eric Hotinger is a Software Engineer at AWS. He enjoys solving seemingly impossible problems in the areas of analytics, streaming, containers, and serverless.

Provide data reliability in Amazon Redshift at scale using Great Expectations library

2021-11-22 Faizan Ahmed

Post Syndicated from Faizan Ahmed original https://aws.amazon.com/blogs/big-data/provide-data-reliability-in-amazon-redshift-at-scale-using-great-expectations-library/

Ensuring data reliability is one of the key objectives of maintaining data integrity and is crucial for building data trust across an organization. Data reliability means that the data is complete and accurate. It’s the catalyst for delivering trusted data analytics and insights. Incomplete or inaccurate data leads business leaders and data analysts to make poor decisions, which can lead to negative downstream impacts and subsequently may result in teams spending valuable time and money correcting the data later on. Therefore, it’s always a best practice to run data reliability checks before loading the data into any targets like Amazon Redshift, Amazon DynamoDB, or Amazon Timestream databases.

This post discusses a solution for running data reliability checks before loading the data into a target table in Amazon Redshift using the open-source library Great Expectations. You can automate the process for data checks via the extensive built-in Great Expectations glossary of rules using PySpark, and it’s flexible for adding or creating new customized rules for your use case.

Amazon Redshift is a cloud data warehouse solution and delivers up to three times better price-performance than other cloud data warehouses. With Amazon Redshift, you can query and combine exabytes of structured and semi-structured data across your data warehouse, operational database, and data lake using standard SQL. Amazon Redshift lets you save the results of your queries back to your Amazon Simple Storage Service (Amazon S3) data lake using open formats like Apache Parquet, so that you can perform additional analytics from other analytics services like Amazon EMR, Amazon Athena, and Amazon SageMaker.

Great Expectations (GE) is an open-source library and is available in GitHub for public use. It helps data teams eliminate pipeline debt through data testing, documentation, and profiling. Great Expectations helps build trust, confidence, and integrity of data across data engineering and data science teams in your organization. GE offers a variety of expectations developers can configure. The tool defines expectations as statements describing verifiable properties of a dataset. Not only does it offer a glossary of more than 50 built-in expectations, it also allows data engineers and scientists to write custom expectation functions.

Use case overview

Before performing analytics or building machine learning (ML) models, cleaning data can take up a lot of time in the project cycle. Without automated and systematic data quality checks, we may spend most of our time cleaning data and hand-coding one-off quality checks. As most data engineers and scientists know, this process can be both tedious and error-prone.

Having an automated quality check system is critical to project efficiency and data integrity. Such systems help us understand data quality expectations and the business rules behind them, know what to expect in our data analysis, and make communicating the data’s intricacies much easier. For example, in a raw dataset of customer profiles of a business, if there’s a column for date of birth in format YYYY-mm-dd, values like 1000-09-01 would be correctly parsed as a date type. However, logically this value would be incorrect in 2021, because the age of the person would be 1021 years, which is impossible.

Another use case could be to use GE for streaming analytics, where you can use AWS Database Migration Service (AWS DMS) to migrate a relational database management system. AWS DMS can export change data capture (CDC) files in Parquet format to Amazon S3, where these files can then be cleansed by an AWS Glue job using GE and written to either a destination bucket for Athena consumption or the rows can be streamed in AVRO format to Amazon Kinesis or Kafka.

Additionally, automated data quality checks can be versioned and also bring benefit in the form of optimal data monitoring and reduced human intervention. Data lineage in an automated data quality system can also indicate at which stage in the data pipeline the errors were introduced, which can help inform improvements in upstream systems.

Solution architecture

This post comes with a ready-to-use blueprint that automatically provisions the necessary infrastructure and spins up a SageMaker notebook that walks you step by step through the solution. Additionally, it enforces the best practices in data DevOps and infrastructure as code. The following diagram illustrates the solution architecture.

The architecture contains the following components:

Data lake – When we run the AWS CloudFormation stack, an open-source sample dataset in CSV format is copied to an S3 bucket in your account. As an output of the solution, the data destination is an S3 bucket. This destination consists of two separate prefixes, each of which contains files in Parquet format, to distinguish between accepted and rejected data.
DynamoDB – The CloudFormation stack persists data quality expectations in a DynamoDB table. Four predefined column expectations are populated by the stack in a table called redshift-ge-dq-dynamo-blog-rules. Apart from the pre-populated rules, you can add any rule from the Great Expectations glossary according to the data model showcased later in the post.
Data quality processing – The solution utilizes a SageMaker notebook instance powered by Amazon EMR to process the sample dataset using PySpark (v3.1.1) and Great Expectations (v0.13.4). The notebook is automatically populated with the S3 bucket location and Amazon Redshift cluster identifier via the SageMaker lifecycle config provisioned by AWS CloudFormation.
Amazon Redshift – We create internal and external tables in Amazon Redshift for the accepted and rejected datasets produced from processing the sample dataset. The external dq_rejected.monster_com_rejected table, for rejected data, uses Amazon Redshift Spectrum and creates an external database in the AWS Glue Data Catalog to reference the table. The dq_accepted.monster_com table is created as a regular Amazon Redshift table by using the COPY command.

Sample dataset

As part of this post, we have performed tests on the Monster.com job applicants sample dataset to demonstrate the data reliability checks using the Great Expectations library and loading data into an Amazon Redshift table.

The dataset contains nearly 22,000 different sample records with the following columns:

country
country_code
date_added
has_expired
job_board
job_description
job_title
job_type
location
organization
page_url
salary
sector
uniq_id

For this post, we have selected four columns with inconsistent or dirty data, namely organization, job_type, uniq_id, and location, whose inconsistencies are flagged according to the rules we define from the GE glossary as described later in the post.

Prerequisites

For this solution, you should have the following prerequisites:

An AWS account if you don’t have one already. For instructions, see Sign Up for AWS.
For this post, you can launch the CloudFormation stack in the following Regions:
- us-east-1
- us-east-2
- us-west-1
- us-west-2
An AWS Identity and Access Management (IAM) user. For instructions, see Create an IAM User.
The user should have create, write, and read access for the following AWS services:
- AWS CloudFormation
- DynamoDB
- Amazon EMR
- AWS Glue
- IAM
- Amazon Redshift
- SageMaker
- Amazon Virtual Private Cloud (Amazon VPC)
Familiarity with Great Expectations and PySpark.

Set up the environment

Choose Launch Stack to start creating the required AWS resources for the notebook walkthrough:

For more information about Amazon Redshift cluster node types, see Overview of Amazon Redshift clusters. For the type of workflow described in this post, we recommend using the RA3 Instance Type family.

Run the notebooks

When the CloudFormation stack is complete, complete the following steps to run the notebooks:

On the SageMaker console, choose Notebook instances in the navigation pane.

This opens the notebook instances in your Region. You should see a notebook titled redshift-ge-dq-EMR-blog-notebook.

Choose Open Jupyter next to this notebook to open the Jupyter notebook interface.

You should see the Jupyter notebook file titled ge-redshift.ipynb.

Choose the file to open the notebook and follow the steps to run the solution.

Run configurations to create a PySpark context

When the notebook is open, make sure the kernel is set to Sparkmagic (PySpark). Run the following block to set up Spark configs for a Spark context.

Create a Great Expectations context

In Great Expectations, your data context manages your project configuration. We create a data context for our solution by passing our S3 bucket location. The S3 bucket’s name, created by the stack, should already be populated within the cell block. Run the following block to create a context:

from great_expectations.data_context.types.base import DataContextConfig,DatasourceConfig,S3StoreBackendDefaults
from great_expectations.data_context import BaseDataContext

bucket_prefix = "ge-redshift-data-quality-blog"
bucket_name = "ge-redshift-data-quality-blog-region-account_id"
region_name = '-'.join(bucket_name.replace(bucket_prefix,'').split('-')[1:4])
dataset_path=f"s3://{bucket_name}/monster_com-job_sample.csv"
project_config = DataContextConfig(
    config_version=2,
    plugins_directory=None,
    config_variables_file_path=None,
    datasources={
        "my_spark_datasource": {
            "data_asset_type": {
                "class_name": "SparkDFDataset",//Setting dataset type to Spark
                "module_name": "great_expectations.dataset",
            },
            "spark_config": dict(spark.sparkContext.getConf().getAll()) //Passing Spark Session configs,
            "class_name": "SparkDFDatasource",
            "module_name": "great_expectations.datasource"
        }
    },
    store_backend_defaults=S3StoreBackendDefaults(default_bucket_name=bucket_name)//
)
context = BaseDataContext(project_config=project_config)

For more details on creating a GE context, see Getting started with Great Expectations.

Get GE validation rules from DynamoDB

Our CloudFormation stack created a DynamoDB table with prepopulated rows of expectations. The data model in DynamoDB describes the properties related to each dataset and its columns and the number of expectations you want to configure for each column. The following code describes an example of the data model for the column organization:

{
 "id": "job_reqs-organization", 
 "dataset_name": "job_reqs", 
 "rules": [ //list of expectations to apply to this column
  {
   "kwargs": {
    "result_format": "SUMMARY|COMPLETE|BASIC|BOOLEAN_ONLY" //The level of detail of the result
   },
   "name": "expect_column_values_to_not_be_null",//name of GE expectation   "reject_msg": "REJECT:null_values_found_in_organization"
  }
 ],
 "column_name": "organization"
}

The code contains the following parameters:

id – Unique ID of the document
dataset_name – Name of the dataset, for example monster_com
rules – List of GE expectations to apply:
- kwargs – Parameters to pass to an individual expectation
- name – Name of the expectation from the GE glossary
- reject_msg – String to flag for any row that doesn’t pass this expectation
column_name – Name of dataset column to run the expectations on

Each column can have one or more expectations associated that it needs to pass. You can also add expectations for more columns or to existing columns by following the data model shown earlier. With this technique, you can automate verification of any number of data quality rules for your datasets without performing any code change. Apart from its flexibility, what makes GE powerful is the ability to create custom expectations if the GE glossary doesn’t cover your use case. For more details on creating custom expectations, see How to create custom Expectations.

Now run the cell block to fetch the GE rules from the DynamoDB client:

Read the monster.com sample dataset and pass through validation rules.

After we have the expectations fetched from DynamoDB, we can read the raw CSV dataset. This dataset should already be copied to your S3 bucket location by the CloudFormation stack. You should see the following output after reading the CSV as a Spark DataFrame.

To evaluate whether a row passes each column’s expectations, we need to pass the necessary columns to a Spark user-defined function. This UDF evaluates each row in the DataFrame and appends the results of each expectation to a comments column.

Rows that pass all column expectations have a null value in the comments column.

A row that fails at least one column expectation is flagged with the string format REJECT:reject_msg_from_dynamo. For example, if a row has a null value in the organization column, then according to the rules defined in DynamoDB, the comments column is populated by the UDF as REJECT:null_values_found_in_organization.

The technique with which the UDF function recognizes a potentially erroneous column is done by evaluating the result dictionary generated by the Great Expectations library. The generation and structure of this dictionary is dependent upon the keyword argument of result_format. In short, if the count of unexpected column values of any column is greater than zero, we flag that as a rejected row.

Split the resulting dataset into accepted and rejected DataFrames.

Now that we have all the rejected rows flagged in the source DataFrame within the comments column, we can use this property to split the original dataset into accepted and rejected DataFrames. In the previous step, we mentioned that we append an action message in the comments column for each failed expectation in a row. With this fact, we can select rejected rows that start with the string REJECT (alternatively, you can also filter by non-null values in the comments column to get the accepted rows). When we have the set of rejected rows, we can get the accepted rows as a separate DataFrame by using the following PySpark except function.

Write the DataFrames to Amazon S3.

Now that we have the original DataFrame divided, we can write them both to Amazon S3 in Parquet format. We need to write the accepted DataFrame without the comments column because it’s only added to flag rejected rows. Run the cell blocks to write the Parquet files under appropriate prefixes as shown in the following screenshot.

Copy the accepted dataset to an Amazon Redshift table

Now that we have written the accepted dataset, we can use the Amazon Redshift COPY command to load this dataset into an Amazon Redshift table. The notebook outlines the steps required to create a table for the accepted dataset in Amazon Redshift using the Amazon Redshift Data API. After the table is created successfully, we can run the COPY command.

Another noteworthy point to mention is that one of the advantages that we witness due to the data quality approach described in this post is that the Amazon Redshift COPY command doesn’t fail due to schema or datatype errors for the columns, which have clear expectations defined that match the schema. Similarly, you can define expectations for every column in the table that satisfies the schema constraints and can be considered a dq_accepted.monster_com row.

Create an external table in Amazon Redshift for rejected data

We need to have the rejected rows available to us in Amazon Redshift for comparative analysis. These comparative analyses can help inform upstream systems regarding the quality of data being collected and how they can be corrected to improve the overall quality of data. However, it isn’t wise to store the rejected data on the Amazon Redshift cluster, particularly for large tables, because it occupies extra disk space and increase cost. Instead, we use Redshift Spectrum to register an external table in an external schema in Amazon Redshift. The external schema lives in an external database in the AWS Glue Data Catalog and is referenced by Amazon Redshift. The following screenshot outlines the steps to create an external table.

Verify and compare the datasets in Amazon Redshift.

12,160 records got processed successfully out of a total of 22,000 from the input dataset, and were loaded to the monster_com table under the dq_accepted schema. These records successfully passed all the validation rules configured in DynamoDB.

A total 9,840 records got rejected due to breaking of one or more rules configured in DynamoDB and loaded to the monster_com_rejected table in the dq_rejected schema. In this section, we describe the behavior of each expectation on the dataset.

Expect column values to not be null in organization – This rule is configured to reject a row if the organization is null. The following query returns the sample of rows, from the dq_rejected.monster_com_rejected table, that are null in the organization column, with their reject message.
Expect column values to match the regex list in job_type – This rule expects the column entries to be strings that can be matched to either any of or all of a list of regular expressions. In our use case, we have only allowed values that match a pattern within [".*Full.*Time", ".*Part.*Time", ".*Contract.*"].
The following query shows rows that are rejected due to an invalid job type.

Most of the records were rejected with multiple reasons, and all those mismatches are captured under the comments column.

Expect column values to not match regex for uniq_id – Similar to the previous rule, this rule aims to reject any row whose value matches a certain pattern. In our case, that pattern is having an empty space (\s++) in the primary column uniq_id. This means we consider a value to be invalid if it has empty spaces in the string. The following query returned an invalid format for uniq_id.
Expect column entries to be strings with a length between a minimum value and a maximum value (inclusive) – A length check rule is defined in the DynamoDB table for the location column. This rule rejects values or rows if the length of the value violates the specified constraints. The following
query returns the records that are rejected due to a rule violation in the location column.

You can continue to analyze the other columns’ predefined rules from DynamoDB or pick any rule from the GE glossary and add it to an existing column. Rerun the notebook to see the result of your data quality rules in Amazon Redshift. As mentioned earlier, you can also try creating custom expectations for other columns.

Benefits and limitations

The efficiency and efficacy of this approach is delineated from the fact that GE enables automation and configurability to an extensive degree when compared with other approaches. A very brute force alternative to this could be writing stored procedures in Amazon Redshift that can perform data quality checks on staging tables before data is loaded into main tables. However, this approach might not be scalable because you can’t persist repeatable rules for different columns, as persisted here in DynamoDB, in stored procedures (or call DynamoDB APIs), and would have to write and store a rule for each column of every table. Furthermore, to accept or reject a row based on a single rule requires complex SQL statements that may result in longer durations for data quality checks or even more compute power, which can also incur extra costs. With GE, a data quality rule is generic, repeatable, and scalable across different datasets.

Another benefit of this approach, related to using GE, is that it supports multiple Python-based backends, including Spark, Pandas, and Dask. This provides flexibility across an organization where teams might have skills in different frameworks. If a data scientist prefers using Pandas to write their ML pipeline feature quality test, then a data engineer using PySpark can use the same code base to extend those tests due to the consistency of GE across backends.

Furthermore, GE is written natively in Python, which means it’s a good option for engineers and scientists who are more used to running their extract, transform, and load (ETL) workloads in PySpark in comparison to frameworks like Deequ, which is natively written in Scala over Apache Spark and fits better for Scala use cases (the Python interface, PyDeequ, is also available). Another benefit of using GE is the ability to run multi-column unit tests on data, whereas Deequ doesn’t support that (as of this writing).

However, the approach described in this post might not be the most performant in some cases for full table load batch reads for very large tables. This is due to the serde (serialization/deserialization) cost of using UDFs. Because the GE functions are embedded in PySpark UDFs, the performance of these functions is slower than native Spark functions. Therefore, this approach gives the best performance when integrated with incremental data processing workflows, for example using AWS DMS to write CDC files from a source database to Amazon S3.

Clean up

Some of the resources deployed in this post, including those deployed using the provided CloudFormation template, incur costs as long as they’re in use. Be sure to remove the resources and clean up your work when you’re finished in order to avoid unnecessary cost.

Go to the CloudFormation console and click the ‘delete stack’ to remove all resources.

The resources in the CloudFormation template are not production ready. If you would like to use this solution in production, enable logging for all S3 buckets and ensure the solution adheres to your organization’s encryption policies through EMR Security Best Practices.

Conclusion

In this post, we demonstrated how you can automate data reliability checks using the Great Expectations library before loading data into an Amazon Redshift table. We also showed how you can use Redshift Spectrum to create external tables. If dirty data were to make its way into the accepted table, all downstream consumers such as business intelligence reporting, advanced analytics, and ML pipelines can get affected and produce inaccurate reports and results. The trends of such data can generate wrong leads for business leaders while making business decisions. Furthermore, flagging dirty data as rejected before loading into Amazon Redshift also helps reduce the time and effort a data engineer might have to spend in order to investigate and correct the data.

We are interested to hear how you would like to apply this solution for your use case. Please share your thoughts and questions in the comments section.

About the Authors

Faizan Ahmed is a Data Architect at AWS Professional Services. He loves to build data lakes and self-service analytics platforms for his customers. He also enjoys learning new technologies and solving, automating, and simplifying customer problems with easy-to-use cloud data solutions on AWS. In his free time, Faizan enjoys traveling, sports, and reading.

Bharath Kumar Boggarapu is a Data Architect at AWS Professional Services with expertise in big data technologies. He is passionate about helping customers build performant and robust data-driven solutions and realize their data and analytics potential. His areas of interests are open-source frameworks, automation, and data architecting. In his free time, he loves to spend time with family, play tennis, and travel.

Design and build a Data Vault model in Amazon Redshift from a transactional database

2021-11-18 George Komninos

Post Syndicated from George Komninos original https://aws.amazon.com/blogs/big-data/design-and-build-a-data-vault-model-in-amazon-redshift-from-a-transactional-database/

Building a highly performant data model for an enterprise data warehouse (EDW) has historically involved significant design, development, administration, and operational effort. Furthermore, the data model must be agile and adaptable to change while handling the largest volumes of data efficiently.

Data Vault is a methodology for delivering project design and implementation to accelerate the build of data warehouse projects. Within the overall methodology, the Data Vault 2.0 data modeling standards are popular and widely used within the industry because they emphasize the business keys and their associations within the delivery of business processes. Data Vault facilitates the rapid build of data models via the following:

Pattern-based entities each with a well-defined purpose
Data silos are removed because data is represented in source system independent structures
Data can be loaded in parallel with minimum dependencies
Historized data is stored at its lowest level of granularity
Flexible business rules can be applied independently of the loading of the data
New data sources can be added with no impact on the existing model.

We always recommend working backwards from the business requirements to choose the most suitable data modelling pattern to use; there are times where Data Vault will not be the best choice for your enterprise data warehouse and another modelling pattern will be more suitable.

In this post, we demonstrate how to implement a Data Vault model in Amazon Redshift and query it efficiently by using the latest Amazon Redshift features, such as separation of compute from storage, seamless data sharing, automatic table optimizations, and materialized views.

Data Vault data modeling overview

A data warehouse platform built using Data Vault typically has the following architecture:

The architecture consists of four layers:

Staging – Contains a copy of the latest changes to data from the source systems. This layer doesn’t hold history and, during its population, you can apply several transformations to the staged data, including data type changes or resizing, character set conversion, and the addition of meta-data columns to support later processing.
Raw Data Vault – Holds the historized copy of all of the data from multiple source systems. No filters or business transformations have occurred at this point except for storing the data in source-system independent targets.
Business Data Vault – An optional delivery, but is very often built. It contains business calculations and de-normalizations with the sole purpose of improving the speed and simplicity of access within the consumption layer, which is called the Information Mart layer.
Information Mart Layer – Where data is most commonly accessed by consumers, for example reporting dashboards or extracts. You can build multiple marts from the one Data Vault Integration Layer, and the most common data modeling choice for these marts is Star/Kimball schemas.

Convert a Third Normal Form transactional schema to a Data Vault schema

The following entity relationship diagram is a standard implementation of a transactional model that a sports ticket selling service could use:

The main entities within the schema are sporting events, customers, and tickets. A customer is a person, and a person can purchase one or multiple tickets for a sporting event. This business event is captured by the Ticket Purchase History intersection entity above. Finally, a sporting event has many tickets available to purchase and is staged within a single city.

To convert this source model to a Data Vault model, we start to identify the business keys, their descriptive attributes, and the business transactions. The three main entity types in the Raw Data Vault model are as follows:

Hubs – A collection of Business Keys discovered for each business entity.
Links – Business transactions within the process being modelled. This is always between two or more business keys (hubs) and recorded at a point in time.
Satellites – Historized reference data about either the business key (Hub) or business transaction (link).

The following example solution represents some of the sporting event entities when converted into the preceeding Raw Data Vault objects.

Hub entities

The hub is the definitive list of business keys loaded into the Raw Data Vault layer from all of the source systems. A business key is used to uniquely identify a business entity and is never duplicated. In our example, the source system has assigned a surrogate key field called Id to represent the Business Key, so this is stored in a column on the Hub called sport_event_id. Some common additional columns on hubs include the Load DateTimeStamp which records the date and time the business key was first discovered, and the Record Source which records the name of the source system where this business key was first loaded. Although, you don’t have to create a surrogate type (hash or sequence) for the primary key column, it is very common in Data Vault to hash the business key, so our example does this. Amazon Redshift supports multiple cryptographic hash functions like MD5, FNV, SHA1, and SHA2, which you can choose to generate your primary key column. See the following code :

create table raw_data_vault.hub_sport_event 
(
  sport_event_pk  varchar(32) not null     
 ,sport_event_id  integer     not null
 ,load_dts        timestamp   not null       
 ,record_source   varchar(10) not null      
);

Note the following:

The preceeding code assumes the MD5 hashing algorithm is used. If using FNV_HASH, the datatype will be Bigint.
The Id column is the business key from the source feed. It’s passed into the hashing function for the _PK column.
In our example, there is only a single value for the business key. If a compound key is required, then more than one column can be added.
Load_DTS is populated via the staging schema or extract, transform, and load (ETL) code.
Record_Source is populated via the staging schema or ETL code.

Link entities

The link object is the occurrence of two or more business keys undertaking a business transaction, for example purchasing a ticket for a sporting event. Each of the business keys is mastered in their respective hubs, and a primary key is generated for the link comprising all of the business keys (typically separated by a delimiter field like ‘^’). As with hubs, some common additional columns are added to links, including the Load DateTimeStamp which records the date and time the transaction was first discovered, and the Record Source which records the name of the source system where this transaction was first loaded. See the following code:

create table raw_data_vault.lnk_ticket_sport_event 
(
  ticket_sport_event_pk varchar(32)  not null    
 ,ticket_fk             varchar(32)  not null   
 ,sport_event_fk        varchar(32)  not null   
 ,load_dts              timestamp    not null   
 ,record_source         varchar(10)  not null   
);

Note the following:

The code assumes that the MD5 hashing algorithm is used. The _PK column is hashed values of concatenated ticket and sporting event business keys from the source data feed, for example MD5(ticket_id||'^'||sporting_event_id)
The two _FK columns are foreign keys linked to the primary key of the respective hubs.
Load_DTS is populated via the staging schema or ETL code.
Record_Source is populated via the staging schema or ETL code.

Satellite entities

The history of data about the hub or link is stored in the satellite object. The Load DateTimeStamp is part of the compound key of the satellite along with the primary key of either the hub or link because data can change over time. There are choices within the Data Vault standards for how to store satellite data from multiple sources. A common approach is to append the name of the feed to the satellite name. This lets a single hub contain reference data from more than one source system, and for new sources to be added without impact to the existing design. See the following code:

create table raw_data_vault.sat_sport_event 
(
  sport_event_pk    varchar(32) not null     
 ,load_dts          timestamp   not null  
 ,sport_type_name   varchar(50)
 ,start_datetime    timestamp
 ,sold_out          boolean     
 ,record_source     varchar(10) not null 
 ,hash_diff         varchar(32) not null 
);

Note the following:

The sport_event_pk value is inherited from the hub.
The compound key is the sport_event_pk and load_dts columns. This allows history to be maintained.
The business attributes are typically optional.
Load_DTS is populated via the staging schema or ETL code.
Record_Source is populated via the staging schema or ETL code.
Hash_Diff is a Data Vault technique to simplify the identification of data changes within satellites. The business attribute values are concatenated and hashed with your algorithm of choice. Then, during the ETL processing, only the two hash values (one on the source record and one on the latest dated satellite record) should be compared.

Converted Data Vault Model

If we take the preceding three Data Vault entity types above, we can convert the source data model into a Data Vault data model as follows:

The Business Data Vault contains business-centric calculations and performance de-normalizations that are read by the Information Marts. Some of the object types that are created in the Business Vault layer include the following:

PIT (point in time) tables – You can store data in more than one satellite for a single hub, each with a different Load DateTimeStamp depending on when the data was loaded. A PIT table simplifies access to all of this data by creating a table or materialized view to present a single row with all of the relevant data to hand. The compound key of a PIT table is the primary key from the hub, plus a snapshot date or snapshot date and time for the frequency of the population. Once a day is the most popular, but equally the frequency could be every 15 minutes or once a month.
Bridge tables – Similar to PIT tables, bridge tables take the data from more than one link or link satellite and again de-normalize into a single row. This single row view makes accessing complex datasets over multiple tables from the Raw Data Vault much more straightforward and performant. Like a PIT table, the bridge table can be either a table or materialized view.
KPI tables – The pre-computed business rules calculate KPIs and store them in dedicated tables.
Type 2 tables –You can apply additional processing in the Business Data Vault to calculate Type 2 like time periods because the data in the Raw Data Vault follows an insert only pattern.

The architecture of Amazon Redshift allows flexibility in the design of the Data Vault platform by using the capabilities of the Amazon Redshift RA3 instance type to separate the compute resources from the data storage layer and the seamless ability to share data between different Amazon Redshift clusters. This flexibility allows highly performant and cost-effective Data Vault platforms to be built. For example, the Staging and Raw Data Vault Layers are populated 24-hours-a-day in micro batches by one Amazon Redshift cluster, the Business Data Vault layer can be built one-time-a-day and paused to save costs when completed, and any number of consumer Amazon Redshift clusters can access the results. Depending on the processing complexity of each layer, Amazon Redshift supports independently scaling the compute capacity required at each stage.

All of the underlying tables in Raw Data Vault can be loaded simultaneously. This makes great use of the massively parallel processing architecture in Amazon Redshift. For our business model, it makes sense to create a Business Data Vault layer, which can be read by an Information Mart to perform dimensional analysis on ticket sales. It can give us insights on the top home teams in fan attendance and how that correlates with specific sport locations or cities. Running these queries involves joining multiple tables. It’s important to design an optimal Business Data Vault layer to avoid excessive joins for deriving these insights.

For example, to get the number of tickets per city for June 2021, the SQL looks like the following code:

SELECT name,count(lpt.ticket_fk) as tickets_sold
FROM lnk_person_ticket lpt
  JOIN lnk_ticket_sport_event ltse on lpt.ticket_fk = ltse.ticket_fk
  JOIN lnk_sport_event_location lsel on ltse.sport_event_fk = lsel.sport_event_fk
  JOIN lnk_location_city llc on lsel.location_fk = llc.location_fk
  JOIN sat_city sc on llc.city_fk = sc.city_pk
  JOIN sat_sport_event sse on lsel.sport_event_fk = sse.sport_event_pk
Where start_date between '2021-06-05' and '2021-06-15' group by 1;

We can use the EXPLAIN command for the preceding query to get the Amazon Redshift query plan. The following plan shows that the specified joins require broadcasting data across nodes, since the join conditions are on different keys. This makes the query computationally expensive:

dev=# explain SELECT
  name, count(lpt.ticket_fk) as tickets_sold FROM lnk_person_ticket lpt
  JOIN lnk_ticket_sport_event ltse on lpt.ticket_fk = ltse.ticket_fk
  JOIN lnk_sport_event_location lsel on ltse.sport_event_fk = lsel.sport_event_fk
  JOIN lnk_location_city llc on lsel.location_fk = llc.location_fk
  JOIN sat_city sc on llc.city_fk = sc.city_pk
  JOIN sat_sport_event sse on lsel.sport_event_fk = sse.sport_event_pk
where
  start_date between ‘2021-06-05’
  and ‘2021-06-15’
GROUP BY
  1;
                            QUERY PLAN
----------------------------------------------------------------------
 XN HashAggregate  (cost=96331086541.29..96331086564.36 rows=9226 width=49)
   ->  XN Hash Join DS_BCAST_INNER  (cost=166693605.84..96331086495.16 rows=9226 width=49)
         Hash Cond: ((“outer”.ticket_fk)::text = (“inner”.ticket_fk)::text)
         ->  XN Hash Join DS_BCAST_INNER  (cost=166690878.95..344629685.90 rows=621783 width=49)
               Hash Cond: ((“outer”.sport_event_fk)::text = (“inner”.sport_event_fk)::text)
               ->  XN Seq Scan on lnk_ticket_sport_event ltse  (cost=0.00..147804.35 rows=14780435 width=72)
               ->  XN Hash  (cost=166690878.49..166690878.49 rows=185 width=85)
                     ->  XN Hash Join DS_BCAST_INNER  (cost=49690773.08..166690878.49 rows=185 width=85)
                           Hash Cond: ((“outer”.location_fk)::text = (“inner”.location_fk)::text)
                           ->  XN Hash Join DS_BCAST_INNER  (cost=65.61..79200165.14 rows=179 width=108)
                                 Hash Cond: ((“outer”.sport_event_fk)::text = (“inner”.sport_event_pk)::text)
                                 ->  XN Seq Scan on lnk_sport_event_location lsel  (cost=0.00..43.44 rows=4344 width=72)
                                 ->  XN Hash  (cost=65.16..65.16 rows=180 width=36)
                                       ->  XN Seq Scan on sat_sport_event sse  (cost=0.00..65.16 rows=180 width=36)
                                             Filter: ((start_date <= ‘2021-06-15’::date) AND (start_date >= ‘2021-06-05’::date))
                           ->  XN Hash  (cost=49690707.31..49690707.31 rows=63 width=49)
                                 ->  XN Hash Join DS_BCAST_INNER  (cost=0.78..49690707.31 rows=63 width=49)
                                       Hash Cond: ((“outer”.city_pk)::text = (“inner”.city_fk)::text)
                                       ->  XN Seq Scan on sat_city sc  (cost=0.00..27909.51 rows=2790951 width=49)
                                       ->  XN Hash  (cost=0.62..0.62 rows=62 width=72)
                                             ->  XN Seq Scan on lnk_location_city llc  (cost=0.00..0.62 rows=62 width=72)
         ->  XN Hash  (cost=2181.51..2181.51 rows=218151 width=36)
               ->  XN Seq Scan on lnk_person_ticket lpt  (cost=0.00..2181.51 rows=218151 width=36)
(23 rows)

Let’s discuss the latest Amazon Redshift features that help optimize the performance of these queries on top of a Business Data Vault model.

Use Amazon Redshift features to query the Data Vault

Automatic table optimization

Traditionally, to optimize joins in Amazon Redshift, it’s recommended to use distribution keys and styles to co-locate data in the same nodes, as based on common join predicates. The Raw Data Vault layer has a very well-defined pattern, which is ideal for determining the distribution keys. However, the broad range of SQL queries applicable to the Business Data Vault makes it hard to predict your consumption pattern that would drive your distribution strategy.

Automatic table optimization lets you get the fastest performance quickly without needing to invest time to manually tune and implement table optimizations. Automatic table optimization continuously observes how queries interact with tables, and it uses machine learning (ML) to select the best sort and distribution keys to optimize performance for the cluster’s workload. If Amazon Redshift determines that applying a key will improve cluster performance, then tables are automatically altered within hours without requiring administrator intervention.

Automatic Table Optimization provided following recommendations for the above query to get the number of tickets per city for June 2021. The recommendations suggest modifying the distribution style and sort keys for tables involved in these queries.

dev=# select * from svv_alter_table_recommendations;
   type    | database | table_id | group_id |                                           ddl                                           | auto_eligible
-----------+----------+----------+----------+-----------------------------------------------------------------------------------------+---------------
 diststyle | dev      |   127372 |        0 | ALTER TABLE “public”.“lnk_person_ticket” ALTER DISTSTYLE KEY DISTKEY “ticket_fk”        | f
 sortkey   | dev      |   127421 |       -1 | ALTER TABLE “public”.“lnk_ticket_sport_event” ALTER COMPOUND SORTKEY (“sport_event_fk”) | f
 diststyle | dev      |   127421 |        0 | ALTER TABLE “public”.“lnk_ticket_sport_event” ALTER DISTSTYLE KEY DISTKEY “ticket_fk”   | f
 sortkey   | dev      |   145032 |       -1 | ALTER TABLE “public”.“sat_city” ALTER COMPOUND SORTKEY (“city_pk”)                      | f

After the recommended distribution keys and sort keys were applied by Automatic Table Optimization, the explain plan shows “DS_DIST_NONE” and no data redistribution was required anymore for this query. The data required for the joins was co-located across Amazon Redshift nodes.

QUERY PLAN
----------------------------------------------------------------------
 XN HashAggregate  (cost=344646541.29..344646564.36 rows=9226 width=49)
   ->  XN Hash Join DS_DIST_NONE  (cost=166693605.84..344646495.16 rows=9226 width=49)
         Hash Cond: ((“outer”.ticket_fk)::text = (“inner”.ticket_fk)::text)
         ->  XN Hash Join DS_BCAST_INNER  (cost=166690878.95..344629685.90 rows=621783 width=49)
               Hash Cond: ((“outer”.sport_event_fk)::text = (“inner”.sport_event_fk)::text)
               ->  XN Seq Scan on lnk_ticket_sport_event ltse  (cost=0.00..147804.35 rows=14780435 width=72)
               ->  XN Hash  (cost=166690878.49..166690878.49 rows=185 width=85)
                     ->  XN Hash Join DS_BCAST_INNER  (cost=49690773.08..166690878.49 rows=185 width=85)
                           Hash Cond: ((“outer”.location_fk)::text = (“inner”.location_fk)::text)
                           ->  XN Hash Join DS_BCAST_INNER  (cost=65.61..79200165.14 rows=179 width=108)
                                 Hash Cond: ((“outer”.sport_event_fk)::text = (“inner”.sport_event_pk)::text)
                                 ->  XN Seq Scan on lnk_sport_event_location lsel  (cost=0.00..43.44 rows=4344 width=72)
                                 ->  XN Hash  (cost=65.16..65.16 rows=180 width=36)
                                       ->  XN Seq Scan on sat_sport_event sse  (cost=0.00..65.16 rows=180 width=36)
                                             Filter: ((start_date <= ‘2021-06-15’::date) AND (start_date >= ‘2021-06-05’::date))
                           ->  XN Hash  (cost=49690707.31..49690707.31 rows=63 width=49)
                                 ->  XN Hash Join DS_BCAST_INNER  (cost=0.78..49690707.31 rows=63 width=49)
                                       Hash Cond: ((“outer”.city_pk)::text = (“inner”.city_fk)::text)
                                       ->  XN Seq Scan on sat_city sc  (cost=0.00..27909.51 rows=2790951 width=49)
                                       ->  XN Hash  (cost=0.62..0.62 rows=62 width=72)
                                             ->  XN Seq Scan on lnk_location_city llc  (cost=0.00..0.62 rows=62 width=72)
         ->  XN Hash  (cost=2181.51..2181.51 rows=218151 width=36)
               ->  XN Seq Scan on lnk_person_ticket lpt  (cost=0.00..2181.51 rows=218151 width=36)
(23 rows)

Materialized views in Amazon Redshift

The data analyst responsible for running this analysis benefits significantly by creating a materialized view in the Business Data Vault schema that pre-computes the results of the queries by running the following SQL:

CREATE MATERIALIZED VIEW bridge_city_ticket_aggregation_mv
AUTO REFRESH YES
AS SELECT name, count(lpt.ticket_fk) as tickets_sold
FROM lnk_person_ticket lpt
JOIN lnk_ticket_sport_event ltse on lpt.ticket_fk = ltse.ticket_fk 
JOIN lnk_sport_event_location lsel on ltse.sport_event_fk = lsel.sport_event_fk 
JOIN lnk_location_city llc on lsel.location_fk = llc.location_fk 
JOIN sat_city sc on llc.city_fk = sc.city_pk 
GROUP BY 1;

To get the latest satellite values, we must include load_dts in our join. For simplicity, we don’t do that for this post.

You can optimize this query both in terms of code length and complexity to something as simple as the following:

SELECT * FROM bridge_city_ticket_aggregation_mv;

The run plan in this case is as follows:

XN Seq Scan on mv_tbl__bridge_city_ticket_aggregation_mv__0 derived_table1  (cost=0.00..0.36 rows=36 width=524)

More importantly, Amazon Redshift can automatically use the materialized view even if that’s not explicitly stated.

The preceding scenario addresses the needs of a specific analysis because the resulting materialized view is an aggregate. In a more generic scenario, after reviewing our Data Vault ER diagram, you can observe that any query that involves ticket sales analysis per location requires a substantial number of joins, all of which use different join keys. Therefore, any such analysis comes at a significant cost regarding performance. For example, to get the count of tickets sold per city and stadium name, you must run a query like the following:

SELECT sc.name city_name, ssl.name stadium_name, count(lpt.ticket_fk) tickets_sold
FROM lnk_person_ticket lpt
JOIN lnk_ticket_sport_event ltse ON lpt.ticket_fk = ltse.ticket_fk 
JOIN lnk_sport_event_location lsel ON ltse.sport_event_fk = lsel.sport_event_fk 
JOIN sat_location ssl ON lsel.location_fk = ssl.location_pk 
JOIN lnk_location_city llc ON lsel.location_fk = llc.location_fk 
JOIN sat_city sc ON llc.city_fk = sc.city_pk 
GROUP BY 1, 2;

You can use the EXPLAIN command for the preceding query to get the explain plan and know how expensive such an operation is:

XN HashAggregate  (cost=99574385259.46..99574385829.64 rows=228071 width=68)
  ->  XN Hash Join DS_BCAST_INNER  (cost=127173776.83..99574383548.93 rows=228071 width=68)
        Hash Cond: (("outer".sport_event_fk)::text = ("inner".sport_event_fk)::text)
        ->  XN Hash Join DS_BCAST_INNER  (cost=2726.89..95986925283.91 rows=219289 width=72)
              Hash Cond: (("outer".ticket_fk)::text = ("inner".ticket_fk)::text)
              ->  XN Seq Scan on lnk_ticket_sport_event ltse  (cost=0.00..147804.35 rows=14780435 width=72)
              ->  XN Hash  (cost=2181.51..2181.51 rows=218151 width=36)
                    ->  XN Seq Scan on lnk_person_ticket lpt  (cost=0.00..2181.51 rows=218151 width=36)
        ->  XN Hash  (cost=127171038.56..127171038.56 rows=4553 width=68)
              ->  XN Hash Join DS_BCAST_INNER  (cost=49690708.24..127171038.56 rows=4553 width=68)
                    Hash Cond: (("outer".location_fk)::text = ("inner".location_fk)::text)
                    ->  XN Hash Join DS_BCAST_INNER  (cost=0.78..39680186.12 rows=4416 width=127)
                          Hash Cond: (("outer".location_fk)::text = ("inner".location_pk)::text)
                          ->  XN Seq Scan on lnk_sport_event_location lsel  (cost=0.00..43.44 rows=4344 width=72)
                          ->  XN Hash  (cost=0.62..0.62 rows=62 width=55)
                                ->  XN Seq Scan on sat_location ssl  (cost=0.00..0.62 rows=62 width=55)
                    ->  XN Hash  (cost=49690707.31..49690707.31 rows=63 width=49)
                          ->  XN Hash Join DS_BCAST_INNER  (cost=0.78..49690707.31 rows=63 width=49)
                                Hash Cond: (("outer".city_pk)::text = ("inner".city_fk)::text)
                                ->  XN Seq Scan on sat_city sc  (cost=0.00..27909.51 rows=2790951 width=49)
                                ->  XN Hash  (cost=0.62..0.62 rows=62 width=72)
                                      ->  XN Seq Scan on lnk_location_city llc  (cost=0.00..0.62 rows=62 width=72)

We can identify commonly joined tables, like hub_sport_event, hub_ticket and hub_location, and then boost the performance of queries by creating materialized views that implement these joins ahead of time. For example, we can create a materialized view to join tickets to sport locations:

CREATE MATERIALIZED VIEW bridge_tickets_per_stadium_mv
AUTO REFRESH YES
AS select hsl.hub_sport_location_key location_id, hub_ticket_seq tickets_id , start_date date, "name" stadium_name
from hub_ticket
join lnk_ticket_sport_event ltse on hub_ticket_seq = hub_ticket_key
join hub_sport_event hse on hse.hub_sport_event_key = ltse.hub_sport_event_seq
join sat_sport_event sse on sse.hub_sport_event_key = hse.hub_sport_event_key
join lnk_sport_event_location lsel on hse.hub_sport_event_key = lsel.hub_sport_event_seq
join hub_sport_location hsl on hub_location_seq = hub_sport_location_key
join sat_sport_location ssl on ssl.hub_sport_location_key = hsl.hub_sport_location_key;

If we don’t make any edits to the expensive query that we ran before, then the run plan is as follows:

XN HashAggregate (cost=88052548.77..88064188.37 rows=4655838 width=457)
-> XN Hash Join DS_BCAST_INNER (cost=49690707.47..88017629.99 rows=4655838 width=457)
Hash Cond: (("outer".location_id)::text = ("inner".hub_location_seq)::text)
-> XN Seq Scan on mv_tbl__bridge_tickets_per_stadium_mv__0 derived_table1 (cost=0.00..147804.35 rows=14780435 width=510)
-> XN Hash (cost=49690707.31..49690707.31 rows=63 width=49)
-> XN Hash Join DS_BCAST_INNER (cost=0.78..49690707.31 rows=63 width=49)
Hash Cond: (("outer".hub_city_key)::text = ("inner".hub_city_seq)::text)
-> XN Seq Scan on hub_city hc (cost=0.00..27909.51 rows=2790951 width=49)
-> XN Hash (cost=0.62..0.62 rows=62 width=72)
-> XN Seq Scan on lnk_location_city llc (cost=0.00..0.62 rows=62 width=72)

Amazon Redshift now uses the materialized view for any future queries that involve joining tickets with sports locations. For example, a separate business intelligence (BI) team looking into the dates with the highest ticket sales can run a query like the following:

select start_date date, count(hub_ticket_seq) tickets
from hub_ticket
join lnk_ticket_sport_event ltse on hub_ticket_seq = hub_ticket_key
join hub_sport_event hse on hse.hub_sport_event_key  = ltse.hub_sport_event_seq 
join sat_sport_event sse on sse.hub_sport_event_key = hse.hub_sport_event_key 
join lnk_sport_event_location lsel on hse.hub_sport_event_key = lsel.hub_sport_event_seq
join hub_sport_location hsl on hub_location_seq = hub_sport_location_key
join sat_sport_location ssl on ssl.hub_sport_location_key = hsl.hub_sport_location_key 
group by 1
order by 2 desc
limit 10;

Amazon Redshift can implicitly understand that the query can be optimized by using the materialized view we already created, thereby avoiding joins that involve broadcasting data across nodes. This can be seen from the run plan:

XN Limit (cost=1000000221707.65..1000000221707.68 rows=10 width=40)
-> XN Merge (cost=1000000221707.65..1000000221707.75 rows=39 width=40)
Merge Key: count(derived_table1.tickets_id)
-> XN Network (cost=1000000221707.65..1000000221707.75 rows=39 width=40)
Send to leader
-> XN Sort (cost=1000000221707.65..1000000221707.75 rows=39 width=40)
Sort Key: count(derived_table1.tickets_id)
-> XN HashAggregate (cost=221706.52..221706.62 rows=39 width=40)
-> XN Seq Scan on mv_tbl__bridge_tickets_per_stadium_mv__0 derived_table1 (cost=0.00..147804.35 rows=14780435 width=40)

If we drop the materialized view, then the preceding query results in the following plan:

XN Limit (cost=7509421514303.64..7509421514303.66 rows=10 width=40)
-> XN Merge (cost=7509421514303.64..7509421514303.73 rows=39 width=40)
Merge Key: count(ltse.hub_ticket_seq)
-> XN Network (cost=7509421514303.64..7509421514303.73 rows=39 width=40)
Send to leader
-> XN Sort (cost=7509421514303.64..7509421514303.73 rows=39 width=40)
Sort Key: count(ltse.hub_ticket_seq)
-> XN HashAggregate (cost=6509421514302.51..6509421514302.61 rows=39 width=40)
-> XN Hash Join DS_BCAST_INNER (cost=54745206.40..6509421439263.58 rows=15007786 width=40)
Hash Cond: (("outer".hub_sport_event_seq)::text = ("inner".hub_sport_event_seq)::text)
-> XN Hash Join DS_BCAST_INNER (cost=184864.04..6507391239560.52 rows=14634339 width=148)
Hash Cond: (("outer".hub_ticket_seq)::text = ("inner".hub_ticket_key)::text)
-> XN Hash Join DS_BCAST_INNER (cost=108.60..3997288304.94 rows=14558405 width=148)
Hash Cond: (("outer".hub_sport_event_seq)::text = ("inner".hub_sport_event_key)::text)
-> XN Hash Join DS_BCAST_INNER (cost=54.30..2085599304.09 rows=14669000 width=112)
Hash Cond: (("outer".hub_sport_event_seq)::text = ("inner".hub_sport_event_key)::text)
-> XN Seq Scan on lnk_ticket_sport_event ltse (cost=0.00..147804.35 rows=14780435 width=72)
-> XN Hash (cost=43.44..43.44 rows=4344 width=40)
-> XN Seq Scan on sat_sport_event sse (cost=0.00..43.44 rows=4344 width=40)
-> XN Hash (cost=43.44..43.44 rows=4344 width=36)
-> XN Seq Scan on hub_sport_event hse (cost=0.00..43.44 rows=4344 width=36)
-> XN Hash (cost=147804.35..147804.35 rows=14780435 width=36)
-> XN Seq Scan on hub_ticket (cost=0.00..147804.35 rows=14780435 width=36)
-> XN Hash (cost=54560331.14..54560331.14 rows=4489 width=36)
-> XN Hash Join DS_BCAST_INNER (cost=1.55..54560331.14 rows=4489 width=36)
Hash Cond: (("outer".hub_location_seq)::text = ("inner".hub_sport_location_key)::text)
-> XN Hash Join DS_BCAST_INNER (cost=0.78..27280186.11 rows=4416 width=108)
Hash Cond: (("outer".hub_location_seq)::text = ("inner".hub_sport_location_key)::text)
-> XN Seq Scan on lnk_sport_event_location lsel (cost=0.00..43.44 rows=4344 width=72)
-> XN Hash (cost=0.62..0.62 rows=62 width=36)
-> XN Seq Scan on sat_sport_location ssl (cost=0.00..0.62 rows=62 width=36)
-> XN Hash (cost=0.62..0.62 rows=62 width=36)
-> XN Seq Scan on hub_sport_location hsl (cost=0.00..0.62 rows=62 width=36)

End-users of the data warehouse don’t need to worry about refreshing the data in the materialized views. This is because we enabled automatic materialized view refresh. Future use cases involving new dimensions also benefit from the existence of materialized views.

Prepared statements in the data vault with materialized views in Amazon Redshift

Another type of query that we can run on top of the Business Data Vault schema is prepared statements with bind variables. It’s quite common to see user interfaces integrated with data warehouses, which lets users dynamically change the value of the variable through selection in a choice list or link in a cross-tab. When the variable changes, so do the query condition and the report or dashboard contents. The following query is a prepared statement to get the count of tickets sold per city and stadium name. It takes the stadium name as a variable and provides the number of tickets sold in that stadium.

PREPARE prep_statement (varchar(100))
AS select hc.name city_name, ssl."name" stadium_name, count(hub_ticket_seq) tickets
 from hub_ticket
 join lnk_ticket_sport_event ltse on hub_ticket_seq = hub_ticket_key
 join hub_sport_event hse on hse.hub_sport_event_key = ltse.hub_sport_event_seq
 join sat_sport_event sse on sse.hub_sport_event_key = hse.hub_sport_event_key
 join lnk_sport_event_location lsel on hse.hub_sport_event_key = lsel.hub_sport_event_seq
 join hub_sport_location hsl on hub_location_seq = hub_sport_location_key
 join sat_sport_location ssl on ssl.hub_sport_location_key = hsl.hub_sport_location_key
 join lnk_location_city llc on llc.hub_location_seq = hsl.hub_sport_location_key
 join hub_city hc on llc.hub_city_seq = hc.hub_city_key
 where ssl."name"  = $1
 group by 1, 2;
PREPARE

Let’s run the query to see the city and tickets sold for different stadiums passed as a variable in this prepared statement:

dev=# EXECUTE prep_statement('Lucas Oil Stadium');
  city_name   |   stadium_name    | tickets
--------------+-------------------+---------
 Indianapolis | Lucas Oil Stadium |    8892
(1 row)

dev=# EXECUTE prep_statement('Ford Field');
 city_name | stadium_name | tickets
-----------+--------------+---------
 Detroit   | Ford Field   |   42720
(1 row)

Let’s dive into the explain plan of this prepared statement to understand if Amazon Redshift can implicitly understand that the query can be optimized by using the materialized view bridge_tickets_per_stadium_mv that was created earlier:

XN HashAggregate  (cost=87685290.31..87685914.69 rows=249748 width=66)
->  XN Hash Join DS_BCAST_INNER  (cost=49690707.47..87683417.20 rows=249748 width=66)
Hash Cond: (("outer".location_id)::text = ("inner".hub_location_seq)::text)
->  XN Seq Scan on *mv_tbl__bridge_tickets_per_stadium_mv__0* derived_table1  (cost=0.00..184755.44 rows=242303 width=89)
*Filter: ((stadium_name)::text = ($1)::text)*
->  XN Hash  (cost=49690707.31..49690707.31 rows=63 width=49)
->  XN Hash Join DS_BCAST_INNER  (cost=0.78..49690707.31 rows=63 width=49)
Hash Cond: (("outer".hub_city_key)::text = ("inner".hub_city_seq)::text)
->  XN Seq Scan on hub_city hc  (cost=0.00..27909.51 rows=2790951 width=49)
->  XN Hash  (cost=0.62..0.62 rows=62 width=72)
->  XN Seq Scan on lnk_location_city llc  (cost=0.00..0.62 rows=62 width=72)

As noted in the explain plan, Amazon Redshift could optimize the explain plan of the query to implicitly use the materialized view created earlier, even for prepared statements.

Conclusion

In this post, we’ve demonstrated how to implement Data Vault model in Amazon Redshift, thereby levering the out-of-the-box features. We also discussed how Amazon Redshift’s features, such as seamless data share, automatic table optimization, materialized views, and automatic materialized view refresh can help you build data models that meet high performance requirements.

About the Authors

George Komninos is a solutions architect for the AWS Data Lab. He helps customers convert their ideas to a production-ready data products. Before AWS, he spent three years at Alexa Information as a data engineer. Outside of work, George is a football fan and supports the greatest team in the world, Olympiacos Piraeus.

Devika Singh is a Senior Solutions Architect at Amazon Web Services. Devika helps customers architect and build database and data analytics solutions to accelerate their path to production as part of the AWS Data Lab. She has expertise in database and data warehouse migrations to AWS, helping customers improve the value of their solutions with AWS.

Simon Dimaline has specialized in data warehousing and data modeling for more than 20 years. He currently works for the Data & Analytics practice within AWS Professional Services accelerating customers’ adoption of AWS analytics services.

Federate Amazon Redshift access with SecureAuth single sign-on

2021-11-18 Srikanth Sopirala

Post Syndicated from Srikanth Sopirala original https://aws.amazon.com/blogs/big-data/federate-amazon-redshift-access-with-secureauth-single-sign-on/

Amazon Redshift is the leading cloud data warehouse that delivers up to 3x better price performance compared to other cloud data warehouses by using massively parallel query execution, columnar storage on high-performance disks, and results caching. You can confidently run mission-critical workloads, even in highly regulated industries, because Amazon Redshift comes with out-of-the-box security and compliance.

You can use your corporate identity providers (IdPs), for example Azure AD, Active Directory Federation Services, Okta, or Ping Federate, with Amazon Redshift to provide single sign-on (SSO) to your users so they can use their IdP accounts to log in and access Amazon Redshift. With federation, you can centralize management and governance of authentication and permissions. For more information about the federation workflow using AWS Identity and Access Management (IAM) and an IdP, see Federate Database User Authentication Easily with IAM and Amazon Redshift.

This post shows you how to use the Amazon Redshift browser-based plugin with SecureAuth to enable federated SSO into Amazon Redshift.

Solution overview

To implement this solution, you complete the following high-level steps:

Configure your SecureAuth IdP v20.06 or later, including users, groups, and SAML application.
Configure your AWS service provider, including IdP, IAM policy, and IAM role.
Set up your SQL Workbench/J client and test SSO.

The process flow for federated authentication includes the following steps:

The user logs in from SQL Workbench/J.
The IdP authenticates using the corporate user name and password, and returns a SAML assertion.
The client uses the AWS SDK to call AWS Security Token Service (AWS STS) to assume a role with SAML.
AWS STS returns temporary AWS credentials.
The client uses the temporary AWS credentials to connect Amazon Redshift.

The following diagram illustrates this process flow.

Prerequisites

Before starting this walkthrough, you must have the following:

An Amazon Redshift cluster
A SecureAuth account
A database user with superuser permissions
Amazon Redshift DB groups pre-created and necessary privileges assigned to them

Set up a user data store at your IdP (SecureAuth)

For instructions on setting up your user data store integration with SecureAuth, see Add a User Data Store. The following screenshot shows a sample setup.

Configure your IdP (SecureAuth)

The next step is to create a new realm in your IdP.

On the SecureAuth portal, on the Create Realms menu, choose Create New From Template.
Select the application (for this post, Amazon Web Services) from the list to establish the target resource for the new realm.
For Page Title/Header, enter the title to appear on the web admin and end-user login pages (for this post, we enter Amazon Redshift).
For Data source, choose the type of Active Directory integration to use with your realm (the user data source we created earlier).
For Start Location, choose At Identity Provider to specify the provider the end-user uses to initiate the login process.
For Global Aux Fields, set Global Aux ID 1 to true.
For SAML Consumer URL, enter http://localhost:7890/redshift/.
Configure the SAML attributes according to the following table.

Attribute Number	SAML Attributes	Value
Attribute 1	https://aws.amazon.com/SAML/Attributes/Role	Full Group DN List
Attribute 2	https://aws.amazon.com/SAML/Attributes/RoleSessionName	Authenticated User ID
Attribute 3	https://redshift.amazon.com/SAML/Attributes/AutoCreate	Global Aux ID 1
Attribute 4	https://redshift.amazon.com/SAML/Attributes/DbUser	Authenticated User ID

The value of Attribute 1 must be dynamically populated with the AWS roles associated with the user that you use to access the Amazon Redshift cluster. This can be a multi-valued SAML attribute to accommodate situations where a user belongs to multiple AD groups or AWS roles with access to the Amazon Redshift cluster. The format of contents within Attribute 1 look like arn:aws:iam::AWS-ACCOUNT-NUMBER:role/AWS-ROLE-NAME, arn:aws:iam::AWS-ACCOUNT-NUMBER:saml-provider/SAML-PROVIDER-NAME after it’s dynamically framed using the following steps.

The dynamic population of Attribute 1 (https://aws.amazon.com/SAML/Attributes/Role) can be done using SecureAuth’s transformation engine.

Enable the transformation engine and use a transformation script.
Use the following sample XSLT transform script as part of the SecureAuth’s transformation engine to take the AD group names a user belongs to and generate the IAM roles’ information by modifying the AD group names with ARN details.

We are assuming that the AD group name and IAM role name will be the same. This is key for the transformation to work.

You have to substitute the following in the script:

AWS-ACCOUNT-NUMBER – A 12-digit AWS account number where you configure the Amazon Redshift access IAM roles.
SAML-PROVIDER-NAME – The name of SAML provider that you create on the AWS side within IAM using the SAML provider metadata file from SecureAuth. Because you haven’t created the SAML provider yet in IAM, make a note of the name that you’re providing here and reuse that name when you are doing the AWS side of the configuration.

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:user="urn:my-scripts">
<msxsl:script language="C#" implements-prefix="user">
<msxsl:using namespace="System.Globalization"/>
<![CDATA[
public static string Transform(string v)
        {
              int startPosition = v.IndexOf("=") + 1;
              string rolearn = v.Substring(startPosition,v.IndexOf(",", startPosition) - startPosition);
              return string.Format("arn:aws:iam::AWS-ACCOUNT-NUMBER:role/{0},arn:aws:iam::AWS-ACCOUNT-NUMBER:saml-provider/SAML-PROVIDER-NAME",rolearn);
        }
    ]]>
</msxsl:script>
  <xsl:template match="/">
    <user>
      <UserID>
        <xsl:value-of select="user/UserID" />
      </UserID>
      <Email1>
        <xsl:value-of select="user/Email1" />
      </Email1>
      <Email2>
        <xsl:value-of select="user/Email2" />
      </Email2>
      <Email3>
        <xsl:value-of select="user/Email3" />
      </Email3>
      <Email4>
        <xsl:value-of select="user/Email4" />
      </Email4>
      <AuxID1>
        <xsl:value-of select="user/AuxID1" />
      </AuxID1>
      <AuxID2>
        <xsl:value-of select="user/AuxID2" />
      </AuxID2>
      <AuxID3>
        <xsl:value-of select="user/AuxID3" />
      </AuxID3>
      <AuxID4>
        <xsl:value-of select="user/AuxID4" />
      </AuxID4>
      <AuxID5>
        <xsl:value-of select="user/AuxID5" />
      </AuxID5>
      <AuxID6>
        <xsl:value-of select="user/AuxID6" />
      </AuxID6>
      <AuxID7>
        <xsl:value-of select="user/AuxID7" />
      </AuxID7>
      <AuxID8>
        <xsl:value-of select="user/AuxID8" />
      </AuxID8>
      <AuxID9>
        <xsl:value-of select="user/AuxID9" />
      </AuxID9>
      <AuxID10>
        <xsl:value-of select="user/AuxID10" />
      </AuxID10>
      <FirstName>
        <xsl:value-of select="user/FirstName" />
      </FirstName>
      <LastName>
        <xsl:value-of select="user/LastName" />
      </LastName>
      <Phone1>
        <xsl:value-of select="user/Phone1" />
      </Phone1>
      <Phone2>
        <xsl:value-of select="user/Phone2" />
      </Phone2>
      <Phone3>
        <xsl:value-of select="user/Phone3" />
      </Phone3>
      <Phone4>
        <xsl:value-of select="user/Phone4" />
      </Phone4>
      <GroupList>
        <Groups>
          <xsl:for-each select="user/GroupList/Groups/Value">
            <Value>
              <xsl:value-of select="current()" />
            </Value>
          </xsl:for-each>
        </Groups>
        <FullGroups>
         <xsl:for-each select="user/GroupList/FullGroups/Value">
           <Value>
             <xsl:value-of select="user:Transform(current())"/>
           </Value>
         </xsl:for-each>
        </FullGroups>
     </GroupList>
    </user>
  </xsl:template>
</xsl:stylesheet>

If you want to filter down the number of AD groups to the ones that are associated with Amazon Redshift access, you can apply a group filter expression for Attribute 1. For example, in the following screenshot, we’re assuming that all AD groups associated with Amazon Redshift access contain the word “Redshift”. This is represented as a regular expression.
Set the value of Global Aux ID 1 to true.

When the setup is complete, your SecureAuth side settings should look something similar to the following screenshots.

When the configuration is complete, download the SAML provider metadata XML file from SecureAuth.

Configure your service provider (AWS)

The next step is to set up the service provider.

Create an IdP in IAM

To create an IdP in IAM, complete the following steps:

On the IAM console, choose Identity providers in the navigation pane.
Choose Create Provider.
In the Configure Provider section, choose the provider type as SAML, enter the provider name as the same one as you mentioned during SecureAuth configuration, and upload the metadata document that you got earlier from your IdP.
On the next page, choose Create to complete the IdP creation within IAM.

You should see a page similar to the following after the IdP creation is complete.

Create a SAML 2.0 federation IAM role and corresponding policy

For each Amazon Redshift DB group of users, we need to create an IAM role and the corresponding IAM policy. Repeat the steps in this section for each DB group.

On the IAM console, choose Roles in the navigation pane.
Choose Create role.
For Select type of trusted entity, choose SAML 2.0 federation.
For SAML provider, choose the IdP you created earlier.
Select Allow programmatic access only.
For Attribute¸ choose SAML:aud.
For Value, enter http://localhost:7890/redshift/.
Choose Next: Permissions.
Under Attach permissions policies, choose Create policy.
Choose the JSON tab on the Create policy page.
Enter the following sample IAM policy, with the following information:
1. REGION – The Region where your Amazon Redshift cluster exists. For example, us-east-1.
2. AWS-ACCOUNT-NUMBER – The 12-digit account number in which your Amazon Redshift cluster exists.
3. REDSHIFT-CLUSTER-IDENTIFIER – Your Amazon Redshift cluster identifier that you gathered earlier.
4. DATABASE – Your Amazon Redshift cluster database name.
5. UNIQUE-ROLE-IDENTIFIER – For now, leave it as is; you have to come back to the policy and change it after the role is created.
6. REDSHIFT-DB-GROUP – The database group name within Amazon Redshift.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "redshift:GetClusterCredentials",
            "Resource": [
                "arn:aws:redshift:REGION:AWS-ACCOUNT-NUMBER:dbname:REDSHIFT-CLUSTER-IDENTIFIER/DATABASE",
                "arn:aws:redshift:REGION:AWS-ACCOUNT-NUMBER:dbuser:REDSHIFT-CLUSTER-IDENTIFIER/${redshift:DbUser}",
                "arn:aws:redshift:REGION:AWS-ACCOUNT-NUMBER:cluster:REDSHIFT-CLUSTER-IDENTIFIER"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:userid": "UNIQUE-ROLE-IDENTIFIER:${redshift:DbUser}"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": "redshift:CreateClusterUser",
            "Resource": "arn:aws:redshift:REGION:AWS-ACCOUNT-NUMBER:dbuser:REDSHIFT-CLUSTER-IDENTIFIER/${redshift:DbUser}"
        },
        {
            "Effect": "Allow",
            "Action": "redshift:JoinGroup",
            "Resource": "arn:aws:redshift:REGION:AWS-ACCOUNT-NUMBER:dbgroup:REDSHIFT-CLUSTER-IDENTIFIER/REDSHIFT-DB-GROUP"
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:DescribeClusters",
                "iam:ListRoles"
            ],
            "Resource": "*"
        }
    ]
}

On the Review policy page, enter a name and description.
Choose Create policy.
When the policy creation is complete, go back to the Create role page, choose the refresh icon, and search for the newly created policy.
Choose the policy and choose Next: Tags.
Add any tags that you want and then Next: Review.
In the Review section, provide a role name and choose Create role.

Make sure that your IAM role name matches with the AD groups that you’re creating to support Amazon Redshift access. The transformation script that we discussed in the SecureAuth section is assuming that the AD group name and the Amazon Redshift access IAM role name are the same.

We use the AWS Command Line Interface (AWS CLI) to fetch the unique role identifier for the role you just created.

Configure the AWS CLI with the necessary access token and secret access key prior to running the following command:

aws iam get-role --role-name <value>

Replace <value> with the role-name that you just created. So, in this example, the command is:

aws iam get-role --role-name redshift-sales-role

From the output JSON, note the value of RoleId.
On the IAM console, open the policy you created earlier.
Choose Edit policy.
Choose the JSON tab.
Replace UNIQUE-ROLE-IDENTIFIER with the RoleId fetched earlier.
Choose Review policy and Save changes.

You’ve now created an IAM role and policy corresponding to the DB group for which you’re trying to enable IAM-based access.

Log in to Amazon Redshift using IdP-based credentials

To log in to Amazon Redshift using your IdP-based credentials, complete the following steps:

Download the latest Amazon Redshift JDBC driver with the AWS SDK for Java.
Launch the SQL Workbench/J app.
Under Manage Drivers, add the Amazon Redshift JDBC driver.
Create a new connection profile for the Amazon Redshift connection.
Choose com.amazon.redshift.jdbc.Driver as the driver.
For URL, enter jdbc:redshift:iam://REDSHIFT-CLUSTER-ENDPOINT:PORT#/DATABASE.

For example, jdbc:redshift:iam://sample-redshift-cluster-1.cxqXXXXXXXXX.us-east-1.redshift.amazonaws.com:5439/dev.

Leave the Username and Password fields empty.
Select Autocommit.
Choose Extended Properties and provide the following values:
1. plugin_name – com.amazon.redshift.plugin.BrowserSamlCredentialsProvider
2. login_url – The login URL from your IdP when you did the setup on your IdP side. For example, https://XYZ.identity.secureauth.com/SecureAuth2/.
3. dbgroups – The DB group that you use for login (this is a required parameter). This DB group must exist in Amazon Redshift. For example, finance.
4. preferred_role – The preferred IAM role that you want to use. If only one IAM role is populated in the SAML token provided by IdP, this isn’t required.
Choose Test to check if the configuration is working.

You should be redirected to your IdP in a browser window for authentication.

You’ll see the multi-factor authentication screen if it has been set up within SecureAuth.

You should see the successful login in the browser as well as in the SQL Workbench/J app.

Connect to Amazon Redshift and run a sample query such as select current_user, which should show the currently logged-in user.

Summary

Amazon Redshift supports stringent compliance and security requirements with no extra cost, which makes it ideal for highly regulated industries. With federation, you can centralize management and governance of authentication and permissions by managing users and groups within the enterprise IdP and use them to authenticate to Amazon Redshift. SSO enables users to have a seamless user experience while accessing various applications in the organization.

In this post, we walked you through a step-by-step guide to configure and use SecureAuth as your IdP and enabled federated SSO to an Amazon Redshift cluster. You can follow these steps to set up federated SSO for your organization and manage access privileges based on read/write privileges or by business function and passing group membership defined in your SecureAuth IdP to your Amazon Redshift cluster.

If you have any questions or suggestions, please leave a comment.

About the Authors

Srikanth Sopirala is a Principal Analytics Specialist Solutions Architect at AWS. He is a seasoned leader who is passionate about helping customers build scalable data and analytics solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys reading, spending time with his family, and road biking.

Sandeep Veldi is a Sr. Solutions Architect at AWS. He helps AWS customers with prescriptive architectural guidance based on their use cases and navigating their cloud journey. In his spare time, he loves to spend time with his family.

BP Yau is a Sr Analytics Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate the Oracle Data Warehouse to Amazon Redshift and built the next generation big data analytics platform using AWS technologies.

Jamey Munroe is Head of Database Sales, Global Verticals and Strategic Accounts. He joined AWS as one of its first Database and Analytics Sales Specialists, and he’s passionate about helping customers drive bottom-line business value throughout the full lifecycle of data. In his spare time, he enjoys solving home improvement DIY project challenges, fishing, and competitive cycling.

Use the Amazon Redshift SQLAlchemy dialect to interact with Amazon Redshift

2021-11-17 Sumeet Joshi

Post Syndicated from Sumeet Joshi original https://aws.amazon.com/blogs/big-data/use-the-amazon-redshift-sqlalchemy-dialect-to-interact-with-amazon-redshift/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that enables you to analyze your data at scale. You can interact with an Amazon Redshift database in several different ways. One method is using an object-relational mapping (ORM) framework. ORM is widely used by developers as an abstraction layer upon the database, which allows you to write code in your preferred programming language instead of writing SQL. SQLAlchemy is a popular Python ORM framework that enables the interaction between Python code and databases.

A SQLAlchemy dialect is the system used to communicate with various types of DBAPI implementations and databases. Previously, the SQLAlchemy dialect for Amazon Redshift used psycopg2 for communication with the database. Because psycopg2 is a Postgres connector, it doesn’t support Amazon Redshift specific functionality such as AWS Identity and Access Management (IAM) authentication for secure connections and Amazon Redshift specific data types such as SUPER and GEOMETRY. The new Amazon Redshift SQLAlchemy dialect uses the Amazon Redshift Python driver (redshift_connector) and lets you securely connect to your Amazon Redshift database. It natively supports IAM authentication and single sign-on (SSO). It also supports Amazon Redshift specific data types such as SUPER, GEOMETRY, TIMESTAMPTZ, and TIMETZ.

In this post, we discuss how you can interact with your Amazon Redshift database using the new Amazon Redshift SQLAlchemy dialect. We demonstrate how you can securely connect using Okta and perform various DDL and DML operations. Because the new Amazon Redshift SQLAlchemy dialect uses redshift_connector, users of this package can take full advantage of the connection options provided by redshift_connector, such as authenticating via IAM and identity provider (IdP) plugins. Additionally, we also demonstrate the support for IPython SqlMagic, which simplifies running interactive SQL queries directly from a Jupyter notebook.

Prerequisites

The following are the prerequisites for this post:

An Amazon Redshift cluster. For more information, see Getting started with Amazon Redshift.
Okta set up for SSO. For instructions on setting up Okta as an IdP for Amazon Redshift, see Federate Amazon Redshift access with Okta as an identity provider.

Get started with the Amazon Redshift SQLAlchemy dialect

It’s easy to get started with the Amazon Redshift SQLAlchemy dialect for Python. You can install the sqlalchemy-redshift library using pip. To demonstrate this, we start with a Jupyter notebook. Complete the following steps:

Create a notebook instance (for this post, we call it redshift-sqlalchemy).
On the Amazon SageMaker console, under Notebook in the navigation pane, choose Notebook instances.
Find the instance you created and choose Open Jupyter.
Open your notebook instance and create a new conda_python3 Jupyter notebook.
Run the following commands to install sqlalchemy-redshift and redshift_connector:

pip install sqlalchemy-redshift
pip install redshift_connector

redshift_connector provides many different connection options that help customize how you access your Amazon Redshift cluster. For more information, see Connection Parameters.

Connect to your Amazon Redshift cluster

In this step, we show you how to connect to your Amazon Redshift cluster using two different methods: Okta SSO federation, and direct connection using your database user and password.

Connect with Okta SSO federation

As a prerequisite, set up your Amazon Redshift application in your Okta configuration. For more information, see Federate Amazon Redshift access with Okta as an identity provider.

To establish a connection to the Amazon Redshift cluster, we utilize the create_engine function. The SQLAlchemy create_engine() function produces an engine object based on a URL. The sqlalchemy-redshift package provides a custom interface for creating an RFC-1738 compliant URL that you can use to establish a connection to an Amazon Redshift cluster.

We build the SQLAlchemy URL as shown in the following code. URL.create() is available for SQLAlchemy version 1.4 and above. When authenticating using IAM, the host and port don’t need to be specified by the user. To connect with Amazon Redshift securely using SSO federation, we use the Okta user name and password in the URL.

import sqlalchemy as sa
from sqlalchemy.engine.url import URL
from sqlalchemy import orm as sa_orm

from sqlalchemy_redshift.dialect import TIMESTAMPTZ, TIMETZ


# build the sqlalchemy URL. When authenticating using IAM, the host
# and port do not need to be specified by the user.
url = URL.create(
drivername='redshift+redshift_connector', # indicate redshift_connector driver and dialect will be used
database='dev', # Amazon Redshift database
username='[email protected]', # Okta username
password='<PWD>' # Okta password
)

# a dictionary is used to store additional connection parameters
# that are specific to redshift_connector or cannot be URL encoded.
conn_params = {
"iam": True, # must be enabled when authenticating via IAM
"credentials_provider": "OktaCredentialsProvider",
"idp_host": "<prefix>.okta.com",
"app_id": "<appid>",
"app_name": "amazon_aws_redshift",
"region": "<region>",
"cluster_identifier": "<clusterid>",
"ssl_insecure": False, # ensures certificate verification occurs for idp_host
}

engine = sa.create_engine(url, connect_args=conn_params)

Connect with an Amazon Redshift database user and password

You can connect to your Amazon Redshift cluster using your database user and password. We construct a URL and use the URL.create() constructor, as shown in the following code:

import sqlalchemy as sa
from sqlalchemy.engine.url import URL

# build the sqlalchemy URL
url = URL.create(
drivername='redshift+redshift_connector', # indicate redshift_connector driver and dialect will be used
host='<clusterid>.xxxxxx.<aws-region>.redshift.amazonaws.com', # Amazon Redshift host
port=5439, # Amazon Redshift port
database='dev', # Amazon Redshift database
username='awsuser', # Amazon Redshift username
password='<pwd>' # Amazon Redshift password
)

engine = sa.create_engine(url)

Next, we will create a session using the already established engine above. 

Session = sa_orm.sessionmaker()
Session.configure(bind=engine)
session = Session()

# Define Session-based Metadata
metadata = sa.MetaData(bind=session.bind)

Create a database table using Amazon Redshift data types and insert data

With new Amazon Redshift SQLAlchemy dialect, you can create tables with Amazon Redshift specific data types such as SUPER, GEOMETRY, TIMESTAMPTZ, and TIMETZ.

In this step, you create a table with TIMESTAMPTZ, TIMETZ, and SUPER data types.

Optionally, you can define your table’s distribution style, sort key, and compression encoding. See the following code:

import datetime
import uuid
import random

table_name = 'product_clickstream_tz'

RedshiftDBTable = sa.Table(
table_name,
metadata,
sa.Column('session_id', sa.VARCHAR(80)),
sa.Column('click_region', sa.VARCHAR(100), redshift_encode='lzo'),
sa.Column('product_id', sa.BIGINT),
sa.Column('click_datetime', TIMESTAMPTZ),
sa.Column('stream_time', TIMETZ),
sa.Column ('order_detail', SUPER),
redshift_diststyle='KEY',
redshift_distkey='session_id',
redshift_sortkey='click_datetime'
)

# Drop the table if it already exists
if sa.inspect(engine).has_table(table_name):
RedshiftDBTable.drop(bind=engine)

# Create the table (execute the "CREATE TABLE" SQL statement for "product_clickstream_tz")
RedshiftDBTable.create(bind=engine)

In this step, you will populate the table by preparing the insert command. 

# create sample data set
# generate a UUID for this row
session_id = str(uuid.uuid1())

# create Region information
click_region = "US / New York"

# create Product information
product_id = random.randint(1,100000)

# create a datetime object with timezone
click_datetime = datetime.datetime(year=2021, month=10, day=20, hour=10, minute=12, second=40, tzinfo=datetime.timezone.utc)

# create a time object with timezone
stream_time = datetime.time(hour=10, minute=14, second=56, tzinfo=datetime.timezone.utc)

# create SUPER information
order_detail = '[{"o_orderstatus":"F","o_clerk":"Clerk#0000001991","o_lineitems":[{"l_returnflag":"R","l_tax":0.03,"l_quantity":4,"l_linestatus":"F"}]}]'

# create the insert SQL statement
insert_data_row = RedshiftDBTable.insert().values(
session_id=session_id,
click_region=click_region,
product_id=product_id,
click_datetime=click_datetime,
stream_time=stream_time,
order_detail=order_detail
)

# execute the insert SQL statement
session.execute(insert_data_row)
session.commit()

Query and fetch results from the table

The SELECT statements generated by SQLAlchemy ORM are constructed by a query object. You can use several different methods, such as all(), first(), count(), order_by(), and join(). The following screenshot shows how you can retrieve all rows from the queried table.

Use IPython SqlMagic with the Amazon Redshift SQLAlchemy dialect

The Amazon Redshift SQLAlchemy dialect now supports SqlMagic. To establish a connection, you can build the SQLAlchemy URL with the redshift_connector driver. More information about SqlMagic is available on GitHub.

In the next section, we demonstrate how you can use SqlMagic. Make sure that you have the ipython-sql package installed; if not, install it by running the following command:

pip install ipython-sql

Connect to Amazon Redshift and query the data

In this step, you build the SQLAlchemy URL to connect to Amazon Redshift and run a sample SQL query. For this demo, we have prepopulated TPCH data in the cluster from GitHub. See the following code:

import sqlalchemy as sa
from sqlalchemy.engine.url import URL
from sqlalchemy.orm import Session
%reload_ext sql
%config SqlMagic.displaylimit = 25

connect_to_db = URL.create(
drivername='redshift+redshift_connector',     host='cluster.xxxxxxxx.region.redshift.amazonaws.com',     
port=5439,  
database='dev',  
username='awsuser',  
password='xxxxxx'  
)
%sql $connect_to_db
%sql select current_user, version();

You can view the data in tabular format by using the pandas.DataFrame() method.

If you installed matplotlib, you can use the result set’s .plot(), .pie(), and .bar() methods for quick plotting.

Clean up

Make sure that SQLAlchemy resources are closed and cleaned up when you’re done with them. SQLAlchemy uses a connection pool to provide access to an Amazon Redshift cluster. Once opened, the default behavior leaves these connections open. If not properly cleaned up, this can lead to connectivity issues with your cluster. Use the following code to clean up your resources:

session.close()

# If the connection was accessed directly, ensure it is invalidated
conn = engine.connect()
conn.invalidate()

# Clean up the engine
engine.dispose()

Summary

In this post, we discussed the new Amazon Redshift SQLAlchemy dialect. We demonstrated how it lets you securely connect to your Amazon Redshift database using SSO as well as direct connection using the SQLAlchemy URL. We also demonstrated how SQLAlchemy supports TIMESTAMPTZ, TIMETZ, and SUPER data types without explicitly casting it. We also showcased how redshift_connector and the dialect support SqlMagic with Jupyter notebooks, which enables you to run interactive queries against Amazon Redshift.

About the Authors

Sumeet Joshi is an Analytics Specialist Solutions Architect based out of New York. He specializes in building large-scale data warehousing solutions. He has over 16 years of experience in data warehousing and analytical space.

Brooke White is a Software Development Engineer at AWS. She enables customers to get the most out of their data through her work on Amazon Redshift drivers. Prior to AWS, she built ETL pipelines and analytics APIs at a San Francisco Bay Area startup.

How GE Aviation built cloud-native data pipelines at enterprise scale using the AWS platform

2021-11-16 Alcuin Weidus

Post Syndicated from Alcuin Weidus original https://aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud-native-data-pipelines-at-enterprise-scale-using-the-aws-platform/

This post was co-written with Alcuin Weidus, Principal Architect from GE Aviation.

GE Aviation, an operating unit of GE, is a world-leading provider of jet and turboprop engines, as well as integrated systems for commercial, military, business, and general aviation aircraft. GE Aviation has a global service network to support these offerings.

From the turbosupercharger to the world’s most powerful commercial jet engine, GE’s history of powering the world’s aircraft features more than 90 years of innovation.

In this post, we share how GE Aviation built cloud-native data pipelines at enterprise scale using the AWS platform.

A focus on the foundation

At GE Aviation, we’ve been invested in the data space for many years. Witnessing the customer value and business insights that could be extracted from data at scale has propelled us forward. We’re always looking for new ways to evolve, grow, and modernize our data and analytics stack. In 2019, this meant moving from a traditional on-premises data footprint (with some specialized AWS use cases) to a fully AWS Cloud-native design. We understood the task was challenging, but we were committed to its success. We saw the tremendous potential in AWS, and were eager to partner closely with a company that has over a decade of cloud experience.

Our goal from the outset was clear: build an enterprise-scale data platform to accelerate and connect the business. Using the best of cloud technology would set us up to deliver on our goal and prioritize performance and reliability in the process. From an early point in the build, we knew that if we wanted to achieve true scale, we had to start with solid foundations. This meant first focusing on our data pipelines and storage layer, which serve as the ingest point for hundreds of source systems. Our team chose Amazon Simple Storage Service (Amazon S3) as our foundational data lake storage platform.

Amazon S3 was the first choice as it provides an optimal foundation for a data lake store delivering virtually unlimited scalability and 11 9s of durability. In addition to its scalable performance, it has ease-of-use features, native encryption, and access control capabilities. Equally important, Amazon S3 integrates with a broad portfolio of AWS services, such as Amazon Athena, the AWS Glue Data Catalog, AWS Glue ETL (extract, transform, and load) Amazon Redshift, Amazon Redshift Spectrum, and many third-party tools, providing a growing ecosystem of data management tools.

How we started

The journey started with an internal hackathon that brought cross-functional team members together. We organized around an initial design and established an architecture to start the build using serverless patterns. A combination of Amazon S3, AWS Glue ETL, and the Data Catalog were central to our solution. These three services in particular aligned to our broader strategy to be serverless wherever possible and build on top of AWS services that were experiencing heavy innovation in the way of new features.

We felt good about our approach and promptly got to work.

Solution overview

Our cloud data platform built on Amazon S3 is fed from a combination of enterprise ELT systems. We have an on-premises system that handles change data capture (CDC) workloads and another that works more in a traditional batch manner.

Our design has the on-premises ELT systems dropping files into an S3 bucket set up to receive raw data for both situations. We made the decision to standardize our processed data layer into Apache Parquet format for our cataloged S3 data lake in preparation for more efficient serverless consumption.

Our enterprise CDC system can already land files natively in Parquet; however, our batch files are limited to CSV, so the landing of CSV files triggers another serverless process to convert these files to Parquet using AWS Glue ETL.

The following diagram illustrates this workflow.

When raw data is present and ready in Apache Parquet format, we have an event-triggered solution that processes the data and loads it to another mirror S3 bucket (this is where our users access and consume the data).

Pipelines are developed to support loading at a table level. We have specific AWS Lambda functions to identify schema errors by comparing each file’s schema against the last successful run. Another function validates that a necessary primary key file is present for any CDC tables.

Data partitioning and CDC updates

When our preprocessing Lambda functions are complete, the files are processed in one of two distinct paths based on the table type. Batch table loads are by far the simpler of the two and are handled via a single Lambda function.

For CDC tables, we use AWS Glue ETL to load and perform the updates against our tables stored in the mirror S3 bucket. The AWS Glue job uses Apache Spark data frames to combine historical data, filter out deleted records, and union with any records inserted. For our process, updates are treated as delete-then-insert. After performing the union, the entire dataset is written out to the mirror S3 bucket in a newly created bucket partition.

The following diagram illustrates this workflow.

We write data into a new partition for each table load, so we can provide read consistency in a way that makes sense to our consuming business partners.

Building the Data Catalog

When each Amazon S3 mirror data load is complete, another separate serverless branch is triggered to handle catalog management.

The branch updates the location property within the catalog for pre-existing tables, indicating each newly added partition. When loading a table for the first time, we trigger a series of purpose-built Lambda functions to create the AWS Glue Data Catalog database (only required when it’s an entirely new source schema), create an AWS Glue crawler, start the crawler, and delete the crawler when it’s complete.

The following diagram illustrates this workflow.

These event-driven design patterns allow us to fully automate the catalog management piece of our architecture, which became a big win for our team because it lowered the operational overhead associated with onboarding new source tables. Every achievement like this mattered because it realized the potential the cloud had to transform how we build and support products across our technology organization.

Final implementation architecture and best practices

The solution evolved several times throughout the development cycle, typically due to learning something new in terms of serverless and cloud-native development, and further working with AWS Solutions Architect and AWS Professional Services teams. Along the way, we’ve discovered many cloud-native best practices and accelerated our serverless data journey to AWS.

The following diagram illustrates our final architecture.

We strategically added Amazon Simple Queue Service (Amazon SQS) between purpose-built Lambda functions to decouple the architecture. Amazon SQS gave our system a level of resiliency and operational observability that otherwise would have been a challenge.

Another best practice arose from using Amazon DynamoDB as a state table to help ensure our entire serverless integration pattern was writing to our mirror bucket with ACID guarantees.

On the topic of operational observability, we use Amazon EventBridge to capture and report on operational metadata like table load status, time of the last load, and row counts.

Bringing it all together

At the time of writing, we’ve had production workloads running through our solution for the better part of 14 months.

Production data is integrated from more than 30 source systems at present and totals several hundred tables. This solution has given us a great starting point for building our cloud data ecosystem. The flexibility and extensibility of AWS’s many services have been key to our success.

Appreciation for the AWS Glue Data Catalog has been an essential element. Without knowing it at the time we started building a data lake, we’ve been embracing a modern data architecture pattern and organizing around our transactionally consistent and cataloged mirror storage layer.

The introduction of a more seamless Apache Hudi experience within AWS has been a big win for our team. We’ve been busy incorporating Hudi into our CDC transaction pipeline and are thrilled with the results. We’re able to spend less time writing code managing the storage of our data, and more time focusing on the reliability of our system. This has been critical in our ability to scale. Our development pipeline has grown beyond 10,000 tables and more than 150 source systems as we approach another major production cutover.

Looking ahead, we’re intrigued by the potential for AWS Lake Formation goverened tables to further accelerate our momentum and management of CDC table loads.

Conclusion

Building our cloud-native integration pipeline has been a journey. What started as an idea and has turned into much more in a brief time. It’s hard to appreciate how far we’ve come when there’s always more to be done. That being said, the entire process has been extraordinary. We built deep and trusted partnerships with AWS, learned more about our internal value statement, and aligned more of our organization to a cloud-centric way of operating.

The ability to build solutions in a serverless manner opens up many doors for our data function and, most importantly, our customers. Speed to delivery and the pace of innovation is directly related to our ability to focus our engineering teams on business-specific problems while trusting a partner like AWS to do the heavy lifting of data center operations like racking, stacking, and powering servers. It also removes the operational burden of managing operating systems and applications with managed services. Finally, it allows us to focus on our customers and business process enablement rather than on IT infrastructure.

The breadth and depth of data and analytics services on AWS make it possible to solve our business problems by using the right resources to run whatever analysis is most appropriate for a specific need. AWS Data and Analytics has deep integrations across all layers of the AWS ecosystem, giving us the tools to analyze data using any approach quickly. We appreciate AWS’s continual innovation on behalf of its customers.

About the Authors

Alcuin Weidus is a Principal Data Architect for GE Aviation. Serverless advocate, perpetual data management student, and cloud native strategist, Alcuin is a data technology leader on a team responsible for accelerating technical outcomes across GE Aviation. Connect him on Linkedin.

Suresh Patnam is a Sr Solutions Architect at AWS; He works with customers to build IT strategy, making digital transformation through the cloud more accessible, focusing on big data, data lakes, and AI/ML. In his spare time, Suresh enjoys playing tennis and spending time with his family. Connect him on LinkedIn.

Apply CI/CD DevOps principles to Amazon Redshift development

2021-11-15 Ashok Srirama

Post Syndicated from Ashok Srirama original https://aws.amazon.com/blogs/big-data/apply-ci-cd-devops-principles-to-amazon-redshift-development/

CI/CD in the context of application development is a well-understood topic, and developers can choose from numerous patterns and tools to build their pipelines to handle the build, test, and deploy cycle when a new commit gets into version control. For stored procedures or even schema changes that are directly related to the application, this is typically part of the code base and is included in the code repository of the application. These changes are then applied when the application gets deployed to the test or prod environment.

This post demonstrates how you can apply the same set of approaches to stored procedures, and even schema changes to data warehouses like Amazon Redshift.

Stored procedures are considered code and as such should undergo the same rigor as application code. This means that the pipeline should involve running tests against changes to make sure that no regressions are introduced to the production environment. Because we automate the deployment of both stored procedures and schema changes, this significantly reduces inconsistencies in between environments.

Solution overview

The following diagram illustrates our solution architecture. We use AWS CodeCommit to store our code, AWS CodeBuild to run the build process and test environment, and AWS CodePipeline to orchestrate the overall deployment, from source, to test, to production.

Database migrations and tests also require connection information to the relevant Amazon Redshift cluster; we demonstrate how to integrate this securely using AWS Secrets Manager.

We discuss each service component in more detail later in the post.

You can see how all these components work together by completing the following steps:

Clone the GitHub repo.
Deploy the AWS CloudFormation template.
Push code to the CodeCommit repository.
Run the CI/CD pipeline.

Clone the GitHub repository

The CloudFormation template and the source code for the example application are available in the GitHub repo. Before you get started, you need to clone the repository using the following command:

git clone https://github.com/aws-samples/amazon-redshift-devops-blog

This creates a new folder, amazon-redshift-devops-blog, with the files inside.

Deploy the CloudFormation template

The CloudFormation stack creates the VPC, Amazon Redshift clusters, CodeCommit repository, CodeBuild projects for both test and prod, and the pipeline using CodePipeline to orchestrate the change release process.

On the AWS CloudFormation console, choose Create stack.
Choose With new resources (standard).
Select Upload a template file.
Choose Choose file and locate the template file (<cloned_directory>/cloudformation_template.yml).
Choose Next.
For Stack name, enter a name.
In the Parameters section, provide the primary user name and password for both the test and prod Amazon Redshift clusters.

The username must be 1–128 alphanumeric characters, and it can’t be a reserved word.

The password has the following criteria:

Must be 8-64 characters
Must contain at least one uppercase letter
Must contain at least one lowercase letter
Must contain at least one number
Can only contain ASCII characters (ASCII codes 33–126), except ‘ (single quotation mark), ” (double quotation mark), /, \, or @

Please note that production credentials could be created separately by privileged admins, and you could pass in the ARN of a pre-existing secret instead of the actual password if you so choose.

Choose Next.
Leave the remaining settings at their default and choose Next.
Select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create stack.

You can choose the refresh icon on the stack’s Events page to track the progress of the stack creation.

Push code to the CodeCommit repository

When stack creation is complete, go to the CodeCommit console. Locate the redshift-devops-repo repository that the stack created. Choose the repository to view its details.

Before you can push any code into this repo, you have to set up your Git credentials using instructions here Setup for HTTPS users using Git credentials. At Step 4 of the Setup for HTTPS users using Git credentials, copy the HTTPS URL, and instead of cloning, add the CodeCommit repo URL into the code that we cloned earlier:

git remote add codecommit <repo_https_url> 
git push codecommit main

The last step populates the repository; you can check it by refreshing the CodeCommit console. If you get prompted for a user name and password, enter the Git credentials that you generated and downloaded from Step 3 of the Setup for HTTPS users using Git credentials

Run the CI/CD pipeline

After you push the code to the CodeCommit repository, this triggers the pipeline to deploy the code into both the test and prod Amazon Redshift clusters. You can monitor the progress on the CodePipeline console.

To dive deeper into the progress of the build, choose Details.

You’re redirected to the CodeBuild console, where you can see the run logs as well as the result of the test.

Components and dependencies

Although from a high-level perspective the test and prod environment look the same, there are some nuances with regards to how these environments are configured. Before diving deeper into the code, let’s look at the components first:

CodeCommit – This is the version control system where you store your code.
CodeBuild – This service runs the build process and test using Maven.
- Build – During the build process, Maven uses FlyWay to connect to Amazon Redshift to determine the current version of the schema and what needs to be run to bring it up to the latest version.
- Test – In the test environment, Maven runs JUnit tests against the test Amazon Redshift cluster. These tests may involve loading data and testing the behavior of the stored procedures. The results of the unit tests are published into the CodeBuild test reports.
Secrets Manager – We use Secrets Manager to securely store connection information to the various Amazon Redshift clusters. This includes host name, port, user name, password, and database name. CodeBuild refers to Secrets Manager for the relevant connection information when a build gets triggered. The underlying CodeBuild service role needs to have the corresponding permission to access the relevant secrets.
CodePipeline – CodePipeline is responsible for the overall orchestration from source to test to production.

As referenced in the components, we also use some additional dependencies at the code level:

Flyway – This framework is responsible for keeping different Amazon Redshift clusters in different environments in sync as far as schema and stored procedures are concerned.
JUnit – Unit testing framework written in Java.
Apache Maven – A dependency management and build tool. Maven is where we integrate Flyway and JUnit.

In the following sections, we dive deeper into how these dependencies are integrated.

Apache Maven

For Maven, the configuration file is pom.xml. For an example, you can check out the pom file from our demo app. The pertinent part of the xml is the build section:

<build>
        <plugins>
            <plugin>
                <groupId>org.flywaydb</groupId>
                <artifactId>flyway-maven-plugin</artifactId>
                <version>${flyway.version}</version>
                <executions>
                    <execution>
                        <phase>process-resources</phase>
                        <goals>
                            <goal>migrate</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>${surefire.version}</version>
            </plugin>
        </plugins>
    </build>

This section describes two things:

By default, the Surefire plugin triggers during the test phase of Maven. The plugin runs the unit tests and generates reports based on the results of those tests. These reports are stored in the target/surefire-reports folder. We reference this folder in the CodeBuild section.
Flyway is triggered during the process-resources phase of Maven, and it triggers the migrate goal of Flyway. Looking at Maven’s lifecycle, this phase is always triggered first and deploys the latest version of stored procedures and schemas to the test environment before running test cases.

Flyway

Changes to the database are called migrations, and these can be either versioned or repeatable. Developers can define which type of migration by the naming convention used by Flyway to determine which one is which. The following diagram illustrates the naming convention.

A versioned migration consists of the regular SQL script that is run and an optional undo SQL script to reverse the specific version. You have to create this undo script in order to enable the undo functionality for a specific version. For example, a regular SQL script consists of creating a new table, and the corresponding undo script consists of dropping that table. Flyway is responsible for keeping track of which version a database is currently at, and runs N number of migrations depending on how far back the target database is compared to the latest version. Versioned migrations are the most common use of Flyway and are primarily used to maintain table schema and keep reference or lookup tables up to date by running data loads or updates via SQL statements. Versioned migrations are applied in order exactly one time.

Repeatable migrations don’t have a version; instead they’re rerun every time their checksum changes. They’re useful for maintaining user-defined functions and stored procedures. Instead of having multiple files to track changes over time, we can just use a single file and Flyway keeps track of when to rerun the statement to keep the target database up to date.

By default, these migration files are located in the classpath under db/migration, the full path being src/main/resources/db/migration. For our example application, you can find the source code on GitHub.

JUnit

When Flyway finishes running the migrations, the test cases are run. These test cases are under the folder src/test/java. You can find examples on GitHub that run a stored procedure via JDBC and validate the output or the impact.

Another aspect of unit testing to consider is how the test data is loaded and maintained in the test Amazon Redshift cluster. There are a couple of approaches to consider:

As per our example, we’re packaging the test data as part of our version control and loading the data when the first unit test is run. The advantage of this approach is that you get flexibility of when and where you run the test cases. You can start with either a completely empty or partially populated test cluster and you get with the right environment for the test case to run. Other advantages are that you can test data loading queries and have more granular control over the datasets that are being loaded for each test. The downside of this approach is that, depending on how big your test data is, it may add additional time for your test cases to complete.
Using an Amazon Redshift snapshot dedicated to the test environment is another way to manage the test data. With this approach, you have a couple more options:
- Transient cluster – You can provision a transient Amazon Redshift cluster based on the snapshot when the CI/CD pipeline gets triggered. This cluster stops after the pipeline completes to save cost. The downside of this approach is that you have to factor in Amazon Redshift provisioning time in your end-to-end runtime.
- Long-running cluster – Your test cases can connect to an existing cluster that is dedicated to running test cases. The test cases are responsible for making sure that data-related setup and teardown are done accordingly depending on the nature of the test that’s running. You can use @BeforeAll and @AfterAll JUnit annotations to trigger the setup and teardown, respectively.

CodeBuild

CodeBuild provides an environment where all of these dependencies run. As shown in our architecture diagram, we use CodeBuild for both test and prod. The differences are in the actual commands that run in each of those environments. These commands are stored in the buildspec.yml file. In our example, we provide a separate buildspec file for test and a different one for prod. During the creation of a CodeBuild project, we can specify which buildspec file to use.

There are a few differences between the test and prod CodeBuild project, which we discuss in the following sections.

Buildspec commands

In the test environment, we use mvn clean test and package the Surefire reports so the test results can be displayed via the CodeBuild console. While in the prod environment, we just run mvn clean process-resources. The reason for this is because in the prod environment, we only need to run the Flyway migrations, which are hooked up to the process-resources Maven lifecycle, whereas in the test environment, we not only run the Flyway migrations, but also make sure that it didn’t introduce any regressions by running test cases. These test cases might have an impact on the underlying data, which is why we don’t run it against the production Amazon Redshift cluster. If you want to run the test cases against production data, you can use an Amazon Redshift production cluster snapshot and run the test cases against that.

Secrets via Secrets Manager

Both Flyway and JUnit need information to identify and connect to Amazon Redshift. We store this information in Secrets Manager. Using Secrets Manager has several benefits:

Secrets are encrypted automatically
Access to secrets can be tightly controlled via fine-grained AWS Identity and Access Management (IAM) policies
All activity with secrets is recorded, which enables easy auditing and monitoring
You can rotate secrets securely and safely without impacting applications

For our example application, we define the secret as follows:

{
  "username": "<Redshift username>",
  "password": "<Redshift password>",
  "host": "<Redshift hostname>",
  "port": <Redshift port>,
  "dbName": "<Redshift DB Name>"
}

CodeBuild is integrated with Secrets Manager, so we define the following environment variables as part of the CodeBuild project:

TEST_HOST: arn:aws:secretsmanager:<region>:<AWS Account Id>:secret:<secret name>:host
TEST_JDBC_USER: arn:aws:secretsmanager:<region>:<AWS Account Id>:secret:<secret name>:username
TEST_JDBC_PASSWORD: arn:aws:secretsmanager:<region>:<AWS Account Id>:secret:<secret name>:password
TEST_PORT: arn:aws:secretsmanager:<region>:<AWS Account Id>:secret:<secret name>:port
TEST_DB_NAME: arn:aws:secretsmanager:<region>:<AWS Account Id>:secret:<secret name>:dbName
TEST_REDSHIFT_IAM_ROLE: <ARN of IAM role> (This can be in plaintext and should be attached to the Amazon Redshift cluster)
TEST_DATA_S3_BUCKET: <bucket name> (This is where the test data is staged)

CodeBuild automatically retrieves the parameters from Secrets Manager and they’re available in the application as environment variables. If you look at the buildspec_prod.yml example, we use the preceding variables to populate the Flyway environment variables and JDBC connection URL.

VPC configuration

For CodeBuild to be able to connect to Amazon Redshift, you need to configure which VPC it runs in. This includes the subnets and security group that it uses. The Amazon Redshift cluster’s security group also needs to allow access from the CodeBuild security group.

CodePipeline

To bring all these components together, we use CodePipeline to orchestrate the flow from the source code through prod deployment. CodePipeline also has additional capabilities. For example, you can add an approval step between test and prod so a release manager can review the results of the tests before releasing the changes to production.

Example scenario

You can use tests as a form of documentation of what is the expected behavior of a function. To further illustrate this point, let’s look at a simple stored procedure:

create or replace procedure merge_staged_products()
as $$
BEGIN
    update products set status='CLOSED' where product_name in (select product_name from products_staging) and status='ACTIVE';
    insert into products(product_name,price) select product_name, price from products_staging;
END;
$$ LANGUAGE plpgsql;

If you deployed the example app from the previous section, you can follow along by copying the stored procedure code and pasting it in src/main/resources/db/migration/R__MergeStagedProducts.sql. Save it and push the change to the CodeCommit repository by issuing the following commands (assuming that you’re at the top of the project folder):

git add src
git commit -m “<commit message>”
git push codecommit main

After you push the changes to the CodeCommit repository, you can follow the progress of the build and test stages on the CodePipeline console.

We implement a basic Slowly Changing Dimension Type 2 approach in which we mark old data as CLOSED and append newer versions of the data. Although the stored procedure works as is, our test has the following expectations:

The number of closed status in the products table needs to correspond to the number of duplicate entries in the staging table.
The products table has a close_date column that needs to be populated so we know when it was deprecated
At the end of the merge, the staging table needs to be cleared for subsequent ETL runs

The stored procedure will pass the first test, but fail later tests. When we push this change to CodeCommit and the CI/CD process runs, we can see results like in the following screenshot.

The tests show that the second and third tests failed. Failed tests result in the pipeline stopping, which means these bad changes don’t end up in production.

We can update the stored procedure and push the change to CodeCommit to trigger the pipeline again. The updated stored procedure is as follows:

create or replace procedure merge_staged_products()
as $$
BEGIN
    update products set status='CLOSED', close_date=CURRENT_DATE where product_name in (select product_name from products_staging) and status='ACTIVE';
    insert into products(product_name,price) select product_name, price from products_staging;
    truncate products_staging;
END;
$$ LANGUAGE plpgsql;

All the tests passed this time, which allows CodePipeline to proceed with deployment to production.

We used Flyway’s repeatable migrations to make the changes to the stored procedure. Code is stored in a single file and Flyway verifies the checksum of the file to detect any changes and reapplies the migration if the checksum is different from the one that’s already deployed.

Clean up

After you’re done, it’s crucial to tear down the environment to avoid incurring additional charges beyond your testing. Before you delete the CloudFormation stack, go to the Resources tab of your stack and make sure the two buckets that were provisioned are empty. If they’re not empty, delete all the contents of the buckets.

Now that the buckets are empty, you can go back to the AWS CloudFormation console and delete the stack to complete the cleanup of all the provisioned resources.

Conclusion

Using CI/CD principles in the context of Amazon Redshift stored procedures and schema changes greatly improves confidence when updates are getting deployed to production environments. Similar to CI/CD in application development, proper test coverage of stored procedures is paramount to capturing potential regressions when changes are made. This includes testing both success paths as well as all possible failure modes.

In addition, versioning migrations enables consistency across multiple environments and prevents issues arising from schema changes that aren’t applied properly. This increases confidence when changes are being made and improves development velocity as teams spend more time developing functionality rather than hunting for issues due to environment inconsistencies.

We encourage you to try building a CI/CD pipeline for Amazon Redshift using these steps described in this blog.

About the Authors

Ashok Srirama is a Senior Solutions Architect at Amazon Web Services, based in Washington Crossing, PA. He specializes in serverless applications, containers, devops, and architecting distributed systems. When he’s not spending time with his family, he enjoys watching cricket, and driving his bimmer.

Indira Balakrishnan is a Senior Solutions Architect in the AWS Analytics Specialist SA Team. She is passionate about helping customers build cloud-based analytics solutions to solve their business problems using data-driven decisions. Outside of work, she volunteers at her kids’ activities and spends time with her family.

Vaibhav Agrawal is an Analytics Specialist Solutions Architect at AWS.Throughout his career, he has focused on helping customers design and build well-architected analytics and decision support platforms.

Rajesh Francis is a Sr. Analytics Customer Experience Specialist at AWS. He specializes in Amazon Redshift and works with customers to build scalable Analytic solutions.

Jeetesh Srivastva is a Sr. Manager, Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to implement scalable solutions using Amazon Redshift and other AWS Analytic services. He has worked to deliver on-premises and cloud-based analytic solutions for customers in banking and finance and hospitality industry verticals.

Disaster Recovery with AWS Managed Services, Part I: Single Region

2021-11-12 Dhruv Bakshi

Post Syndicated from Dhruv Bakshi original https://aws.amazon.com/blogs/architecture/disaster-recovery-with-aws-managed-services-part-i-single-region/

This 3-part blog series discusses disaster recovery (DR) strategies that you can implement to ensure your data is safe and that your workload stays available during a disaster. In Part I, we’ll discuss the single AWS Region/multi-Availability Zone (AZ) DR strategy.

The strategy outlined in this blog post addresses how to integrate AWS managed services into a single-Region DR strategy. This will minimize maintenance and operational overhead, create fault-tolerant systems, ensure high availability, and protect your data with robust backup/recovery processes. This strategy replicates workloads across multiple AZs and continuously backs up your data to another Region with point-in-time recovery, so your application is safe even if all AZs within your source Region fail.

Implementing the single Region/multi-AZ strategy

The following sections list the components of the example application presented in Figure 1, which illustrates a multi-AZ environment with a secondary Region that is strictly utilized for backups. This example architecture refers to an application that processes payment transactions that has been modernized with AMS. We’ll show you which AWS services it uses and how they work to maintain the single Region/multi-AZ strategy.

Figure 1. Single Region/multi-AZ with secondary Region for backups

Amazon EKS control plane

Amazon Elastic Kubernetes Service (Amazon EKS) runs the Kubernetes management infrastructure across multiple AZs to eliminate a single point of failure.

This means that if your infrastructure or AZ fails, it will automatically scale control plane nodes based on load, automatically detect and replace unhealthy control plane instances, and restart them across the AZs within the Region as needed.

Amazon EKS data plane

Instead of creating individual Amazon Elastic Compute Cloud (Amazon EC2) instances, create worker nodes using an Amazon EC2 Auto Scaling group. Join the group to a cluster, and the group will automatically replace any terminated or failed nodes if an AZ fails. This ensures that the cluster can always run your workload.

Amazon ElastiCache

Amazon ElastiCache continually monitors the state of the primary node. If the primary node fails, it will promote the read replica with the least replication lag to primary. A replacement read replica is then created and provisioned in the same AZ as the failed primary. This is to ensure high availability of the service and application.

An ElastiCache for Redis (cluster mode disabled) cluster with multiple nodes has three types of endpoints: the primary endpoint, the reader endpoint and the node endpoints. The primary endpoint is a DNS name that always resolves to the primary node in the cluster.

Amazon Redshift

Currently, Amazon Redshift only supports single-AZ deployments. Although there are ways to work around this, we are focusing on cluster relocation. Parts II and III of this series will show you how to implement this service in a multi-Region DR deployment.

Cluster relocation enables Amazon Redshift to move a cluster to another AZ with no loss of data or changes to your applications. When Amazon Redshift relocates a cluster to a new AZ, the new cluster has the same endpoint as the original cluster. Your applications can reconnect to the endpoint and continue operations without modifications or loss of data.

Note: Amazon Redshift may also relocate clusters in non-AZ failure situations, such as when issues in the current AZ prevent optimal cluster operation or to improve service availability.

Amazon OpenSearch Service

Deploying your data nodes into three AZs with Amazon OpenSearch Service (formerly Amazon Elasticsearch Service) can improve the availability of your domain and increase your workload’s tolerance for AZ failures.

Amazon OpenSearch Service automatically deploys into three AZs when you select a multi-AZ deployment. This distribution helps prevent cluster downtime if an AZ experiences a service disruption. When you deploy across three AZs, Amazon OpenSearch Service distributes master nodes equally across all three AZs. That way, in the rare event of an AZ disruption, two master nodes will still be available.

Amazon OpenSearch Service also distributes primary shards and their corresponding replica shards to different zones. In addition to distributing shards by AZ, Amazon OpenSearch Service distributes them by node. When you deploy the data nodes across three AZs with one replica enabled, shards are distributed across the three AZs.

Note: For more information on multi-AZ configurations, please refer to the AZ disruptions table.

Amazon RDS PostgreSQL

Amazon Relational Database Service (Amazon RDS) handles failovers automatically so you can resume database operations as quickly as possible.

In a Multi-AZ deployment, Amazon RDS automatically provisions and maintains a synchronous standby replica in a different AZ. The primary DB instance is synchronously replicated across AZs to a standby replica. If an AZ or infrastructure fails, Amazon RDS performs an automatic failover to the standby. This minimizes the disruption to your applications without administrative intervention.

Backing up data across Regions

Here is how the managed services back up data to a secondary Region:

Manage snapshots of persistent volumes for Amazon EKS with Velero. Amazon Simple Storage Service (Amazon S3) stores these snapshots in an S3 bucket in the primary Region. Amazon S3 replicates these snapshots to an S3 bucket in another Region via S3 cross-Region replication.
Create a manual snapshot of Amazon OpenSearch Service clusters, which are stored in a registered repository like Amazon S3. You can do this manually or automate it via an AWS Lambda function, which automatically and asynchronously copy objects across Regions.
Use manual backups and copy API calls for Amazon ElastiCache to establish a snapshot and restore strategy in a secondary Region. You can manually back your data up to an S3 bucket or automate the backup via Lambda. Once your data is backed up, a snapshot of the ElastiCache cluster will be stored in an S3 bucket. Then S3 cross-Region replication will asynchronously copy the backup to an S3 bucket in a secondary Region.
Take automatic, incremental snapshots of your data periodically with Amazon Redshift and save them to Amazon S3. You can precisely control when snapshots are taken and can create a snapshot schedule and attach it to one or more clusters. You can also configure a cross-Region snapshot copy, which automatically copies your automated and manual snapshots to another Region.
Use AWS Backup to support AWS resources and third-party applications. AWS Backup copies RDS backups to multiple Regions on demand or automatically as part of a scheduled backup plan.

Note: You can add a layer of protection to your backups through AWS Backup Vault Lock and S3 Object Lock.

Conclusion

The single Region/multi-AZ strategy safeguards your workloads against a disaster that disrupts an Amazon data center by replicating workloads across multiple AZs in the same Region. This blog shows you how AWS managed services automatically fails over between AZs without interruption when experiencing a localized disaster, and how backups to a separate Region ensure data protection.

In the next post, we will discuss a multi-Region warm standby strategy for the same application stack illustrated in this post.

Related information

How Roche democratized access to data with Google Sheets and Amazon Redshift Data API

2021-11-09 Dr. Yannick Misteli

Post Syndicated from Dr. Yannick Misteli original https://aws.amazon.com/blogs/big-data/how-roche-democratized-access-to-data-with-google-sheets-and-amazon-redshift-data-api/

This post was co-written with Dr. Yannick Misteli, João Antunes, and Krzysztof Wisniewski from the Roche global Platform and ML engineering team as the lead authors.

Roche is a Swiss multinational healthcare company that operates worldwide. Roche is the largest pharmaceutical company in the world and the leading provider of cancer treatments globally.

In this post, Roche’s global Platform and machine learning (ML) engineering team discuss how they used Amazon Redshift data API to democratize access to the data in their Amazon Redshift data warehouse with Google Sheets (gSheets).

Business needs

Go-To-Market (GTM) is the domain that lets Roche understand customers and create and deliver valuable services that meet their needs. This lets them get a better understanding of the health ecosystem and provide better services for patients, doctors, and hospitals. It extends beyond health care professionals (HCPs) to a larger Healthcare ecosystem consisting of patients, communities, health authorities, payers, providers, academia, competitors, etc. Data and analytics are essential to supporting our internal and external stakeholders in their decision-making processes through actionable insights.

For this mission, Roche embraced the modern data stack and built a scalable solution in the cloud.

Driving true data democratization requires not only providing business leaders with polished dashboards or data scientists with SQL access, but also addressing the requirements of business users that need the data. For this purpose, most business users (such as Analysts) leverage Excel—or gSheet in the case of Roche—for data analysis.

Providing access to data in Amazon Redshift to these gSheets users is a non-trivial problem. Without a powerful and flexible tool that lets data consumers use self-service analytics, most organizations will not realize the promise of the modern data stack. To solve this problem, we want to empower every data analyst who doesn’t have an SQL skillset with a means by which they can easily access and manipulate data in the applications that they are most familiar with.

The Roche GTM organization uses the Redshift Data API to simplify the integration between gSheets and Amazon Redshift, and thus facilitate the data needs of their business users for analytical processing and querying. The Amazon Redshift Data API lets you painlessly access data from Amazon Redshift with all types of traditional, cloud-native, and containerized, serverless web service-based applications and event-driven applications. Data API simplifies data access, ingest, and egress from languages supported with AWS SDK, such as Python, Go, Java, Node.js, PHP, Ruby, and C++ so that you can focus on building applications as opposed to managing infrastructure. The process they developed using Amazon Redshift Data API has significantly lowered the barrier for entry for new users without needing any data warehousing experience.

Use-Case

In this post, you will learn how to integrate Amazon Redshift with gSheets to pull data sets directly back into gSheets. These mechanisms are facilitated through the use of the Amazon Redshift Data API and Google Apps Script. Google Apps Script is a programmatic way of manipulating and extending gSheets and the data that they contain.

Architecture

It is possible to include publicly available JS libraries such as JQuery-builder provided that Apps Script is natively a cloud-based Javascript platform.

The JQuery builder library facilitates the creation of standard SQL queries via a simple-to-use graphical user interface. The Redshift Data API can be used to retrieve the data directly to gSheets with a query in place. The following diagram illustrates the overall process from a technical standpoint:

Even though AppsScript is, in fact, Javascript, the AWS-provided SDKs for the browser (NodeJS and React) cannot be used on the Google platform, as they require specific properties that are native to the underlying infrastructure. It is possible to authenticate and access AWS resources through the available API calls. Here is an example of how to achieve that.

You can use an access key ID and a secret access key to authenticate the requests to AWS by using the code in the link example above. We recommend following the least privilege principle when granting access to this programmatic user, or assuming a role with temporary credentials. Since each user will require a different set of permissions on the Redshift objects—database, schema, and table—each user will have their own user access credentials. These credentials are safely stored under the AWS Secrets Manager service. Therefore, the programmatic user needs a set of permissions that enable them to retrieve secrets from the AWS Secrets Manager and execute queries against the Redshift Data API.

Code example for AppScript to use Data API

In this section, you will learn how to pull existing data back into a new gSheets Document. This section will not cover how to parse the data from the JQuery-builder library, as it is not within the main scope of the article.

<script src="https://cdn.jsdelivr.net/npm/jQuery-QueryBuilder/dist/js/query-builder.standalone.min.js"></script>

In the AWS console, go to Secrets Manager and create a new secret to store the database credentials to access the Redshift Cluster: username and password. These will be used to grant Redshift access to the gSheets user.

In the AWS console, create a new IAM user with programmatic access, and generate the corresponding Access Key credentials. The only set of policies required for this user is to be able to read the secret created in the previous step from the AWS Secrets Manager service and to query the Redshift Data API.

Below is the policy document:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetSecretValue",
        "secretsmanager:DescribeSecret"
      ],
      "Resource": "arn:aws:secretsmanager:*::secret:*"
    },
    {
      "Sid": "VisualEditor1",
      "Effect": "Allow",
      "Action": "secretsmanager:ListSecrets",
      "Resource": "*"
    },
    {
      "Sid": "VisualEditor2",
      "Effect": "Allow",
      "Action": "redshift-data:*",
      "Resource": "arn:aws:redshift:*::cluster:*"
    }
  ]
}

Access the Google Apps Script console. Create an aws.gs file with the code available here. This will let you perform authenticated requests to the AWS services by providing an access key and a secret access key.
Initiate the AWS variable providing the access key and secret access key created in step 3.
```
AWS.init("<ACCESS_KEY>", "<SECRET_KEY>");
```

Request the Redshift username and password from the AWS Secrets Manager:

function runGetSecretValue_(secretId) {
 
  var resultJson = AWS.request(
    	getSecretsManagerTypeAWS_(),
    	getLocationAWS_(),
    	'secretsmanager.GetSecretValue',
    	{"Version": getVersionAWS_()},
    	method='POST',
    	payload={          
      	"SecretId" : secretId
    	},
    	headers={
      	"X-Amz-Target": "secretsmanager.GetSecretValue",
      	"Content-Type": "application/x-amz-json-1.1"
    	}
  );
 
  Logger.log("Execute Statement result: " + resultJson);
  return JSON.parse(resultJson);
 
}

Query a table using the Amazon Redshift Data API:

function runExecuteStatement_(sql) {
 
  var resultJson = AWS.request(
    	getTypeAWS_(),
    	getLocationAWS_(),
    	'RedshiftData.ExecuteStatement',
    	{"Version": getVersionAWS_()},
    	method='POST',
    	payload={
      	"ClusterIdentifier": getClusterIdentifierReshift_(),
      	"Database": getDataBaseRedshift_(),
      	"DbUser": getDbUserRedshift_(),
      	"Sql": sql
    	},
    	headers={
      	"X-Amz-Target": "RedshiftData.ExecuteStatement",
      	"Content-Type": "application/x-amz-json-1.1"
    	}
  ); 
 
  Logger.log("Execute Statement result: " + resultJson);

The result can then be displayed as a table in gSheets:

function fillGsheet_(recordArray) { 
 
  adjustRowsCount_(recordArray);
 
  var rowIndex = 1;
  for (var i = 0; i < recordArray.length; i++) {  
       
	var rows = recordArray[i];
	for (var j = 0; j < rows.length; j++) {
  	var columns = rows[j];
  	rowIndex++;
  	var columnIndex = 'A';
     
  	for (var k = 0; k < columns.length; k++) {
       
    	var field = columns[k];       
    	var value = getFieldValue_(field);
    	var range = columnIndex + rowIndex;
    	addToCell_(range, value);
 
    	columnIndex = nextChar_(columnIndex);
 
  	}
 
	}
 
  }
 
}

Once finished, the Apps Script can be deployed as an Addon that enables end-users from an entire organization to leverage the capabilities of retrieving data from Amazon Redshift directly into their spreadsheets. Details on how Apps Script code can be deployed as an Addon can be found here.

How users access Google Sheets

Open a gSheet, and go to manage addons -> Install addon:
Once the Addon is successfully installed, select the Addon menu and select Redshift Synchronization. A dialog will appear prompting the user to select the combination of database, schema, and table from which to load the data.
After choosing the intended table, a new panel will appear on the right side of the screen. Then, the user is prompted to select which columns to retrieve from the table, apply any filtering operation, and/or apply any aggregations to the data.
Upon submitting the query, app scripts will translate the user selection into a query that is sent to the Amazon Redshift Data API. Then, the returned data is transformed and displayed as a regular gSheet table:

Security and Access Management

In the scripts above, there is a direct integration between AWS Secrets Manager and Google Apps Script. The scripts above can extract the currently-authenticated user’s Google email address. Using this value and a set of annotated tags, the script can appropriately pull the user’s credentials securely to authenticate the requests made to the Amazon Redshift cluster. Follow these steps to set up a new user in an existing Amazon Redshift cluster. Once the user has been created, follow these steps for creating a new AWS Secrets Manager secret for your cluster. Make sure that the appropriate tag is applied with the key of “email” along with the corresponding user’s Google email address. Here is a sample configuration that is used for creating Redshift groups, users, and data shares via the Redshift Data API:

connection:
 redshift_super_user_database: dev
 redshift_secret_name: dev_
 redshift_cluster_identifier: dev-cluster
 redshift_secrets_stack_name: dev-cluster-secrets
 environment: dev
 aws_region: eu-west-1
 tags:
   - key: "Environment"
 	value: "dev"
users:
 - name: user1
   email: [email protected]
 data_shares:
 - name: test_data_share
   schemas:
 	- schema1
   redshift_namespaces:
 	- USDFJIL234234WE
group:
 - name: readonly
   users:
 	- user1
   databases:
 	- database: database1
   	exclude-schemas:
     	- public
     	- pg_toast
     	- catalog_history
   	include-schemas:
     	- schema1
   	grant:
     	- select

Operational Metrics and Improvement

Providing access to live data that is hosted in Redshift directly to the business users and enabling true self-service decrease the burden on platform teams to provide data extracts or other mechanisms to deliver up-to-date information. Additionally, by not having different files and versions of data circulating, the business risk of reporting different key figures or KPI can be reduced, and an overall process efficiency can be achieved.

The initial success of this add-on in GTM led to the extension of this to a broader audience, where we are hoping to serve hundreds of users with all of the internal and public data in the future.

Conclusion

In this post, you learned how to create new Amazon Redshift tables and pull existing Redshift tables into a Google Sheet for business users to easily integrate with and manipulate data. This integration was seamless and demonstrated how easy the Amazon Redshift Data API makes integration with external applications, such as Google Sheets with Amazon Redshift. The outlined use-cases above are just a few examples of how the Amazon Redshift Data API can be applied and used to simplify interactions between users and Amazon Redshift clusters.

About the Authors

Dr. Yannick Misteli is leading cloud platform and ML engineering teams in global product strategy (GPS) at Roche. He is passionate about infrastructure and operationalizing data-driven solutions, and he has broad experience in driving business value creation through data analytics.

João Antunes is a Data Engineer in the Global Product Strategy (GPS) team at Roche. He has a track record of deploying Big Data batch and streaming solutions for the telco, finance, and pharma industries.

Krzysztof Wisniewski is a back-end JavaScript developer in the Global Product Strategy (GPS) team at Roche. He is passionate about full-stack development from the front-end through the back-end to databases.

Matt Noyce is a Senior Cloud Application Architect at AWS. He works together primarily with Life Sciences and Healthcare customers to architect and build solutions on AWS for their business needs.

Migrating to an Amazon Redshift Cloud Data Warehouse from Microsoft APS

2021-11-09 Sudarshan Roy

Post Syndicated from Sudarshan Roy original https://aws.amazon.com/blogs/architecture/migrating-to-an-amazon-redshift-cloud-data-warehouse-from-microsoft-aps/

Before cloud data warehouses (CDWs), many organizations used hyper-converged infrastructure (HCI) for data analytics. HCIs pack storage, compute, networking, and management capabilities into a single “box” that you can plug into your data centers. However, because of its legacy architecture, an HCI is limited in how much it can scale storage and compute and continue to perform well and be cost-effective. Using an HCI can impact your business’s agility because you need to plan in advance, follow traditional purchase models, and maintain unused capacity and its associated costs. Additionally, HCIs are often proprietary and do not offer the same portability, customization, and integration options as with open-standards-based systems. Because of their proprietary nature, migrating HCIs to a CDW can present technical hurdles, which can impact your ability to realize the full potential of your data.

One of these hurdles includes using AWS Schema Conversion Tool (AWS SCT). AWS SCT is used to migrate data warehouses, and it supports several conversions. However, when you migrate Microsoft’s Analytics Platform System (APS) SQL Server Parallel Data Warehouse (PDW) platform using only AWS SCT, it results in connection errors due to the lack of server-side cursor support in Microsoft APS. In this blog post, we show you three approaches that use AWS SCT combined with other AWS services to migrate Microsoft’s Analytics Platform System (APS) SQL Server Parallel Data Warehouse (PDW) HCI platform to Amazon Redshift. These solutions will help you overcome elasticity, scalability, and agility constraints associated with proprietary HCI analytics platforms and future proof your analytics investment.

AWS Schema Conversion Tool

Though using AWS SCT only will result in server-side cursor errors, you can pair it with other AWS services to migrate your data warehouses to AWS. AWS SCT converts source database schema and code objects, including views, stored procedures, and functions, to be compatible with a target database. It highlights objects that require manual intervention. You can also scan your application source code for embedded SQL statements as part of database-schema conversion project. During this process, AWS SCT optimizes cloud-native code by converting legacy Oracle and SQL Server functions to their equivalent AWS service. This helps you modernize applications simultaneously. Once conversion is complete, AWS SCT can also migrate data.

Figure 1 shows a standard AWS SCT implementation architecture.

Figure 1. AWS SCT migration approach

The next section shows you how to pair AWS SCT with other AWS services to migrate a Microsoft APS PDW to Amazon Redshift CDW. We prove you a base approach and two extensions to use for data warehouses with larger datasets and longer release outage windows.

Migration approach using SQL Server on Amazon EC2

The base approach uses Amazon Elastic Compute Cloud (Amazon EC2) to host a SQL Server in a symmetric multi-processing (SMP) architecture that is supported by AWS SCT, as opposed to Microsoft’s APS PDW’s massively parallel processing (MPP) architecture. By changing the warehouse’s architecture from MPP to SMP and using AWS SCT, you’ll avoid server-side cursor support errors.

Here’s how you’ll set up the base approach (Figure 2):

Set up the SMP SQL Server on Amazon EC2 and AWS SCT in your AWS account.
Set up Microsoft tools, including SQL Server Data Tools (SSDT), remote table copy, and SQL Server Integration Services (SSIS).
Use the Application Diagnostic Utility (ADU) and SSDT to connect and extract table lists, indexes, table definitions, view definitions, and stored procedures.
Generate data description languages (DDLs) using step 3 outputs.
Apply these DDLs to the SMP SQL Server on Amazon EC2.
Run AWS SCT against the SMP SQL database to begin migrating schema and data to Amazon Redshift.
Extract data using remote table copy from source, which copies data into the SMP SQL Server.
Load this data into Amazon Redshift using AWS SCT or AWS Database Migration Service (AWS DMS).
Use SSIS to load delta data from source to the SMP SQL Server on Amazon EC2.

Figure 2. Base approach using SMP SQL Server on Amazon EC2

Extending the base approach

The base approach overcomes server-side issues you would have during a direct migration. However, many organizations host terabytes (TB) of data. To migrate such a large dataset, you’ll need to adjust your approach.

The following sections extend the base approach. They still use the base approach to convert the schema and procedures, but the dataset is handled via separate processes.

Extension 1: AWS Snowball Edge

Note: AWS Snowball Edge is a Region-specific service. Verify that the service is available in your Region before planning your migration. See Regional Table to verify availability.

Snowball Edge lets you transfer large datasets to the cloud at faster-than-network speeds. Each Snowball Edge device can hold up to 100 TB and uses 256-bit encryption and an industry-standard Trusted Platform Module to ensure security and full chain-of-custody for your data. Furthermore, higher volumes can be transferred by clustering 5–10 devices for increased durability and storage.

Extension 1 enhances the base approach to allow you to transfer large datasets (Figure 3) while simultaneously setting up an SMP SQL Server on Amazon EC2 for delta transfers. Here’s how you’ll set it up:

Once Snowball Edge is enabled in the on-premises environment, it allows data transfer via network file system (NFS) endpoints. The device can then be used with standard Microsoft tools like SSIS, remote table copy, ADU, and SSDT.
While the device is being shipped back to an AWS facility, you’ll set up an SMP SQL Server database on Amazon EC2 to replicate the base approach.
After your data is converted, you’ll apply a converted schema to Amazon Redshift.
Once the Snowball Edge arrives at the AWS facility, data is transferred to the SMP SQL Server database.
You’ll subsequently run schema conversions and initial and delta loads per the base approach.

Figure 3. Solution extension that uses Snowball Edge for large datasets

Note: Where sequence numbers overlap in the diagram is a suggestion to possible parallel execution

Extension 1 transfers initial load and later applies delta load. This adds time to the project because of longer cutover release schedules. Additionally, you’ll need to plan for multiple separate outages, Snowball lead times, and release management timelines.

Note that not all analytics systems are classified as business-critical systems, so they can withstand a longer outage, typically 1-2 days. This gives you an opportunity to use AWS DataSync as an additional extension to complete initial and delta load in a single release window.

Extension 2: AWS DataSync

DataSync speeds up data transfer between on-premises environments and AWS. It uses a purpose-built network protocol and a parallel, multi-threaded architecture to accelerate your transfers.

Figure 4 shows the solution extension, which works as follows:

Create SMP MS SQL Server on EC2 and the DDL, as shown in the base approach.
Deploy DataSync agent(s) in your on-premises environment.
Provision and mount an NFS volume on the source analytics platform and DataSync agent(s).
Define a DataSync transfer task after the agents are registered.
Extract initial load from source onto the NFS mount that will be uploaded to Amazon Simple Storage Service (Amazon S3).
Load data extracts into the SMP SQL Server on Amazon EC2 instance (created using base approach).
Run delta loads per base approach, or continue using solution extension for delta loads.

Figure 4. Solution extension that uses DataSync for large datasets

Note: where sequence numbers overlap in the diagram is a suggestion to possible parallel execution

Transfer rates for DataSync depend on the amount of data, I/O, and network bandwidth available. A single DataSync agent can fully utilize a 10 gigabit per second (Gbps) AWS Direct Connect link to copy data from on-premises to AWS. As such, depending on initial load size, transfer window calculations must be done prior to finalizing transfer windows.

Conclusion

The approach and its extensions mentioned in this blog post provide mechanisms to migrate your Microsoft APS workloads to an Amazon Redshift CDW. They enable elasticity, scalability, and agility for your workload to future proof your analytics investment.

Related information

Accelerate self-service analytics with Amazon Redshift Query Editor V2

2021-11-08 Bhanu Pittampally

Post Syndicated from Bhanu Pittampally original https://aws.amazon.com/blogs/big-data/accelerate-self-service-analytics-with-amazon-redshift-query-editor-v2/

Amazon Redshift is a fast, fully managed cloud data warehouse. Tens of thousands of customers use Amazon Redshift as their analytics platform. Users such as data analysts, database developers, and data scientists use SQL to analyze their data in Amazon Redshift data warehouses. Amazon Redshift provides a web-based query editor in addition to supporting connectivity via ODBC/JDBC or the Redshift Data API. Query Editor V2 lets users explore, analyze, and collaborate on data. You can use Query Editor V2 to create databases, schemas, tables, and load data from Amazon Simple Storage Service (S3) either using COPY command or using a wizard . You can browse multiple databases and run queries on your Amazon Redshift data warehouse, data lake, or federated query to operational databases such as Amazon Aurora.

From the smallest start-ups to worldwide conglomerates, customers across the spectrum tell us they want to promote self-service analytics by empowering their end-users, such as data analysts and business analysts, to load data into their analytics platform. Analysts at these organizations create tables and load data in their own workspace, and they join that with the curated data available from the data warehouse to gain insight. This post will discuss how Query Editor V2 accelerates self-service analytics by enabling users to create tables and load data with simple wizards.

The Goal to Accelerate and Empower Data Analysts

A common practice that we see across enterprises today is that more and more enterprises are letting data analysts or business analysts load data into their user or group workspaces that co-exist on data warehouse platforms. Enterprise calls these personal workspaces, departmental schemas, project-based schemas or labs, and so on. The idea of this approach is to empower data analysts to load data sets by themselves and join curated data sets on a data warehouse platform to accelerate the data analysis process.

Amazon Redshift Query Editor V2 makes it easy for administrators to create the workspaces, and it enables data analysts to create and load data into the tables. Query Editor V2 lets you easily create external schemas in Redshift Cluster to extend the data warehouse to a data lake, thereby accelerating analytics.

An example Use case

Let’s assume that an organization has a marketing department with some power users and regular users. In this example, let’s also consider that the organization already has an Enterprise Data Warehouse (EDW) powered by Amazon Redshift. The marketing department would like to have a workspace created for their team members.

A visual depiction of a Data Warehouse Environment may look like the following figure. Enterprises let user/group schemas be created along with an EDW, which contains curated data sets. Analysts can create and load exploratory data sets into user schemas, and then join curated data sets available in the EDW.

ScopeofSolution

Amazon Redshift provides several options to isolate your users’ data from the enterprise data warehouse data,. Amazon Redshift data sharing lets you share data from your EDW cluster with a separate consumer cluster. Your users can consume the EDW data and create their own workspace in the consumer cluster. Alternatively, you can create a separate database for your users’ group workspace in the same cluster, and then isolate each user group to have their own schema. Amazon Redshift supports queries of data joining across databases, and then users can join their tables with the curated data in the EDW. We recommend you use the data sharing option that lets you isolate both compute and data. Query Editor v2 supports both scenarios.

Once you have enabled your data analysts to have their own workspace and provided the relevant privileges, then they can easily create Schema, table, and load data.

Prerequisites

You have an Amazon Redshift cluster, and you have configured the Query Editor V2. You can view the Simplify Data Analysis with Amazon Redshift Query Editor V2 post for instructions on setting up Query Editor V2.
For loading your data from Amazon S3 into Amazon Redshift, you will start by creating an IAM role to provide permissions to access Amazon S3 and grant that role to the Redshift cluster. By default, Redshift users assume that the IAM role is attached to the Redshift cluster. You can find the instructions in the Redshift getting started guide.
For users who want to load data from Amazon S3, Query Editor V2 provides an option to browse S3 buckets. To use this feature, users should have List permission on the S3 bucket.

Create Schemas

The Query Editor V2 supports the schema creation actions. Likewise, admins can create both native and external schemas by creating Schema wizard.

CreateSchemas

As a user, you can easily create a “schema” by accessing Create Schema wizard available from the Create button, and then selecting “Schema” from the drop-down list, as shown in the following screenshot.

If you select the Schema from the drop-down list, then the Create Schema wizard similar to the following screenshot is displayed. You can choose a local schema and provide a schema name.

Optionally, you can authorize a user to authorize users to create objects in the Schema. When the Authorize user check box is selected, then Create and Usage access are granted to the user. Now, Janedoe can create objects in this Schema.

Let’s assume that the analyst user Janedoe logs in to Query Editor V2 and logs in to the database and wants to create table and load data into their personal workspace.

Creating Tables

The Query Editor V2 provides a Create table wizard for users to create a table quickly. It allows power users to auto-create the table as based on a data file. Users can upload the file from their local machine and let Query Editor V2 figure out the data types and column widths. Optionally, you can change the column definition, such as encoding and table properties.

Below is a sample CSV file with a row header and sample rows from the MarketingCampaign.csv file. We will demonstrate how to create a table based on this file in the following steps.

SampleData

The following screenshot shows the uploading of the MarketingCampaing.csv file into Query Editor V2.

Create Table Wizard has two sections:

Columns

The Columns tab lets users select a file from their local desktop and upload it to Query Editor V2. Users can choose Schema from the drop-down option and provide a table name.

Query Editor V2 automatically infers columns and some data types for each column. It has several options to choose from to set as column properties. For example, you have the option to change column names, data types, and encoding. If you do not choose any encoding option, then the encoding choice will be selected automatically. You can also add new fields, for example, an auto-increment ID column, and define properties for that particular identity column.

Table Details

You can use the Create Table wizard to create a temporary table or regular table with the option of including it in automatic backups. The temporary table is available until the end of the session and is used in queries. A temporary table is automatically dropped once a user session is terminated.

The “Table Details” is optional, as Amazon Redshift’s Automatic Table Optimization feature takes care of Distribution Key and Sort Key on behalf of users.

Viewing Create Table Statement

Once the column and table level detail is set, Query Editor V2 gives an option to view the Create table statement in Query Editor tab. This lets users save the definition for later use or share it with other users. Once the user reviews the create table definition, then the user can hit the “Run” button to run the query. Users can also directly create a table from the Create table wizard.

The following screenshot shows the Create table definition for the marketing campaign data set.

CreateTable3

Query Editor V2 lets users view table definitions in a table format. The following screenshot displays the table that we created earlier. Note that Redshift automatically inferred encoding type for each column. As the best practice, it skipped for “Dt_Customer“, as it was set as the sort key. When creating the table, we did not set the encodings for columns, as Redshift will automatically set the best compression methods for each column.

Query Editor V2 distinguishes columns by data types in a table by using distinct icons for them.

You can also view the table definition by right-clicking on the table and selecting the show definition option. You can also generate a template select command, and drop or truncate the table by right-clicking on a table.

Loading Data

Now that we have created a schema and a table, let’s learn how to upload the data to the table that we created earlier.

Query Editor V2 provides you with the ability to load data for S3 buckets to Redshift tables. The COPY command is recommended to load data in Amazon Redshift. The COPY command leverages the massively parallel processing capabilities of Redshift.

The Load Data wizard in the Query Editor V2 loads data into Redshift by generating the COPY command. As a data analyst, you don’t have to remember the intricacies of the COPY command.

You can quickly load data from CSV, JSON, ORC, or Parquet files to an existing table using the Load Data Wizard. It supports all of the options in the COPY command. The Load Data Wizard lets Data analysts build a COPY command with an easy-to-use GUI.

The following screenshot shows an S3 bucket that has our MarketingCampaign.csv file. This is a much larger file that we used to create the table using Create table wizard. We will use this file to walk you through the Load Data wizard.

LoadData1

The Load Data wizard lets you browse your available S3 bucket and select a file or folder from the S3 bucket. You can also use a manifest file. A manifest file lets you make sure that all of the files are loaded using the COPY command. You can find more information about manifest files here.

The Load Data Wizard lets you enter several properties, such as the Redshift Cluster IAM role and whether data is encrypted. You can also set file options. For example, in the case of CSV, you can set delimiter and quote parameters. If the file is compressed, then you can provide compression settings.

With the Data Conversion Parameters, you can select options like Escape Characters, time format, and if you want to ignore the header in your data file. The Load Operations option lets you set compression encodings and error handling options.

Query Editor V2 lets you browse S3 objects, thereby making it easier to navigate buckets, folders, and files. Below screens displays the flow

Query Editor V2 supports loading data of many open formats, such as JSON, Delimiter, FixedWidth, AVRO, Parquet, ORC, and Shapefile.

In our example, we are loading CSV files. As you can see, we have selected our MarketingCampaing.csv file and set the Region, and then selected the Resfhift cluster IAM Role.

For the CSV file, under additional File Options, Delimiter Character and Quote Character are set with “;” and an empty quote in the below screen.

Once the required parameters are set, continue to next step to load data. Load Data operation builds a copy command and automatically loads it into Query Editor Tab, and then invokes the query.

LoadData5

Data is loaded into the target table successfully, and now you can run a query to view that data. The following screen shows the result of the select query executed on our target table:

LoadData6

Viewing load errors

If your COPY command fails, then these are logged into STL_LOAD_ERRORS system table. Query Editor v2 simplifies the viewing of the common errors by showing the errors in-place as shown in the following screenshot:

LoadData7

Saving and reusing the queries

You can save the load queries for future usage by clicking on the saved query and providing a name in the saved query.

You would probably like to reuse the load query in the future to load data in from another S3 location. In that case, you can use the parameterized query by replacing the S3 URL of the as shown in the following screenshot:

SavingQ2

You can save the query, and then share the query with another user.

When you or other users run the query, a prompt for the parameter will appear as in the following screenshot:

SavingQ3

We discussed how data analysts could load data into their own or the group’s workspace.

We will now discuss using Query Editor V2 to create an external schema to extend your data warehouse to the data lake.

Extending the Data Warehouse to the Data Lake

Extending Data warehouses to Data lakes is part of modern data architecture practices. Amazon Redshift enables this with seamless integration through Data lake running on AWS. Redshift uses Spectrum to allow this extension. You can access data lakes from the Redshift Data warehouse by creating Redshift external schemas.

Query Editor V2 lets you create an external schema referencing an external database in AWS Glue Data Catalogue.

To extend your Data Warehouse to Data Lake, you should have an S3 data lake and AWS Glue Data Catalog database defined for the data lake. Grant permission on AWS Glue to Redshift Cluster Role. You can find more information about external Schema here.

You can navigate to the Create External Schema by using Create Schema wizard, and then selecting the External Schema as shown in the following screenshot:

The Query Editor V2 makes the schema creation experience very easy by hiding the intricacies of the create external schema syntax. You can use the simple interface and provide the required parameters, such as Glue data regions, external database name, and the IAM role. You can browse the Glue Catalog and view the database name.

After you use the create schema option, you can see the schemas in the tree-view. The Query Editor V2 uses distinct icons to distinguish between native Schema and external Schema.

Viewing External Table Definitions

The Query Editor V2 lets data analysts quickly view objects available in external databases and understand their metadata.

You can view tables and columns for a given table by clicking on external Schema and then on a table. When a particular table is selected, its metadata information is displayed in the bottom portion below the tree-view panel. This is a powerful feature, as an analyst can easily understand the data residing externally in the data lake.

You can now run queries against external tables in the external Schema.

In our fictitious enterprise, Marketing Department team members can load data in their own workspace and join the data from their own user/group workspace with the curated data in the enterprise data warehouse or data lake.

Conclusion

This post demonstrated how the Query Editor V2 enabled data analysts to create tables and load data from Amazon S3 easily with a simple wizard.

We also discussed how Query Editor V2 lets you extend the data warehouse to a data lake. The data analysts can easily browse tables in your local data warehouse, data shared from another cluster, or tables in the data lake. You can run queries that can join tables in your data warehouse and data lake. The Query Editor V2 also provides several features for the collaboration of query authoring. You can view the earlier blog to learn more about how the Query Editor V2 simplifies data analysis.

These features let organizations accelerate self-service analytics and end-users deliver the insights faster.

Happy querying!

About the Authors

Bhanu Pittampally is Analytics Specialist Solutions Architect based out of Dallas. He specializes in building analytical solutions. His background is in data warehouse – architecture, development and administration. He is in data and analytical field for over 13 years. His Linkedin profile is here.

Debu Panda is a Principal Product Manager at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world.

Cansu Aksu is a Front End Engineer at AWS, has a several years of experience in developing user interfaces. She is detail oriented, eager to learn and passionate about delivering products and features that solve customer needs and problems

Chengyang Wang is a Frontend Engineer in Redshift Console Team. He worked on a number of new features delivered by redshift in the past 2 years. He thrives to deliver high quality products and aim to improve customer experience from UI

Query hierarchical data models within Amazon Redshift

2021-11-03 Nadim Rana

Post Syndicated from Nadim Rana original https://aws.amazon.com/blogs/big-data/query-hierarchical-data-models-within-amazon-redshift/

In a hierarchical database model, information is stored in a tree-like structure or parent-child structure, where each record can have a single parent but many children. Hierarchical databases are useful when you need to represent data in a tree-like hierarchy. The perfect example of a hierarchical data model is the navigation file and folders or sitemap of a website. Another common example that’s widely used is a company organization chart or staffing information where employees are linked with each other via employee-manager relationship.

Some relational databases provide functionality to support hierarchical data model via common table expressions (CTEs). Recursive CTE syntax and semantics are defined as part of ANSI SQL to query a hierarchical data model in relational database management systems.

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that supports the hierarchical database model through ANSI compliant recursive CTE, which allows you to query hierarchical data stored in parent-child relationships. Recursive CTE support is a newly introduced feature in Amazon Redshift from April 2021.

In this post, we present a simple hierarchical data model and write queries to show how easy it is to retrieve information from this model using recursive CTEs in Amazon Redshift.

Common use cases of hierarchical data models

The Amazon Redshift’s hierarchical data model is supported by recursive common table expressions. The hierarchical data model is built upon a parent-child relationship within the same table or view. The model represents a tree-like structure, where each record has a single parent. Although a parent can have one or more child records, each child can have only single parent. For example, most companies have a hierarchical structure when it comes to their org chart. Under an org chart, you have a CEO at the root of the tree and staffing structure underneath the root. You can traverse the hierarchy from bottom-up or top-down to discover all employees under a manager. Another example of a hierarchical data model is components and subcomponents of an engine. To query a hierarchical dataset, databases provide special query constructs, usually with recursive CTEs.

Common Table Expressions

Amazon Redshift provides the WITH statement, which allows us to construct auxiliary statements (CTEs) for use in a SQL query. CTEs are like temporary tables that exist only while running the query.

ANSII CTEs is a great technique to simplify complex queries by breaking the query into simple individual pieces. Using CTEs with a WITH clause results in a more readable and maintainable query versus using subqueries. The following is a simple query to illustrate how to write a CTE:

WITH outer_products as (
  SELECT *
  FROM Products
  WHERE active_product is true
)

SELECT a.*
FROM dim_customers_product a
INNER JOIN outer_products ON (outer_products.product_id = a.product_id);

In the preceding query, the outer_products block is constructed using a WITH clause. You can use this block within subqueries, join conditions, or any other SQL constructs, just like any table. Without this block, complex queries involve code replication multiple times, which increases complexity and readability.

Recursive Common Table Expressions

A recursive common table expression is a CTE that references itself. The SQL WITH construct, using a CTE, is useful for other functions than just recursive queries, but recursive CTEs are very effective in querying a hierarchical data model. Without the support for recursive CTEs, it’s very hard to query hierarchical data in a relational database. For example, without recursive CTEs, procedural code is needed along with temporary tables to query hierarchical data represented in a relational model. The approach is not very elegant and is also error-prone. A recursive CTE is usually needed when querying hierarchical data such as organization charts where one employee reports to a manager, or a multi-level bill of materials when a product consists of many components, and each component also consists of many subcomponents.

The following is an example of a recursive CTE to generate a sequence of numbers. The recursive part nums is being used in the inner query that references the outer CTE, which is defined as RECURSIVE nums(n).

WITH RECURSIVE nums (n) AS (
    SELECT 1
    UNION ALL
    SELECT n+1 FROM nums WHERE n+1 <= 10   --Recursive query referencing outer CTE nums
)
SELECT n FROM nums;

Org chart use case

Let’s look at an organization chart where employees are related to each other in a staffing data model.

The following queries run on a table called employees, which contains the employee ID, employee name, and manager of the employee.

Let’s look at how to return the number of employees that directly or indirectly report to Bob.

Because this data is hierarchical in nature, we run a hierarchical query:

with bob_org(id, name, manager) as
( select id, name, manager
  from employee
  where name = 'Bob'
 
  union all
 
  select e.id, e.name, e.manager
  from employee e, bob_org b
  where e.manager = b.id
  )
 select count(*) from bob_org;

Let’s break down the preceding query and look at individual pieces that construct the components of a hierarchal query.

In general, a recursive CTE is a UNION ALL subquery with three parts within the WITH clause:

Anchor member – The first part is a SELECT query that doesn’t have recursive reference to cte_name. It returns the base result of the CTE, which is the initial set of the recursion. This initial query is often called the initial member, seed member, or anchor member. In the preceding query, bob_org is cte_name, and the first SELECT query is the anchor member.
Recursive member – The second query after UNION ALL is a recursive query that references the CTE, called the recursive member. The recursive member is unioned with the anchor member (the initial seed) using the set operator UNION ALL. UNION ALL must be performed for the recursive CTEs. In the preceding example, the second query is the recursive query, and it references the CTE bob_org.
Termination condition – This condition must be specified in the recursive member (the second query after UNION ALL). The termination condition is usually provided via a WHERE clause or JOIN operation. In the preceding example, the termination condition is provided via a WHERE clause using e.manager = b.id.

Let’s use the same org chart to get all the subordinates of the manager with ID 302:

WITH RECURSIVE subordinates(employee_id, manager_id, employee_name) AS (
SELECT  employee_id,
        manager_id,
        employee_name
FROM    employees 
WHERE   employee_id = 302

UNION ALL 
SELECT E1.employee_id,
       E1.manager_id,
       E1.employee_name
FROM   employees E1
INNER JOIN subordinates s 
On s.employee_id = E1.manager_id
)
SELECT * FROM subordinates;

The first SELECT query (the anchor member) returns the base result, which is the employee with ID 302:

Employee_id                 Manager_id                 Employee_name
302                           201                          Mary

Next, the recursive member returns all the members underneath the tree rooted at employee ID 302 as well as the root, which is ID 302:

Employee_id                 Manager_id                 Employee_name
302                           201                           Mary
402                           302                           Sam
408                           302                           Erin

When we start at the seed level, recursive queries can traverse a hierarchical tree in an elegant manner. This is not easily possible without hierarchical support, which is achieved by recursive CTEs.

Display “hierarchical path” via recursive CTE

Displaying hierarchy levels and a hierarchical path along with the data is also often desired in these types of use cases. You can easily achieve this by using a recursive CTE with the following approach.

Let’s modify the preceding query to display the hierarchical path and levels:

WITH RECURSIVE subordinates(level, employee_id, manager_id, employee_name, path) AS (
SELECT  1,
        employee_id,
        manager_id,
        employee_name,
        CAST(employee_id AS VARCHAR(1000))  -- See note below for CAST
FROM    employees 
WHERE   employee_id = 302

UNION ALL 
SELECT s.level + 1,
       E1.employee_id,
       E1.manager_id,
       E1.employee_name,
       concat( concat( s.path, '/'), E1.employee_id) AS path
FROM   employees E1
INNER JOIN subordinates s 
On s.employee_id = E1.manager_id
)
SELECT * FROM subordinates;

In the preceding query, it’s very important to use CAST with the appropriate data type to define the maximum width of the result column. The main reason for using CAST is to define the column length for the resultant column. The result column data type and its length is determined by the initialization branch, namely the anchor member or first SELECT. In this example, if the employee_id column is of varchar(400) type but its actual content is much smaller, for example, 20 bytes, CAST isn’t needed when there are 10 hierarchical levels. This is because the recursive CTE’s result column path would be varchar(400), which is wide enough to hold the result data. However, if the employee_id data type is varchar(30) and the actual data length is 20 bytes, the result column path of the recursive CTE is longer than 30 bytes only after two levels of hierarchy. To accommodate 10 levels of hierarchy, CAST(employee_id as varchar(400)) is needed here. This defines the data type and size of the result column. In this case, the result column path maximum data size is 400 bytes, which is long enough to accommodate the result data. Therefore, we highly recommend using explicit CAST to define the data type and size of the result column in a recursive CTE.

The result of the preceding query is as follows:

Level    Employee_id    Manager_id     Employee_name     Path
1        302            201            Mary              302
2        402            302            Sam               302/402
2        408            302            Erin              302/408

Because both employees 402 and 408 are one level below Mary (302), their level is 1 plus Mary’s level. Also, it should be noted that the path column now indicates the hierarchical path rooted at the seed level, which was 302 in our scenario.

With these recursive CTE queries, it’s extremely easy to query hierarchical data within Amazon Redshift. No more complicated stored procedures or temporary tables are needed to query hierarchical data in Redshift.

Recursive CTE troubleshooting: Top-down vs. bottom up

You can use both top-down and bottom-up approaches via recursive CTE to traverse the hierarchical structure. In the preceding example, we presented a top-down approach. To traverse bottom-up, you should specify the starting point in the first part of the query. Then the second query needs to join the recursive component using the parent ID.

In both top-down and bottom-up strategies, you must clearly specify the termination condition. You might run into cyclic conditions, which could cause buffer overflow if termination conditions aren’t specified correctly. This could cause an infinite loop, causing Recursive CTE out of working buffers errors in Amazon Redshift. Therefore, you must add a condition to stop the recursion, which prevents infinite loops.

Summary

This post discussed the usage and benefits of the new capability in Amazon Redshift that provides support for querying hierarchical data models via recursive CTEs. This functionality can also provide tremendous benefits when combined with the SUPER data type, which provides support for querying semi-structured data like JSON. Try it out and share your experience.

About the Author

Nadim Rana is a Data Warehouse and MPP Architect at Amazon Web Services. He is a big data enthusiast and collaborates with customers around the globe to achieve success and meet their data warehousing and data lake architecture needs.