All posts by Jagadish Kumar

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Post Syndicated from Jagadish Kumar original https://aws.amazon.com/blogs/big-data/petabyte-scale-log-analytics-with-amazon-s3-amazon-opensearch-service-and-amazon-opensearch-ingestion/

Organizations often need to manage a high volume of data that is growing at an extraordinary rate. At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance.

With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging. With a modern data architecture on AWS, you can rapidly build scalable data lakes; use a broad and deep collection of purpose-built data services; ensure compliance via unified data access, security, and governance; scale your systems at a low cost without compromising performance; and share data across organizational boundaries with ease, allowing you to make decisions with speed and agility at scale.

You can take all your data from various silos, aggregate that data in your data lake, and perform analytics and machine learning (ML) directly on top of that data. You can also store other data in purpose-built data stores to analyze and get fast insights from both structured and unstructured data. This data movement can be inside-out, outside-in, around the perimeter or sharing across.

For example, application logs and traces from web applications can be collected directly in a data lake, and a portion of that data can be moved out to a log analytics store like Amazon OpenSearch Service for daily analysis. We think of this concept as inside-out data movement. The analyzed and aggregated data stored in Amazon OpenSearch Service can again be moved to the data lake to run ML algorithms for downstream consumption from applications. We refer to this concept as outside-in data movement.

Let’s look at an example use case. Example Corp. is a leading Fortune 500 company that specializes in social content. They have hundreds of applications generating data and traces at approximately 500 TB per day and have the following criteria:

  • Have logs available for fast analytics for 2 days
  • Beyond 2 days, have data available in a storage tier that can be made available for analytics with a reasonable SLA
  • Retain the data beyond 1 week in cold storage for 30 days (for purposes of compliance, auditing, and others)

In the following sections, we discuss three possible solutions to address similar use cases:

  • Tiered storage in Amazon OpenSearch Service and data lifecycle management
  • On-demand ingestion of logs using Amazon OpenSearch Ingestion
  • Amazon OpenSearch Service direct queries with Amazon Simple Storage Service (Amazon S3)

Solution 1: Tiered storage in OpenSearch Service and data lifecycle management

OpenSearch Service supports three integrated storage tiers: hot, UltraWarm, and cold storage. Based on your data retention, query latency, and budgeting requirements, you can choose the best strategy to balance cost and performance. You can also migrate data between different storage tiers.

Hot storage is used for indexing and updating, and provides the fastest access to data. Hot storage takes the form of an instance store or Amazon Elastic Block Store (Amazon EBS) volumes attached to each node.

UltraWarm offers significantly lower costs per GiB for read-only data that you query less frequently and doesn’t need the same performance as hot storage. UltraWarm nodes use Amazon S3 with related caching solutions to improve performance.

Cold storage is optimized to store infrequently accessed or historical data. When you use cold storage, you detach your indexes from the UltraWarm tier, making them inaccessible. You can reattach these indexes in a few seconds when you need to query that data.

For more details on data tiers within OpenSearch Service, refer to Choose the right storage tier for your needs in Amazon OpenSearch Service.

Solution overview

The workflow for this solution consists of the following steps:

  1. Incoming data generated by the applications is streamed to an S3 data lake.
  2. Data is ingested into Amazon OpenSearch using S3-SQS near-real-time ingestion through notifications set up on the S3 buckets.
  3. After 2 days, hot data is migrated to UltraWarm storage to support read queries.
  4. After 5 days in UltraWarm, the data is migrated to cold storage for 21 days and detached from any compute. The data can be reattached to UltraWarm when needed. Data is deleted from cold storage after 21 days.
  5. Daily indexes are maintained for easy rollover. An Index State Management (ISM) policy automates the rollover or deletion of indexes that are older than 2 days.

The following is a sample ISM policy that rolls over data into the UltraWarm tier after 2 days, moves it to cold storage after 5 days, and deletes it from cold storage after 21 days:

{
    "policy": {
        "description": "hot warm delete workflow",
        "default_state": "hot",
        "schema_version": 1,
        "states": [
            {
                "name": "hot",
                "actions": [
                    {
                        "rollover": {
                            "min_index_age": "2d",
                            "min_primary_shard_size": "30gb"
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "warm"
                    }
                ]
            },
            {
                "name": "warm",
                "actions": [
                    {
                        "replica_count": {
                            "number_of_replicas": 5
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "cold",
                        "conditions": {
                            "min_index_age": "5d"
                        }
                    }
                ]
            },
            {
                "name": "cold",
                "actions": [
                    {
                        "retry": {
                            "count": 5,
                            "backoff": "exponential",
                            "delay": "1h"
                        },
                        "cold_migration": {
                            "start_time": null,
                            "end_time": null,
                            "timestamp_field": "@timestamp",
                            "ignore": "none"
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "delete",
                        "conditions": {
                            "min_index_age": "21d"
                        }
                    }
                ]
            },
            {
                "name": "delete",
                "actions": [
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "cold_delete": {}
                    }
                ],
                "transitions": []
            }
        ],
        "ism_template": {
            "index_patterns": [
                "log*"
            ],
            "priority": 100
        }
    }
}

Considerations

UltraWarm uses sophisticated caching techniques to enable querying for infrequently accessed data. Although the data access is infrequent, the compute for UltraWarm nodes needs to be running all the time to make this access possible.

When operating at PB scale, to reduce the area of effect of any errors, we recommend decomposing the implementation into multiple OpenSearch Service domains when using tiered storage.

The next two patterns remove the need to have long-running compute and describe on-demand techniques where the data is either brought when needed or queried directly where it resides.

Solution 2: On-demand ingestion of logs data through OpenSearch Ingestion

OpenSearch Ingestion is a fully managed data collector that delivers real-time log and trace data to OpenSearch Service domains. OpenSearch Ingestion is powered by the open source data collector Data Prepper. Data Prepper is part of the open source OpenSearch project.

With OpenSearch Ingestion, you can filter, enrich, transform, and deliver your data for downstream analysis and visualization. You configure your data producers to send data to OpenSearch Ingestion. It automatically delivers the data to the domain or collection that you specify. You can also configure OpenSearch Ingestion to transform your data before delivering it. OpenSearch Ingestion is serverless, so you don’t need to worry about scaling your infrastructure, operating your ingestion fleet, and patching or updating the software.

There are two ways that you can use Amazon S3 as a source to process data with OpenSearch Ingestion. The first option is S3-SQS processing. You can use S3-SQS processing when you require near-real-time scanning of files after they are written to S3. It requires an Amazon Simple Queue Service (Amazon S3) queue that receives S3 Event Notifications. You can configure S3 buckets to raise an event any time an object is stored or modified within the bucket to be processed.

Alternatively, you can use a one-time or recurring scheduled scan to batch process data in an S3 bucket. To set up a scheduled scan, configure your pipeline with a schedule at the scan level that applies to all your S3 buckets, or at the bucket level. You can configure scheduled scans with either a one-time scan or a recurring scan for batch processing.

For a comprehensive overview of OpenSearch Ingestion, see Amazon OpenSearch Ingestion. For more information about the Data Prepper open source project, visit Data Prepper.

Solution overview

We present an architecture pattern with the following key components:

  • Application logs are streamed into to the data lake, which helps feed hot data into OpenSearch Service in near-real time using OpenSearch Ingestion S3-SQS processing.
  • ISM policies within OpenSearch Service handle index rollovers or deletions. ISM policies let you automate these periodic, administrative operations by triggering them based on changes in the index age, index size, or number of documents. For example, you can define a policy that moves your index into a read-only state after 2 days and then deletes it after a set period of 3 days.
  • Cold data is available in the S3 data lake to be consumed on demand into OpenSearch Service using OpenSearch Ingestion scheduled scans.

The following diagram illustrates the solution architecture.

The workflow includes the following steps:

  1. Incoming data generated by the applications is streamed to the S3 data lake.
  2. For the current day, data is ingested into OpenSearch Service using S3-SQS near-real-time ingestion through notifications set up in the S3 buckets.
  3. Daily indexes are maintained for easy rollover. An ISM policy automates the rollover or deletion of indexes that are older than 2 days.
  4. If a request is made for analysis of data beyond 2 days and the data is not in the UltraWarm tier, data will be ingested using the one-time scan feature of Amazon S3 between the specific time window.

For example, if the present day is January 10, 2024, and you need data from January 6, 2024 at a specific interval for analysis, you can create an OpenSearch Ingestion pipeline with an Amazon S3 scan in your YAML configuration, with the start_time and end_time to specify when you want the objects in the bucket to be scanned:

version: "2"
ondemand-ingest-pipeline:
  source:
    s3:
      codec:
        newline:
      compression: "gzip"
      scan:
        start_time: 2023-12-28T01:00:00
        end_time: 2023-12-31T09:00:00
        buckets:
          - bucket:
              name: <bucket-name>
      aws:
        region: "us-east-1"
        sts_role_arn: "arn:aws:iam::<acct num>:role/PipelineRole"
    
    acknowledgments: true
  processor:
    - parse_json:
    - date:
        from_time_received: true
        destination: "@timestamp"           
  sink:
    - opensearch:                  
        index: "logs_ondemand_20231231"
        hosts: [ "https://search-XXXX-domain-XXXXXXXXXX.us-east-1.es.amazonaws.com" ]
        aws:                  
          sts_role_arn: "arn:aws:iam::<acct num>:role/PipelineRole"
          region: "us-east-1"

Considerations

Take advantage of compression

Data in Amazon S3 can be compressed, which reduces your overall data footprint and results in significant cost savings. For example, if you are generating 15 PB of raw JSON application logs per month, you can use a compression mechanism like GZIP, which can reduce the size to approximately 1PB or less, resulting in significant cost savings.

Stop the pipeline when possible

OpenSearch Ingestion scales automatically between the minimum and maximum OCUs set for the pipeline. After the pipeline has completed the Amazon S3 scan for the specified duration mentioned in the pipeline configuration, the pipeline continues to run for continuous monitoring at the minimum OCUs.

For on-demand ingestion for past time durations where you don’t expect new objects to be created, consider using supported pipeline metrics such as recordsOut.count to create Amazon CloudWatch alarms that can stop the pipeline. For a list of supported metrics, refer to Monitoring pipeline metrics.

CloudWatch alarms perform an action when a CloudWatch metric exceeds a specified value for some amount of time. For example, you might want to monitor recordsOut.count to be 0 for longer than 5 minutes to initiate a request to stop the pipeline through the AWS Command Line Interface (AWS CLI) or API.

Solution 3: OpenSearch Service direct queries with Amazon S3

OpenSearch Service direct queries with Amazon S3 (preview) is a new way to query operational logs in Amazon S3 and S3 data lakes without needing to switch between services. You can now analyze infrequently queried data in cloud object stores and simultaneously use the operational analytics and visualization capabilities of OpenSearch Service.

OpenSearch Service direct queries with Amazon S3 provides zero-ETL integration to reduce the operational complexity of duplicating data or managing multiple analytics tools by enabling you to directly query your operational data, reducing costs and time to action. This zero-ETL integration is configurable within OpenSearch Service, where you can take advantage of various log type templates, including predefined dashboards, and configure data accelerations tailored to that log type. Templates include VPC Flow Logs, Elastic Load Balancing logs, and NGINX logs, and accelerations include skipping indexes, materialized views, and covered indexes.

With OpenSearch Service direct queries with Amazon S3, you can perform complex queries that are critical to security forensics and threat analysis and correlate data across multiple data sources, which aids teams in investigating service downtime and security events. After you create an integration, you can start querying your data directly from OpenSearch Dashboards or the OpenSearch API. You can audit connections to ensure that they are set up in a scalable, cost-efficient, and secure way.

Direct queries from OpenSearch Service to Amazon S3 use Spark tables within the AWS Glue Data Catalog. After the table is cataloged in your AWS Glue metadata catalog, you can run queries directly on your data in your S3 data lake through OpenSearch Dashboards.

Solution overview

The following diagram illustrates the solution architecture.

This solution consists of the following key components:

  • The hot data for the current day is stream processed into OpenSearch Service domains through the event-driven architecture pattern using the OpenSearch Ingestion S3-SQS processing feature
  • The hot data lifecycle is managed through ISM policies attached to daily indexes
  • The cold data resides in your Amazon S3 bucket, and is partitioned and cataloged

The following screenshot shows a sample http_logs table that is cataloged in the AWS Glue metadata catalog. For detailed steps, refer to Data Catalog and crawlers in AWS Glue.

Before you create a data source, you should have an OpenSearch Service domain with version 2.11 or later and a target S3 table in the AWS Glue Data Catalog with the appropriate AWS Identity and Access Management (IAM) permissions. IAM will need access to the desired S3 buckets and have read and write access to the AWS Glue Data Catalog. The following is a sample role and trust policy with appropriate permissions to access the AWS Glue Data Catalog through OpenSearch Service:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "directquery.opensearchservice.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

The following is a sample custom policy with access to Amazon S3 and AWS Glue:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Statement1",
            "Effect": "Allow",
            "Action": "es:ESHttp*",
            "Resource": "arn:aws:es:*:<acct_num>:domain/*"
        },
        {
            "Sid": "Statement2",
            "Effect": "Allow",
            "Action": [
                "s3:Get*",
                "s3:List*",
                "s3:Put*",
                "s3:Describe*"
            ],
            "Resource": [
                "arn:aws:s3:::<bucket-name>",
                "arn:aws:s3:::<bucket-name>/*"
            ]
        },
        {
            "Sid": "GlueCreateAndReadDataCatalog",
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:CreateDatabase",
                "glue:GetDatabases",
                "glue:CreateTable",
                "glue:GetTable",
                "glue:UpdateTable",
                "glue:DeleteTable",
                "glue:GetTables",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:CreatePartition",
                "glue:BatchCreatePartition",
                "glue:GetUserDefinedFunctions"
            ],
            "Resource": [
                "arn:aws:glue:us-east-1:<acct_num>:catalog",
                "arn:aws:glue:us-east-1:<acct_num>:database/*",
                "arn:aws:glue:us-east-1:<acct_num>:table/*"
            ]
        }
    ]
}

To create a new data source on the OpenSearch Service console, provide the name of your new data source, specify the data source type as Amazon S3 with the AWS Glue Data Catalog, and choose the IAM role for your data source.

After you create a data source, you can go to the OpenSearch dashboard of the domain, which you use to configure access control, define tables, set up log type-based dashboards for popular log types, and query your data.

After you set up your tables, you can query your data in your S3 data lake through OpenSearch Dashboards. You can run a sample SQL query for the http_logs table you created in the AWS Glue Data Catalog tables, as shown in the following screenshot.

Best practices

Ingest only the data you need

Work backward from your business needs and establish the right datasets you’ll need. Evaluate if you can avoid ingesting noisy data and ingest only curated, sampled, or aggregated data. Using these cleaned and curated datasets will help you optimize the compute and storage resources needed to ingest this data.

Reduce the size of data before ingestion

When you design your data ingestion pipelines, use strategies such as compression, filtering, and aggregation to reduce the size of the ingested data. This will permit smaller data sizes to be transferred over the network and stored in your data layer.

Conclusion

In this post, we discussed solutions that enable petabyte-scale log analytics using OpenSearch Service in a modern data architecture. You learned how to create a serverless ingestion pipeline to deliver logs to an OpenSearch Service domain, manage indexes through ISM policies, configure IAM permissions to start using OpenSearch Ingestion, and create the pipeline configuration for data in your data lake. You also learned how to set up and use the OpenSearch Service direct queries with Amazon S3 feature (preview) to query data from your data lake.

To choose the right architecture pattern for your workloads when using OpenSearch Service at scale, consider the performance, latency, cost and data volume growth over time in order to make the right decision.

  • Use Tiered storage architecture with Index State Management policies when you need fast access to your hot data and want to balance the cost and performance with UltraWarm nodes for read-only data.
  • Use On Demand Ingestion of your data into OpenSearch Service when you can tolerate ingestion latencies to query your data not retained in your hot nodes. You can achieve significant cost savings when using compressed data in Amazon S3 and ingesting data on demand into OpenSearch Service.
  • Use Direct query with S3 feature when you want to directly analyze your operational logs in Amazon S3 with the rich analytics and visualization features of OpenSearch Service.

As a next step, refer to the Amazon OpenSearch Developer Guide to explore logs and metric pipelines that you can use to build a scalable observability solution for your enterprise applications.


About the Authors

Jagadish Kumar (Jag) is a Senior Specialist Solutions Architect at AWS focused on Amazon OpenSearch Service. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.


Muthu Pitchaimani
is a Senior Specialist Solutions Architect with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.


Sam Selvan
is a Principal Specialist Solution Architect with Amazon OpenSearch Service.

Accelerate Amazon Redshift secure data use with Satori – Part 1

Post Syndicated from Jagadish Kumar original https://aws.amazon.com/blogs/big-data/accelerate-amazon-redshift-secure-data-use-with-satori-part-1/

This post is co-written by Lisa Levy, Content Specialist at Satori.

Data democratization enables users to discover and gain access to data faster, improving informed data-driven decisions and using data to generate business impact. It also increases collaboration across teams and organizations, breaking down data silos and enabling cross-functional teams to work together more effectively.

A significant barrier to data democratization is ensuring that data remains secure and compliant. The ability to search, locate, and mask sensitive data is critical for the data democratization process. Amazon Redshift provides numerous features such as role-based access control (RBAC), row-level security (RLS), column-level security (CLS), and dynamic data masking to facilitate the secure use of data.

In this two-part series, we explore how Satori, an Amazon Redshift Ready partner, can help Amazon Redshift users automate secure access to data and provide their data users with self-service data access. Satori integrates natively with both Amazon Redshift provisioned clusters and Amazon Redshift Serverless for easy setup of your Amazon Redshift data warehouse in the secure Satori portal.

In part 1, we provide detailed steps on how to integrate Satori with your Amazon Redshift data warehouse and control how data is accessed with security policies.

In part 2, we will explore how to set up self-service data access with Satori to data stored in Amazon Redshift.

Satori’s data security platform

Satori is a data security platform that enables frictionless self-service access for users with built-in security. Satori accelerates implementing data security controls on datawarehouses like Amazon Redshift, is straightforward to integrate, and doesn’t require any changes to your Amazon Redshift data, schema, or how your users interact with data.

Integrating Satori with Amazon Redshift accelerates organizations’ ability to make use of their data to generate business value. This faster time-to-value is achieved by enabling companies to manage data access more efficiently and effectively.

By using Satori with the Modern Data Architecture on AWS, you can find and get access to data using a personalized data portal, and companies can set policies such as just-in-time access to data and fine-grained access control. Additionally, all data access is audited. Satori seamlessly works with native Redshift objects, external tables that can be queried through Amazon Redshift Spectrum, as well shared database objects through Redshift data sharing.

Satori anonymizes data on the fly, based on your requirements, according to users, roles, and datasets. The masking is applied regardless of the underlying database and doesn’t require writing code or making changes to your databases, data warehouses, and data lakes. Satori continuously monitors data access, identifies the location of each dataset, and classifies the data in each column. The result is a self-populating data inventory, which also classifies the data for you and allows you to add your own customized classifications.

Satori integrates with identity providers to enrich its identity context and deliver better analytics and more accurate access control policies. Satori interacts with identity providers either via API or by using the SAML protocol. Satori also integrates with business intelligence (BI) tools like Amazon QuickSight, Tableau, Power BI etc. to monitor and enforce security and privacy policies for data consumers who use BI tools to access data.

In this post, we explore how organizations can accelerate secure data use in Amazon Redshift with Satori, including the benefits of integration and the necessary steps to start. We’ll go through an example of integrating Satori with a Redshift cluster and view how security policies are applied dynamically when queried through DBeaver.

Prerequisites

You should have the following prerequisites:

  • An AWS account.
  • A Redshift cluster and Redshift Severless endpoint to store and manage data. You can create and manage your cluster through the AWS Management Console, AWS Command Line Interface (AWS CLI), or Redshift API.
  • A Satori account and the Satori connector for Amazon Redshift.
  • A Redshift security group. You’ll need to configure your Redshift security group to allow inbound traffic from the Satori connector for Amazon Redshift. Note that Satori can be deployed as a software as a service (SaaS) data access controller or within your VPC.

Prepare the data

To set up our example, complete the following steps:

  1. On the Amazon Redshift console, navigate to Query Editor v2.

If you’re familiar with SQL Notebooks, you can download this SQL notebook for the demonstration and import it to quickly get started.

  1. In the Amazon Redshift provisioned Cluster, Use the following code to create a table, populate it, and create roles and users:
-- 1- Create Schema
create schema if not exists customer_schema;

-- 2- Create customer and credit_cards table
CREATE TABLE customer_schema.credit_cards (
customer_id INT,
name TEXT,
is_fraud BOOLEAN,
credit_card TEXT
);


create table customer_schema.customer (
id INT,
first_name TEXT,
last_name TEXT,
email TEXT,
gender TEXT,
ssn TEXT
);

-- 3- Populate the tables with sample data
INSERT INTO customer_schema.credit_cards
VALUES
(100,'John Smith','n', '4532109867542837'),
(101,'Jane Doe','y', '4716065243786267'),
(102,'Mahendra Singh','n', '5243111024532276'),
(103,'Adaku Zerhouni','n', '6011011238764578'),
(104,'Miguel Salazar','n', '6011290347689234'),
(105,'Jack Docket','n', '3736165700234635');

INSERT INTO customer_schema.customer VALUES
(1,'Yorke','Khomishin','[email protected]','Male','866-95-2246'),
(2,'Tedd','Donwell','[email protected]','Male','726-62-3033'),
(3,'Lucien','Keppe','[email protected]','Male','865-28-6322'),
(4,'Hester','Arnefield','[email protected]','Female','133-72-9078'),
(5,'Abigale','Bertouloume','[email protected]','Female','780-69-6814'),
(6,'Larissa','Bremen','[email protected]','Female','121-78-7749');

-- 4-  GRANT  SELECT permissions on the table
GRANT SELECT ON customer_schema.credit_cards TO PUBLIC;
GRANT SELECT ON customer_schema.customer TO PUBLIC;

-- 5- create roles
CREATE ROLE customer_service_role;
CREATE ROLE auditor_role;
CREATE ROLE developer_role;
CREATE ROLE datasteward_role;


-- 6- create four users
CREATE USER Jack WITH PASSWORD '1234Test!';
CREATE USER Kim WITH PASSWORD '1234Test!';
CREATE USER Mike WITH PASSWORD '1234Test!';
CREATE USER Sarah WITH PASSWORD '1234Test!';


-- 7- Grant roles to above users
GRANT ROLE customer_service_role TO Jack;
GRANT ROLE auditor_role TO Kim;
GRANT ROLE developer_role TO Mike;
GRANT ROLE datasteward_role TO Sarah;
  1. Get namespaces for the Redshift provisioned cluster and Redshift Serverless endpoint

Connect to provisioned cluster through Query Editor V2 and run the following SQL:

select current_namespace; -- (Save as <producer_namespace>)

Repeat the above step for Redshift Serverless endpoint and get the namespace:

select current_namespace; -- (Save as <consumer_namespace>
  1. Connect to Redshift provisioned cluster and create an outbound data share (producer) with the following SQL
-- Creating a datashare

CREATE DATASHARE cust_share SET PUBLICACCESSIBLE TRUE;

-- Adding schema to datashare

ALTER DATASHARE cust_share ADD SCHEMA customer_schema;

-- Adding customer table to datshares. We can add all the tables also if required

ALTER DATASHARE cust_share ADD TABLE customer_schema.credit_cards;

GRANT USAGE ON DATASHARE cust_share TO NAMESPACE '<consumer_namespace>'; -- (replace with consumer namespace created in prerequisites 4)
  1. Connect to Redshift Serverless endpoint and execute the below statements to setup the inbound datashare.
CREATE DATABASE cust_db FROM DATASHARE cust_share OF NAMESPACE '< producer_namespace >'; -- (replace with producer namespace created in prerequisites 4)
  1. Optionally, create the credit_cards table as an external table by using this sample file in Amazon S3 and adding the table to AWS Glue Data Catalog through Glue Crawler. Once the table is available in Glue Data Catalog, you can create the external schema in your Amazon Redshift Serverless endpoint using the below SQL
CREATE external SCHEMA satori_external

FROM data catalog DATABASE 'satoriblog'

IAM_ROLE default

CREATE external DATABASE if not exists;

Verify that the external table credit_cards is available from your Redshift Serverless endpoint

select * from satori_external.credit_cards ;

Connect to Amazon Redshift

If you don’t have a Satori account, you can either create a test drive account or get Satori from the AWS Marketplace. Then complete the following steps to connect to Amazon Redshift:

  1. Log in to Satori.
  2. Choose Data Stores in the navigation pane, choose Add Datastore, and choose Amazon Redshift.

DatastoreSetup001

  1. Add your cluster identifier from the Amazon Redshift console. Satori will automatically detect the Region where your cluster resides within your AWS account.
  2. Satori will generate a Satori hostname for your cluster, which you will use to connect to your Redshift cluster
  3. In this demonstration, we will add a Redshift provisioned cluster and a Redshift Serverless endpoint to create two datastores in Satori

DatastoreProvisioned003

Datastore Serverless002

  1. Allow inbound access for the Satori IP addresses listed in your Redshift cluster security group.

For more details on connecting Satori to your Redshift cluster, refer to Adding an AWS Redshift Data Store to Satori.

  1. Under Authentication Settings, enter your root or superuser credentials for each datastore.

AuthenticationSettings004

  1. Leave the rest of the tabs with their default settings and choose Save.

Now your data stores are ready to be accessed through Satori.

Create a dataset

Complete the following steps to create a dataset:

  1. Choose Datasets in the navigation pane and choose Add New Dataset.
  2. Select your datastore and enter the details of your dataset.

CustomerDataset005

A dataset can be a collection of database objects that you categorize as a dataset. For Redshift provisioned cluster, we created a customer dataset with details on the database and schema. You can also optionally choose to focus on a specific table within the schema or even exclude certain schemas or tables from the dataset.

For Redshift Serverless, we created a dataset that with all datastore locations, to include the shared table and External table

ServerlessDataset006

  1. Choose Save.
  1. For each dataset, navigate to User Access Rules and create dataset user access policies for the roles we created.

UserAccessRoles007

  1. Enable Give Satori Control Over Access to the Dataset.
  2. Optionally, you can add expiration and revoke time configurations to the access policies to limit how long access is granted to the Redshift cluster.

Create a security policy for the dataset

Satori provides multiple masking profile templates that you can use as a baseline and customize before adding them to your security policies. Complete the following steps to create your security policy:

  1. Choose Masking Profiles in the navigation pane and use the Restrictive Policy template to create a masking policy.

MaskingProfiles008

  1. Provide a descriptive name for the policy.
  2. You can customize the policy further to add custom fields and their respective masking policies. The following example shows the additional field Credit Card Number that was added with the action to mask everything but the last four characters.

  1. Choose Security Policies in the navigation pane and create a security policy called Customer Data Security Policy.

  1. Associate the policy with the masking profile created in the previous step.

  1. Associate the created security policy with the datasets by editing the dataset and navigating to the Security Policies tab.

Now that the integration, policy, and access controls are set, let’s query the data through DBeaver.

Query secure data

To query your data, connect to the Redshift cluster and Redshift Serverless endpoint using their respective Satori hostname that was obtained earlier.

When you query the data in Redshift provisioned cluster, you will see the security policies applied to the result set at runtime.

When you query the data in Redshift Serverless endpoint, you will see the security policies applied to credit_cards table shared from the Redshift provisioned cluster.

You will get similar results with policies applied if you query the external table in Amazon S3 from Redshift Serverless endpoint

Summary

In this post, we described how Satori can help you with secure data access from your Redshift cluster without requiring any changes to your Redshift data, schema, or how your users interact with data. In part 2, we will explore how to set up self-service data access to data stored in Amazon Redshift with the different roles we created as part of the initial setup.

Satori is available on the AWS Marketplace. To learn more, start a free trial or request a demo meeting.


About the authors

Jagadish Kumar is a Senior Analytics Specialist Solutions Architect at AWS focused on Amazon Redshift. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.

LisaLevyLisa Levy is a Content Specialist at Satori. She publishes informative content to effectively describe how Satori’s data security platform enhances organizational productivity.

Migrate Google BigQuery to Amazon Redshift using AWS Schema Conversion tool (SCT)

Post Syndicated from Jagadish Kumar original https://aws.amazon.com/blogs/big-data/migrate-google-bigquery-to-amazon-redshift-using-aws-schema-conversion-tool-sct/

Amazon Redshift is a fast, fully-managed, petabyte scale data warehouse that provides the flexibility to use provisioned or serverless compute for your analytical workloads. Using Amazon Redshift Serverless and Query Editor v2, you can load and query large datasets in just a few clicks and pay only for what you use. The decoupled compute and storage architecture of Amazon Redshift enables you to build highly scalable, resilient, and cost-effective workloads. Many customers migrate their data warehousing workloads to Amazon Redshift and benefit from the rich capabilities it offers. The following are just some of the notable capabilities:

  • Amazon Redshift seamlessly integrates with broader analytics services on AWS. This enables you to choose the right tool for the right job. Modern analytics is much wider than SQL-based data warehousing. Amazon Redshift lets you build lake house architectures and then perform any kind of analytics, such as interactive analytics, operational analytics, big data processing, visual data preparation, predictive analytics, machine learning (ML), and more.
  • You don’t need to worry about workloads, such as ETL, dashboards, ad-hoc queries, and so on, interfering with each other. You can isolate workloads using data sharing, while using the same underlying datasets.
  • When users run many queries at peak times, compute seamlessly scales within seconds to provide consistent performance at high concurrency. You get one hour of free concurrency scaling capacity for 24 hours of usage. This free credit meets the concurrency demand of 97% of the Amazon Redshift customer base.
  • Amazon Redshift is easy-to-use with self-tuning and self-optimizing capabilities. You can get faster insights without spending valuable time managing your data warehouse.
  • Fault Tolerance is inbuilt. All of the data written to Amazon Redshift is automatically and continuously replicated to Amazon Simple Storage Service (Amazon S3). Any hardware failures are automatically replaced.
  • Amazon Redshift is simple to interact with. You can access data with traditional, cloud-native, containerized, and serverless web services-based or event-driven applications and so on.
  • Redshift ML makes it easy for data scientists to create, train, and deploy ML models using familiar SQL. They can also run predictions using SQL.
  • Amazon Redshift provides comprehensive data security at no extra cost. You can set up end-to-end data encryption, configure firewall rules, define granular row and column level security controls on sensitive data, and so on.
  • Amazon Redshift integrates seamlessly with other AWS services and third-party tools. You can move, transform, load, and query large datasets quickly and reliably.

In this post, we provide a walkthrough for migrating a data warehouse from Google BigQuery to Amazon Redshift using AWS Schema Conversion Tool (AWS SCT) and AWS SCT data extraction agents. AWS SCT is a service that makes heterogeneous database migrations predictable by automatically converting the majority of the database code and storage objects to a format that is compatible with the target database. Any objects that can’t be automatically converted are clearly marked so that they can be manually converted to complete the migration. Furthermore, AWS SCT can scan your application code for embedded SQL statements and convert them.

Solution overview

AWS SCT uses a service account to connect to your BigQuery project. First, we create an Amazon Redshift database into which BigQuery data is migrated. Next, we create an S3 bucket. Then, we use AWS SCT to convert BigQuery schemas and apply them to Amazon Redshift. Finally, to migrate data, we use AWS SCT data extraction agents, which extract data from BigQuery, upload it into the S3 bucket, and then copy to Amazon Redshift.

Prerequisites

Before starting this walkthrough, you must have the following prerequisites:

  1. A workstation with AWS SCT, Amazon Corretto 11, and Amazon Redshift drivers.
    1. You can use an Amazon Elastic Compute Cloud (Amazon EC2) instance or your local desktop as a workstation. In this walkthrough, we’re using Amazon EC2 Windows instance. To create it, use this guide.
    2. To download and install AWS SCT on the EC2 instance that you previously created, use this guide.
    3. Download the Amazon Redshift JDBC driver from this location.
    4. Download and install Amazon Corretto 11.
  2. A GCP service account that AWS SCT can use to connect to your source BigQuery project.
    1. Grant BigQuery Admin and Storage Admin roles to the service account.
    2. Copy the Service account key file, which was created in the Google cloud management console, to the EC2 instance that has AWS SCT.
    3. Create a Cloud Storage bucket in GCP to store your source data during migration.

This walkthrough covers the following steps:

  • Create an Amazon Redshift Serverless Workgroup and Namespace
  • Create the AWS S3 Bucket and Folder
  • Convert and apply BigQuery Schema to Amazon Redshift using AWS SCT
    • Connecting to the Google BigQuery Source
    • Connect to the Amazon Redshift Target
    • Convert BigQuery schema to an Amazon Redshift
    • Analyze the assessment report and address the action items
    • Apply converted schema to target Amazon Redshift
  • Migrate data using AWS SCT data extraction agents
    • Generating Trust and Key Stores (Optional)
    • Install and start data extraction agent
    • Register data extraction agent
    • Add virtual partitions for large tables (Optional)
    • Create a local migration task
    • Start the Local Data Migration Task
  • View Data in Amazon Redshift

Create an Amazon Redshift Serverless Workgroup and Namespace

In this step, we create an Amazon Redshift Serverless workgroup and namespace. Workgroup is a collection of compute resources and namespace is a collection of database objects and users. To isolate workloads and manage different resources in Amazon Redshift Serverless, you can create namespaces and workgroups and manage storage and compute resources separately.

Follow these steps to create Amazon Redshift Serverless workgroup and namespace:

  • Navigate to the Amazon Redshift console.
  • In the upper right, choose the AWS Region that you want to use.
  • Expand the Amazon Redshift pane on the left and choose Redshift Serverless.
  • Choose Create Workgroup.
  • For Workgroup name, enter a name that describes the compute resources.
  • Verify that the VPC is the same as the VPC as the EC2 instance with AWS SCT.
  • Choose Next.

  • For Namespace name, enter a name that describes your dataset.
  • In Database name and password section, select the checkbox Customize admin user credentials.
    • For Admin user name, enter a username of your choice, for example awsuser.
    • For Admin user password: enter a password of your choice, for example MyRedShiftPW2022
  • Choose Next. Note that data in Amazon Redshift Serverless namespace is encrypted by default.
  • In the Review and Create page, choose Create.
  • Create an AWS Identity and Access Management (IAM) role and set it as the default on your namespace, as described in the following. Note that there can only be one default IAM role.
    • Navigate to the Amazon Redshift Serverless Dashboard.
    • Under Namespaces / Workgroups, choose the namespace that you just created.
    • Navigate toSecurity and encryption.
    • Under Permissions, choose Manage IAM roles.
    • Navigate to Manage IAM roles. Then, choose the Manage IAM roles drop-down and choose Create IAM role.
    • Under Specify an Amazon S3 bucket for the IAM role to access, choose one of the following methods:
      • Choose No additional Amazon S3 bucket to allow the created IAM role to access only the S3 buckets with a name starting with redshift.
      • Choose Any Amazon S3 bucket to allow the created IAM role to access all of the S3 buckets.
      • Choose Specific Amazon S3 buckets to specify one or more S3 buckets for the created IAM role to access. Then choose one or more S3 buckets from the table.
    • Choose Create IAM role as default. Amazon Redshift automatically creates and sets the IAM role as default.
  • Capture the Endpoint for the Amazon Redshift Serverless workgroup that you just created.

Create the S3 bucket and folder

During the data migration process, AWS SCT uses Amazon S3 as a staging area for the extracted data. Follow these steps to create the S3 bucket:

  • Navigate to the Amazon S3 console
  • Choose Create bucket. The Create bucket wizard opens.
  • For Bucket name, enter a unique DNS-compliant name for your bucket (e.g., uniquename-bq-rs). See rules for bucket naming when choosing a name.
  • For AWS Region, choose the region in which you created the Amazon Redshift Serverless workgroup.
  • Select Create Bucket.
  • In the Amazon S3 console, navigate to the S3 bucket that you just created (e.g., uniquename-bq-rs).
  • Choose “Create folder” to create a new folder.
  • For Folder name, enter incoming and choose Create Folder.

Convert and apply BigQuery Schema to Amazon Redshift using AWS SCT

To convert BigQuery schema to the Amazon Redshift format, we use AWS SCT. Start by logging in to the EC2 instance that we created previously, and then launch AWS SCT.

Follow these steps using AWS SCT:

Connect to the BigQuery Source

  • From the File Menu choose Create New Project.
  • Choose a location to store your project files and data.
  • Provide a meaningful but memorable name for your project, such as BigQuery to Amazon Redshift.
  • To connect to the BigQuery source data warehouse, choose Add source from the main menu.
  • Choose BigQuery and choose Next. The Add source dialog box appears.
  • For Connection name, enter a name to describe BigQuery connection. AWS SCT displays this name in the tree in the left panel.
  • For Key path, provide the path of the service account key file that was previously created in the Google cloud management console.
  • Choose Test Connection to verify that AWS SCT can connect to your source BigQuery project.
  • Once the connection is successfully validated, choose Connect.

Connect to the Amazon Redshift Target

Follow these steps to connect to Amazon Redshift:

  • In AWS SCT, choose Add Target from the main menu.
  • Choose Amazon Redshift, then choose Next. The Add Target dialog box appears.
  • For Connection name, enter a name to describe the Amazon Redshift connection. AWS SCT displays this name in the tree in the right panel.
  • For Server name, enter the Amazon Redshift Serverless workgroup endpoint captured previously.
  • For Server port, enter 5439.
  • For Database, enter dev.
  • For User name, enter the username chosen when creating the Amazon Redshift Serverless workgroup.
  • For Password, enter the password chosen when creating Amazon Redshift Serverless workgroup.
  • Uncheck the “Use AWS Glue” box.
  • Choose Test Connection to verify that AWS SCT can connect to your target Amazon Redshift workgroup.
  • Choose Connect to connect to the Amazon Redshift target.

Note that alternatively you can use connection values that are stored in AWS Secrets Manager. 

Convert BigQuery schema to an Amazon Redshift

After the source and target connections are successfully made, you see the source BigQuery object tree on the left pane and target Amazon Redshift object tree on the right pane.

Follow these steps to convert BigQuery schema to the Amazon Redshift format:

  • On the left pane, right-click on the schema that you want to convert.
  • Choose Convert Schema.
  • A dialog box appears with a question, The objects might already exist in the target database. Replace?. Choose Yes.

Once the conversion is complete, you see a new schema created on the Amazon Redshift pane (right pane) with the same name as your BigQuery schema.

The sample schema that we used has 16 tables, 3 views, and 3 procedures. You can see these objects in the Amazon Redshift format in the right pane. AWS SCT converts all of the BigQuery code and data objects to the Amazon Redshift format. Furthermore, you can use AWS SCT to convert external SQL scripts, application code, or additional files with embedded SQL.

Analyze the assessment report and address the action items

AWS SCT creates an assessment report to assess the migration complexity. AWS SCT can convert the majority of code and database objects. However, some of the objects may require manual conversion. AWS SCT highlights these objects in blue in the conversion statistics diagram and creates action items with a complexity attached to them.

To view the assessment report, switch from the Main view to the Assessment Report view as follows:

The Summary tab shows objects that were converted automatically, and objects that weren’t converted automatically. Green represents automatically converted or with simple action items. Blue represents medium and complex action items that require manual intervention.

The Action Items tab shows the recommended actions for each conversion issue. If you select an action item from the list, AWS SCT highlights the object to which the action item applies.

The report also contains recommendations for how to manually convert the schema item. For example, after the assessment runs, detailed reports for the database/schema show you the effort required to design and implement the recommendations for converting Action items. For more information about deciding how to handle manual conversions, see Handling manual conversions in AWS SCT. Amazon Redshift takes some actions automatically while converting the schema to Amazon Redshift. Objects with these actions are marked with a red warning sign.

You can evaluate and inspect the individual object DDL by selecting it from the right pane, and you can also edit it as needed. In the following example, AWS SCT modifies the RECORD and JSON datatype columns in BigQuery table ncaaf_referee_data to the SUPER datatype in Amazon Redshift. The partition key in the ncaaf_referee_data table is converted to the distribution key and sort key in Amazon Redshift.

Apply converted schema to target Amazon Redshift

To apply the converted schema to Amazon Redshift, select the converted schema in the right pane, right-click, and then choose Apply to database.

Migrate data from BigQuery to Amazon Redshift using AWS SCT data extraction agents

AWS SCT extraction agents extract data from your source database and migrate it to the AWS Cloud. In this walkthrough, we show how to configure AWS SCT extraction agents to extract data from BigQuery and migrate to Amazon Redshift.

First, install AWS SCT extraction agent on the same Windows instance that has AWS SCT installed. For better performance, we recommend that you use a separate Linux instance to install extraction agents if possible. For big datasets, you can use several data extraction agents to increase the data migration speed.

Generating trust and key stores (optional)

You can use Secure Socket Layer (SSL) encrypted communication with AWS SCT data extractors. When you use SSL, all of the data passed between the applications remains private and integral. To use SSL communication, you must generate trust and key stores using AWS SCT. You can skip this step if you don’t want to use SSL. We recommend using SSL for production workloads.

Follow these steps to generate trust and key stores:

  1. In AWS SCT, navigate to Settings → Global Settings → Security.
  2. Choose Generate trust and key store.
  3. Enter the name and password for trust and key stores and choose a location where you would like to store them.
  4. Choose Generate.

Install and configure Data Extraction Agent

In the installation package for AWS SCT, you find a sub-folder agent (\aws-schema-conversion-tool-1.0.latest.zip\agents). Locate and install the executable file with a name like aws-schema-conversion-tool-extractor-xxxxxxxx.msi.

In the installation process, follow these steps to configure AWS SCT Data Extractor:

  1. For Listening port, enter the port number on which the agent listens. It is 8192 by default.
  2. For Add a source vendor, enter no, as you don’t need drivers to connect to BigQuery.
  3. For Add the Amazon Redshift driver, enter YES.
  4. For Enter Redshift JDBC driver file or files, enter the location where you downloaded Amazon Redshift JDBC drivers.
  5. For Working folder, enter the path where the AWS SCT data extraction agent will store the extracted data. The working folder can be on a different computer from the agent, and a single working folder can be shared by multiple agents on different computers.
  6. For Enable SSL communication, enter yes. Choose No here if you don’t want to use SSL.
  7. For Key store, enter the storage location chosen when creating the trust and key store.
  8. For Key store password, enter the password for the key store.
  9. For Enable client SSL authentication, enter yes.
  10. For Trust store, enter the storage location chosen when creating the trust and key store.
  11. For Trust store password, enter the password for the trust store.
*************************************************
*                                               *
*     AWS SCT Data Extractor Configuration      *
*              Version 2.0.1.666                *
*                                               *
*************************************************
User name: Administrator
User home: C:\Windows\system32\config\systemprofile
*************************************************
Listening port [8192]: 8192
Add a source vendor [YES/no]: no
No one source data warehouse vendors configured. AWS SCT Data Extractor cannot process data extraction requests.
Add the Amazon Redshift driver [YES/no]: YES
Enter Redshift JDBC driver file or files: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject\redshift-jdbc42-2.1.0.9.jar
Working folder [C:\Windows\system32\config\systemprofile]: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject
Enable SSL communication [YES/no]: YES
Setting up a secure environment at "C:\Windows\system32\config\systemprofile". This process will take a few seconds...
Key store [ ]: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject\TrustAndKeyStores\BQToRedshiftKeyStore
Key store password:
Re-enter the key store password:
Enable client SSL authentication [YES/no]: YES
Trust store [ ]: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject\TrustAndKeyStores\BQToRedshiftTrustStore
Trust store password:
Re-enter the trust store password:

Starting Data Extraction Agent(s)

Use the following procedure to start extraction agents. Repeat this procedure on each computer that has an extraction agent installed.

Extraction agents act as listeners. When you start an agent with this procedure, the agent starts listening for instructions. You send the agents instructions to extract data from your data warehouse in a later section.

To start the extraction agent, navigate to the AWS SCT Data Extractor Agent directory. For example, in Microsoft Windows, double-click C:\Program Files\AWS SCT Data Extractor Agent\StartAgent.bat.

  • On the computer that has the extraction agent installed, from a command prompt or terminal window, run the command listed following your operating system.
  • To check the status of the agent, run the same command but replace start with status.
  • To stop an agent, run the same command but replace start with stop.
  • To restart an agent, run the same RestartAgent.bat file.

Register the Data Extraction Agent

Follow these steps to register the Data Extraction Agent:

  1. In AWS SCT, change the view to Data Migration view (other) and choose + Register.
  2. In the connection tab:
    1. For Description, enter a name to identify the Data Extraction Agent.
    2. For Host name, if you installed the Data Extraction Agent on the same workstation as AWS SCT, enter 0.0.0.0 to indicate local host. Otherwise, enter the host name of the machine on which the AWS SCT Data Extraction Agent is installed. It’s recommended to install the Data Extraction Agents on Linux for better performance.
    3. For Port, enter the number entered for the Listening Port when installing the AWS SCT Data Extraction Agent.
    4. Select the checkbox to use SSL (if using SSL) to encrypt the AWS SCT connection to the Data Extraction Agent.
  3. If you’re using SSL, then in the SSL Tab:
    1. For Trust store, choose the trust store name created when generating Trust and Key Stores (optionally, you can skip this if SSL connectivity isn’t needed).
    2. For Key Store, choose the key store name created when generating Trust and Key Stores (optionally, you can skip this if SSL connectivity isn’t needed).
  4. Choose Test Connection.
  5. Once the connection is validated successfully, choose Register.

Add virtual partitions for large tables (optional)

You can use AWS SCT to create virtual partitions to optimize migration performance. When virtual partitions are created, AWS SCT extracts the data in parallel for partitions. We recommend creating virtual partitions for large tables.

Follow these steps to create virtual partitions:

  1. Deselect all objects on the source database view in AWS SCT.
  2. Choose the table for which you would like to add virtual partitioning.
  3. Right-click on the table, and choose Add Virtual Partitioning.
  4. You can use List, Range, or Auto Split partitions. To learn more about virtual partitioning, refer to Use virtual partitioning in AWS SCT. In this example, we use Auto split partitioning, which generates range partitions automatically. You would specify the start value, end value, and how big the partition should be. AWS SCT determines the partitions automatically. For a demonstration, on the Lineorder table:
    1. For Start Value, enter 1000000.
    2. For End Value, enter 3000000.
    3. For Interval, enter 1000000 to indicate partition size.
    4. Choose Ok.

You can see the partitions automatically generated under the Virtual Partitions tab. In this example, AWS SCT automatically created the following five partitions for the field:

    1. <1000000
    2. >=1000000 and <=2000000
    3. >2000000 and <=3000000
    4. >3000000
    5. IS NULL

Create a local migration task

To migrate data from BigQuery to Amazon Redshift, create, run, and monitor the local migration task from AWS SCT. This step uses the data extraction agent to migrate data by creating a task.

Follow these steps to create a local migration task:

  1. In AWS SCT, under the schema name in the left pane, right-click on Standard tables.
  2. Choose Create Local task.
  3. There are three migration modes from which you can choose:
    1. Extract source data and store it on a local pc/virtual machine (VM) where the agent runs.
    2. Extract data and upload it on an S3 bucket.
    3. Choose Extract upload and copy, which extracts data to an S3 bucket and then copies to Amazon Redshift.
  4. In the Advanced tab, for Google CS bucket folder enter the Google Cloud Storage bucket/folder that you created earlier in the GCP Management Console. AWS SCT stores the extracted data in this location.
  5. In the Amazon S3 Settings tab, for Amazon S3 bucket folder, provide the bucket and folder names of the S3 bucket that you created earlier. The AWS SCT data extraction agent uploads the data into the S3 bucket/folder before copying to Amazon Redshift.
  6. Choose Test Task.
  7. Once the task is successfully validated, choose Create.

Start the Local Data Migration Task

To start the task, choose the Start button in the Tasks tab.

  • First, the Data Extraction Agent extracts data from BigQuery into the GCP storage bucket.
  • Then, the agent uploads data to Amazon S3 and launches a copy command to move the data to Amazon Redshift.
  • At this point, AWS SCT has successfully migrated data from the source BigQuery table to the Amazon Redshift table.

View data in Amazon Redshift

After the data migration task executes successfully, you can connect to Amazon Redshift and validate the data.

Follow these steps to validate the data in Amazon Redshift:

  1. Navigate to the Amazon Redshift QueryEditor V2.
  2. Double-click on the Amazon Redshift Serverless workgroup name that you created.
  3. Choose the Federated User option under Authentication.
  4. Choose Create Connection.
  5. Create a new editor by choosing the + icon.
  6. In the editor, write a query to select from the schema name and table name/view name you would like to verify. Explore the data, run ad-hoc queries, and make visualizations and charts and views.

The following is a side-by-side comparison between source BigQuery and target Amazon Redshift for the sports data-set that we used in this walkthrough.

Clean up up any AWS resources that you created for this exercise

Follow these steps to terminate the EC2 instance:

  1. Navigate to the Amazon EC2 console.
  2. In the navigation pane, choose Instances.
  3. Select the check-box for the EC2 instance that you created.
  4. Choose Instance state, and then Terminate instance.
  5. Choose Terminate when prompted for confirmation.

Follow these steps to delete Amazon Redshift Serverless workgroup and namespace

  1. Navigate to Amazon Redshift Serverless Dashboard.
  2. Under Namespaces / Workgroups, choose the workspace that you created.
  3. Under Actions, choose Delete workgroup.
  4. Select the checkbox Delete the associated namespace.
  5. Uncheck Create final snapshot.
  6. Enter delete in the delete confirmation text box and choose Delete.

Follow these steps to delete the S3 bucket

  1. Navigate to Amazon S3 console.
  2. Choose the bucket that you created.
  3. Choose Delete.
  4. To confirm deletion, enter the name of the bucket in the text input field.
  5. Choose Delete bucket.

Conclusion

Migrating a data warehouse can be a challenging, complex, and yet rewarding project. AWS SCT reduces the complexity of data warehouse migrations. Following this walkthrough, you can understand how a data migration task extracts, downloads, and then migrates data from BigQuery to Amazon Redshift. The solution that we presented in this post performs a one-time migration of database objects and data. Data changes made in BigQuery when the migration is in progress won’t be reflected in Amazon Redshift. When data migration is in progress, put your ETL jobs to BigQuery on hold or replay the ETLs by pointing to Amazon Redshift after the migration. Consider using the best practices for AWS SCT.

AWS SCT has some limitations when using BigQuery as a source. For example, AWS SCT can’t convert sub queries in analytic functions, geography functions, statistical aggregate functions, and so on. Find the full list of limitations in the AWS SCT user guide. We plan to address these limitations in future releases. Despite these limitations, you can use AWS SCT to automatically convert most of your BigQuery code and storage objects.

Download and install AWS SCT, sign in to the AWS Console, checkout Amazon Redshift Serverless, and start migrating!


About the authors

Cedrick Hoodye is a Solutions Architect with a focus on database migrations using the AWS Database Migration Service (DMS) and the AWS Schema Conversion Tool (SCT) at AWS. He works on DB migrations related challenges. He works closely with EdTech, Energy, and ISV business sector customers to help them realize the true potential of DMS service. He has helped migrate 100s of databases into the AWS cloud using DMS and SCT.

Amit Arora is a Solutions Architect with a focus on Database and Analytics at AWS. He works with our Financial Technology and Global Energy customers and AWS certified partners to provide technical assistance and design customer solutions on cloud migration projects, helping customers migrate and modernize their existing databases to the AWS Cloud.

Jagadish Kumar is an Analytics Specialist Solution Architect at AWS focused on Amazon Redshift. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.

Anusha Challa is a Senior Analytics Specialist Solution Architect at AWS focused on Amazon Redshift. She has helped many customers build large-scale data warehouse solutions in the cloud and on premises. Anusha is passionate about data analytics and data science and enabling customers achieve success with their large-scale data projects.

Build a big data Lambda architecture for batch and real-time analytics using Amazon Redshift

Post Syndicated from Jagadish Kumar original https://aws.amazon.com/blogs/big-data/build-a-big-data-lambda-architecture-for-batch-and-real-time-analytics-using-amazon-redshift/

With real-time information about customers, products, and applications in hand, organizations can take action as events happen in their business application. For example, you can prevent financial fraud, deliver personalized offers, and identify and prevent failures before they occur in near real time. Although batch analytics provides abilities to analyze trends and process data at scale that allow processing data in time intervals (such as daily sales aggregations by individual store), real-time analytics is optimized for low-latency analytics, ensuring that data is available for querying in seconds. Both paradigms of data processing operate in silos, which results in data redundancy and operational overhead to maintain them. A big data Lambda architecture is a reference architecture pattern that allows for the seamless coexistence of the batch and near-real-time paradigms for large-scale data for analytics.

Amazon Redshift allows you to easily analyze all data types across your data warehouse, operational database, and data lake using standard SQL. In this post, we collect, process, and analyze data streams in real time. With data sharing, you can share live data across Amazon Redshift clusters for read purposes with relative security and ease out of the box. In this post, we discuss how we can harness the data sharing ability of Amazon Redshift to set up a big data Lambda architecture to allow both batch and near-real-time analytics.

Solution overview

Example Corp. is a leading electric automotive company that revolutionized the automotive industry. Example Corp. operationalizes the connected vehicle data and improves the effectiveness of various connected vehicle and fleet use cases, including predictive maintenance, in-vehicle service monetization, usage-based insurance. and delivering exceptional driver experiences. In this post, we explore the real-time and trend analytics using the connected vehicle data to illustrate the following use cases:

  • Usage-based insurance – Usage-based insurance (UBI) relies on analysis of near-real-time data from the driver’s vehicle to access the risk profile of the driver. In addition, it also relies on the historical analysis (batch) of metrics (such as the number of miles driven in a year). The better the driver, the lower the premium.
  • Fleet performance trends – The performance of a fleet (such as a taxi fleet) relies on the analysis of historical trends of data across the fleet (batch) as well as the ability to drill down to a single vehicle within the fleet for near-real-time analysis of metrics like fuel consumption or driver distraction.

Architecture overview

In this section, we discuss the overall architectural setup for the Lambda architecture solution.

The following diagram shows the implementation architecture and the different computational layers:

  • Data ingestion from AWS IoT Core
  • Batch layer
  • Speed layer
  • Serving layer

Data ingestion

Vehicle telemetry data is ingested into the cloud through AWS IoT Core and routed to Amazon Kinesis Data Streams. The Kinesis Data Streams layer acts as a separation layer for the speed layer and batch layer, where the incoming telemetry is consumed by the speed layer’s Amazon Redshift cluster and Amazon Kinesis Data Firehose, respectively.

Batch layer

Amazon Kinesis Data Firehose is a fully managed service that can batch, compress, transform, and encrypt your data streams before loading them into your Amazon Simple Storage Service (Amazon S3) data lake. Kinesis Data Firehose also allows you to specify a custom expression for the Amazon S3 prefix where data records are delivered. This provides the ability to filter the partitioned data and control the amount of data scanned by each query, thereby improving performance and reducing cost.

The batch layer persists data in Amazon S3 and is accessed directly by an Amazon Redshift Serverless endpoint (serving layer). With Amazon Redshift Serverless, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables.

The batch layer can also optionally precompute results as batch views from the immutable Amazon S3 data lake and persist them as either native tables or materialized views for very high-performant use cases. You can create these precomputed batch views using AWS Glue, Amazon Redshift stored procedures, Amazon Redshift materialized views, or other options.

The batch views can be calculated as:

batch view = function (all data)

In this solution, we build a batch layer for Example Corp. for two types of queries:

  • rapid_acceleration_by_year – The number of rapid accelerations by each driver aggregated per year
  • total_miles_driven_by_year – The total number of miles driven by the fleet aggregated per year

For demonstration purposes, we use Amazon Redshift stored procedures to create the batch views as Amazon Redshift native tables from external tables using Amazon Redshift Spectrum.

Speed layer

The speed layer processes data streams in real time and aims to minimize latency by providing real-time views into the most recent data.

Amazon Redshift Streaming Ingestion uses SQL to connect with one or more Kinesis data streams simultaneously. The native streaming ingestion feature in Amazon Redshift lets you ingest data directly from Kinesis Data Streams and enables you to ingest hundreds of megabytes of data per second and query it at exceptionally low latency—in many cases only 10 seconds after entering the data stream.

The speed cluster uses materialized views to materialize a point-in-time view of a Kinesis data stream, as accumulated up to the time it is queried. The real-time views are computed using this layer, which provide a near-real-time view of the incoming telemetry stream.

The speed views can be calculated as a function of recent data unaccounted for in the batch views:

speed view = function (recent data)

We calculate the speed views for these batch views as follows:

  • rapid_acceleration_realtime – The number of rapid accelerations by each driver for recent data not accounted for in the batch view rapid_acceleration_by_month
  • miles_driven_realtime – The number of miles driven by each driver for recent data not in miles_driven_by_month

Serving layer

The serving layer comprises an Amazon Redshift Serverless endpoint and any consumption services such as Amazon QuickSight or Amazon SageMaker.

Amazon Redshift Serverless (preview) is a serverless option of Amazon Redshift that makes it easy to run and scale analytics in seconds without the need to set up and manage data warehouse infrastructure. With Amazon Redshift Serverless, any user—including data analysts, developers, business professionals, and data scientists—can get insights from data by simply loading and querying data in the data warehouse.

Amazon Redshift data sharing enables instant, granular, and fast data access across Amazon Redshift clusters without the need to maintain redundant copies of data.

The speed cluster provides outbound data shares of the real-time materialized views to the Amazon Redshift Serverless endpoint (serving cluster).

The serving cluster joins data from the batch layer and speed layer to get near-real-time and historical data for a particular function with minimal latency. The consumption layer (such as Amazon API Gateway or QuickSight) is only aware of the serving cluster, and all the batch and stream processing is abstracted from the consumption layer.

We can view the queries to the speed layer from data consumption layer as follows:

query = function (batch views, speed views)

Deploy the CloudFormation template

We have provided an AWS CloudFormation template to demonstrate the solution. You can download and use this template to easily deploy the required AWS resources. This template has been tested in the us-east-1 Region.

The template requires you to provide the following parameters:

  • DatabaseName – The name of the first database to be created for speed cluster
  • NumberOfNodes – The number of compute nodes in the cluster.
  • NodeType – The type of node to be provisioned
  • MasterUserName – The user name that is associated with the master user account for the cluster that is being created
  • MasterUserPassword – The password that is associated with the master user account
  • InboundTraffic – The CIDR range to allow inbound traffic to the cluster
  • PortNumber – The port number on which the cluster accepts incoming connections
  • SQLForData – The source query to extract from AWS IOT Core topic

Prerequisites

When setting up this solution and using your own application data to push to Kinesis Data Streams, you can skip setting up the IoT Device Simulator and start creating your Amazon Redshift Serverless endpoint. This post uses the simulator to create related database objects and assumes use of the simulator in the solution walkthrough.

Set up the IoT Device Simulator

We use the IoT Device simulator to generate and simulate vehicle IoT data. The solution allows you to create and simulate hundreds of connected devices, without having to configure and manage physical devices or develop time-consuming scripts.

Use the following CloudFormation template to create the IoT Device Simulator in your account for trying out this solution.

Configure devices and simulations

To configure your devices and simulations, complete the following steps:

  1. Use the login information you received in the email you provided to log in to the IoT Device Simulator.
  2. Choose Device Types and Add Device Type.
  3. Choose Automotive Demo.
  4. For Device type name, enter testVehicles.
  5. For Topic, enter the topic where the sensor data is sent to AWS IoT Core.
  6. Save your settings.
  7. Choose Simulations and Add simulation.
  8. For Simulation name, enter testSimulation.
  9. For Simulation type¸ choose Automotive Demo.
  10. For Select a device type¸ choose the device type you created (testVehicles).
  11. For Number of devices, enter 15.

You can choose up to 100 devices per simulation. You can configure a higher number of devices to simulate large data.

  1. For Data transmission interval, enter 1.
  2. For Data transmission duration, enter 300.

This configuration runs the simulation for 5 minutes.

  1. Choose Save.

Now you’re ready to simulate vehicle telemetry data to AWS IoT Core.

Create an Amazon Redshift Serverless endpoint

The solution uses an Amazon Redshift Serverless endpoint as the serving layer cluster. You can set up Amazon Redshift Serverless in your account.

Set up Amazon Redshift Query Editor V2

To query data, you can use Amazon Redshift Query Editor V2. For more information, refer to Introducing Amazon Redshift Query Editor V2, a Free Web-based Query Authoring Tool for Data Analysts.

Get namespaces for the provisioned speed layer cluster and Amazon Redshift Serverless

Connect to speed-cluster-iot (the speed layer cluster) through Query Editor V2 and run the following SQL:

select current_namespace; -- (Save as <producer_namespace>)

Similarly, connect to the Amazon Redshift Serverless endpoint and get the namespace:

select current_namespace; -- (Save as <consumer_namespace>)

You can also get this information via the Amazon Redshift console.

Now that we have all the prerequisites set up, let’s go through the solution walkthrough.

Implement the solution

The workflow includes the following steps:

  1. Start the IoT simulation created in the previous section.

The vehicle IoT is simulated and ingested through IoT Device Simulator for the configured number of vehicles. The raw telemetry payload is sent to AWS IoT Core, which routes the data to Kinesis Data Streams.

At the batch layer, data is directly put from Kinesis Data Streams to Kinesis Data Firehose, which converts the data to parquet and delivers to Amazon with the prefix s3://<Bucketname>/vehicle_telematics_raw/year=<>/month=<>/day=<>/.

  1. When the simulation is complete, run the pre-created AWS Glue crawler vehicle_iot_crawler on the AWS Glue console.

The serving layer Amazon Redshift Serverless endpoint can directly access data from the Amazon S3 data lake through Redshift Spectrum external tables. In this demo, we compute batch views through Redshift Spectrum and store them as Amazon Redshift tables using Amazon Redshift stored procedures.

  1. Connect to the Amazon Redshift Serverless endpoint through Query Editor V2 and create the stored procedures using the following SQL script.
  2. Run the two stored procedures to create the batch views:
call rapid_acceleration_by_year_sp();
call total_miles_driven_by_year_sp();

The two stored procedures create batch views as Amazon Redshift native tables:

    • batchlayer_rapid_acceleration_by_year
    • batchlayer_total_miles_by_year

You can also schedule these stored procedures as batch jobs. For more information, refer to Scheduling SQL queries on your Amazon Redshift data warehouse.

At the speed layer, the incoming data stream is read and materialized by the speed layer Amazon Redshift cluster in the materialized view vehicleiotstream_mv.

  1. Connect to the provisioned speed-cluster-iot and run the following SQL script to create the required objects.

Two real-time views are created from this materialized view:

    • batchlayer_rapid_acceleration_by_year
    • batchlayer_total_miles_by_year
  1. Refresh the materialized view vehicleiotstream_mv at the required interval, which triggers Amazon Redshift to read from the stream and load data into the materialized view.
    REFRESH MATERIALIZED VIEW vehicleiotstream_mv;

Refreshes are currently manual, but can be automated using the query scheduler.

The real-time views are shared as an outbound data share by the speed cluster to the serving cluster.

  1. Connect to speed-cluster-iot and create an outbound data share (producer) with the following SQL:
    -- Create Datashare from Primary (Producer) to Serverless (Consumer)
    CREATE DATASHARE speedlayer_datashare SET PUBLICACCESSIBLE TRUE;
    ALTER DATASHARE speedlayer_datashare ADD SCHEMA public;
    ALTER DATASHARE speedlayer_datashare ADD ALL TABLES IN SCHEMA public;
    GRANT USAGE ON DATASHARE speedlayer_datashare TO NAMESPACE '<consumer_namespace>'; -- (replace with consumer namespace created in prerequisites 5)

  2. Connect to speed-cluster-iot and create an inbound data share (consumer) with the following SQL:
    CREATE DATABASE vehicleiot_shareddb FROM DATASHARE speedlayer_datashare OF NAMESPACE '< producer_namespace >'; -- (replace with producer namespace created in prerequisites 5)

Now that the real-time views are available for the Amazon Redshift Serverless endpoint, we can run queries to get real-time metrics or historical trends with up-to-date data by accessing the batch and speed layers and joining them using the following queries.

For example, to calculate total rapid acceleration by year with up-to-the-minute data, you can run the following query:

-- Rapid Acceleration By Year

select SUM(rapid_acceleration) rapid_acceleration, vin, year from 
(
select rapid_acceleration, vin,year
  from public.batchlayer_rapid_acceleration_by_year batch
union all
select rapid_acceleration, vin,year
from speedlayer_shareddb.public.speedlayer_rapid_acceleration_by_year speed)
group by VIN, year;

Similarly, to calculate total miles driven by year with up-to-the-minute data, run the following query:

-- Total Miles Driven By Year

select SUM(total_miles) total_miles_driven , year from 
(
select total_miles, year
  from public.batchlayer_total_miles_by_year batch
union all
select total_miles, year
from speedlayer_shareddb.public.speedlayer_total_miles_by_year speed)
group by year;

For only access to real-time data to power daily dashboards, you can run queries against real-time views shared to your Amazon Redshift Serverless cluster.

For example, to calculate the average speed per trip of your fleet, you can run the following SQL:

select CAST(measuretime as DATE) "date",
vin,
trip_id,
avg(vehicleSpeed)
from speedlayer_shareddb.public.vehicleiotstream_mv 
group by vin, date, trip_id;

Because this demo uses the same data as a quick start, there are duplicates in this demonstration. In actual implementations, the serving cluster manages the data redundancy and duplication by creating views with date predicates that consume non-overlapping data from batch and real-time views and provide overall metrics to the consumption layer.

You can consume the data with QuickSight for dashboards, with API Gateway for API-based access, or via the Amazon Redshift Data API or SageMaker for AI and machine learning (ML) workloads. This is not included as part of the provided CloudFormation template.

Best practices

In this section, we discuss some best practices and lessons learned when using this solution.

Provisioned vs. serverless

The speed layer is a continuous ingestion layer reading data from the IoT streams often running 24/7 workloads. There is less idle time and variability in the workloads and it is advantageous to have a provisioned cluster supporting persistent workloads that can scale elastically.

The serving layer can be provisioned (in case of 24/7 workloads) or Amazon Redshift Serverless in case of sporadic or ad hoc workloads. In this post, we assumed sporadic workloads, so serverless is the best fit. In addition, the serving layer can house multiple Amazon Redshift clusters, each consuming their data share and serving downstream applications.

RA3 instances for data sharing

Amazon Redshift RA3 instances enable data sharing to allow you to securely and easily share live data across Amazon Redshift clusters for reads. You can combine the data that is ingested in near-real time with the historical data using the data share to provide personalized driving characteristics to determine the insurance recommendation.

You can also grant fine-grained access control to the underlying data in the producer to the consumer cluster as needed. Amazon Redshift offers comprehensive auditing capabilities using system tables and AWS CloudTrail to allow you to monitor the data sharing permissions and usage across all the consumers and revoke access instantly when necessary. The permissions are granted by the superusers from both the producer and the consumer clusters to define who gets access to what objects, similar to the grant commands used in the earlier section. You can use the following commands to audit the usage and activities for the data share.

Track all changes to the data share and the shared database imported from the data share with the following code:

Select username, share_name, recordtime, action, 
         share_object_type, share_object_name 
  from svl_datashare_change_log
   order by recordtime desc;

Track data share access activity (usage), which is relevant only on the producer, with the following code:

Select * from svl_datashare_usage;

Pause and Resume

You can pause the producer cluster when batch processing is complete to save costs. The pause and resume actions on Amazon Redshift allow you to easily pause and resume clusters that may not be in operation at all times. It allows you to create a regularly-scheduled time to initiate the pause and resume actions at specific times or you can manually initiate a pause and later a resume. Flexible on-demand pricing and per-second billing gives you greater control of costs of your Redshift compute clusters while maintaining your data in a way that is simple to manage.

Materialized views for fast access to data

Materialized views allow pre-composed results from complex queries on large tables for faster access. The producer cluster exposes data as materialized views to simplify access for the consumer cluster. This also allows flexibility at the producer cluster to update the underlying table structure to address new business use cases, without affecting consumer-dependent queries and enabling a loose coupling.

Conclusion

In this post, we demonstrated how to process and analyze large-scale data from streaming and batch sources using Amazon Redshift as the core of the platform guided by the Lambda architecture principles.

You started by collecting real-time data from connected vehicles, and storing the streaming data in an Amazon S3 data lake through Kinesis Data Firehose. The solution simultaneously processes the data for near-real-time analysis through Amazon Redshift streaming ingestion.

Through the data sharing feature, you were able to share live, up-to-date data to an Amazon Redshift Serverless endpoint (serving cluster), which merges the data from the speed layer (near-real time) and batch layer (batch analysis) to provide low-latency access to data from near-real-time analysis to historical trends.

Click here to get started with this solution today and let us know how you implemented this solution in your organization through the comments section.


About the Authors

Jagadish Kumar is a Sr Analytics Specialist Solutions Architect at AWS. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS. He is an avid college football fan and enjoys reading, watching sports and riding motorcycle.

Thiyagarajan Arumugam is a Big Data Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

Eesha Kumar is an Analytics Solutions Architect with AWS. He works with customers to realize business value of data by helping them building solutions leveraging AWS platform and tools.