Tag Archives: Amazon Redshift Spectrum

Automate data archival for Amazon Redshift time series tables

Post Syndicated from Nita Shah original https://aws.amazon.com/blogs/big-data/automate-data-archival-for-amazon-redshift-time-series-tables/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all of your data using standard SQL. Tens of thousands of customers today rely on Amazon Redshift to analyze exabytes of data and run complex analytical queries, making it the most widely used cloud data warehouse. You can run and scale analytics in seconds on all your data without having to manage your data warehouse infrastructure.

A data retention policy is part of an organization’s overall data management. In a big data world, the size of data is consistently increasing, which directly affects the cost of storing the data in data stores. It’s necessary to keep optimizing your data in data warehouses for consistent performance, reliability, and cost control. It’s crucial to define how long an organization needs to hold on to specific data, and if data that is no longer needed should be archived or deleted. The frequency of data archival depends on the relevance of the data with respect to your business or legal needs.

Data archiving is the process of moving data that is no longer actively used in a data warehouse to a separate storage device for long-term retention. Archive data consists of older data that is still important to the organization and may be needed for future reference, as well as data that must be retained for regulatory compliance.

Data purging is the process of freeing up space in the database or deleting obsolete data that isn’t required by the business. The purging process can be based on the data retention policy, which is defined by the data owner or business need.

This post walks you through the process of how to automate data archival and purging of Amazon Redshift time series tables. Time series tables retain data for a certain period of time (days, months, quarters, or years) and need data to be purged regularly to maintain the rolling data to be analyzed by end-users.

Solution overview

The following diagram illustrates our solution architecture.

We use two database tables as part of this solution.

The arch_table_metadata database table stores the metadata for all the tables that need to be archived and purged. You need to add rows into this table that you want to archive and purge. The arch_table_metadata table contains the following columns.

ColumnName Description
id Database-generated, automatically assigns a unique value to each record.
schema_name Name of the database schema of the table.
table_name Name of the table to be archived and purged.
column_name Name of the date column that is used to identify records to be archived and purged.
s3_uri Amazon S3 location where the data will be archived.
retention_days Number of days the data will be retained for the table. Default is 90 days.

The arch_job_log database table stores the run history of stored procedures. Records are added to this table by the stored procedure. It contains the following columns.

ColumnName Description
job_run_id Assigns unique numeric value per stored procedure run.
arch_table_metadata_id Id column value from table arch_table_metadata.
no_of_rows_bfr_delete Number of rows in the table before purging.
no_of_rows_deleted Number of rows deleted by the purge operation.
job_start_time Time in UTC when the stored procedure started.
job_end_time Time in UTC when the stored procedure ended.
job_status Status of the stored procedure run: IN-PROGRESS, COMPLETED, or FAILED.

Prerequisites

For this solution, complete the following prerequisites:

  1. Create an Amazon Redshift provisioned cluster or Amazon Redshift serverless workgroup.
  2. In Amazon Redshift query editor v2 or a compatible SQL editor of your choice, create the tables arch_table_metadata and arch_job_log. Use the following code for the table DDLs:
    create table arch_table_metadata
    (
    id integer identity(0,1) not null, 
    schema_name varchar(100) not null, 
    table_name varchar(100) not null, 
    column_name varchar(100) not null,
    s3_uri varchar(1000) not null,
    retention_days integer default 90
    );
    
    create table arch_job_log
    (
    job_run_id bigint not null, 
    arch_table_metadata_id  integer not null,
    no_of_rows_bfr_delete bigint,
    no_of_rows_deleted bigint,
    table_arch_start_time timestamp default current_timestamp,
    table_arch_end_time timestamp default current_timestamp,
    job_start_time timestamp default current_timestamp,
    job_end_time timestamp default current_timestamp,
    job_status varchar(20)
    );
    

  3. Create the stored procedure sp_archive_data with the following code snippet. The stored procedure takes the AWS Identity and Access Management (IAM) role ARN as an input argument if you’re not using the default IAM role. If you’re using the default IAM role for your Amazon Redshift cluster, you can pass the input parameter as default. For more information, refer to Creating an IAM role as default in Amazon Redshift.
    CREATE OR REPLACE PROCEDURE archive_data_sp(p_iam_role IN varchar(256))
    AS $$
    DECLARE
    
    v_command           varchar(500);
    v_sql               varchar(500);
    v_count_sql         text;
    
    v_table_id          int;
    v_schema_name       text;
    v_table_name        text;
    v_column_name       text;
    v_s3_bucket_url     text;
    v_s3_folder_name_prefix     text;
    v_retention_days            int = 0;
    v_no_of_rows_before_delete  int = 0;
    v_no_of_deleted_rows        int =0;
    v_job_start_time            timestamp;
    v_job_status                int = 1;
    v_job_id                    int =0;
    
    
    table_meta_data_cur CURSOR FOR
    SELECT id, schema_name, table_name, column_name,s3_uri,retention_days
    FROM arch_table_metadata;
    
    BEGIN
    
        SELECT NVL(MAX(job_run_id),0) + 1 INTO v_job_id FROM arch_job_log;
        RAISE NOTICE '%', v_job_id;
    
        OPEN table_meta_data_cur;
        FETCH table_meta_data_cur INTO v_table_id,v_schema_name, v_table_name, v_column_name, v_s3_bucket_url, v_retention_days;
        WHILE v_table_id IS NOT NULL LOOP
    
            v_count_sql = 'SELECT COUNT(*) AS v_no_of_rows_before_delete FROM ' || v_schema_name || '.' || v_table_name;
            RAISE NOTICE '%', v_count_sql;
            EXECUTE v_count_sql INTO v_no_of_rows_before_delete;
            RAISE NOTICE 'v_no_of_rows_before_delete %', v_no_of_rows_before_delete;
    
            v_job_start_time = GETDATE();
            v_s3_folder_name_prefix = v_schema_name || '.' || v_table_name || '/';
            v_sql = 'SELECT * FROM ' || v_schema_name || '.' || v_table_name || ' WHERE ' || v_column_name || ' <= DATEADD(DAY,-' || v_retention_days || ',CURRENT_DATE)';
    
            IF p_iam_role = 'default' THEN
                v_command = 'UNLOAD (''' || v_sql ||  ''') to ''' || v_s3_bucket_url || v_s3_folder_name_prefix || ''' IAM_ROLE default  PARQUET PARTITION BY (' || v_column_name || ') INCLUDE ALLOWOVERWRITE';
            ELSE
                v_command = 'UNLOAD (''' || v_sql ||  ''') to ''' || v_s3_bucket_url || v_s3_folder_name_prefix || ''' IAM_ROLE ''' || p_iam_role || ''' PARQUET PARTITION BY (' || v_column_name || ') INCLUDE ALLOWOVERWRITE';
            END IF;
            RAISE NOTICE '%', v_command;
            EXECUTE v_command;
    
            v_sql := 'DELETE FROM ' || v_schema_name || '.' || v_table_name || ' WHERE ' || v_column_name || ' <= DATEADD(DAY,-' || v_retention_days || ',CURRENT_DATE)';
            RAISE NOTICE '%', v_sql;
            EXECUTE v_sql;
    
            GET DIAGNOSTICS v_no_of_deleted_rows := ROW_COUNT;
            RAISE INFO '# of rows deleted = %', v_no_of_deleted_rows;
    
            v_sql = 'INSERT INTO arch_job_log (job_run_id, arch_table_metadata_id ,no_of_rows_bfr_delete,no_of_rows_deleted,job_start_time,job_end_time,job_status) VALUES ('
                        || v_job_id || ',' || v_table_id || ',' || v_no_of_rows_before_delete || ',' || v_no_of_deleted_rows || ',''' || v_job_start_time || ''',''' || GETDATE() || ''',' || v_job_status || ')';
            RAISE NOTICE '%', v_sql;
            EXECUTE v_sql;
    
            FETCH table_meta_data_cur INTO v_table_id,v_schema_name, v_table_name, v_column_name, v_s3_bucket_url, v_retention_days;
        END LOOP;
        CLOSE table_meta_data_cur;
    
        EXCEPTION
        WHEN OTHERS THEN
            RAISE NOTICE 'Error - % ', SQLERRM;
    END;
    $$ LANGUAGE plpgsql;
    

Archival and purging

For this use case, we use a table called orders, for which we want to archive and purge any records older than the last 30 days.

Use the following DDL to create the table in the Amazon Redshift cluster:

create table orders (
  O_ORDERKEY bigint NOT NULL,
  O_CUSTKEY bigint,
  O_ORDERSTATUS varchar(1),
  O_TOTALPRICE decimal(18,4),
  O_ORDERDATE Date,
  O_ORDERPRIORITY varchar(15),
  O_CLERK varchar(15),
  O_SHIPPRIORITY Integer,
  O_COMMENT varchar(79))
distkey (O_ORDERKEY)
sortkey (O_ORDERDATE);

The O_ORDERDATE column makes it a time series table, which you can use to retain the rolling data for a certain period.

In order to load the data into the orders table using the below COPY command , you would need to have default IAM role attached to your Redshift cluster or replace the default keyword in the COPY command with the arn of the IAM role attached to the Redshift cluster

copy orders from 's3://redshift-immersionday-labs/data/orders/orders.tbl.'
iam_role default
region 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;

When you query the table, you can see that this data is for 1998. To test this solution, you need to manually update some of the data to the current date by running the following SQL statement:

update orders set O_ORDERDATE = current_date where O_ORDERDATE < '1998-08-02';

The table looks like the following screenshot after running the update statement.

Now let’s run the following SQL to get the count of number of records to be archived and purged:

select count (*) from orders where O_ORDERDATE <= DATEADD(DAY,-30,CURRENT_DATE)

Before running the stored procedure, we need to insert a row into the arch_file_metadata table for the stored procedure to archive and purge records in the orders table. In the following code, provide the Amazon Simple Storage Service (Amazon S3) bucket name where you want to store the archived data:

INSERT INTO arch_table_metadata (schema_name, table_name, column_name, s3_uri, retention_days) VALUES ('public', 'orders', 'O_ORDERDATE', 's3://<your-bucketname>/redshift_data_archival/', 30);

The stored procedure performs the following high-level steps:

  1. Open a cursor to read and loop through the rows in the arch_table_metadata table.
  2. Retrieve the total number of records in the table before purging.
  3. Export and archive the records to be deleted into the Amazon S3 location as specified in the s3_uri column value. Data is partitioned in Amazon S3 based on the column_name field in arch_table_metadata. The stored procedure uses the IAM role passed as input for the UNLOAD operation.
  4. Run the DELETE command to purge the identified records based on the retention_days column value.
  5. Add a record in arch_job_log with the run details.

Now, let’s run the stored procedure via the call statement passing a role ARN as input parameter to verify the data was archived and purged correctly:

call archive_data_sp('arn:aws:iam:<your-account-id>:role/RedshiftRole-7OR1UWVPFI5J');

As shown in the following screenshot, the stored procedure ran successfully.

Now let’s validate the table was purged successfully by running the following SQL:

select count (*) from orders where O_ORDERDATE <= DATEADD(DAY,-30,CURRENT_DATE)

We can navigate to the Amazon S3 location to validate the archival process. The following screenshot shows the data has been archived into the Amazon S3 location specified in the arch_table_metadata table.

Now let’s run the following SQL statement to look at the stored procedure run log entry:

select a.* from arch_job_log a, arch_table_metadata b
where a.arch_table_metadata_id = b.id
and b.table_name = 'orders'

The following screenshot shows the query results.

In this example, we demonstrated how you can set up and validate your Amazon Redshift table archival and purging process.

Schedule the stored procedure

Now that you have learned how to set up and validate your Amazon Redshift tables for archival and purging, you can schedule this process. For instructions on how to schedule a SQL statement using either the AWS Management Console or the AWS Command Line Interface (AWS CLI), refer to Scheduling SQL queries on your Amazon Redshift data warehouse.

Archive data in Amazon S3

As part of this solution, data is archived in an S3 bucket before it’s deleted from the Amazon Redshift table. This helps reduce the storage on the Amazon Redshift cluster and enables you to analyze the data for any ad hoc requests without needing to load back into the cluster. In the stored procedure, the UNLOAD command exports the data to be purged to Amazon S3, partitioned by the date column, which is used to identify the records to purge. To save costs on Amazon S3 storage, you can manage the storage lifecycle with Amazon S3 lifecycle configuration.

Analyze the archived data in Amazon S3 using Amazon Redshift Spectrum

With Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3, and easily analyze the archived data in Amazon S3 without having to load it back in Amazon Redshift tables. For further analysis of your archived data (cold data) and frequently accessed data (hot data) in the cluster’s local disk, you can run queries joining Amazon S3 archived data with tables that reside on the Amazon Redshift cluster’s local disk. The following diagram illustrates this process.

Let’s take an example where you want to view the number of orders for the last 2 weeks of December 1998, which is archived in Amazon S3. You need to complete the following steps using Redshift Spectrum:

  1. Create an external schema in Amazon Redshift.
  2. Create a late-binding view to refer to the underlying Amazon S3 files with the following query:
    create view vw_orders_hist as select count(*),o_orderdate
    from <external_schema>. orders 
    where o_orderdate between '1998-12-15' and '1998-12-31' group by 2
    with no schema binding;
    

  3. To see a unified view of the orders historical data archived in Amazon S3 and the current data stored in the Amazon Redshift local table, you can use a UNION ALL clause to join the Amazon Redshift orders table and the Redshift Spectrum orders table:
    create view vw_orders_unified as 
    select * from <external_schema>.orders
    union all
    select * from public.orders
    with no schema binding;

To learn more about the best practices for Redshift Spectrum, refer to Best Practices for Amazon Redshift Spectrum.

Best practices

The following are some best practices to reduce your storage footprint and optimize performance of your workloads:

Conclusion

In this post, we demonstrated the automatic archival and purging of data in Amazon Redshift tables to meet your compliance and business requirements, thereby optimizing your application performance and reducing storage costs. As an administrator, you can start working with application data owners to identify retention policies for Amazon Redshift tables to achieve optimal performance, prevent any storage issues specifically for DS2 and DC2 nodes, and reduce overall storage costs.


About the authors

Nita Shah is an Analytics Specialist Solutions Architect at AWS based out of New York. She has been building data warehouse solutions for over 20 years and specializes in Amazon Redshift. She is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Ranjan Burman is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 15 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with the use of cloud solutions.

Prathap Thoguru is an Enterprise Solutions Architect at Amazon Web Services. He has over 15 years of experience in the IT industry and is a 9x AWS certified professional. He helps customers migrate their on-premises workloads to the AWS Cloud.

Accelerate self-service analytics with Amazon Redshift Query Editor V2

Post Syndicated from Bhanu Pittampally original https://aws.amazon.com/blogs/big-data/accelerate-self-service-analytics-with-amazon-redshift-query-editor-v2/

Amazon Redshift is a fast, fully managed cloud data warehouse. Tens of thousands of customers use Amazon Redshift as their analytics platform. Users such as data analysts, database developers, and data scientists use SQL to analyze their data in Amazon Redshift data warehouses. Amazon Redshift provides a web-based query editor in addition to supporting connectivity via ODBC/JDBC or the Redshift Data API. Query Editor V2 lets users explore, analyze, and collaborate on data. You can use Query Editor V2 to create databases, schemas, tables, and load data from Amazon Simple Storage Service (S3) either using COPY command or using a wizard . You can browse multiple databases and run queries on your Amazon Redshift data warehouse, data lake, or federated query to operational databases such as Amazon Aurora.

From the smallest start-ups to worldwide conglomerates, customers across the spectrum tell us they want to promote self-service analytics by empowering their end-users, such as data analysts and business analysts, to load data into their analytics platform. Analysts at these organizations create tables and load data in their own workspace, and they join that with the curated data available from the data warehouse to gain insight. This post will discuss how Query Editor V2 accelerates self-service analytics by enabling users to create tables and load data with simple wizards.

The Goal to Accelerate and Empower Data Analysts

A common practice that we see across enterprises today is that more and more enterprises are letting data analysts or business analysts load data into their user or group workspaces that co-exist on data warehouse platforms. Enterprise calls these personal workspaces, departmental schemas, project-based schemas or labs, and so on. The idea of this approach is to empower data analysts to load data sets by themselves and join curated data sets on a data warehouse platform to accelerate the data analysis process.

Amazon Redshift Query Editor V2 makes it easy for administrators to create the workspaces, and it enables data analysts to create and load data into the tables. Query Editor V2 lets you easily create external schemas in Redshift Cluster to extend the data warehouse to a data lake, thereby accelerating analytics.

An example Use case

Let’s assume that an organization has a marketing department with some power users and regular users. In this example, let’s also consider that the organization already has an Enterprise Data Warehouse (EDW) powered by Amazon Redshift. The marketing department would like to have a workspace created for their team members.

A visual depiction of a Data Warehouse Environment may look like the following figure. Enterprises let user/group schemas be created along with an EDW, which contains curated data sets. Analysts can create and load exploratory data sets into user schemas, and then join curated data sets available in the EDW.

ScopeofSolution

Amazon Redshift provides several options to isolate your users’ data from the enterprise data warehouse data,. Amazon Redshift data sharing lets you share data from your EDW cluster with a separate consumer cluster. Your users can consume the EDW data and create their own workspace in the consumer cluster. Alternatively, you can create a separate database for your users’ group workspace in the same cluster, and then isolate each user group to have their own schema. Amazon Redshift supports queries of data joining across databases, and then users can join their tables with the curated data in the EDW. We recommend you use the data sharing option that lets you isolate both compute and data. Query Editor v2 supports both scenarios.

Once you have enabled your data analysts to have their own workspace and provided the relevant privileges, then they can easily create Schema, table, and load data.

Prerequisites

  1.  You have an Amazon Redshift cluster, and you have configured the Query Editor V2. You can view the Simplify Data Analysis with Amazon Redshift Query Editor V2 post for instructions on setting up Query Editor V2.
  2. For loading your data from Amazon S3 into Amazon Redshift, you will start by creating an IAM role to provide permissions to access Amazon S3 and grant that role to the Redshift cluster. By default, Redshift users assume that the IAM role is attached to the Redshift cluster. You can find the instructions in the Redshift getting started guide.
  3. For users who want to load data from Amazon S3, Query Editor V2 provides an option to browse S3 buckets. To use this feature, users should have List permission on the S3 bucket.

Create Schemas

The Query Editor V2 supports the schema creation actions. Likewise, admins can create both native and external schemas by creating Schema wizard.

CreateSchemas

As a user, you can easily create a “schema” by accessing Create Schema wizard available from the Create button, and then selecting “Schema” from the drop-down list, as shown in the following screenshot.

If you select the Schema from the drop-down list, then the Create Schema wizard similar to the following screenshot is displayed. You can choose a local schema and provide a schema name.

Optionally, you can authorize a user to authorize users to create objects in the Schema. When the Authorize user check box is selected, then Create and Usage access are granted to the user. Now, Janedoe can create objects in this Schema.

Let’s assume that the analyst user Janedoe logs in to Query Editor V2 and logs in to the database and wants to create table and load data into their personal workspace.

Creating Tables

The Query Editor V2 provides a Create table wizard for users to create a table quickly. It allows power users to auto-create the table as based on a data file. Users can upload the file from their local machine and let Query Editor V2 figure out the data types and column widths. Optionally, you can change the column definition, such as encoding and table properties.

Below is a sample CSV file with a row header and sample rows from the MarketingCampaign.csv file. We will demonstrate how to create a table based on this file in the following steps.

SampleData

The following screenshot shows the uploading of the MarketingCampaing.csv file into Query Editor V2.

Create Table Wizard has two sections:

  1. Columns

The Columns tab lets users select a file from their local desktop and upload it to Query Editor V2. Users can choose Schema from the drop-down option and provide a table name.

Query Editor V2 automatically infers columns and some data types for each column. It has several options to choose from to set as column properties. For example, you have the option to change column names, data types, and encoding. If you do not choose any encoding option, then the encoding choice will be selected automatically. You can also add new fields, for example, an auto-increment ID column, and define properties for that particular identity column.

  1. Table Details

You can use the Create Table wizard to create a temporary table or regular table with the option of including it in automatic backups. The temporary table is available until the end of the session and is used in queries. A temporary table is automatically dropped once a user session is terminated.

The “Table Details” is optional, as Amazon Redshift’s Automatic Table Optimization feature takes care of Distribution Key and Sort Key on behalf of users.

  1. Viewing Create Table Statement

Once the column and table level detail is set, Query Editor V2 gives an option to view the Create table statement in Query Editor tab. This lets users save the definition for later use or share it with other users. Once the user reviews the create table definition, then the user can hit the “Run” button to run the query. Users can also directly create a table from the Create table wizard.

The following screenshot shows the Create table definition for the marketing campaign data set.

CreateTable3

Query Editor V2 lets users view table definitions in a table format. The following screenshot displays the table that we created earlier. Note that Redshift automatically inferred encoding type for each column. As the best practice, it skipped for “Dt_Customer“, as it was set as the sort key. When creating the table, we did not set the encodings for columns, as Redshift will automatically set the best compression methods for each column.

Query Editor V2 distinguishes columns by data types in a table by using distinct icons for them.

You can also view the table definition by right-clicking on the table and selecting the show definition option. You can also generate a template select command, and drop or truncate the table by right-clicking on a table.

Loading Data

Now that we have created a schema and a table, let’s learn how to upload the data to the table that we created earlier.

Query Editor V2 provides you with the ability to load data for S3 buckets to Redshift tables. The COPY command is recommended to load data in Amazon Redshift. The COPY command leverages the massively parallel processing capabilities of Redshift.

The Load Data wizard in the Query Editor V2 loads data into Redshift by generating the COPY command. As a data analyst, you don’t have to remember the intricacies of the COPY command.

You can quickly load data from CSV, JSON, ORC, or Parquet files to an existing table using the Load Data Wizard. It supports all of the options in the COPY command. The Load Data Wizard lets Data analysts build a COPY command with an easy-to-use GUI.

The following screenshot shows an S3 bucket that has our MarketingCampaign.csv file. This is a much larger file that we used to create the table using Create table wizard. We will use this file to walk you through the Load Data wizard.

LoadData1

The Load Data wizard lets you browse your available S3 bucket and select a file or folder from the S3 bucket. You can also use a manifest file. A manifest file lets you make sure that all of the files are loaded using the COPY command. You can find more information about manifest files here.

The Load Data Wizard lets you enter several properties, such as the Redshift Cluster IAM role and whether data is encrypted. You can also set file options. For example, in the case of CSV, you can set delimiter and quote parameters. If the file is compressed, then you can provide compression settings.

With the Data Conversion Parameters, you can select options like Escape Characters, time format, and if you want to ignore the header in your data file. The Load Operations option lets you set compression encodings and error handling options.

Query Editor V2 lets you browse S3 objects, thereby making it easier to navigate buckets, folders, and files. Below screens displays the flow

Query Editor V2 supports loading data of many open formats, such as JSON, Delimiter, FixedWidth, AVRO, Parquet, ORC, and Shapefile.

In our example, we are loading CSV files. As you can see, we have selected our MarketingCampaing.csv file and set the Region, and then selected the Resfhift cluster IAM Role.

For the CSV file, under additional File Options, Delimiter Character and Quote Character are set with “;” and an empty quote in the below screen.

Once the required parameters are set, continue to next step to load data. Load Data operation builds a copy command and automatically loads it into Query Editor Tab, and then invokes the query.

LoadData5

Data is loaded into the target table successfully, and now you can run a query to view that data. The following screen shows the result of the select query executed on our target table:

LoadData6

Viewing load errors

If your COPY command fails, then these are logged into STL_LOAD_ERRORS system table. Query Editor v2 simplifies the viewing of the common errors by showing the errors in-place as shown in the following screenshot:

LoadData7

Saving and reusing the queries

You can save the load queries for future usage by clicking on the saved query and providing a name in the saved query.

SavingQ1You would probably like to reuse the load query in the future to load data in from another S3 location. In that case, you can use the parameterized query by replacing the S3 URL of the as shown in the following screenshot:

SavingQ2

You can save the query, and then share the query with another user.

When you or other users run the query, a prompt for the parameter will appear as in the following screenshot:

SavingQ3

We discussed how data analysts could load data into their own or the group’s workspace.

We will now discuss using Query Editor V2 to create an external schema to extend your data warehouse to the data lake.

Extending the Data Warehouse to the Data Lake

Extending Data warehouses to Data lakes is part of modern data architecture practices. Amazon Redshift enables this with seamless integration through Data lake running on AWS. Redshift uses Spectrum to allow this extension. You can access data lakes from the Redshift Data warehouse by creating Redshift external schemas.

Query Editor V2 lets you create an external schema referencing an external database in AWS Glue Data Catalogue.

To extend your Data Warehouse to Data Lake, you should have an S3 data lake and AWS Glue Data Catalog database defined for the data lake. Grant permission on AWS Glue to Redshift Cluster Role. You can find more information about external Schema here.

You can navigate to the Create External Schema by using Create Schema wizard, and then selecting the External Schema as shown in the following screenshot:

The Query Editor V2 makes the schema creation experience very easy by hiding the intricacies of the create external schema syntax. You can use the simple interface and provide the required parameters, such as Glue data regions, external database name, and the IAM role. You can browse the Glue Catalog and view the database name.

After you use the create schema option, you can see the schemas in the tree-view. The Query Editor V2 uses distinct icons to distinguish between native Schema and external Schema.

Viewing External Table Definitions

The Query Editor V2 lets data analysts quickly view objects available in external databases and understand their metadata.

You can view tables and columns for a given table by clicking on external Schema and then on a table. When a particular table is selected, its metadata information is displayed in the bottom portion below the tree-view panel. This is a powerful feature, as an analyst can easily understand the data residing externally in the data lake.

You can now run queries against external tables in the external Schema.

In our fictitious enterprise, Marketing Department team members can load data in their own workspace and join the data from their own user/group workspace with the curated data in the enterprise data warehouse or data lake.

Conclusion

This post demonstrated how the Query Editor V2 enabled data analysts to create tables and load data from Amazon S3 easily with a simple wizard.

We also discussed how Query Editor V2 lets you extend the data warehouse to a data lake. The data analysts can easily browse tables in your local data warehouse, data shared from another cluster, or tables in the data lake. You can run queries that can join tables in your data warehouse and data lake. The Query Editor V2 also provides several features for the collaboration of query authoring. You can view the earlier blog to learn more about how the Query Editor V2 simplifies data analysis.

These features let organizations accelerate self-service analytics and end-users deliver the insights faster.

Happy querying!


About the Authors

Bhanu Pittampally is Analytics Specialist Solutions Architect based out of Dallas. He specializes in building analytical solutions. His background is in data warehouse – architecture, development and administration. He is in data and analytical field for over 13 years. His Linkedin profile is here.

Debu-PandaDebu Panda  is a Principal Product Manager at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world.

cansuaCansu Aksu is a Front End Engineer at AWS, has a several years of experience in developing user interfaces. She is detail oriented, eager to learn and passionate about delivering products and features that solve customer needs and problems

chengyangwangChengyang Wang is a Frontend Engineer in Redshift Console Team. He worked on a number of new features delivered by redshift in the past 2 years. He thrives to deliver high quality products and aim to improve customer experience from UI

Speed up data ingestion on Amazon Redshift with BryteFlow

Post Syndicated from Pradnya Bhandary original https://aws.amazon.com/blogs/big-data/speed-up-data-ingestion-on-amazon-redshift-with-bryteflow/

This is a guest post by Pradnya Bhandary, Co-Founder and CEO at Bryte Systems.

Data can be transformative for an organization. How and where you store your data for analysis and business intelligence is therefore an especially important decision that each organization needs to make. Should you choose an on-premises data warehouse solution or embrace the cloud?

On-premises data warehouses require servers to be in your data center, a team to manage them, and a large initial investment that tries to accommodate current and future data needs. Cloud data warehouses, on the other hand, are fully managed and can start with a small investment and grow as your business demands. You don’t need to provision for the future, but for present needs. This knowledge brings peace of mind that extra capacity can be added overnight, if needed.

One such cloud data warehouse is Amazon Redshift. Amazon Redshift is the most popular cloud data warehouse. It’s a fully managed, petabyte-scale cloud-based data warehouse product designed for large-scale dataset storage and analysis. It provides agility, flexibility, and cost-effectiveness.

To use Amazon Redshift effectively, getting your data efficiently from various data silos to the data warehouse is critical, because this determines how quickly you can access the data. Do you have to wait until the data is loaded and make do with stale data, or can you access your data across the organization in near-real time to derive fresh and rich data insights?

In this post, we explain how Origin Energy followed some of the best practices for data ingestion for Amazon Redshift and how they achieved faster ingestion rates into Amazon Redshift using the BryteFlow software.

Origin Energy

Origin Energy, a leading energy provider, was finding that its on-premises data warehouse was struggling to support growing data demands. They also needed to unlock siloed data from legacy databases like SAP, Oracle, SQL Server, MySQL, and more. Access to data was fragmented and time consuming.

Moving to a cloud enterprise analytics environment with centralized data access was the only viable option to support their data initiatives. For more information, see the Origin Energy case study.

To speed up data ingestion on Amazon Redshift, they followed data ingestion best practices.

Log-based CDC mechanism to get data to Amazon Redshift

When data is replicated from a source database to a target that could be another database, data warehouse, or cloud data storage object, changes to the data in the source need to be captured and replicated to the destination to keep data consistent and trustworthy. Change data capture (CDC) makes this possible. CDC captures the changes in the source data and updates only the data in the destination that has changed. This does away with the tedious task of bulk load updating and enables real-time data integration. It’s a continual, extremely reliable process and has no impact on source systems.

Parallel multi-threaded initial sync

When doing an initial ingest of data, especially of exceptionally large datasets to Amazon Redshift, parallel, multi-thread syncing to replicate data is extremely helpful in cutting down total data ingestion time. Data replication proceeds in parallel on multiple threads with optimized extraction and loading.

Parallel multi-threaded log-based capture and merge for Oracle

When source systems generate a large amount of incremental data, and transactional logs are written very often, the CDC mechanism or mining of logs for incremental data can lag behind if it can’t keep up with the source throughput. You can configure BryteFlow to have multiple parallel threads for mining. Furthermore, you can configure BryteFlow so that the logs can be mined on a completely different server and therefore there is zero load on the source and data replication is much faster with parallelism. You can make the transaction logs available on a shared mount on a remote server you spin up, and mine logs on this server. This puts zero load on the source systems and operational systems and they aren’t impacted even with mining huge volumes of data (or volumes of incremental data).

Best practices for loading data to Amazon Redshift

The following are the three best practices for loading data into Amazon Redshift.

Split large files into multiple files for high-performance loads

Amazon Redshift is a massively parallel processing (MPP) data warehouse, where several compute nodes work in parallel to ingest the data. Each node is further subdivided into slices, with each slice having one or more dedicated cores, equally dividing the processing capacity. The number of slices per node depends on the node type of the cluster. The following table shows the different node types and the number of slices.

Node Type Default Slices per Node
Dense Compute DC2
dc2.large 2
dc2.8xlarge 16
Dense Storage DS2
ds2.xlarge 2
ds2.8xlarge 16
RA3 Nodes
ra3.4xlarge 4
ra3.16xlarge 16

After you extract data into files, you can compress the files and split a single file to multiple files according to the number of slices, so that files are loaded with the compute being distributed evenly across the slices on Amazon Redshift. The number of multiple files is a configurable parameter in BryteFlow that can be set depending on the Amazon Redshift node types. The COPY command that ingests data into Amazon Redshift is configured optimally for fast data loads.

Automatic creation of tables, default distribution keys, and distribution style

Data is distributed among the nodes on the basis of distribution style and distribution key of a particular table in Amazon Redshift. An even distribution of data enables Amazon Redshift to assign the workload evenly to slices and maximizes the benefit of parallel processing. This also helps during data ingestion. When ingesting data, BryteFlow automatically creates tables with the right DDL on Amazon Redshift. It also creates default distribution keys and distribution style, so that table ingestion is highly performant.

Optimum sort keys used for optimum loads support efficient columnar storage

Data in the Amazon Redshift data warehouse is stored in a columnar fashion, which drastically reduces the number of disk I/O requests and minimizes the amount of data loaded into the memory to run a query. Reduction in I/O speeds up queries, and loading less data means Amazon Redshift can perform more in-memory processing. Using the optimum table sort keys is the best practice for efficient loads.

Automatic sync on Amazon Redshift

CDC is an important element of syncing data with Amazon Redshift. BryteFlow automatically merges changes on Amazon Redshift, with type2 history (if configured) with high performance. This means that data is ready to use as soon as it is ingested, without running lengthy ETL processes on it. The following diagram illustrates the type2 history feature.

Metadata for every extract and load can be captured on Amazon Aurora

Details on each extract and load process (for example, tables names), the number of records affected, the start and end times, and various other details is critical operational metadata that’s very useful in determining performance, tuning, and triggering other ETL processes. If this operational metadata is maintained on Amazon Redshift with every extract and load, constant single row inserts and updates can hamper performance drastically because Amazon Redshift is a columnar database and not an OLTP system.

At BryteFlow, we found that the best practice is to keep the operational metadata in an Amazon Aurora database, which is OLTP in nature and can store this metadata with constant updates and inserts and low latency.

Build your data lake on Amazon S3 and automatically query using Amazon Redshift Spectrum

BryteFlow enables you to build a continually refreshing data lake at scale on Amazon Simple Storage Service (Amazon S3) with continual replication and transformation of data. You can configure BryteFlow to use Amazon EMR on the incremental data with data on Amazon S3 and automatically merge and transform your data with an intuitive GUI. The EMR cluster can scale up or down depending on your data needs. You can then query data automatically on Amazon S3 using Amazon Redshift Spectrum.

Offload data ingestion and data preparation on Amazon S3 and load to Amazon Redshift

BryteFlow Blend helps with real-time data preparation of data ingested by BryteFlow, using Apache Spark on Amazon EMR with an intuitive GUI. You can load data prepared on Amazon S3 to Amazon Redshift via BryteFlow Blend or make it accessible to Amazon Redshift via Amazon Redshift Spectrum. This helps reserve the computational resources of Amazon Redshift for the actual querying (queries run much faster) while the Amazon S3 data lake handles data integration. The following diagram illustrates the distributed data integration architecture.

BryteFlow software supports all of the preceding ingestion best practices. Origin Energy used the BryteFlow software to build their analytics platform on Amazon Redshift. Origin’s data access has improved from several days to mere hours. BryteFlow software has helped achieve accurate data replication on the AWS Cloud with low latency, facilitating faster time-to-market for new and highly personalized customer offerings and a significant reduction in data costs.

Conclusion

Amazon Redshift delivers fast performance at scale for the most demanding workloads. Ingesting and preparing your data to Amazon Redshift using the BryteFlow software makes this an extremely attractive value proposition. You can liberate your data across data silos and quickly unlock the value on Amazon Redshift.

To see how this works for your project, you can get a free trial from our website. We offer complete support on your free trial including screen sharing, online support, and consultation.

Alternatively, you can go to AWS Marketplace for a free trial of BryteFlow Standard Edition or BryteFlow Enterprise Edition. Contact us to let us know so we can assist you through the trial.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

 


About the Authors

Pradnya Bhandary is the Co-Founder and CEO at Bryte Systems.

 

 

 

 

Manage and control your cost with Amazon Redshift Concurrency Scaling and Spectrum

Post Syndicated from Vince Marchillo original https://aws.amazon.com/blogs/big-data/manage-and-control-your-cost-with-amazon-redshift-concurrency-scaling-and-spectrum/

Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence tools.

This post shares the simple steps you can take to use the new Amazon Redshift usage controls feature to monitor and control your usage and associated cost for Amazon Redshift Spectrum and Concurrency Scaling features. Redshift Spectrum enables you to power a lake house architecture to directly query and join data across your data warehouse and data lake, and Concurrency Scaling enables you to support thousands of concurrent users and queries with consistently fast query performance.

Why this feature is important

With tens of thousands of customers, the Amazon Redshift team has the benefit of observing the workloads and behavior across a large variety of customers, including the internal teams at Amazon. We observed that across both internal and external customers, Amazon Redshift is running demanding workloads, such as extract, transform, and load (ETL) pipelines that condense several massive datasets, many of which are hundreds of terabytes in size, into single consumable tables for reporting and analytic purposes. These jobs take advantage of Concurrency Scaling to automatically scale Amazon Redshift query processing to handle burst workloads, and Redshift Spectrum to perform analytics on the transformed datasets by joining them with external data stored in a variety of open data formats in the data lake backed by Amazon Simple Storage Service (Amazon S3). Both these features are charged based on the usage, making cost management important, especially during peak periods when there is more activity for an organization. Although a combination of Amazon CloudWatch alarms and workload management (WLM) query monitoring rules can help to keep track and monitor usage, customers have asked to control their cost at the Amazon Redshift cluster level based on usage.

Usage limits to control Concurrency Scaling and Redshift Spectrum costs

With the new usage controls feature, you can now monitor and control the usage and associated costs for Redshift Spectrum and Concurrency Scaling. You can create daily, weekly, and monthly usage limits, and define actions to take if those limits are reached to maintain predictable spending. Actions include logging usage stats as an event to a system table, generating Amazon Simple Notification Service (Amazon SNS) alerts, and disabling Redshift Spectrum or Concurrency Scaling based on your defined thresholds. The new usage controls feature allows you to continue reaping the benefits provided by both Concurrency Scaling and Redshift Spectrum with the peace of mind that you can stay within budget simply by configuring the appropriate thresholds.

Setting up and managing usage controls

You can configure Amazon Redshift usage control options on the Amazon Redshift console or by using the AWS Command Line Interface (AWS CLI) or API. You can choose to set up to four limits per feature, allowing for multiple levels of logging or notifications before you disable Redshift Spectrum or Concurrency Scaling. The usage limit settings available are largely the same for both Concurrency Scaling and Redshift Spectrum usage—the main difference is that Concurrency Scaling usage limits are based on time spent (hours and minutes), while Redshift Spectrum usage limits are based on terabytes of data scanned. The fields you can adjust and select include the following:

  • Time – The time range for which your usage limits should be applied. You can choose daily, weekly, or monthly.
  • Usage limit – For Concurrency Scaling, this allows you to enter an integer value for hours and minutes to limit the amount of time this feature can be used before the usage limit kicks in. For Redshift Spectrum, you enter an integer value for the total number of terabytes you want to allow to be scanned before the limits apply.
  • Action – The action you want to take when your usage control limit has been reached. You can choose from Log to system table, Alert, or Disable feature. The Alert and Disable feature actions trigger a CloudWatch metric alarm and you can optionally set to send Amazon SNS-based notifications.

The ability to configure up to four limits per feature, combined with the three available actions that you can take, provides accurate visibility into your current usage and a way to generate metrics of your usage patterns. You can use the Disable feature option to easily prevent going over budget, and the Alert and Log actions can provide valuable insights, such as how you are currently using Redshift Spectrum and Concurrency Scaling. For example, configuring a usage limit with a daily time period and an action to log to a system table allows you to easily generate metrics on which days you had higher utilization of Redshift Spectrum or Concurrency Scaling. These metrics can provide insights into high-traffic days and potential areas where your pipelines can be adjusted to better distribute your traffic. Another option is to configure a weekly alert limit at 25% of your desired monthly usage as a way to ensure that you are within your monthly expected budget.

Setting usage control limits on the Amazon Redshift console

To set usage limits for Concurrency Scaling and Redshift Spectrum using the new Amazon Redshift console, perform the following steps:

  1. On the Amazon Redshift console, choose
  2. Select your desired cluster.
  3. From the Actions drop-down menu, choose Configure usage limit.

  1. To configure usage limits for Concurrency Scaling, choose Configure usage limit in the Concurrency scaling usage limit
  2. To configure usage limits for Redshift Spectrum, choose Configure usage limit in the Redshift Spectrum usage limit

  1. In the Configure usage limit section, select or deselect Concurrency scaling and Redshift Spectrum.

Selecting one of those options brings up the corresponding configuration windows.

  1. Choose a Time period (Daily, Weekly, or Monthly) from the drop-down menu.
  2. Enter your desired Usage limit.
  3. From the Action drop-down menu, choose an action (Alert, Log to system table, or Disable feature).

  1. To configure additional usage limits, choose Add another limit and action.
  2. When you have configured all your desired usage limits, choose Configure to confirm your usage limit settings.

Your configurations are now visible in the usage limit dashboard.

Managing usage control limits via the Amazon Redshift console

You can edit and delete usage limits on the Amazon Redshift console. The Edit option allows you to add limits or modify existing limit settings, and the Delete option deletes all configured limits for the corresponding service. To manage your configurations, perform the following steps:

  1. On the Amazon Redshift console, choose
  2. Select your desired cluster.
  3. From the Actions drop-down menu, choose Configure usage limit.
  4. To edit your existing usage limit configurations, choose Edit in the corresponding service box.

The editing option lets you add a new usage limit, remove a usage limit, or modify an existing usage limit and corresponding action.

To modify the time period of an existing usage limit, you can remove and add it as a new usage limit.

To delete your Concurrency Scaling limits, choose Delete usage limit in the Concurrency scaling usage limit section.

To delete your Redshift Spectrum limits, choose Delete usage limit in the Redshift Spectrum usage limit section. Choosing Delete usage limit removes all limits configured for that service.

Setting usage control limits via the AWS CLI

You can also use the AWS CLI to add, edit, describe, or remove usage control configurations. The following examples outline the required CLI commands for each use case:

  • create-usage-limit – This command adds a new usage limit configuration for your Amazon Redshift cluster. The command should include the following parameters:
    • –cluster-identifier – The name of the cluster on which to apply the usage control.
    • –period – The time range for your usage limit. You can enter daily, weekly, or monthly for this parameter.
    • –feature-type – The service to which you want to apply this usage control. You can enter spectrum or concurrency-scaling for this parameter.
    • –limit-type – For Redshift Spectrum, this parameter should be set to data-scanned. For Concurrency Scaling, this should be set to time.
    • –amount – For Redshift Spectrum, this parameter should equal the total terabytes allowed to be scanned in increments of 1 TB. For Concurrency Scaling, this parameter should be set to the total minutes (on the console, you can do this in hh:mm) allowed before limits actions are applied.
    • –breach-action – The action to take when you reach your configured limit. Possible values are log, emit-metric, or disable. emit-metric sends metrics for CloudWatch.

See the following example code:

aws redshift create-usage-limit --cluster-identifier <yourClusterIdentifier> --period <daily|weekly|monthly> --feature-type <spectrum|concurrency-scaling> --limit-type <data-scanned|time> --amount <yourDesiredAmount> --breach-action <log|emit-metric|disable>
  • describe-usage-limits – This command returns a JSON response that lists the configured usage limits for the cluster you choose. The response includes all the configurable fields, such as the limit type and breach actions, and includes a usage limit ID, which is required for the modify and delete commands. The describe command should include the following parameter:
    • –cluster-identifier – The cluster identifier for which you want to obtain the configured usage limits.

See the following example code:

aws redshift describe-usage-limits --cluster-identifier <yourClusterIdentifier>
{
    "UsageLimits": [
        {
            "LimitType": "data-scanned",
            "Period": "daily",
            "BreachAction": "log",
            "FeatureType": "spectrum",
            "UsageLimitId": "4257b96e-5b12-4348-adc2-4922d2ceddd2",
            "Amount": 1,
            "ClusterIdentifier": "cost-controls-demo"
        },
  • modify-usage-limit – This command allows you to modify an existing usage limit configuration on your Amazon Redshift cluster. This command requires the UsageLimitID for the limit you want to modify, which you can obtain by running the describe-usage-limits The modify-usage-limit command should include the following parameters:
    • –usage-limit-id – The ID of the usage limit that you want to modify. You can obtain this by running the describe-usage-limits.
    • –amount – The new value for your limit threshold.
    • –breach-action – The new action to take if you reach your limit threshold.

See the following example code:

aws redshift modify-usage-limit --usage-limit-id "<yourUsageLimitID>" --amount <newAmount> --breach-action <newBreachAction>
  • delete-usage-limit – This command deletes a configured usage limit from your Amazon Redshift cluster. This command requires the UsageLimitID, which you can obtain by running the describe-usage-limits This command should have the following parameter:
    • –usage-limit-id – The ID of the limit that you want to delete.

See the following example code:

aws redshift delete-usage-limit --usage-limit-id "<yourUsageLimitID>"

For more information, see Managing usage limits in Amazon Redshift.

Summary

The Amazon Redshift usage controls provide you with an easy way to monitor, alert, and limit the cost you incur when using Concurrency Scaling and Redshift Spectrum features. With up to four limits configurable per feature and options to log events, trigger Amazon SNS notifications, or disable the features altogether from the Redshift console and AWS CLI, you have all the tools needed to make sure you stay within your budget.

 


About the Authors

Vince Marchillo is a Solutions Architect within Amazon’s Business Data Technologies organization. Vince guides customers to leverage scalable, secure, and easy-to-use data lake architecture powered by AWS services and other technologies.

 

 

 

 

Maor Kleider is a product and database engineering leader for Amazon Redshift. Maor is passionate about collaborating with customers and partners, learning about their unique big data use cases and making their experience even better. In his spare time, Maor enjoys traveling and exploring new restaurants with his family.

 

 

 

Satish Sathiya is a Product Engineer at Amazon Redshift. He is an avid big data enthusiast who collaborates with customers around the globe to achieve success and meet their data warehousing and data lake architecture needs.

Develop an application migration methodology to modernize your data warehouse with Amazon Redshift

Post Syndicated from Po Hong original https://aws.amazon.com/blogs/big-data/develop-an-application-migration-methodology-to-modernize-your-data-warehouse-with-amazon-redshift/

Data in every organization is growing in volume and complexity faster than ever. However, only a fraction of this invaluable asset is available for analysis. Traditional on-premises MPP data warehouses such as Teradata, IBM Netezza, Greenplum, and Vertica have rigid architectures that do not scale for modern big data analytics use cases. These traditional data warehouses are expensive to set up and operate, and require large upfront investments in both software and hardware. They cannot support modern use cases such as real-time or predictive analytics and applications that need advanced machine learning and personalized experiences.

Amazon Redshift is a fast, fully managed, cloud-native and cost-effective data warehouse that liberates your analytics pipeline from these limitations. You can run queries across petabytes of data in your Amazon Redshift cluster, and exabytes of data in-place on your data lake. You can set up a cloud data warehouse in minutes, start small for just $0.25 per hour, and scale to over a petabyte of compressed data for under $1,000 per TB per year – less than one-tenth the cost of competing solutions.

With tens of thousands of current global deployments (and rapid growth), Amazon Redshift has experienced tremendous demand from customers seeking to migrate away from their legacy MPP data warehouses. The AWS Schema Conversion Tool (SCT) makes this type of MPP migration predictable by automatically converting the source database schema and a majority of the database code objects, including views, stored procedures, and functions, to equivalent features in Amazon Redshift. SCT can also help migrate data from a range of data warehouses to Amazon Redshift by using built-in data migration agents.

Large-scale MPP data warehouse migration presents a challenge in terms of project complexity and poses a risk to execution in terms of resources, time, and cost. You can significantly reduce the complexity of migrating your legacy data warehouse and workloads with a subject- and object-level consumption-based data warehouse migration roadmap.

AWS Professional Services has designed and developed this tool based on many large-scale MPP data warehouse migration projects we have performed in the last few years. This approach is derived from lessons learned from analyzing and dissecting your ETL and reporting workloads, which often have intricate dependencies. It breaks a complex data warehouse migration project into multiple logical and systematic waves based on multiple dimensions: business priority, data dependency, workload profiles and existing service level agreements (SLAs).

Consumption-based migration methodology

An effective and efficient method to migrate an MPP data warehouse is the consumption-based migration model, which moves workloads from the source MPP data warehouse to Amazon Redshift in a series of waves. You should run both the source MPP data warehouse and Amazon Redshift production environments in parallel for a certain amount of time before you can fully retire the source MPP data warehouse. For more information, see How to migrate a large data warehouse from IBM Netezza to Amazon Redshift with no downtime.

A data warehouse has the following two logical components:

  • Subject area – A data source and data domain combination. It is typically associated with a business function, such as sales or payment.
  • Application – An analytic that consumes one or more subject areas to deliver value to customers.

The following diagram illustrates the workflow of data subject areas and information consumption.

Figure 1: Consuming applications (reports/analytics) and subject areas (data source/domain).

This methodology helps to facilitate a customer’s journey to build a Data Driven Enterprise (D2E). The benefits are: 1) helping to deeply understand the customer’s business context and use-cases and 2) contributing to shaping an enterprise data migration roadmap.

Affinity mapping between subject area and application

To decide which applications and their associated subject areas go into which wave, you need a detailed mapping between applications and subject areas. The following table shows an example of this type of mapping.

Figure 2: The affinity mapping between applications (analytics) and subject areas.

The basis for this mapping is the query execution metadata often stored in system tables of legacy data warehouses. This mapping is the basis for the creation of each wave — a single-step migration of an application’s objects and associated subject areas. You can similarly derive another potential second step, which results in a more detailed mapping between data sources and subject areas (to the level of individual tables) and helps with detailed project planning.

The sorting method in the preceding table is important. The right-most column shows the total number of times a subject area appears in applications (from the most common subject area to the least common subject area, top to bottom). The bottom row displays the number of subject areas that appear in an application (from the most dense application to the least dense, from left to right).

Given all other conditions being equal, which applications and analytics should you start with for the first wave? The best practice is to start somewhere in the middle (such as Analytic 8 or 9 in the preceding table). If you start from the left-most column (Analytic 1), the wave includes numerous objects (sources and tables, views, ETL scripts, data formatting, cleansing and exposing routines), which makes it unwieldy and inordinately long to execute and complete. Alternatively, if you start from the right-most column (Analytic 19), it covers very few subject areas and increases the number of waves required to complete the entire migration to a longer time frame. This choice also fails to offer good insight into the complexity of the whole project.

Migration wave and subject area integration

The following table illustrates the wave-based migration approach (stair steps) for the preceding affinity map. In each wave (which may include one or more applications or analytics), there are always new subject areas (in green) to onboard and subject areas that were migrated in the previous waves (in blue). An optimal wave-based migration approach is to design migration waves to have fewer new builds with each subsequent wave. In the following example, as earlier waves finish, there are fewer new subject areas to integrate in the subsequent waves — another reason to start in the middle and work to the left on the affinity chart. This ultimately results in accelerated delivery of your migration outcomes.

Figure 3: Stair step model and subject area integration.

Wave 0 typically includes the shared or foundational dimensional data or tables that every application uses (for example, Time and Organization). Each wave should have at least one anchor application, and anchor applications need to include new subject areas or data sources. What factors should you consider when choosing anchor applications in a wave? An anchor application in one wave should have minimal dependencies on other anchor applications in other waves, and it is often considered important from the business perspective. The combination of anchor applications across all waves should cover all subject areas.

In the preceding example, there are six different migration waves. The following table summarizes their anchor applications:

Migration Wave Anchor Applications
Wave 1 Analytic 9
Wave 2 Analytic 8
Wave 3 Analytic 7
Wave 4 Analytic 6
Wave 5 Analytic 4 and Analytic 5
Wave 6 Analytic 3

All the other applications (analytics) will be automatically taken care of because the subject areas they are dependent upon will have already been built in the above mentioned waves.

Application onboarding best practices

To determine how many waves there should be and what applications should go into each wave, consider the following factors:

  • Business priorities – An application’s value as part of a customer’s Data Driven Enterprise (D2E) journey
  • Workload profiles – Whether a workload is mostly ETL (write intensive) or query (read only)
  • Data-sharing requirements – Different applications may use data in the same tables
  • Application SLAs – The promised performance metric to end-users for each application
  • Dependencies – Functional dependencies among different applications

 The optimal number of waves is determined based on the above criteria, and the best practice is to have up to 10 waves in one migration project so that you can effectively plan, execute and monitor.

Regarding what applications should go into which wave and why, interactions among applications and their performance impact are typically too complex to understand from first principles. The following are some best practices:

  • Perform experiments and tests to develop an understanding about how applications interact with each other and their performance impact.
  • Group applications based on common data-sharing requirements.
  • Be aware that not all workloads benefit from a large cluster. For example, simple dashboard queries may run faster on small clusters, while complex queries can take advantage of all the slices in a large Amazon Redshift cluster.
  • Consider grouping applications with different workload and access patterns.
  • Consider using a dedicated cluster for different waves of applications.
  • Develop a workload profile for each application.

Amazon Redshift cluster sizing guide

The Amazon Redshift node type determines the CPU, RAM, storage capacity, and storage drive type for each node.  The RA3 node type enables you to scale compute and storage independently. You pay separately for the amount of compute and Amazon Redshift Managed Storage (RMS) that you use.  DS2 node types are optimized to store large amounts of data and use hard disk drive (HDD) storage. If you currently run on DS2 nodes, you should upgrade to RA3 cluster to get up to 2x better performance and 2x more storage for the same cost. The dense compute (DC) node types are optimized for compute.  DC2 node types are optimized for performance-intensive workloads because they use solid state drive (SSD) storage.

Amazon Redshift node types are available in different sizes. Node size and the number of nodes determine the total storage for a cluster. We recommend 1) if you have less than 1TB of compressed data size, you should choose DC2 node types; 2) for more than 1TB of compressed data size choose RA3 node types (RA3.4xlarge or RA3.16xlarge). For more information, see Clusters and Nodes in Amazon Redshift.

The node type that you choose depends on several factors:

  • The compute needs of downstream systems to meet Service Level Agreements (SLAs)
  • The complexity of the queries and concurrent operations that you need to support in the database
  • The trade-off between achieving the best performance for your workload and budget
  • The amount of data you want to store in a cluster

For more detailed information about Amazon Redshift cluster node type and cluster sizing, see Clusters and nodes in Amazon Redshift.

As your data and performance needs change over time, you can easily resize your cluster to make the best use of the compute and storage options that Amazon Redshift provides.  You can use Elastic Resize to scale your Amazon Redshift cluster up and down in a matter of minutes to handle predictable spiky workloads and use the automated Concurrency Scaling feature to improve the performance of ad-hoc query workloads.

Although you may be migrating all the data in your traditional MPP data warehouses into the managed storage of Amazon Redshift, it is also common to send data to different destinations.  You might send cold or historical data to an Amazon S3 data lake to save costs, and send hot or warm data to an Amazon Redshift cluster for optimal performance.  Amazon Redshift Spectrum allows you to easily query and join data across your Amazon Redshift data warehouse and Amazon S3 data lake.  The powerful serverless data lake approach using AWS Glue and AWS Lambda functions enables the lake house architecture that combines data in an Amazon S3 data lake with data warehousing in the cloud using a simplified ETL data pipeline, minimizing the need to load data into an Amazon Redshift cluster.  For more detailed information, see ETL and ELT design patterns for lake house architecture using Amazon Redshift: Part 1 and Build and automate a serverless data lake using an AWS Glue trigger for the Data Catalog and ETL jobs.

Conclusion

This post demonstrated how to develop a comprehensive, wave-based application migration methodology for a complex project to modernize a traditional MPP data warehouse with Amazon Redshift. It provided best practices and lessons learned by considering business priority, data dependency, workload profiles and existing service level agreements (SLAs).

We would like to acknowledge AWS colleagues Corina Radovanovich, Jackie Jiang, Hunter Grider, Srinath Madabushi, Matt Scaer, Dilip Kikla, Vinay Shukla, Eugene Kawamoto, Maor Kleider, Himanshu Raja, Britt Johnston and Jason Berkowitz for their valuable feedback and suggestions.

If you have any questions or suggestions, please leave your feedback in the comment section. For more information about modernizing your on-premises data warehouses by migrating to Amazon Redshift and finding a trusted AWS partner who can assist you on this endeavor, see Modernize your Data Warehouse.

 


About the authors

Po Hong, PhD, is a Principal Data Architect of Data & Analytics Global Specialty Practice, AWS Professional Services.

 

 

 

Anand Rajaram is the global head of the Amazon Redshift Professional Services Practice at AWS.

 

 

 

Restrict Amazon Redshift Spectrum external table access to Amazon Redshift IAM users and groups using role chaining

Post Syndicated from Harsha Tadiparthi original https://aws.amazon.com/blogs/big-data/restrict-amazon-redshift-spectrum-external-table-access-to-amazon-redshift-iam-users-and-groups-using-role-chaining/

With Amazon Redshift Spectrum, you can query the data in your Amazon Simple Storage Service (Amazon S3) data lake using a central AWS Glue metastore from your Amazon Redshift cluster. This capability extends your petabyte-scale Amazon Redshift data warehouse to unbounded data storage limits, which allows you to scale to exabytes of data cost-effectively. Like Amazon EMR, you get the benefits of open data formats and inexpensive storage, and you can scale out to thousands of Redshift Spectrum nodes to pull data, filter, project, aggregate, group, and sort. Like Amazon Athena, Redshift Spectrum is serverless and there’s nothing to provision or manage. You only pay $5 for every 1 TB of data scanned. This post discusses how to configure Amazon Redshift security to enable fine grained access control using role chaining to achieve high-fidelity user-based permission management.

As you start using the lake house approach, which integrates Amazon Redshift with the Amazon S3 data lake using RedShift Spectrum, you need more flexibility when it comes to granting access to different external schemas on the cluster. For example, in the following use case, you have two Redshift Spectrum schemas, SA and SB, mapped to two databases, A and B, respectively, in an AWS Glue Data Catalog, in which you want to allow access for the following when queried from Amazon Redshift:

  • Select access for SA only to IAM user group Grp1
  • Select access for database SB only to IAM user group Grp2
  • No access for IAM user group Grp3 to databases SA and SB

By default, the policies defined under the AWS Identity and Access Management (IAM) role assigned to the Amazon Redshift cluster manages Redshift Spectrum table access, which is inherited by all users and groups in the cluster. This IAM role associated to the cluster cannot easily be restricted to different users and groups. This post details the configuration steps necessary to achieve fine-grained authorization policies for different users in an Amazon Redshift cluster and control access to different Redshift Spectrum schemas and tables using IAM role chaining. When using role chaining, you don’t have to modify the cluster; you can make all modifications on the IAM side. Adding new roles doesn’t require any changes in Amazon Redshift. Even when using AWS Lake Formation, as of this writing, you can’t achieve this level of isolated, coarse-grained access control on the Redshift Spectrum schemas and tables. For more information about cross-account queries, see How to enable cross-account Amazon Redshift COPY and Redshift Spectrum query for AWS KMS–encrypted data in Amazon S3.

Prerequisites

This post uses a TPC-DS 3 TB public dataset from Amazon S3 cataloged in AWS Glue by an AWS Glue crawler and an example retail department dataset. To get started, you must complete the following prerequisites. The first two prerequisites are outside of the scope of this post, but you can use your cluster and dataset in your Amazon S3 data lake.

  1. Create an Amazon Redshift cluster with or without an IAM role assigned to the cluster.
  2. Create an AWS Glue Data Catalog with a database using data from the data lake in Amazon S3, with either an AWS Glue crawler, Amazon EMR, AWS Glue, or Athena.The database should have one or more tables pointing to different Amazon S3 paths. This post uses an industry standard TPC-DS 3 TB dataset, but you can also use your own dataset.
  3. Create IAM users and groups to use later in Amazon Redshift:
    • Create new IAM groups named grpA and grpB without any policies.
    • Create users a1 and b1 and add them to groups grpA and grpB, respectively. Use lower-case usernames.
    • Add the following policy to all the groups you created to allow IAM users temporary credentials when authenticating against Amazon Redshift:
      {
         			 "Version": "2012-10-17",
          			"Statement": {
              			"Effect": "Allow",
              			"Action": "redshift:GetClusterCredentials",
              			"Resource": "arn:aws:redshift:us-east-1:0123456789:dbuser:*/*"
          			}
      }

      You may want to use more restricted access by allowing specific users and groups in the cluster to this policy for additional security.

  1. Create the IAM users and groups locally on the Amazon Redshift cluster without any password.
    • To create user a1, enter the following code:
      create user a1 password disable;

    • To create grpA, enter the following code:
      create group grpA with user a1;

    • Repeat these steps for user b1 and add the user to grpB.
  1. Install a jdbc sql query client such as SqlWorkbenchJ on the client machine.

Use case

In the following use case, you have an AWS Glue Data Catalog with a database named tpcds3tb. Tables in this database point to Amazon S3 under a single bucket, but each table is mapped to a different prefix under the bucket. The following screenshot shows the different table locations.

You use the tpcds3tb database and create a Redshift Spectrum external schema named schemaA. You create groups grpA and grpB with different IAM users mapped to the groups. The goal is to grant different access privileges to grpA and grpB on external tables within schemaA.

This post presents two options for this solution:

  • Use the Amazon Redshift grant usage statement to grant grpA access to external tables in schemaA. The groups can access all tables in the data lake defined in that schema regardless of where in Amazon S3 these tables are mapped to.
  • Configure role chaining to Amazon S3 external schemas that isolate group access to specific data lake locations and deny access to tables in the schema that point to a different Amazon S3 locations.

Isolating user and group access using the grant usage privilege

You can use the Amazon Redshift grant usage privilege on schemaA, which allows grpA access to all objects under that schema. You don’t grant any usage privilege to grpB; users in that group should see access denied when querying.

  1. Create an IAM role named mySpectrum for the Amazon Redshift cluster to allow Redshift Spectrum to read Amazon S3 objects using the following policy:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:Get*",
                    "s3:List*"
                ],
                "Resource": "*"
            }
        ]
    }

  2. Add a trust relationship to allow users in Amazon Redshift to assume roles assigned to the cluster. See the following code:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "redshift.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
      ]
    }

  3. Choose your cluster name.
  4. Choose Properties.
  5. Choose Manage IAM roles.
  6. For Available IAM roles, choose your new role.If you don’t see the role listed, select Enter ARN and enter the role’s ARN.
  7. Choose Done as seen in screenshot below.
  8. Use the Amazon Redshift JDBC driver that has AWS SDK, which you can download from the Amazon Redshift console (see the following screenshot) and connect to the cluster using the IAM connection string from a SQL client such as SqlWorkbenchJ.Screenshot below depicts the jar file type you select.Following screenshot depicts the connection configuration using Workbenchj.
  9. As an Amazon Redshift admin user, create external schemas with schemaA mapped to the AWS Glue database tpcds3tb (you use the IAM role you created earlier to allow Redshift Spectrum access to Amazon S3). See the following code:
    create external schema schemaA from data catalog
    database 'tpcds3tb'
    iam_role 'arn:aws:iam::0123456789:role/mySpectrum' region 'us-east-1';

  10. Verify the schema is in the Amazon Redshift catalog with the following code:
    select * from svv_external_schemas;

  11. Grant usage privilege to grpA. See the following code:
    Grant usage on schema schemaA to group grpA;

  12. Query tables in schemaA as user a1 in grpA using your SQL client. See the following code:
    Select * from schemaA.customer limit 3;
    Select * from schemaA.call_center limit 3;

    The following screenshot shows the successful query results.

  13. Query tables in schemaA as user b1 in grpB.The following screenshot shows the error message you receive.

This option gives great flexibility to isolate user access on Redshift Spectrum schemas, but what if user b1 is authorized to access one or more tables in that schema but not all tables? The second option creates coarse-grained access control policies.

Isolating user and group access using IAM policies and role chaining

You can use IAM policies mapped to IAM roles with a trust relationship to specific users and groups based on Amazon S3 location access and assign it to the cluster. For this use case, grpB is authorized to only access the table catalog_page located at s3://myworkspace009/tpcds3t/catalog_page/, and grpA is authorized to access all tables but catalog_page located at s3://myworkspace009/tpcds3t/*. The following steps help you configure for the given security requirement.

The following diagram depicts how role chaining works.

You first create IAM roles with policies specific to grpA and grpB. The first role is a generic cluster role that allows users to assume this role using a trust relationship defined in the role.

  1. On the IAM console, create a new role. See the following code:
    RoleName: myblog-redshift-assumeRole

  2. Add the following two policies to this role:
    • Add a managed policy named AWSAWS GlueConsoleFullAccess. You might consider adding your inline policy with least privileges instead of this managed role for more restricted security needs.
    • Add an inline policy called myblog-redshift-assumerole-inline with the following rules:
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "Stmt1487639602000",
                  "Effect": "Allow",
                  "Action": [
                      "sts:AssumeRole"
                  ],
                  "Resource": "arn:aws:iam::0123456789:role/*"
              }
          ]
      }

  1. Add a trust relationship that allows the users in the cluster to assume this role. You can choose to limit this to specific users as necessary. See the following code:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "redshift.amazonaws.com"
          },
          "Action": "sts:AssumeRole",
          "Condition": {
            "StringLike": {
              "sts:ExternalId": "arn:aws:redshift:us-east-1:0123456789:dbuser:<YourRedshiftClusterName>/*"
            }
          }
        }
      ]
    }

  2. Create a new Redshift-customizable role specific to grpA with a policy allowing access to Amazon S3 locations for which this group is only allowed access. Make sure you omit the Amazon S3 location for the catalog_page table; you don’t want to authorize this group to view that data.
    • Name the role myblog-grpA-role.
  1. Add the following two policies to this role:
    • Add a managed policy named AWSAWS GlueConsoleFullAccess to the role. (Note: You might consider adding your inline policy with least privileges instead of this managed role for more restricted security needs.)
    • Add an inline policy named myblog-grpA-access-policy with the following rules (modify it to fit your security needs and allow minimal permissions):
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "AllowS3",
                  "Effect": "Allow",
                  "Action": [
                      "s3:*"
                  ],
                  "Resource": [
                      "arn:aws:s3:::myworkspace009/tpcds3t/call_center/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/catalog_returns/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/catalog_sales/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/customer/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/customer_address/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/customer_demographics/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/date_dim/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/item/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/promotion/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/store/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/store_sales/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/web_page/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/web_sales/*"
                  ]
              }
          ]
      }

  1. Add a trust relationship explicitly listing all users in grpA to only allow them to assume this role (choose the tab Trust relationships and edit it to add the following policy updating the relevant account details):The trust relationship has to be updated for each user added to this role, or build a new role for each user. It is fairly easy to script automate updating this trust relationship for each new user.
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": " arn:aws:iam::0123456789:role/myblog-redshift-assumeRole"
          },
          "Action": "sts:AssumeRole",
          "Condition": {
            "StringEquals": {
              "sts:ExternalId": "arn:aws:redshift:us-east-1:0123456789:dbuser:<YourRedshiftClusterName>/a1"
            }
          }
        }
      ]
    }

  2. Create another Redshift-customizable role specific to grpB with a policy restricting access only to Amazon S3 locations where this group is allowed access.
    • Name the role myblog-grpB-role.
  1. Add the following two policies to this role. Create these managed policies reflecting the data access per DB Group and attach them to the roles that are assumed on the cluster.
    • Add a managed policy named AWSAWS GlueConsoleFullAccess to the role. You might consider adding your inline policy with least privileges instead of this managed role for more restricted security needs.
    • Add an inline policy named myblog-grpB-access-policy with the following rules (modify it to fit your security needs and allow minimal permissions):
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "AllowS3",
                  "Effect": "Allow",
                  "Action": [
                      "s3:*"
                  ],
                  "Resource": [
                      "arn:aws:s3:::myworkspace009/tpcds3t/catalog_page/*"
                  ]
              }
          ]
      }

  1. Add a trust relationship explicitly listing all users in grpB to only allow them to assume this role (choose the tab Trust relationships and edit it to add the following policy updating the relevant account details):This trust relationship has to be updated for each user for this role, or build a role for each user. It is fairly easy to script automate updating this trust relationship.
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": " arn:aws:iam::0123456789:role/myblog-redshift-assumeRole"
          },
          "Action": "sts:AssumeRole",
          "Condition": {
            "StringEquals": {
              "sts:ExternalId": "arn:aws:redshift:us-east-1:0123456789:dbuser:<YourRedshiftClusterName>/b1"
            }
          }
        }
      ]
    }

  2. Attach the three roles to the Amazon Redshift cluster and remove any other roles mapped to the cluster. If you don’t find any roles in the drop-down menu, use the role ARN.
  3. As an admin user, create a new external schema for grpA and grpB, respectively, using role chaining with the two roles you created.
    • For grpA, enter the following code:
      create external schema rc_schemaA from data catalog
      database 'tpcds3tb'
      iam_role 'arn:aws:iam::0123456789:role/myblog-redshift-assumeRole,arn:aws:iam::0123456789:role/myblog-grpA-role' region 'us-east-1';
      
      grant usage on schema rc_schemaA to group grpA;

    • For grpB, enter the following code:
      create external schema rc_schemaB from data catalog
      database 'tpcds3tb'
      iam_role 'arn:aws:iam::0123456789:role/myblog-redshift-assumeRole,arn:aws:iam::0123456789:role/myblog-grpB-role' region 'us-east-1';
      
      grant usage on schema rc_schemaB to group grpB;

  1. Query the external schema as user in grpA and grpB.
    • To query the customer table and catalog_page table as user a1 in grpA, enter the following code:
      select * from rc_schemaA.customer limit 3;
      select * from rc_schemaA.catalog_page limit 3;

The following screenshot shows the query results; user a1 can access the customer table successfully.

The following screenshot shows that user a1 can’t access catalog_page.

    • Query the customer table and catalog_page table as user b1 in grpB.

The following screenshot shows that user b1 can access catalog_page.

The following screenshot shows that user b1 can’t access the customer table.

Conclusion

This post demonstrated two different ways to isolate user and group access to external schema and tables. With the first option of using Grant usage statements, the granted group has access to all tables in the schema regardless of which Amazon S3 data lake paths the tables point to. This approach gives great flexibility to grant access at ease, but it doesn’t allow or deny access to specific tables in that schema.

With the second option, you manage user and group access at the grain of Amazon S3 objects, which gives more control of data security and lowers the risk of unauthorized data access. This approach has some additional configuration overhead compared to the first approach, but can yield better data security.

In both approaches, building a right governance model upfront on Amazon S3 paths, external schemas, and table mapping based on how groups of users access them is paramount to provide the best security and allow low operational overhead.

Special acknowledgment goes to AWS colleague Martin Grund for his valuable comments and suggestions.

 


About the Authors

Harsha Tadiparthi is a Specialist Sr. Solutions Architect, AWS Analytics. He enjoys solving complex customer problems in Databases and Analytics and delivering successful outcomes. Outside of work, he loves to spend time with his family, watch movies, and travel whenever possible.

 

 

 

Harshida Patel is a Data Warehouse Specialist Solutions Architect with AWS.