All posts by Thiyagarajan Arumugam

Working with timestamp with time zone in your Amazon S3-based data lake

Post Syndicated from Thiyagarajan Arumugam original https://aws.amazon.com/blogs/big-data/working-with-timestamp-with-time-zone-in-your-amazon-s3-based-data-lake/

With a data lake built on Amazon Simple Storage Service (Amazon S3), you can use the purpose-built analytics services for a range of use cases, from analyzing petabyte-scale datasets to querying the metadata of a single object. AWS analytics services support open file formats such as Parquet, ORC, JSON, Avro, CSV, and more, so it’s convenient to analyze with the tool that is most appropriate for your use case. For more information, see Amazon S3 as the Data Lake Storage Platform.

The TIMESTAMP and TIMESTAMPTZ (TIMESTAMP with time zone) data types are key data elements associated with many time-based datasets (for example clickstream, historical sales, and forecasting) in your data lake. But when you access the data across different analytical services, such as Amazon EMR-based ETL outputs being read by Amazon Redshift Spectrum, you may not know how the data will behave. Furthermore, lack of proper handling may cause accuracy issues in timestamp with time zone data types. This post delves into handling the TIMESTAMP and TIMESTAMPTZ data types in the context of a data lake by using a centralized data architecture. Because AWS analytical services cover a broad spectrum, we primarily focus on handing timestamps using Apache Hive, Apache Spark, Apache Parquet (using Amazon EMR and Amazon Athena), and Amazon Redshift to cover both the data lake and data warehouse.

Overview of TIMESTAMP and TIMESTAMPTZ data types in your data lake

Let’s start with some common definitions of the TIMESTAMP and TIMESTAMPTZ data types.

  • The TIMESTAMP data type stores values that include the date and time of day. For example, 12/17/1997 17:37:16. Timestamps are presented without time zone information.
  • The TIMESTAMPTZ data type to stores values with the date, time of day, and time zone. For example, 12/17/1997 17:37:16 (PST).

Internally, the timestamp is as an integer, representing seconds in UTC since the epoch (1970-01-01 00:00:00 UTC) and TIMESTAMPTZ values also stored as integers with respect to Coordinated Universal Time (UTC).

When working with the TIMESTAMPTZ data type, reads and writes use the time zone of the client user machine. When no time zone is set up or if left at the default values (such as the JVM/SQL client), it defaults to UTC.

Timestamp behavior when accessed across the analytical services

For this post, we discuss handling the timestamp with time zone data when accessed individually within the services and as well as between the services. The following diagram shows the architecture for this setup.

In this architecture, Parquet objects are stored in a centralized Amazon S3-based data lake, and Amazon EMR, Athena, and Amazon Redshift are used to access this centralized data. Data is also processed by these individual engines and accessed across these services through the Amazon S3 storage.

In this post, we illustrate the behavior of the different data types when data moves across different services from the Amazon S3 Parquet files.

Processing data in Amazon EMR (ETL) and accessing it with Amazon Redshift

In this use case, the Spark or Hive data pipeline generates Parquet files in the data lake and stores it in Amazon S3. Parquet files that are stored in Amazon S3 are loaded to Amazon Redshift using the COPY command. The following diagram illustrates this workflow.

To test this setup, complete the following steps:

  1. Create a Hive table and insert a sample row (for this post, we use an EMR cluster spun up in us-west-2, PST):
    CREATE EXTERNAL TABLE clickstream_dwh.clickstream_hive
    (
      sessionid             BIGINT,
      click_region       STRING,
      click_datetime_utc       TIMESTAMP,
      pageid                INT,
      productid          INT
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
    LOCATION s3://clickstream-dwh-us-west-2/warehouse/clickstream_dwh.db/clickstream_hive/';
    
    
    insert into clickstream_dwh.clickstream_hive values (9074420482 ,'SEATTLE, US' ,'2014-04-06 02:40:13' ,3365,183876);

  1. Verify the Parquet file content using the Parquet tool in Amazon S3:
    $ parq 000000_0 --head 10
        sessionid click_region  click_datetime_utc  pageid  productid
    0  9074420482        SEATTLE, US 2014-04-06 09:40:13    3365     183876

In the preceding output, the Hive client running Amazon EMR interprets the time zone with respect to the end-user client (in PST), and converts it to UTC when writing to the Parquet file.

  1. Read through Hive and Spark (in Pacific time):
    Read through Hive(Pacific):
    
    PST:
    select * from clickstream_dwh.clickstream_hive; 
    
    +-------------+---------------+------------------------+---------+------------+
    |  sessionid  | click_region  |   click_datetime_utc   | pageid  | productid  |
    +-------------+---------------+------------------------+---------+------------+
    | 9074420482  | SEATTLE, US         | 2014-04-06 02:40:13.0  | 3365    | 183876     |

Amazon EMR Hive and Spark convert the underlying UTC stored timestamp values in Parquet to the client user machine’s relative time (PST) when displaying the results.

  1. Copy the Parquet file to an Amazon Redshift table with the TIMESTAMP column data type (in UTC). We use the SQL command line client tool psql to query the results in Amazon Redshift.
    COPY the parquet file to Redshift table with timestamp column data type(UTC):
    
    
    CREATE TABLE clickstream_dwh.clickstream_ts
    (
      sessionid             BIGINT,
      click_region       VARCHAR(100),
      click_datetime_utc       TIMESTAMP,
      pageid                INT,
      productid          INT
    );
    
    dev=# SHOW TIMEZONE;
     TimeZone 
    ----------
     UTC
    (1 row)
    
    dev=# select * from clickstream_dwh.clickstream_ts;
     sessionid  | click_region | click_datetime_utc  | pageid | productid 
    ------------+--------------+---------------------+--------+-----------
     9074420482 | SEATTLE, US        | 2014-04-06 09:40:13 |   3365 |    183876
    (1 row)
    
    
    dev=# SET timezone = 'America/Los_Angeles';
    SET
    
    dev=# SHOW TIMEZONE;
          TimeZone       
    ---------------------
     America/Los_Angeles
    (1 row)
    
    
    dev=# select * from clickstream_dwh.clickstream_ts;
     sessionid  | click_region | click_datetime_utc  | pageid | productid 
    ------------+--------------+---------------------+--------+-----------
     9074420482 | SEATTLE, US        | 2014-04-06 09:40:13 |   3365 |    183876
    (1 row)
    
    Note: SET timezone = 'America/Los_Angeles' , does not affect the TIMESTAMP column.

In the preceding output, the timestamp doesn’t have a time zone. All data is interpreted in UTC or whatever raw format it was when loaded into Amazon Redshift.

  1. Copy the Parquet file to the Amazon Redshift table using TIMESTAMPTZ (UTC & Pacific):
    COPY the parquet file to Redshift table with TIMESTAMPTZ(UTC & Pacific):
    
    
    CREATE TABLE clickstream_dwh.clickstream_tz
    (
      sessionid             BIGINT,
      click_region       VARCHAR(100),
      click_datetime_utc       TIMESTAMPTZ,
      pageid                INT,
      productid          INT
    );
    
    
    COPY clickstream_dwh.clickstream_tz
    FROM 's3://clickstream-dwh-us-west-2/warehouse/clickstream_dwh.db/clickstream_hive/'
    IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftDemoRole'
    parquet;
    
    
    
    dev=# SHOW TIMEZONE;
     TimeZone 
    ----------
     UTC
    (1 row)
    
    dev=# select * from clickstream_dwh.clickstream_tz;
     sessionid  | click_region |   click_datetime_utc    | pageid | productid 
    ------------+--------------+------------------------+--------+-----------
     9074420482 | SEATTLE, US        | 2014-04-06 09:40:13+00 |   3365 |    183876
    (1 row)
    
    dev=# SET timezone = 'America/Los_Angeles';
    SET
    dev=# 
    dev=# 
    dev=# SHOW TIMEZONE;
          TimeZone       
    ---------------------
     America/Los_Angeles
    (1 row)
    
    dev=# select * from clickstream_dwh.clickstream_tz;
     sessionid  | click_region |   click_datetime_utc    | pageid | productid 
    ------------+--------------+------------------------+--------+-----------
     9074420482 | SEATTLE, US        | 2014-04-06 02:40:13-07 |   3365 |    183876
    (1 row)

The output shows that TIMESTAMPTZ can interpret the client time zone and convert the value with respect to the end-user client (PST), though the actual values are stored in UTC.

Processing data from Amazon Redshift and moving it to an Amazon S3 data lake

In the following use case, we copy data from Amazon Redshift to a data lake. Amazon Redshift stores the TIMESTAMP and TIMESTAMPTZ columns data types in a table. The table data is exported to Amazon S3 as Parquet files with the UNLOAD command. The following diagram illustrates this architecture.

To experiment with this setup, complete the following steps:

  1. Unload the Amazon Redshift table data to Amazon S3 (in UTC):
    UNLOAD ('select * from clickstream_dwh.clickstream_tz;')
    TO 's3://clickstream-dwh-us-west-2/warehouse/clickstream_dwh.db/clickstream_rs/'
    IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftDemoRole'
    parquet;

  1. Verify the Parquet file content:
Check the parquet file content:

$ parq 0016_part_00.parquet --head 10
    sessionid click_region   click_datetime_utc  pageid  productid
0  9074420482        SEATTLE, US 2014-04-06 09:40:13    3365     183876
  1. Create a table in Hive and query it (in UTC):
    CREATE EXTERNAL TABLE clickstream_dwh.clickstream_rs
    (
      sessionid             BIGINT,
      click_region       STRING,
      click_datetime_utc       TIMESTAMP,
      pageid                INT,
      productid          INT
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
    LOCATION 's3://clickstream-dwh-us-west-2/warehouse/clickstream_dwh.db/clickstream_rs/';
    
    
    
    > set hive.parquet.timestamp.skip.conversion=false;
    
    
    > select * from clickstream_dwh.clickstream_rs;
    
    
    +-------------+---------------+------------------------+---------+------------+
    |  sessionid  | click_region  |   click_datetime_utc   | pageid  | productid  |
    +-------------+---------------+------------------------+---------+------------+
    | 9074420482  | SEATTLE, US         | 2014-04-06 02:40:13.0  | 3365    | 183876     |
    +-------------+---------------+------------------------+---------+------------+
    1 row selected (0.888 seconds)

In the preceding output, the actual data in the Parquet file is stored in UTC, but Hive can read and display the local time zone using client settings.

When using Hive, set hive.parquet.timestamp.skip.conversion=false. Pre-3.1.2 Hive implementation of Parquet stores timestamps in UTC on-file; this flag allows you to skip the conversion when reading Parquet files created from other tools that may not have done so. Setting it to false treats legacy timestamps as UTC-normalized. For more information, see hive.parquet.timestamp.skip.conversion.

  1. Query using Spark-SQL (in Pacific time):
    spark-sql> select *  from clickstream_dwh.clickstream_rs;
    
    
    9074420482	SEATTLE, US	2014-04-06 02:40:13	3365	183876

In the preceding output, Spark converts the values with respect to the end-user client (PST), though the actual values are stored in UTC.

Accessing data through Athena

To access the data through Athena, you need to create the external table either in the AWS Glue Data Catalog or Hive metastore. In this example, we populate the Data Catalog.

To create a table using the Data Catalog, sign in to the Athena console and run the following DDL:

CREATE EXTERNAL TABLE IF NOT EXISTS default.blog_clickstream(
  `sessionid` bigint,
  `click_region` string,
  `click_datetime_utc` timestamp,
  `pageid` int,
  `productid` int 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://clickstream-dwh-us-west-2/warehouse/clickstream_dwh.db/clickstream_rs/'
TBLPROPERTIES ('has_encrypted_data'='false');

The following screenshot shows the query results.

The results show that Athena converts click_datetime_utc to the user’s local time zone (in this case, PST).

Accessing data through Amazon Redshift Spectrum

To access the data through Amazon Redshift Spectrum, you need to create the following:

  • The external database and table in the Data Catalog or a Hive metastore. We can use the same table we already created in the preceding use case (default.blog_clickstream).
  • An Amazon Redshift external schema for the external database in the Data Catalog.

See the following code:

dev=# CREATE EXTERNAL SCHEMA ext_clickstream_blog
      from data catalog
      database 'default' 
      region 'us-west-2' 
      iam_role 'arn:aws:iam::123456789012:role/RedshiftDemoRole';


dev=#  select * from ext_clickstream_blog.blog_clickstream;

sessionid  | click_region | click_datetime_utc  | pageid | productid
-----------+--------------+---------------------+--------+----------
9074420482 | SEATTLE, US        | 2014-04-06 09:40:13 |   3365 |    183876

The output shows that Amazon Redshift Spectrum can convert click_datetime_utc to the local time zone of the user (PST).

Use cases for handling TIMESTAMP AND TIMESTAMPTZ data types

When implementing the data model for your data lake, the choice between selecting the TIMESTAMP or TIMESTAMPTZ data type depends on how your end-users consume the data. In this section, we discuss two different use cases.

Using TIMESTAMP for a uniform display of one normalized time

When you want a uniform display of a standard time (in a particular time zone), use the TIMESTAMP data type and baseline the values that are stored into a particular time zone. For example, the following table shows collected clickstream data from a global website.

sessionid click_region click_datetime_utc pageid productid
3682484416 Chennai 2014-04-06T 18:44:58 7156 309743
6367587374 London 2014-04-06T 18:44:58 5298 749625
9074420482 Los Angles 2014-04-06T 09:40:13 3365 183876
1746153004 Perth 2014-04-04T 06:13:28 3761 725195
1449344779 Singapore 2014-04-04T 06:13:28 3527 140229
4115543521 New York 2014-04-07T 09:22:28 3712 831655
2748081381 Paris 2014-04-07T 09:22:28 8474 347742
1120684200 New York 2014-04-07T 09:22:28 2731 568755

The clicks for the website come from users across the globe and are normalized for the UTC time zone using the click_datetime_utc column and a TIMESTAMP data type. You can accomplish this step during the data transformation process. Normalizing the data avoids confusion when data is analyzed across the different Regions without needing explicit conversion.

Using TIMESTAMPTZ for a contextual display of data depending on the user’s local time zone

When you need a contextual display of date and time for users accessing the data, choose the TIMESTAMPTZ data type. For example, let’s consider a customer service application that is accessing data from a centralized data warehouse. Individual users of the application are interested in analyzing data with respect which location the issue happened, rather than a normalized time zone. See the following code:

create table public.customer_issue_log
(
customer_id	BIGINT NOT NULL,
customer_location	VARCHAR(50) NOT NULL,
customer_timezone	VARCHAR(10) NOT NULL,
issue_create_time timestamptz NOT NULL,
issue_create_time_utc timestamp  NULL,
issue_id	INTEGER NOT NULL,
issue_severity INTEGER NOT NULL
);

The following table summarizes the output.

customer_id customer_location customer_timezone issue_create_time issue_create_time_utc issue_id issue_severity
9589430063 Chennai, India IST 4/5/2014 11:44:58 PM 4/6/2014 04:44:58 AM 2343 3
2796599493 New York, US EST 4/6/2014 06:44:58 AM 4/6/2014 11:44:58 AM 2780 4
1836626118 Toronto, CA EDT 4/4/2014 05:13:28 AM 4/4/2014 10:13:28 AM 7821 1
6790206978 Sydney, Australia AEDT 4/5/2014 05:40:13 PM 4/5/2014 10:40:13 PM 3135 5

The issue_create_time column stores the date and time values and the time zone. When you query this table, you can view issue_create_time in your local time zone automatically (without any explicit conversion) by configuring set timezone or using a SQL client (such as SQL Workbench) that automatically adjusts this with respect your local computer settings.

In addition, you can also introduce optional redundant columns such as issue_create_time_utc for ease of use when users try to analyze the data across different Regions.

Independent of the approach taken for the implementation, there is no loss of timestamp or time zone values when using the preceding approach, and you can perform data aggregation on both columns without needing explicit conversion because all data is stored in UTC in the underlying storage (Parquet in Amazon Redshift). For example, you can roll up data into weekly or monthly aggregates across the Regions without any explicit conversion. The following example code calculates the weekly number of issues by priority across all locations:

select date_trunc('week', issue_create_time) wk, issue_severity, count(issue_id) from public.customer_issue_log
where
group by date_trunc('week', issue_create_time), issue_severity;

Best practices for handling timestamps and time zones with data types

You should handle dates as either DATE, TIMESTAMP, or TIMESTAMPTZ data types and not convert them to strings. When dates are interpreted from strings, you lose all the features and flexiblity of working with date fields and date calculations, and also lose efficiency of processing. Moreover, casting or converting at runtime can be expensive.

When using TIMESTAMP or TIMESTAMPTZ data types, be aware of the client tools that access them. Client tool behavior largely depends on the local setting of the drivers and JVM. But it’s possible to override the behavior and always check for client tool-specific default behavior.

Use TIMESTAMPTZ only when absolutely necessary in the data model. In most use cases, TIMESTAMP simplifies data handling and avoids ambiguity when users access them.

Summary

In this post, we talked about handling and using TIMESTAMP and TIMESTAMPTZ data types with an Amazon S3-backed data lake. Most importantly, we covered how different AWS services like Amazon Redshift, Amazon EMR, Hive, and many other client tools interpret and interact with these data types. Choosing between using TIMESTAMP or TIMESTAMPTZ depends on the use case and how the end-user wants to visualize the data (a uniform display with one normalized time or a contextual display depending on time zone, respectively). Happy timestamping!


About the Authors

Thiyagarajan Arumugam is a Principal Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

 

 

Srinivasan Krishnasamy is a ‘Senior Big Data Consultant’ at Amazon Web Services. He joined AWS in 2015 and specializes in building and supporting Big Data solutions that help customers to ingest, process and analyze data at scale.

 

 

Satish Sathiya is a Product Engineer at Amazon Redshift. He is an avid big data enthusiast who collaborates with customers around the globe to achieve success and meet their data warehousing and data lake architecture needs.

Orchestrating an ETL process using AWS Step Functions for Amazon Redshift

Post Syndicated from Thiyagarajan Arumugam original https://aws.amazon.com/blogs/big-data/orchestrating-an-etl-process-using-aws-step-functions-for-amazon-redshift/

Modern data lakes depend on extract, transform, and load (ETL) operations to convert bulk information into usable data. This post walks through implementing an ETL orchestration process that is loosely coupled using AWS Step Functions, AWS Lambda, and AWS Batch to target an Amazon Redshift cluster.

Because Amazon Redshift uses columnar storage, it is well suited for fast analytical insights using the convenient ANSI SQL queries. You can rapidly scale your Amazon Redshift clusters up and down in minutes to meet the demanding workloads for both your end-user reports and timely data refresh into the data warehouse.

AWS Step Functions makes it easy to develop and use repeatable workflows that scale well. Step Functions lets you build automation workflows from individual Lambda functions. Each function performs a discrete task and lets you develop, test, and modify the components of your workflow quickly and seamlessly.

An ETL process refreshes your data warehouse from source systems, organizing the raw data into a format you can more readily use. Most organizations run ETL as a batch or as part of a real-time ingest process to keep the data warehouse current and provide timely analytics. A fully automated and highly scalable ETL process helps minimize the operational effort that you must invest in managing the regular ETL pipelines. It also ensures the timely and accurate refresh of your data warehouse. You can tailor this process to refresh data into any data warehouse or the data lake.

This post also provides an AWS CloudFormation template that launches the entire sample ETL process in one click to refresh the TPC-DS dataset. Find the template link in the Set up the entire workflow using AWS CloudFormation section.

Architectural overview

The following diagram illustrates the architectural overview of the different components involved in the orchestration of the ETL workflow. This workflow uses Step Functions to fetch source data from Amazon S3 to refresh the Amazon Redshift data warehouse.

Here are the core components of the workflow:

  • Amazon CloudWatch triggers the ETL process based on a schedule, through the AWS CLI, or using the various AWS SDKs in a Lambda function.
  • The ETL workflow uses Step Functions for a multi-step ETL process and manages AWS services into serverless workflows. You can build and easily iterate these using JSON-based templates. For example, a typical ETL process may involve refreshing dimensions first and later refreshing the fact tables. You can declare your order of operations using a Step Functions state machine.
  • A Lambda function lets you build microservices to coordinate job submission and monitoring without needing to write code for workflow logic, parallel processes, error handling, timeouts, or retries.
  • AWS Batch runs several ETL jobs such as transforms and loads into Amazon Redshift. AWS Batch manages all the infrastructure for you, avoiding the complexities of provisioning, managing, monitoring, and scaling your batch computing jobs. It also lets you wait for the jobs to complete.
  • The source data in Amazon S3 refreshes an Amazon Redshift data warehouse through a PL/SQL container. To specify the ETL logic, I use.sql files that contain the SQL code for a particular step. For example, a .sql file for a typical dimension table refresh contains steps to load the data from Amazon S3 to a temporary staging table and INSERT/UPDATE the target table. Before beginning, review a sample dimensional table .sql file.

You can execute the workflow and monitor it using the state machine. You can trigger the ETL according to a schedule or an event (for example, as soon as all the data files arrive in S3).

Prerequisites

Before you get started, create a Docker image that can execute .sql files. AWS Batch creates resources for executing the ETL steps using this Docker image. To create the Docker image, you need:

If this is your first time using AWS Batch, see Getting Started with AWS Batch. Create an environment to build and register the Docker image. For this post, register this image in an Amazon ECR repository. This is a private repository by default, making it useful for AWS Batch jobs.

Building the fetch and running psql Docker image

To build the Docker image, follow the steps outlined in the post Creating a Simple “Fetch & Run” AWS Batch Job.

Use the following Docker configuration and fetch and run psql scripts to build the images.

  1. DockerFetchRunPsqlUbundu
  2. fetch_and_run_psql.sh

Follow the steps in the post to import the Docker image into the ECR container registry. After you complete the previous steps, your Docker image is ready to trigger a .sql execution for an Amazon Redshift cluster.

Example: ETL process using TPC-DS dataset

This example uses a subset of the TPC-DS dataset to demonstrate a typical dimensional model refresh. Here is the Entity Relationship diagram of the TPC-DS data model that I use for this ETL application:

The ETL process refreshes table data for the Store_Sales fact table along with the Customer_Address and Item dimensions for a particular dataset date.

Setting up the ETL workflow using Step Functions

Step Functions make complicated workflows more straightforward. You can set up dependency management and failure handling using a JSON-based template. Workflows are just a series of steps, with the output of one step acting as input into the next.

This example completes various dimensional table transforms and loads before triggering the Fact table load. Also, a workflow can branch out into multiple parallel steps whenever needed. You can monitor each step of execution as it happens, which means you can identify and fix problems quickly.

This illustration outlines the example ETL process set up through Step Functions:

For more information, see the detailed workflow diagram.

In the above workflow, the ETL process checks the DB connection in Step 1 and triggers the Customer_Address (Step 2.1) and Item_dimension (Step 2.2) steps, which execute in parallel. The Store_Sales (Step 3) FACT table waits for the process to complete the dimensional tables. Each ETL step is autonomous, allowing you to monitor and respond to failures at any stage.

I now examine the Store_Sales step (Step 3) in detail. Other steps follow a similar pattern of implementation.

Here is the state implementation for Store_Sales step (Step 3):

{
   "Comment":"A simple ETL example that submits a Job to AWS Batch",
   "StartAt":"DBConnectionInit",
	...
      "Parallel":{
         "Type":"Parallel",
         "Next":"SalesFACTInit",
         "ResultPath":"$.status",
         "Branches":[
	...
      },
      "SalesFACTInit":{
         "Type":"Pass",
         "Next":"SubmitStoreSalesFACTJob",
         "Result":"SalesFACT",
         "ResultPath":"$.stepId"
      },
      "SubmitStoreSalesFACTJob":{
         "Type":"Task",
         "Resource":"arn:aws:lambda:us-west-2:1234567890:function:StepFunctionsSample-JobStatusPol-SubmitJobFunction-5M2HCJIG81R1",
         "Next":"GetStoreSalesFACTJobStatus"
      },
      "GetStoreSalesFACTJobStatus":{
         "Type":"Task",
         "Resource":"arn:aws:lambda:us-west-2:1234567890:function:StepFunctionsSample-JobStatusPoll-CheckJobFunction-1SKER18I6FU24",
         "Next":"CheckStoreSalesFACTJobStatus",
         "InputPath":"$",
         "ResultPath":"$.status"
      },
      "CheckStoreSalesFACTJobStatus":{
         "Type":"Choice",
         "Choices":[
            {
               "Variable":"$.status",
               "StringEquals":"FAILED",
               "Next":"FailState"
            },
            {
               "Variable":"$.status",
               "StringEquals":"SUCCEEDED",
               "Next":"GetFinalStoreSalesFACTJobStatus"
            }
         ],
         "Default":"StoreSalesFACTWait30Seconds"
      },
	...
   }
}

The Parallel process that loads all the dimension tables sets up a dependency on later Store Sales Fact transformation/load SalesFACTInit through the Next attribute. The SalesFACTInit step triggers the transformation using the SubmitStoreSalesFACTJob to AWS Batch triggered through AWS Lambda job JobStatusPol-SubmitJobFunction. GetStoreSalesFACTJobStatus polls through the AWS Lambda JobStatusPoll-CheckJobFunction every 30 seconds to check for completion. CheckStoreSalesFACTJobStatus validates the status and decides to succeed or fail the process depending on the returned status.

Here is snippet of input for executing the state machine job for Step 3:

{  
"DBConnection":{  
..
   "SalesFACT":{  
      "jobName":"my-job",
      "jobDefinition":"arn:aws:batch:us-west-2:1234567890:job-definition/JobDefinition-cd6aa175c07fb2a:1",
      "jobQueue":"arn:aws:batch:us-west-2:1234567890:job-queue/JobQueue-217beecdb0caa3f",
      "wait_time":60,
      "containerOverrides":{  
         "environment":[  
            {  
               "name":"BATCH_FILE_TYPE",
               "value":"script_psql"
            },
            {  
               "name":"BATCH_FILE_S3_URL",
               "value":"s3://salamander-us-east-1/reinvent2018/ant353/etlscript/psql_rs.sh"
            },
            {  
               "name":"BATCH_FILE_SQL_S3_URL",
               "value":"s3://salamander-us-east-1/reinvent2018/ant353/etlscript/store_sales.sql"
            },
            {  
               "name":"DATASET_DATE",
               "value":"2003-01-02"
            }
         ]     
}}}

The input defines which .sql file each step invokes, along with the refresh date. You can represent any complex ETL workflow as a JSON workflow, making it easy to manage. This also decouples the inputs to invoke for each step.

Executing the ETL workflow

AWS Batch executes each .sql script (store_sales.sql) that the state machine invokes by using an incremental data refresh for the sales data on a particular date.

Here is the load and transformation implementation for the store_sales.sql:

\set s3datareadrolevar 'aws_iam_role=' :s3datareadrole
-- This transform ETL will refresh data for the store_sales table
-- Start a new transaction

begin transaction;

-- Create a stg_store_sales staging table and COPY data from input S3 location it with updated rows from SALES_UPDATE

DROP TABLE if exists public.stg_store_sales;
CREATE TABLE public.stg_store_sales
(
	sold_date DATE   ENCODE lzo
	,sold_time INTEGER   ENCODE lzo
	,i_item_id CHAR(16)   ENCODE lzo
	,c_customer_id CHAR(16)   ENCODE lzo
	,cd_demo_sk INTEGER   ENCODE lzo
	,hd_income_band_sk INTEGER   ENCODE lzo
	,hd_buy_potential CHAR(15)   ENCODE lzo
	,hd_dep_count INTEGER   ENCODE lzo
	,hd_vehicle_count INTEGER   ENCODE lzo
	,ca_address_id CHAR(16)   ENCODE lzo
	,s_store_id CHAR(16)   ENCODE lzo
	,p_promo_id CHAR(16)   ENCODE lzo
	,ss_ticket_number INTEGER   ENCODE lzo
	,ss_quantity INTEGER   ENCODE lzo
	,ss_wholesale_cost NUMERIC(7,2)   ENCODE lzo
	,ss_list_price NUMERIC(7,2)   ENCODE lzo
	,ss_sales_price NUMERIC(7,2)   ENCODE lzo
	,ss_ext_discount_amt NUMERIC(7,2)   ENCODE lzo
	,ss_ext_sales_price NUMERIC(7,2)   ENCODE lzo
	,ss_ext_wholesale_cost NUMERIC(7,2)   ENCODE lzo
	,ss_ext_list_price NUMERIC(7,2)   ENCODE lzo
	,ss_ext_tax NUMERIC(7,2)   ENCODE lzo
	,ss_coupon_amt NUMERIC(7,2)   ENCODE lzo
	,ss_net_paid NUMERIC(7,2)   ENCODE lzo
	,ss_net_paid_inc_tax NUMERIC(7,2)   ENCODE lzo
	,ss_net_profit NUMERIC(7,2)   ENCODE lzo
)
BACKUP NO
DISTSTYLE EVEN
;


\set s3loc 's3://salamander-us-east-1/reinvent2018/ant353/store_sales/saledate=' :dt '/'
-- COPY input data to the staging table
copy public.stg_store_sales
from
:'s3loc'
CREDENTIALS :'s3datareadrolevar'
DELIMITER '~' gzip region 'us-east-1';

-- Delete any rows from target store_sales for the input date for idempotency
delete from store_sales where ss_sold_date_sk in (select d_date_sk from date_dim where d_date=:'dt');
--Insert data from staging table to the target TABLE
INSERT INTO store_sales
(
  ss_sold_date_sk,
  ss_sold_time_sk,
  ss_item_sk,
  ss_customer_sk,
  ss_cdemo_sk,
  ss_hdemo_sk,
  ss_addr_sk,
  ss_store_sk,
  ss_promo_sk,
  ss_ticket_number,
  ss_quantity,
  ss_wholesale_cost,
  ss_list_price,
  ss_sales_price,
  ss_ext_discount_amt,
  ss_ext_sales_price,
  ss_ext_wholesale_cost,
  ss_ext_list_price,
  ss_ext_tax,
  ss_coupon_amt,
  ss_net_paid,
  ss_net_paid_inc_tax,
  ss_net_profit
)
SELECT date_dim.d_date_sk ss_sold_date_sk,
       time_dim.t_time_sk ss_sold_time_sk,
       item.i_item_sk ss_item_sk,
       customer.c_customer_sk ss_customer_sk,
       customer_demographics.cd_demo_sk ss_cdemo_sk,
       household_demographics.hd_demo_sk ss_hdemo_sk,
       customer_address.ca_address_sk ss_addr_sk,
       store.s_store_sk ss_store_sk,
       promotion.p_promo_sk ss_promo_sk,
       ss_ticket_number,
       ss_quantity,
       ss_wholesale_cost,
       ss_list_price,
       ss_sales_price,
       ss_ext_discount_amt,
       ss_ext_sales_price,
       ss_ext_wholesale_cost,
       ss_ext_list_price,
       ss_ext_tax,
       ss_coupon_amt,
       ss_net_paid,
       ss_net_paid_inc_tax,
       ss_net_profit
FROM stg_store_sales AS store_sales
  JOIN date_dim ON store_sales.sold_date = date_dim.d_date
  LEFT JOIN time_dim ON store_sales.sold_time = time_dim.t_time
  LEFT JOIN item
         ON store_sales.i_item_id = item.i_item_id
        AND i_rec_end_date IS NULL
  LEFT JOIN customer ON store_sales.c_customer_id = customer.c_customer_id
  LEFT JOIN customer_demographics ON store_sales.cd_demo_sk = customer_demographics.cd_demo_sk
  LEFT JOIN household_demographics
         ON store_sales.hd_income_band_sk = household_demographics.hd_income_band_sk
        AND store_sales.hd_buy_potential = household_demographics.hd_buy_potential
        AND store_sales.hd_dep_count = household_demographics.hd_dep_count
        AND store_sales.hd_vehicle_count = household_demographics.hd_vehicle_count
  LEFT JOIN customer_address ON store_sales.ca_address_id = customer_address.ca_address_id
  LEFT JOIN store
         ON store_sales.s_store_id = store.s_store_id
        AND s_rec_end_date IS NULL
  LEFT JOIN promotion ON store_sales.p_promo_id = promotion.p_promo_id;

  --drop staging table

  DROP TABLE if exists public.stg_store_sales;

  -- End transaction and commit
  end transaction;

This ETL implementation runs through the following steps:

  1. A COPY command fast loads the data from S3 in bulk into the staging table stg_store_sales.
  2. Begin…end transactions encapsulate multiple steps in the transformation and load process. This leads to fewer commit operations in the end, making the process less expensive.
  3. ETL implementation is idempotent. If it fails, you can retry the job without any cleanup. For example, it recreates the stg_store_sales is each time, then deletes target table store_sales with the data for the particular refresh date each time.

For best practices used in the preceding implementation, see the Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift post.

Furthermore, Customer_Address demonstrates a Type 1 implementation and Item follows a Type 2 implementation in a typical dimensional model.

Set up the entire workflow using AWS CloudFormation

The AWS CloudFormation template includes all the steps of this solution. This template creates all the required AWS resources and invokes initial data setup and the refresh of this data for a particular day. Here is a list of all the resources it creates inside the CloudFormation stack:

  • A VPC and associated subnets, security groups, and routes
  • IAM roles
  • An Amazon Redshift cluster
  • An AWS Batch job definition and compute environment
  • A Lambda function to submit and poll AWS Batch jobs
  • A Step Functions state machine to orchestrate the ETL workflow and refresh the data in the Amazon Redshift cluster

Here is the architecture of this setup that shows the Amazon Redshift setup in the VPC and the ETL process orchestrated using Step Functions:

Step 1: Create the stack with AWS CloudFormation

To deploy this application in your AWS account, start by launching this CloudFormation stack:

  • This stack uses the password Password#123. Change it as soon as possible. Use a minimum of eight characters, at least one uppercase letter, one lowercase letter, one number, and one special character.
  1. Use the default values for all other parameters.

The stack takes about ten minutes to launch. Wait for it to complete when the status changes to CREATE_COMPLETE.

Make a note of the value of ExecutionInput in the Output section of the stack. The JSON looks like the following code example:

“
{ "DBConnection":{ "jobName":"
…
alue":"s3://salamander-us-east-1/reinvent2018/ant353/etlscript/store_sales.sql" }, { "name":"DATASET_DATE", "value":"2003-01-02" } ] } } }
”

Note the Physical ID of JobDefinition and JobQueue in the Resources section of the stack.

Step 2: Set up TPC-DS 1-GB initial data in Amazon Redshift

The following steps load an initial 1 GB of TPCDS data into the Amazon Redshift cluster:

  • In the AWS Batch console, choose Job, select the job queue noted earlier, and choose Submit Job.
  • Set a new job name, for example, TPCDSdataload, and select the JobDefinition value that you noted earlier. Choose Submit Job. Wait for the job to completely load the initial 1 GB of TPCDS data into the Amazon Redshift cluster.
  • In the AWS Batch dashboard and monitor for the completion of TPCDS data load. This takes about ten minutes to complete.

Step 3: Execute the ETL workflow in the setup function

The ETL process is a multi-step workflow to refresh the TPCDS dimensional model with data from 2010-10-10.

  1. In the Step Functions console, choose JobStatusPollerStateMachine-*.
  2. Choose Start execution and provide an optional execution name, for example, ETLWorkflowDataRefreshfor2003-01-02. In the execution input, enter the ExecutionInput value that you noted earlier. This kickstarts the ETL process. The state machine uses the Lambda poller to submit and monitor each step of the ETL job. Each input invokes the ETL workflow. You can monitor the process of the ETL by refreshing your browser.

Step 4: Verify the ETL data refresh in the Amazon Redshift cluster

In the Amazon Redshift console, choose Query Editor. Enter the following credentials:

  • Database: dev.
  • Database Use: awsuser.
  • Password: This requires the password that you created in Step 1 (default Password#123).

After you are logged in to the public schema, execute the following query to check the data load for 2010-10-10:

SELECT c_last_name, 
               c_first_name, 
               ca_city, 
               bought_city, 
               ss_ticket_number, 
               amt, 
               profit 
FROM   (SELECT ss_ticket_number, 
               ss_customer_sk, 
               ca_city            bought_city, 
               Sum(ss_coupon_amt) amt, 
               Sum(ss_net_profit) profit 
        FROM   store_sales, 
               date_dim, 
               store, 
               household_demographics, 
               customer_address 
        WHERE  store_sales.ss_sold_date_sk = date_dim.d_date_sk 
               AND store_sales.ss_store_sk = store.s_store_sk 
               AND store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk 
               AND store_sales.ss_addr_sk = customer_address.ca_address_sk 
               AND ( household_demographics.hd_dep_count = 6 
                      OR household_demographics.hd_vehicle_count = 0 ) 
               AND d_date =  '2003-01-02'
        GROUP  BY ss_ticket_number, 
                  ss_customer_sk, 
                  ss_addr_sk, 
                  ca_city) dn, 
       customer, 
       customer_address current_addr 
WHERE  ss_customer_sk = c_customer_sk 
       AND customer.c_current_addr_sk = current_addr.ca_address_sk 
       AND current_addr.ca_city <> bought_city 
ORDER  BY c_last_name, 
          c_first_name, 
          ca_city, 
          bought_city, 
          ss_ticket_number
LIMIT 100;

The query should display the TPC-DS dataset for 2010-10-10 that the ETL process loaded.

Step 5: Cleaning up

When you finish testing this solution, remember to clean up all the AWS resources that you created using AWS CloudFormation. Use the AWS CloudFormation console or AWS CLI to delete the stack that you specified previously.

Conclusion

In this post, I described how to implement an ETL workflow using decoupled services in AWS and set up a highly scalable orchestration that lets you refresh data into an Amazon Redshift cluster.

You can easily expand on what you learned here. Here are some options that you let you extend this solution to accommodate other analytical services or make it robust enough to be production ready:

  • This example invokes the state machine manually using Step Functions. You can instead trigger the state machine automatically using a CloudWatch event or S3 event, such as whenever new files arrive in the source bucket. You also drive the ETL invocation using a schedule. For useful information for automating your ETL workflow, see Schedule a Serverless Workflow.
  • You can add an alert mechanism in case of failures. To do this, create a Lambda function that sends you an email based on the status of each step in the Step Functions workflow.
  • Each step of the state machine is autonomous and can invoke any service with a Lambda function. You can integrate any analytical service into your workflow. For example, you can create a separate Lambda function to invoke AWS Glue and clean some of your data before transforming the data using Amazon Redshift. In this case, you add the AWS Glue job as a dependency in the step before the dimension load.

With this Step Functions-based workflow, you can decouple the different steps of ETL orchestration using any analytical service. Because of this, the solution is adaptable and interchangeable to a wide variety of applications.

If you have questions or suggestions, please leave a comment below.

 


About the Author

Thiyagarajan Arumugam is a Big Data Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

 

 

 

Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift

Post Syndicated from Thiyagarajan Arumugam original https://aws.amazon.com/blogs/big-data/top-8-best-practices-for-high-performance-etl-processing-using-amazon-redshift/

An ETL (Extract, Transform, Load) process enables you to load data from source systems into your data warehouse. This is typically executed as a batch or near-real-time ingest process to keep the data warehouse current and provide up-to-date analytical data to end users.

Amazon Redshift is a fast, petabyte-scale data warehouse that enables you easily to make data-driven decisions. With Amazon Redshift, you can get insights into your big data in a cost-effective fashion using standard SQL. You can set up any type of data model, from star and snowflake schemas, to simple de-normalized tables for running any analytical queries.

To operate a robust ETL platform and deliver data to Amazon Redshift in a timely manner, design your ETL processes to take account of Amazon Redshift’s architecture. When migrating from a legacy data warehouse to Amazon Redshift, it is tempting to adopt a lift-and-shift approach, but this can result in performance and scale issues long term. This post guides you through the following best practices for ensuring optimal, consistent runtimes for your ETL processes:

  • COPY data from multiple, evenly sized files.
  • Use workload management to improve ETL runtimes.
  • Perform table maintenance regularly.
  • Perform multiple steps in a single transaction.
  • Loading data in bulk.
  • Use UNLOAD to extract large result sets.
  • Use Amazon Redshift Spectrum for ad hoc ETL processing.
  • Monitor daily ETL health using diagnostic queries.

1. COPY data from multiple, evenly sized files

Amazon Redshift is an MPP (massively parallel processing) database, where all the compute nodes divide and parallelize the work of ingesting data. Each node is further subdivided into slices, with each slice having one or more dedicated cores, equally dividing the processing capacity. The number of slices per node depends on the node type of the cluster. For example, each DS2.XLARGE compute node has two slices, whereas each DS2.8XLARGE compute node has 16 slices.

When you load data into Amazon Redshift, you should aim to have each slice do an equal amount of work. When you load the data from a single large file or from files split into uneven sizes, some slices do more work than others. As a result, the process runs only as fast as the slowest, or most heavily loaded, slice. In the example shown below, a single large file is loaded into a two-node cluster, resulting in only one of the nodes, “Compute-0”, performing all the data ingestion:

When splitting your data files, ensure that they are of approximately equal size – between 1 MB and 1 GB after compression. The number of files should be a multiple of the number of slices in your cluster. Also, I strongly recommend that you individually compress the load files using gzip, lzop, or bzip2 to efficiently load large datasets.

When loading multiple files into a single table, use a single COPY command for the table, rather than multiple COPY commands. Amazon Redshift automatically parallelizes the data ingestion. Using a single COPY command to bulk load data into a table ensures optimal use of cluster resources, and quickest possible throughput.

2. Use workload management to improve ETL runtimes

Use Amazon Redshift’s workload management (WLM) to define multiple queues dedicated to different workloads (for example, ETL versus reporting) and to manage the runtimes of queries. As you migrate more workloads into Amazon Redshift, your ETL runtimes can become inconsistent if WLM is not appropriately set up.

I recommend limiting the overall concurrency of WLM across all queues to around 15 or less. This WLM guide helps you organize and monitor the different queues for your Amazon Redshift cluster.

When managing different workloads on your Amazon Redshift cluster, consider the following for the queue setup:

  • Create a queue dedicated to your ETL processes. Configure this queue with a small number of slots (5 or fewer). Amazon Redshift is designed for analytics queries, rather than transaction processing. The cost of COMMIT is relatively high, and excessive use of COMMIT can result in queries waiting for access to the commit queue. Because ETL is a commit-intensive process, having a separate queue with a small number of slots helps mitigate this issue.
  • Claim extra memory available in a queue. When executing an ETL query, you can take advantage of the wlm_query_slot_count to claim the extra memory available in a particular queue. For example, a typical ETL process might involve COPYing raw data into a staging table so that downstream ETL jobs can run transformations that calculate daily, weekly, and monthly aggregates. To speed up the COPY process (so that the downstream tasks can start in parallel sooner), the wlm_query_slot_count can be increased for this step.
  • Create a separate queue for reporting queries. Configure query monitoring rules on this queue to further manage long-running and expensive queries.
  • Take advantage of the dynamic memory parameters. They swap the memory from your ETL to your reporting queue after the ETL job has completed.

3. Perform table maintenance regularly

Amazon Redshift is a columnar database, which enables fast transformations for aggregating data. Performing regular table maintenance ensures that transformation ETLs are predictable and performant. To get the best performance from your Amazon Redshift database, you must ensure that database tables regularly are VACUUMed and ANALYZEd. The Analyze & Vacuum schema utility helps you automate the table maintenance task and have VACUUM & ANALYZE executed in a regular fashion.

  • Use VACUUM to sort tables and remove deleted blocks

During a typical ETL refresh process, tables receive new incoming records using COPY, and unneeded data (cold data) is removed using DELETE. New rows are added to the unsorted region in a table. Deleted rows are simply marked for deletion.

DELETE does not automatically reclaim the space occupied by the deleted rows. Adding and removing large numbers of rows can therefore cause the unsorted region and the number of deleted blocks to grow. This can degrade the performance of queries executed against these tables.

After an ETL process completes, perform VACUUM to ensure that user queries execute in a consistent manner. The complete list of tables that need VACUUMing can be found using the Amazon Redshift Util’s table_info script.

Use the following approaches to ensure that VACCUM is completed in a timely manner:

  • Use wlm_query_slot_count to claim all the memory allocated in the ETL WLM queue during the VACUUM process.
  • DROP or TRUNCATE intermediate or staging tables, thereby eliminating the need to VACUUM them.
  • If your table has a compound sort key with only one sort column, try to load your data in sort key order. This helps reduce or eliminate the need to VACUUM the table.
  • Consider using time series This helps reduce the amount of data you need to VACUUM.
  • Use ANALYZE to update database statistics

Amazon Redshift uses a cost-based query planner and optimizer using statistics about tables to make good decisions about the query plan for the SQL statements. Regular statistics collection after the ETL completion ensures that user queries run fast, and that daily ETL processes are performant. The Amazon Redshift utility table_info script provides insights into the freshness of the statistics. Keeping the statistics off (pct_stats_off) less than 20% ensures effective query plans for the SQL queries.

4. Perform multiple steps in a single transaction

ETL transformation logic often spans multiple steps. Because commits in Amazon Redshift are expensive, if each ETL step performs a commit, multiple concurrent ETL processes can take a long time to execute.

To minimize the number of commits in a process, the steps in an ETL script should be surrounded by a BEGIN…END statement so that a single commit is performed only after all the transformation logic has been executed. For example, here is an example multi-step ETL script that performs one commit at the end:

Begin
CREATE temporary staging_table;
INSERT INTO staging_table SELECT .. FROM source (transformation logic);
DELETE FROM daily_table WHERE dataset_date =?;
INSERT INTO daily_table SELECT .. FROM staging_table (daily aggregate);
DELETE FROM weekly_table WHERE weekending_date=?;
INSERT INTO weekly_table SELECT .. FROM staging_table(weekly aggregate);
Commit

5. Loading data in bulk

Amazon Redshift is designed to store and query petabyte-scale datasets. Using Amazon S3 you can stage and accumulate data from multiple source systems before executing a bulk COPY operation. The following methods allow efficient and fast transfer of these bulk datasets into Amazon Redshift:

  • Use a manifest file to ingest large datasets that span multiple files. The manifest file is a JSON file that lists all the files to be loaded into Amazon Redshift. Using a manifest file ensures that Amazon Redshift has a consistent view of the data to be loaded from S3, while also ensuring that duplicate files do not result in the same data being loaded more than one time.
  • Use temporary staging tables to hold the data for transformation. These tables are automatically dropped after the ETL session is complete. Temporary tables can be created using the CREATE TEMPORARY TABLE syntax, or by issuing a SELECT … INTO #TEMP_TABLE query. Explicitly specifying the CREATE TEMPORARY TABLE statement allows you to control the DISTRIBUTION KEY, SORT KEY, and compression settings to further improve performance.
  • User ALTER table APPEND to swap data from the staging tables to the target table. Data in the source table is moved to matching columns in the target table. Column order doesn’t matter. After data is successfully appended to the target table, the source table is empty. ALTER TABLE APPEND is much faster than a similar CREATE TABLE AS or INSERT INTO operation because it doesn’t involve copying or moving data.

6. Use UNLOAD to extract large result sets

Fetching a large number of rows using SELECT is expensive and takes a long time. When a large amount of data is fetched from the Amazon Redshift cluster, the leader node has to hold the data temporarily until the fetches are complete. Further, data is streamed out sequentially, which results in longer elapsed time. As a result, the leader node can become hot, which not only affects the SELECT that is being executed, but also throttles resources for creating execution plans and managing the overall cluster resources. Here is an example of a large SELECT statement. Notice that the leader node is doing most of the work to stream out the rows:

Use UNLOAD to extract large results sets directly to S3. After it’s in S3, the data can be shared with multiple downstream systems. By default, UNLOAD writes data in parallel to multiple files according to the number of slices in the cluster. All the compute nodes participate to quickly offload the data into S3.

If you are extracting data for use with Amazon Redshift Spectrum, you should make use of the MAXFILESIZE parameter to and keep files are 150 MB. Similar to item 1 above, having many evenly sized files ensures that Redshift Spectrum can do the maximum amount of work in parallel.

7. Use Redshift Spectrum for ad hoc ETL processing

Events such as data backfill, promotional activity, and special calendar days can trigger additional data volumes that affect the data refresh times in your Amazon Redshift cluster. To help address these spikes in data volumes and throughput, I recommend staging data in S3. After data is organized in S3, Redshift Spectrum enables you to query it directly using standard SQL. In this way, you gain the benefits of additional capacity without having to resize your cluster.

For tips on getting started with and optimizing the use of Redshift Spectrum, see the previous post, 10 Best Practices for Amazon Redshift Spectrum.

8. Monitor daily ETL health using diagnostic queries

Monitoring the health of your ETL processes on a regular basis helps identify the early onset of performance issues before they have a significant impact on your cluster. The following monitoring scripts can be used to provide insights into the health of your ETL processes:

Script Use when… Solution
commit_stats.sql – Commit queue statistics from past days, showing largest queue length and queue time first DML statements such as INSERT/UPDATE/COPY/DELETE operations take several times longer to execute when multiple of these operations are in progress Set up separate WLM queues for the ETL process and limit the concurrency to < 5.
copy_performance.sql –  Copy command statistics for the past days Daily COPY operations take longer to execute • Follow the best practices for the COPY command.
• Analyze data growth with the incoming datasets and consider cluster resize to meet the expected SLA.
table_info.sql – Table skew and unsorted statistics along with storage and key information Transformation steps take longer to execute • Set up regular VACCUM jobs to address unsorted rows and claim the deleted blocks so that transformation SQL execute optimally.
• Consider a table redesign to avoid data skewness.
v_check_transaction_locks.sql – Monitor transaction locks INSERT/UPDATE/COPY/DELETE operations on particular tables do not respond back in timely manner, compared to when run after the ETL Multiple DML statements are operating on the same target table at the same moment from different transactions. Set up ETL job dependency so that they execute serially for the same target table.
v_get_schema_priv_by_user.sql – Get the schema that the user has access to Reporting users can view intermediate tables Set up separate database groups for reporting and ETL users, and grants access to objects using GRANT.
v_generate_tbl_ddl.sql – Get the table DDL You need to create an empty table with same structure as target table for data backfill Generate DDL using this script for data backfill.
v_space_used_per_tbl.sql – monitor space used by individual tables Amazon Redshift data warehouse space growth is trending upwards more than normal

Analyze the individual tables that are growing at higher rate than normal. Consider data archival using UNLOAD to S3 and Redshift Spectrum for later analysis.

Use unscanned_table_summary.sql to find unused table and archive or drop them.

top_queries.sql – Return the top 50 time consuming statements aggregated by its text ETL transformations are taking longer to execute Analyze the top transformation SQL and use EXPLAIN to find opportunities for tuning the query plan.

There are several other useful scripts available in the amazon-redshift-utils repository. The AWS Lambda Utility Runner runs a subset of these scripts on a scheduled basis, allowing you to automate much of monitoring of your ETL processes.

Example ETL process

The following ETL process reinforces some of the best practices discussed in this post. Consider the following four-step daily ETL workflow where data from an RDBMS source system is staged in S3 and then loaded into Amazon Redshift. Amazon Redshift is used to calculate daily, weekly, and monthly aggregations, which are then unloaded to S3, where they can be further processed and made available for end-user reporting using a number of different tools, including Redshift Spectrum and Amazon Athena.

Step 1:  Extract from the RDBMS source to a S3 bucket

In this ETL process, the data extract job fetches change data every 1 hour and it is staged into multiple hourly files. For example, the staged S3 folder looks like the following:

 [[email protected] ~]$ aws s3 ls s3://<<S3 Bucket>>/batch/2017/07/02/
2017-07-02 01:59:58   81900220 20170702T01.export.gz
2017-07-02 02:59:56   84926844 20170702T02.export.gz
2017-07-02 03:59:54   78990356 20170702T03.export.gz
…
2017-07-02 22:00:03   75966745 20170702T21.export.gz
2017-07-02 23:00:02   89199874 20170702T22.export.gz
2017-07-02 00:59:59   71161715 20170702T23.export.gz

Organizing the data into multiple, evenly sized files enables the COPY command to ingest this data using all available resources in the Amazon Redshift cluster. Further, the files are compressed (gzipped) to further reduce COPY times.

Step 2: Stage data to the Amazon Redshift table for cleansing

Ingesting the data can be accomplished using a JSON-based manifest file. Using the manifest file ensures that S3 eventual consistency issues can be eliminated and also provides an opportunity to dedupe any files if needed. A sample manifest20170702.json file looks like the following:

{
  "entries": [
    {"url":" s3://<<S3 Bucket>>/batch/2017/07/02/20170702T01.export.gz", "mandatory":true},
    {"url":" s3://<<S3 Bucket>>/batch/2017/07/02/20170702T02.export.gz", "mandatory":true},
    …
    {"url":" s3://<<S3 Bucket>>/batch/2017/07/02/20170702T23.export.gz", "mandatory":true}
  ]
}

The data can be ingested using the following command:

SET wlm_query_slot_count TO <<max available concurrency in the ETL queue>>;
COPY stage_tbl FROM 's3:// <<S3 Bucket>>/batch/manifest20170702.json' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' manifest;

Because the downstream ETL processes depend on this COPY command to complete, the wlm_query_slot_count is used to claim all the memory available to the queue. This helps the COPY command complete as quickly as possible.

Step 3: Transform data to create daily, weekly, and monthly datasets and load into target tables

Data is staged in the “stage_tbl” from where it can be transformed into the daily, weekly, and monthly aggregates and loaded into target tables. The following job illustrates a typical weekly process:

Begin
INSERT into ETL_LOG (..) values (..);
DELETE from weekly_tbl where dataset_week = <<current week>>;
INSERT into weekly_tbl (..)
  SELECT date_trunc('week', dataset_day) AS week_begin_dataset_date, SUM(C1) AS C1, SUM(C2) AS C2
	FROM   stage_tbl
GROUP BY date_trunc('week', dataset_day);
INSERT into AUDIT_LOG values (..);
COMMIT;
End;

As shown above, multiple steps are combined into one transaction to perform a single commit, reducing contention on the commit queue.

Step 4: Unload the daily dataset to populate the S3 data lake bucket

The transformed results are now unloaded into another S3 bucket, where they can be further processed and made available for end-user reporting using a number of different tools, including Redshift Spectrum and Amazon Athena.

unload ('SELECT * FROM weekly_tbl WHERE dataset_week = <<current week>>’) TO 's3:// <<S3 Bucket>>/datalake/weekly/20170526/' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';

Summary

Amazon Redshift lets you easily operate petabyte-scale data warehouses on the cloud. This post summarized the best practices for operating scalable ETL natively within Amazon Redshift. I demonstrated efficient ways to ingest and transform data, along with close monitoring. I also demonstrated the best practices being used in a typical sample ETL workload to transform the data into Amazon Redshift.

If you have questions or suggestions, please comment below.

 


About the Author

Thiyagarajan Arumugam is a Big Data Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

 

Federate Database User Authentication Easily with IAM and Amazon Redshift

Post Syndicated from Thiyagarajan Arumugam original https://aws.amazon.com/blogs/big-data/federate-database-user-authentication-easily-with-iam-and-amazon-redshift/

Managing database users though federation allows you to manage authentication and authorization procedures centrally. Amazon Redshift now supports database authentication with IAM, enabling user authentication though enterprise federation. No need to manage separate database users and passwords to further ease the database administration. You can now manage users outside of AWS and authenticate them for access to an Amazon Redshift data warehouse. Do this by integrating IAM authentication and a third-party SAML-2.0 identity provider (IdP), such as AD FS, PingFederate, or Okta. In addition, database users can also be automatically created at their first login based on corporate permissions.

In this post, I demonstrate how you can extend the federation to enable single sign-on (SSO) to the Amazon Redshift data warehouse.

SAML and Amazon Redshift

AWS supports Security Assertion Markup Language (SAML) 2.0, which is an open standard for identity federation used by many IdPs. SAML enables federated SSO, which enables your users to sign in to the AWS Management Console. Users can also make programmatic calls to AWS API actions by using assertions from a SAML-compliant IdP. For example, if you use Microsoft Active Directory for corporate directories, you may be familiar with how Active Directory and AD FS work together to enable federation. For more information, see the Enabling Federation to AWS Using Windows Active Directory, AD FS, and SAML 2.0 AWS Security Blog post.

Amazon Redshift now provides the GetClusterCredentials API operation that allows you to generate temporary database user credentials for authentication. You can set up an IAM permissions policy that generates these credentials for connecting to Amazon Redshift. Extending the IAM authentication, you can configure the federation of AWS access though a SAML 2.0–compliant IdP. An IAM role can be configured to permit the federated users call the GetClusterCredentials action and generate temporary credentials to log in to Amazon Redshift databases. You can also set up policies to restrict access to Amazon Redshift clusters, databases, database user names, and user group.

Amazon Redshift federation workflow

In this post, I demonstrate how you can use a JDBC– or ODBC-based SQL client to log in to the Amazon Redshift cluster using this feature. The SQL clients used with Amazon Redshift JDBC or ODBC drivers automatically manage the process of calling the GetClusterCredentials action, retrieving the database user credentials, and establishing a connection to your Amazon Redshift database. You can also use your database application to programmatically call the GetClusterCredentials action, retrieve database user credentials, and connect to the database. I demonstrate these features using an example company to show how different database users accounts can be managed easily using federation.

The following diagram shows how the SSO process works:

  1. JDBC/ODBC
  2. Authenticate using Corp Username/Password
  3. IdP sends SAML assertion
  4. Call STS to assume role with SAML
  5. STS Returns Temp Credentials
  6. Use Temp Credentials to get Temp cluster credentials
  7. Connect to Amazon Redshift using temp credentials

Walkthrough

Example Corp. is using Active Directory (idp host:demo.examplecorp.com) to manage federated access for users in its organization. It has an AWS account: 123456789012 and currently manages an Amazon Redshift cluster with the cluster ID “examplecorp-dw”, database “analytics” in us-west-2 region for its Sales and Data Science teams. It wants the following access:

  • Sales users can access the examplecorp-dw cluster using the sales_grp database group
  • Sales users access examplecorp-dw through a JDBC-based SQL client
  • Sales users access examplecorp-dw through an ODBC connection, for their reporting tools
  • Data Science users access the examplecorp-dw cluster using the data_science_grp database group.
  • Partners access the examplecorp-dw cluster and query using the partner_grp database group.
  • Partners are not federated through Active Directory and are provided with separate IAM user credentials (with IAM user name examplecorpsalespartner).
  • Partners can connect to the examplecorp-dw cluster programmatically, using language such as Python.
  • All users are automatically created in Amazon Redshift when they log in for the first time.
  • (Optional) Internal users do not specify database user or group information in their connection string. It is automatically assigned.
  • Data warehouse users can use SSO for the Amazon Redshift data warehouse using the preceding permissions.

Step 1:  Set up IdPs and federation

The Enabling Federation to AWS Using Windows Active Directory post demonstrated how to prepare Active Directory and enable federation to AWS. Using those instructions, you can establish trust between your AWS account and the IdP and enable user access to AWS using SSO.  For more information, see Identity Providers and Federation.

For this walkthrough, assume that this company has already configured SSO to their AWS account: 123456789012 for their Active Directory domain demo.examplecorp.com. The Sales and Data Science teams are not required to specify database user and group information in the connection string. The connection string can be configured by adding SAML Attribute elements to your IdP. Configuring these optional attributes enables internal users to conveniently avoid providing the DbUser and DbGroup parameters when they log in to Amazon Redshift.

The user-name attribute can be set up as follows, with a user ID (for example, nancy) or an email address (for example. [email protected]):

<Attribute Name="https://redshift.amazon.com/SAML/Attributes/DbUser">  
  <AttributeValue>user-name</AttributeValue>
</Attribute>

The AutoCreate attribute can be defined as follows:

<Attribute Name="https://redshift.amazon.com/SAML/Attributes/AutoCreate">
    <AttributeValue>true</AttributeValue>
</Attribute>

The sales_grp database group can be included as follows:

<Attribute Name="https://redshift.amazon.com/SAML/Attributes/DbGroups">
    <AttributeValue>sales_grp</AttributeValue>
</Attribute>

For more information about attribute element configuration, see Configure SAML Assertions for Your IdP.

Step 2: Create IAM roles for access to the Amazon Redshift cluster

The next step is to create IAM policies with permissions to call GetClusterCredentials and provide authorization for Amazon Redshift resources. To grant a SQL client the ability to retrieve the cluster endpoint, region, and port automatically, include the redshift:DescribeClusters action with the Amazon Redshift cluster resource in the IAM role.  For example, users can connect to the Amazon Redshift cluster using a JDBC URL without the need to hardcode the Amazon Redshift endpoint:

Previous:  jdbc:redshift://endpoint:port/database

Current:  jdbc:redshift:iam://clustername:region/dbname

Use IAM to create the following policies. You can also use an existing user or role and assign these policies. For example, if you already created an IAM role for IdP access, you can attach the necessary policies to that role. Here is the policy created for sales users for this example:

Sales_DW_IAM_Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "redshift:DescribeClusters"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:GetClusterCredentials"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw",
                "arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:userid": "AIDIODR4TAW7CSEXAMPLE:${redshift:DbUser}@examplecorp.com"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:CreateClusterUser"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:JoinGroup"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:dbgroup:examplecorp-dw/sales_grp"
            ]
        }
    ]
}

The policy uses the following parameter values:

  • Region: us-west-2
  • AWS Account: 123456789012
  • Cluster name: examplecorp-dw
  • Database group: sales_grp
  • IAM role: AIDIODR4TAW7CSEXAMPLE
Policy Statement Description
{
"Effect":"Allow",
"Action":[
"redshift:DescribeClusters"
],
"Resource":[
"arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw"
]
}

Allow users to retrieve the cluster endpoint, region, and port automatically for the Amazon Redshift cluster examplecorp-dw. This specification uses the resource format arn:aws:redshift:region:account-id:cluster:clustername. For example, the SQL client JDBC can be specified in the format jdbc:redshift:iam://clustername:region/dbname.

For more information, see Amazon Resource Names.

{
"Effect":"Allow",
"Action":[
"redshift:GetClusterCredentials"
],
"Resource":[
"arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw",
"arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
],
"Condition":{
"StringEquals":{
"aws:userid":"AIDIODR4TAW7CSEXAMPLE:${redshift:DbUser}@examplecorp.com"
}
}
}

Generates a temporary token to authenticate into the examplecorp-dw cluster. “arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}” restricts the corporate user name to the database user name for that user. This resource is specified using the format: arn:aws:redshift:region:account-id:dbuser:clustername/dbusername.

The Condition block enforces that the AWS user ID should match “AIDIODR4TAW7CSEXAMPLE:${redshift:DbUser}@examplecorp.com”, so that individual users can authenticate only as themselves. The AIDIODR4TAW7CSEXAMPLE role has the Sales_DW_IAM_Policy policy attached.

{
"Effect":"Allow",
"Action":[
"redshift:CreateClusterUser"
],
"Resource":[
"arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
]
}
Automatically creates database users in examplecorp-dw, when they log in for the first time. Subsequent logins reuse the existing database user.
{
"Effect":"Allow",
"Action":[
"redshift:JoinGroup"
],
"Resource":[
"arn:aws:redshift:us-west-2:123456789012:dbgroup:examplecorp-dw/sales_grp"
]
}
Allows sales users to join the sales_grp database group through the resource “arn:aws:redshift:us-west-2:123456789012:dbgroup:examplecorp-dw/sales_grp” that is specified in the format arn:aws:redshift:region:account-id:dbgroup:clustername/dbgroupname.

Similar policies can be created for Data Science users with access to join the data_science_grp group in examplecorp-dw. You can now attach the Sales_DW_IAM_Policy policy to the role that is mapped to IdP application for SSO.
 For more information about how to define the claim rules, see Configuring SAML Assertions for the Authentication Response.

Because partners are not authorized using Active Directory, they are provided with IAM credentials and added to the partner_grp database group. The Partner_DW_IAM_Policy is attached to the IAM users for partners. The following policy allows partners to log in using the IAM user name as the database user name.

Partner_DW_IAM_Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "redshift:DescribeClusters"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:GetClusterCredentials"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw",
                "arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
            ],
            "Condition": {
                "StringEquals": {
                    "redshift:DbUser": "${aws:username}"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:CreateClusterUser"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:JoinGroup"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:dbgroup:examplecorp-dw/partner_grp"
            ]
        }
    ]
}

redshift:DbUser“: “${aws:username}” forces an IAM user to use the IAM user name as the database user name.

With the previous steps configured, you can now establish the connection to Amazon Redshift through JDBC– or ODBC-supported clients.

Step 3: Set up database user access

Before you start connecting to Amazon Redshift using the SQL client, set up the database groups for appropriate data access. Log in to your Amazon Redshift database as superuser to create a database group, using CREATE GROUP.

Log in to examplecorp-dw/analytics as superuser and create the following groups and users:

CREATE GROUP sales_grp;
CREATE GROUP datascience_grp;
CREATE GROUP partner_grp;

Use the GRANT command to define access permissions to database objects (tables/views) for the preceding groups.

Step 4: Connect to Amazon Redshift using the JDBC SQL client

Assume that sales user “nancy” is using the SQL Workbench client and JDBC driver to log in to the Amazon Redshift data warehouse. The following steps help set up the client and establish the connection:

  1. Download the latest Amazon Redshift JDBC driver from the Configure a JDBC Connection page
  2. Build the JDBC URL with the IAM option in the following format:
    jdbc:redshift:iam://examplecorp-dw:us-west-2/sales_db

Because the redshift:DescribeClusters action is assigned to the preceding IAM roles, it automatically resolves the cluster endpoints and the port. Otherwise, you can specify the endpoint and port information in the JDBC URL, as described in Configure a JDBC Connection.

Identify the following JDBC options for providing the IAM credentials (see the “Prepare your environment” section) and configure in the SQL Workbench Connection Profile:

plugin_name=com.amazon.redshift.plugin.AdfsCredentialsProvider 
idp_host=demo.examplecorp.com (The name of the corporate identity provider host)
idp_port=443  (The port of the corporate identity provider host)
user=examplecorp\nancy(corporate user name)
password=***(corporate user password)

The SQL workbench configuration looks similar to the following screenshot:

Now, “nancy” can connect to examplecorp-dw by authenticating using the corporate Active Directory. Because the SAML attributes elements are already configured for nancy, she logs in as database user nancy and is assigned the sales_grp. Similarly, other Sales and Data Science users can connect to the examplecorp-dw cluster. A custom Amazon Redshift ODBC driver can also be used to connect using a SQL client. For more information, see Configure an ODBC Connection.

Step 5: Connecting to Amazon Redshift using JDBC SQL Client and IAM Credentials

This optional step is necessary only when you want to enable users that are not authenticated with Active Directory. Partners are provided with IAM credentials that they can use to connect to the examplecorp-dw Amazon Redshift clusters. These IAM users are attached to Partner_DW_IAM_Policy that assigns them to be assigned to the public database group in Amazon Redshift. The following JDBC URLs enable them to connect to the Amazon Redshift cluster:

jdbc:redshift:iam//examplecorp-dw/analytics?AccessKeyID=XXX&SecretAccessKey=YYY&DbUser=examplecorpsalespartner&DbGroup= partner_grp&AutoCreate=true

The AutoCreate option automatically creates a new database user the first time the partner logs in. There are several other options available to conveniently specify the IAM user credentials. For more information, see Options for providing IAM credentials.

Step 6: Connecting to Amazon Redshift using an ODBC client for Microsoft Windows

Assume that another sales user “uma” is using an ODBC-based client to log in to the Amazon Redshift data warehouse using Example Corp Active Directory. The following steps help set up the ODBC client and establish the Amazon Redshift connection in a Microsoft Windows operating system connected to your corporate network:

  1. Download and install the latest Amazon Redshift ODBC driver.
  2. Create a system DSN entry.
    1. In the Start menu, locate the driver folder or folders:
      • Amazon Redshift ODBC Driver (32-bit)
      • Amazon Redshift ODBC Driver (64-bit)
      • If you installed both drivers, you have a folder for each driver.
    2. Choose ODBC Administrator, and then type your administrator credentials.
    3. To configure the driver for all users on the computer, choose System DSN. To configure the driver for your user account only, choose User DSN.
    4. Choose Add.
  3. Select the Amazon Redshift ODBC driver, and choose Finish. Configure the following attributes:
    Data Source Name =any friendly name to identify the ODBC connection 
    Database=analytics
    user=uma(corporate user name)
    Auth Type-Identity Provider: AD FS
    password=leave blank (Windows automatically authenticates)
    Cluster ID: examplecorp-dw
    idp_host=demo.examplecorp.com (The name of the corporate IdP host)

This configuration looks like the following:

  1. Choose OK to save the ODBC connection.
  2. Verify that uma is set up with the SAML attributes, as described in the “Set up IdPs and federation” section.

The user uma can now use this ODBC connection to establish the connection to the Amazon Redshift cluster using any ODBC-based tools or reporting tools such as Tableau. Internally, uma authenticates using the Sales_DW_IAM_Policy  IAM role and is assigned the sales_grp database group.

Step 7: Connecting to Amazon Redshift using Python and IAM credentials

To enable partners, connect to the examplecorp-dw cluster programmatically, using Python on a computer such as Amazon EC2 instance. Reuse the IAM users that are attached to the Partner_DW_IAM_Policy policy defined in Step 2.

The following steps show this set up on an EC2 instance:

  1. Launch a new EC2 instance with the Partner_DW_IAM_Policy role, as described in Using an IAM Role to Grant Permissions to Applications Running on Amazon EC2 Instances. Alternatively, you can attach an existing IAM role to an EC2 instance.
  2. This example uses Python PostgreSQL Driver (PyGreSQL) to connect to your Amazon Redshift clusters. To install PyGreSQL on Amazon Linux, use the following command as the ec2-user:
    sudo easy_install pip
    sudo yum install postgresql postgresql-devel gcc python-devel
    sudo pip install PyGreSQL

  1. The following code snippet demonstrates programmatic access to Amazon Redshift for partner users:
    #!/usr/bin/env python
    """
    Usage:
    python redshift-unload-copy.py <config file> <region>
    
    * Copyright 2014, Amazon.com, Inc. or its affiliates. All Rights Reserved.
    *
    * Licensed under the Amazon Software License (the "License").
    * You may not use this file except in compliance with the License.
    * A copy of the License is located at
    *
    * http://aws.amazon.com/asl/
    *
    * or in the "license" file accompanying this file. This file is distributed
    * on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
    * express or implied. See the License for the specific language governing
    * permissions and limitations under the License.
    """
    
    import sys
    import pg
    import boto3
    
    REGION = 'us-west-2'
    CLUSTER_IDENTIFIER = 'examplecorp-dw'
    DB_NAME = 'sales_db'
    DB_USER = 'examplecorpsalespartner'
    
    options = """keepalives=1 keepalives_idle=200 keepalives_interval=200
                 keepalives_count=6"""
    
    set_timeout_stmt = "set statement_timeout = 1200000"
    
    def conn_to_rs(host, port, db, usr, pwd, opt=options, timeout=set_timeout_stmt):
        rs_conn_string = """host=%s port=%s dbname=%s user=%s password=%s
                             %s""" % (host, port, db, usr, pwd, opt)
        print "Connecting to %s:%s:%s as %s" % (host, port, db, usr)
        rs_conn = pg.connect(dbname=rs_conn_string)
        rs_conn.query(timeout)
        return rs_conn
    
    def main():
        # describe the cluster and fetch the IAM temporary credentials
        global redshift_client
        redshift_client = boto3.client('redshift', region_name=REGION)
        response_cluster_details = redshift_client.describe_clusters(ClusterIdentifier=CLUSTER_IDENTIFIER)
        response_credentials = redshift_client.get_cluster_credentials(DbUser=DB_USER,DbName=DB_NAME,ClusterIdentifier=CLUSTER_IDENTIFIER,DurationSeconds=3600)
        rs_host = response_cluster_details['Clusters'][0]['Endpoint']['Address']
        rs_port = response_cluster_details['Clusters'][0]['Endpoint']['Port']
        rs_db = DB_NAME
        rs_iam_user = response_credentials['DbUser']
        rs_iam_pwd = response_credentials['DbPassword']
        # connect to the Amazon Redshift cluster
        conn = conn_to_rs(rs_host, rs_port, rs_db, rs_iam_user,rs_iam_pwd)
        # execute a query
        result = conn.query("SELECT sysdate as dt")
        # fetch results from the query
        for dt_val in result.getresult() :
            print dt_val
        # close the Amazon Redshift connection
        conn.close()
    
    if __name__ == "__main__":
        main()

You can save this Python program in a file (redshiftscript.py) and execute it at the command line as ec2-user:

python redshiftscript.py

Now partners can connect to the Amazon Redshift cluster using the Python script, and authentication is federated through the IAM user.

Summary

In this post, I demonstrated how to use federated access using Active Directory and IAM roles to enable single sign-on to an Amazon Redshift cluster. I also showed how partners outside an organization can be managed easily using IAM credentials.  Using the GetClusterCredentials API action, now supported by Amazon Redshift, lets you manage a large number of database users and have them use corporate credentials to log in. You don’t have to maintain separate database user accounts.

Although this post demonstrated the integration of IAM with AD FS and Active Directory, you can replicate this solution across with your choice of SAML 2.0 third-party identity providers (IdP), such as PingFederate or Okta. For the different supported federation options, see Configure SAML Assertions for Your IdP.

If you have questions or suggestions, please comment below.


Additional Reading

Learn how to establish federated access to your AWS resources by using Active Directory user attributes.


About the Author

Thiyagarajan Arumugam is a Big Data Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.