Tag Archives: Amazon Redshift

Synchronize and control your Amazon Redshift clusters maintenance windows

Post Syndicated from Ahmed Gamaleldin original https://aws.amazon.com/blogs/big-data/synchronize-and-control-your-amazon-redshift-clusters-maintenance-windows/

Amazon Redshift is a data warehouse that can expand to exabyte-scale. Today, tens of thousands of AWS customers (including NTT DOCOMO, Finra, and Johnson & Johnson) use Amazon Redshift to run mission-critical business intelligence dashboards, analyze real-time streaming data, and run predictive analytics jobs.

Amazon Redshift powers analytical workloads for Fortune 500 companies, startups, and everything in between. With the constant increase in generated data, Amazon Redshift customers continue to achieve successes in delivering better service to their end-users, improving their products, and running an efficient and effective business. Availability, therefore, is key in continuing to drive customer success, and AWS will use commercially reasonable efforts to make Amazon Redshift available with a Monthly Uptime Percentage for each multi-node cluster, during any monthly billing cycle, of at least 99.9% (our “Service Commitment”).

In this post, we present a solution to help you provide a predictable and repeatable experience to your Amazon Redshift end-users by taking control of recurring Amazon Redshift maintenance windows.

Amazon Redshift maintenance windows

Amazon Redshift periodically performs maintenance to apply fixes, enhancements, and new features to your cluster. This type of maintenance occurs during a 30-minute maintenance window set by default per Region from an 8-hour block on a random day of the week. You should change the scheduled maintenance window according to your business needs by modifying the cluster, either programmatically or by using the Amazon Redshift console. The window must be at least 30 minutes and not longer than 24 hours. For more information, see Managing clusters using the console.

If a maintenance event is scheduled for a given week, it starts during the assigned 30-minute maintenance window. While Amazon Redshift performs maintenance, it terminates any queries or other operations that are in progress. If there are no maintenance tasks to perform during the scheduled maintenance window, your cluster continues to operate normally until the next scheduled maintenance window. Amazon Redshift uses Amazon Simple Notification Service (Amazon SNS) to send notifications of Amazon Redshift events. You enable notifications by creating an Amazon Redshift event subscription. You can create an Amazon Redshift event notification subscription so you can be notified when an event occurs for a given cluster.

When an Amazon Redshift cluster is scheduled for maintenance, you receive an Amazon Redshift “Pending” event, as described in the following table.

Amazon Redshift category Event ID Event severity Description
Pending REDSHIFT-EVENT-2025 INFO Your database for cluster <cluster name> will be updated between <start time> and <end time>. Your cluster will not be accessible. Plan accordingly.
Pending REDSHIFT-EVENT-2026 INFO Your cluster <cluster name> will be updated between <start time> and <end time>. Your cluster will not be accessible. Plan accordingly.

Amazon Redshift also gives you the option to reschedule your cluster’s maintenance window by deferring your upcoming maintenance by up to 45 days. This option is particularly helpful if you want to maximize your cluster uptime by deferring a future maintenance window. For example, if your cluster’s maintenance window is set to Wednesday 8:30–9:00 UTC and you need to have nonstop access to your cluster for the next 2 weeks, you can defer maintenance to a date 2 weeks from now. We don’t perform any maintenance on your cluster when you have specified a deferment.

Deferring a maintenance window doesn’t apply to mandatory Amazon Redshift updates, such as vital security patches. Scheduled maintenance is different from Amazon Redshift mandatory maintenance. If Amazon Redshift needs to update hardware or make other mandatory updates during your period of deferment, we notify you and make the required changes. Your cluster isn’t available during these updates and such maintenance can’t be deferred. If a hardware replacement is required, you receive an event notification through the AWS Management Console and your SNS subscription as a “Pending” item, as shown in the following table.

Amazon Redshift category Event ID Event severity Description
Pending REDSHIFT-EVENT-3601 INFO A node on your cluster <cluster name> will be replaced between <start time> and <end time>. You can’t defer this maintenance. Plan accordingly.
Pending REDSHIFT-EVENT-3602 INFO A node on your cluster <cluster name> is scheduled to be replaced between <start time> and <end time>. Your cluster will not be accessible. Plan accordingly.

Challenge

As the number of the Amazon Redshift clusters you manage for an analytics application grows, the end-user experience becomes highly dependent on recurring maintenance. This means that you want to make sure maintenance windows across all clusters happen at the same time and fall on the same day each month. You want to avoid a situation in which, for example, your five Amazon Redshift clusters are each updated on a different day or week. Ultimately, you want to give your Amazon Redshift end-users an uninterrupted number of days where the clusters aren’t subject to any scheduled maintenance. This gives you the opportunity to announce a scheduled maintenance to your users well ahead of the maintenance date.

Amazon Redshift clusters are scheduled for maintenance depending on several factors, including when a cluster was created and its Region. For example, you may have two clusters on the same Amazon Redshift version that are scheduled for different maintenance windows. Regions implement the Amazon Redshift latest versions and patches at different times. A cluster running in us-east-1 might be scheduled for the same maintenance 1 week before another cluster in eu-west-1. This makes it harder for you to provide your end-users with a predictable maintenance schedule.

To solve this issue, you typically need to frequently check and synchronize the maintenance windows across all your clusters to a specific time, for example sat:06:00-sat:06:30. Additionally, you want to avoid having your clusters scheduled for maintenance at different intervals. For that, you need to defer maintenance across all your clusters to fall on the same exact day. For example, you can defer all maintenance across all clusters to happen 1 month from now regardless of when the cluster was last updated. This way you know that your clusters aren’t scheduled for maintenance for the next 30 days. This gives you enough time to announce the maintenance schedule to your Amazon Redshift users.

Solution overview

Having one day per month when Amazon Redshift scheduled maintenance occurs across all your clusters provides your users with a seamless experience. It gives you control and predictability over when clusters aren’t available. You can announce this maintenance window ahead of time to avoid sudden interruptions.

The following solution deploys an AWS Lambda function (RedshiftMaintenanceSynchronizer) to your AWS account. The function runs on a schedule configurable through an AWS CloudFormation template parameter frequency to run every 6, 12, or 24 hours. This function synchronizes maintenance schedules across all your Amazon Redshift clusters to happen at the same time on the same day. It also gives you the option to defer all future maintenance windows across all clusters by a number of days (deferment days) for up to 45 days. This enables you to provide your users with an uninterrupted number of days when your clusters aren’t subject to maintenance.

We use the following input parameters:

  • Deferment days – The number of days to defer all future scheduled maintenance windows. Amazon Redshift can defer maintenance windows by up to 45 days. The solution adds this number to the date of the last successfully completed maintenance across any of your clusters. The resulting date is the new maintenance date for all your clusters.
  • Frequency – The frequency to run this solution. You can configure it to run every 6, 12, or 24 hours.
  • Day – The preferred day of the week to schedule maintenance windows.
  • Hour – The preferred hour of the day to schedule maintenance windows (24-hour format).
  • Minute – The preferred minute of the hour to schedule maintenance windows.

The Lambda function performs the following steps:

  1. Lists all your Amazon Redshift clusters in the Region.
  2. Updates the maintenance window for all clusters to the same value of the day/hour/minute input parameters from the CloudFormation template.
  3. Checks for the last successfully completed maintenance across all clusters.
  4. Calculates the deferment maintenance date by adding the deferment input parameter to the date of the last successfully completed maintenance. For example, if the deferment parameter is 30 days and the last successful maintenance window completed on July 1, 2020, then the next deferment maintenance date is July 31, 2020.
  5. Defers the next maintenance window across all clusters to the deferment date calculated in the previous step.

Launch the solution

To get started, deploy the cloudFormation template to your AWS account.

  1. For Stack name, enter a name for your stack for easy reference.
  2. For Day, choose the day of week for the maintenance window to occur.
  3. For Hour, choose the hour for the window to start.
  4. For Minute, choose the minute within the hour for the window to start.

Your window should be a time with the least cluster activity and during off-peak work hours.

  1. For Deferment Days, choose the number of days to defer all future scheduled maintenance windows.
  2. For Solution Run Frequency, choose the frequency of 6, 12, or 24 hours.

cloudFormation Template

After the template has successfully deployed, the following resources are available:

  • The RedshiftMaintenanceSync Lambda function. This is a Python 3.8 Lambda function that syncs and defers the maintenance windows across all your Amazon Redshift clusters.

lamda_console

  • The RedshiftMaintenanceSyncEventRule Amazon CloudWatch event rule. This rule triggers on a schedule based on the Frequency input parameter. It triggers the RedshiftMaintenanceSync Lambda function to run the solution logic.

When the Lambda function starts and detects an already available deferment on any of your clusters, it doesn’t attempt to modify the existing deferment, and exits instead.

The solution logs any deferment it performs on any cluster in the associated CloudWatch log group.

Synchronize and defer maintenance windows

To demonstrate this solution, I have two Amazon Redshift clusters with different preferred maintenance windows and scheduled intervals.

Cluster A (see the following screenshot) has a preferred maintenance window set to Friday at 4:30–5:00 PM. It’s scheduled for maintenance in 2 days.

Cluster B has a preferred maintenance window set to Tuesday at 9:45–10:15 AM. It’s scheduled for maintenance in 6 days.

This means that my Amazon Redshift users have a 30-minute interruption twice in the next 2 and 6 days. Also, these interruptions happen at completely different times.

Let’s launch the solution and see how the cluster maintenance windows and scheduling intervals change.

The Lambda function does the following every time it runs:

  1. Lists all the clusters.
  2. Checks if any of the clusters have a deferment enabled and exits if it finds any.
  3. Syncs all preferred maintenance windows across all clusters to fall on the same time and day of week.
  4. Checks for the latest successfully completed maintenance date.
  5. Adds the deferment days to the last maintenance date. This becomes the new deferment date.
  6. Applies the new deferment date to all clusters.

Cluster A’s maintenance window was synced to the input parameters I passed to the CloudFormation template. Additionally, the next maintenance window was deferred until June 21, 2021, 8:06 AM (UTC +02:00).

Cluster B’s maintenance window is also synced to the same value from Cluster A, and the next maintenance window was deferred to the same exact day as Cluster A: June 21, 2021, 8:06 AM (UTC +02:00).

Finally, let’s check the CloudWatch log group to understand what the solution did.

Now both clusters have the same maintenance window and next scheduled maintenance, which is a month from now. In a production scenario, this gives you enough lead time to announce the maintenance window to your Amazon Redshift end-users and provide them with the exact day and time when clusters aren’t available.

Summary

This solution can help you provide a predictable and repeatable experience to your Amazon Redshift end-users by taking control of recurring Amazon Redshift maintenance windows. It enables you to provide your users with an uninterrupted number of days where clusters are always available—barring any scheduled mandatory upgrades. Click here to get started with Amazon Redshift today.


About the Author

Ahmed_Gamaleldin

Ahmed Gamaleldin is a Senior Technical Account Manager (TAM) at Amazon Web Services. Ahmed helps customers run optimized workloads on AWS and make the best out of their cloud journey.

Accelerate your data warehouse migration to Amazon Redshift – Part 2

Post Syndicated from Michael Soo original https://aws.amazon.com/blogs/big-data/part-2-accelerate-your-data-warehouse-migration-to-amazon-redshift/

This is the second post in a multi-part series. We’re excited to shared dozens of new features to automate your schema conversion; preserve your investment in existing scripts, reports, and applications; accelerate query performance; and potentially reduce your overall cost to migrate to Amazon Redshift. Check out the first post Accelerate your data warehouse migration to Amazon Redshift – Part 1 to learn more about automated macro conversion, case-insensitive string comparison, case-sensitive identifiers, and other exciting new features.

Amazon Redshift is the leading cloud data warehouse. No other data warehouse makes it as easy to gain new insights from your data. With Amazon Redshift, you can query exabytes of data across your data warehouse, operational data stores, and data lake using standard SQL. You can also integrate other services like Amazon EMR, Amazon Athena, and Amazon SageMaker to use all the analytic capabilities in the AWS Cloud.

Many customers have asked for help migrating their self-managed data warehouse engines to Amazon Redshift. In these cases, you may have terabytes (or petabytes) of historical data, a heavy reliance on proprietary features, and thousands of extract, transform, and load (ETL) processes and reports built over years (or decades) of use.

Until now, migrating a data warehouse to AWS was complex and involved a significant amount of manual effort.

Today, we’re happy to share additional enhancements to the AWS Schema Conversion Tool (AWS SCT) to automate your migrations to Amazon Redshift. These enhancements reduce the recoding needed for your data tables, and more importantly, the manual work needed for views, stored procedures, scripts, and other application code that use those tables.

In this post, we introduce automation for INTERVAL and PERIOD data types, automatic type casting, binary data support, and some other enhancements that have been requested by customers. We show you how to use AWS SCT to convert objects from a Teradata data warehouse and provide links to relevant documentation so you can continue exploring these new capabilities.

INTERVAL data types

An INTERVAL is an unanchored duration of time, like “1 year” or “2 hours,” that doesn’t have a specific start or end time. In Teradata, INTERVAL data is implemented as 13 distinct data types depending on the granularity of time being represented. The following table summarizes these types.

Year intervals Month intervals Day intervals Hour intervals Minute intervals Second intervals

INTERVAL YEAR

INTERVAL YEAR TO MONTH

INTERVAL MONTH

 

INTERVAL DAY

INTERVAL DAY TO HOUR

INTERVAL DAY TO MINUTE

INTERVAL DAY TO SECOND

INTERVAL HOUR

INTERVAL HOUR TO MINUTE

INTERVAL HOUR TO SECOND

INTERVAL MINUTE

INTERVAL MINUTE TO SECOND

INTERVAL SECOND

 

Amazon Redshift doesn’t support INTERVAL data types natively. Previously, if you used INTERVAL types in your data warehouse, you had to develop custom code as part of the database conversion process.

Now, AWS SCT automatically converts INTERVAL data types for you. AWS SCT converts an INTERVAL column into a CHARACTER VARYING column in Amazon Redshift. Then AWS SCT converts your application code that uses the column to emulate the INTERVAL semantics.

For example, consider the following Teradata table, which has a MONTH interval column. This table store different types of leaves of absences and the allowable duration for each.

CREATE TABLE testschema.loa_durations (
  loa_type_id INTEGER
, loa_name VARCHAR(100) CHARACTER SET LATIN
, loa_duration INTERVAL MONTH(2))
PRIMARY INDEX (loa_type_id);

AWS SCT converts the table to Amazon Redshift as follows. Because Amazon Redshift doesn’t have a native INTERVAL data type, AWS SCT replaces it with a VARCHAR data type.

CREATE TABLE testschema.loa_durations(
  loa_type_id INTEGER
, loa_name VARCHAR(100)
, loa_duration VARCHAR(64)
)
DISTSTYLE KEY
DISTKEY
(
loa_type_id
)
SORTKEY
(
loa_type_id
);

Now, let’s suppose your application code uses the loa_duration column, like the following Teradata view. Here, the INTERVAL MONTH field is added to the current date to compute when a leave of absence ends if it starts today.

REPLACE VIEW testschema.loa_projected_end_date AS
SELECT
  loa_type_id loa_type_id
, loa_name loa_name
, loa_duration
, current_date AS today
, current_date + loa_duration AS end_date
FROM
testschema.loa_durations
;

Because the data is stored as CHARACTER VARYING, AWS SCT injects the proper type CAST into the Amazon Redshift code to interpret the string values as a MONTH interval. It then converts the arithmetic using Amazon Redshift date functions.

CREATE OR REPLACE VIEW testschema.loa_projected_end_date (loa_type_id, loa_name, loa_duration, today, end_date) AS
SELECT
  loa_type_id AS loa_type_id
, loa_name AS loa_name
, loa_duration
, CURRENT_DATE AS today
, dateadd(MONTH, CAST (loa_duration AS INTEGER),CURRENT_DATE)::DATE AS end_date
FROM testschema.loa_durations
;

Also, as a bonus, AWS SCT automatically converts any literal INTERVAL values that you might be using in your code.

For example, consider the following Teradata table. The table contains a DATE column, which records the last date when an employee was promoted.

CREATE TABLE TESTSCHEMA.employees (
  id INTEGER
, name VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC
, manager_id INTEGER
, last_promo_date DATE FORMAT 'YY/MM/DD'
)
UNIQUE PRIMARY INDEX ( id );

Now, suppose the database contains a view that computes the next date an employee is eligible for a promotion. We implement a business rule that employees who have never been promoted are eligible for promotion in 1.5 years. All other employees become eligible 2.5 years after their last promotion. See the following code:

REPLACE VIEW testschema.eligible_for_promo AS
SELECT 
  id
, name
, last_promo_date
, CASE WHEN last_promo_date is NULL THEN current_date + INTERVAL '18' MONTH
       ELSE last_promo_date + INTERVAL '2-06' YEAR TO MONTH
  END eligible_date
FROM employees
;

AWS SCT converts the INTERVAL values used in the CASE statement and translates the date expressions accordingly:

CREATE OR REPLACE VIEW testschema.eligible_for_promo (id, name, last_promo_date, eligible_date) AS
SELECT
  id
, name
, last_promo_date
, CASE
    WHEN last_promo_date IS NULL THEN dateadd(MONTH, 18, CURRENT_DATE)::DATE
    ELSE dateadd(MONTH, 30, last_promo_date)::DATE
  END AS eligible_date
FROM testschema.employees
;

We’re excited about INTERVAL automation in AWS SCT and encourage you to give it a try. For more information about getting started with AWS SCT, see Installing, verifying, and updating AWS SCT.

PERIOD data type

A PERIOD data value represents a duration of time with a specified start and end. For example, the Teradata literal “(2021-01-01 to 2021-01-31)” is a period with a duration of 31 days that starts and ends on the first and last day of January 2021, respectively. PERIOD data types can have three different granularities: DATE, TIME, or TIMESTAMP. The following table provides some examples.

Period Type Example
PERIOD(DATE) “(2021-01-01 to 2021-01-31) “
PERIOD(TIME) “(12:00:00 to 13:00:00)”
PERIOD(TIMESTAMP) “(2021-01-31 00:00:00 to 2021-01-31 23:59:59)”

As with INTERVAL, the PERIOD data type isn’t natively supported by Amazon Redshift. Previously, if you used these data types in your tables, you had to write custom code as part of the database conversion process.

Now, AWS SCT automatically converts PERIOD data types for you. AWS SCT converts a PERIOD column into two DATE (or TIME or TIMESTAMP) columns as appropriate on Amazon Redshift. Then AWS SCT converts your application code that uses the column to emulate the source engine semantics.

For example, consider the following Teradata table:

CREATE SET TABLE testschema.period_table (
  id INTEGER
, period_col PERIOD(timestamp)) 
UNIQUE PRIMARY INDEX (id);

AWS SCT converts the PERIOD(TIMESTAMP) column into two TIMESTAMP columns in Amazon Redshift:

CREATE TABLE IF NOT EXISTS testschema.period_table(
  id INTEGER
, period_col_begin TIMESTAMP
, period_col_end TIMESTAMP
)
DISTSTYLE KEY
DISTKEY
(id)
SORTKEY
(id);

Now, let’s look at a simple example of how you can use AWS SCT to convert your application code. A common operation in Teradata is to extract the starting (or ending) timestamps in a PERIOD value using the BEGIN and END built-in functions:

REPLACE VIEW testschema.period_view_begin_end AS 
SELECT 
  BEGIN(period_col) AS period_start
, END(period_col) AS period_end 
FROM testschema.period_table
;

AWS SCT converts the view to reference the transformed table columns:

CREATE OR REPLACE VIEW testschema.period_view_begin_end (period_start, period_end) AS
SELECT
  period_col_begin AS period_start
, period_col_end AS period_end
FROM testschema.period_table;

We’ll continue to build automation for PERIOD data conversion, so stay tuned for more improvements. In the meantime, you can try out the PERIOD data type conversion features in AWS SCT now. For more information, see Installing, verifying, and updating AWS SCT.

Type casting

Some data warehouse engines, like Teradata, provide an extensive set of rules to cast data values in expressions. These rules permit implicit casts, where the target data type is inferred from the expression, and explicit casts, which typically use a function to perform the type conversion.

Previously, you had to manually convert implicit cast operations in your SQL code. Now, we’re happy to share that AWS SCT automatically converts implicit casts as needed. This feature is available now for the following set of high-impact Teradata data types.

Category Source data type Target data types
Numeric CHAR BIGINT
NUMBER
TIMESTAMP
VARCHAR NUMBER
NUMERIC
DEC
CHAR
GEOMETRY
INTEGER DATE
DEC
BIGINT DATE
NUMBER CHARACTER
VARCHAR
DEC
DECIMAL DATE
TIMESTAMP
SMALLINT
DOUBLE PRECISION
FLOAT DEC
Time DATE BIGINT
INTEGER
DECIMAL
FLOAT
NUMBER
CHARACTER
TIMESTAMP
INTERVAL NUMBER
BIGINT
INTEGER
Other GEOMETRY DECIMAL

Let’s look at how to cast numbers to DATE. Many Teradata applications treat numbers and DATE as equivalent values. Internally, Teradata stores DATE values as INTEGER. The rules to convert between an INTEGER and a DATE are well-known and developers have commonly exploited this information to perform date calculations using INTEGER arithmetic.

For example, consider the following Teradata table:

CREATE TABLE testschema.employees (
  id INTEGER
, name VARCHAR(20) CHARACTER SET LATIN
, manager_id INTEGER
, last_promo_date DATE FORMAT 'YY/MM/DD')
UNIQUE PRIMARY INDEX ( id );

We insert a single row of data into the table:

select * from employees;

 *** Query completed. One row found. 4 columns returned. 
 *** Total elapsed time was 1 second.

         id  name                   manager_id  last_promo_date
-----------  --------------------  -----------  ---------------
        112  Britney                       201                ?

We use a macro to update the last_promo_date field for id = 112. The macro accepts a BIGINT parameter to populate the DATE field.

replace macro testschema.set_last_promo_date(emp_id integer, emp_promo_date bigint) AS (
update testschema.employees
set last_promo_date = :emp_promo_date
where id = :emp_id;
);

Now, we run the macro and check the value of the last_promo_date attribute:

exec testschema.set_last_promo_date(112, 1410330);

 *** Update completed. One row changed. 
 *** Total elapsed time was 1 second.


select * from employees;

 *** Query completed. One row found. 4 columns returned. 
 *** Total elapsed time was 1 second.

         id  name                   manager_id  last_promo_date
-----------  --------------------  -----------  ---------------
        112  Britney                       201         41/03/30

You can see the last_promo_date attribute is set to the date March 30, 2041.

Now, let’s use AWS SCT to convert the table and macro to Amazon Redshift. As we saw in Part 1 of this series, AWS SCT converts the Teradata macro into an Amazon Redshift stored procedure:

CREATE TABLE IF NOT EXISTS testschema.employees(
  id INTEGER
, name CHARACTER VARYING(20) 
, manager_id INTEGER
, last_promo_date DATE
)
DISTSTYLE KEY
DISTKEY
(id)
SORTKEY
(id);

CREATE OR REPLACE PROCEDURE testschema.set_last_promo_date(par_emp_id INTEGER, par_emp_promo_date BIGINT)
AS $BODY$
BEGIN
    UPDATE testschema.employees
    SET last_promo_date = TO_DATE((par_emp_promo_date + 19000000), 'YYYYMMDD')
        WHERE id = par_emp_id;
END;
$BODY$
LANGUAGE plpgsql;

Note that 20410330 = 1410330 + 19000000; so adding 19,000,000 to the input returns the correct date value 2041-03-30.

Now, when we run the stored procedure, it updates the last_promo_date as expected:

myredshift=# select * from testschema.employees;
 id  |  name   | manager_id | last_promo_date
 112 | Britney |        201 |
(1 row)

myredshift=# call testschema.set_last_promo_date(112, 1410330);
CALL

myredshift=# select * from testschema.employees;
 id  |  name   | manager_id | last_promo_date
 112 | Britney |        201 | 2041-03-30
(1 row)

Automatic data type casting is available in AWS SCT now. You can download the latest version and try it out.

BLOB data

Amazon Redshift doesn’t have native support for BLOB columns, which you use to store large binary objects like text or images.

Previously, if you were migrating a table with a BLOB column, you had to manually move the BLOB values to file storage, like Amazon Simple Storage Service (Amazon S3), then add a reference to the S3 file in the table. Using Amazon S3 as the storage target for binary objects is a best practice because these objects are large and typically have low analytic value.

We’re happy to share that AWS SCT now automates this process for you. AWS SCT replaces the BLOB column with a CHARACTER VARYING column on the target table. Then, when you use the AWS SCT data extractors to migrate your data, the extractors upload the BLOB value to Amazon S3 and insert a reference to the BLOB into the target table.

For example, let’s create a table in Teradata and populate it with some data:

CREATE SET TABLE TESTSCHEMA.blob_table (
  id INTEGER
, blob_col BLOB(10485760))
PRIMARY INDEX ( id );

select * from blob_table;

 *** Query completed. 2 rows found. 2 columns returned. 
 *** Total elapsed time was 1 second.

         id blob_col
----------- ---------------------------------------------------------------
          1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
          2 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Now, we convert the table with AWS SCT and build it on Amazon Redshift:

myredshift=# \d testschema.blob_table;
                    Table "testschema.blob_table"
Column  | Type                     | Collation | Nullable | Default 
id      | integer                  |.          |          | 
blob_col | character varying(1300) |           |          |

Then we use the AWS SCT data extractors to migrate the table data from Teradata to Amazon Redshift.

When we look at the table in Amazon Redshift, you can see the paths to the S3 files that contain the BLOB values:

myredshift=# select * from testschema.blob_table;
(2 rows)

 id |                                                               blob_col                                                               
  2 | s3://<bucket name>/data/c12f53330dd3427a845a77f143d4a1a1/dbdee8e0485c481dad601fd6170fbfb4_lobs/2/308b6f0a902941e793212058570cdda5.dat
  1 | s3://<bucket name>/data/c12f53330dd3427a845a77f143d4a1a1/dbdee8e0485c481dad601fd6170fbfb4_lobs/2/a7686067af5549479b52d81e83c3871e.dat

And on Amazon S3, you can see the actual data files. There are two, one for each BLOB value:

$ aws s3 ls s3://<bucket name>/data/c12f53330dd3427a845a77f143d4a1a1/dbdee8e0485c481dad601fd6170fbfb4_lobs/2/
2021-05-13 23:59:47         23 522fee54fda5472fbae790f43e36cba1.dat
2021-05-13 23:59:47         24 5de6c53831f741629476e2c2cbc6b226.dat

BLOB support is available now in AWS SCT and the AWS SCT data extractors. Download the latest version of the application and try it out today.

Multi-byte CHARACTER conversion

Teradata supports multibyte characters in CHARACTER data columns, which are fixed length fields. Amazon Redshift supports multibyte characters in CHARACTER VARYING fields but not in fixed-length CHARACTER columns.

Previously, if you had fixed-length CHARACTER columns, you had to determine if they contained multibyte character data, and increase the target column size as appropriate.

AWS SCT now bridges this gap for you. If your Teradata tables contain CHARACTER columns with multibyte characters, AWS SCT automatically converts these columns to Amazon Redshift CHARACTER VARYING fields and sets the column sizes accordingly. Consider the following example, which contains four columns, a LATIN column that contains only single-byte characters, and UNICODE, GRAPHIC, and KANJISJIS columns that can contain multi-byte characters:

create table testschema.char_table (
  latin_col char(70) character set latin
, unicode_col char(70) character set unicode
, graphic_col char(70) character set graphic
, kanjisjis_col char(70) character set kanjisjis
);

AWS SCT translates the LATIN column to a fixed length CHARACTER column. The multi-byte columns are upsized and converted to CHARACTER VARYING:

CREATE TABLE IF NOT EXISTS testschema.char_table (
  latin_col CHARACTER(70)
, unicode_col CHARACTER VARYING(210)
, graphic_col CHARACTER VARYING(210)
, kanjisjis_col CHARACTER VARYING(210)
)
DISTSTYLE KEY
DISTKEY
(col1)
SORTKEY
(col1);

Automatic conversion for multibyte CHARACTER columns is available in AWS SCT now.

GEOMETRY data type size

Amazon Redshift has long supported geospatial data with a GEOMETRY data type and associated spatial functions.

Previously, Amazon Redshift restricted the maximum size of a GEOMETRY column to 64 KB, which constrained some customers with large objects. Now, we’re happy to share that the maximum size of GEOMETRY objects has been increased to just under 1 MB (specifically, 1,048,447 bytes).

For example, consider the following Teradata table:

create table geometry_table (
 id INTEGER
, geometry_col1 ST_GEOMETRY 
, geometry_col2 ST_GEOMETRY(1000)
, geometry_col3 ST_GEOMETRY(1048447) 
, geometry_col4 ST_GEOMETRY(10484470)
, geometry_col5 ST_GEOMETRY INLINE LENGTH 1000
)
;

You can use AWS SCT to convert it to Amazon Redshift. The converted table definition is as follows. A size specification isn’t needed on the converted columns because Amazon Redshift implicitly sets the column size.

CREATE TABLE IF NOT EXISTS testschema.geometry_table(
id INTEGER,
geometry_col1 GEOMETRY,
geometry_col2 GEOMETRY,
geometry_col3 GEOMETRY,
geometry_col4 GEOMETRY,
geometry_col5 GEOMETRY
)
DISTSTYLE KEY
DISTKEY
(
id
)
SORTKEY
(
id
);
ALTER TABLE testschema.geometry_table ALTER DISTSTYLE AUTO;
ALTER TABLE testschema.geometry_table ALTER SORTKEY AUTO;

Large GEOMETRY columns are available in Amazon Redshift now. For more information, see Querying spatial data in Amazon Redshift.

Conclusion

We’re happy to share these new features with you. If you’re contemplating a migration to Amazon Redshift, these capabilities can help automate your schema conversion and preserve your investment in existing reports, applications, and ETL, as well as accelerate your query performance.

This post described a few of the dozens of new features we have recently introduced to automate your data warehouse migrations to Amazon Redshift. We will share more in upcoming posts. You’ll hear about additional SQL automation, a purpose-built scripting language for Amazon Redshift with BTEQ compatibility, and automated support for proprietary SQL features.

Check back soon for more information. Until then, you can learn more about Amazon Redshift and the AWS Schema Conversion Tool on the AWS website. Happy migrating!


About the Author

Michael Soo is a database engineer with the AWS DMS and AWS SCT team at Amazon Web Services.

How Jobcase is using Amazon Redshift ML to recommend job search content at scale

Post Syndicated from Clay Martin original https://aws.amazon.com/blogs/big-data/how-jobcase-is-using-amazon-redshift-ml-to-recommend-job-search-content-at-scale/

This post is co-written with Clay Martin and Ajay Joshi from Jobcase as the lead authors.

Jobcase is an online community dedicated to empowering and advocating for the world’s workers. We’re the third-largest destination for job search in the United States, and connect millions of Jobcasers to relevant job opportunities, companies, and other resources on a daily basis. Recommendation is central to everything we do.

In this post, Jobcase data scientists discuss how Amazon Redshift ML helped us generate billions of job search recommendations in record time and with improved relevance.

The challenge: Scaling job recommendations

At Jobcase, we have used Amazon Redshift as our primary data warehouse for 8 years. Over the years, we built up a significant amount of historical job seeker and job content interaction data, which is stored in highly optimized compressed tables.

Our recommender system applies machine learning (ML) models to these big datasets, which poses a familiar problem: the data and ML models aren’t colocated on the same compute clusters, which involves moving large amounts of data across networks. In many cases, at batch inference time, data must be shuttled out of—and eventually brought back into—a data warehouse, which can be a time-consuming, expensive, and sometimes error-prone process. This also requires data engineering resources to set up data pipelines, and often becomes a bottleneck for data scientists to perform quick experimentation and drive business value. Until now, this data/model colocation issue has proven to be a big obstacle to applying ML to batch recommendation at scale.

How Amazon Redshift ML helped solve this challenge

Amazon Redshift ML, powered by Amazon SageMaker Autopilot, makes it easy for data analysts and database developers to create, train, and apply ML models using familiar SQL commands in Amazon Redshift data warehouses.

Amazon Redshift ML has proven to be a great solution to some of these problems at Jobcase. With Amazon Redshift ML’s in-database local inference capability, we now perform model inference on billions of records in a matter of minutes, directly in our Amazon Redshift data warehouse. In this post, we talk about our journey to Amazon Redshift ML, our previous attempts to use ML for recommendations, and where we go from here.

What we’re trying to solve

Job search is a unique and challenging domain for recommender system design and implementation. There are extraordinary variables to consider. For instance, many job openings only last for days or weeks, and they must be within a reasonable commute for job seekers. Some jobs require skills that only a subset of our members possess. These constraints don’t necessarily apply to, say, movie recommendations. On the other hand, job preferences are relatively stable; if a member is interested in truck driver jobs on Monday, there’s a good chance they’re still interested on Tuesday.

The Jobcase recommender system is responsible for generating job search content for over 10 million active Jobcasers per day. On average, every day we have about 20–30 million unique job listings in the eligible pool for recommendations. The system runs overnight and generates predictions in batch mode, and is expected to be completed by early morning hours. These recommendations are used throughout the day to engage with Jobcase members through various communication channels like email, SMS, and push notifications.

Our recommendation system

Each communication channel has its own peculiarities and deliverability constraints. To handle these constraints, our recommender system is broken into multiple steps and phases, where the final step is used to fine-tune the model for a particular channel. All of this multi-channel job seeker interaction data resides in our Amazon Redshift cluster.

In phase one, we apply unsupervised learning techniques to reduce the candidate set of items per member. We calculate item-item similarity scores from members’ engagement histories, and use these scores to generate the N most similar items per user. This collaborative filtering phase poses an important design trade-off: filter out too many relevant candidate items, and member engagement drops; filter out too few, and downstream inference remains computationally infeasible.

The second phase is a channel-specific supervised learning phase. It uses the similarity score from the first phase along with other predicted metrics as features and attributes, and tries to directly predict member engagement for that channel. In this example, let’s assume email is the channel and member engagement is captured by the dependent variable email click-through rate (CTR) = email clicks/email sends.

Here we also include job seeker features, such as educational attainment, commute preferences, location, and so on, as well as item or job content features, such as channel-specific historical macro or local engagement rates. Generating predictions for over 10 million job seekers paired to 200–300 items per member requires 2–3 billion predictions for just one channel.

Simplifying ML from within Amazon Redshift without any data movement

Until now, our data/model colocation problem has been challenging to solve from a cost and performance perspective. This is where Amazon Redshift ML has been instrumental in significantly improving the recommender system by enabling billions of non-linear model predictions in just a few minutes.

Before Amazon Redshift ML, we needed to write custom data pipelines to get data out of the data warehouse to Amazon Simple Storage Service (Amazon S3), then to ML inference instances, and finally pipe predictions back into the data warehouse for consumption. This added additional time delays and cost. Historically, it has been a challenge to improve this phase, and we had to rely on relatively simple linear models, optimized via A/B testing and hard-coded into SQL statements.

With Amazon Redshift ML, we were able to bring cutting-edge model classes with in-database local inference capabilities directly into our data warehouse. Therefore, the expressive power of the models that we could fit vastly increased. The following architecture diagram shows how we simplified our data pipeline with Amazon Redshift ML.

Our success story with Amazon Redshift ML

We have previously attempted to move one or both phases of our system out of Amazon Redshift. To improve our collaborative filtering phase, we tried using open-source libraries on Amazon Elastic Compute Cloud (Amazon EC2) instances that implement matrix factorization algorithms and natural language processing (NLP) inspired techniques such as Global Vectors (GloVe), which are distributed word representations. None of these solutions generated enough improvement in terms of member engagement to justify the increased data pipeline complexity, operational time delays, and operational expense. Pipelines to improve supervised user-item scoring had similar difficulties.

When Amazon Redshift ML was released in preview mode December 2020, we spun up a small Amazon Redshift cluster to test its capabilities against our use cases. We were immediately struck by the fact that Amazon Redshift ML makes fitting an XGBoost model or feed-forward neural network as easy as writing a SQL query. When Amazon Redshift ML became GA at the end of May 2021, we set it up in production within a day, and deployed a production model within a week. The following is a sample model that we trained and predicted with Amazon Redshift ML.

The following is the training code:

CREATE MODEL
f_ml_predict_email_content1_ctr
FROM (SELECT * FROM
Email_content1_click_history)
TARGET is_click
FUNCTION f_ml_predict_email_content1_ctr
PROBLEM_TYPE REGRESSION
OBJECTIVE 'MSE';

The following is the prediction code:

UPDATE email_content1_new_data
SET ctr_score = 
f_ml_predict_email_content1_ctr(
 feature1
,feature2,
 . . .
,feature_n)

Now we have several models in production, each performing billions of predictions in Amazon Redshift. The following are some of the key benefits we realized with Amazon Redshift ML:

  • Running model predictions at scale, performing billions of predictions in minutes, which we couldn’t achieve before implementing Amazon Redshift ML
  • Significant reduction in the model development cycle by eliminating the data pipelines
  • Significant reduction in model testing cycles by testing bigger cohort sizes, which helped reach our desired statistical significance quickly
  • Reduced cost by using Amazon Redshift ML’s local in-database inference capability, which saved cost on external ML frameworks and compute cost
  • 5–10% improvement in member engagement rates across several different email template types, resulting in increased revenue

Conclusion

In this post, we described how Amazon Redshift ML helped Jobcase effectively match millions of jobs to over 10 million active Jobcase members on a daily basis.

If you’re an Amazon Redshift user, Amazon Redshift ML provides an immediate and significant value add, with local in-database inference at no additional cost. It gives data scientists the ability to experiment quickly without data engineering dependencies. Amazon Redshift ML currently supports regression and classification model types, but is still able to achieve a great balance between speed, accuracy, complexity, and cost.


About Authors

Clay Martin is a Data Scientist at Jobcase Inc. He designs recommender systems to connect Jobcase members with the most relevant job content.

 

 

Ajay Joshi is Senior Software Engineer at Jobcase Inc. Ajay supports the data analytics and machine learning infrastructure at Jobcase and helps them design and maintain data warehouses powered by Amazon Redshift, Amazon SageMaker, and other AWS services.

 

 

Manash Deb is a Senior Analytics Specialist Solutions Architect at AWS. He has worked on building end-to-end data-driven solutions in different database and data warehousing technologies for over 15 years. He loves to learn new technologies and solving, automating, and simplifying customer problems with easy-to-use cloud data solutions on AWS.

 

 

Debu Panda, a Principal Product Manager at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world.

Case-insensitive collation support for string processing in Amazon Redshift

Post Syndicated from Mengchu Cai original https://aws.amazon.com/blogs/big-data/case-insensitive-collation-support-for-string-processing-in-amazon-redshift/

Amazon Redshift is a fast, fully managed, cloud-native data warehouse. Tens of thousands of customers have successfully migrated their workloads to Amazon Redshift. We hear from customers that they need case-insensitive collation for strings in Amazon Redshift in order to maintain the same functionality and meet their performance goals when they migrate their existing workloads from legacy, on-premises data warehouses like Teradata, Oracle, or IBM. With that goal in mind, AWS provides an option to create case-insensitive and case-sensitive collation.

In this post, we discuss how to use case-insensitive collation and how to override the default collation. Also, we specifically explain the process to migrate your existing Teradata database using the native Amazon Redshift collation capability.

What is collation?

Collation is a set of rules that tells a database engine how to compare and sort the CHAR and VARCHAR columns data in SQL. A case-insensitive collation ignores the differences between uppercase and lowercase letters for string comparison and sorting, whereas a case-sensitive collation does not. For example, in case-insensitive collation, “A” and “a” are equal. Different operations such as LIKE predicates, group by, order by, Regex, similar to behave based on the collation defined while the stored data keeps its original case.

We can define collation at three levels:

  • Database
  • Column
  • Expression

Database-level collation

You can specify collation when creating a new database. All VARCHAR and CHAR columns in the current database pick up the database-level collation as default if no column-level override exists. If you don’t specify collation for a new database, the default collation continues to be the current semantic of case-sensitive.

To create a database with collation in Amazon Redshift, use the following syntax:

CREATE DATABASE database_name [ WITH ]
[ COLLATE CASE_SENSITIVE | COLLATE CASE_INSENSITIVE ]

To alter a database with collation in Amazon Redshift, use the following syntax:

ALTER DATABASE database_name
COLLATE CASE_SENSITIVE | COLLATE CASE_INSENSITIVE

This action only works when the database is empty. Therefore, it’s important to use collation as a first step before any database objects are created.

To find the collation of an existing database, use the following query.

select db_collation();

This provides output as case_sensitive or case_insensitive.

Column-level collation

You can specify collation for a VARCHAR or CHAR column when creating a new table. This overrides the database-level collation.

To define a table with column-level collation in Amazon Redshift, use the following syntax:

create table T1 (
col1 varchar(20) collate case_sensitive encode zstd,
col2 char(20) collate case_insensitive encode zstd,
col3 varchar(20) encode zstd /* not specified, use collation of current database */
);

To find the collation defined at the individual column level, use the following query:

 SELECT TABLE_NAME, COLUMN_NAME, data_type, COLLATION_NAME
FROM svv_columns WHERE TABLE_NAME = 't1' ORDER BY COLUMN_NAME;

Expression-level collation

You can use the collate function in a SQL query to override collation at the column level and database level.

To use the collate function in Amazon Redshift, use the following syntax. This example code converts col1 to case-insensitive and compares with the value john:

select * from T1 where collate(col1, 'case_insensitive') = 'john';

Solution overview

In the following use case, we discuss how to convert a legacy Teradata database’s collation syntax to Amazon Redshift syntax. In Teradata, based on the session mode, the default CHAR and VARCHAR columns collation changes. We can override this default collation at the column and expression level. In Amazon Redshift, we can define collation at the database, column, and expression levels.

In this use case, let’s consider the migration of two tables, invoice and customer, and the corresponding queries built using these two tables from the Teradata database ci_db. Make sure that ci_db is not an existing database in Amazon Redshift. We perform the following steps to walk through the migration process:

  1. Identify the collation in Teradata.
  2. Set up the Amazon Redshift database and schema.
  3. Set up the Amazon Redshift DDL.
  4. Load sample data.
  5. Test the Reporting queries for collation syntax.

Identify the collation in Teradata

In Teradata, based on the session mode, the default CHAR and VARCHAR column collation changes. If the Teradata session mode is in ANSI mode, the default is case-sensitive; if it’s in Teradata mode, it’s dependent on the DefaultCaseSpec parameter at cluster level. If DefaultCaseSpec parameter is TRUE, the default is case sensitive; if it’s FALSE, it’s case insensitive. By default, DefaultCaseSpec parameter is set to FALSE. We can override this default collation at the column and expression level.

SELECT transaction_mode FROM dbc.sessioninfo
   	WHERE sessionno = SESSION;

If the output is A, it’s in ANSI mode; if its T, it’s in Teradata mode. For this use case, let’s assume that the session is in Teradata mode.

To identify the collation override at the column level, run the following commands on the Teradata environment:

show table ci_db.invoice;
show table ci_db.customer;

We receive the following outputs:

CREATE MULTISET TABLE ci_db.invoice ,FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO,
     MAP = TD_MAP1
     (
      invoice_id VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,
      cust_nbr VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC
      )
PRIMARY INDEX ( invoice_id );


CREATE MULTISET TABLE ci_db.customer ,FALLBACK ,
     NO BEFORE JOURNAL,
     NO AFTER JOURNAL,
     CHECKSUM = DEFAULT,
     DEFAULT MERGEBLOCKRATIO,
     MAP = TD_MAP1
     (
      cust_nbr VARCHAR(100) CHARACTER SET LATIN CASESPECIFIC,
      cust_name VARCHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC)
PRIMARY INDEX ( cust_nbr );

Set up the Amazon Redshift database and schema

In Amazon Redshift a database contains one or more named schemas. Each schema in a database contains tables and other kinds of named objects. By default, a database has a single schema, which is named PUBLIC. In general, we recommend to create each Teradata database as a schema in Amazon Redshift.

To create similar functionality of case insensitivity at the database level in Amazon Redshift, we create a new database ci_database with CASE_INSENSITIVE collation:

create database ci_database WITH COLLATE CASE_INSENSITIVE; 

After the database is created in Amazon Redshift, connect to the database by altering the database name from the default database to ci_database in the connection details. Let’s create the schema ci_db in the current database ci_database.

The following code creates the schema in Amazon Redshift and sets it as the default schema for the session:

create schema ci_db;
set search_path to ci_db;

Set up Amazon Redshift DDL

Based on the Teradata DDL output from earlier, we can observe the following:

  • The ci_db.invoice table has all the fields as case-insensitive.
  • The ci_db.customer table has cust_nbr as case-sensitive and cust_name as case-insensitive.

Because we created the case-insensitive database in Amazon Redshift, we need to mention the collation override at the column level for the case-sensitive columns only.

To attain similar collation logic in Amazon Redshift, create the tables with the following syntax:

create table ci_db.invoice 
(
invoice_id VARCHAR(100) ENCODE ZSTD
,cust_nbr VARCHAR(100) ENCODE ZSTD
)
DISTKEY ( invoice_id );



create table ci_db.customer 
(
cust_nbr VARCHAR(100) collate case_sensitive ENCODE ZSTD
,cust_name VARCHAR(100) ENCODE ZSTD 
)
DISTKEY ( cust_nbr );

Load sample data

In general, we recommend using the AWS SCT agent tool to migrate the data from Teradata to Amazon Redshift. For this post, let’s load the following sample data into the tables invoice and customer on Amazon Redshift. Run the following insert statements:

insert into ci_db.invoice (invoice_id ,cust_nbr) values ('inv1','a1');
insert into ci_db.invoice (invoice_id ,cust_nbr) values ('INV1','A1');
insert into ci_db.invoice (invoice_id ,cust_nbr) values ('inv2','b1');
insert into ci_db.invoice (invoice_id ,cust_nbr) values ('INV2','B1');

insert into ci_db.customer (cust_nbr,cust_name) values ( 'a1','John');
insert into ci_db.customer (cust_nbr,cust_name) values ( 'A1','David');
insert into ci_db.customer (cust_nbr,cust_name) values ( 'b1','Bob');
insert into ci_db.customer (cust_nbr,cust_name) values ( 'B1','Mary');

Test the Reporting queries for collation syntax

Let’s review the data that we loaded into both the tables:

select * from ci_db.invoice;

invoice_id | cust_nbr 
------------+----------
 inv1       | a1
 INV1       | A1
 inv2       | b1
 INV2       | B1
(4 rows)

select * from ci_db.customer;

cust_nbr | cust_name 
----------+-----------
 a1       | John 
 A1       | David
 b1       | Bob
 B1       | Mary
(4 rows)

Default collation column query

Run the following query on both environments and observe that both inv1 and INV1 invoices are returned because the invoice_id column has the default database collation (case-insensitive):

select * from ci_db.invoice where invoice_id ='inv1';

invoice_id | cust_nbr 
------------+----------
 inv1       | a1
 INV1       | A1
(2 rows)

Case-sensitive collation column query

Run the following query on both environments and observe that only the customer a1 is returned; the customer A1 is ignored because the cust_nbr field collation is case-sensitive:

select * from ci_db.customer where cust_nbr ='a1';

cust_nbr | cust_name 
----------+-----------
 a1       | John
(1 row)

Case-insensitive expression override collation query

Run the following query and observe that by performing an expression-level override to case-insensitive for the case-sensitive column cust_nbr it returns both customers a1 and A1.

The following is the Teradata syntax:

select * from ci_db.customer where cust_nbr (not casespecific) ='a1';
cust_nbr | cust_name 
----------+-----------
 a1       | John
 A1       | David
(2 rows)

The following is the Amazon Redshift syntax:

select * from ci_db.customer where collate(cust_nbr ,'case_insensitive') ='a1';

cust_nbr | cust_name 
----------+-----------
 a1       | John
 A1       | David
(2 rows)

Expression-level override in join condition query

When different collation columns are joined, in Teradata by default it performs a case-sensitive join. To achieve similar functionality in Amazon Redshift, we have to perform a case-sensitive expression-level override for the case-insensitive column. In the following example, the cust_nbr column is case-insensitive in the invoice table, whereas it’s case-sensitive in the customer table.

The following is the Teradata syntax:

select 
 inv.invoice_id
,inv.cust_nbr as cust_nbr_from_invoice
,cust.cust_nbr as cust_nbr_from_customer
,cust.cust_name  

from ci_db.invoice inv
inner join ci_db.customer cust 
on inv.cust_nbr = cust.cust_nbr
where cust.cust_nbr = 'a1'
order by inv.invoice_id;

invoice_id | cust_nbr_from_invoice | cust_nbr_from_customer | cust_name 
------------+-----------------------+------------------------+-----------
 inv1       | a1                    | a1                     | John
(1 row)

The following is the Amazon Redshift syntax:

select 
 inv.invoice_id
,inv.cust_nbr as cust_nbr_from_invoice
,cust.cust_nbr as cust_nbr_from_customer
,cust.cust_name  

from ci_db.invoice inv
inner join ci_db.customer cust 
on collate (inv.cust_nbr , 'case_sensitive') = cust.cust_nbr
where cust.cust_nbr = 'a1'
order by inv.invoice_id;

invoice_id | cust_nbr_from_invoice | cust_nbr_from_customer | cust_name 
------------+-----------------------+------------------------+-----------
 inv1       | a1                    | a1                     | John
(1 row)

Materialized views with column-level override:

To perform complex queries on large tables in Amazon Redshift, we can create materialized views to reduce the time it takes to compute the results. We can create materialized views on top of the Amazon Redshift tables and the column-level collations defined are honored. 

The following code creates a materialized view in Amazon Redshift:

create materialized view customer_mv AS
select collate(cust_nbr,'case_insensitive') as cust_nbr_ci  ,cust_name from ci_db.customer;

Run the following query and observe that both customers a1 and A1 are returned because the materialized view has the case-insensitive override at the materialized view level even though in the base table it’s a case-sensitive override:

SELECT * FROM customer_mv WHERE cust_nbr_ci ='a1';

cust_nbr_ci | cust_name 
------------+-----------
 a1         | John
 A1         | David
(2 rows)

Identify the column-level collations in Teradata

For Teradata to Amazon Redshift migrations, it’s important to identify the list of all columns with column-level override collation logic in Teradata. You can use the following query to identify the collation override at each column level:

select databasename,tablename,columnname,uppercaseflag from dbc.columns where databasename ='ci_db';

If Uppercaseflag shows as C, it’s a case-sensitive column; N means not case-sensitive.

Things to consider

  • All string comparison operators, like, order by and group by clauses, aggregate functions, window functions, and scalar functions, follow the database and column collation
  • If a function or an operation returns the VARCHAR or CHAR type and takes multiple inputs with different collation types (case-sensitive and case-insensitive), you should use the expression-level override
  • For external queries, including Amazon Redshift Spectrum and Amazon Aurora PostgreSQL, federated queries use database-level collation only.

For other details and limitations, see Amazon Redshift Database Collation documentation.

We recommend using AWS SCT to accelerate your Teradata migration to Amazon Redshift. Refer to this blog for more details.

Conclusion

This post demonstrated how to use collation for string processing at the database, column, and expression level in Amazon Redshift. We also walked through migrating existing Teradata database collations to Amazon Redshift syntax.

The Amazon Redshift collation feature for string processing reduces the effort required when migrating from traditional on-premises MPP data warehouses such as Teradata to Amazon Redshift without refactoring your application code. This feature also helps you achieve your performance goals using Amazon Redshift by keeping the on-premises default case-insensitive feature.

We hope you can take advantage of this new Amazon Redshift feature to migrate to the AWS Cloud for database freedom.


About the authors

Mengchu Cai is a principal engineer at AWS. He works on redshift query optimization, query performance and SQL functionality challenges.

 

 

 

Vamsi Bhadriraju is a Data Architect at AWS. He works closely with enterprise customers to build data lakes and analytical applications on the AWS Cloud.

 

 

 

Yichao Xue is a Software Engineer with Amazon Redshift Query Processing team. He enjoys solving challenging problems for different components of Redshift, including workload management, Redshift Spectrum, federated queries, and recently case-insensitive collation. Outside work, he likes reading, watching movies, listening to music and traveling around the world.

Data preparation using Amazon Redshift with AWS Glue DataBrew

Post Syndicated from Dhiraj Thakur original https://aws.amazon.com/blogs/big-data/data-preparation-using-amazon-redshift-with-aws-glue-databrew/

With AWS Glue DataBrew, data analysts and data scientists can easily access and visually explore any amount of data across their organization directly from their Amazon Simple Storage Service (Amazon S3) data lake, Amazon Redshift data warehouse, Amazon Aurora, and other Amazon Relational Database Service (Amazon RDS) databases. You can choose from over 250 built-in functions to merge, pivot, and transpose the data without writing code.

Now, with added support for JDBC-accessible databases, DataBrew also supports additional data stores, including PostgreSQL, MySQL, Oracle, and Microsoft SQL Server. In this post, we use DataBrew to clean data from an Amazon Redshift table, and transform and use different feature engineering techniques to prepare data to build a machine learning (ML) model. Finally, we store the transformed data in an S3 data lake to build the ML model in Amazon SageMaker.

Use case overview

For our use case, we use mock student datasets that contain student details like school, student ID, name, age, student study time, health, country, and marks. The following screenshot shows an example of our data.

For our use case, the data scientist uses this data to build an ML model to predict a student’s score in upcoming annual exam. However, this raw data requires cleaning and transformation. A data engineer must perform the required data transformation so the data scientist can use the transformed data to build the model in SageMaker.

Solution overview

The following diagram illustrates our solution architecture.

The workflow includes the following steps:

  1. Create a JDBC connection for Amazon Redshift and a DataBrew project.
  2. AWS DataBrew queries sample student performance data from Amazon Redshift and does the transformation and feature engineering to prepare the data to build ML model.
  3. The DataBrew job writes the final output to our S3 output bucket.
  4. The data scientist builds the ML model in SageMaker to predict student marks in an upcoming annual exam.

We cover steps 1–3 in this post.

Prerequisites

To complete this solution, you should have an AWS account.

Prelab setup

Before beginning this tutorial, make sure you have the required permissions to create the resources required as part of the solution.

For our use case, we use a mock dataset. You can download the DDL and data files from GitHub.

  1. Create the Amazon Redshift cluster to capture the student performance data.
  2. Set up a security group for Amazon Redshift.
  3. Create a schema called student_schema and a table called study_details. You can use DDLsql to create database objects.
  4. We recommend using the COPY command to load a table in parallel from data files on Amazon S3. However, for this post, you can use study_details.sql to insert the data in the tables.

Create an Amazon Redshift connection

To create your Amazon Redshift connection, complete the following steps:

  1. On the DataBrew console, choose Datasets.
  2. On the Connections tab, choose Create connection.
  3. For Connection name, enter a name (for example, student-db-connection).
  4. For Connection type, select JDBC.
  5. Provide other parameters like the JDBC URL and login credentials.
  6. In the Network options section, choose the VPC, subnet, and security groups of your Amazon Redshift cluster.
  7. Choose Create connection.

Create datasets

To create the datasets, complete the following steps:

  1. On the Datasets page of the DataBrew console, choose Connect new dataset.
  2. For Dataset name, enter a name (for example, student).
  3. For Your JDBC source, choose the connection you created (AwsGlueDatabrew-student-db-connection).
  4. Select the study_details table.
  5. For Enter S3 destination, enter an S3 bucket for Amazon Redshift to store the intermediate result.
  6. Choose Create dataset.

You can also configure a lifecycle rule to automatically clean up old files from the S3 bucket.

Create a project using the datasets

To create your DataBrew project, complete the following steps:

  1. On the DataBrew console, on the Projects page, choose Create project.
  2. For Project Name, enter student-proj.
  3. For Attached recipe, choose Create new recipe.

The recipe name is populated automatically.

  1. For Select a dataset, select My datasets.
  2. Select the student dataset.
  3. For Role name, choose the AWS Identity and Access Management (IAM) role to be used with DataBrew.
  4. Choose Create project.

You can see a success message along with our Amazon Redshift study_details table with 500 rows.

After the project is opened, a DataBrew interactive session is created. DataBrew retrieves sample data based on your sampling configuration selection.

Create a profiling job

DataBrew helps you evaluate the quality of your data by profiling it to understand data patterns and detect anomalies.

To create your profiling job, complete the following steps:

  1. On the DataBrew console, choose Jobs in the navigation pane.
  2. On the Profile jobs tab, choose Create job.
  3. For Job name, enter student-profile-job.
  4. Choose the student dataset.
  5. Provide the S3 location for job output.
  6. For Role name, choose the role to be used with DataBrew.
  7. Choose Create and run job.

Wait for the job to complete.

  1. Choose the Columns statistics tab.

You can see that the age column has some missing values.

You can also see that the study_time_in_hr column has two outliers.

Build a transformation recipe

All ML algorithms use input data to generate outputs. Input data comprises features usually in structured columns. To work properly, the features need to have specific characteristics. This is where feature engineering comes in. In this section, we perform some feature engineering techniques to prepare our dataset to build the model in SageMaker.

Let’s drop the unnecessary columns from our dataset that aren’t required for model building.

  1. Choose Column and choose Delete.
  2. For Source columns, choose the columns school_name, first_name, and last_name.
  3. Choose Apply.

We know from the profiling report that the age value is missing in two records. Let’s fill in the missing value with the median age of other records.

  1. Choose Missing and choose Fill with numeric aggregate.
  2. For Source column, choose age.
  3. For Numeric aggregate, choose Median.
  4. For Apply transform to, select All rows.
  5. Choose Apply.

We know from the profiling report that the study_time_in_hr column has two outliers, which we can remove.

  1. Choose Outliers and choose Remove outliers.
  2. For Source column, choose study_time_in_hr.
  3. Select Z-score outliers.
  4. For Standard deviation threshold, choose 3.
  5. Select Remove outliers.
  6. Under Remove outliers, select All outliers.
  7. Under Outlier removal options¸ select Delete outliers.
  8. Choose Apply.
  9. Choose Delete rows and click Apply.

The next step is to convert the categorical value to a numerical value for the gender column.

  1. Choose Mapping and choose Categorical mapping.
  2. For Source column, choose gender.
  3. For Mapping options, select Map top 1 values.
  4. For Map values, select Map values to numeric values.
  5. For M, choose 1.
  6. For Others, choose 2.
  7. For Destination column, enter gender_mapped.
  8. For Apply transform to, select All rows.
  9. Choose Apply.

ML algorithms often can’t work on label data directly, requiring the input variables to be numeric. One-hot encoding is one technique that converts categorical data that doesn’t have an ordinal relationship with each other to numeric data.

To apply one-hot encoding, complete the following steps:

  1. Choose Encode and choose One-hot encode column.
  2. For Source column, choose health.
  3. For Apply transform to, select All rows.
  4. Choose Apply.

The following screenshot shows the full recipe that we applied to our dataset before we can use it to build our model in SageMaker.

Run the DataBrew recipe job on the full data

Now that we have built the recipe, we can create and run a DataBrew recipe job.

  1. On the project details page, choose Create job.
  2. For Job name¸ enter student-performance.

We use CSV as the output format.

  1. For File type, choose CSV.
  2. For Role name, choose an existing role or create a new one.
  3. Choose Create and run job.
  4. Navigate to the Jobs page and wait for the student-performance job to complete.
  5. Choose the Destination link to navigate to Amazon S3 to access the job output.

Clean up

Delete the following resources that might accrue cost over time:

  • The Amazon Redshift cluster
  • The recipe job student-performance
  • The job output stored in your S3 bucket
  • The IAM roles created as part of projects and jobs
  • The DataBrew project student-proj and its associated recipe student-proj-recipe
  • The DataBrew datasets

Conclusion

In this post, we saw how to create a JDBC connection for an Amazon Redshift data warehouse. We learned how to use this connection to create a DataBrew dataset for an Amazon Redshift table. We also saw how easily we can bring data from Amazon Redshift into DataBrew, seamlessly apply transformations and feature engineering techniques, and run recipe jobs that refresh the transformed data for ML model building in SageMaker.


About the Author

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

 

 

 

Work with semistructured data using Amazon Redshift SUPER

Post Syndicated from Satish Sathiya original https://aws.amazon.com/blogs/big-data/work-with-semistructured-data-using-amazon-redshift-super/

With the new SUPER data type and the PartiQL language, Amazon Redshift expands data warehouse capabilities to natively ingest, store, transform, and analyze semi-structured data. Semi-structured data (such as weblogs and sensor data) fall under the category of data that doesn’t conform to a rigid schema expected in relational databases. It often contain complex values such as arrays and nested structures that are associated with serialization formats, such as JSON.

The schema of the JSON can evolve over time according to the business use case. Traditional SQL users who are experienced in handling structured data often find it challenging to deal with semi-structured data sources such as nested JSON documents due to lack of SQL support, the need to learn multiple complex functions, and the need to use third-party tools.

This post is part of a series that talks about ingesting and querying semi-structured data in Amazon Redshift using the SUPER data type.

With the introduction of the SUPER data type, Amazon Redshift provides a rapid and flexible way to ingest JSON data and query it without the need to impose a schema. This means that you don’t need to worry about the schema of the incoming document, and can load it directly into Amazon Redshift without any ETL to flatten the data. The SUPER data type is stored in an efficient binary encoded Amazon Redshift native format.

The SUPER data type can represent the following types of data:

  • An Amazon Redshift scalar value:
    • A null
    • A Boolean
    • Amazon Redshift numbers, such as SMALLINT, INTEGER, BIGINT, DECIMAL, or floating point (such as FLOAT4 or FLOAT8)
    • Amazon Redshift string values, such as VARCHAR and CHAR
  • Complex values:
    • An array of values, including scalar or complex
    • A structure, also known as tuple or object, that is a map of attribute names and values (scalar or complex)

For more information about the SUPER type, see Ingesting and querying semistructured data in Amazon Redshift.

After the semi-structured and nested data is loaded into the SUPER data type, you can run queries on it by using the PartiQL extension of SQL. PartiQL is backward-compatible to SQL. It enables seamless querying of semi-structured and structured data and is used by multiple AWS services, such as Amazon DynamoDB, Amazon Quantum Ledger Database (Amazon QLDB), and AWS Glue Elastic Views. With PartiQL, the query engine can work with schema-less SUPER data that originated in serialization formats, such as JSON. With the use of PartiQL, familiar SQL constructs seamlessly combine access to both the classic, tabular SQL data and the semi-structured data in SUPER. You can perform object and array navigation and also unnesting with simple and intuitive extensions to SQL semantics.

For more information about PartiQL, see Announcing PartiQL: One query language for all your data.

Use cases

The SUPER data type is useful when processing and querying semi-structured or nested data such as web logs, data from industrial Internet of Things (IoT) machines and equipment, sensors, genomics, and so on. To explain the different features and functionalities of SUPER, we use sample industrial IoT data from manufacturing.

The following diagram shows the sequence of events in which the data is generated, collected, and finally stored in Amazon Redshift as the SUPER data type.

Manufacturers are embracing the cloud solutions with connected machines and factories to transform and use data to make data-driven decisions so they can optimize operations, increase productivity, and improve availability while reducing costs.

As an example, the following diagram depicts a common asset hierarchy of a fictional smart manufacturing company.

This data is typically semi-structured and hierarchical in nature. To find insights from this data using traditional methods and tools, you need to extract, preprocess, load, and transform it into the proper structured format to run typical analytical queries using a data warehouse. The time to insight is delayed because of the initial steps required for data cleansing and transformation typically performed using third-party tools or other AWS services.

Amazon Redshift SUPER handles these use cases by helping manufacturers extract, load, and query (without any transformation) a variety of data sources collected from edge computing and industrial IoT devices. Let’s use this sample dataset to drive our examples to explain the capabilities of Amazon Redshift SUPER.

The following is an example subscription dataset for assets (such as crowder, gauge, and pneumatic cylinder) in the workshop, which collects metrics on different properties (such as air pressure, machine state, finished parts count, and failures). Data on these metrics is generated continuously in a time series fashion and pushed to the AWS Cloud using the AWS IoT SiteWise connector. This specific example collects the Machine State property for an asset. The following sample data called subscriptions.json has scalar columns like topicId and qos, and a nested array called messages, which has metrics on the assets.

The following is another dataset called asset_metadata.json, which describes different assets and their properties, more like a dimension table. In this example, the asset_name is IAH10 Line 1, which is a processing plant line for a specific site.

Load data into SUPER columns

Now that we covered the basics behind Amazon Redshift SUPER and the industrial IoT use case, let’s look at different ways to load this dataset.

Copy the JSON document into multiple SUPER data columns

In this use case for handling semi-structured data, the user knows the incoming data’s top-level schema and structure, but some of the columns are semi-structured and nested in structures or arrays. You can choose to shred a JSON document into multiple columns that can be a combination of the SUPER data type and Amazon Redshift scalar types. The following code shows what the table DDL looks like if we translate the subscriptions.json semi-structured schema into a SUPER equivalent:

CREATE TABLE subscription_auto (
topicId INT
,topicFilter VARCHAR
,qos INT
,messages SUPER
);

To load data into this table, specify the auto option along with FORMAT JSON in the COPY command to split the JSON values across multiple columns. The COPY matches the JSON attributes with column names and allows nested values, such as JSON arrays and objects, to be ingested as SUPER values. Run the following command to ingest data into the subscription_auto table. Replace the AWS Identity and Access Management (IAM) role with your own credentials. The ignorecase option passed along with auto is required only if your JSON attribute names are in CamelCase. In our case, our columns topicId and topicFilter scalar columns are in CamelCase.

COPY subscription_auto FROM 's3://redshift-downloads/semistructured/super-blog/subscriptions/' 
IAM_ROLE '<< your IAM role >>' 
FORMAT JSON 'auto ignorecase';

A select * from subscription_auto command looks like the following code. The messages SUPER column holds the entire array in this case.

-- A sample output
SELECT * FROM subscription_auto;
[ RECORD 1 ]
topicid     | 1001
topicfilter | $aws/sitewise/asset-models/+/assets/+/properties/+
qos         | 0
messages    | [{"format":"json","topic":"$aws\/sitewise\/asset-models\/8926cf44-14ea-4cd8-a7c6-e61af641dbeb\/assets\/0aaf2aa2-0299-442a-b2ea-ecf3d62f2a2c\/properties\/3ff67d41-bf69-4d57-b461-6f1513e127a4","timestamp":1616381297183,"payload":{"type":"PropertyValueUpdate","payload":{"assetId":"0aaf2aa2-0299-442a-b2ea-ecf3d62f2a2c","propertyId":"3ff67d41-bf69-4d57-b461-6f1513e127a4","values":[{"timestamp":{"timeInSeconds":1616381289,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381292,"offsetInNanos":425000000},"quality":"FAIR","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381286,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381293,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381287,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381290,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381290,"offsetInNanos":925000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381294,"offsetInNanos":425000000},"quality":"FAIR","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381285,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381288,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}}]}}}]

Alternatively, you can specify jsonpaths to load the same data, as in the following code. The jsonpaths option is helpful if you want to load only selective columns from your JSON document. For more information, see COPY from JSON format.

COPY subscription_auto FROM 's3://redshift-downloads/semistructured/super-blog/subscriptions/' 
IAM_ROLE '<< your IAM role >>' 
FORMAT JSON 's3://redshift-downloads/semistructured/super-blog/jsonpaths/subscription_jsonpaths.json';

subscription_jsonpaths.json looks like the following:

{"jsonpaths": [
    "$.topicId",
    "$.topicFilter",
    "$.qos",
    "$.messages"
   ]
}

While we’re in the section about loading, let’s also create the asset_metadata table and load it with relevant data, which we need in our later examples. The asset_metadata table has more information about industry shop floor assets and their properties like asset_name, property_name, and model_id.

CREATE TABLE asset_metadata (
asset_id VARCHAR
,asset_name VARCHAR
,asset_model_id VARCHAR
,asset_property_id VARCHAR
,asset_property_name VARCHAR
,asset_property_data_type VARCHAR
,asset_property_unit VARCHAR
,asset_property_alias VARCHAR
);

COPY asset_metadata FROM 's3://redshift-downloads/semistructured/super-blog/asset_metadata/' 
IAM_ROLE '<< your IAM role >>' 
FORMAT JSON 'auto';

Copy the JSON document into a single SUPER column

You can also load a JSON document into a single SUPER data column. This is typical when the schema of the incoming JSON is unknown and evolving. SUPER’s schema-less ingestion comes to the forefront here, letting you load the data in a flexible fashion.

For this use case, assume that we don’t know the names of the columns in subscription.json and want to load it into Amazon Redshift. It’s as simple as the following code:

CREATE TABLE subscription_noshred (s super);

After the table is created, we can use the COPY command to ingest data from Amazon Simple Storage Service (Amazon S3) into the single SUPER column. The noshred is a required option to go along with FORMAT JSON, which tells the COPY parser not to shred the document but load it into a single column.

COPY subscription_noshred FROM 's3://redshift-downloads/semistructured/super-blog/subscriptions/' 
IAM_ROLE '<< your IAM role >>' 
FORMAT JSON 'noshred';

After the COPY has successfully ingested the JSON, the subscription_noshred table has a SUPER column s that contains the data of the entire JSON object. The ingested data maintains all the properties of the JSON nested structure but in a SUPER data type.

The following code shows how select star (*) into subscription_noshred looks; the entire JSON structure is in SUPER column s:

--A sample output
SELECT * FROM subscription_noshred;
[ RECORD 1 ]
s | {"topicId":1001,"topicFilter":"$aws\/sitewise\/asset-models\/+\/assets\/+\/properties\/+","qos":0,"messages":[{"format":"json","topic":"$aws\/sitewise\/asset-models\/8926cf44-14ea-4cd8-a7c6-e61af641dbeb\/assets\/0aaf2aa2-0299-442a-b2ea-ecf3d62f2a2c\/properties\/3ff67d41-bf69-4d57-b461-6f1513e127a4","timestamp":1616381297183,"payload":{"type":"PropertyValueUpdate","payload":{"assetId":"0aaf2aa2-0299-442a-b2ea-ecf3d62f2a2c","propertyId":"3ff67d41-bf69-4d57-b461-6f1513e127a4","values":[{"timestamp":{"timeInSeconds":1616381289,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381292,"offsetInNanos":425000000},"quality":"FAIR","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381286,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381293,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381287,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381290,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381290,"offsetInNanos":925000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381294,"offsetInNanos":425000000},"quality":"FAIR","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381285,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381288,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}}]}}}]}

Similar to the noshred option, we can also use jsonpaths to load complete documents. This can be useful in cases where we want to extract additional columns, such as distribution and sort keys, while still loading the complete document as a SUPER column. In the following example, we map our first column to the root JSON object, while also mapping a distribution key to the second column, and a sort key to the third column.

subscription_sorted_jsonpaths.json looks like the following:

{"jsonpaths": [
"$",
"$.topicId",
"$.topicFilter"
]
}

CREATE TABLE subscription_sorted (
s super
,topicId INT
,topicFilter VARCHAR
) DISTKEY(topicFilter) SORTKEY(topicId);

COPY subscription_sorted FROM 's3://redshift-downloads/semistructured/super-blog/subscriptions/' 
IAM_ROLE '<< your IAM role >>' 
FORMAT JSON 's3://redshift-downloads/semistructured/superblog/jsonpaths/subscription_sorted_jsonpaths.json';

Copy data from columnar formats like Parquet and ORC

If your semi-structured or nested data is already available in either Apache Parquet or Apache ORC formats, you can use the COPY command with the SERIALIZETOJSON option to ingest data into Amazon Redshift. The Amazon Redshift table structure should match the number of columns and the column data types of the Parquet or ORC files. Amazon Redshift can replace any Parquet or ORC column, including structure and array types, with SUPER data columns. The following are the COPY examples to load from Parquet and ORC format:

--Parquet 
COPY subscription_auto FROM 's3://redshift-downloads/semistructured/super-blog/subscriptions_parquet/' 
IAM_ROLE '<< your IAM role >>' FORMAT PARQUET SERIALIZETOJSON;

--ORC 
COPY subscription_auto FROM 's3://redshift-downloads/semistructured/super-blog/subscriptions_orc/' 
IAM_ROLE '<< your IAM role >>' FORMAT ORC SERIALIZETOJSON;

Apart from COPY, you can load the same data in columnar format in Amazon S3 using Amazon Redshift Spectrum and the INSERT command. The following example assumes the Redshift Spectrum external schema super_workshop and the external table subscription_parquet is already created and available in the database.

We need to set an important session-level configuration parameter for this to work: SET json_serialization_enable TO true. For more information, see Serializing complex nested JSON. This session-level parameter allows you to query top-level nested collection columns as serialized JSON.

/* -- Set the GUC for JSON serialization -- */ 
SET json_serialization_enable TO true;

INSERT INTO subscription_parquet 
SELECT topicId
,topicFilter
,qos
,JSON_PARSE(messages) FROM super_workshop.subscription_parquet;

As of this writing, we can’t use the SUPER column as such as a distribution or sort key, but if we need to use one of the attributes in a SUPER column as a distribution a sort key and keep the entire SUPER column intact as a separate column, we can use the following code as a workaround (apart from the jsonpaths example described earlier). For example, let’s assume the column format within the array messages needs to be used a distribution key for the Amazon Redshift table:

SET json_serialization_enable TO true;

DROP TABLEIF EXISTS subscription_messages;

CREATE TABLE subscription_messages (
m_format VARCHAR
,m_messages super
) distkey (m_format);

--'format' scalar column is extracted from 'messages' super column 
INSERT INTO subscription_messages 
SELECT element.format::VARCHAR
,supercol FROM (
SELECT JSON_PARSE(messages) AS supercol FROM super_workshop.subscription_parquet
) AS tbl,tbl.supercol element;

Apart from using the Amazon Redshift COPY command, we can also ingest JSON-formatted data into Amazon Redshift using the traditional SQL INSERT command:

--Insert example 
INSERT INTO subscription_auto VALUES (
0000
,'topicFiltervalue'
,0000
,JSON_PARSE('[{"format":"json"},{"topic": "sample topic"},[1,2,3]]')
); 

The JSON_PARSE() function parses the incoming data in proper JSON format and helps convert it into the SUPER data type. Without the JSON_PARSE() function, Amazon Redshift treats and ingests the value as a single string into SUPER instead of a JSON-formatted value.

Query SUPER columns

Amazon Redshift uses the PartiQL language to offer SQL-compatible access to relational, semi-structured, and nested data. PartiQL’s extensions to SQL are straightforward to understand, treat nested data as first-class citizens, and seamlessly integrate with SQL. The PartiQL syntax uses dot notation and array subscript for path navigation when accessing nested data.

The Amazon Redshift implementation of PartiQL supports dynamic typing for querying semi-structured data. This enables automatic filtering, joining, and aggregation on the combination of structured, semi-structured, and nested datasets. PartiQL allows the FROM clause items to iterate over arrays and use unnest operations.

The following sections focus on different query access patterns that involve navigating the SUPER data type with path and array navigation, unnest, or joins.

Navigation

We may want to find an array element by its index, or we may want to find the positions of certain elements in their arrays, such as the first element or the last element. We can reference a specific element simply by using the index of the element within square brackets (the index is 0-based) and within that element we can reference an object or struct by using the dot notation.

Let’s use our subscription_auto table to demonstrate the examples. We want to access the first element of the array, and within that we want to know the value of the attribute format:

SELECT messages[0].format FROM subscription_auto ;
--A sample output
 format
--------
 "json"
 "json"
 "json"
(3 rows)

Amazon Redshift can also use a table alias as a prefix to the notation. The following example is the same query as the previous example:

SELECT sub.messages[0].format FROM subscription_auto AS sub;
--A sample output
 format
--------
 "json"
 "json"
 "json"
(3 rows)

You can use the dot and bracket notations in all types of queries, such as filtering, join, and aggregation. You can use them in a query in which there are normally column references. The following example uses a SELECT statement that filters results:

SELECT COUNT(*) FROM subscription_auto WHERE messages[0].format IS NOT NULL;
-- A sample output
 count
-------
    11
(1 row)

Unnesting and flattening

To unnest or flatten an array, Amazon Redshift uses the PartiQL syntax to iterate over SUPER arrays by using the FROM clause of a query. We continue with the previous example in the following code, which iterates over the array attribute messages values:

SELECT c.*, o FROM subscription_auto c, c.messages AS o WHERE o.topic='sample topic';
-- A sample output
                             messages                 |        o
------------------------------------------------------+--------------------------
[{"format":"json"},{"topic":"sample topic"},[1,2,3]]  | {"topic":"sample topic"}
(1 row)

The preceding query has one extension over standard SQL: the c.messages AS o, where o is the alias for the SUPER array messages and serves to iterate over it. In standard SQL, the FROM clause x (AS) y means “for each tuple y in table x.” Similarly, the FROM clause x (AS) y, if x is a SUPER value, translates to “for each (SUPER) value y in (SUPER) array value x.” The projection list can also use the dot and bracket notation for regular navigation.

When unnesting an array, if we want to get the array subscripts (starting from 0) as well, we can specify AT some_index right after unnest. The following examples iterates over both the array values and array subscripts:

SELECT o, index FROM subscription_auto c, c.messages AS o AT index 
WHERE o.topic='sample topic';

-- A sample output 
       o                  | index
--------------------------+-------
{"topic":"sample topic"}  | 1
(1 row)

If we have an array of arrays, we can do multiple unnestings to iterate into the inner arrays. The following example shows how we do it. Note that unnesting non-array expressions (the objects inside c.messages) are ignored, only the arrays are unnested.

SELECT messages, array, inner_array_element FROM subscription_auto c, c.messages AS array, array AS inner_array_element WHERE array = json_parse('[1,2,3]');

-- A sample output 
    messages                                          | array   | inner_array_element
------------------------------------------------------+---------+---------------------
[{"format":"json"},{"topic":"sample topic"},[1,2,3]]  | [1,2,3] | 1
[{"format":"json"},{"topic":"sample topic"},[1,2,3]]  | [1,2,3] | 2
[{"format":"json"},{"topic":"sample topic"},[1,2,3]]  | [1,2,3] | 3
(3 rows)

Dynamic typing

With the schema-less nature of semi-structured data, the same attributes within the JSON might have values of different types. For example, asset_id from our example might have initially started with integers and then because of a business decision changed into alphanumeric (string) values before finally settling on array type. Amazon Redshift SUPER handles this situation by using dynamic typing. Schema-less SUPER data is processed without the need to statically declare the data types before using them in a query. Dynamic typing is most useful in joins and GROUP BY clauses. Although deep comparing of SUPER column is possible, we recommend restricting the joins and aggregations to use the leaf-level scalar attribute for optimal performance. The following example uses a SELECT statement that requires no explicit casting of the dot and bracket expressions to the usual Amazon Redshift types:

SELECT messages[0].format,
        messages[0].topic
FROM subscription_auto
WHERE messages[0].payload.payload."assetId" > 0;
 
--A sample output
 format | topic
--------+-------
"json"  | "sample topic" 
(1 rows)

When your JSON attribute names are in mixed case or CamelCase, the following session parameter is required to query the data. In the preceding query, assetId is in mixed case. For more information, see SUPER configurations.

SET enable_case_sensitive_identifier to TRUE;

The greater than (>) sign in this query evaluates to true when messages[0].payload.payload."assetId" is greater than 0. In all other cases, the equality sign evaluates to false, including the cases where the arguments of the equality are different types. The messages[0].payload.payload."assetId" attribute can potentially contain a string, integer, array, or structure, and Amazon Redshift only knows that it is a SUPER data type. This is where dynamic typing helps in processing the schemaless data by evaluating the types at runtime based on the query. For example, processing the first record of “assetId” may result in an integer, whereas the second record can be a string, and so on. When using an SQL operator or function with dot and bracket expressions that have dynamic types, Amazon Redshift produces results similar to using a standard SQL operator or function with the respective static types. In this example, when the dynamic type of the path expression is an integer, the comparison with the integer 0 is meaningful. Whenever the dynamic type of “assetId” is any other data type except being an integer, the equality returns false.

For queries with joins and aggregations, dynamic typing automatically matches values with different dynamic types without performing a long CASE WHEN analysis to find out what data types may appear. The following code is an example of an aggregation query in which we count the number of topics:

SELECT messages[0].format,
count(messages[0].topic)
FROM subscription_auto WHERE messages[0].payload.payload."assetId" > 'abc' GROUP BY 1;

-- a sample output
 format | count
--------+-------
"json"  |   2
(1 row)

For the next join query example, we unnest and flatten the messages array from subscription_auto and join with the asset_metadata table to get the asset_name and property_name based on the asset_id and property_id, which we use as join keys.

Joins on SUPER should preferably be on an extracted path and avoid deep compare of the entire nested field for performance. In the following examples, the join keys used are on extracted path keys and not on the whole array:

SELECT c.topicId
,c.qos
,a.asset_name
,o.payload.payload."assetId"
,a.asset_property_name
,o.payload.payload."propertyId"
FROM subscription_auto AS c
,c.messages AS o -- o is the alias for messages array
,asset_metadata AS a
WHERE o.payload.payload."assetId" = a.asset_id AND o.payload.payload."propertyId" = a.asset_property_id AND o.payload.type = 'PropertyValueUpdate';
 
--A sample output 
 topicid | qos |  asset_name  |                assetId                 | asset_property_name |               propertyId
---------+-----+--------------+----------------------------------------+---------------------+----------------------------------------
    1001 |   0 | IAH10 Line 1 | "0aaf2aa2-0299-442a-b2ea-ecf3d62f2a2c" | stop                | "3ff67d41-bf69-4d57-b461-6f1513e127a4"
(1 row)

The following code is another join query that is looking for a count on the quality of the metrics collected for a specific asset (in this case IAH10 Line) and it is property (Machine State) and categorizes it based on the quality:

SELECT v.quality::VARCHAR
,count(*)
FROM subscription_auto AS c
,c.messages AS o -- o is the alias for messages array
,o.payload.payload.VALUES AS v -- v is the alias for values array 
,asset_metadata AS a 
WHERE o."assetId" = a.asset_name 
AND o."propertyId" = a.asset_property_name 
AND a.asset_name = 'IAH10 Line 1' 
AND a.asset_property_name = 'Machine State' 
GROUP BY v.quality;

-- A sample output 
quality   | count
----------+-------
CRITICAL  | 152 
GOOD      | 2926 
FAIR      | 722

Summary

This post discussed the benefits of the new SUPER data type and use cases in which nested data types can help improve storage efficiency, performance, or simplify analysis. Amazon Redshift SUPER along with PartiQL enables you to write queries over relational and hierarchical data model with ease. The next post in this series will focus on how to speed up frequent queries on SUPER columns using materialized views. Try it out and share your experience!


About the Authors

Satish Sathiya is a Senior Product Engineer at Amazon Redshift. He is an avid big data enthusiast who collaborates with customers around the globe to achieve success and meet their data warehousing and data lake architecture needs.

 

 

 

Runyao Chen is a Software Development Engineer at Amazon Redshift. He is passionate about MPP databases and has worked on SQL language features such as querying semistructured data using SUPER. In his spare time, he enjoys reading and exploring new restaurants.

 

 

 

Cody Cunningham is a Software Development Engineer at Amazon Redshift. He works on Redshift’s data ingestion, implementing new features and performance improvements, and recently worked on the ingestion support for SUPER.

Amazon Redshift identity federation with multi-factor authentication

Post Syndicated from Manash Deb original https://aws.amazon.com/blogs/big-data/amazon-redshift-identity-federation-with-multi-factor-authentication/

Password-based access control alone is not considered secure enough, and many organizations are adopting multi-factor authentication (MFA) and single sign-on (SSO) as a de facto standard to prevent unauthorized access to systems and data. SSO frees up time and resources for both administrators and end users from the painful process of password-based credential management. MFA capability adds an extra layer of protection, improving organizational security and confidence.

Amazon Redshift is a fast, fully-managed cloud data warehouse that provides browser-based plugins for JDBC/ODBC drivers, which helps you easily implement SSO capabilities added with MFA to secure your data warehouse, and also helps automation and enforcement of data access policies across the organization through directory federation.

You can integrate Amazon Redshift with different identity providers (IdPs) like Microsoft Azure Active Directory, Ping, Okta, OneLogin, and more. For more information, see Using a credentials provider plugin. You may already have pre-existing integrations for federating to AWS using industry standard Security Assertion Markup Language (SAML) with these IdPs. In this post, we explain the steps to integrate the Amazon Redshift browser-based SAML plugin to add SSO and MFA capability with your federation IdP.

Solution overview

With this integration, users get authenticated to Amazon Redshift using the SSO and MFA credentials of the IdP application, which uses SAML tokens to map the IdP user identity (like login name or email address) as the database user in Amazon Redshift. It can also map users’ directory group memberships to corresponding database groups in Amazon Redshift, which allows you to control authorization grants for database objects in Amazon Redshift.

The following diagram illustrates our solution architecture.

High-level steps for SSO-MFA integration

The following diagram illustrates the authentication flow with the browser SAML plugin.

We complete the following high-level steps to set up the SSO and MFA authentication capability to an Amazon Redshift data warehouse using a browser-based SAML plugin with our IdP:

  1. Create a custom SAML 2.0 application with the IdP with the following configurations:
    1. A redirect URI (for example, http://localhost:7890/redshift/).
    2. MFA capability enabled.
    3. Relevant SAML claim attributes.
    4. Appropriate directory groups and users with the IdP.
  2. Add appropriate AWS Identity and Access Management (IAM) permissions:
    1. Add an IdP.
    2. Add appropriate IAM roles for the IdP.
    3. Use IAM policies to add appropriate permissions to the roles to access the Amazon Redshift cluster.
  3. Set up Amazon Redshift with group-level access control:
    1. Connect to Amazon Redshift using superuser credentials.
    2. Set up appropriate database groups in Amazon Redshift.
    3. Grant access permissions appropriate to relevant groups.
  4. Connect to Amazon Redshift with your JDBC/ODBC SQL client:
    1. Configure connection attributes for the IdP.
    2. Enable browser-based MFA.
    3. Connect to Amazon Redshift.

Create a custom SAML 2.0 application with the IdP

The first step of this setup is to create a SAML 2.0 application with your IdP. The various directory groups and users that need to log in to the Amazon Redshift cluster should have access to this SAML application. Provide an appropriate redirect_uri (for example, http://localhost:7890/redshift/) in the SAML application configuration, so that the IdP can seamlessly redirect SAML responses over HTTPS to this URI, which then allows the Amazon Redshift JDBC/ODBC driver to authenticate and authorize the user.

The following screenshot shows a SAML application configuration with PingOne as the IdP (for more details on PingOne Amazon Redshift SSO federation, see Federating single sign-on access to your Amazon Redshift cluster with PingIdentity).

You need to download the metadata XML file from the provider (as shown in the preceding screenshot) and use it in a later step to create a SAML IdP in IAM.

Next, you can enable MFA for this application so that users are authorized to access Amazon Redshift only after they pass the two-factor authentication with MFA.

The following screenshot shows the MFA configuration settings with PingOne as the IdP.

As part of the IdP application setup, map the following claim attributes so that Amazon Redshift can access them using the SAML response.

Claim Attribute Namespace Description Example Value
Role https://aws.amazon.com/SAML/Attributes/Role aws_iam_role_arn_for_identity_provider, aws_identity_provider_arn arn:aws:iam::<account>:role/PingOne_Redshift_SSO_Role,arn:aws:iam::<account>:saml-provider/PingOne
RoleSessionName https://aws.amazon.com/SAML/Attributes/RoleSessionName Identification for the user session, which in most cases is the email_id of the user. email
AutoCreate https://redshift.amazon.com/SAML/Attributes/AutoCreate If this parameter is set, new users authenticated by the IdP are automatically created in Amazon Redshift. "true"
DbUser https://redshift.amazon.com/SAML/Attributes/DbUser Identification for the user session, which in most cases is the email_id of the user. email
DbGroups https://redshift.amazon.com/SAML/Attributes/DbGroups Amazon Redshift database group names for the user, which in most cases is the same as the directory groups the user belongs to. data_scientist

The following screenshot is an example of these claim attributes set up for PingOne as IdP.

Apart from setting up the SAML application, you also need to set up appropriate directory groups and users with your IdP, which you will use to grant SSO and MFA access to users for different applications like AWS Single Sign-On and Amazon Redshift application single sign-on.

The following screenshot is an example of this user group set up for PingOne as IdP.

Add appropriate permissions using IAM

After you complete the configuration settings with the IdP, the next step is to configure appropriate permissions in IAM in your AWS account for identity federation using IAM.

The first step is to add an IdP using the SAML metadata XML file downloaded from the IdP’s SAML application you created in the previous step.

After you add the IdP, you need to create an IAM role with that IdP as a trusted entity.

Set the value of the SAML:aud attribute to the same redirect URI defined in your IdP’s SAML application setup (for example, http://localhost:7890/redshift/).

Create a new IAM policy with the necessary permissions needed by the users to access Amazon Redshift, and attach it to the IAM role you created earlier. See the following sample policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "redshift:CreateClusterUser",
                "redshift:JoinGroup",
                "redshift:GetClusterCredentials",
                "redshift:ListSchemas",
                "redshift:ListTables",
                "redshift:ListDatabases",
                "redshift:ExecuteQuery",
                "redshift:FetchResults",
                "redshift:CancelQuery",
                "redshift:DescribeClusters",
                "redshift:DescribeQuery",
                "redshift:DescribeTable"
            ],
            "Resource": [
                "arn:aws:redshift:<region>:<account>:cluster:<clusterName>",
                "arn:aws:redshift:<region>:<account>:dbuser:<clusterName>/${redshift:DbUser}",
                "arn:aws:redshift:<region>:<account>:dbname:<clusterName>/${redshift:DbName}",
                "arn:aws:redshift:<region>:<account>:dbgroup:<clusterName>/bi_users_group",
                "arn:aws:redshift:<region>:<account>:dbgroup:<clusterName>/etl_users_group",
                "arn:aws:redshift:<region>:<account>:dbgroup:<clusterName>/analysts_group"
            ]
        }
    ]
}

You can also use an AWS CloudFormation template to automate this IAM setup by uploading the IdP- specific SAML metadata XML file from the SAML application you created.

The following template takes care of creating the IAM resources required for this setup. You need to enter the following parameters to the template:

  • FederationProviderName – Enter a suitable name for the IAM IdP.
  • FederationXmlS3Location – Enter the Amazon Simple Storage Service (Amazon S3) URI where you uploaded the SAML metadata XML file from your IdP’s SAML application.
  • RedshiftClusterEndpoint – Enter the endpoint URI of your Amazon Redshift cluster. You can get this URI via the Amazon Redshift console. If you have multiple Amazon Redshift clusters, you may need to modify this CloudFormation template to add permissions for all the clusters in your account.

Grant group-level access control with Amazon Redshift

If you haven’t set up an Amazon Redshift cluster yet, see Getting started with Amazon Redshift for a step-by-step guide to create a new cluster in your AWS account.

If you already have an Amazon Redshift cluster, note the primary user credentials for that cluster and refer to the following resources to connect to that cluster using a SQL client like SQL Workbench/J and the latest Amazon Redshift JDBC driver with AWS SDK:

When you’re logged in, you need to set up the appropriate groups in Amazon Redshift. The following example code sets up three database groups for business intelligence (BI) users, analysts, and a cross-user group in Amazon Redshift:

CREATE GROUP bi_users_group;
CREATE GROUP analysts_group;
CREATE GROUP cross_user_group;

You can then set up database objects and appropriate access permissions for them. In the following code, we set up two schemas for analysts and BI users and then grant access on them to the relevant groups:

CREATE SCHEMA IF NOT EXISTS bi_schema;
CREATE SCHEMA IF NOT EXISTS analysts_schema;

GRANT SELECT ON ALL TABLES IN SCHEMA bi_schema TO GROUP bi_users_group, GROUP cross_user_group;
ALTER DEFAULT PRIVILEGES IN SCHEMA bi_schema GRANT SELECT ON TABLES TO GROUP bi_users_group, GROUP cross_user_group;

GRANT SELECT ON ALL TABLES IN SCHEMA analysts_schema TO GROUP analysts_group, GROUP cross_user_group;
ALTER DEFAULT PRIVILEGES IN SCHEMA analysts_schema GRANT SELECT ON TABLES TO GROUP analysts_group, GROUP cross_user_group;

These group-level grants allow federated users to access Amazon Redshift objects based on their associated permissions. As explained earlier in this post, you can map your IdP directory groups to their respective database groups in Amazon Redshift, which allows you to control both authentication and authorization to Amazon Redshift based on the IdP credentials.

However, you may choose to control the authorization part within the database itself instead of relying on IdP directory groups. In this case, you use the IdP only to facilitate system authentication to Amazon Redshift, but for data authorization, you map the users and groups manually using alter group statements, as in the following code:

CREATE USER "[email protected]" PASSWORD DISABLE;
ALTER GROUP bi_users_group ADD USER "[email protected]";
ALTER GROUP cross_user_group ADD USER "[email protected]";

In the preceding example, we create a new user, exampleuser, with password disabled. We can use the IdP credentials for this user to authenticate and therefore it doesn’t need a password. But to provide authorization, we added this user to the bi_user and cross_user groups, so that it can inherit the permissions granted to these groups and can work seamlessly with SSO and MFA federation.

Configure your JDBC/ODBC SQL client to use the browser-based plugin to connect to Amazon Redshift

In this step, you can test Amazon Redshift connectivity through your IdP using a SQL client like SQL Workbench/J.

You need to provide the following configurations in the SQL client.

Property Value
Driver Amazon Redshift
URL jdbc:redshift:iam://<your-redshift-cluster-endpoint>

Additionally, you need to set up the following extended properties.

Property Value
login_url This is the SAML application’s login page
plugin_name com.amazon.redshift.plugin.BrowserSamlCredentialsProvider
idp_response_timeout Number of seconds to allow for SSO authentication to complete before timing out

The following screenshot shows the configurations to connect SQLWorkbench/J client with PingOne as IdP.

The following table summarizes our property values.

Property Value
login_url https://sso.connect.pingidentity.com/sso/sp/initsso?saasid=<your_saas_id>&idpid=<your_idp_id> (This is the SAML application’s SSO URL from your IDP)
plugin_name com.amazon.redshift.plugin.BrowserSamlCredentialsProvider
idp_response_timeout 120

When you choose Test or OK, a new web browser window opens that shows the SAML application’s login page.

If this is the first time you’re logging in to PingOne, and haven’t set up MFA before, you can download and pair the PingID mobile app on iOS or Android.

After the PingID app is installed and paired, it pushes a notification to your phone to approve or deny the MFA authorization. When the MFA succeeds, the browser displays a success message on the redirect page.

After the connection is successful, let’s run a SQL query to confirm that the correct user identification was passed and also confirm that the correct database group was mapped for this SQL user session, based on the user’s directory group. In this case, the user manish was mapped to the bi_users_group directory group in PingOne. We should see the SQL session reflect the corresponding database group in Amazon Redshift.

We were able to successfully accomplish MFA-based SSO identity federation with PingOne using the browser-based plugin that Amazon Redshift provides.

IdP-specific configurations

As mentioned earlier, the first step of this process is to set up SSO for Amazon Redshift with your IdP. The setup steps for that may vary depending on the provider. For more information, see the following resources:

The following videos also cover these details if you want to view them in action:

Conclusion

Amazon Redshift makes it easy to integrate identity federation with your existing third-party identity providers, allowing you to centralize user and access management in a single corporate directory. In this post, we showed how the Amazon Redshift browser-based plugin works with popular SAML-based IdPs to provide an additional security layer with MFA authentication. You can also use the instructions in this post to set up various SAML-based IdPs (like Ping, Okta, JumpCloud, and OneLogin) to implement SSO and MFA with Amazon Redshift.


About the Authors

Manash Deb is a Senior Analytics Specialist Solutions Architect at AWS. He has worked on building end-to-end data driven solutions in different database and data warehousing technologies for over fifteen years. He loves to learn new technologies and solving, automating and simplifying customer problems with easy-to-use cloud data solutions on AWS.

 

 

Manish Vazirani is an Analytics Specialist Solutions Architect at Amazon Web Services.

 

 

 

 

 

 

Improve query performance using AWS Glue partition indexes

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/improve-query-performance-using-aws-glue-partition-indexes/

While creating data lakes on the cloud, the data catalog is crucial to centralize metadata and make the data visible, searchable, and queryable for users. With the recent exponential growth of data volume, it becomes much more important to optimize data layout and maintain the metadata on cloud storage to keep the value of data lakes.

Partitioning has emerged as an important technique for optimizing data layout so that the data can be queried efficiently by a variety of analytic engines. Data is organized in a hierarchical directory structure based on the distinct values of one or more columns. Over time, hundreds of thousands of partitions get added to a table, resulting in slow queries. To speed up query processing of highly partitioned tables cataloged in AWS Glue Data Catalog, you can take advantage of AWS Glue partition indexes.

Partition indexes are available for queries in Amazon EMRAmazon Redshift Spectrum, and AWS Glue extract, transform, and load (ETL) jobs (Spark DataFrame). When partition indexes are enabled on the heavily partitioned AWS Glue Data Catalog tables, all these query engines are accelerated. You can add partition indexes to both new tables and existing tables. This post demonstrates how to utilize partition indexes, and discusses the benefit you can get with partition indexes when working with highly partitioned data.

Partition indexes

AWS Glue partition indexes are an important configuration to reduce overall data transfers and processing, and reduce query processing time. In the AWS Glue Data Catalog, the GetPartitions API is used to fetch the partitions in the table. The API returns partitions that match the expression provided in the request. If no partition indexes are present on the table, all the partitions of the table are loaded, and then filtered using the query expression provided by the user in the GetPartitions request. The query takes more time to run as the number of partitions increase on a table with no indexes. With an index, the GetPartitions request tries to fetch a subset of the partitions instead of loading all the partitions in the table.

The following are key benefits of partition indexes:

  • Increased query performance
  • Increased concurrency as a result of fewer GetPartitions API calls
  • Cost savings:
    • Analytic engine cost (query performance is related to the charges in Amazon EMR and AWS Glue ETL)
    • AWS Glue Data Catalog API request cost

Setting up resources with AWS CloudFormation

This post provides an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs. Some of the resources that this stack deploys incur costs when in use.

The CloudFormation template generates the following resources:

If you’re using AWS Lake Formation permissions, you need to ensure that the IAM user or role running AWS CloudFormation has the required permissions (to create a database on the Data Catalog).

The tables use sample data located in an Amazon Simple Storage Service (Amazon S3) public bucket. Initially, no partition indexes are configured in these AWS Glue Data Catalog tables.

To create your resources, complete the following steps:

  1. Sign in to the CloudFormation console.
  2. Choose Launch Stack:
  3. Choose Next.
  4. For DatabaseName, leave as the default.
  5. Choose Next.
  6. On the next page, choose Next.
  7. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  8. Choose Create.

Stack creation can take up to 5 minutes. When the stack is completed, you have two Data Catalog tables: table_with_index and table_without_index. Both tables point to the same S3 bucket, and the data is highly partitioned based on yearmonthday, and hour columns for more than 42 years (1980-2021). In total, there are 367,920 partitions, and each partition has one JSON file, data.json. In the following sections, you see how the partition indexes work with these sample tables.

Setting up a partition index on the AWS Glue console

You can create partition indexes at any time. If you want to create a new table with partition indexes, you can make the CreateTable API call with a list of PartitionIndex objects. If you want to add a partition index to an existing table, make the CreatePartitionIndex API call. You can also perform these actions on the AWS Glue console. You can create up to three partition indexes on a table.

Let’s configure a new partition index for the table table_with_index we created with the CloudFormation template.

  1. On the AWS Glue console, choose Tables.
  2. Choose the table table_with_index.
  3. Choose Partitions and indices.
  4. Choose Add new index.
  5. For Index name, enter year-month-day-hour.
  6. For Selected keys from schema, select year, month, day, and hour.
  7. Choose Add index.

The Status column of the newly created partition index shows the status as Creating. Wait for the partition index to be Active. The process takes about 1 hour because more number of partitions longer it takes for index creation and we have 367,920 partitions on this table.

Now the partition index is ready for the table table_with_index. You can use this index from various analytic engines when you query against the table. You see default behavior in the table table_without_index because no partition indexes are configured for this table.

You can follow (or skip) any of the following sections based on your interest.

Making a GetPartitions API call with an expression

Before we use the partition index from various query engines, let’s try making the GetPartitions API call using AWS Command Line Interface (AWS CLI) to see the difference. The AWS CLI get-partitions command makes multiple GetPartitions API calls if needed. In this section, we simply use the time command to compare the duration for each table, and use the debug logging to compare the number of API calls for each table.

  1. Run the get-partitions command against the table table_without_index with the expression year='2021' and month='04' and day='01':
    $ time aws glue get-partitions --database-name partition_index --table-name table_without_index --expression "year='2021' and month='04' and day='01'"
    ...
    real    3m57.438s
    user    0m2.872s
    sys    0m0.248s
    

The command took about 4 minutes. Note that you used only three partition columns out of four.

  1. Run the same command with debug logging to get the number of the GetPartitionsAPI calls:
    $ aws glue get-partitions --database-name partition_index --table-name table_without_index --expression "year='2021' and month='04' and day='01'" --debug 2>get-partitions-without-index.log
    $ cat get-partitions-without-index.log | grep x-amz-target:AWSGlue.GetPartitions | wc -l
         737

There were 737 GetPartitions API calls when the partition indexes aren’t used.

  1. Next, run the get-partitions command against table_with_index with the same expression:
    $ time aws glue get-partitions --database-name partition_index --table-name table_with_index --expression "year='2020' and month='07' and day='01' and hour='09'"
    ...
    real    0m2.697s
    user    0m0.442s
    sys    0m0.163s

The command took just 2.7 seconds. You can see how quickly the required partitions were returned.

  1. Run the same command with debug logging to get the number of the GetPartitionsAPI calls:
    $ aws glue get-partitions --database-name partition_index --table-name table_with_index --expression "year='2021' and month='04' and day='01'" --debug 2>get-partitions-with-index.log
    $ cat get-partitions-with-index.log | grep x-amz-target:AWSGlue.GetPartitions | wc -l
           4
    

There were only four GetPartitions API calls when the partition indexes are used.

Querying a table using Apache Spark on Amazon EMR

In this section, we explore querying a table using Apache Spark on Amazon EMR.

  1. Launch a new EMR cluster with Apache Spark.

For instructions, see Setting Up Amazon EMR. You need to specify the AWS Glue Data Catalog as the metastore. In this example, we use the default EMR cluster (release: emr-6.2.0, three m5.xlarge nodes).

  1. Connect to the EMR node using SSH.
  2. Run the spark-sql command on the EMR node to start an interactive shell for Spark SQL:
    $ spark-sql

  3. Run the following SQL against partition_index.table_without_index:
    spark-sql> SELECT count(*), sum(value) FROM partition_index.table_without_index WHERE year='2021' AND month='04' AND day='01';
    24    13840.894731640636
    Time taken: 35.518 seconds, Fetched 1 row(s)

The query took 35 seconds. Even though you aggregated records only in the specific partition, the query took so long because there are many partitions and the GetPartitions API call takes time.

Now let’s run the same query against table_with_index to see how much benefit the partition index introduces.

  1. Run the following SQL against partition_index.table_with_index:
    spark-sql> SELECT count(*), sum(value) FROM partition_index.table_with_index WHERE year='2021' AND month='04' AND day='01';
    24    13840.894731640636
    Time taken: 2.247 seconds, Fetched 1 row(s)

The query took just 2 seconds. The reason for the difference in query duration is because the number of GetPartitions calls is smaller because of the partition index.

The following chart shows the granular metrics for query planning time without and with the partition index. The query planning time with the index is far less than that without the index.

For more information about comparing metrics in Apache Spark, see Appendix 2 at the end of this post.

Querying a table using Redshift Spectrum

To query with Redshift Spectrum, complete the following steps:

  1. Launch a new Redshift cluster.

You need to configure an IAM role for the cluster to utilize Redshift Spectrum and the Amazon Redshift query editor. Choose dc2.large, 1 node in this example. You need to launch the cluster in the us-east-1 Region because you need to place your cluster in the same Region as the bucket location.

  1. Connect with the Redshift query editor. For instructions, see Querying a database using the query editor.
  2. Create an external schema for the partition_index database to use it in Redshift Spectrum: (replace <your IAM role ARN> with your IAM role ARN).
    create external schema spectrum from data catalog 
    database 'partition_index' 
    iam_role '<your IAM role ARN>'
    create external database if not exists;

  3. Run the following SQL against spectrum_schema.table_without_index:
    SELECT count(*), sum(value) FROM spectrum.table_without_index WHERE year='2021' AND month='04' AND day='01'

The following screenshot shows our output.

The query took more than 3 minutes.

  1. Run the following SQL against spectrum_schema.table_with_index:
    SELECT count(*), sum(value) FROM spectrum.table_with_index WHERE year='2021' AND month='04' AND day='01'

The following screenshot shows our output.

The query for the table using indexes took just 8 seconds, which is much faster than the table without indexes.

Querying a table using AWS Glue ETL

Let’s launch an AWS Glue development endpoint and an Amazon SageMaker notebook.

  1. Open the AWS Glue console, choose Dev endpoints.
  2. Choose Add endpoint.
  3. For Development endpoint name, enter partition-index.
  4. For IAM role, choose your IAM role.

For more information about roles, see Managing Access Permissions for AWS Glue Resources.

  1. For Worker type under Security configuration, script libraries, and job parameters (optional), choose 1X.
  2. For Number of workers, enter 4.
  3. For Dependent jar path, enter s3://crawler-public/json/serde/json-serde.jar.
  4. Select Use Glue data catalog as the Hive metastore under Catalog options (optional).
  5. Choose Next.
  6. For Networking, leave as is (by default, Skip networking configuration is selected), and choose Next.
  7. For Add an SSH public key (Optional), leave it blank, and choose Next.
  8. Choose Finish.
  9. Wait for the development endpoint partition-index to show as READY.

The endpoint may take up to 10 minutes to be ready.

  1. Select the development endpoint partition-index, and choose Create SageMaker notebook on the Actions
  2. For Notebook name, enter partition-index.
  3. Select Create an IAM role.
  4. For IAM role, enter partition-index.
  5. Choose Create notebook.
  6. Wait for the notebook aws-glue-partition-index to show the status as Ready.

The notebook may take up to 3 minutes to be ready.

  1. Select the notebook aws-glue-partition-index, and choose Open notebook.
  2. Choose Sparkmagic (PySpark)on the New
  3. Enter the following code snippet against table_without_index, and run the cell:
    %%time
    %%sql
    SELECT count(*), sum(value) FROM partition_index.table_without_index WHERE year='2021' AND month='04' AND day='01'

The following screenshot shows our output.

The query took 3 minutes.

  1. Enter the following code snippet against partition_index.table_with_index, and run the cell:
    %%time
    %%sql
    SELECT count(*), sum(value) FROM partition_index.table_with_index WHERE year='2021' AND month='04' AND day='01'

The following screenshot shows our output.

The cell took just 7 seconds. The query for the table using indexes is faster than the table without indexes.

Cleaning up

Now to the final step, cleaning up the resources:

  1. Delete the CloudFormation stack. 
  2. Delete the EMR cluster.
  3. Delete the Amazon Redshift cluster.
  4. Delete the AWS Glue development endpoint and SageMaker notebook.

Conclusion

In this post, we explained how to use partition indexes and how they accelerate queries in various query engines. If you have several millions of partitions, the performance benefit is significantly more. You can learn about partition indexes more deeply in Working with Partition Indexes.


Appendix 1: Setting up a partition index using AWS CLI

If you prefer using the AWS CLI, run the following create-partition-index command to set up a partition index:

$ aws glue create-partition-index --database-name partition_index --table-name table_with_index --partition-index Keys=year,month,day,hour,IndexName=year-month-day-hour

To get the status of the partition index, run the following get-partition-indexes command:

$ aws glue get-partition-indexes --database-name partition_index --table-name table_with_index
{
    "PartitionIndexDescriptorList": [
        {
            "IndexName": "year-month-day-hour",
            "Keys": [
                {
                    "Name": "year",
                    "Type": "string"
                },
                {
                    "Name": "month",
                    "Type": "string"
                },
                {
                    "Name": "day",
                    "Type": "string"
                },
                {
                    "Name": "hour",
                    "Type": "string"
                }
            ],
            "IndexStatus": "CREATING"
        }
    ]
}

Appendix 2: Comparing breakdown metrics in Apache Spark

If you’re interested in comparing the breakdown metrics for query planning time, you can register a SQL listener with the following Scala code snippet:

spark.listenerManager.register(new org.apache.spark.sql.util.QueryExecutionListener {
  override def onSuccess(funcName: String, qe: org.apache.spark.sql.execution.QueryExecution, durationNs: Long): Unit = {
    val metricMap = qe.tracker.phases.mapValues { ps => ps.endTimeMs - ps.startTimeMs }
    println(metricMap.toSeq)
  }
  override def onFailure(funcName: String, qe: org.apache.spark.sql.execution.QueryExecution, exception: Exception): Unit = {}
})

If you use spark-shell, you can register the listener as follows:

$ spark-shell
...
scala> spark.listenerManager.register(new org.apache.spark.sql.util.QueryExecutionListener {
     |   override def onSuccess(funcName: String, qe: org.apache.spark.sql.execution.QueryExecution, durationNs: Long): Unit = {
     |     val metricMap = qe.tracker.phases.mapValues { ps => ps.endTimeMs - ps.startTimeMs }
     |     println(metricMap.toSeq)
     |   }
     |   override def onFailure(funcName: String, qe: org.apache.spark.sql.execution.QueryExecution, exception: Exception): Unit = {}
     | })

Then run the same query without using the index to get the breakdown metrics:

scala> spark.sql("SELECT count(*), sum(value) FROM partition_index.table_without_index WHERE year='2021' AND month='04' AND day='01'").show()
Vector((planning,208), (optimization,29002), (analysis,4))
+--------+------------------+
|count(1)|        sum(value)|
+--------+------------------+
|      24|13840.894731640632|
+--------+------------------+

In this example, we use the same setup for the EMR cluster (release: emr-6.2.0, three m5.xlarge nodes). The console has additional line:

Vector((planning,208), (optimization,29002), (analysis,4)) 

Apache Spark’s query planning mechanism has three phases: analysis, optimization, and physical planning (shown as just planning). This line means that the query planning took 4 milliseconds in analysis, 29,002 milliseconds in optimization, and 208 milliseconds in physical planning.

Let’s try running the same query using the index:

scala> spark.sql("SELECT count(*), sum(value) FROM partition_index.table_with_index WHERE year='2021' AND month='04' AND day='01'").show()
Vector((planning,7), (optimization,608), (analysis,2))                          
+--------+------------------+
|count(1)|        sum(value)|
+--------+------------------+
|      24|13840.894731640634|
+--------+------------------+

The query planning took 2 milliseconds in analysis, 608 milliseconds in optimization, and 7 milliseconds in physical planning.


About the Authors

Noritaka Sekiyama is a Senior Big Data Architect at AWS Glue and AWS Lake Formation. He is passionate about big data technology and open source software, and enjoys building and experimenting in the analytics area.

 

 

 

Sachet Saurabh is a Senior Software Development Engineer at AWS Glue and AWS Lake Formation. He is passionate about building fault tolerant and reliable distributed systems at scale.

 

 

 

Vikas Malik is a Software Development Manager at AWS Glue. He enjoys building solutions that solve business problems at scale. In his free time, he likes playing and gardening with his kids and exploring local areas with family.

 

 

 

 

Amazon Redshift ML Is Now Generally Available – Use SQL to Create Machine Learning Models and Make Predictions from Your Data

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/amazon-redshift-ml-is-now-generally-available-use-sql-to-create-machine-learning-models-and-make-predictions-from-your-data/

With Amazon Redshift, you can use SQL to query and combine exabytes of structured and semi-structured data across your data warehouse, operational databases, and data lake. Now that AQUA (Advanced Query Accelerator) is generally available, you can improve the performance of your queries by up to 10 times with no additional costs and no code changes. In fact, Amazon Redshift provides up to three times better price/performance than other cloud data warehouses.

But what if you want to go a step further and process this data to train machine learning (ML) models and use these models to generate insights from data in your warehouse? For example, to implement use cases such as forecasting revenue, predicting customer churn, and detecting anomalies? In the past, you would need to export the training data from Amazon Redshift to an Amazon Simple Storage Service (Amazon S3) bucket, and then configure and start a machine learning training process (for example, using Amazon SageMaker). This process required many different skills and usually more than one person to complete. Can we make it easier?

Today, Amazon Redshift ML is generally available to help you create, train, and deploy machine learning models directly from your Amazon Redshift cluster. To create a machine learning model, you use a simple SQL query to specify the data you want to use to train your model, and the output value you want to predict. For example, to create a model that predicts the success rate for your marketing activities, you define your inputs by selecting the columns (in one or more tables) that include customer profiles and results from previous marketing campaigns, and the output column you want to predict. In this example, the output column could be one that shows whether a customer has shown interest in a campaign.

After you run the SQL command to create the model, Redshift ML securely exports the specified data from Amazon Redshift to your S3 bucket and calls Amazon SageMaker Autopilot to prepare the data (pre-processing and feature engineering), select the appropriate pre-built algorithm, and apply the algorithm for model training. You can optionally specify the algorithm to use, for example XGBoost.

Architectural diagram.

Redshift ML handles all of the interactions between Amazon Redshift, S3, and SageMaker, including all the steps involved in training and compilation. When the model has been trained, Redshift ML uses Amazon SageMaker Neo to optimize the model for deployment and makes it available as a SQL function. You can use the SQL function to apply the machine learning model to your data in queries, reports, and dashboards.

Redshift ML now includes many new features that were not available during the preview, including Amazon Virtual Private Cloud (VPC) support. For example:

Architectural diagram.

  • You can also create SQL functions that use existing SageMaker endpoints to make predictions (remote inference). In this case, Redshift ML is batching calls to the endpoint to speed up processing.

Before looking into how to use these new capabilities in practice, let’s see the difference between Redshift ML and similar features in AWS databases and analytics services.

ML Feature Data Training
from SQL
Predictions
using SQL Functions
Amazon Redshift ML

Data warehouse

Federated relational databases

S3 data lake (with Redshift Spectrum)

Yes, using
Amazon SageMaker Autopilot
Yes, a model can be imported and executed inside the Amazon Redshift cluster, or invoked using a SageMaker endpoint.
Amazon Aurora ML Relational database
(compatible with MySQL or PostgreSQL)
No

Yes, using a SageMaker endpoint.

A native integration with Amazon Comprehend for sentiment analysis is also available.

Amazon Athena ML

S3 data lake

Other data sources can be used through Athena Federated Query.

No Yes, using a SageMaker endpoint.

Building a Machine Learning Model with Redshift ML
Let’s build a model that predicts if customers will accept or decline a marketing offer.

To manage the interactions with S3 and SageMaker, Redshift ML needs permissions to access those resources. I create an AWS Identity and Access Management (IAM) role as described in the documentation. I use RedshiftML for the role name. Note that the trust policy of the role allows both Amazon Redshift and SageMaker to assume the role to interact with other AWS services.

From the Amazon Redshift console, I create a cluster. In the cluster permissions, I associate the RedshiftML IAM role. When the cluster is available, I load the same dataset used in this super interesting blog post that my colleague Julien wrote when SageMaker Autopilot was announced.

The file I am using (bank-additional-full.csv) is in CSV format. Each line describes a direct marketing activity with a customer. The last column (y) describes the outcome of the activity (if the customer subscribed to a service that was marketed to them).

Here are the first few lines of the file. The first line contains the headers.

age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y 56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no

I store the file in one of my S3 buckets. The S3 bucket is used to unload data and store SageMaker training artifacts.

Then, using the Amazon Redshift query editor in the console, I create a table to load the data.

CREATE TABLE direct_marketing (
	age DECIMAL NOT NULL, 
	job VARCHAR NOT NULL, 
	marital VARCHAR NOT NULL, 
	education VARCHAR NOT NULL, 
	credit_default VARCHAR NOT NULL, 
	housing VARCHAR NOT NULL, 
	loan VARCHAR NOT NULL, 
	contact VARCHAR NOT NULL, 
	month VARCHAR NOT NULL, 
	day_of_week VARCHAR NOT NULL, 
	duration DECIMAL NOT NULL, 
	campaign DECIMAL NOT NULL, 
	pdays DECIMAL NOT NULL, 
	previous DECIMAL NOT NULL, 
	poutcome VARCHAR NOT NULL, 
	emp_var_rate DECIMAL NOT NULL, 
	cons_price_idx DECIMAL NOT NULL, 
	cons_conf_idx DECIMAL NOT NULL, 
	euribor3m DECIMAL NOT NULL, 
	nr_employed DECIMAL NOT NULL, 
	y BOOLEAN NOT NULL
);

I load the data into the table using the COPY command. I can use the same IAM role I created earlier (RedshiftML) because I am using the same S3 bucket to import and export the data.

COPY direct_marketing 
FROM 's3://my-bucket/direct_marketing/bank-additional-full.csv' 
DELIMITER ',' IGNOREHEADER 1
IAM_ROLE 'arn:aws:iam::123412341234:role/RedshiftML'
REGION 'us-east-1';

Now, I create the model straight form the SQL interface using the new CREATE MODEL statement:

CREATE MODEL direct_marketing
FROM direct_marketing
TARGET y
FUNCTION predict_direct_marketing
IAM_ROLE 'arn:aws:iam::123412341234:role/RedshiftML'
SETTINGS (
  S3_BUCKET 'my-bucket'
);

In this SQL command, I specify the parameters required to create the model:

  • FROM – I select all the rows in the direct_marketing table, but I can replace the name of the table with a nested query (see example below).
  • TARGET – This is the column that I want to predict (in this case, y).
  • FUNCTION – The name of the SQL function to make predictions.
  • IAM_ROLE – The IAM role assumed by Amazon Redshift and SageMaker to create, train, and deploy the model.
  • S3_BUCKET – The S3 bucket where the training data is temporarily stored, and where model artifacts are stored if you choose to retain a copy of them.

Here I am using a simple syntax for the CREATE MODEL statement. For more advanced users, other options are available, such as:

  • MODEL_TYPE – To use a specific model type for training, such as XGBoost or multilayer perceptron (MLP). If I don’t specify this parameter, SageMaker Autopilot selects the appropriate model class to use.
  • PROBLEM_TYPE – To define the type of problem to solve: regression, binary classification, or multiclass classification. If I don’t specify this parameter, the problem type is discovered during training, based on my data.
  • OBJECTIVE – The objective metric used to measure the quality of the model. This metric is optimized during training to provide the best estimate from data. If I don’t specify a metric, the default behavior is to use mean squared error (MSE) for regression, the F1 score for binary classification, and accuracy for multiclass classification. Other available options are F1Macro (to apply F1 scoring to multiclass classification) and area under the curve (AUC). More information on objective metrics is available in the SageMaker documentation.

Depending on the complexity of the model and the amount of data, it can take some time for the model to be available. I use the SHOW MODEL command to see when it is available:

SHOW MODEL direct_marketing

When I execute this command using the query editor in the console, I get the following output:

Console screenshot.

As expected, the model is currently in the TRAINING state.

When I created this model, I selected all the columns in the table as input parameters. I wonder what happens if I create a model that uses fewer input parameters? I am in the cloud and I am not slowed down by limited resources, so I create another model using a subset of the columns in the table:

CREATE MODEL simple_direct_marketing
FROM (
        SELECT age, job, marital, education, housing, contact, month, day_of_week, y
 	  FROM direct_marketing
)
TARGET y
FUNCTION predict_simple_direct_marketing
IAM_ROLE 'arn:aws:iam::123412341234:role/RedshiftML'
SETTINGS (
  S3_BUCKET 'my-bucket'
);

After some time, my first model is ready, and I get this output from SHOW MODEL. The actual output in the console is in multiple pages, I merged the results here to make it easier to follow:

Console screenshot.

From the output, I see that the model has been correctly recognized as BinaryClassification, and F1 has been selected as the objective. The F1 score is a metrics that considers both precision and recall. It returns a value between 1 (perfect precision and recall) and 0 (lowest possible score). The final score for the model (validation:f1) is 0.79. In this table I also find the name of the SQL function (predict_direct_marketing) that has been created for the model, its parameters and their types, and an estimation of the training costs.

When the second model is ready, I compare the F1 scores. The F1 score of the second model is lower (0.66) than the first one. However, with fewer parameters the SQL function is easier to apply to new data. As is often the case with machine learning, I have to find the right balance between complexity and usability.

Using Redshift ML to Make Predictions
Now that the two models are ready, I can make predictions using SQL functions. Using the first model, I check how many false positives (wrong positive predictions) and false negatives (wrong negative predictions) I get when applying the model on the same data used for training:

SELECT predict_direct_marketing, y, COUNT(*)
  FROM (SELECT predict_direct_marketing(
                   age, job, marital, education, credit_default, housing,
                   loan, contact, month, day_of_week, duration, campaign,
                   pdays, previous, poutcome, emp_var_rate, cons_price_idx,
                   cons_conf_idx, euribor3m, nr_employed), y
          FROM direct_marketing)
 GROUP BY predict_direct_marketing, y;

The result of the query shows that the model is better at predicting negative rather than positive outcomes. In fact, even if the number of true negatives is much bigger than true positives, there are much more false positives than false negatives. I added some comments in green and red to the following screenshot to clarify the meaning of the results.

Console screenshot.

Using the second model, I see how many customers might be interested in a marketing campaign. Ideally, I should run this query on new customer data, not the same data I used for training.

SELECT COUNT(*)
  FROM direct_marketing
 WHERE predict_simple_direct_marketing(
           age, job, marital, education, housing,
           contact, month, day_of_week) = true;

Wow, looking at the results, there are more than 7,000 prospects!

Console screenshot.

Availability and Pricing
Redshift ML is available today in the following AWS Regions: US East (Ohio), US East (N Virginia), US West (Oregon), US West (San Francisco), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (Paris), Europe (Stockholm), Asia Pacific (Hong Kong) Asia Pacific (Tokyo), Asia Pacific (Singapore), Asia Pacific (Sydney), and South America (São Paulo). For more information, see the AWS Regional Services list.

With Redshift ML, you pay only for what you use. When training a new model, you pay for the Amazon SageMaker Autopilot and S3 resources used by Redshift ML. When making predictions, there is no additional cost for models imported into your Amazon Redshift cluster, as in the example I used in this post.

Redshift ML also allows you to use existing Amazon SageMaker endpoints for inference. In that case, the usual SageMaker pricing for real-time inference applies. Here you can find a few tips on how to control your costs with Redshift ML.

To learn more, you can see this blog post from when Redshift ML was announced in preview and the documentation.

Start getting better insights from your data with Redshift ML.

Danilo

Implementing multi-tenant patterns in Amazon Redshift using data sharing

Post Syndicated from Rajesh Francis original https://aws.amazon.com/blogs/big-data/implementing-multi-tenant-patterns-in-amazon-redshift-using-data-sharing/

Software service providers offer subscription-based analytics capabilities in the cloud with Analytics as a Service (AaaS), and increasingly customers are turning to AaaS for business insights. A multi-tenant storage strategy allows the service providers to build a cost-effective architecture to meet increasing demand.

Multi-tenancy means a single instance of software and its supporting infrastructure is shared to serve multiple customers. For example, a software service provider could generate data that is housed in a single data warehouse cluster, but accessed securely by multiple customers. This storage strategy offers an opportunity to centralize management of data, simplify ETL processes, and optimize costs. However, service providers have to constantly balance between cost and providing a better user experience for their customers.

With the new data sharing feature, you can use Amazon Redshift to scale and meet both objectives of managing costs by simplifying storage and ETL pipelines while still providing consistent performance to customers.  You can ingest data into a cluster designated as a producer cluster, and share this live data with one or more consumer clusters. Clusters accessing this shared data are isolated from each other, therefore performance of a producer cluster isn’t impacted by workloads on consumer clusters. This enables consuming clusters to get consistent performance based on individual compute capacity.

In this post, we focus on various AaaS patterns, and discuss how you can use data sharing in a multi-tenant architecture to scale for virtually unlimited users. We discuss detailed steps to use data sharing with different storage strategies.

Multi-tenant storage patterns

Multi-tenant storage patterns help simplify the architecture and long-term maintenance of the analytics platform. In a multi-tenant strategy, data is stored centrally in a single cluster for all tenants, enabling simplification of the ETL ingestion pipeline and data management. In the previously published whitepaper SaaS Storage Strategies, various models of storage and benefits are covered for a single cluster scenario.

The three strategies you can choose from are:

  • Pool model – Data is stored in a single database schema for all tenants, and a new column (tenant_id) is used to scope and control access to individual tenant data. Access to the multi-tenant data is controlled using views built on the tables.
  • Bridge model – Storage and access to data for each tenant is controlled at individual schema level in the same database.
  • Silo model – Storage and access control to data for each tenant is maintained in separate databases

The following diagram illustrates the architecture of these multi-tenant storage strategies.

The following diagram illustrates the architecture of these multi-tenant storage strategies.

In the following sections, we will discuss how these multi-tenant strategies can be implemented using Amazon Redshift data sharing feature with a multi-cluster architecture.

Scaling your multi-tenant architecture using data sharing

AaaS providers implementing multi-tenant architectures were previously limited to resources of a single cluster to meet the compute and concurrency requirements of users across all the tenants. As the number of tenants increased, you could either turn on concurrency scaling or create additional clusters. However, the addition of new clusters means additional ingestion pipelines and increased operational overhead.

With data sharing in Amazon Redshift, you can easily and securely share data across clusters. Data ingested into the producer cluster is shared with one or more consumer clusters, which allows total separation of ETL and BI workloads. Several consumer clusters can read data from the managed storage of a producer cluster. This enables instant, granular, and high-performance access without data copies and movement. Workloads accessing shared data are isolated from each other and the producer. You can distribute workloads across multiple clusters while simplifying and consolidating the ETL ingestion pipeline into one main producer cluster, providing optimal price for performance.

Consumer clusters can in turn be producers for the data sets they own. Customers can optimize costs even further by collocating multiple tenants on the same consumer cluster. For instance, you can group low volume tier 3 tenants into a single consumer cluster to provider a lower cost offering, while high volume tier 1 tenants get their own isolated compute clusters. Consumer clusters can be created in the same account as producer or in a different AWS account. With this you can have separate billing for the consumer clusters, where you can chargeback to the business group that uses the consumer cluster or even allow your customers to use their own Redshift cluster in their account, so they pay for usage of the consumer cluster. The following diagram shows the difference in ETL and consumer access patterns in a multi-tenant architecture using data sharing versus a single cluster approach without data sharing.

Consumer clusters can in turn be producers for the data sets they own.

Multi-tenant architecture with data sharing compared to single cluster approach

Creating a multi-tenant architecture for an AaaS solution

For this post, we use a simple data model with a fact and a dimension table to demonstrate how to leverage data sharing to design a scalable multi-tenant AaaS solution. We cover detailed steps involved for each storage strategy using this data model. The tables are as follows:

  • Customer – dimension table containing customer details
  • Sales – fact table containing sales transactions

We use two Amazon Redshift ra3.4xl clusters, with 2 nodes each, and designate one cluster as producer and other as consumer.

The high-level steps involved in enabling data sharing across clusters are as follows:

  1. Create a data share in the producer cluster and assign database objects to the data share.
  2. From the producer cluster, grant usage on the data share to consumer clusters, identified by namespace or AWS account.
  3. From the consumer cluster, create an external database using the data share from the producer
  4. Query the tables in the data share through the external shared database in the consumer cluster. Grant access to other users to access this shared database and objects.

Creating producer and consumer Amazon Redshift clusters

Let us start by creating two Amazon Redshift ra3.4xl clusters with 2-nodes each, one for the producer and other for consumer.

  1. On the Amazon Redshift cluster, create two clusters of RA3 instance type, and name them ds-producer and ds-consumer-c1, respectively.
  1. Next, log in to Amazon Redshift using the query editor. You can also use a SQL client tool like DBeaver, SQL Workbench, or Aginity Workbench. For configuration information, see Connecting to an Amazon Redshift cluster using SQL client tools.

Get the cluster namespace of the producer and consumer clusters from the console. We will use the namespaces to create the tenant table and to create and access the data shares. You can also get the cluster namespaces by logging into each of the clusters and executing the SELECT CURRENT_NAMESPACE statement in the query editor.

Please note to replace the corresponding namespaces in the code sections wherever producercluster_namespace, consumercluster1_namespace, and consumercluster_namespace is referenced.

The following screenshot shows the namespace on the Amazon Redshift console.

Now that we have the clusters created, we will go through the detailed steps for the three models. First, we will cover the Pool model, followed by Bridge model and finally the Silo model.

Pool model

The pool model represents an all-in, multi-tenant model where all tenants share the same storage constructs and provides the most benefit in simplifying the AaaS solution.

With this model, data storage is centralized in one cluster database, and data is stored for all tenants in the same set of data models. To scope and control access to tenant data, we introduce a column (tenant_id) that serves as a unique identifier for each tenant.

Security management to prevent cross-tenant access is one of the main aspects to address with the pool model. We can implement row-level security and provide secure access to the data by creating database views and set application-level policies by creating groups with specific access and assigning users to the groups. The following diagram illustrates the pool model architecture.

The following diagram illustrates the pool model architecture.

To create a multi-tenant solution using the pool model, you create data shares for the pool model in the producer cluster, and share data with the consumer cluster. We provide more detail on these steps in the following sections.

Creating data shares for the pool model in the producer cluster

To create data shares for the pool model in the producer cluster, complete the following steps:

  1. Log in to the producer cluster as an admin user and run the following script.

Note that we have a tenant table to store unique identifiers for each tenant or consumer (tenant).

We add a column (tenant_id) to the sales and customer tables to uniquely identify tenant data. This tenant_id references the tenant_id in the tenant table to uniquely identify the tenant and consumer records. See the following code:

/**********************************************/
/*       datasharing datasetup pool model     */
/**********************************************/

-- Create Schema
create schema sales;

CREATE TABLE IF NOT EXISTS sales.tenant ( 
t_tenantid int8 not null,
t_name varchar(50) not null,
t_namespace varchar(50),
t_account varchar(16)
)
DISTSTYLE AUTO
SORTKEY AUTO;

-- Create tables for multi-tenant sales schema
drop table sales.customer;
CREATE TABLE IF NOT EXISTS sales.customer(
  c_tenantid int8 not null,
  c_custid int8 not null,
  c_name varchar(25) not null,
  c_region varchar(40) not null,
  Primary Key(c_tenantid, c_custid)
) 
DISTSTYLE AUTO
SORTKEY AUTO;

CREATE TABLE IF NOT EXISTS sales.sales (
  s_tenantid int8 not null,
  s_orderid int8 not null,
  s_custid int8 not null,
  s_totalprice numeric(12,2) not null,
  s_orderdate date not null,
  Primary Key(s_tenantid, s_orderid)
) 
DISTSTYLE AUTO
SORTKEY AUTO;
  1. Set up the tenant table with the details for each consumer cluster, and ingest data into the customer dimension and sales fact tables. Using the COPY command is the recommended way to ingest data into Amazon Redshift, but for illustration purposes, we use INSERT statements:
    -- Ingest data 
    insert into sales.tenant values
    (0, 'primary', '<producercluster_namespace>',''),
    (1, 'tenant1', '<consumercluster1_namespace>',''),
    (2, 'tenant2', '<consumercluster2_namespace>','');
    
    insert into sales.customer values
    (1, 1, 'Customer 1', 'NorthEast'),
    (1, 2, 'Customer 2', 'SouthEast'),
    (2, 1, 'Customer 3', 'NorthWest'),
    (2, 2, 'Customer 4', 'SouthEast');
    
    truncate table sales.sales;
    insert into sales.sales values
    (1, 1, 1, 2434.33, '2020-11-21'),
    (1, 2, 2, 54.90, '2020-5-5'),
    (1, 3, 2, 9678.99, '2020-3-8'),
    (2, 1, 2, 452.24, '2020-1-23'),
    (2, 2, 1, 76523.10, '2020-11-3'),
    (2, 3, 1, 6745.20, '2020-10-01');
    
    select count(*) from sales.tenant;
    select count(*) from sales.customer;
    select count(*) from sales.sales;

Securing data on the producer cluster by restricting access

In the pool model, no external user has direct access to underlying tables. All access is restricted using views.

  1. Create a view for each of the fact and dimension tables to include a condition to filter records from the consumer tenant’s namespace. In our example, we create v_customersales to combine sales fact and customer dimension tables with a restrictive filter for tenant.namespace = current_namespace. See the following code:
    /**********************************************/
    /* We will create late binding views          */
    /* but materialized views could also be used  */
    /**********************************************/
    
    create or replace view sales.v_customer as
    select * 
    from sales.customer c, sales.tenant t
    where c.c_tenantid = t.t_tenantid 
    and t.t_namespace  = current_namespace;
    
    create or replace view sales.v_sales as
    select * 
    from sales.sales s, sales.tenant t
    where s.s_tenantid = t.t_tenantid 
    and t.t_namespace  = current_namespace;
    
    create or replace view sales.v_customersales as 
    select c_tenantid, c_name, c_region, 
    	date_part(w, to_date(s_orderdate,'YYYY-MM-DD')) as "week", 
    	date_part(mon, to_date(s_orderdate,'YYYY-MM-DD')) as "month", 
    	date_part(dow, to_date(s_orderdate,'YYYY-MM-DD')) as "dow",
    	date_part(yr, to_date(s_orderdate,'YYYY-MM-DD')) as "year",
    	date_part(d, to_date(s_orderdate,'YYYY-MM-DD')) as "dom", 
    	t.t_namespace
    from sales.tenant t, sales.customer c, sales.sales s
    where t.t_tenantid = c.c_tenantid 
    and c.c_tenantid = s.s_tenantid 
    and c.c_custid = s.s_custid 
    and t.t_namespace = current_namespace 
    WITH NO SCHEMA BINDING;
    
    select * from sales.v_customersales;
          
    

Now that we have database objects created in the producer cluster, we can share the data with the consumer clusters.

Sharing data with the consumer cluster

To share data with the consumer cluster, complete the following steps:

  1. Create a data share for the sales data:
    /***************************************************/
    /* Create Datashare and add objects to the share    */
    /****************************************************/
    CREATE DATASHARE salesshare;
    

  1. Enter the following code to alter the data share, add the sales schema to be shared with the consumer clusters, and add all tables in the sales schema to be shared with the consumer cluster:
    /************************************************************/
    /* Add objects at desired granularities: schemas, tables,   */
    /* views include materialized, and SQL UDFs                 */
    /************************************************************/
    ALTER DATASHARE salesshare ADD SCHEMA sales;  -- New addition to create SCHEMA first
    
    /*For pool model, we share only the views and not tables */
    ALTER DATASHARE SalesShare ADD TABLE sales.v_customer;
    ALTER DATASHARE SalesShare ADD TABLE sales.v_customersales;
    

For the pool model, we share only the views with the consumer cluster and not the tables. The ALTER statement ADD TABLE is used to add both views and tables.

  1. Grant usage on the sales data share to the namespace of the BI consumer cluster. You can get the namespace of the BI cluster from the console or using the SELECT CURRENT_NAMESPACE statement in the BI cluster. See the following code:
    /********************************************************************/
    /* Grant access to consumer clusters                                */
    /* login to Consumer BI Cluster and get the Namespace from          */
    /* the Redshift console or using SELECT CURRENT_NAMESPACE           */
    /********************************************************************/
    SELECT CURRENT_NAMESPACE;
    
    --Namespace refers to the namespace GUID of the consumer cluster in the account
    GRANT USAGE ON DATASHARE salesshare TO NAMESPACE '<consumercluster1_namespace>'
    --Account numbers are 12 digit long
    GRANT USAGE ON DATASHARE salesshare TO ACCOUNT 'Consumer_AWSAccount';
    

  1. View data shares that are shared from the producer cluster:
    SELECT * FROM SVV_DATASHARES;

The following screenshot shows the output.

The following screenshot shows the output.

You can also see the data shares and their detailed objects and consumers using the following commands:

SHOW DATASHARES;
DESC DATASHARE salesshare;
select * from SVV_DATASHARE_OBJECTS;
select * from SVV_DATASHARE_CONSUMERS;

Viewing and querying data shares for the pool model from the consumer cluster

To view and query data shares from the consumer cluster, complete the following steps:

  1. Log in to the consumer cluster as an admin user and view the data share objects:
    /**********************************************************/
    /* Login to Consumer cluster as awsuser:                  */
    /* View datashares and create local database for querying */
    /**********************************************************/
    select * from SVV_DATASHARE_OBJECTS;

The following screenshot shows the results.

The following screenshot shows the results.

  1. Create a new database from the data share of the producer cluster:
    /**********************************************************/
    /* Create a local database and schema reference           */
    /**********************************************************/
    CREATE DATABASE sales_db FROM DATASHARE salesshare
    OF NAMESPACE '<producercluster_namespace>';

  1. Optionally, you can create an external schema in the consumer cluster pointing to the schema in the database of the producer cluster.

Creating a local external schema in the consumer cluster allows schema-level access controls within the consumer cluster, and uses a two-part notation when referencing shared data objects (localschema.table; vs. external_db.producerschema.table). See the following code:

/*********************************************************/
/* Create External Schema - Optional                     */
/* reason for schema: give specific access to schema     */
/* using shared alias get access to a secondary database */
/*********************************************************/
CREATE EXTERNAL SCHEMA sales_schema 
FROM REDSHIFT DATABASE 'sales_db' SCHEMA 'sales';
  1. Now you can query the shared data from the producer cluster by using the syntax tenant.schema.table:
    select * from sales_db.sales.customer;
    select * from sales_db.sales.v_customersales;

  1. From the tenant1 consumer cluster, you can view the databases and the tenants that are accessible to tenant1. tenant1_schema is as follows:
    select * from SVV_REDSHIFT_DATABASES;
    

The following screenshot shows the results.

The following screenshot shows the results.

Creating local consumer users and controlling access

You can control access to users in your consumer cluster by creating users and groups, and assigning access to the data share objects.

  1. Log in as an admin user on consumer cluster 1 and enter the following code to create tenant1_group, grant usage on the local database sales_db and schema sales_schema to the group, and assign the user tenant1_user to the tenant1_group:
    /********************************************************/
    /* Consumers can create own users and assign privileges */
    /* Create tenant1_group and assign privileges to read   */
    /* sales_db and the sales_schema                        */
    /* Create tenant1_user in tenant1_group                 */
    /********************************************************/
    create group tenant1_group;
    create user tenant1_user password 'Redshift#123!' in group tenant1_group;
    
    GRANT USAGE ON DATABASE sales_db TO tenant1_group;
    GRANT USAGE ON SCHEMA sales_schema TO GROUP tenant1_group;
    

  1. Now, login as tenant1_user to consumer cluster 1 and select data from the views v_customer and v_customersales:
    /*******************************************************/
    /* select from view returns only sales records related */
    /* to Consumer A namespace                             */
    /*******************************************************/
    select * from sales_db.sales.v_customer;
    select * from sales_db.sales.v_customersales;
    

You should see only the data relevant to tenant 1 and not the data that is associated with tenant 2.

You should see only the data relevant to tenant 1 and not the data that is associated with tenant 2.

 

Create Materialized views to optimize performance

Consumer clusters can have their own database objects which are local to the consumer. You can also create materialized views on the datashare objects and control when to refresh the dataset for your consumers. This provides another level of isolation from the producer cluster, and will ensure the consumer clusters go against their local dataset.

  1. Log in as an admin user on consumer cluster 1 and enter the following code to create a materialized view for customersales. This will create a local view that can be periodically refreshed from the consumer cluster.

 

/*******************************************************/
/* Create materialized view in consumer cluster        */
/*******************************************************/
create MATERIALIZED view tenant1_sales.mv_customersales as 
select c_tenantid, c_name, c_region, 
	date_part(w, to_date(s_orderdate,'YYYY-MM-DD')) as "week", 
	date_part(mon, to_date(s_orderdate,'YYYY-MM-DD')) as "month", 
	date_part(dow, to_date(s_orderdate,'YYYY-MM-DD')) as "dow",
	date_part(yr, to_date(s_orderdate,'YYYY-MM-DD')) as "year",
	date_part(d, to_date(s_orderdate,'YYYY-MM-DD')) as "dom", 
	t.t_namespace
from sales_db.tenant t, sales_db.customer c, sales_db.sales s
where t.t_tenantid = c.c_tenantid 
and c.c_tenantid = s.s_tenantid 
and c.c_custid = s.s_custid 
and t.t_namespace = current_namespace;

select * from tenant1_sales.mv_customersales top 100;

REFRESH MATERIALIZED VIEW tenant1_sales.mv_customersales;

 

With the preceding steps, we have demonstrated how you can control access to the tenant data in the same datastore using views. We also reviewed how data shares help efficiently share data between producer and consumer clusters with transaction consistency. We also saw how a local materialized view can be created to further isolate your BI workloads for your customers and provide a consistent, performant user experience. In the next section we will discuss the Bridge model.

Bridge model

In the bridge model, data for each tenant is stored in its own schema in a database and contains a similar set of tables. Data shares are created for each schema and shared with the corresponded consumer. This is an appealing balance between silo and pool model, providing both data isolation and ETL consolidation. With Amazon Redshift, you can create up to 9,900 schemas. For more information, see Quotas and limits in Amazon Redshift.

With data sharing, separate consumer clusters can be provisioned to use the same managed storage from producer cluster. Consumer clusters have all the capabilities of a producer cluster, and can in turn be producer clusters for data objects they own. Consumers can’t share data that is already shared with them. Without data sharing, queries from all customers are directed to a single cluster. The following diagram illustrates the bridge model.

The following diagram illustrates the bridge model.

To create a multi-tenant architecture using bridge model, complete the steps in the following sections.

Creating database schemas and tables for the bridge model in the producer cluster

As we did in the pool model, the first step is to create the database schema and tables. We log in to the producer cluster as an admin user and create separate schemas for each tenant. For our post, we create two schemas, tenant1 and tenant2, to store data for two tenants.

  1. Log in to the producer cluster as the admin user.
  1. Use the script below to create two schemas, tenant1 and tenant2, and create tables for customer dimension and sales facts under each of the two schemas. See the following code:
    /****************************************/
    /* Bridge -  Data Model */
    /****************************************/
    -- Create schemas tenant1 and tenant2
    create schema tenant1;
    create schema tenant2;
    
    -- Create tables for tenant1
    CREATE TABLE IF NOT EXISTS tenant1.customer (
      c_custid int8 not null ,
      c_name varchar(25) not null,
      c_region varchar(40) not null,
      Primary Key(c_custid)
    ) diststyle ALL sortkey(c_custid);
    
    CREATE TABLE IF NOT EXISTS tenant1.sales (
      s_orderid int8 not null,
      s_custid int8 not null,
      s_totalprice numeric(12,2) not null,
      s_orderdate date not null,
      Primary Key(s_orderid)
    ) distkey(s_orderid) sortkey(s_orderdate, s_orderid) ;
    
      
    
    -- Create tables for tenant2
    CREATE TABLE IF NOT EXISTS tenant2.customer (
      c_custid int8 not null ,
      c_name varchar(25) not null,
      c_region varchar(40) not null,
      Primary Key(c_custid)
    ) diststyle ALL sortkey(c_custid);
    
    CREATE TABLE IF NOT EXISTS tenant2.sales (
      s_orderid int8 not null,
      s_custid int8 not null,
      s_totalprice numeric(12,2) not null,
      s_orderdate date not null,
      Primary Key(s_orderid)
    ) distkey(s_orderid) sortkey(s_orderdate, s_orderid) ;
    

  1. Ingest data into the customer dimension and sales fact tables. Using the COPY command is the recommended way to ingest data into Amazon Redshift, but for illustration purposes, we use the INSERT statement:
    -- ingest data for tenant1
    -- ingest customer data
    insert into tenant1.customer values
    (1, 'Customer 1', 'NorthEast'),
    (2, 'Customer 2', 'SouthEast');
    
    -- ingest sales data
    insert into tenant1.sales values
    (1, 1, 2434.33, '2020-11-21'),
    (2, 2, 54.90, '2020-5-5'),
    (3, 2, 9678.99, '2020-3-8');
    
    select count(*) from tenant1.customer;
    select count(*) from tenant1.sales;
    
    
    -- ingest data for tenant2
    -- ingest customer data
    insert into tenant2.customer values
    (1, 'Customer 3', 'NorthWest'),
    (2, 'Customer 4', 'SouthEast');
    
    -- ingest sales data
    truncate table tenant2.sales;
    insert into tenant2.sales values
    (1, 2, 452.24, '2020-1-23'),
    (2, 1, 76523.10, '2020-11-3'),
    (3, 1, 6745.20, '2020-10-01');
    
    
    select count(*) from tenant2.customer;
    select count(*) from tenant2.sales;
    

Creating data shares and granting usage to the consumer cluster

In the following code, we create two data shares, tenant1share and tenant2share, to share the database objects under the two schemas to the respective consumer clusters.

  1. Create two datashares tenant1share and tenant2share to share the database objects under the two schemas to the respective consumer clusters.
    /******************************************************************/
    /*   Create Datashare and add database objects to the datashare   */
    /******************************************************************/
    CREATE DATASHARE tenant1share;
    CREATE DATASHARE tenant2share;

  1. Alter the datashare and add the schema(s) for respective tenants to be shared with the consumer cluster
    ALTER DATASHARE tenant1share ADD SCHEMA tenant1;
    ALTER DATASHARE tenant2share ADD SCHEMA tenant2;

  1. Alter the datashare and add all tables in the schema(s) to be shared with the consumer cluster
    ALTER DATASHARE tenant1share ADD ALL TABLES IN SCHEMA tenant1;
    ALTER DATASHARE tenant2share ADD ALL TABLES IN SCHEMA tenant2;

Getting the namespace of the first consumer cluster

  1. Log in to the consumer cluster and get the namespace from the console or by running the select current_namespace command:
    /* Grant access to tenant1 schema for Tenant1 BI Cluster */
    /* login to tenant1 BI Cluster and get the Namespace 
     * or get the Namespace from the Redshift console */
    SELECT CURRENT_NAMESPACE;

 

  1. Grant usage on the data share for the first tenant to the namespace of the BI cluster. You can get the namespace of the BI cluster from the console or using the SELECT CURRENT_NAMESPACE statement in the BI cluster:
    -- Grant usage on the datashare to the first consumer cluster
    -- Namespace refers to the namespace GUID of the consumer cluster 
    GRANT USAGE ON DATASHARE tenant1share TO NAMESPACE '<consumercluster1_namespace>'
    --Account numbers are 12 digit long (optional)
    --GRANT USAGE ON DATASHARE tenant1share TO ACCOUNT '<Consumer_AWSAccount>';

Getting the namespace of the second consumer cluster

  1. Log in to the second consumer cluster and get the namespace from the console or by running the select current_namespace command:
    /* Grant access to tenant2 schema for Tenant2 BI Cluster */
    /*login to tenant1 BI Cluster and get the Namespace      */
    SELECT CURRENT_NAMESPACE;
    

  1. Grant usage on the data share for the second tenant to the namespace of the second consumer cluster you just got from the previous step:
    -- Grant usage on the datashare to the second consumer cluster
    GRANT USAGE ON DATASHARE tenant2share TO NAMESPACE '<consumercluster2_namespace>'
    --Account numbers are 12 digit long (optional)
    --GRANT USAGE ON DATASHARE tenant2share TO ACCOUNT '<Consumer_AWSAccount>';

  1. To view data shares from the producer cluster, enter the following code:
    /******************************************************************/
    /*   View Datashares created, Datashare objects and consumers     */
    /******************************************************************/
    select * from SVV_DATASHARES;
    select * from SVV_DATASHARE_OBJECTS;
    select * from SVV_DATASHARE_CONSUMERS;

The following screenshot shows the commands in the query editor.

The following screenshot shows the commands in the query editor.

The following screenshot shows the query results.

The following screenshot shows the query results.

Accessing data using the consumer cluster from the data share

To access data using the consumer cluster, complete the following steps:

  1. Log in to the first consumer cluster ds-consumer-c1, as an admin user.
  1. View data share objects from the SVV_DATASHARE_OBJECTS system view:
    /*****************************************************/
    /* Consumer cluster as adminuser
    /* List the shares available and review contents for each 
    /********************************************************/
    -- You can view datashare objects associated with the cluster
    -- using either of the two commands
    SHOW DATASHARES;
    select * from SVV_DATASHARES;

The following screenshot shows the query results.

The following screenshot shows the query results.

--View objects shared in inbound share for consumer
select * from SVV_DATASHARE_OBJECTS;

The following screenshot shows the query results.

The following screenshot shows the query results.

--View namespace or clusters granted usage to a datashare
select * from svv_datashare_consumers;
  1. Create a local database in the first consumer cluster, and an external schema to be able to provide controlled access to the specific schema to the consumer clusters:
    /*******************************************************/
    /* Create a local database and schema reference        */
    /* to the share objects                                */
    /*******************************************************/
    CREATE DATABASE tenant1_db FROM DATASHARE tenant1share
    OF NAMESPACE '<producercluster_namespace>';

  1. Query the database tables using the three-part notation db.tenant.table:
    select * from tenant1_db.tenant1.customer;
    select * from tenant1_db.tenant1.sales;

  1. Optionally, you can create an external schema.

There are two reasons to create an external schema: either to enable two-part notation access to the tables from the consumer cluster, or to provide restricted access to the specific schemas for selected users, when multiple schemas are shared from the producer cluster. See the following code for our external schema:

/* Create External Schema */
CREATE EXTERNAL SCHEMA tenant1_schema FROM REDSHIFT DATABASE 'tenant1_db' SCHEMA 'tenant1';
  1. If you created the local schemas, you can use the following two-part notation to query the database tables:
    select * from tenant1_schema.customer;
    select * from tenant1_schema.sales;

  1. You can view the shared databases by querying the SVV_REDSHIFT_DATABASES table:
    select * from SVV_REDSHIFT_DATABASES;

The following screenshot shows the query results.

The following screenshot shows the query results.

Creating consumer users for managing access

Still logged in as an admin user to the consumer cluster, you can create other users who have access to the database objects.

  1. Create users and groups, and assign users and object privileges to the groups with the following code:
    /*******************************************************/
    /* Consumer can create own users and assign privileges */
    /* Create tenant1_user and assign privileges to        */
    /* read datashare from tenant1 schema                  */
    /*******************************************************/
    create group tenant1_group;
    create user tenant1_user password 'Redshift#123!' in group tenant1_group;
    
    GRANT USAGE ON DATABASE tenant1_db TO tenant1_user;
    GRANT USAGE ON SCHEMA tenant1_schema TO GROUP tenant1_group;
    

Now tenant1_user can log in and query the shared tables from tenant schema.

  1. Log in to the consumer cluster as tenant1_user and query the tables:
    /************************************************************/
    /* Consumer cluster as tenant1_user - As a consumer cluster */
    /* administrator, you must follow these steps:              */
    /************************************************************/
    
    select * from tenant1_db.tenant1.customer;
    select * from tenant1_db.tenant1.sales;
    
    
    /************************************************************/
    /* If you have created and External Schema                  */
    /* you can use the two-part notation to query tables.       */
    /************************************************************/
    
    select * from tenant1_schema.customer;
    select * from tenant1_schema.sales;
    

Revoking access to a data share (optional)

  1. At any point, if you want to revoke access to the data share, you can the REVOKE USAGE command:
    /*************************************************************/
    /* To revoke access at any time use the REVOKE USAGE command */
    /*************************************************************/
    --Namespace refers to the namespace GUID of the consumer cluster in the account
    REVOKE USAGE ON DATASHARE Salesshare FROM NAMESPACE '<consumercluster1_namespace>'
    --Account numbers are 12 digit long
    REVOKE USAGE ON DATASHARE Salesshare FROM ACCOUNT '<Consumer_AWSAccount>';
    

Silo model

The third option is to store data for each tenant in separate databases within a cluster. If you need your data isolated from other tenants, you can use the silo model and each database may have distinct data models, monitoring, management, and security footprints.

Amazon Redshift supports cross-database queries across databases, which allow you to simplify data organization. You can store common or granular datasets used across all tenants in a centralized database, and use the cross-database query capability to join relevant data for each tenant.

The steps to create a data share in a silo model is similar to a bridge model; however, unlike a bridge model (where data share is for each schema), the silo model has a data share created for each database. The following diagram illustrates the architecture of the silo model.

The following diagram illustrates the architecture of the silo model.

Creating data shares for the silo model in the producer cluster

To create data shares for the silo model in the producer cluster, complete the following steps:

  1. Log in to the producer cluster as an admin user and create separate databases for each tenant:
    /*****************************************************/
    /** Silo Model – Create databases for the 2 tenants **/
    /*****************************************************/
    create database tenant1_silodb;
    
    create database tenant2_silodb;
    

  1. Log in again to the producer cluster with the database name and user ID for the database that you want to share (tenant1_silodb) and create the schema and tables:
    /***********************************************************/
    /* login to tenant1_db and create schema tenant1 and tables*/
    /***********************************************************/
    create schema tenant1_siloschema;
    
    -- Create tables for tenant1
    CREATE TABLE IF NOT EXISTS tenant1_silodb.tenant1_siloschema.customer (
      c_custid int8 not null ,
      c_name varchar(25) not null,
      c_region varchar(40) not null,
      Primary Key(c_custid)
    ) diststyle ALL sortkey(c_custid);
    
    CREATE TABLE IF NOT EXISTS tenant1_silodb.tenant1_siloschema.sales (
      s_orderid int8 not null,
      s_custid int8 not null,
      s_totalprice numeric(12,2) not null,
      s_orderdate date not null,
      Primary Key(s_orderid)
    ) distkey(s_orderid) sortkey(s_orderdate, s_orderid) ;
    
    insert into tenant1_siloschema.customer values
    (1, 'Customer 1', 'NorthEast'),
    (2, 'Customer 2', 'SouthEast');
    
    truncate table tenant1_siloschema.sales;
    insert into tenant1.sales values
    (1, 1, 2434.33, '2020-11-21'),
    (2, 2, 54.90, '2020-5-5'),
    (3, 2, 9678.99, '2020-3-8');
    

  1. Create a data share with a name for the first tenant (for example, tenant1dbshare):
    /******************************************************************/
    /*   Create datashare and add database objects to the datashare   */
    /******************************************************************/
    CREATE DATASHARE tenant1_silodbshare;
    

  1. Run Alter datashare commands to add the schemas to be shared with the consumer cluster and add all tables in the schemas to be shared with the consumer cluster:
    ALTER DATASHARE tenant1_silodbshare ADD SCHEMA tenant1_siloschema;
    ALTER DATASHARE tenant1_silodbshare ADD ALL TABLES IN SCHEMA tenant1_siloschema;

  1. Grant usage on the data share for first tenant to the namespace of the BI cluster. You can get the namespace of the BI cluster from the console or using the SELECT CURRENT_NAMESPACE statement in the BI cluster:
    --Namespace refers to the namespace GUID of the consumer cluster in the account
    GRANT USAGE ON DATASHARE tenant1_silodbshare TO NAMESPACE ‘<consumercluster1_namespace>’
    --Account numbers are 12 digit long (optional)
    --GRANT USAGE ON DATASHARE tenant1_silodbshare TO ACCOUNT '<AWS-Account>';
    

Viewing and querying data shares for the silo model from the consumer cluster

To view and query your data shares, complete the following steps:

  1. Log in to the consumer cluster as an admin user.
  2. Create a new database from the data share of the producer cluster:
    CREATE DATABASE tenant1_silodb FROM DATASHARE tenant1_silodbshare
    OF NAMESPACE ‘<producercluster_namespace>’;

Now you can start querying the shared data from the producer cluster by using the syntax – tenant.schema.table. If you created an external schema, then you can also use the two-part notation to query the tables.

  1. Query the data with the following code:
    select * from tenant1_silodb.tenant1.customer;
    select * from tenant1_silodb.tenant1.sales;
    

  1. Optionally, you can create an external schema pointing to the schema in the database of the producer cluster. This allows you query shared tables using a two-part notation. See the following code:
    CREATE EXTERNAL SCHEMA tenant1_siloschema FROM REDSHIFT DATABASE 'tenant1_silodb' SCHEMA 'tenant1';
    
    --With this Schema, you can access using two-part notation to select from data share tables
    select * from tenant1_siloschema.customer;
    select * from tenant1_siloschema.sales;
    

  1. You can repeat the same steps for tenant2 to share the tenant2 database with tenant2 You can also control access to users in your consumer cluster by creating users and groups, and assigning access to the data share objects.

System views to view data shares

We have introduced new system tables and views to easily identify the data shares and related objects. You can use three different groups of system views to view the data share objects:

  • Views starting with SVV_DATASHARES – has detail of datashares and objects in a datashare.
View Name Purpose
SVV_DATASHARES View a list of data shares created on the cluster and data shares shared with the cluster
SVV_DATASHARE_OBJECTS View a list of objects in all data shares created on the cluster or shared with the cluster
SVV_DATASHARE_CONSUMERS View a list of consumers for data share created on the cluster
  • Views starting with SVV_REDSHIFT – contains details on both local and remote Redshift databases.
View Name Purpose
SVV_REDSHIFT_DATABASES List of all databases that a user has access to
SVV_REDSHIFT_SCHEMAS List of all schemas that user has access to
SVV_REDSHIFT_TABLES List of all tables that a user has access to
SVV_REDSHIFT_COLUMNS List of all columns that a user has access to
SVV_REDSHIFT_FUNCTIONS List of all functions that user has access to
  • Views starting with SVV_ALL– contain local and remote databases, external schemas including  spectrum and federated query, and external schema references to shared data.  If you create external schemas in consumer cluster, you need to use the SVV_ALL views to look at the objects.
View Name Purpose
SVV_ ALL _SCHEMAS Union of list of all schemas from SVV_REDSHIFT_SCHEMA view and consolidated list of all external tables and schemas that user has access to
SVV_ ALL _TABLES List of all tables that a user has access to
SVV_ ALL _COLUMNS List of all columns that a user has access to
SVV_ ALL _FUNCTIONS List of all functions that user has access to

Considerations for choosing a storage strategy

You can adopt a storage strategy or choose a hybrid approach based on business, technical, and operational requirements. Before deciding on a strategy, consider the quotas and limits for various objects in Amazon Redshift, and the number of databases per cluster or number of schemas per database to check if it meets your requirements. The following table summarizes these considerations.

  Pool Bridge Silo
Separation of tenant data Views Schema Database
ETL pipeline complexity Low Low Medium
Limits 100,000 tables (RA3 – 4x, 16x large clusters) 9,900 schemas per database 60 databases per cluster
Chargeback to consumer accounts Yes Yes Yes
Scalability High High High

Conclusion

In this post, we discussed how you can use the new data sharing feature of Amazon Redshift to implement an AaaS solution with a multi-tenant architecture while meeting SLAs for consumers using separate Amazon Redshift clusters. We demonstrated three types of models providing various levels of isolation for the tenant data. We compared and contrasted the models and provided guidance on when to choose an implementation model. We encourage you to try the data sharing feature to build your AaaS or software as a service (SaaS) solutions.


About the Authors

Rajesh Francis is a Sr. Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to build scalable Analytic solutions.

 

 

 

Neeraja Rentachintala is a Principal Product Manager with Amazon Redshift. Neeraja is a seasoned Product Management and GTM leader, bringing over 20 years of experience in product vision, strategy and leadership roles in data products and platforms. Neeraja delivered products in analytics, databases, data Integration, application integration, AI/Machine Learning, large scale distributed systems across On-Premise and Cloud, serving Fortune 500 companies as part of ventures including MapR (acquired by HPE), Microsoft SQL Server, Oracle, Informatica and Expedia.com.

 

Jeetesh Srivastva is a Sr. Analytics specialist solutions architect at AWS. He specializes in Amazon Redshift and works with customers to implement scalable solutions leveraging Redshift and other AWS Analytic services. He has worked to deliver on premises and cloud based analytic solutions for customers in banking & finance and hospitality industry verticals.

Data monetization and customer experience optimization using telco data assets: Part 2

Post Syndicated from Vikas Omer original https://aws.amazon.com/blogs/big-data/part-2-data-monetization-and-customer-experience-optimization-using-telco-data-assets/

Part 1 of this series explains the importance of building and implementing a customer experience (CX) management and data monetization strategy for telecom service providers (TSPs), and the major challenges driving these initiatives. It also includes an AWS CloudFormation template to set up a demonstration of the solution using AWS services. It covers transforming and enriching multiple datasets, and offers information about data standardization, baselining an analytics data model to marry different datasets like deep packet inspection (DPI) engine embedded Packet Switch (PS) probe, CRM, subscriptions, media, carrier, device, and network configuration management in the data warehouse with AWS Glue, AWS Lambda, and Amazon Redshift.

In this post, I demonstrate how you can enable data analysts, scientists, and advanced business users to query data from Amazon Redshift or Amazon Simple Storage Service (Amazon S3) directly. I also demonstrate configuring a simple drag-and-drop interface for self-service analytics so you can prepare and publish insights based on enriched data stored in Amazon Redshift or Amazon S3 through Amazon QuickSight.

Solution overview

The following diagram illustrates the workflow of the solution.

In part 1 of this series, we discuss the overall workflow. In this post, we focus on the following steps:

  1. Catalog the processed raw, aggregate, and dimension data in the AWS Glue Data Catalog using the DPI processed data crawler.
  2. Interactively query data directly from Amazon S3 using Amazon Athena and visualize in QuickSight.
  3. Enable self-service analytics using QuickSight to prepare and publish insights based on data residing in the Amazon Redshift cluster.

Querying data using Amazon Redshift

After creating your Amazon Redshift cluster, you can immediately run queries by using the query editor on the Amazon Redshift console. Complete the following steps:

  1. On the Amazon Redshift console, in the navigation pane, choose Clusters.

A cluster with the identifier <redshift database name>-<cloudformation stack> should be present. For this example, the cluster is cemdm-telco.

  1. Choose Editor.
  2. Enter the required credentials to connect to the Amazon Redshift query editor. (Database name, Database user, and Database password are the ones you entered while creating the CloudFormation stack.)

  1. Choose Connect to database.

Upon successful authentication, you’re directed to the query editor.

  1. Run a few queries to check if data is in the tables.

In the following code, <table-name> is the Amazon Redshift table name:

select count(1) from cemdm.<table-name>;

The following query extracts the number of unique subscriber count by age group with Apple devices browsing retail domain websites or apps in or around shopping malls. You can also extract the list of subscribers and micro-segment them by consumption (total data volume) or by adding KPIs like recency and frequency.

select 
  dcd.age_range, 
  count(distinct f.customer_id)as "Unique Subs Count"
from 
  cemdm.f_daily_dpi f
inner join cemdm.d_customer_demographics dcd on f.customer_id = dcd.customer_id
inner join cemdm.d_tac dt on f.tac_code = dt.tac_sid
inner join cemdm.d_device dd on dt.device_sid = dd.device_sid
inner join cemdm.d_dpi_dictionary ddd on f.protocol_id = ddd.app_id
inner join cemdm.d_location dl on f.location_id = dl.location_id
where 
  dd.device_manufacturer = 'Apple' 
and ddd.media_category = 'Retail' 
and location_tier_4 ilike '%mall%'
group by 1 
order by 2 desc;

The following screenshot shows the output.

Unloading processed and enriched data from Amazon Redshift to Amazon S3

Amazon Redshift also includes Amazon Redshift Spectrum, which allows you to directly run SQL queries against exabytes of unstructured data in Amazon S3 data lakes. No loading or transformation is required, and you can use open data formats, including Avro, CSV, Ion, JSON, ORC, and Parquet. Amazon Redshift Spectrum automatically scales query compute capacity based on the data being retrieved, so queries against Amazon S3 run quickly, regardless of dataset size.

Amazon Redshift Spectrum gives you the freedom to store your data where you want, in the format you want, and have it available for processing when you need it. This is particularly helpful if you need to offload cold or historical data on Amazon Redshift to Amazon S3 in open data format. You can still access this data through Amazon Redshift via Amazon Redshift Spectrum plus any other application.

TSP data assets also include a lot of unstructured event data. This data is transient, and only valuable for a short amount of time. Therefore, you can leave it on Amazon S3 and access it from Amazon Redshift directly through Amazon Redshift Spectrum. You can use a lake house architecture approach, where hot, mostly static, and corporate data is in the warehouse, and the events data is in the data lake.

Alternatively, you can analyze data on Amazon S3 using Athena.

  1. Use the queries in the following table (in the Unload Statement column) in the Amazon Redshift query editor to unload data from Amazon Redshift to Amazon S3. For instructions, see Unloading data to Amazon S3. Provide the following information:
    • <aws-stack-name> – The name of the CloudFormation stack
    • <aws-region> – The Region in which you deployed the stack (for example, us-east-1)
    • <s3-bucket-name> – The bucket that you created while deploying the stack
    • <aws-account-id> – The AWS account ID in which you deployed the stack
    • <table-name> – The name of the Amazon Redshift table
Amazon Redshift Table Unload Statement

f_raw_dpi

f_hourly_dpi

unload ('select * from  cemdm.<table-name>') 
       to 's3://<s3-bucket-name>/dpi/processed/<table-name>/' 
       iam_role 'arn:aws:iam::<aws-account-id>:role /RedshiftBasicCustom-<aws-region>-<aws-stack-name>' 
       ALLOWOVERWRITE
       PARQUET 
       PARTITION BY (date_id, hour_id);

f_daily_dpi
unload ('select * from  cemdm.<table-name>') 
       to 's3://<s3-bucket-name>/dpi/processed/f_daily_dpi/' 
       iam_role 'arn:aws:iam::<aws-account-id>:role/RedshiftBasicCustom-<aws-region>-<aws-stack-name>' 
       ALLOWOVERWRITE
       PARQUET 
       PARTITION BY (date_id);

d_customer_demographics

d_device

d_dpi_dictionary

d_location

d_operator_plmn

d_tac

d_tariff_plan

d_tariff_plan_desc

unload ('select * from  cemdm.<table-name>') 
   to 's3://<s3-bucket-name>/dpi/processed/<table-name>/' 
       iam_role 'arn:aws:iam::<aws-account-id>:role /RedshiftBasicCustom-<aws-region>-<aws-stack-name>' 
       ALLOWOVERWRITE
       PARQUET;

Alternatively, you can copy the Amazon Redshift AWS Identity and Access Management (IAM) role ARN to unload data to Amazon S3 from the console under the cluster’s properties.

  1. Verify that the data has been unloaded to Amazon S3 under <s3-bucket-name>/dpi/processed/.
  2. On the AWS Glue console, in the navigation pane, choose Crawlers.
  3. Select DPIProcessedDataCrawler.
  4. Choose Run crawler.

  1. Wait for the crawler to show the status Stopping.

The tables added against the DPIProcessedDataCrawler crawlers should show 11.

  1. Under Databases, choose Tables.
  2. Verify the following 11 tables are created under the cemdm database:
    • processed_f_raw_dpi
    • processed_f_hourly_dpi
    • processed_f_daily_dpi
    • processed_d_customer_demographics
    • processed_d_device
    • processed_d_dpi_dictionary
    • processed_d_location
    • processed_d_operator_plmn
    • processed_d_tac
    • processed_d_tariff_plan
    • processed_d_tariff_plan_desc

Visualizing data using QuickSight

QuickSight is a business analytics service you can use to build visualizations, perform one-time analysis, and get business insights from your data. For more information, see What Is Amazon QuickSight?

To connect QuickSight to Amazon Redshift as your data source, complete the following steps:

  1. Create a private connection from Amazon QuickSight to an Amazon Redshift cluster.

These steps involve creating a new private subnet that the CloudFormation stack already created. Use the private subnet that isn’t used by Amazon Redshift cluster for your QuickSight connection.

QuickSight provides out-of-the-box integration with Amazon Redshift, making it simple to query and visualize your Redshift data. For more information, see Creating a Dataset from an Autodiscovered Amazon Redshift Cluster or Amazon RDS Instance.

  1. For Schema, choose cdmdm.
  2. For Tables, select f_daily_dpi.
  3. Choose Edit/Preview data.

  1. Add data and prepare the following table relationships in the Data Prep Use the information provided to create the relationships between different tables:
Table A Name Table A Attribute Join Type Table B Name Table B Attribute
f_daily_dpi customer_id LEFT d_tariff_plan customer_id
f_daily_dpi tac_code INNER d_tac tac_sid
f_daily_dpi sgsn_plmn_sid INNER d_operator_plmn plmn_sid
f_daily_dpi location_id LEFT d_location location_id
f_daily_dpi protocol_id INNER d_dpi_dictionary app_id
f_daily_dpi customer_id LEFT d_customer_demographics customer_id
d_tariff_plan tariff_plan_id INNER d_tariff_plan_desc tariff_plan_id
d_tac device_sid INNER d_device device_sid

You can join d_operator_plmn with sgsn_plmn_sid and home_plmn_sid, but because the sample data only contains home subscriber data, a second join of f_raw_dpi data with d_operator_plmn on home_plmn_sid and plmn_sid is not present in the given relationship of tables.

The following screenshot shows the table relationships.

  1. Name your analysis CEMDM.
  2. Choose Save & visualize.

The following screenshots demonstrate a few QuickSight analyses created from the dataset we created. For more information about creating analyses in QuickSight, see Working with Analyses. You can divide all analyses across all the available attributes. We use the use case from part 1 of this series.

The following screenshot shows visualizations of user demographics on the Demographics tab.

The following screenshot shows visualizations of user interest on the Interest Analysis tab.

The following screenshot shows visualizations of user locations on the Location tab.

The following screenshot shows visualizations of device information on the Device tab.

The following screenshot shows visualizations of subscription information on the Subscriptions tab.

The following screenshot shows visualizations of roaming users on the Roaming tab.

The following screenshot shows visualizations on the Sub Details tab. You can drill down to subscriber-level details from any dashboard across any dimension or apply global-level filters to narrow down the desired segment.

You can also build these reports using Athena as a data connector. QuickSight provides out-of-the-box integration with Athena, which lets you run SQL queries on top of the metadata in your AWS Glue Data Catalog. For more information, see Creating a Dataset Using Amazon Athena Data.

You can also use Amazon Redshift metadata as a business glossary and visualize it using QuickSight with the following custom SQL:

SELECT * FROM (
  select 
    n.nspname as "Schema",c1.relname as "Table Name", c.attname as "Column Name", 'Attribute' as "Type",
    c.attnum as "Ordinal Position",typnotnull as "Is Not Null",typdefault as "Default Value", t.typname as "Data Type",
    split_part(d.description,'|',1) as "Category", 
    split_part(d.description,'|',2) as "Source",
    split_part(d.description,'|',3) as "Transient/Derived",
    split_part(d.description,'|',4) as "Is PII",
    split_part(d.description,'|',5) as "Is Business Sensitive",
    split_part(d.description,'|',6) as "Description"  
  from pg_catalog.pg_attribute c
  inner join pg_class c1 on c.attrelid=c1.oid
  inner JOIN pg_type t on t.oid=c.atttypid
  inner join pg_catalog.pg_namespace n on c1.relnamespace=n.oid
  inner join pg_catalog.pg_description d on d.objoid=c1.oid AND c.attnum = d.objsubid
  where n.nspname='cemdm' and c.attnum > 0
  UNION ALL
  select 
    pn.nspname as "Schema",pc.relname "Table Name",null as "Column Name", 'Table' as "Type", 
    null as "Ordinal Position",null as "Is Not Null",null as "Default Value",null as "Data Type",
    split_part(pd.description,'|',1) as "Category", 
    split_part(pd.description,'|',2) as "Source",
    split_part(pd.description,'|',3) as "Transient/Derived",
    split_part(pd.description,'|',4) as "Is PII",
    split_part(pd.description,'|',5) as "Is Business Sensitive",
    split_part(pd.description,'|',6) as "Description"
  from pg_catalog.pg_description pd 
  inner join pg_class pc on pd.objoid = pc.oid
  inner join pg_catalog.pg_namespace pn on pc.relnamespace = pn.oid
  where pn.nspname = 'cemdm' and pd.objsubid = 0
) x
order by "Table Name", nvl("Ordinal Position",0);

The following screenshot shows a sample visualization which you can build on QuickSight.

For more information about running custom Amazon Redshift SQL using Amazon QuickSight, see Using the Query Editor.

QuickSight allows creating template from existing analysis. You can use the resulting template to create a dashboard. For more information, see Evolve your analytics with Amazon QuickSight’s new APIs and theming capabilitiesYou can also embed QuickSight dashboards into your own apps, websites, and wikis without the need to provision and manage users (readers) in QuickSight. For more information, see New in Amazon QuickSight – session capacity pricing for large scale deployments, embedding in public websites, and developer portal for embedded analytics.”

Cleaning up

To avoid incurring future charges, delete the resources you created. Manually delete anything created outside of the CloudFormation stack and then the stack itself.

Conclusion

In this post, I demonstrated how data analysts, data scientists, and advanced business users can easily query multiple data sources and generate actionable insights including user interest profiles, segments, and micro-segments. Downstream systems like campaign management systems, customer care portals, and customer-facing applications; internal teams like retention, marketing, CX, and network; and workloads like machine learning can greatly benefit from the insights generated from this solution. You can automate these insights and integrate them with northbound systems, and trigger them based on a schedule or an event.

I also demonstrated how business users are empowered with self-service analytics to help them perform data exploration and publish ready-made insights in the form of dashboards. You can also create stories to drive data-heavy conversations based on enriched data stored in Amazon Redshift or Amazon S3.

Perceiving customer behavior across multiple touchpoints is the key for any business to thrive. And the essence of this solution is to capitalize on data and drive CX and monetization initiatives holistically across your organization. This framework allows you to accelerate your journey towards improving CX and generating new revenue streams by using existing data assets.

You can progressively augment this solution by adding additional data sources to evolve into a customer data platform hosting 360° profiles of individual subscribers correlated from multiple data sources. This solution can further support new and existing marketing, partnerships, loyalty, retention, network planning, and network optimization initiatives to drive revenue growth and improve profitability while keeping subscribers happy and loyal. It also helps you define an organization-wide standard for data visualization, self-service analytics, metadata discovery, and data marketplace.

For more ways to expand this solution, consider the following services:

  • AWS Data Exchange makes it easy to find, subscribe to, and use third-party data in the cloud. You can merge it with in-house data assets to span existing insights across multiple domains.
  • Amazon Pinpoint is a flexible and scalable outbound and inbound marketing communications service. You can connect with customers over channels like email, SMS, push, or voice. You can segment and micro-segment your campaign audience for the right customer and personalize your messages with the right content.

As always, AWS welcomes feedback. This is a wide-open space to explore, so reach out to us if you want to dive deep into understanding how you can build this solution and more on AWS. Please submit comments or questions in the comments section.


About the Author

Vikas Omer is an analytics specialist solutions architect at Amazon Web Services. Vikas has a strong background in analytics, customer experience management (CEM) and data monetization, with over 11 years of experience in the telecommunications industry globally. With six AWS Certifications, including Analytics Specialty, he is a trusted analytics advocate to AWS customers and partners. He loves traveling, meeting customers, and helping them become successful in what they do.

Building a cost efficient, petabyte-scale lake house with Amazon S3 lifecycle rules and Amazon Redshift Spectrum: Part 2

Post Syndicated from Cristian Gavazzeni original https://aws.amazon.com/blogs/big-data/part-2-building-a-cost-efficient-petabyte-scale-lake-house-with-amazon-s3-lifecycle-rules-and-amazon-redshift-spectrum/

In part 1 of this series, we demonstrated building an end-to-end data lifecycle management system integrated with a data lake house implemented on Amazon Simple Storage Service (Amazon S3) with Amazon Redshift and Amazon Redshift Spectrum. In this post, we address the ongoing operation of the solution we built.

Data ageing process after a month (ongoing)

Let’s assume a month has elapsed since walking through the use case in the last post, and old historical data was classified and tiered accordingly to policy. You now need to enter the new monthly data generated into the lifecycle pipeline as follows:

  • June 2020 data – Produced and consolidated into Amazon Redshift local tables
  • December 2019 data – Migrated to Amazon S3
  • June 2019 data – Migrated from Amazon S3 to S3-IA
  • March 2019 data – Migrated from S3-IA to Glacier

The first step required is to increase the ageing counter of all the Parquet files (both midterm and shortterm prefixes), using aws s3api get-object-tagging to check the current value and increasing by 1 with aws s3api set-object-tagging. This can be cumbersome if you have many objects, but you can automate it with Amazon S3 CLI scripts or SDKs like Boto3 (Phyton).

The following is a simple Python script you can use to check the current tag settings for all keys in the prefix extract_shortterm:

from boto3 import client 
import re 
conn = client('s3') 
def printtags(mybucket, myprefix): 
    for key in conn.list_objects(Bucket = mybucket, Prefix = myprefix)['Contents']: 
        if key['Key'].endswith('.parquet'): 
            tagset = conn.get_object_tagging(Bucket=mybucket, Key=key['Key'])['TagSet'] 
            stringa = str(tagset) 
            stringtag = (re.findall("\d+", stringa)) 
            tagvalue = int(stringtag[0]) 
            print((key['Key']), "ageing = ", tagvalue) 
return 
#below set parameters bucket and prefix accordingly with your env 
printtags('rs-lakehouse-blog-post', 'extract_shortterm/')

This second Python script lists all current tag settings for all keys in the prefix extract_shortterm, increases by 1 ageing, and lists the keys and new tag values. If other tags were added to these objects prior to this step, this new tag overwrites the entire tagSet. The set object tagging operation is not an append, but a completely new PUT.

from boto3 import client 
import re
conn = client('s3') 
def updateTags(mybucket, myprefix): 
    for key in conn.list_objects(Bucket = mybucket, Prefix = myprefix)['Contents']: 
        if key['Key'].endswith('.parquet'): 
            tagset = conn.get_object_tagging(Bucket = mybucket, Key=key['Key'])['TagSet'] 
            stringa = str(tagset) 
            stringtag = (re.findall("\d+", stringa)) 
            tagvalue = int(stringtag[0]) 
            print((key['Key']), "Current ageing = ", tagvalue) 
            tagvalue = tagvalue+1 
            put_tags_response = conn.put_object_tagging(Bucket=mybucket, Key = key['Key'], Tagging = {'TagSet': [ { 'Key': 'ageing', 'Value': str(tagvalue) }, ] } ) 
return 
printtags('rs-lakehouse-blog-post', 'extract_shortterm/')

To run the pipeline described before, you need to perform the following:

  1. Unload the December 2019 data.
  2. Apply the tag ageing to 6.
  3. Add the new Parquet file as a new partition to the external table taxispectrum.taxi_archive.

For the relevant syntax, see [part 1]. You can automate these tasks using an AWS Lambda function and use a monthly schedule.

Check the results with a query to the external table and don’t forget to remove unloaded items from the Amazon Redshift local table, as you did in part 1 of this series.

In this use case, we know exactly the mapping of June 2019 to the Amazon S3 key name because we used a specific naming convention. If your use case is different, you can use the two pseudo-columns automatically created in every Amazon Redshift external table: $path and $size. See the following code:

select pickup, 
    "$path", 
    "$size" 
from taxispectrum.taxi_archive 
where pickup between '2019-06-01 00:00:00' and '2019-06-30 23:59:59' 
limit 10
;

 The following screenshot shows our results.

The following screenshot shows our results. 

We’re migrating the March 2019 Parquet file to Glacier, so you should remove the related partition from the AWS Glue Data Catalog:

ALTER TABLE taxispectrum.taxi_archive
DROP PARTITION (yearmonth=‘2019-03’)
;

Right to be forgotten

One of the pillar rules of GDPR is the “right to be forgotten” rule—the ability for a customer or employee to request deletion of any personal data.

Implementing this feature for external tables on Amazon Redshift requires a different approach than for local tables, because external tables don’t support delete and update operations.

You can implement two different approaches.

In and out

In this first approach, you copy the external table to a temporary Amazon Redshift table, delete the specific rows, unload the temporary table to Amazon S3 and overwrite the key name, and drop the temporary (internal table).

Let’s assume that the drivers in the dataset are identified with column pulocid. We want to delete all records related to a driver identified with pulocid 129 and who worked between October 2019 and November 2019.

  1. With the following code (from Amazon Redshift), you can identify a every single row matched with specific single Parquet file:
    select pickup,
    	dropoff,
    	pulocid,
    	“$path”
    from taxispectrum.taxi_shortterm
    where pulocid = 129 and pickup between ‘2019-10-01 00:00:00’ and ‘2019-11-30 23:59:59’
    ;

The following screenshot shows our results.

 The following screenshot shows our results.

  1. When checking the applied tags, note the value associated to the ageing tag, or save the output of the following command in a temporary JSON file:
    aws s3api get-object-tagging \
    --bucket rs-lakehouse-blog-post \
    --key extract_shortterm/green_tripdata_2019-10000.parquet > \
    /tmp/oldtag.json

  2. Create two temporary tables, one for each of the two Parquet files matching the query (October and November):
    create table temporaryoct (like greentaxi);

  3. Copy the Parquet file to the local table, using the format as parquet attribute:
    copy temporaryoct
    from ‘s3://rs-lakehouse-blog-post/extract_shortterm/green_tripdata_2019-10000.parquet’
    iam_role ‘arn:aws:iam::123456789012:role/BlogSpectrumRole’
    format as parquet
    ;

  4. Delete the records matching the “right to be forgotten” request criteria:
    delete from temporaryoct 
    where pulocid = 129 and pickup between '2019-10-01 00:00:00' and '2019-11-30 23:59:59'
    ;

  5. Overwrite the Parquet file with the UNLOAD command (note the allowoverwrite option):
    unload ('select * from temporaryoct') 
    to 's3://rs-lakehouse-blog-post/extract_shortterm/green_tripdata_2019-10000.parquet’ 
    iam_role 'arn:aws:iam::123456789012:role/BlogSpectrumRole' 
    parquet parallel off allowoverwrite
    ;

  6. Drop the temporary table:
    drop table temporaryoct;

In more complex use cases, user data might span multiple months, and our approach might not be effective. In these cases, using Spark to process and rewrite the Parquet could be a better and faster solution.

In other use cases, the number of records to be deleted could be a majority. If so, as an alternative to the delete and unload steps, you could use CREATE EXTERNAL TABLE AS (CTAS). CTAS creates an external table based on the column definition from a query and writes the results of that query on Amazon S3.

Edit your own

The second option is to use an external editor to access the Amazon S3 file and remove specific records. You could use a Spark script with the following steps:

  1. Create a DataFrame.
  2. Import a Parquet file in memory.
  3. Remove records matching your criteria.
  4. Overwrite the same Amazon S3 key with the new data.

Building a simple data ageing dashboard

Sometimes data temperature is very predictable and based on ageing, but in some cases, especially when data is originated and accessed from different entities, it’s not easy to build a model to fit the best storage transition strategy. For these scenarios, you can use Amazon S3 analytics storage class analysis and Amazon S3 access logs.

Storage class analytis observes the infrequent access patterns of a filtered set of data over a period of time. You can use the analysis results to help you improve your lifecycle policies. You can configure storage class analysis to analyze all the objects in a bucket rr, and configure filters to group objects together for analysis by a common prefix (objects that have names that begin with a common string), by object tags, or by both prefix and tags. Filtering by object groups is the best way to benefit from storage class analysis.

To achieve a better understanding of how data is accessed (and who accessed it, and when) and build a custom tiering strategy, you can use Amazon S3 access logs. This feature doesn’t have any additional costs, but log retention incurs Amazon S3 storage costs. You first define a recipient to store the logs.

  1. Create a new bucket named rs-lakehouse-blog-post-logs.
  2. To set up S3 Server access logging on the source bucket rs-lakehouse-blog-post on the Amazon S3 console, on the Properties tab, choose Server access logging.
  3. For Target bucket, enter rs-lakehouse-blog-post-logs.
  4. For Target prefix, leave blank.
  5. Choose Save.

You first define a recipient to store the logs.

Let’s assume that after few days of activities, you want to discover how users and applications accessed the data.

  1. On the Amazon Redshift console, create an external table to map the Amazon S3 access logs:
    CREATE EXTERNAL TABLE taxispectrum.s3accesslogs(
        BucketOwner                   varchar(256), 
        Bucket                        varchar(256), 
        RequestDateTime               varchar(256), 
        RemoteIP                      varchar(256), 
        Requester                     varchar(256), 
        RequestID                     varchar(256), 
        Operation                     varchar(256), 
        Key                           varchar(256), 
        RequestURI_operation          varchar(256),
        RequestURI_key                varchar(256),
        RequestURI_httpProtoversion   varchar(256),
        HTTPstatus                    varchar(256), 
        ErrorCode                     varchar(256), 
        BytesSent                     varchar(256), 
        ObjectSize                    varchar(256), 
        TotalTime                     varchar(256), 
        TurnAroundTime                varchar(256), 
        Referrer                      varchar(256), 
        UserAgent                     varchar(256), 
        VersionId                     varchar(256)
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
        'input.regex' = '([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^ ]*) ([^ ]*) ([^ ]*)\" (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\") ([^ ]*)'
    )
    STORED AS TEXTFILE
    LOCATION 's3://rs-lakehouse-blog-post-logs/'
    ;

  2. Check the AWS Identity and Access Management (IAM) policy S3-Lakehouse-Policy created in part 1 and add the following two lines to the JSON definition file:
    "arn:aws:s3:::rs-lakehouse-blog-post-logs",
    "arn:aws:s3:::rs-lakehouse-blog-post-logs/*"

  3. Query a few columns:
    select bucket, 
        key, 
        requestdatetime, 
        remoteip, 
        operation, 
        useragent 
    from taxispectrum.s3accesslogs
    ;

The following screenshot shows our results.

The following screenshot shows our results.

This report is a starting point for both auditing purposes and for analyzing access patterns. As a next step, you could use a business intelligence (BI) visualization solution like Amazon QuickSight and create a dataset in order to create a dashboard showing the most and least accessed files.

Cleaning up

To clean up your resources after walking through this post, complete the following steps:

  1. Delete the Amazon Redshift cluster without the final cluster snapshot:
    aws redshift delete-cluster –-cluster-identifier redshift-cluster-1

  2. Delete the schema and table defined in AWS Glue:
    aws glue delete-table \
        –-database-name blogdb \
        –-name taxi_archive 
        
    aws glue delete-table \
        –-database-name blogdb \
        –-name s3accesslogs 
        
    aws glue delete-database –-name blogdb

  3. Delete the S3 buckets and all their content:
    aws s3 rb s3://rs-lakehouse-blog-post –-force
    aws s3 rb s3://rs-lakehouse-blog-post-logs –-force 

Conclusion

In the first post in this series, we demonstrated how to implement a data lifecycle system for a lake house using Amazon Redshift, Redshift Spectrum, and Amazon S3 lifecycle rules. In this post, we focused on how to operationalize the solution with automation scripts (with the AWS Boto3 library for Python) and S3 Server access logs.


About the Authors

Cristian Gavazzeni is a senior solution architect at Amazon Web Services. He has more than 20 years of experience as a pre-sales consultant focusing on Data Management, Infrastructure and Security. During his spare time he likes eating Japanese food and travelling abroad with only fly and drive bookings.

 

 

Francesco MarelliFrancesco Marelli is a senior solutions architect at Amazon Web Services. He has lived and worked in London for 10 years, after that he has worked in Italy, Switzerland and other countries in EMEA. He is specialized in the design and implementation of Analytics, Data Management and Big Data systems, mainly for Enterprise and FSI customers. Francesco also has a strong experience in systems integration and design and implementation of web applications. He loves sharing his professional knowledge, collecting vinyl records and playing bass.

Amazon Redshift 2020 year in review

Post Syndicated from Corina Radovanovich original https://aws.amazon.com/blogs/big-data/amazon-redshift-2020-year-in-review/

Today, more data is created every hour than in an entire year just 20 years ago. Successful organizations are leveraging this data to deliver better service to their customers, improve their products, and run an efficient and effective business. As the importance of data and analytics continues to grow, the Amazon Redshift cloud data warehouse service is evolving to meet the needs of our customers. Amazon Redshift was the first data warehouse built for the cloud in 2012, and we’ve constantly listened to our customers to deliver on our promise of a fast, scalable, and easy-to-use service that makes it possible to deliver insight across an organization with a single and consistent view of your data—even with the enormous growth in data we’re experiencing. That’s why tens of thousands of customers like Nasdaq, Dollar Shave Club, and VOO are betting on Amazon Redshift to gain the insight they need. You can hear from customers about how they’re using Amazon Redshift in the AWS re:Invent 2020 sessions How Twilio scaled business intelligence with a data lake powered by AWS and How Vyaire uses AWS analytics to scale ventilator product and save lives.

In 2020, we continued to innovate at a fast clip, releasing dozens of new features and capabilities to make it easier to analyze all your data with a Lake House Architecture, get fast performance at any scale, and lower your costs with predictable pricing. Some of these new features were major efforts that required extensive development and engineering work, and others were relatively minor, but when considered as an aggregate make a big difference to our customers’ ability to do things like migrate to Amazon Redshift from legacy on-premises data warehouses, or support new use cases.

Lake House and AWS integrated

At AWS, we believe in adopting a Lake House Architecture so you can easily integrate your data warehouse with the data lake on Amazon Simple Storage Service (Amazon S3), purpose-built data stores, and with other analytics services without moving and transforming your data explicitly. For more information about this approach, see the post Harness the power of your data with AWS Analytics by Rahul Pathak, and watch his AWS re:Invent 2020 analytics leadership session.

The following image shows how Amazon Redshift integrates with the data lake and other services.

The following image shows how Amazon Redshift integrates with the data lake and other services.

Since we released the Amazon Redshift Spectrum feature a couple years ago, customers have been querying exabytes of data directly in the lake in Apache Parquet, an open file format. With data lake export released in 2019, you can save the results of an Amazon Redshift query back into the lake. This means you can take advantage of (or be ready to evolve to) real-time analytics and machine learning (ML) and AI use cases without re-architecture, because Amazon Redshift is fully integrated with your data lake. In 2020, we also released new capabilities like Amazon Redshift data sharing (preview), so you can easily share data across Amazon Redshift clusters (both internally and externally) so every user has a live and consistent view of data.

Customers like Warner Bros. Interactive Entertainment, Yelp, Fannie Mae, and many more are benefitting from data sharing. Steven Moy from Yelp shared, “The data sharing feature seamlessly allows multiple Redshift clusters to query data located in our RA3 clusters and their managed storage. This eliminates our concerns with delays in making data available for our teams, and reduces the amount of data duplication and associated backfill headache. We now can concentrate even more of our time making use of our data in Redshift and enable better collaboration instead of data orchestration.”

With Amazon Redshift Federated Query, you can combine operational data that is stored in popular databases such as Amazon Relational Database Service (Amazon RDS) and Amazon Aurora PostgreSQL. We also offer Amazon RDS for MySQL and Amazon Aurora MySQL support in preview. For more information, see Announcing Amazon Redshift federated querying to Amazon Aurora MySQL and Amazon RDS for MySQL.

We also launched a native integration with Amazon SageMaker, Amazon Redshift ML (preview), to make it easy to do more with your data with predictive analytics. Now you can create, train, and deploy ML models with SQL on your Amazon Redshift data without relying on an ML expert or learning new tools and languages.

Customers and partners like Datacoral, ZS Associates, Rackspace, and Slalom are benefiting from Amazon Redshift ML. Raghu Murthy from Datacoral shared, “We are really excited about the new Amazon Redshift ML feature. Typically, our mutual customers need to extract data from Amazon Redshift to perform inference for ML. Now that this can be done natively within Amazon Redshift, we see the potential for a huge performance and productivity improvement. We look forward to helping more customers use ML on the data in their Amazon Redshift data warehouse, and to speeding up the inference pipelines our customers are already using ML with this new capability.”

In addition to querying semi-structured data using Amazon Redshift Spectrum in the lake, in 2020 we launched native support for semi-structured data processing with the SUPER data type (preview). This new data type, SUPER, supports nested data formats such as JSON and enables you to ingest, store, and query nested data natively in Amazon Redshift. SUPER data can be queried using PartiQL, a SQL extension used for easily querying both semi-structured and structured data.

Other features we released in 2020 that support the Lake House Architecture and AWS integrations include AWS Lambda UDF, partner console integration (preview), AWS Glue Elastic Views (preview), support for writing to external tables in Amazon S3, the ability to query open-source Apache Hudi and Delta Lake, and much more.

Learn more about the Lake House Architecture in the AWS re:Invent 2020 session The lake house approach to data warehousing with Amazon Redshift, and dive deep into the new data sharing features in the sessions New use cases for Amazon Redshift and Introducing Amazon Redshift ML.

Performance at scale

Amazon Redshift has always been built for fast performance at scale—we know this is important to our customers because you want a data warehouse you can trust to deliver results quickly across all your data. With Amazon Redshift, you get up to 3x better price performance than other cloud data warehouses, and we recently published our benchmark results so you can learn more and even replicate the tests. The benchmarking test was performed with a single cluster, and for customers that have high concurrency workloads, we offer concurrency scaling to scale out your read workloads.

We know you count on Amazon Redshift to deliver consistently fast results from gigabytes to petabytes of data, and from a few users to thousands. As your users scale, the concurrency scaling capability of Amazon Redshift automatically deploys the necessary compute resources to manage the additional load. And, because we know your workloads are growing fast, we’re building Amazon Redshift for the new scale of data with features like AQUA (Advanced Query Accelerator), a new hardware accelerated cache that boosts queries up to 10x faster than other cloud data warehouses. AQUA is available in preview on RA3 4xl and 16xl nodes in select Regions, and will be generally available in January, 2021.

In 2020, we also invested a lot in making it easier to get the best performance by releasing new capabilities for Amazon Redshift to be a self-tuning and self-learning system. This allows you to get the best performance for your workloads without the undifferentiated heavy lifting of tuning your data warehouse with tasks such as defining sort keys, and distribution keys and new capabilities like materialized views, and automatic refresh and query rewrite of materialized views.

Based on internal benchmarking, optimizations made by the automatic table optimization feature have been shown to increase cluster performance by 24% and 34% using the 3 TB and 30 TB TPC-DS benchmarks, respectively, versus a cluster without automatic table optimization. When professional services firm ZS Associates started using automatic table optimizations, Nishesh Aggarwal shared, “When we tested ATO in our development environment, the performance of our queries was 25% faster than our production workload not using ATO, without requiring any additional effort by our administrators.”

Other features delivered in 2020 that support performance at scale include query compilation improvements, 100k table support, HyperLogLog, and much more.

Find out more about how Amazon.com uses Amazon Redshift to perform analytics at scale in the following AWS re:Invent 2020 session, dive deep into the new features with the session Getting the most out of Amazon Redshift automation, and learn more about AQUA with AQUA for Amazon Redshift (Preview).

Best value

We focus on our customers and innovate to ensure Amazon Redshift provides great value, whether you’re starting small at $0.25 per hour or committing with Reserved Instances that allow you save up to 75% compared to on-demand prices when you commit to a 1- or 3-year term. In 2020, we heard from many new and existing customers about the value and performance gains they experienced from the new generation instance type, RA3 with managed storage. By scaling and paying for storage and compute separately, you get the optimal amount of storage and compute for diverse workloads. RA3 allows you to choose the size of your Amazon Redshift cluster based on your performance requirements, and Amazon Redshift managed storage automatically scales your data warehouse storage capacity without you having to add and pay for additional compute instances. In early 2020, we released RA3.4xl, and more recently completed the family with the new and smallest instance size, RA3.xlplus.

Unlike other cloud DWs where you need premium versions for additional enterprise capabilities, Amazon Redshift pricing includes built-in security-like encryption, audit logs, and compliance, and launches within your virtual private cloud (VPC), as well as data compression and data transfer. Amazon Redshift also provides predictability in month-to-month cost even when you have unpredictable or highly concurrent workloads. Each Amazon Redshift cluster earns up to an hour of free concurrency scaling credits per day, which can be used to offset the cost of the transient clusters that are automatically added to handle high concurrency. Additionally, in 2020 we released new cost control features for Amazon Redshift Spectrum and concurrency scaling.

The automatic workload manager (WLM) was updated in 2020 to make it even more effective to help you run a complex mix of applications. A successful workload management scheme ensures SLAs for high-priority workloads, ensures highly efficient resource utilization, and maximizes return on investment (ROI). One approach to solve this problem is to simply add more resources, but this approach is problematic because it leads to unpredictable spend and high invoices. WLM in Amazon Redshift helps you maximize query throughput and get consistent performance for the most demanding analytics workloads, all while optimizing the resources that you’re already paying for. For example, with query priorities, you can now ensure that higher-priority workloads get preferential treatment in Amazon Redshift, including more resources during busy times for consistent query performance. Query monitoring rules provides ways to manage unexpected situations like detecting and preventing runaway or expensive queries from consuming system resources.

We also improved automatic WLM in several ways. It now uses ML to predict the amount of resources a query needs, allowing us to improve overall throughput. In addition, WLM now scales concurrency dynamically, and we enhanced SQA (short query acceleration) with what we call “turbo boost mode,” a feature that is automatically activated when queue buildup is detected and waiting queries don’t require a lot of resources. This allows for more consistent query performance for all queries regardless of priority, as well as more efficient utilization of resources overall.

Many of our customers have started using the Data API released in 2020 to build web services-based applications and to integrate with services like AWS LambdaAWS AppSync, and AWS Cloud9. The Data API simplifies data access, ingest, and egress from languages supported with AWS SDK such as Python, Go, Java, Node.js, PHP, Ruby, and C++, so you can focus on building applications versus managing infrastructure.

Other features delivered in 2020 that make sure you get the best value out of Amazon Redshift include cross-AZ cluster recovery, open-source JDBC and Python drivers, spatial data processing enhancements, TIME and TIMETZ data types, scheduling of SQL queries, pause and resume, and much more.

Summary

For an overview of the new features, check out the AWS re:Invent 2020 session What’s new with Amazon Redshift and go deeper with the deep dive on best practices for Amazon Redshift. If you’re still evaluating whether a move to the cloud makes sense, learn more about migrating a legacy data warehouse to Amazon Redshift.

Thanks for all your feedback over the years and cheers to the insights you’ll be gaining from your AWS analytics solutions in 2021.


About the Authors

Corina Radovanovich leads product marketing for cloud data warehousing at AWS. She’s worked in marketing and communications for the biggest tech companies worldwide and specializes in cloud data services.

 

 

Eugene Kawamoto is a director of product management for Amazon Redshift. Eugene leads the product management and database engineering teams at AWS. He has been with AWS for ~8 years supporting analytics and database services both in Seattle and in Tokyo. In his spare time, he likes running trails in Seattle, loves finding new temples and shrines in Kyoto, and enjoys exploring his travel bucket list.

Building a cost efficient, petabyte-scale lake house with Amazon S3 lifecycle rules and Amazon Redshift Spectrum: Part 1

Post Syndicated from Cristian Gavazzeni original https://aws.amazon.com/blogs/big-data/part-1-building-a-cost-efficient-petabyte-scale-lake-house-with-amazon-s3-lifecycle-rules-and-amazon-redshift-spectrum/

The continuous growth of data volumes combined with requirements to implement long-term retention (typically due to specific industry regulations) puts pressure on the storage costs of data warehouse solutions, even for cloud native data warehouse services such as Amazon Redshift. The introduction of the new Amazon Redshift RA3 node types helped in decoupling compute from storage growth. Integration points provided by Amazon Redshift Spectrum, Amazon Simple Storage Service (Amazon S3) storage classes, and other Amazon S3 features allow for compliance of retention policies while keeping costs under control.

An enterprise customer in Italy asked the AWS team to recommend best practices on how to implement a data journey solution for sales data; the objective of part 1 of this series is to provide step-by-step instructions and best practices on how to build an end-to-end data lifecycle management system integrated with a data lake house implemented on Amazon S3 with Amazon Redshift. In part 2, we show some additional best practices to operate the solution: implementing a sustainable monthly ageing process, using Amazon Redshift local tables to troubleshoot common issues, and using Amazon S3 access logs to analyze data access patterns.

Amazon Redshift and Redshift Spectrum

At re:Invent 2019, AWS announced new Amazon Redshift RA3 nodes. Even though this introduced new levels of cost efficiency in the cloud data warehouse, we faced customer cases where the data volume to be kept is an order of magnitude higher due to specific regulations that impose historical data to be kept for up to 10–12 years or more. In addition, this historical cold data must be accessed by other services and applications external to Amazon Redshift (such as Amazon SageMaker for AI and machine learning (ML) training jobs), and occasionally it needs to be queried jointly with Amazon Redshift hot data. In these situations, Redshift Spectrum is a great fit because, among other factors, you can use it in conjunction with Amazon S3 storage classes to further improve TCO.

Redshift Spectrum allows you to query data that resides in S3 buckets using already in place application code and logic used for data warehouse tables, and potentially performing joins and unions of Amazon Redshift local tables and data on Amazon S3.

Redshift Spectrum uses a fleet of compute nodes managed by AWS that increases system scalability. To use it, we need to define at least an external schema and an external table (unless an external schema and external database are already defined in the AWS Glue Data Catalog). Data definition language (DDL) statements used to define an external table include a location attribute to address S3 buckets and prefixes containing the dataset, which could be in common file formats like ORC, Parquet, AVRO, CSV, JSON, or plain text. Compressed and columnar file formats like Apache Parquet are preferred because they provide less storage usage and better performance.

For a data catalog, we could use AWS Glue or an external hive metastore. For this post, we use AWS Glue.

S3 Lifecycle rules

Amazon S3 storage classes include S3 Standard, S3-IA, S3 One-Zone, S3 Intelligent-Tiering, S3 Glacier, and S3 Glacier Deep Archive. For our use case, we need to keep data accessible for queries for 5 years and with high durability, so we consider only S3 Standard and S3-IA for this time frame, and S3 Glacier only for long term (5–12 years). Data access to S3 Glacier requires data retrieval in the range of minutes (if using expedited retrieval) and this can’t be matched with the ability to query data. We can adopt Glacier for very cold data if you implement a manual process to first restore the Glacier archive to a temporary S3 bucket, and then query this data defined via an external table.

S3 Glacier Select allows you to query on data directly in S3 Glacier, but it only supports uncompressed CSV files. Because the objective of this post is to propose a cost-efficient solution, we didn’t consider it. If for any reason you have constraints for storing in CSV file format (instead of compressed formats like Parquet), Glacier Select might also be a good fit.

Excluding retrieval costs, the cost for storage for S3-IA is typically around 45% cheaper than S3 Standard, and S3 Glacier is 68% cheaper than S3-IA. For updated pricing information, see Amazon S3 pricing.

We don’t use S3 Intelligent Tiering because it bases storage transition on the last access time, and this resets every time we need to query the data. We use the S3 Lifecycle rules that are based either on creation time or prefix or tag matching, which is consistent regardless of data access patterns.

Simulated use case and retention policy

For our use case, we need to implement the data retention strategy for trip records outlined in the following table.

Corporate Rule Dataset Start Dataset end Data Storage Engine
Last 6 months in Redshift Spectrum December 2019 May 2020 Amazon Redshift local tables Amazon Redshift
Months 6–11 in Amazon S3 June 2019 November 2019 S3 Standard Redshift Spectrum
Months 12–14 in S3-IA March 2019 May 2019 S3-IA Redshift Spectrum
After month 15 January 2019 February 2019 Glacier N/A

For this post, we create a new table in a new Amazon Redshift cluster and load a public dataset. We use the New York City Taxi and Limousine Commission (TLC) Trip Record Data because it provides the required historical depth.

We use the Green Taxi Trip Records, based on monthly CSV files containing 20 columns with fields like vendor ID, pickup time, drop-off time, fare, and other information. 

Preparing the dataset and setting up the Amazon Redshift environment

As a first step, we create an AWS Identity and Access Management (IAM) role for Redshift Spectrum. This is required to allow access to Amazon Redshift to Amazon S3 for querying and loading data, and also to allow access to the AWS Glue Data Catalog whenever we create, modify, or delete a new external table.

  1. Create a role named BlogSpectrumRole.
  2. Edit the following two JSON files, which have the IAM policies according to the bucket and prefix used in this post, as needed, and attach the policies to the role you created:
    S3-Lakehouse-Policy.JSON

    	{
    	    "Version": "2012-10-17",
    	    "Statement": [
    	        {
    	            "Sid": "VisualEditor0",
    	            "Effect": "Allow",
    	            "Action": "s3:*",
    	            "Resource": [
    	                "arn:aws:s3:::rs-lakehouse-blog-post",
    	                "arn:aws:s3:::rs-lakehouse-blog-post/extract_longterm/*",
    	                "arn:aws:s3:::rs-lakehouse-blog-post/extract_midterm/*",
    	                "arn:aws:s3:::rs-lakehouse-blog-post/extract_shortterm/*",
    	    "arn:aws:s3:::rs-lakehouse-blog-post/accesslogs/*"
    	            ]
    	        }
    	    ]
    	}

    Glue-Lakehouse-Policy.JSON

    {
    	"Version": "2012-10-17",
    	"Statement": [
    		{
    			"Sid": "VisualEditor0",
    			"Effect": "Allow",
    			"Action": [
    				"glue:CreateDatabase",
    				"glue:DeleteDatabase",
    				"glue:GetDatabase",
    				"glue:GetDatabases",
    				"glue:UpdateDatabase",
    				"glue:CreateTable",
    				"glue:DeleteTable",
    				"glue:BatchDeleteTable",
    				"glue:UpdateTable",
    				"glue:GetTable",
    				"glue:GetTables",
    				"glue:BatchCreatePartition",
    				"glue:CreatePartition",
    				"glue:DeletePartition",
    				"glue:BatchDeletePartition",
    				"glue:UpdatePartition",
    				"glue:GetPartition",
    				"glue:GetPartitions",
    				"glue:BatchGetPartition"
    				],
    			"Resource": [
    				"*"
    			]
    		}
    	]
    }

Now you create a single-node Amazon Redshift cluster based on a DC2.large instance type, attaching the newly created IAM role BlogSpectrumRole.

  1. On the Amazon Redshift console, choose Create cluster.
  2. Keep the default cluster identifier redshift-cluster-1.
  3. Choose the node type DC2.large.
  4. Set the configuration to single node.
  5. Keep the default dbname and ports.
  6. Set a primary user password.
  7. Choose Create cluster.
  8. After the cluster is configured, check the attached IAM role on the Properties tab for the cluster.
  9. Take note of the IAM role ARN, because you use it to create external tables.

Copying data into Amazon Redshift

To connect to Amazon Redshift, you can use a free client like SQL Workbench/J or use the AWS console embedded query editor with the previously created credentials.

  1. Create a table according to the dataset schema using the following DDL statement:
    	create table greentaxi(
    	vendor_id integer,
    	pickup timestamp,
    	dropoff timestamp,
    	storeandfwd char(2),
    	ratecodeid integer,
    	pulocid integer,
    	dolocid integer,
    	passenger_count integer,
    	trip_dist real,
    	fare_amount real,
    	extra real,
    	mta_tax real,
    	tip_amount real,
    	toll_amount real,
    	ehail_fee real,
    	improve_surch real,
    	total_amount real,
    	pay_type integer,
    	trip_type integer,
    	congest_surch real
    );

The most efficient method to load data into Amazon Redshift is using the COPY command, because it uses the distributed architecture (each slice can ingest one file at the same time).

  1. Load year 2020 data from January to June in the Amazon Redshift table with the following command (replace the IAM role value with the one you created earlier):
    copy greentaxi from ‘s3://nyc-tlc/trip data/green_tripdata_2020’ 
    	iam_role ‘arn:aws:iam::123456789012:role/BlogSpectrumRole’ 
    	delimiter ‘,’ 
    	CSV 
    	dateformat ‘auto’
    	region ‘us-east-1’ 
    ignoreheader as 1;

Now the greentaxi table includes all records starting from January 2019 to June 2020. You’re now ready to leverage Redshift Spectrum and S3 storage classes to save costs.

Extracting data from Amazon Redshift

You perform the next steps using the AWS Command Line Interface (AWS CLI). For download and installation instructions, see Installing, updating, and uninstalling the AWS CLI version 2.

Use the AWS CONFIGURE command to set the access key and secret access key of your IAM user and your selected AWS Region (same as your S3 buckets) of your Amazon Redshift cluster.

In this section, we evaluate two different use cases:

  • New customer – As a new customer, you don’t have much Amazon Redshift old data, and want to extract only the oldest monthly data and apply a lifecycle policy based on creation date. Storage tiering only affects future data and is fully automated.
  • Old customer – In this use case, you come from a multi-year data growth, and need to move existing Amazon Redshift data to different storage classes. In addition, you want a fully automated solution but with the ability to override and decide what and when to transition data between S3 storage classes. This requirement is due to many factors, like the GDPR rule “right to be forgotten.” You may need to edit historical data to remove specific customer records, which changes the file creation date. For this reason, you need S3 Lifecycle rules based on tagging instead of creation date.

New customer use case

The UNLOAD command uses the result of an embedded SQL query to extract data from Amazon Redshift to Amazon S3, producing different file formats such as CSV, text, and Parquet. To extract data from January 2019, complete the following steps:

  1. Create a destination bucket like the following:
    aws s3 mb s3://rs-lakehouse-blog-post

  2. Create a folder named archive in the destination bucket rs-lakehouse-blog-post:
    aws s3api put-object --bucket rs-lakehouse-blog-post -–key archive

  3. Use the following SQL code to implement the UNLOAD statement. The SELECT on Data data types requires quoting as well as a SELECT statement embedded in the UNLOAD command:
    unload ('select * from greentaxi where pickup between ''2019-01-01 00:00:00'' and ''2019-01-30 23:59:59''')
    to 's3://rs-lakehouse-blog-post/archive/5to12years_taxi'
    iam_role 'arn:aws:iam::123456789012:role/BlogSpectrumRole'; 

    1. You can perform a check with the AWS CLI:
      aws s3 ls s3://rs-lakehouse-blog-post/archive/
      2020-10-12 14:49:51          0 
      2020-11-04 09:51:00   34792620 5to12years_taxi0000_part_00
      2020-11-04 09:51:00   34792738 5to12years_taxi0001_part_00

  • The output shows that the UNLOAD statement generated two files of 33 MB each. By default, UNLOAD generates at least one file for each slice in the Amazon Redshift cluster. My cluster is a single node with DC2 type instances with two slices. This default file format is text, which is not storage optimized.

    To simplify the process, you create a single file for each month so that you can later apply lifecycle rules to each file. In real-world scenarios, extracting data with a single file isn’t the best practice in terms of performance optimization. This is just to simplify the process for the purpose of this post.

    1. Create your files with the following code:
      unload ('select * from greentaxi where pickup between ''2019-01-01 00:00:00'' and ''2019-01-31 23:59:59''')
      to 's3://rs-lakehouse-blog-post/archive/green_tripdata_2019-01'
      iam_role 'arn:aws:iam::123456789012:role/BlogSpectrumRole' parquet parallel off;

    The output of the UNLOAD commands is a single file (per month) in Parquet format, which takes 80% less space than the previous unload. This is important to save costs related to both Amazon S3 and Glacier, but also for costs associated to Redshift Spectrum queries, which is billed by amount of data scanned.

    1. You can check how efficient Parquet is compared to text format:
      aws s3 ls s3://rs-lakehouse-blog-post/archive/

      2020-08-12 17:17:42   14523090 green_tripdata_2019-01000.parquet 

    1. Clean up previous the text files:
      aws s3 rm s3://rs-lakehouse-blog-post/ --recursive \
      --exclude "*" --include "archive/5to12*"

      The next step is creating a lifecycle rule based on creation date to automate the migration to S3-IA after 12 months and to Glacier after 15 months. The proposed policy name is 12IA-15Glacier and it’s filtered on the prefix archive/.

    1. Create a JSON file containing the lifecycle policy definition named json:
      {
      	"Rules": [
      		{
      			"ID": "12IA-15Glacier",
      			"Filter": {
      				"Prefix": "archive"
      				},
      			"Status": "Enabled",
      			"Transitions": [
      			{
      				"Days": 365,
      				"StorageClass": "STANDARD_IA"
      			},
      			{
      	                    	"Days": 548,
      	                    	"StorageClass": "GLACIER"
      		}    
      		  ]
      }

    1. Run the following command to send the JSON file to Amazon S3:
      aws s3api put-bucket-lifecycle-configuration \ 
      --bucket rs-lakehouse-blog-post \ 
      --lifecycle-configuration file://lifecycle.json

    This lifecycle policy migrates all keys in the archive prefix from Amazon S3 to S3-IA after 12 months and from S3-IA to Glacier after 15 months. For example, if today were 2020-09-12, and you unload the 2020-03 data to Amazon S3, by 2021-09-12, this 2020-03 data is automatically migrated to S3-IA.

    If using this basic use case, you can skip the partition steps in the section Defining the external schema and external tables.

    Old customer use case

    In this use case, you extract data with different ageing in the same time frame. You extract all data from January 2019 to February 2019 and, because we assume that you aren’t using this data, archive it to S3 Glacier.

    Data from March 2019 to May 2019 is migrated as an external table on S3-IA, and data from June 2019 to November 2019 is migrated as an external table to S3 Standard. With this approach, you comply with customer long-term retention policies and regulations, and reduce TCO.

    You implement the retention strategy described in the Simulated use case and retention policy section.

    1. Create a destination bucket (if you also walked through the first use case, use a different bucket):
      aws s3 mb s3://rs-lakehouse-blog-post

    2. Create three folders named extract_longtermextract_midterm, and extract_shortterm in the destination bucket rs-lakehouse-blog-post. The following code is the syntax for creating the extract_longterm folder:
      aws s3api put-object --bucket rs-lakehouse-blog-post –key

    3. Extract the data:
      unload ('select * from greentaxi where pickup between ''2019-01-01 00:00:00'' and ''2019-01-31 23:59:59''')
      	to 's3://rs-lakehouse-blog-post/extract_longterm/green_tripdata_2019-01'
      iam_role 'arn:aws:iam::123456789012:role/BlogSpectrumRole' parquet parallel off;

    4. Repeat these steps for the February 2019 time frame.

    Managing data ageing with Amazon S3 storage classes and lifecycle policies

    In this section, you manage your data with storage classes and lifecycle policies.

    1. Migrate your keys in Parquet format to Amazon Glacier:
      aws s3api copy-object \
      --copy-source rs-lakehouse-blog-post/extract_longterm/green_tripdata_2019-01000.parquet \
      --storage-class GLACIER \
      --bucket rs-lakehouse-blog-post \
      --key extract_longterm/green_tripdata_2019-01000.parquet
      	 
      aws s3api copy-object \
      --copy-source rs-lakehouse-blog-post/extract_longterm/green_tripdata_2019-02000.parquet \
      --storage-class GLACIER \
      --bucket rs-lakehouse-blog-post \
      --key extract_longterm/green_tripdata_2019-02000.parquet

    2. Extract the data from March 2019 to May 2019 (months 12–15) and migrate them to S3-IA. The following code is for March:
      unload ('select * from greentaxi where pickup between ''2019-03-01 00:00:00'' and ''2019-03-31 23:59:59''')
      to 's3://rs-lakehouse-blog-post/extract_midterm/03/green_tripdata_2019-03'
      iam_role 'arn:aws:iam::123456789012:role/BlogSpectrumRole' parquet parallel off;

    3. Repeat the previous step for April and May.
    4. Migrate all three months to S3-IA using same process as before. The following code is for March:
      aws s3api copy-object \
      --copy-source rs-lakehouse-blog-post/extract_midterm/03/green_tripdata_2019-03000.parquet \
      --storage-class STANDARD_IA \ 8. --bucket rs-lakehouse-blog-post \
      --key extract_midterm/03/green_tripdata_2019-03000.parquet 

    5. Do the same for other two months.
    6. Check the newly applied storage class with the following AWS CLI command:
      aws s3api head-object \
      --bucket rs-lakehouse-blog-post \
      --key extract_midterm/03/green_tripdata_2019-03000.parquet
       
      {
          "AcceptRanges": "bytes",
          "LastModified": "2020-10-12T13:47:32+00:00",
          "ContentLength": 14087514,
          "ETag": "\"15bf39e6b3f32b10ef589d75e0988ce6\"",
          "ContentType": "application/x-www-form-urlencoded; charset=utf-8",
          "Metadata": {},
          "StorageClass": "STANDARD_IA"
      }

    In the next step, you tag every monthly file with a key value named ageing set to the number of months elapsed from the origin date.

    1. Set March to 14, April to 13, and May to 12:
      aws s3api put-object-tagging \
      --bucket rs-lakehouse-blog-post \
      --key extract_midterm/03/green_tripdata_2019-03000.parquet \
      --tagging '{"TagSet": [{ "Key": "ageing", "Value": "14"} ]}'
      	 
      aws s3api put-object-tagging \
      --bucket rs-lakehouse-blog-post \
      --key extract_midterm/04/green_tripdata_2019-04000.parquet \
      --tagging '{"TagSet": [{ "Key": "ageing", "Value": "13"} ]}'
      	 
      aws s3api put-object-tagging \
      --bucket rs-lakehouse-blog-post \
      --key extract_midterm/05/green_tripdata_2019-05000.parquet \
      --tagging '{"TagSet": [{ "Key": "ageing", "Value": "12"} ]}'

    In this set of three objects, the oldest file has the tag ageing set to value 14, and the newest is set to 12. In the second post in this series, you discover how to manage the ageing tag as it increases month by month.

    The next step is to create a lifecycle rule based on this specific tag in order to automate the migration to Glacier at month 15. The proposed policy name is 15IAtoGlacier and the definition is to limit the scope to only object with the tag ageing set to 15 in the specific bucket.

    1. Create a JSON file containing the lifecycle policy definition named json:
      {
          "Rules": [
              {
                  "ID": "15IAtoGlacier",
                  "Filter": {
                      "Tag": {
                          "Key": "ageing",
                          "Value": "15"
                      }
                  },
                  "Status": "Enabled",
                  "Transitions": [
                      {
                          "Days": 1,
                          "StorageClass": "GLACIER"
                      }
                  ]
              }
          ]
      }

      
      

    2. Run the following command to send the JSON file to Amazon S3:
      aws s3api put-bucket-lifecycle-configuration \
      --bucket rs-lakehouse-blog-post \
      --lifecycle-configuration file://lifecycle.json 

    This lifecycle policy migrates all objects with the tag ageing set to 15 from S3-IA to Glacier.

    Though I described this process as automating the migration, I actually want to control the process from the application level using the self-managed tag mechanism. I use this approach because otherwise, the transition is based on file creation date, and the objective is to be able to delete, update, or create a new file whenever needed (for example, to delete parts of records in order to comply to the GDPR “right to be forgotten” rule).

    Now you extract all data from June 2019 to November 2019 (7–11 months old) and keep them in Amazon S3 with a lifecycle policy to automatically migrate to S3-IA after ageing 12 months, using same process as described. These six new objects also inherit the rule created previously to migrate to Glacier after 15 months. Finally, you set the ageing tag as described before.

    Use the extract_shortterm prefix for these unload operations.

    1. Unload June 2019 with the following code:
      unload ('select * from greentaxi where pickup between ''2019-06-01 00:00:00 '' and ''2019-06-30 23:59:59''')
      to 's3://rs-lakehouse-blog-post/extract_shortterm/06/green_tripdata_2019-06'
      iam_role 'arn:aws:iam::123456789012:role/BlogSpectrumRole' parquet parallel off;

    2. Use the same logic for the remaining months up to October.
    3. For November, see the following code:
      	unload ('select * from greentaxi where pickup between ''2019-11-01 00:00:00'' and ''2019-11-30 23:59:59''')
      	to 's3://rs-lakehouse-blog-post/extract_shortterm/11/green_tripdata_2019-11''
      	iam_role 'arn:aws:iam::123456789012:role/BlogSpectrumRole' parquet parallel off;
      
      aws s3 ls –recursive s3://rs-lakehouse-blog-post/extract_shortterm/
      2020-10-12 14:52:11          0 extract_shortterm/
      2020-10-12 18:45:42          0 extract_shortterm/06/
      2020-10-12 18:46:49   10889436 extract_shortterm/06/green_tripdata_2019-06000.parquet
      2020-10-12 18:45:53          0 extract_shortterm/07/
      2020-10-12 18:47:03   10759747 extract_shortterm/07/green_tripdata_2019-07000.parquet
      2020-10-12 18:45:58          0 extract_shortterm/08/
      2020-10-12 18:47:24    9947793 extract_shortterm/08/green_tripdata_2019-08000.parquet
      2020-10-12 18:46:03          0 extract_shortterm/09/
      2020-10-12 18:47:45   10302432 extract_shortterm/09/green_tripdata_2019-09000.parquet
      2020-10-12 18:46:08          0 extract_shortterm/10/
      2020-10-12 18:48:00   10659857 extract_shortterm/10/green_tripdata_2019-10000.parquet
      2020-10-12 18:46:11          0 extract_shortterm/11/
      2020-10-12 18:48:14   10247201 extract_shortterm/11/green_tripdata_2019-11000.parquet

      aws s3 ls --recursive rs-lakehouse-blog-post/extract_longterm/

      2020-10-12 14:49:51          0 extract_longterm/
      2020-10-12 14:56:38   14403710 extract_longterm/green_tripdata_2019-01000.parquet
      2020-10-12 15:30:14   13454341 extract_longterm/green_tripdata_2019-02000.parquet

    4. Apply the tag ageing with range 11 to 6 (June 2019 to November 2019), using either the AWS CLI or console if you prefer.
    5. Create a new lifecycle rule named 12S3toS3IA, which transitions from Amazon S3 to S3-IA.
    6. With the AWS CLI, create a JSON file that includes the previously defined rule 15IAtoGlacier and new 12S3toS3IA, because the command s3api overwrites the current configuration (no incremental approach) with the new policy definition file (JSON). The following code is the new lifecycle.json:
      {
          "Rules": [
              {
                  "ID": "12S3toS3IA",
                  "Filter": {
                      "Tag": {
                          "Key": "ageing",
                          "Value": "12"
                      }
                  },
                  "Status": "Enabled",
                  "Transitions": [
                      {
                          "Days": 30,
                          "StorageClass": "STANDARD_IA"
                      }
                  ]
              },
              {
                  "ID": "15IAtoGlacier",
                  "Filter": {
                      "Tag": {
                          "Key": "ageing",
                          "Value": "15"
                      }
                  },
                  "Status": "Enabled",
                  "Transitions": [
                      {
                          "Days": 1,
                          "StorageClass": "GLACIER"
                      }
                  ]
              }
          ]
      }

      aws s3api get-bucket-lifecycle-configuration \
      --bucket rs-lakehouse-blog-post

    7. Check the applied policies with the following command:
      aws s3api get-bucket-lifecycle-configuration \
      --bucket rs-lakehouse-blog-post

  • You get in stdout a single JSON with merge of 15IAtoGlacier and 12S3toS3IA.

    Defining the external schema and external tables

    Before deleting the records you extracted from Amazon Redshift with the UNLOAD command, we define the external schema and external tables to enable Redshift Spectrum queries for these Parquet files.

    1. Enter the following code to create your schema:
      create external schema taxispectrum
      	from data catalog
      	database 'blogdb'
      	iam_role 'arn:aws:iam::123456789012:role/BlogSpectrumRole'
      create external database if not exists;

    2. Create the external table taxi_archive in the taxispectrum external schema. If you’re walking through the new customer use case, replace the prefix extract_midterm with archive:
      create external table taxispectrum.taxi_archive(
      	vendor_id integer,
      	pickup timestamp,
      	dropoff timestamp,
      	storeandfwd char(2),
      	ratecodeid integer,
      	pulocid integer,
      	dolocid integer,
      	passenger_count integer,
      	trip_dist real,
      	fare_amount real,
      	extra real,
      	mta_tax real,
      	tip_amount real,
      	toll_amount real,
      	ehail_fee real,
      	improve_surch real,
      	total_amount real,
      	pay_type integer,
      	trip_type integer,
      	congest_surch real)
      	partitioned by (yearmonth char(7))
      	stored as parquet
      location 's3://rs-lakehouse-blog-post/extract_midterm/'

    3. Add the six files stored in Amazon S3 and three files stored in S3-IA as partitions (if you’re walking through the new customer use case, you can skip the following partitioning steps). The following code shows March and April:
      ALTER TABLE taxispectrum.taxi_archive
      ADD PARTITION (yearmonth='2019-03') 
      LOCATION 's3://rs-lakehouse-blog-post/extract_midterm/03/';
      ALTER TABLE taxispectrum.taxi_archive
      ADD PARTITION (yearmonth='2019-04') 
      LOCATION 's3://rs-lakehouse-blog-post/extract_midterm/04/';

    4. Continue this process up to December 2019, using extract_shortterm instead of extract_midterm.
    5. Check the table isn’t empty with the following SQL statement:
      Select count (*) from taxispectrum.taxi_archive 

    You get the number of entries in this external table.

    1. Optionally, you can check the partitions mapped to this table with a query to the Amazon Redshift internal table:
      select * from svv_external_partitions

      Run a SELECT command using partitioning in order to optimize costs related to Redshift Spectrum scanning:

      select * from taxispectrum.taxi_archive where yearmonth='2019-11' and fare_amount > 20

    2. Redshift Spectrum scans only specific partitions matching yearmonth.

    The final step is cleaning all the records extracted from the Amazon Redshift local tables:

    delete from public.greentaxi where pickup between '2019-01-01 00:00:00' and '2019-11-30 23:59:59'

    Conclusion

    We demonstrated how to extract historical data from Amazon Redshift and implement an archive strategy with Redshift Spectrum and Amazon S3 storage classes. In addition, we showed how to optimize Redshift Spectrum scans with partitioning.

    In the next post in this series, we show how to operate this solution day by day, especially for the old customer use case, and share some best practices.


    About the Authors

    Cristian Gavazzeni is a senior solution architect at Amazon Web Services. He has more than 20 years of experience as a pre-sales consultant focusing on Data Management, Infrastructure and Security. During his spare time he likes eating Japanese food and travelling abroad with only fly and drive bookings.

     

     

    Francesco MarelliFrancesco Marelli is a senior solutions architect at Amazon Web Services. He has lived and worked in London for 10 years, after that he has worked in Italy, Switzerland and other countries in EMEA. He is specialized in the design and implementation of Analytics, Data Management and Big Data systems, mainly for Enterprise and FSI customers. Francesco also has a strong experience in systems integration and design and implementation of web applications. He loves sharing his professional knowledge, collecting vinyl records and playing bass.

    The best new features for data analysts in Amazon Redshift in 2020

    Post Syndicated from Helen Anderson original https://aws.amazon.com/blogs/big-data/the-best-new-features-for-data-analysts-in-amazon-redshift-in-2020/

    This is a guest post by Helen Anderson, data analyst and AWS Data Hero

    Every year, the Amazon Redshift team launches new and exciting features, and 2020 was no exception. New features to improve the data warehouse service and add interoperability with other AWS services were rolling out all year.

    I am part of a team that for the past 3 years has used Amazon Redshift to store source tables from systems around the organization and usage data from our software as a service (SaaS) product. Amazon Redshift is our one source of truth. We use it to prepare operational reports that support the business and for ad hoc queries when numbers are needed quickly.

    When AWS re:Invent comes around, I look forward to the new features, enhancements, and functionality that make things easier for analysts. If you haven’t tried Amazon Redshift in a while, or even if you’re a longtime user, these new capabilities are designed with analysts in mind to make it easier to analyze data at scale.

    Amazon Redshift ML

    The newly launched preview of Amazon Redshift ML lets data analysts use Amazon SageMaker over datasets in Amazon Redshift to solve business problems without the need for a data scientist to create custom models.

    As a data analyst myself, this is one of the most interesting announcements to come out in re:Invent 2020. Analysts generally use SQL to query data and present insights, but they don’t often do data science too. Now there is no need to wait for a data scientist or learn a new language to create predictive models.

    For information about what you need to get started with Amazon Redshift ML, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.

    For information about what you need to get started with Amazon Redshift ML, see the Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML blog post.

    Federated queries

    As analysts, we often have to join datasets that aren’t in the same format and sometimes aren’t ready for use in the same place. By using federated queries to access data in other databases or Amazon Simple Storage Service (Amazon S3), you don’t need to wait for a data engineer or ETL process to move data around.

    re:Invent 2019 featured some interesting talks from Amazon Redshift customers who were tackling this problem. Now that federated queries over operational databases like Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL are generally available and querying Amazon RDS for MySQL and Amazon Aurora MySQL is in preview, I’m excited to hear more.

    For a step-by-step example to help you get started, see Build a Simplified ETL and Live Data Query Solution Using Redshift Federated Query.

    SUPER data type

    Another problem we face as analysts is that the data we need isn’t always in rows and columns. The new SUPER data type makes JSON data easy to use natively in Amazon Redshift with PartiQL.

    PartiQL is an extension that helps analysts get up and running quickly with structured and semistructured data so you can unnest and query using JOINs and aggregates. This is really exciting for those who deal with data coming from applications that store data in JSON or unstructured formats.

    For use cases and a quickstart, see Ingesting and querying semistructured data in Amazon Redshift (preview).

    Partner console integration

    The preview of the native console integration with partners announced at AWS re:Invent 2020 will also make data analysis quicker and easier. Although analysts might not be doing the ETL work themselves, this new release makes it easier to move data from platforms like Salesforce, Google Analytics, and Facebook Ads into Amazon Redshift.

    Matillion, Sisense, Segment, Etleap, and Fivetran are launch partners, with other partners coming soon. If you’re an Amazon Redshift partner and would like to integrate into the console, contact [email protected].

    RA3 nodes with managed storage

    Previously, when you added Amazon Redshift nodes to a cluster, both storage and compute were scaled up. This all changed with the 2019 announcement of RA3 nodes, which upgrade storage and compute independently.

    In 2020, the Amazon Redshift team introduced RA3.xlplus nodes, which offer even more compute sizing options to address a broader set of workload requirements.

    AQUA for Amazon Redshift

    As analysts, we want our queries to run quickly so we can spend more time empowering the users of our insights and less time watching data slowly return. AQUA, the Advanced Query Accelerator for Amazon Redshift tackles this problem at an infrastructure level by bringing the stored data closer to the compute power

    This hardware-accelerated cache enables Amazon Redshift to run up to 10 times faster as it scales out and processes data in parallel across many nodes. Each node accelerates compression, encryption, and data processing tasks like scans, aggregates, and filtering. Analysts should still try their best to write efficient code, but the power of AQUA will speed up the return of results considerably.

    AQUA is available on Amazon Redshift RA3 instances at no additional cost. To get started with AQUA, sign up for the preview.

    The following diagram shows Amazon Redshift architecture with an AQUA layer.

    AQUA is available on Amazon Redshift RA3 instances at no additional cost.

    Figure 1: Amazon Redshift architecture with AQUA layer

    Automated performance tuning

    For analysts who haven’t used sort and distribution keys, the learning curve can be steep. A table created with the wrong keys can mean results take much longer to return.

    Automatic table optimization tackles this problem by using machine learning to select the best keys and tune the physical design of tables. Letting Amazon Redshift determine how to improve cluster performance reduces manual effort.

    Summary

    These are just some of the Amazon Redshift announcements made in 2020 to help analysts get query results faster. Some of these features help you get access to the data you need, whether it’s in Amazon Redshift or somewhere else. Others are under-the-hood enhancements that make things run smoothly with less manual effort.

    For more information about these announcements and a complete list of new features, see What’s New in Amazon Redshift.


    About the Author

    Helen Anderson is a Data Analyst based in Wellington, New Zealand. She is well known in the data community for writing beginner-friendly blog posts, teaching, and mentoring those who are new to tech. As a woman in tech and a career switcher, Helen is particularly interested in inspiring those who are underrepresented in the industry.

    Running queries securely from the same VPC where an Amazon Redshift cluster is running

    Post Syndicated from Seetha Sarma original https://aws.amazon.com/blogs/big-data/running-queries-securely-from-the-same-vpc-where-an-amazon-redshift-cluster-is-running/

    Customers who don’t need to set up a VPN or a private connection to AWS often use public endpoints to access AWS. Although this is acceptable for testing out the services, most production workloads need a secure connection to their VPC on AWS. If you’re running your production data warehouse on Amazon Redshift, you can run your queries in Amazon Redshift query editor or use Amazon WorkSpaces from Amazon Virtual Private Cloud (Amazon VPC) to connect to Amazon Redshift securely and analyze and graph a data summary in your favorite business intelligence (BI) or data visualization desktop tool.

    With Amazon Redshift, you can query petabytes of structured and semi-structured data across your data warehouse, operational database, and your data lake using standard SQL. Amazon WorkSpaces is a managed, secure Desktop-as-a-Service (DaaS) solution deployed within an Amazon VPC. In this post, we show how you can run SQL queries on Amazon Redshift securely without VPN using Amazon Redshift query editor and Amazon WorkSpaces. First, we discuss how to run queries that return large datasets from the Amazon Redshift query editor using the UNLOAD command. Next, we discuss how to set up Amazon WorkSpaces and use it to securely run queries on Amazon Redshift. We cover the detailed steps for setting up Amazon WorkSpaces and show different scenarios to test Amazon Redshift queries with Amazon WorkSpaces.

    The following diagram illustrates these architectures.

    Using the Amazon Redshift query editor with UNLOAD command

    Amazon Redshift has a query editor on its console that is typically used to run short queries and explore the data in the Amazon Redshift database. You may have a production scenario with queries that return large result sets. For instance, you may want to unload CSV data for use by a downstream process. In this case, you can run your query in the query editor and use the UNLOAD command to send the output directly to Amazon Simple Storage Service (Amazon S3) and get notified when the data is uploaded.

    1. Create separate S3 buckets for each user. You can use the default configuration for the bucket.

    Create separate S3 buckets for each user. You can use the default configuration for the bucket.

    1. On the Amazon Redshift console, choose Editor.
    2. In the Connect to database section, enter the database connection details.
    3. Choose Connect to database.

    Choose Connect to database.

    1. Use the following format for the queries run from the query editor using the IAM role for Amazon Redshift, so the results get uploaded to Amazon S3:
      UNLOAD 
      ('select id, name
      from <your_ schema>.<your_table>)
      TO 's3://<username>/<yy-mm-dd-hh24-mi-ss>/'
      FORMAT as CSV
      iam_role 'arn:aws:iam:<myaccount>:role/MyAmazon_RedshiftUnloadRole';

    Use the following format for the queries run from the query editor.

    This query creates multiple files in the designated user’s S3 bucket under the date/time prefix. The user can preview or download the individual files on the Amazon S3 console.

    This query creates multiple files in the designated user’s S3 bucket under the date/time prefix.

    A large unload may take some time. You can configure Amazon Simple Notification Service (Amazon SNS) to send a notification when the results are uploaded to Amazon S3.

    1. On the Amazon SNS console, choose Topics.
    2. Choose Create topic.

    Choose Create topic.

    1. Create an SNS topic with a meaningful description text, like Your query results are uploaded to S3.

    In the next steps, you edit the access policy of the SNS topic to give permission for Amazon S3 to publish to it.

    1. Change the Principal from "AWS": "*" to "Service": "s3.amazonaws.com".
    2. Scroll down to “Action” and delete everything except “SNS:Publish”. Make sure to delete the extra comma.
    3. Scroll down to “Condition” and modify the text "StringEquals": { "AWS:SourceOwner": <Your account id>} to "ArnLike": { "aws:SourceArn": "arn:aws:s3:*:*:<user-bucket-name>" }.

    In the next steps, you edit the access policy of the SNS topic to give permission for Amazon S3 to publish to it.

    1. In the navigation pane, choose Subscriptions.
    2. Choose Create subscription.

    Choose Create subscription.

    1. Subscribe to the SNS notification with the user’s email address as the endpoint.

    Subscribe to the SNS notification with the user’s email address as the endpoint.

    1. Make sure the user chooses confirms the subscription from their email.
    2. On the Amazon S3 console, choose the Properties tab of the user’s S3 bucket.
    3. Under Event Notifications, choose Create event notification.

    Under Event Notifications, choose Create event notification.

    1. Select All object create events.

    Select All object create events.

    1. For Destination, select SNS topic.
    2. For Specify SNS topic, select Choose from your SNS topics.
    3. For SNS topic, choose the topic you created.

    For SNS topic, choose the topic you created.

    1. Save your settings.
    2. To test the notification, on the Amazon Redshift console, open the query editor.
    3. Edit the UNLOAD query and change the S3 bucket name to the current date and time.
    4. Run the query and check if the user gets the email notification.

    Using Amazon WorkSpaces to run Amazon Redshift queries

    In this section, we cover setting up Amazon WorkSpaces, including Amazon VPC prerequisites, creating an Amazon VPC endpoint for Amazon S3, launching Amazon WorkSpaces in the same VPC where an Amazon Redshift cluster is running, setting up an Amazon WorkSpaces client, installing PSQL or a SQL client, and connecting to the client.

    When setup is complete, we show different scenarios to test with Amazon WorkSpaces, such as testing a SQL command from the Amazon WorkSpaces client, testing SCREEN program to run SQL in the background, and testing PSQL with Amazon S3 and getting a notification through Amazon SNS.

    Prerequisites

    By default, AWS Identity and Access Management (IAM) users and roles can’t perform tasks using the AWS Management Console and they don’t have permission to create or modify Amazon VPC resources. Make sure you have administrator privileges or an administrator creates IAM policies that grants sufficient permissions to edit the route table, edit the VPC security group, and enable a DNS hostname for the VPC.

    When you have the correct permissions, complete the following prerequisite steps:

    1. On the Amazon Redshift console, in config, check the cluster subnet groups to make sure the Amazon Redshift cluster is created in an Amazon VPC with at least two subnets that are in separate Availability Zones.
    2. On the Amazon VPC console, edit the route table and make sure to associate these two subnets.
    3. Make sure the Amazon VPC security group has a self-referencing inbound rule for the security group for all traffic (not all tcp). The self-referencing rule restricts the source to the same security group in the VPC, and it’s not open to all networks. Consider limiting to just the protocol and port needed for Redshift to talk to Workspaces.
    4. Edit the DNS hostname of the Amazon VPC and enable it.

    Creating an Amazon VPC endpoint for Amazon S3 for software downloads

    In this step, you create your Amazon VPC endpoint for Amazon S3. This gives you Amazon S3 access to download PSQL from the Amazon repository. Alternatively, you could set up a NAT Gateway and download PSQL or other SQL clients from the internet.

    1. On the Amazon VPC console, choose Endpoints.
    2. Choose Create endpoint.
      Choose Create endpoint.
    3. Search for Service Name: S3
    4. Select the S3 service gateway
      Select the S3 service gateway
    5. Select the Amazon VPC where the Amazon Redshift cluster is running
    6. Select the route table
      Select the route table
    7. Enter the following custom policy for the endpoint to access the Amazon Linux AMI
      {
          "Version": "2008-10-17",
          "Statement": [
              {
                  "Sid": "Amazon Linux AMI Repository Access",
                  "Effect": "Allow",
                  "Principal": "*",
                  "Action": "s3:GetObject",
                  "Resource": [
                      "arn:aws:s3:::*.amazonaws.com",
                      "arn:aws:s3:::*.amazonaws.com/*"
                  ]
              }
          ]
      }

      Select the Amazon VPC where the Amazon Redshift cluster is running

    8. Create the endpoint

    Launching Amazon WorkSpaces in the VPC where the Amazon Redshift cluster runs

    You’re now ready to launch Amazon WorkSpaces.

    1. On the Amazon WorkSpaces console, choose Launch WorkSpaces.

    On the Amazon WorkSpaces console, choose Launch WorkSpaces.

    1. For Directory types, select Simple AD.

    Directory Service Solutions helps you store information and manage access to resources. For this post, we choose Simple AD.

    Directory Service Solutions helps you store information and manage access to resources. For this post, we choose Simple AD.

    1. For Directory size, select Small.
    2. Enter your directory details.

    Enter your directory details.

    1. For VPC, choose the VPC where the Amazon Redshift cluster is running.
    2. For Subnets, choose the two subnets you created.

    For Subnets, choose the two subnets you created.

    It may take a few minutes to provision the directory. You see the status show as Active when it’s ready.

    It may take a few minutes to provision the directory. You see the status show as Active when it’s ready.

    1. When the directory is provisioned, choose the directory and subnets you created.
    2. Choose Next Step.

    Choose Next Step.

    1. Create and identify your users.

    Create and identify your users.

    1. Use the default settings for the compute bundle.
    2. For Running Mode, select AlwaysOn.

    Alternatively, select AutoStop and adjust the time in order to run long-running queries.

    Alternatively, select AutoStop and adjust the time in order to run long-running queries.

    1. Review and launch WorkSpaces.

    It may take up to 20 minutes to become available.

    Setting up the Amazon WorkSpaces client

    In this section, you configure your Amazon WorkSpaces client.

    1. Use the link from your email to download the Amazon WorkSpaces client.
    2. Register it using the registration code from the email.
    3. Login with your username in the email and your newly created password
    4. In the Amazon WorkSpaces client, open the terminal.

    In the Amazon WorkSpaces client, open the terminal.

    1. Run the following command to capture the IP address:
    hostname -I | awk '{print $2}'

    The following screenshot shows your output.

    The following screenshot shows your output.

    1. On the Amazon Redshift console, choose Clusters.
    2. Choose your cluster.
    3. Save the endpoint information to use later.
    4. Choose the Properties tab.

    Choose the Properties tab.

    1. In the Network and security section, note the VPC security group.
      In the Network and security section, note the VPC security group.
    2. On the Amazon VPC console, under Security, choose Security Groups.
    3. Select the security group that the Amazon Redshift cluster uses.

    Select the security group that the Amazon Redshift cluster uses.

    1. Add an inbound rule with the type Redshift and the source value of the IP Address you captured/32.

    Add an inbound rule with the type Redshift and the source value of the IP Address you captured/32.

    1. On the Amazon WorkSpaces client, use the Amazon Redshift hostname from the endpoint you saved earlier and verify the VPC setup with the following code:
       nslookup <Amazon Redshift hostname>

    If you see an IP address within your subnet range, the private endpoint setup for Amazon Redshift was successful.

    If you see an IP address within your subnet range, the private endpoint setup for Amazon Redshift was successful.

    Testing a SQL command from PSQL or a SQL client in the Amazon WorkSpaces client

    To test a SQL command, complete the following steps:

    1. From the terminal in the Amazon WorkSpaces client, run the following command to install PostgreSQL:
      sudo yum install postgresql-server

    Alternatively, setup a NAT Gateway and download a SQL client such as SQL Workbench on the Amazon WorkSpaces client:

    sudo wget https://www.sql-workbench.eu/Workbench-Build125-with-optional-libs.zip

    Then unzip the content of the downloaded file and save it to a directory:

    unzip Workbench-Build125-with-optional-libs.zip -d ~/Workbench
    1. Use the Amazon Redshift hostname, port, and database names of the Amazon Redshift cluster endpoint you copied earlier and try connecting to the database:
      psql -h <Amazon Redshift hostname> -p <port> -d <database> -U <username> -W

    1. Enter your password when prompted.

    Enter your password when prompted.

    1. Run a SQL command and check the results.

    Testing the SCREEN program to run SQL in the background

    You can use the SCREEN program to run the SQL command in the background and resume to see the results.

    1. From the terminal in the Amazon WorkSpaces client, install the SCREEN program:
      sudo yum install screen

    1. Run the program:
      screen

    1. Connect to PSQL:
      psql -h <Amazon Redshift hostname> -p <port> -d <database> -U <username> -W

    1. Enter the password when prompted.
    2. Run the SQL command
    3. Enter the command ctrl A D to detach from the screen.

    The SQL command is now running in the background. You can check by running ps -ef | grep psql.

    The SQL command is now running in the background. You can check by running ps -ef | grep psql.

    1. To go back to that screen, run the following command:
      screen -r

    1. To quit SCREEN, enter the following command:
      ctrl A \

    Testing PSQL with Amazon S3 and Amazon SNS

    Similar to the UNLOAD command we used from the Amazon Redshift query editor in the beginning of this post, you can run PSQL from the Amazon WorkSpaces client, send the output to an S3 bucket, and get an Amazon SNS notification for an object create event.

    1. From the terminal in the Amazon WorkSpaces client, run aws configure to set up AWS credentials with write access to the S3 bucket.
    2. Run the following command to write a single output file to Amazon S3 and send an email notification:
      psql -h <Amazon Redshift hostname> -p <port> -d <database> -U <username> -W -c 'select column1, column2 from myschema.mytable' > s3://<username>/<yy-mm-dd-hh24-mi-ss>/

    Conclusion

    This post explained how to securely query Amazon Redshift databases running in an Amazon VPC using the UNLOAD command with the Amazon Redshift query editor and using Amazon WorkSpaces running in the same VPC. Try out this solution to securely access Amazon Redshift databases without a VPN and run long-running queries. If you have any questions or comments, please share your thoughts in the comments section.


    About the Authors

    Seetha Sarma is a Senior Database Specialist Solutions Architect with Amazon Web Services. Seetha provides guidance to customers on using AWS services for distributed data processing. In her spare time she likes to go on long walks and enjoy nature.

     

     

     

    Moiz MianMoiz Mian is a Solutions Architect for AWS Strategic Accounts. He focuses on enabling customers to build innovative, scalable, and secure solutions for the cloud. In his free time, he enjoys building out Smart Home systems and driving at race tracks.

    Scheduling SQL queries on your Amazon Redshift data warehouse

    Post Syndicated from Sain Das original https://aws.amazon.com/blogs/big-data/scheduling-sql-queries-on-your-amazon-redshift-data-warehouse/

    Amazon Redshift is the most popular cloud data warehouse today, with tens of thousands of customers collectively processing over 2 exabytes of data on Amazon Redshift daily. Amazon Redshift is fully managed, scalable, secure, and integrates seamlessly with your data lake. In this post, we discuss how to set up and use the new query scheduling feature on Amazon Redshift.

    Amazon Redshift users often need to run SQL queries or routine maintenance tasks at a regular schedule. You can now schedule statements directly from the Amazon Redshift console or by using the AWS Command Line Interface (AWS CLI) without having to use scripting and a scheduler like cron. You can schedule and run the SQL statement using Amazon EventBridge and the Amazon Redshift Data API. The results are available for 24 hours after running the SQL statement. This helps you in a variety of scenarios, such as when you need to do the following:

    • Run SQL queries during non-business hours
    • Load data using COPY statements every night
    • Unload data using UNLOAD nightly or at regular intervals throughout the day
    • Delete older data from tables as per regulatory or established data retention policies
    • Refresh materialized views manually at a regular frequency
    • Back up system tables every night

    EventBridge is a serverless event bus service that makes it easy to connect your applications with data from a variety of sources. EventBridge delivers a stream of real-time data from your own applications, software as a service (SaaS) applications, and AWS services and routes that data to targets including Amazon Redshift clusters. Amazon Redshift integrates with EventBridge to allow you to schedule your SQL statements on recurring schedules and enables you to build event-driven applications. Beside scheduling SQL, you can also invoke the Amazon Redshift Data API in response to any other EventBridge event.

    When creating a schedule using the Amazon Redshift console, you create an EventBridge rule with the specified schedule and attach a target (with the Amazon Redshift cluster information, login details, and SQL command run) to the rule. This rule then runs as per the schedule using EventBridge.

    In this post, we describe how you can schedule SQL statements on Amazon Redshift using EventBridge, and also go over the required security privileges and the steps to schedule a SQL statement using both the AWS Management Console and the AWS CLI.

    Prerequisites

    To set this up, we need to make sure that the AWS Identity and Access Management (IAM) user (which we use to create the schedule and access the schedule history), the IAM role, and the AWS secret (which stores the database user name and password) are configured correctly.

    1. Make sure that the IAM user who is going to create the schedule has the AmazonEventBridgeFullAccess IAM policy attached to it. The IAM user should also have appropriate access to Amazon Redshift as per their role. We use the placeholder {USER_NAME} to refer to this IAM user in this post.
    1. Store the database credentials to be used for running the scheduled SQL statement securely using AWS Secrets Manager. If you already have a secret created, you can use that. If not, create a new secret to store the Amazon Redshift database name and user name.
    2. Create an IAM role for the Amazon Redshift service with the “Redshift Customizable” option and attach the AmazonRedshiftDataFullAccess AWS managed policy to it.
      1. Ensure that the role has the following trust relationships added to it:
        {
              "Sid": "S1",
              "Effect": "Allow",
              "Principal": {
                "Service": "events.amazonaws.com"
              },
              "Action": "sts:AssumeRole"
            }

      2. Add the following AssumeRole permissions so the user can see the schedule history list. Replace the {ACCOUNT_ID} with your AWS account ID and {USER_NAME} with the IAM user you set up:
        {
              "Sid": "S2",
              "Effect": "Allow",
              "Principal": {
                "AWS": "arn:aws:iam::{ACCOUNT_ID}:user/{USER_NAME}"
              },
              "Action": "sts:AssumeRole"
        }

      We use the placeholder {ROLE_NAME} to refer to this role in this post.

    1. In addition to the preceding AssumeRole permissions, the IAM user has to be granted AssumeRole permissions on the IAM role you created in order to view the schedule history:
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "S3",
                  "Effect": "Allow",
                  "Action": "sts:AssumeRole",
                  "Resource": "arn:aws:iam::{ACCOUNT_ID}:role/{ROLE_NAME}"
              }
          ]
      }

    Now that we have completed the security setup required to schedule SQL statements and view their history, let’s schedule a SQL statement to run at a regular frequency. We can do this using the console or the AWS CLI. We discuss both approaches in this post.

    Scheduling the SQL statement using the console

    Let’s take the example of a fairly common use case where data from a table has to be extracted daily (or at another regular frequency) into Amazon Simple Storage Service (Amazon S3) for analysis by data scientists or data analysts. You can accomplish this by scheduling an UNLOAD command to run daily to export data from the table to the data lake on Amazon S3. See the following code:

    unload ('select * from edw.trxns')
    to 's3://mybucket/'
    iam_role 'arn:aws:iam::{ACCOUNT_ID}:role/{ROLE_NAME_2}'
    PARQUET
    PARTITION BY (trxn_dt);
    ;

    {ROLE_NAME_2} in the preceding code is not the same as {ROLE_NAME}. {ROLE_NAME_2} should be an IAM role that has permissions to run the UNLOAD command successfully.

    We can schedule this UNLOAD statement to run every day at 4:00 AM UTC using the following steps:

    1. Sign in to the console. Make sure the IAM user has been granted the necessary permissions.
    2. On the Amazon Redshift console, open the query editor.
    3. Choose Schedule.

    Choose Schedule.

    1. In the Scheduler permissions section, for IAM role, choose the role you created earlier.

    If you don’t have IAM read permissions, you may not see the IAM role in the drop-down menu. In that case, you can enter the Amazon Resource Name (ARN) of the IAM role that you created.

    1. For Authentication, select AWS Secrets Manager.
    2. For Secret, choose the secret you configured earlier, which contains the database user name and password.

    You can also use Temporary credentials for authentication as explained in the AWS documentation.

    1. For Database name, enter a name.

    For Database name, enter a name.

    1. In the Query Information section, for Scheduled query name, enter a name for the query.
    2. For SQL query, enter the SQL code.

    You also have the ability to upload a SQL query from a file.

    You also have the ability to upload a SQL query from a file.

    1. In the Scheduling options section, for Schedule query by, select Run frequency.
    2. For Repeat by, choose Day.
    3. For Repeat every, enter 1.
    4. For Repeat at time (UTC), enter 04:00.

    For Repeat at time (UTC), enter 04:00.

    1. In the Monitoring section, you can choose to enable notifications via email or text using Amazon Simple Notification Service (Amazon SNS).

    If you decide to enable notifications, make sure that the Publish action has been granted to events.amazonaws.com on the SNS topic. You can do this by adding the following snippet to the access policy of the SNS topic. Replace ACCOUNT_ID and SNS_TOPIC_NAME with appropriate values. If you choose to create a new SNS topic during the schedule creation process, then the following access is granted automatically.

    {
          "Sid": "Allow_Publish_Events",
          "Effect": "Allow",
          "Principal": {
            "Service": "events.amazonaws.com"
          },
          "Action": "sns:Publish",
          "Resource": "arn:aws:sns:us-east-1:{ACCOUNT_ID}:{SNS_TOPIC_NAME}"
      }

    Scheduling the SQL statement using the AWS CLI

    You can also schedule SQL statements via the AWS CLI using EventBridge and the Amazon Redshift Data API. For more information about the Amazon Redshift Data API, see Using the Amazon Redshift Data API to interact with Amazon Redshift clusters.

    In the following example, we set up a schedule to refresh a materialized view (called mv_cust_trans_hist) on Amazon Redshift daily at 2:00 AM UTC.

    1. Create an event rule. The following command creates a rule named scheduled-refresh-mv-cust-trans-hist and schedules it to run daily at 2:00 AM UTC. The schedule expression can be provided using cron or rate expressions.
      aws events put-rule \
      --name scheduled-refresh-mv-cust-trans-hist \
      --schedule-expression "cron(0 22 * * ? *)"

    1. Create a JSON object that contains the Amazon Redshift Data API parameters:
      1. For ARN, enter the ARN of your Amazon Redshift cluster.
      2. For the RoleArn and the SecretManagerArn fields, use the ARNs for the role and secret you created earlier.
      3. Enter the database name and the SQL statement to be scheduled for the database and SQL fields.
      4. You can enter any name for the StatementName field.
      5. If you want an event to be sent after the SQL statement has been run, you can set the WithEvent parameter to true. You can then use that event to trigger other SQL statements if needed.
        {
        "Rule": "scheduled-refresh-mv-cust-trans-hist",
        "EventBusName": "default",
        "Targets": 
        [
        {
        "Id": "scheduled-refresh-mv-cust-trans-hist",
        "Arn": "arn:aws:redshift:us-east-1:{ACCOUNT_ID}:cluster:{REDSHIFT_CLUSTER_IDENTIFIER}",
        "RoleArn": "arn:aws:iam::{ACCOUNT_ID}:role/{ROLE_NAME}",
        "RedshiftDataParameters": 
        {
        "SecretManagerArn": "arn:aws:secretsmanager:us-east-1:{ACCOUNT_ID}:secret:{SECRET_NAME-xxxxxx}",
        "Database": "dev",
        "Sql": "REFRESH MATERIALIZED VIEW mv_cust_trans_hist;",
        "StatementName": "refresh-mv-cust-trans-hist",
        "WithEvent": true
        }
        }
        ]
        }

    1. Create an event target using the JSON file created in the previous step:
      aws events put-targets --cli-input-json file://data.json

    Additional commands

    We have now set up the schedule to refresh the materialized view using the AWS CLI. Let’s quickly go over a few more CLI commands that are useful while scheduling tasks:

    • To list all targets for a particular rule, use:
      aws events list-targets-by-rule --rule <rule-name> 

    • To list all rules, use:
      aws events list-rules

    • To remove a target from a specific rule, use:
      aws events remove-targets --rule <rule-name> --ids 2 

    • To delete a rule, use:
      aws events delete-rule --name <rule-name>

    • To view the schedule history for a particular scheduled SQL statement, use:
      aws redshift-data list-statements --status ALL --statement-name <statement-name>

    Retrieving the SQL status and results

    After the schedule is set up, scheduled queries might be listed in the following places:

    • On the Schedules tab of the details page of your cluster.
    • On the scheduled queries list that you can reach from the navigation pane. To see the list, in the navigation pane, choose Queries and Schedule query list.
    • On the Scheduled queries tab of the query editor (see the following screenshot).

    On the Scheduled queries tab of the query editor (see the following screenshot).

    1. Choose the schedule name to see more details about the scheduled query, including details about any previous runs of the schedule.
    2. In the Schedule history section, you can see the ID (which can be used to retrieve SQL statement results), start time, end time, status, and elapsed time for each previous run of the scheduled SQL statement.

    In the Schedule history section, you can see the ID.

    To retrieve the SQL results, you need to use the AWS CLI (or AWS SDK), but you have to assume the role you created earlier.

    1. To assume this role, run the following command on the command line using the IAM user you configured.
      aws sts assume-role --role-arn "arn:aws:iam::{Account_ID}:role/{Role_Name}" --role-session-name AWSCLI-Session

    The command returns a result set similar to the following code:

    {
        "Credentials": {
            "AccessKeyId": "XXXXXXXXXXXX",
            "SecretAccessKey": "XXXXXXXXXXXXXXXXX",
            "SessionToken": "xxxxxxxxxxxxxxxxxxxxx”
            "Expiration": "2020-10-09T17:13:23+00:00"
        },
        "AssumedRoleUser": {
            "AssumedRoleId": "XXXXXXXXXXXXXXXXX:AWSCLI-Session",
            "Arn": "arn:aws:sts::{Account_ID}:assumed-role/{Role_Name}/AWSCLI-Session"
        }
    }
    1. Create three environment variables to assume the IAM role by running the following commands. Use the access key, secret access key, and the session token from the preceding results.
      export AWS_ACCESS_KEY_ID=RoleAccessKeyID
      export AWS_SECRET_ACCESS_KEY=RoleSecretKey
      export AWS_SESSION_TOKEN=RoleSessionToken

    1. Run the following command to retrieve the results of the SQL statement. Replace the ID with the ID from the schedule history on the console.
      aws redshift-data get-statement-result --id xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx --region us-east-1 

    Conclusion

    The ability to schedule SQL statements using the Amazon Redshift Data API and EventBridge simplifies running routine tasks that previously required scripting. You can configure schedules and manage them either via the console or the AWS CLI. You can also see the previous runs of any scheduled SQL statements directly from the console and choose to be notified when it runs.


    About the Authors

    Sain Das is an Analytics Specialist Solutions Architect at AWS and helps customers build scalable cloud solutions that help turn data into actionable insights.

     

     

     

    Vijetha Parampally Vijayakumar is a full stack software development engineer with Amazon Redshift. She is specialized in building applications for Big data, Databases and Analytics.

     

     

     

    André Dias is a Systems Development Engineer working in the Amazon Redshift team. André’s passionate about learning and building new AWS Services and has worked in the Redshift Data API. He specializes in building highly available and cost-effective infrastructure using AWS.

    Dream11’s journey to building their Data Highway on AWS

    Post Syndicated from Pradip Thoke original https://aws.amazon.com/blogs/big-data/dream11s-journey-to-building-their-data-highway-on-aws/

    This is a guest post co-authored by Pradip Thoke of Dream11. In their own words, “Dream11, the flagship brand of Dream Sports, is India’s biggest fantasy sports platform, with more than 100 million users. We have infused the latest technologies of analytics, machine learning, social networks, and media technologies to enhance our users’ experience. Dream11 is the epitome of the Indian sports technology revolution.”

    Since inception, Dream11 has been a data-driven sports technology brand. 

    Since inception, Dream11 has been a data-driven sports technology brand. The systems that power Dream11, including their transactional data warehouse, run on AWS. As Dream11 hosts fantasy sports contests that are joined by millions of Indian sports fans, they have large volumes of transactional data that is organized in a well-defined Amazon Redshift data warehouse. Previously they were using 3rd party services to collect, analyze and build models over user interaction data combined with transactional data. Although this approach was convenient, it presented certain critical issues:

    • The approach wasn’t conducive to 360-degree user analytics. Dream11’s user interactions data wasn’t present on the cloud, where the rest of Dream11’s infrastructure and data were present (AWS, in this case). To get a complete picture of a user’s experience and journey, the user’s interaction data (client events) needs to be analyzed alongside their transactional data (server events). This is known as 360-degree user analytics.
    • It wasn’t possible to get accurate user journey funnel reports. Currently, there are limitations with every tool available on the market with respect to identifying and mapping a given user’s actions across multiple platforms (on the web, iOS, or Android), as well as multiple related apps. This use case is specifically important if your company is a parent to other companies.
    • The statistics on user behavior that Dream11 was getting weren’t as accurate as they wanted. Some of the popular services they were using for web & mobile analytics use the technique of sampling to be able to deal with high volumes of data. Although this is a well-regarded technique to deal with high volumes of data and provides reasonable accuracy in multiple cases, Dream11 wanted statistics to be as accurate as possible.
    • The analytics wasn’t real-time. Dream11 experiences intense use by their users just before and during the real-life sports matches, so real-time and near-real-time analytics is very critical for them. This need wasn’t sufficiently met by the plethora of services they were using.
    • Their approach was leading to high cost for custom analytics for Dream11’s user interactions data, consisting of hundreds of event types. Serverless query engines typically charge by the amount of data scanned and so it can get very expensive if events data isn’t organized properly in separate tables in a data lake to enable selective access.

    All these concerns and needs, led Dream11 to conclude that they needed their own centralized 360-degree analytics platform.

    All these concerns and needs, led Dream11 to conclude that they needed their own centralized 360-degree analytics platform. Therefore, they embarked on the Data Highway project on AWS.

    This project has additional advantages. It is increasingly becoming important to store and process data securely. Having everything in-house can help Dream11 with data security and data privacy objectives. The platform enables 360-degree customer analytics, which further allows Dream11 to do intelligent user segmentation in-house and share only those segments (without exposing underlying transactional or interactions data) with third-party messaging service providers. 

    Design goals

    Dream11 had the following design goals for the project:

    • The system should be easy to maintain and should be able to handle a very high volume of data, consisting of billions of events and terabytes of data daily.
    • The cost should be low and should be pay-as-you-go.
    • Dream11’s web and mobile developers regularly create new types of events to capture new types of interactions. Whenever they add new types of events, they should be immediately available in the system for analytics, and their statistics should immediately reflect in relevant dashboards and reports without any human intervention.
    • Certain types of statistics (such as concurrency) should be available in real-time and near-real time—within 5 minutes or less.
    • Dream11 should be able to use custom logic to calculate key statistics. The analytics should be accurate—no more sampling.
    • The data for various events should be neatly organized in separate tables and analytics-friendly file formats.
    • Although Dream11 will have a common data lake, they shouldn’t be constrained to use a single analytics engine for all types of analytics. Different types of analytics engines excel for different types of queries.
    • The Product Management team should have access to views they commonly use in their decision-making process, such as funnels and user flow diagrams.
    • The system should be extensible by adding lanes in the system. Lanes allow you to reuse your basic setup without mixing events data for different business units. It also potentially allows you to study user behavior across different apps.
    • The system should be able to build 360-degree user profiles
    • The system should provide alerting on important changes to key business metrics.
    • Last but not the least, the system should be secure and reliable with 6 nines of availability guarantee.

    Data Highway architecture

    In less than 3 months, Dream11’s data team built a system that met all the aforementioned goals. The following diagram shows the high-level architecture.

    The following diagram shows the high-level architecture.

    For this project, they used the following components:

    The rest of this post explains the various design choices and trade-offs made by the Dream11’s data engineers. 

    Event ingestion, segregation, and organization

    Dream11 has several hundred event types. These events have common attributes and specific attributes. The following diagram shows the logical structure of these events.

    The following diagram shows the logical structure of these events.

    When the front end receives an event, it saves fields up to common attributes into a message and posts it to Kafka_AllEvents_CommonAttributes. This Kafka topic is the source for the following systems:

    • Apache HBase on Amazon EMR – Provides real-time concurrency analytics
    • Apache Druid – Provides near real-time dimensional analytics
    • Amazon Redshift – Provides session analytics

    The front end also saves events, as they are, into Kafka_AllEvents_AllAttributes. These events are further picked by Apache Ni-Fi, which forwards them to their respective topics. Apache Ni-Fi supports data routing, transformation, and system mediation logic using powerful and scalable directed graphs. Data is transformed and published to Kafka by using a combination of RouteOnAttribute and JoltTransformJSON processors (to parse JSON). Apache Ni-Fi basically reads event names and posts to the Kafka topic with matching names. If Kafka doesn’t have a topic with that name, it creates a new topic with that name. You can configure your Kafka brokers to auto-create a topic when a message is received for a non-existent topic.

    The following diagram illustrates the Amazon S3 sink connector per Kafka topic.

      The following diagram illustrates the Amazon S3 sink connector per Kafka topic.

    The following diagram summarizes the overall design of the system for event ingestion, segregation, and organization.

     

    The following diagram summarizes the overall design of the system for event ingestion, segregation, and organization.

    Storage, cataloging, ETL, and scheduling

    In this section, we discuss how Dream11 updates their AWS Glue Data Catalog, performs extract, transform, and load (ETL) jobs with Amazon EMR Presto, and uses Apache Airflow for schedule management.

    Updating the AWS Glue Data Catalog with metadata for the target table

    The AWS Glue Data Catalog provides a unified metadata repository across a variety of data sources and data formats. It provides out-of-the-box integration with Amazon S3, Amazon Relational Database Service (Amazon RDS), Amazon Redshift, Amazon Redshift Spectrum, Athena, Amazon EMR, and any application compatible with the Apache Hive metastore. You can create your table definitions one time and query across engines. For more information, see FAQ: Upgrading to the AWS Glue Data Catalog.

    Because this Data Catalog is accessible from multiple services that were going to be used for the Data Highway project, Dream11 decided to use it to register all the table definitions.

    Registering tables with AWS Glue Data Catalog is easy. You can use an AWS Glue crawler. It can infer schema from files in Amazon S3 and register a table in the Data Catalog. It works quite well, but Dream11 needed additional actions, such as automatically configuring Kafka Amazon S3 sink connectors etc. Therefore, they developed two Python based crawlers.

    The first Python based crawler runs every 2 hours and looks up Kafka topics. If it finds a new topic, it configures a Kafka Amazon S3 connector sink to dump its data to Amazon S3 every 30 minutes in JSON Gzip format. It also registers a table with Glue Data Catalog so that users can query the JSON data directly, if needed. 

    The second Python based crawler runs once a day and registers a corresponding table for each new table created that day to hold flattened data (Parquet, Snappy). It infers schemas and registers tables with the Data Catalog using its Table API. It adds customization needed by the Dream11 team to the metadata. It then creates Amazon EMR Presto ETL jobs to convert JSON, Gzip data to Parquet, Snappy, and registers them with Apache Airflow to run every 24 hours.

    ETL with Amazon EMR Presto

    Dream11 has a multi node, long-running, multi-purpose EMR cluster. They decided to run scheduled ETL jobs on it for the Data Highway project.

    ETL for an event table involves a simple SELECT FROM -> INSERT INTO command to convert JSON (Gzip) to Parquet (Snappy). Converted data takes up to 70% less space, results in 10 times improvement in Athena query performance. ETL happens once a day. Tables are partitioned by day.

    Data received on Kafka_AllEvents_CommonAttributes topic is loaded to Redshift. ETL involves SELECT FROM -> INSERT INTO to convert JSON (Gzip) to CSV, followed by Amazon Redshift COPY.

    Apache Airflow for schedule management

    Apache Airflow is an open-source tool for authoring and orchestrating big data workflows. With Apache Airflow, data engineers define direct acyclic graphs (DAGs). DAGs describe how to run a workflow and are written in Python. Workflows are designed as a DAG that groups tasks that run independently. The DAG keeps track of the relationships and dependencies between tasks.

    Dream11 uses Apache Airflow to schedule Python scripts and over few hundred ETL jobs on Amazon EMR Presto to convert JSON (Gzip) data for over few hundred events to Parquet (Snappy) format, and converts JSON data containing common attributes for all events to CSV before loading to Amazon Redshift. For more information, see Orchestrate big data workflows with Apache Airflow, Genie, and Amazon EMR: Part 1.

    The following diagram shows the workflow to connect Apache Airflow to Amazon EMR.

    The following diagram shows the workflow to connect Apache Airflow to Amazon EMR.

    The following diagram summarizes the overall design of the system for storage, cataloging, ETL, and scheduling.

    The following diagram summarizes the overall design of the system for storage, cataloging, ETL, and scheduling. 

    Real-time and near-real-time analytics

    In this section, we discuss the real-time and near-real-time analytics performed on Dream11’s data.

    Concurrency analytics with Apache Druid

    Apache Druid is an OLAP-style data store. It computes facts and metrics against various dimensions while data is being loaded. This avoids the need to compute results when a query is run.

    Dream11’s web and mobile events are loaded from the Kafka_AllEvents_CommonAttributes topic to Apache Druid with the help of the Apache Druid Kafka indexing service. Dream11 has a dashboard with different granularity levels and dimensions such as app version, org, and other dimensions present in the common event attributes list.

    Finding active users with Amazon EMR HBase

    Dream11 also needs to identify individual active users at any given time or during a given window. This is required by other downstream teams such as the Data Science team and Digital User Engagement team.

    With the help of a Java consumer, they push all events from the Kafka_AllEvents_ CommonAttributes topic to HBase on an EMR cluster with just required user dimensions. They can query the data in HBase with SQL syntax supported by the Apache Phoenix interface. 

    Session analytics with Amazon Redshift

    Dream11 maintains their transactional data warehouse on Amazon Redshift multi node cluster. Amazon Redshift allows them to run complex SQL queries efficiently. Amazon Redshift would have been a natural choice for event analytics for hundreds of event types. However, in Dream11’s case, events data grows very rapidly and this would be a lot of data in Amazon Redshift. Also, this data loses its value rapidly as time passes (relatively speaking) compared with transactional data. Therefore, they decided to do only session analytics in Amazon Redshift to benefit from its complex SQL query capabilities and to do analytics for individual events with the help of Athena (which we discuss in the next section).

    Data received on Kafka_AllEvents_CommonAttributes is loaded into Amazon S3 every 30 minutes by the associated kafka connector sink. This data is in JSON format with Gzip compression. Every 24 hours, a job runs on Amazon EMR Presto that flattens this data into CSV format. The data is loaded into Amazon Redshift with the COPY command. The data gets loaded first into a staging table. Data in the staging table is aggregated to get sessions data. Amazon Redshift already has transactional data from other tables that, combined now with the session data, allows Dream11 to perform 360-degree user analytics. They can now easily segment users based on their interactions data and transactions data. They can then run campaigns for those users with the help of messaging platforms. 

    Event analytics with Athena

    Dream11 uses Athena to analyze the data in Amazon S3. Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. It made perfect sense to organize data for over hundreds of event tables in Amazon S3 and analyze them with Athena on demand.

    With Athena, you’re charged based on the amount of data scanned by each query. You can get significant cost savings and performance gains by compressing, partitioning, or converting your data to a columnar format, because each of those operations reduces the amount of data that Athena needs to scan to run a query. For more information, see Top 10 Performance Tuning Tips for Amazon Athena.

    As discussed before, Dream11 has registered over hundreds of tables for events data in JSON format, and similar number of tables for events data in Parquet format with the AWS Glue Data Catalog. They observed a performance gain of 10 times on conversion of data format to Parquet, and an 80% savings in space. Data in Amazon S3 can be queried directly through the Athena UI with SQL queries. The other option they use is connecting to Athena using a JDBC driver from Looker and their custom Java UI for the Data Aware project.

    Athena helps Dream11 produce funnel analytics and user path analytics reports and visualizations.

      Athena helps Dream11 produce funnel analytics and user path analytics reports and visualizations.

     The following diagram summarizes the overall design of the system for real-time and near-real-time analytics and visualization.

     The following diagram summarizes the overall design of the system for real-time and near-real-time analytics and visualization.

     

    Conclusion

    This architecture has enabled Dream11 to achieve all the design goals they set out with. Results of analytics for real-time requirements are available under millisecond latency, and the system costs 40% less than the previous system. Analytics is performed with all the data without sampling, so results are accurate and reliable. All the data and analytics engines are within Dream11’s AWS account, improving data security and privacy.

    As of this writing, the system handles 14 TB of data per day and it has served 80 million requests per minute at peak during Dream11 IPL 2020.

    Doing all their analytics in-house on AWS has not just improved speed, accuracy, and data security, it has also enabled newer possibilities. Now Dream11 has a 360-degree view of their users. They can study their users’ progress across multiple platforms – web, Android, and IOS. This new system is enabling novel applications of machine learning, digital user engagement, and social media technologies at Dream11.


    About the Authors

    Pradip Thoke is a AVP Data Engineering at Dream11 and leads their Data Engineering team. The team involved in this implementation includes Vikas Gite, Salman Dhariwala, Naincy Suman, Lavanya Pulijala, Ruturaj Bhokre, Dhanraj Gaikwad, Vishal Verma, Hitesh Bansal, Sandesh Shingare, Renu Yadav, Yash Anand, Akshay Rochwani, Alokh P, Sunaim and Nandeesh Bijoor.

     

    Girish Patil is a Principal Architect AI, Big Data, India Scale Apps for Amazon.

    Building high-quality benchmark tests for Amazon Redshift using Apache JMeter

    Post Syndicated from Asser Moustafa original https://aws.amazon.com/blogs/big-data/building-high-quality-benchmark-tests-for-amazon-redshift-using-apache-jmeter/

    In the introductory post of this series, we discussed benchmarking benefits and best practices common across different open-source benchmarking tools. As a reminder of why benchmarking is important, Amazon Redshift allows you to scale storage and compute independently, and for you to choose an appropriately balanced compute layer, you need to profile the compute requirements of various production workloads. Existing Amazon Redshift customers also desire an approach to scale up with eyes wide open, and benchmarking different Amazon Redshift cluster configurations against various production workloads can help you appropriately accommodate workload expansion. In addition, you may also use benchmark tests to proactively monitor a production cluster’s performance in real time.

    For prospective Amazon Redshift customers, benchmarking Amazon Redshift is often one of the main components of evaluation and a key source of insight into the price-to-performance ratio of different Amazon Redshift configurations.

    Open-source tools, with their cost-efficiency and vendor neutrality, are often the preferred choice for profiling your production workloads and benchmark tests. However, best practices for using these tools are scarce, possibly resulting in flawed compute profiles, flawed benchmark results, customer frustration, or bloated timelines.

    As mentioned, this series is divided into multiple installments, with the first installment discussing general best practices for benchmarking, and the subsequent installments discussing the strengths and challenges with different open-source tools such as SQLWorkbench, psql, and Apache JMeter. In this post, we discuss benchmarking Amazon Redshift with the Apache JMeter open-source tool.

    One final point before we get started: there is a lot that could be said about benchmarking—more than can be accommodated in a single post. Analytics Specialists Solutions Architects such as myself frequently and happily engage with current and prospective customers to help you evaluate your benchmarking strategy and approach at no charge. I highly recommend you take advantage of that benefit by reaching out to your AWS account SA.

    Apache JMeter

    Apache JMeter is an open-source load testing application written in Java that you can use to load test web applications, backend server applications, databases, and more. You can run it on Windows and a number of different Linux/UNIX systems; for this post we run it in a Windows environment. To install Apache JMeter on a Windows EC2 machine, complete the following steps:

    1. Launch a Windows EC2 instance using a Windows Server AMI (such as Microsoft Windows Server 2019 Base).
    2. Connect via RDP to the Windows EC2 Instance (RDP for macOS can be downloaded from Apple’s App Store).
    3. Download and unzip the Apache JMeter .zip file from the Apache JMeter download page.
    4. Download the Redshift JDBC driver and add the driver .jar file to JMeter’s /lib When setting up the JDBC connection in the JMeter GUI, use com.amazon.redshift.jdbc.Driver as the driver class name).
    5. Download the Apache Plugins Manager .jar file to JMeter’s /lib/ext The Apache Plugins Manager enables additional crucial functionality in Apache JMeter for benchmark testing (such as Ultimate Thread Group).
    6. Increase the JVM heap size for Apache JMeter by changing the corresponding JVM parameters in the jmeter.bat file located in the Apache JMeter /bin folder. For example, see the following code:
      edit C:\Dev\apache-jmeter-5.1.1\bin\jmeter.bat rem set HEAP=-Xms1g -Xmx1g -XX:MaxMetaspaceSize=256m set HEAP=-Xms5g -Xmx5g -XX:MaxMetaspaceSize=1g

    1. Choose the jmeter.bat file (double-click) to start Apache JMeter.

    Apache JMeter supports both GUI and CLI modes, and although you may find the Apache JMeter GUI straightforward with a relatively small learning curve, it’s highly recommended that you use the Apache JMeter GUI primarily for defining benchmark tests, and perhaps running small-to-medium-sized benchmark tests. For large load tests, it’s highly recommended that you use the Apache JMeter CLI to minimize the risk of the Apache JMeter GUI exhausting its host’s compute resources, causing it to enter a non-responsive state or fail with an out-of-memory error. Using the CLI for large load tests also helps minimize any impact on the benchmark results.

    In the following example, I demonstrate creating a straightforward load test using both the Apache JMeter GUI and CLI. The load test aims to measure query throughput while simulating 50 concurrent users with the following personas:

    • 20 users submit only small queries, which are of low complexity and typically have a runtime of 0–30 seconds in the current system, such as business intelligence analyst queries
    • 20 users submit only medium queries, which are of moderate complexity and typically have a runtime of 31–300 seconds in the current system, such as data engineer queries
    • 10 users submit only large queries, which are very complex and typically have a runtime over 5 minutes in the current system, such as data scientist queries

    The load test is configured to run for 15 minutes, which is a pretty short test duration, so you can increase that setting to 30 minutes or more. We rely on JMeter’s query throughput calculation, but we can also manually compute query throughput from the runtime metadata that is gathered if we so desire.

    For this post, I skip over discussing the possible Amazon Redshift cluster tweaks that you could use to squeeze every drop of performance out of Amazon Redshift, and instead rely on the strength of its default state to be optimized to achieve excellent query throughput on diverse workloads.

    Apache JMeter has a number of building blocks, such as thread groups, that can be used to define a wide variety of benchmark tests, and each building block can have a number of community implementations (for example, Arrivals Thread Group or Ultimate Thread Group).

    The following diagram provides a basic illustration of the various Apache JMeter building blocks to be leveraged in this load test, how they interact with each other, and the typical order in which are they created; in some cases, I mention the specific implementation of the building block to be used in parenthesis (such as Ultimate Thread Group).

    The following table delves deeper into the purpose that each building block serves in our load test.

    Apache JMeter Component Purpose
    Test Plan Represents an atomic test case (simulate 50 users concurrently querying a Redshift cluster with twice the baseline node count)
    JDBC Connection Configuration Represents all the JDBC information needed to connect to the Amazon Redshift cluster (such as JDBC URL, username, and password)
    User Defined Variables A collection of key-value pairs that can be used as parameters throughout the test plan and make it easier to maintain or change the test behavior
    Listener Captures and displays or writes test output such as SQL result sets
    Thread Group A simulated group of users that perform the test function (submit a SQL query)
    JDBC Request The action to be taken by the simulated users (SQL query text)

    Apache JMeter (GUI)

    The following screenshot is the resulting load test.

    The following screenshot is the resulting load test.

    The following screenshot provides a close up of the building block tree.

    The following screenshot provides a close up of the building block tree.

    In the following sections, we examine each building block in greater detail.

    Test Plan

    The test plan serves as the parent container for our entire benchmark test, and we can change its name in the visual tree that appears in the Apache JMeter GUI by editing the Name field.

    I take advantage of the User Defined Variables section to set my own custom variables that hold values needed by all components in the test case, such as the JDBC URL, test duration, and number of users submitting small, medium, and large queries. The baseDir variable is actually a variable that is intended to be embedded in other variables, rather than directly referenced by other test components. I left all other settings at their default on this page.

    The baseDir variable is actually a variable that is intended to be embedded in other variables, rather than directly referenced by other test components.

    JDBC Connection Configuration

    We use the JDBC Connection Configuration building block to create a database connection pool that is used by the simulated users to submit queries to Amazon Redshift. The value specified in Variable Name for created pool is the identifier that is used to reference this connection pool in other JMeter building blocks. In this example, I named it RedshiftJDBCConfig.

    By setting the Max Number of Connections to 0, the connection pool can grow as large as it needs to. That may not be the desired behavior for all test scenarios, so be sure to set it as you see fit.

    In the Init SQL statements section, I provide an example of how to use SQL to disable the result set cache in Amazon Redshift for every connection created, or perform other similar initialization code.

    Towards the end, I input the database JDBC URL (which is actually a variable reference to a variable defined in the test plan), JDBC driver class name, and database username and password. I left all other fields at their default on this page.

    I left all other fields at their default on this page.

    User Defined Variables

    You can add a User Defined Variables building block in several places, and it’s best to use this capability to limit the scope of each variable. For this post, we use an instance of the User Defined Variables building block to hold the output file names of each listener in this test plan (if you look closely, you can see the values of these variables reference the baseDir variable, which was defined in our test plan). You can also notice three other instances of the User Defined Variables building block for the small, medium, and large thread groups—again so that the scope of variables is kept appropriately narrow.

    You can add a User Defined Variables building block in several places, and it’s best to use this capability to limit the scope of each variable.

    Listeners

    Listeners control where test output is written and how it’s processed. There are many different kinds of listeners that, for example, allow you to capture your test output as a tree, table, or graph. Other listeners can summarize and aggregate test metadata (such as the number of test samples submitted during the test). I choose to add several listeners in this test plan just for demonstration, but I have found the listeners Aggregate Report and View Results in Table to be most helpful to me. The following screenshot shows the View Results in Table output.

    The following screenshot shows the View Results in Table output.

    The following screenshot shows the Aggregate Report output.

    The following screenshot shows the Aggregate Report output.

    You can also save output from listeners after a test run to a different file through the JMeter menu.

    Thread group: Ultimate Thread Group

    A thread group can be thought of as a group of simulated users, which is why for this post, I create three separate thread groups: one to represent each of three previously mentioned user personas being simulated (small, medium, and large). Each thread group is named accordingly.

    We use the Thread Schedule section to control how many users should be created and at what time interval. In this test, I chose to have all 20 small users created at start time without any delays. This is achieved by a one-row entry in the Thread Schedule and setting the Start Threads Count thread group property to 20 users (or the matching variable, as we do in the following screenshot).

    We use the Thread Schedule section to control how many users should be created and at what time interval.

    Alternatively, I could stagger user creation by creating multiple rows and setting the Initial Delay sec field to control each row’s startup delay. With the row entries in the following screenshot, an additional five users are created every 5 seconds.

    With the row entries in the following screenshot, an additional five users are created every 5 seconds.

    Thread group: User Defined Variables

    An additional User Defined Variables instance is added to each of the three thread groups to hold the variables in their individual scope, or that would preferably be configurable at an individual thread group level. For this post, I make the JDBC Connection Configuration a variable so that it’s customizable for each individual thread group (JDBC_Variable_Name_In_Pool). This allows me to, for example, rapidly switch two different test clusters.

    An additional User Defined Variables instance is added to each of the three thread groups to hold the variables in their individual scope.

    JDBC Request

    The JDBC Request can be thought of as the benchmark query or SQL test query to be submitted non-stop by each simulated user in this thread group. To configure this JDBC Request, I specified the appropriate JDBC Connection Configuration and some very simple test SQL. I could have also used Apache JMeter’s ability to parameterize queries so that they vary from one iteration to another using a predetermined set of parameter values. For example, for the SQL statement select * from customer where cust_id=<some value>, Apache JMeter could be configured to set the value in the filter clause to a randomly chosen value from a pre-compiled list of filter values for each sample submission. I left all other settings at their default.

    The JDBC Request can be thought of as the benchmark query or SQL test query to be submitted non-stop by each simulated user in this thread group.

    Apache JMeter (CLI)

    The Apache JMeter GUI saves test plans in .jmx files that can be used to run the same test plan in Apache JMeter’s console mode. The following CLI command demonstrates how you can use the LoadTestExample.jmx file that was created in the previous steps using the GUI to run the same load test:

    > <jmeter_install_dir>\bin\jmeter -n -t LoadTestExample.jmx -e -l test.out

    The sample output is from a 30-second run of LoadTestExample.jmx.

    The sample output is from a 30-second run of LoadTestExample.jmx.

    After the test has completed, several output files are created, such as a JMeter application log, query output files from the listeners (if any), and test statistics from listeners (if any). For this post, the statistical metrics captured for the test run are located in a JSON file inside the report-output directory. See the following screenshot.

    For this post, the statistical metrics captured for the test run are located in a JSON file inside the report-output directory.

    The \report-output\statistics.json file captures a lot of useful metrics, such as the total samples (like SQL queries) submitted during the test duration, achieved query throughput, and number of small, medium, and large queries and their individual throughput. The following screenshot shows a sampling of the data from statistics.json.

    The following screenshot shows a sampling of the data from statistics.json.

    Conclusion

    In this series of posts, we discussed several recommended best practices for conducting high-quality benchmark tests. Some of the best practices represented core principles that span all the open-source tools discussed (such as consistency in testing methodology). In this particular post, we reviewed the strengths and appropriateness of Apache JMeter for conducting benchmark tests. I hope this series has been helpful, and strongly encourage current and prospective customers to reach out to me or other AWS colleagues if you wish to delve deeper.


    About the Author

    Asser Moustafa is an Analytics Specialist Solutions Architect at AWS based out of Dallas, Texas. He advises customers in the Americas on their Amazon Redshift and data lake architectures and migrations, starting from the POC stage to actual production deployment and maintenance

    How FanDuel Group secures personally identifiable information in a data lake using AWS Lake Formation

    Post Syndicated from Damian Grech original https://aws.amazon.com/blogs/big-data/how-fanduel-group-secures-personally-identifiable-information-in-a-data-lake-using-aws-lake-formation/

    This post is co-written with Damian Grech from FanDuel

    FanDuel Group is an innovative sports-tech entertainment company that is changing the way consumers engage with their favorite sports, teams, and leagues. The premier gaming destination in the US, FanDuel Group consists of a portfolio of leading brands across gaming, sports betting, daily fantasy sports, advance-deposit wagering, and TV/media, including FanDuel, Betfair US, and TVG. FanDuel Group has a presence across 50 states and over 8.5 million customers. The company is based in New York with offices in California, New Jersey, Florida, Oregon, and Scotland. FanDuel Group is a subsidiary of Flutter Entertainment plc, the world’s largest sports betting and gaming operator with a portfolio of globally recognized brands and a constituent of the FTSE 100 index of the London Stock Exchange.

    In this post, we discuss how FanDuel used AWS Lake Formation and Amazon Redshift Spectrum to restrict access to personally identifiable information (PII) in their data lake.

    The challenge

    In 2018, a series of mergers led to the creation of FanDuel Group, and the combined data engineering team found themselves operating three data warehouses running on Amazon Redshift. The team decided to create a new single platform to replace the three separate warehouses, consisting of a data warehouse containing the core business data model and a data lake to catalog and hold all other types of data. FanDuel’s vision was to create an unified data platform that served their data requirements. This included the ability to ingest and organize real-time and batch datasets, and secure and govern PII.

    Because the end-users of the existing data warehouses were familiar with Amazon Redshift, it was critical that they be able to access the data lake using Amazon Redshift. Other important architecture considerations included a simplified user experience, the ability to scale to huge data volumes, and a robust security model to provision relevant data to analysts and data scientists.

    To accomplish the vision, FanDuel decided to modernize the data platform and introduce Amazon Simple Storage Service (Amazon S3)-based data lakes. Data lakes are a logical construct that allows data to be stored in its native format using open data formats. With a data lake architecture, FanDuel can enable data analysts to analyze large volume of data without significant modeling. Also, data lakes allow FanDuel to store structured and unstructured data.

    Some of the data to be stored in the data lake was customer PII, so access to this category of data needed to be carefully restricted to only employees who required access to perform their job functions. To address these security challenges, FanDuel first tested out a tag-based approach on Amazon S3 to restrict access to the PII data. The idea was to write two datasets for a single dataset—one with PII and another without PII—and apply tags for files where PII is stored, securing files using AWS Identity and Access Management (IAM) policies. This approach was complex and needed 100–200 hours of development time for every data source that was ingested.

    Solution overview

    FanDuel decided to use Lake Formation and Redshift Spectrum to solve this challenge. The following architectural diagram shows how FanDuel secured their data lake.

    The solution includes the following steps:

    1. The FanDuel team registered the S3 location in Lake Formation.

    After the location is registered, Lake Formation takes control of the data lake, thereby eliminating the need to set up complicated policies in IAM.

    1. FanDuel built AWS Glue ETL jobs to extract data from sources, including MySQL databases and flat files. They used AWS Glue to cleanse and transform raw data to form refined datasets stored in Parquet-formatted files. They also used AWS Glue crawlers to register the cleansed datasets in the Data Catalog.
    2. The team used Lake Formation to set up column-based permissions using two roles:
      1. LimitedPIIAnalyst – Granted access to all columns. Only analysts who needed access to PII data were assigned this role.
      2. NonPIIAnalyst – Granted access to non-PII columns. By default, analysts using the data lake were assigned this role.
    3. FanDuel created two external schemas using Redshift Spectrum: one using the NonPIIAnalyst role, and one using the LimitedPIIAnalyst The following code is an example of the DDL that uses the role that was set up in Lake Formation:
      CREATE EXTERNAL SCHEMA nonpii_data_lake FROM DATA CATALOG
      DATABASE 'fanduel_data_lake' REGION 'us-east-1'
      IAM_ROLE 'arn:aws:iam::123456789012:role/NonPIIAnalyst';
      
      CREATE EXTERNAL SCHEMA limitedpii_data_lake FROM DATA CATALOG
      DATABASE 'fanduel_data_lake' REGION 'us-east-1'
      IAM_ROLE 'arn:aws:iam::123456789012:role/LimitedPIIAnalyst';
      

    FanDuel could already manage access permissions by adding or removing users from a group in Amazon Redshift, so they already had a group consisting of only the analysts who should be permitted access to PII. The following code grants this group access to the limitedpii_data_lake schema, which effectively means only this group can query the data lake using the LimitedPIIAnalyst role:

    GRANT USAGE ON SCHEMA nonpii_data_lake TO base_group;
    GRANT SELECT ON ALL TABLES IN SCHEMA nonpii_data_lake TO base_group;
    GRANT USAGE ON SCHEMA limitedpii_data_lake TO pii_permitted_group;
    GRANT SELECT ON ALL TABLES IN SCHEMA limitedpii_data_lake TO pii_permitted_group;
    

    Benefits

    The ability to extend queries to the data lake with Redshift Spectrum and have column-level access control provides superior control over the S3 tag-based permissions approach that was originally considered. This architecture provided the following benefits for FanDuel:

    • FanDuel could offer new capabilities to data analysts. For example, data analysts could quickly access raw data with PII and combine it with existing data in Amazon Redshift. Lake Formation provided a single view for monitoring the data access patterns.
    • Lake Formation column-level access control allowed them to secure PII data, which otherwise would have taken a complex S3 tag-based approach. This saved 100–200 hours of development time for every new data source and data footprint, because the original approach required creating two files (one with PII and another without PII), tagging files, and setting up permissions based on tags.
    • The ability to extend access from Amazon Redshift to the data lake with appropriate access control has allowed FanDuel to reduce data stored in Amazon Redshift.

    Conclusion

    FanDuel will leverage its new data platform to ingest additional data sources with real-time data so analysts and data scientists can gain insights and improve customer experience.

    Questions or feedback? Send an email to [email protected].


    About the Authors

    Damian Grech is a Data Engineering Senior Manager at FanDuel. Damian has over 15 years of experience in software delivery and has worked with organizations ranging from large enterprises to start-ups at their infant stages. In his spare time, you can find him either experimenting in the kitchen or trailing the Scottish Highlands.

     

     

    Shiv Narayanan is Global Business Development Manager for Data Lakes and Analytics solutions at AWS. He works with AWS customers across the globe to strategize, build, develop and deploy modern data platforms. Shiv loves music, travel, food and trying out new tech.

     

     

     

    Sidhanth Muralidhar is a Senior Technical Account Manager at Amazon Web Services. He works with large enterprise customers who run their workloads on AWS. He is passionate about working with customers and helping them in their cloud journey. In his spare time, he loves to play and watch football.