Tag Archives: Amazon Redshift

Query your Amazon Redshift cluster with the new Query Editor

Post Syndicated from Surbhi Dangi original https://aws.amazon.com/blogs/big-data/query-your-amazon-redshift-cluster-with-the-new-query-editor/

Data warehousing is a critical component for analyzing and extracting actionable insights from your data. Amazon Redshift is a fast, scalable data warehouse that makes it cost-effective to analyze all of your data across your data warehouse and data lake.

The Amazon Redshift console recently launched the Query Editor. The Query Editor is an in-browser interface for running SQL queries on Amazon Redshift clusters directly from the AWS Management Console. Using the Query Editor is the most efficient way to run queries on databases hosted by your Amazon Redshift cluster.

After creating your cluster, you can use the Query Editor immediately to run queries on the Amazon Redshift console. It’s a great alternative to connecting to your database with external JDBC/ODBC clients.

In this post, we show how you can run SQL queries for loading data in clusters and monitoring cluster performance directly from the console.

Using the Query Editor instead of your SQL IDE or tool

The Query Editor provides an in-browser interface for running SQL queries on Amazon Redshift clusters. For queries that are run on compute nodes, you can then view the query results and query execution plan next to your queries.

The ability to visualize queries and results in a convenient user interface lets you accomplish many tasks, both as a database administrator and a database developer. The visual Query Editor helps you do the following:

  • Build complex queries.
  • Edit and run queries.
  • Create and edit data.
  • View and export results.
  • Generate EXPLAIN plans on your queries.

With the Query Editor, you can also have multiple SQL tabs open at the same time. Colored syntax, query autocomplete, and single-step query formatting are all an added bonus!

Database administrators typically maintain a repository of commonly used SQL statements that they run regularly. If you have this written in a notepad somewhere, the saved queries feature is for you. This feature lets you save and reuse your commonly run SQL statements in one step. This makes it efficient for you to review, rerun, and modify previously run SQL statements. The Query Editor also has an exporter so that you can export the query results into a CSV format.

The Query Editor lets you perform common tasks, such as creating a schema and table on the cluster and loading data in tables. These common tasks are now possible with a few simple SQL statements that you run directly on the console. You can also do day-to-day administrative tasks from the console. These tasks can include finding long-running queries on the cluster, checking for potential deadlocks with long-running updates on a cluster, and checking for how much space is available in the cluster.

The Query Editor is available in 16 AWS Regions. It’s available on the Amazon Redshift console at no extra cost to you. Standard Amazon Redshift rates apply for your cluster usage and for Amazon Redshift Spectrum. To learn more, see Amazon Redshift pricing.

Let’s get started with the Query Editor

The following sections contain the steps for setting up your Amazon Redshift cluster with a sample dataset from an Amazon S3 bucket using the Query Editor directly from the console. For new users, this is an especially handy alternative to setting up JDBC/ODBC clients to establish a connection to your cluster. If you already have a cluster, you can complete these steps in 10 minutes or less.

In the following example, you use the Query Editor to perform these tasks:

  • Load a sample dataset in your cluster.
  • Run SQL queries on a sample dataset and view results and execution details.
  • Run administration queries on system tables and save frequently used queries.
  • Run SQL queries to join an internal and external table.

Use the following steps to set up your cluster for querying:

  1. On the Amazon Redshift console, create a cluster.For detailed steps, see the procedure described in Launch a Sample Amazon Redshift Cluster in the Amazon Redshift Getting Started Guide. Use any of the following currently supported node types: dc1.8xlarge, dc2.large, dc2.8xlarge, or ds2.8xlarge.For this post, we used the Quick launch cluster button on the Amazon Redshift dashboard to create a single-node, dc2.large cluster called demo-cluster in the us-east-1 Region. As you go through the tutorial, replace this cluster name with the name of the cluster that you launched, and the Region where you launched the cluster.

  1. Add Query Editor-related permissions to the AWS account.To access the Query Editor feature on the console, you need permissions. For detailed steps, see Enabling Access to the Query Editor in the Amazon Redshift Cluster Management Guide.
  1. To load and run queries on a sample dataset (including permissions to load data from S3 or to use the AWS Glue or Amazon Athena Data Catalogs), follow these steps:a. To load sample data from Amazon S3 using the COPY command, you must provide authentication for your cluster to access Amazon S3 on your behalf. Sample data for this procedure is provided in an Amazon S3 bucket that is owned by Amazon Redshift. The bucket permissions are configured to allow all authenticated AWS users read access to the sample data files.To perform this step:

• Attach the AmazonS3ReadOnlyAccess policy to the IAM role. The AmazonS3ReadOnlyAccess policy grants your cluster read-only access to all Amazon S3 buckets.

• If you’re using the AWS Glue Data Catalog, attach the AWSGlueConsoleFullAccess policy to the IAM role. If you’re using the Athena Data Catalog, attach the AmazonAthenaFullAccess policy to the IAM role.

b. In step 2 of the example, you run the COPY command to load the sample data. The COPY command includes a placeholder for the IAM role Amazon Resource Name (ARN). To load sample data, add the role ARN in the COPY The following is a sample COPY command:

COPY myinternalschema.event FROM 's3://aws-redshift-spectrum-sample-data-us-east-1/spectrum/event/allevents_pipe.txt'
delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS' region 'us-east-1';

After you complete these steps, your Amazon Redshift cluster is ready. The following sections describe three steps that demonstrate what you can do with the Query Editor:

  • Use the Query Editor for loading data.
  • Perform several day-to-day administration tasks.
  • Run a query on data stored in the Amazon Redshift cluster and Amazon S3 data lake, with no need for loading or other data preparation.

Step 1: Connect to your cluster in the Query Editor

To connect to your cluster:

  1. Using the left navigation pane on the Amazon Redshift console, navigate to the Query Editor.
  2. In the Credentials dialog box, in the Cluster drop-down list, choose the cluster name (demo-cluster). Choose the database and the database user for this cluster.
  3. If you created the cluster by using the service-provided default values, choose dev as your Database selection, and enter awsuser in the Database user box.
  4. Enter the password for the cluster. Commonly, Amazon Redshift database users log on by providing a database user name and password. As an alternative, if you don’t remember your password, you can retrieve it in an encrypted format by choosing Create a temporary password, as shown in the following example. For more information, see Using IAM Authentication to Generate Database User Credentials.

This connects to the cluster if you have Query Editor-related permissions for the AWS account. For more information, see the step to add the Query Editor-related permissions to the AWS account in the previous section.

Step 2: Prepare the cluster with a sample dataset

To prepare the cluster with a sample dataset:

  1. Run the following SQL in the Query Editor. This creates the schema myinternalschema in the Amazon Redshift cluster demo-cluster.
/* Create a schema */
CREATE SCHEMA myinternalschema

  1. Run the following SQL statement in the Query Editor to create a table for schema myinternalschema.
/* Create table */
CREATE TABLE myinternalschema.event(
	eventid integer not null distkey,
	venueid smallint not null,
	catid smallint not null,
	dateid smallint not null sortkey,
	eventname varchar(200),
	starttime timestamp);
  1. Run the following SQL statement with the COPY command to copy the sample dataset from Amazon S3 to your Amazon Redshift cluster, demo-cluster, in the us-east-1 The Amazon S3 path for the sample dataset is s3://aws-redshift-spectrum-sample-data-us-east-1/spectrum/event/allevents_pipe.txt.

Before choosing Run Query, remember to replace the placeholder in the example with the ARN for the IAM role that is associated with this AWS account. If your cluster is in another AWS Region, replace the Region in the region parameter and the Amazon S3 path, as shown in the following SQL command:

/* Load data */
COPY myinternalschema.event FROM 's3://aws-redshift-spectrum-sample-data-us-east-1/spectrum/event/allevents_pipe.txt'
delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS' region 'us-east-1';
  1. To ensure access to the public dataset in Amazon S3, make sure that this AWS account has the correct permissions to access Amazon S3, AWS Glue, and Athena. For more information, see the step to load and run queries on the sample dataset (Amazon S3 and AWS Glue/Amazon Athena Data Catalog permissions) earlier in this post.
  2. To verify the data in the previously created table in the Query Editor, browse through the tables in the schema viewer on the left. Choose the preview icon next to the table name to see the first 10 records from the event table. Choosing this option runs the following query for a preview of the table, displaying 10 rows from the table:
/* View a snippet of the same dataset in myinternalschema */ 
SELECT * FROM myinternalschema.event

You can also enter your own SQL statements. Use Ctrl + Space to autocomplete queries in the Query Editor, to verify the data in the table that you created.

Step 3: Helpful cluster management queries

You are all set to try Amazon Redshift! In day-to-day cluster management and monitoring, you can run the following SQL queries in the Query Editor. These frequently used queries let you find and shut down long-running queries, uncover deadlock situations, and check for available disk space on your Amazon Redshift cluster. Save these queries and get convenient access to them by choosing Saved queries in the left navigation on the console, as shown in the following example:

Kill malfunctioning or long-running queries on a cluster

If there is a malfunctioning query that must be shut down, locating the query can often be a multi-step process. Run the following SQL in the Query Editor to find all queries that are running on an Amazon Redshift cluster with a SQL statement:

/* Queries are currently in progress */ 
 , query
 , pid
 , starttime
 , left(text, 50) as text
FROM pg_catalog.stv_inflight

After locating the malfunctioning queries from the query result set, use the cancel <pid> <msg> command to kill a query. Be sure to use the process ID—pid in the previous SQL—and not the query ID. You can supply an optional message that is returned to the issuer of the query and logged.

Monitor disk space being used on a cluster

One of the most frequently used console functions is monitoring the percentage of disk space used by a cluster. Queries fail if there is limited space in the cluster to create temp tables used while the query is running. Vacuums can also fail if the cluster does not have free space to store intermediate data in the cluster restore process. Monitoring this metric is important for planning ahead before the cluster gets full and you have to resize or add more clusters.

If you suspect that you are experiencing high or full disk usage with Amazon Redshift, run the following SQL in the Query Editor to find disk space available and see individual table sizes on the cluster:

/* Disk space available on your Redshift cluster */
SELECT SUM(used)::float / SUM(capacity) as pct_full
FROM pg_catalog.stv_partitions
/* Find individual table sizes */
SELECT t.name, COUNT(tbl) / 1000.0 as gb
SELECT DISTINCT id, name FROM stv_tbl_perm
) t
JOIN stv_blocklist ON tbl=t.id

From here, you can either drop the unnecessary tables or resize your cluster to have more capacity. For more information, see Resizing Clusters in Amazon Redshift.

Watch for deadlock situations with suspiciously long-running updates on the cluster

If a cluster has a suspiciously long-running update, it might be in a deadlocked transaction. The stv_locks table indicates any transactions that have locks, along with the process ID of the relevant sessions. This pid can be passed to pg_terminate_backend(pid) to kill the offending session.

Run a SQL statement in the Query Editor to inspect the locks:

\/* Find all transactions that have locks along with the process id of the relevant sessions */ 
FROM pg_catalog.stv_locks 
ORDER BY last_update asc

To shut down the session, run select pg_terminate_backend(lock_owner_pid), using the value from stl_locks.

See the rows affected by the most recent vacuums of the cluster

By running a vacuum command on tables in the cluster, any free space because of delete and update operations is reclaimed. At the same time, the data of the table gets sorted. The result is a compact and sorted table, which improves the cluster performance.

Run the following SQL statement to see a count of rows that were deleted or resorted from the most recent vacuums from the svv_vacuum_summary table:

/* Deleted or restored rows from most recent vacuums */
select * from svv_vacuum_summary
where table_name = 'events'

Debug connection issues for Amazon Redshift clusters

Joining stv_sessions and stl_connection_log tables returns a list of all sessions (all connects, authenticates, and disconnects on the cluster) and the respective remote host and port information.

To list all connections, run the following SQL statement in the Query Editor:

/* List connections, along with remote host information */ 
 '' remotehost,
FROM stv_sessions
LEFT JOIN stl_connection_log ON pid = process
  AND starttime > recordtime - interval '1 second'
ORDER BY starttime DESC

Use the saved queries feature to save these commonly used SQL statements in your account and run them in the Query Editor with one click.

Bonus step 4: Query with Amazon Redshift Spectrum

With Amazon Redshift Spectrum, you can query data in Amazon S3 without the need to first load it into Amazon Redshift. Amazon Redshift Spectrum queries employ massive parallelism to quickly process large datasets in S3, without ingesting that data into Amazon Redshift. Much of the processing occurs in the Amazon Redshift Spectrum layer. Multiple clusters can concurrently query the same dataset in Amazon S3 without needing to make copies of the data for each cluster.

To get set up with Amazon Redshift Spectrum, run the following SQL statements in the Query Editor for demo-cluster. If your cluster is in another AWS Region, be sure to replace the Region in the region parameter and the Amazon S3 path in the following SQL statement.

To create a new schema from a data catalog to use with Amazon Redshift Spectrum:

/* Create external (Amazon S3) schema */
from data catalog
database 'myexternaldatabase'
region 'us-east-1'
create external database if not exists;

To create a table for the Amazon Redshift Spectrum S3 sample dataset:

/* Create external table */
CREATE EXTERNAL TABLE myexternalschema.sales(
salesid integer,
listid integer,
sellerid integer,
buyerid integer,
eventid integer,
dateid smallint,
qtysold smallint, pricepaid decimal(8,1), commission decimal(8,1), saletime timestamp)
row format delimited
fields terminated by '\t'
stored as textfile
location 's3://aws-redshift-spectrum-sample-data-us-east-1/spectrum/sales/' 
table properties ('numRows'='171000');

Start querying!

This section provides an example scenario to start querying data from the external (Amazon S3) sales table and the internal (Amazon Redshift) event table. The join query in this scenario looks for all events (from the sales dataset loaded on the demo-cluster) with the sale price paid > 50 (from the Amazon Redshift Spectrum dataset in Amazon S3, s3://aws-redshift-spectrum-sample-data-us-east-1/spectrum/sales/).

/* Join a table from the sample dataset with a Spectrum table */
/* Join external (Amazon S3) and internal (Amazon Redshift) table */
    myexternalschema.sales.eventid = myinternalschema.event.eventid       
    and myexternalschema.sales.pricepaid > 50  
group by
order by
    1 desc;

In the Query results section, choose View execution to see the detailed execution plan. The query plan is available for all queries executed on compute nodes.

Note: Queries that do not reference user tables, such as administration queries that only use catalog tables, do not have an available query plan.

Optionally, download the query results to your local disk for offline use. Queries run for up to three minutes in the Query Editor. After a query is completed, the Query Editor provides two minutes to fetch results. Rerun the query and try again if you hit the two-minute threshold.

Load additional tables from the Amazon Redshift sample dataset by using the following SQL statements and get creative with your queries. Before choosing Run query in the Query Editor, remember to add the ARN for the IAM role that is associated with this AWS account in the placeholder in the following SQL statement. If your cluster is in another AWS Region, replace the Region in the region parameter and the Amazon S3 path in the following SQL statement.

copy users from 's3://awssampledbuswest2/tickit/allusers_pipe.txt' 
delimiter '|' region 'us-west-2';

copy venue from 's3://awssampledbuswest2/tickit/venue_pipe.txt' 
delimiter '|' region 'us-west-2';

copy category from 's3://awssampledbuswest2/tickit/category_pipe.txt' 
delimiter '|' region 'us-west-2';

copy date from 's3://awssampledbuswest2/tickit/date2008_pipe.txt' 
delimiter '|' region 'us-west-2';

copy listing from 's3://awssampledbuswest2/tickit/listings_pipe.txt' 
delimiter '|' region 'us-west-2';

copy sales from 's3://awssampledbuswest2/tickit/sales_tab.txt'
delimiter '\t' timeformat 'MM/DD/YYYY HH:MI:SS' region 'us-west-2';


In this post, we introduced the Query Editor, an in-browser interface for running SQL queries on Amazon Redshift clusters. We showed how you can use it to run SQL queries for loading data in clusters and monitoring cluster performance directly on the console. To learn more about Amazon Redshift and start with Query Editor, visit the Amazon Redshift webpage.

If you like this feature, share your feedback by using the Send feedback link on the console, as shown following.

If you have any questions or suggestions, please leave a comment below.

Happy querying!


About the Authors

Surbhi Dangi is a senior product/design manager at AWS. Her work includes building user experiences for Database, Analytics & AI AWS consoles, launching new database and analytics products, working on new feature launches for existing products, and building broadly adopted internal tools for AWS teams. She enjoys traveling to new destinations to discover new cultures, trying new cuisines, and teaches product management 101 to aspiring PMs.



Raja Bhogi is an engineering manager at AWS. He is responsible for building delightful and easy-to-use web experiences for analytics and blockchain products. His work includes launching web experiences for new analytics products, and working on new feature launches for existing products. He is passionate about web technologies, performance insights, and tuning. He is a thrill seeker and enjoys everything from roller coasters to bungy jumping.




Federate Amazon Redshift access with Okta as an identity provider

Post Syndicated from Rajiv Gupta original https://aws.amazon.com/blogs/big-data/federate-amazon-redshift-access-with-okta-as-an-identity-provider/

Managing database users and access can be a daunting and error-prone task. In the past, database administrators had to determine which groups a user belongs to and which objects a user/group is authorized to use. These lists were maintained within the database and could easily get disjointed from the corporate directory.

With federation, you can manage users and groups within the enterprise identity provider (IdP) and pass them to Amazon Redshift at login. In a previous post, Federate Database User Authentication Easily with IAM and Amazon Redshift, I discussed the internals of the federation workflow using Active Directory Federation Service (AD FS) as our identity provider.

In this post, I focus on Okta as the identity provider. I provide step-by-step guidance showing how you can set up a trial Okta.com account, build users and groups within your organization’s directory, and enable single sign-on (SSO) into Amazon Redshift. You can do all of this while also maintaining group-level access controls within your data warehouse.

The steps in this post are structured into the following sections:

  • Identity provider (Okta) configuration – You set up Okta, which contains your users organized into logical groups.
  • AWS configuration – You set up a role that establishes a trust relationship between your identity provider and AWS and a role that Okta uses to access Amazon Redshift.
  • Identity provider (Okta) advanced configuration – You finalize the Okta configuration by inputting the roles that you just created. You also inform Okta about which groups are allowed to be passed to Amazon Redshift.
  • Amazon Redshift server/client setup – You set up groups within the Amazon Redshift database to match the Okta groups. You also authorize these groups to access certain schemas and tables. Finally, you set up your client tools to use your enterprise credentials and sign in to Amazon Redshift.

Identity provider (Okta) configuration

In this first step, you set up Okta, add users, and organize them into logical groups. You then add the Amazon Web Services Redshift Okta application.

Step 1: Create an Okta account

If you don’t already have access to an Okta account, you can start a 30-day free trial: https://www.okta.com/free-trial/.

Step 2: Set up your Okta directory

Sign in to Okta.com using the following URL, where <prefix> is specific to your account and was created at account setup:


Navigate to the Directory page to add people and groups into Okta that match your organization. Be sure to use lowercase group names because Amazon Redshift expects the group names to be lowercase.

In the following example, I added three users and two groups, where one of the users (Jorge) belongs to both the “sales” and “marketing” groups.

First, choose Admin in the upper-right corner.

To add users, choose Add Person. The following example shows the users that were created.

To add groups into Okta, choose Add Group. The following example shows three groups.

Step 3: Add the “Amazon Web Services Redshift” Okta application

Navigate to the Applications page. Choose Add Application, and search for the Amazon Web Services Redshift application. Proceed with the default settings.

Step 4: Download the Okta application metadata

Make sure that you have navigated to the Amazon Web Services Redshift application’s settings page, which appears as follows.

Choose Sign On, and then choose the Identity Provider metadata link to download the metadata file in xml format (for example, metadata.xml).

AWS configuration

Next, you set up a role that establishes a trust relationship between the identity provider and AWS. You also create a role that Okta uses to access Amazon Redshift.

Step 5: Create the SAML IAM identity provider

Switching to AWS Management Console, sign in using your AWS credentials. Then open the AWS Identity and Access Management (IAM) console.

On the IAM console, choose Identity providers, and then choose Create Provider, as shown following.

Provide a name for your IdP, and upload the metadata file that you downloaded in the previous step.

Step 6: Create the IAM SAML 2.0 federation role

On the IAM console, navigate to Roles and create a new SAML 2.0 federation role.  Reference the IdP that you created in the previous step, and choose Allow programmatic and AWS Management Console access.

Step 7: Add other permissions to query Amazon Redshift

Choose Next: Assign Permissions. Then choose Create policy.

Create the following custom policy, replacing the region, account, and cluster parameters. These permissions allow the role to use Amazon Redshift to query data, create users, and allow users to join groups.

    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
           "Action": [
           "Resource": "arn:aws:redshift:<region>:<account>:cluster:<cluster>"}]

There are a few important things to note:

  • The group membership lasts only for the duration of the user session.
  • There is no CreateGroup permission because groups need to be manually created and granted DB privileges.

The following image shows the summary page for the role.

Identity provider (Okta) advanced configuration

In this section, you finalize the Okta configuration by adding the roles that you just created. You also tell Okta which groups are allowed to be passed to Amazon Redshift.

Step 8: Configure the advanced sign-on settings

Switch back to Okta.com. Navigate to the settings page for the Amazon Web Services Redshift application. In the Sign-On section, scroll to Advanced Sign-On Settings.

Enter the previously created IdP and role ARNS, which are globally unique and ensure that Okta will be directed to your AWS account. Allowed DB Groups is a list of allowed groups that will be sent to Amazon Redshift in the DBGroup SAML assertion.

Don’t use the asterisk (*) wildcard. This will cause the Everyone group to be passed, and Amazon Redshift will complain because it expects the group names to be lowercase.  Note that the ${user.username} is sent in the DBUser SAML assertion.

Step 9: Authorize users

Authorize users to use the Amazon Web Services Redshift application by selecting their respective groups or individual user accounts. In this example, I authorized users by group.

Amazon Redshift server/client setup

Next, you set up groups in the Amazon Redshift database to match the Okta groups. You also authorize these groups to access certain schemas and tables. Finally, you set up your client tools to use your enterprise credentials and sign in to Amazon Redshift.

Step 10: Set up groups

Log in to your Amazon Redshift cluster with an admin account. Create groups that match the IdP group names, and grant the appropriate permissions to tables and schemas.

CREATE GROUP marketing;
GRANT ALL on TABLES to GROUP marketing;

Step 11: Configure the JDBC SQL client

Assuming that the Amazon Redshift JDBC driver is installed, set up a new connection to your cluster using your IdP credentials. In the following example, I am using SQLWorkbenchJ. For the URL, be sure to enter “iam” to instruct the driver to authenticate using IAM. For Username and Password, enter the values that you set in Okta.

Enter the extended properties as follows. For app_id and idp_host, refer to the URL for the application in your web browser:


Step 12: Configure the ODBC SQL client

Assuming that the Amazon Redshift ODBC driver is installed, set up a new connection to your cluster using your IdP credentials. In the following example, I modified the ~/Library/ODBC/odbc.ini file.  See the previous instructions for determining the <app_id> and <prefix> values.

[ODBC Data Sources]
Redshift DSN=Installed

[Redshift DSN]

Step 13: Test user access

You should now be able to sign on with the users created. In our example, [email protected] has access to the tables in the “sales” schema only. The user [email protected] has access to tables in the “marketing” schema only. And [email protected] has access to tables in both schemas. Using the [email protected] user, you get following results when trying to query data from each of the schemas:

select storeid From sales.stores


select * From marketing.campaign

An error occurred when executing the SQL command:
select * From marketing.campaign

[Amazon](500310) Invalid operation: permission denied for schema marketing;
1 statement failed.

Execution time: 0.16s


In this post, I provided a step-by-step guide for configuring and using Okta as your Identity Provider (IdP) to enable single sign-on to an Amazon Redshift cluster. I also showed how group membership within your IdP can be passed along, enabling you to manage user access to Amazon Redshift resources from within your IdP.

If you have questions or suggestions, please comment below.


About the Author

Rajiv Gupta is a data warehouse specialist solutions architect with Amazon Web Services.





New – Concurrency Scaling for Amazon Redshift – Peak Performance at All Times

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-concurrency-scaling-for-amazon-redshift-peak-performance-at-all-times/

Amazon Redshift is a data warehouse that can expand to exabyte-scale. Today, tens of thousands of AWS customers (including NTT DOCOMO, Finra, and Johnson & Johnson) use Redshift to run mission-critical BI dashboards, analyze real-time streaming data, and run predictive analytics jobs.

A challenge arises when the number of concurrent queries grows at peak times. When a multitude of business analysts all turn to their BI dashboards or long-running data science workloads compete with other workloads for resources, Redshift will queue queries until enough compute resources become available in the cluster. This ensures that all of the work gets done, but it can mean that performance is impacted at peak times. Two options present themselves:

  • Overprovision the cluster to meet peak needs. This option addresses the immediate issue, but wastes resources and costs more than necessary.
  • Optimize the cluster for typical workloads. This option forces you to wait longer for results at peak times, possibly delaying important business decisions.

New Concurrency Scaling
Today I would like to offer a third option. You can now configure Redshift to add more query processing power on an as-needed basis. This happens transparently and in a manner of seconds, and provides you with fast, consistent performance even as the workload grows to hundreds of concurrent queries. Additional processing power is ready in seconds and does not need to be pre-warmed or pre-provisioned. You pay only for what you use, with per-second billing and also accumulate one hour of concurrency scaling cluster credits every 24 hours while your main cluster is running. The extra processing power is removed when it is no longer needed, making this a perfect way to address the bursty use cases that I described above.

You can allocate the burst power to specific users or queues, and you can continue to use your existing BI and ETL applications. Concurrency Scaling Clusters are used to handle many forms of read-only queries, with additional flexibility in the works; read about Concurrency Scaling to learn more.

Using Concurrency Scaling
This feature can be enabled for an existing cluster in minutes! We recommend starting with a fresh Redshift Parameter Group for testing purposes, so I start by creating one:

Then I edit my cluster’s Workload Management Configuration, select the new parameter group, set the Concurrency Scaling Mode to auto, and click Save:

I will use the Cloud Data Warehouse Benchmark Derived From TPC-DS as a source of test data and test queries. I download the DDL, customize it with my AWS credentials, and use psql to connect to my cluster and create the test data:

sample=# create database sample;
sample=# \connect sample;
psql (9.2.24, server 8.0.2)
WARNING: psql version 9.2, server version 8.0.
         Some psql features might not work.
SSL connection (cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256)
You are now connected to database "sample" as user "awsuser".
sample=# \i ddl.sql

The DDL creates the tables and loads populates them using data stored in an S3 bucket:

sample=# \dt
                 List of relations
 schema |          name          | type  |  owner
 public | call_center            | table | awsuser
 public | catalog_page           | table | awsuser
 public | catalog_returns        | table | awsuser
 public | catalog_sales          | table | awsuser
 public | customer               | table | awsuser
 public | customer_address       | table | awsuser
 public | customer_demographics  | table | awsuser
 public | date_dim               | table | awsuser
 public | dbgen_version          | table | awsuser
 public | household_demographics | table | awsuser
 public | income_band            | table | awsuser
 public | inventory              | table | awsuser
 public | item                   | table | awsuser
 public | promotion              | table | awsuser
 public | reason                 | table | awsuser
 public | ship_mode              | table | awsuser
 public | store                  | table | awsuser
 public | store_returns          | table | awsuser
 public | store_sales            | table | awsuser
 public | time_dim               | table | awsuser
 public | warehouse              | table | awsuser
 public | web_page               | table | awsuser
 public | web_returns            | table | awsuser
 public | web_sales              | table | awsuser
 public | web_site               | table | awsuser
(25 rows)

Then I download the queries and open up a bunch of PuTTY windows so that I can generate a meaningful load for my Redshift cluster:

I run an initial set of parallel queries, and then ramp up over time, I can see them in the Cluster Performance tab for my cluster:

I can see the additional processing power come online as needed, and then go away when no longer needed, in the Database Performance tab:

As you can see, my cluster scales as needed in order to handle all of the queries as expeditiously as possible. The Concurrency Scaling Usage shows me how many seconds of additional processing power I have consumed (as I noted earlier, each cluster accumulates a full hour of concurrency credits every 24 hours).

I can use the parameter max_concurrency_scaling_clusters to control the number of Concurrency Scaling Clusters that can be used (the default limit is 10, but you can request an increase if you need more).

Available Today
You can start making use of Concurrency Scaling Clusters today in the US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), and Asia Pacific (Tokyo) Regions today, with more to come later this year.



Granting fine-grained access to the Amazon Redshift Management Console

Post Syndicated from Raj Jayaraman original https://aws.amazon.com/blogs/big-data/granting-fine-grained-access-to-the-amazon-redshift-management-console/

As a fully managed service, Amazon Redshift is designed to be easy to set up and use. In this blog post, we demonstrate how to grant access to users in an operations group to perform only specific actions in the Amazon Redshift Management Console. If you implement a custom IAM policy, you can set it up so these users can monitor and terminate running queries. At the same time, you can prevent these users from performing other more privileged operations such as modifying, restarting, or deleting an Amazon Redshift cluster.

An overview of Amazon Redshift access control

Since its release in February 2013, Amazon Redshift has quickly become a popular cloud-based data warehousing platform for thousands of customers worldwide.

Access to Amazon Redshift requires credentials that AWS can use to authenticate your requests. Those credentials must have permissions to access Amazon Redshift resources, such as an Amazon Redshift cluster or a snapshot. For more details on these credentials, see Authentication and Access Control for Amazon Redshift in the Amazon Redshift documentation.

Every AWS resource is owned by an AWS account, and permissions to create or access the resources are governed by AWS Identity and Access Management (IAM) policies. An AWS account administrator can attach permissions policies to IAM identities (users, groups, and roles). In particular, an AWS account administrator can attach an IAM permissions policy to a specific user. Such a policy grants permissions for that user to manage an Amazon Redshift resource, such as a snapshot or an event subscription.

When granting permissions, you can decide who gets the permissions and which Amazon Redshift resources they get permissions for. You can also decide on the specific actions that you want to allow on those resources. Policies attached to an IAM identity are referred to as identity-based IAM policies, and policies attached to a resource are referred to as resource-based policies. Amazon Redshift supports only identity-based IAM policies.

Use case: Setting limited access for a user

Consider the following use case. Suppose that an IAM user who is a member of a customer’s operations group needs to monitor and terminate queries running in an Amazon Redshift cluster. It’s best if they do so through the Amazon Redshift console. This user is not allowed to modify or delete any other Amazon Redshift resources.

To implement this use case, we need to implement a custom IAM policy that ensures this IAM user has read-only access to the Amazon Redshift console. Doing this means that the user can get descriptions of the available clusters and navigate to the Queries tab. Additionally, we want the IAM user to be able to cancel a running query through the Amazon Redshift console. To allow this, we use the redshift:CancelQuerySession IAM action. For descriptions of other allowed Amazon Redshift actions in an IAM policy and what each action means, see Actions Defined by Amazon Redshift in the Amazon Redshift documentation.

To create such a custom IAM policy, follow these instructions:

  1. Sign in to the AWS Management Console and open the IAM console at https://console.aws.amazon.com/iam/.
  2. In the navigation pane on the left, choose Policies.
  3. Choose Create policy.
  4. Choose the JSON tab and input the following policy:
    "Version": "2012-10-17",
    "Statement": [
            "Action": [
            "Effect": "Allow",
            "Resource": "*"

  1. On the Review policy page, type a value for Name and optionally for Description for the policy that you are creating. Review the policy Summary to see the permissions that are granted by your policy. Then choose Create policy to save your work.
  2. Attach this policy to an existing or a new IAM user.

With this permission policy, an IAM user can select an Amazon Redshift cluster, list all running queries in the Queries tab, and terminate a query if needed. All the permissions are read-only. Thus, the user can’t create a new Amazon Redshift cluster or modify or delete an existing cluster. However, the user can view available clusters, cluster snapshots, parameter groups, and cluster subnet groups, and view other properties of existing clusters.

Validating the use case

With the above IAM policy in place, after the IAM user logs into the Amazon Redshift Management Console, the user can select and view details about the Amazon Redshift cluster or clusters in the account. After navigating to the Queries tab, the user can see both the running and completed queries.

To cancel or terminate a long running query, the user can select the query from the list and choose Terminate Query. However, this user can’t modify or delete anything else in the Amazon Redshift console. As an example, if the user tries to modify an Amazon Redshift cluster (to change its endpoint), that user encounters the following error.


In this post, we have walked through a detailed customer use case of providing fine-grained access to the Amazon Redshift console. Using a set of carefully tailored IAM policies, a customer’s operations personnel can have read-only access to the Amazon Redshift console. These personnel can cancel or terminate running queries without the ability to modify, add, or delete any other Amazon Redshift resources.

We want to acknowledge our fellow AWS co-workers Ryan Mich, Sulay Shah and Hunter Grider for their many useful comments and suggestions.

If you have any questions or suggestions, leave your feedback in the comment section. If you need any further assistance to optimize your Amazon Redshift implementation, contact your AWS account team or a trusted AWS partner.


About the authors

Raj Jayaraman is a cloud support engineer with AWS Support at Amazon Web Services.





Po Hong, Ph.D. is a senior data architect within the Global Data & Analytics Specialty Practice at AWS Professional Services.





Build a modern analytics stack optimized for sharing and collaborating with Mode and Amazon Redshift

Post Syndicated from Benn Stancil original https://aws.amazon.com/blogs/big-data/build-a-modern-analytics-stack-optimized-for-sharing-and-collaborating-with-mode-and-amazon-redshift/

Leading technology companies, such as Netflix and Airbnb, are building on AWS to solve problems on the edge of the data ecosystem. While these companies show us what data and analytics make possible, the complexity and scale of their problems aren’t typical. Most of our challenges aren’t figuring out how to process billions of records to provide real-time recommendations to millions of customers. Instead, we struggle with wrangling the data that we have, finding and sharing insights from that data, and then acting on them. This leads even the most savvy data teams that have adopted tools, such as Jupyter notebooks, to either be bottlenecked by slow data pipelines, or resort to a manual process to democratize insights for their stakeholders.

This blog post walks you through the following:

  1. The stages of BI modernization that illustrate what problems arise when organizations adopt modern BI tools without truly modernizing the analytics platform.
  2. What a cloud-first data science platform, with dashboarding and notebooking solutions integrated together with efficient data pipelines, could look like.
  3. How to replicate such a platform in a few steps. This combines a collaborative analytics solution, such as Mode (an Amazon Redshift partner), with a strong analytics foundation built with AWS data warehousing, ETL, and data exploration services.

The stages of BI modernization

There are two primary stages in the evolution of BI – the use of Microsoft Excel for creating reports and the use of dashboarding tools for sharing insights in a consumable format.

Stage 1: The Excel workflow

At AWS and Mode, we talk to thousands of companies who are looking to get more value out of their data. When we ask them how they use data today, the most common answer we get is shown in the following example:

Most companies recognize this workflow as broken. The data pipelines are impossible to manage, and the analysis requires manual effort to reproduce. And in the end, we don’t know if “budget_analysis_v3_final_revised_FINAL.xls” is indeed final.

Stage 2: The dash to the dashboards

When looking for a more effective solution, companies often turn to BI products like Tableau, Amazon QuickSight, PowerBI, and Looker. These products, which were either born in the cloud or are heavily invested in it now, make it efficient to create and share reports and dashboards. KPIs can be delivered through up-to-date URLs rather than emailed files. This helps ensure that everyone has the same view of what’s happening across the business. The BI process is shown in the following example:

While modern BI is a significant step forward, it’s an incomplete solution. Dashboards reveal what’s happening, but businesses that want to use that data for action must understand why things are happening. Before a company can respond to falling sales in one region of the country, for example, it must understand what’s driving the drop. Because dashboards can’t be easily modified, extended, or reused for further analysis, they are often the wrong tool for analysts and data scientists who are charged with answering open-ended exploratory questions. As a result, data infrastructures remain fragmented, and analytics and data science workflows are still built on manual processes.

A cloud-first data science platform

The ideal technology stack for modern data science teams unifies these two stages described in the previous section. Dashboards should serve as the start for exploratory questions for analysts, analysts’ work should be as accessible as company dashboards, and the platform should facilitate a close collaboration between data scientists and business stakeholders.

Pioneering data teams at leading tech companies have developed internal solutions to do exactly this. Uber built a data science workbench for data exploration, data preparation, adhoc analyses, model exploration, workflow scheduling, dashboarding, and collaboration. Netflix recently unveiled the Netflix Data Platform, which automates the execution and distribution of Jupyter notebooks. Instacart built Blazer for exploring and sharing data.

All of these platforms have three things in common:

  • They combine visualization tools and interactive analysis tools, such as R and Python notebooks, and a collaboration platform.
  • They are powered by a modern data warehouse that can scale to accommodate any size of data and any number of analysts.
  • They have reliable ETL pipelines that provide analysts and data scientists access to the data they need, when they need it.

Building a cloud-first data science platform

Fortunately, AWS and its partners offer solutions that check all these boxes and provide the same power to data science teams that aren’t able to build it themselves. Data warehousing services like Amazon Redshift and Athena are fast, scalable, and accessible to anyone who can write standard SQL. ETL partners like Fivetran, Segment, and Matillion provide reliable, push-button ETL services from hundreds of applications into Amazon Redshift and Amazon S3. Finally, a cloud-based analytics platform such as Mode combines visualizations tools, fully hosted R and Python notebooks, and a distribution platform.

This modern stack, which is as powerful as the tooling inside Netflix or Airbnb, provides fully automated BI and data science tooling. It can be deployed in a matter of days and at a fraction of the cost of legacy data science tools.

Three steps to building the platform

Implement this data science infrastructure by using the following three steps:

  1. Set up a data warehouse.
  2. Populate your warehouse with data from around your company.
  3. Add a data science solution on top of your warehouse.

These steps do not require a large investment into engineering teams and custom-built software.

There are many ways to customize this stack to fit your company’s needs. However, this section shows how to set up using Amazon Redshift for a warehouse, Fivetran for ETL, and Mode for data science.

Step 1: Setting up Amazon Redshift

For information about setting up an Amazon Redshift warehouse, see Getting Started with Amazon Redshift. While you need an AWS account to set it up, the process requires no code and only takes a few minutes.

Most configuration options, including the size of the cluster, can be adjusted after the initial setup. Therefore, it’s not necessary to get everything exact at first. If a different configuration is more appropriate later, you can go back and change most of the Amazon Redshift settings.

Step 2: Populating Amazon Redshift with data

Your warehouse is only as good as the data in it. Fortunately, a number of ETL tools make it more efficient to continuously stream data from around your business and the applications you use. Application databases, third party apps like Salesforce and Zendesk, even CSV files – all of these can be easily fed into Amazon Redshift without any engineering effort.

Fivetran, an Amazon Redshift partner, is one such ETL tool (it’s a tool that we’re happy with at Mode). To connect Fivetran to your Amazon Redshift database, first configure your database to allow Fivetran to connect. Fivetran supports a variety of options for connecting, including connecting directly or by using an SSH tunnel. For more information about the steps, see the connection options.

As a final step, create an Amazon Redshift user for Fivetran. We recommend that you use another user than the master user. To create this user, log into the Amazon Redshift query editor (or a SQL client of your choice) and run the following commands:

CREATE USER fivetran PASSWORD <password>;
GRANT CREATE ON DATABASE <database> TO fivetran;

After Amazon Redshift is configured:

  1. Create a new Fivetran account.
  2. Select I already have an existing warehouse, then choose Redshift.
  3. Fill out the form with your Amazon Redshift credentials, as shown in the following example, then choose Save.

  1. After Fivetran is connected to Amazon Redshift, connect it with the data sources that you want to pull into Amazon Redshift. This process is now more efficient.
  2. In Fivetran, choose Connectors.
  3. Choose Add connector, then choose the data source that you want to integrate. Though the specifics vary by source, most of them follow the same pattern.
  4. Choose a schema in Amazon Redshift that you want to write your data to, and then follow the authorization flow that Fivetran automatically steps you through.

The following are examples of connection flows:

Connection flow for Salesforce


Connection flow for Google Analytics


By using similar flows, you can also connect other databases, such as Amazon RDS Postgres or the MySQL database, and directly upload CSVs.

When these connections are set up, data automatically syncs between your data sources and Amazon Redshift. If you want more control, Fivetran lets you choose which data to sync, and how often it’s updated.

Can’t find the data source you’re looking for? Other ETL tools, including Stitch Data, Segment, and ETLeap, provide similar services that are just as easy to set up. We recommend this guide when making a decision about which tool is right for you.

Step 3: Connecting Amazon Redshift to Mode

Finally, by connecting Mode to your Amazon Redshift, you can provide your entire company access to your data in a collaborative analytics environment.

To connect Mode, configure your security groups so that Mode can access Amazon Redshift. If you’re connecting Mode directly to your cluster, follow the security groups documentation linked above to grant access to the following IP addresses:

Mode also offers alternative ways of connecting if you’re unable to modify your firewall.

After you’ve completed these steps, you need only enter your credentials on Mode’s data source connection page, as shown in the following example:

After the connection is made, choose who in your organization can access that connection. Then you can immediately query your data and build the analysis for your team from the Mode Editor, as shown in the following example:

In addition to a SQL environment and visualization builder, Mode also offers integrated Python and R notebooks. Choose New Notebook in the left navigation bar to start a new Python or R instance that’s automatically populated with your query results as DataFrames. This enables data scientists to seamlessly create and share analysis directly with everyone around the company. Ultimately, this approach lets you build the most flexible platform for your analytical needs. Your business analysts and data scientists can now work in the same environment. They can collaborate seamlessly, and access the same data at all times.


This new architecture lets organizations to do more with their data, faster. Data teams that use Python and R can go beyond sharing static dashboards and reports; instead, they can also use popular forecasting and machine learning libraries like Prophet and TensorFlow. These libraries help teams find insights that they couldn’t have found otherwise. This lets teams deliver regular updates that keep everyone informed, and also answer strategic and high-value questions that drive key decisions. Moreover, Mode makes these analyses accessible to everyone around the business. Because the notebooks are fully managed, data scientists can share their work directly with stakeholders without any extra work from IT departments.

By combining Mode with Amazon Redshift, data teams also remove common bottlenecks in data integration, cleansing, or ETL processes that loads data into Amazon Redshift. With Amazon Redshift Spectrum, they can query data directly in their Amazon S3 data lake from a Mode dashboard or notebook. Moreover, they can combine these queries with data already loaded into the data warehouse.

Try it yourself

We’ve built an experience for you to get a feel for this stack. If you think it could work for your case, you can get started using Mode with Amazon Redshift in a matter of minutes. If you’re not already using Amazon Redshift, you can get started with a 2-month free trial and deploy the solution, as suggested. With Mode connected to Amazon Redshift, you can start exploring your data right away or try using one of the publicly available datasets.


About the Authors

Benn Stancil is a co­founder and Chief Analyst at Mode, a company building collaborative tools for data scientists and analysts. Benn is responsible for overseeing Mode’s internal analytics efforts, and is also an active contributor to the data science community. In addition, Benn provides strategic guidance to Mode’s product direction as a member of the product leadership team.




Ayush Jain is a Product Marketer at Amazon Web Services. He loves growing cloud services and helping customers get more value from the cloud deployments. He has several years of experience in Software Development, Product Management and Product Marketing in developer and data services.




Himanshu Raja is a Senior Product Manager for Amazon Redshift. Himanshu loves solving hard problems with data and cherishes moments when data goes against intuition. In his spare time, Himanshu enjoys cooking Indian food and watching action movies.

How to rotate Amazon DocumentDB and Amazon Redshift credentials in AWS Secrets Manager

Post Syndicated from Apurv Awasthi original https://aws.amazon.com/blogs/security/how-to-rotate-amazon-documentdb-and-amazon-redshift-credentials-in-aws-secrets-manager/

Using temporary credentials is an AWS Identity and Access Management (IAM) best practice. Even Dilbert is learning to set up temporary credentials. Today, AWS Secrets Manager made it easier to follow this best practice by launching support for rotating credentials for Amazon DocumentDB and Amazon Redshift automatically. Now, with a few clicks, you can configure Secrets Manager to rotate these credentials automatically, turning a typical, long-term credential into a temporary credential.

In this post, I summarize the key features of AWS Secrets Manager. Then, I show you how to store a database credential for an Amazon DocumentDB cluster and how your applications can access this secret. Finally, I show you how to configure AWS Secrets Manager to rotate this secret automatically.

Key features of Secrets Manager

These features include the ability to:

  • Rotate secrets safely. You can configure Secrets Manager to rotate secrets automatically without disrupting your applications, turning long-term secrets into temporary secrets. Secrets Manager natively supports rotating secrets for all Amazon database services—Amazon RDS, Amazon DocumentDB, and Amazon Redshift—that require a user name and password. You can extend Secrets Manager to meet your custom rotation requirements by creating an AWS Lambda function to rotate other types of secrets.
  • Manage access with fine-grained policies. You can store all your secrets centrally and control access to these securely using fine-grained AWS Identity and Access Management (IAM) policies and resource-based policies. You can also tag secrets to help you discover, organize, and control access to secrets used throughout your organization.
  • Audit and monitor secrets centrally. Secrets Manager integrates with AWS logging and monitoring services to enable you to meet your security and compliance requirements. For example, you can audit AWS AWS CloudTrail logs to see when Secrets Manager rotated a secret or configure AWS CloudWatch Events to alert you when an administrator deletes a secret.
  • Pay as you go. Pay for the secrets you store in Secrets Manager and for the use of these secrets; there are no long-term contracts or licensing fees.
  • Compliance. You can use AWS Secrets Manager to manage secrets for workloads that are subject to U.S. Health Insurance Portability and Accountability Act (HIPAA), Payment Card Industry Data Security Standard (PCI-DSS), and ISO/IEC 27001, ISO/IEC 27017, ISO/IEC 27018, or ISO 9001.

Phase 1: Store a secret in Secrets Manager

Now that you’re familiar with the key features, I’ll show you how to store the credential for a DocumentDB cluster. To demonstrate how to retrieve and use the secret, I use a Python application running on Amazon EC2 that requires this database credential to access the DocumentDB cluster. Finally, I show how to configure Secrets Manager to rotate this database credential automatically.

  1. In the Secrets Manager console, select Store a new secret.
    Figure 1: Select "Store a new secret"

    Figure 1: Select “Store a new secret”

  2. Next, select Credentials for DocumentDB database. For this example, I store the credentials for the database masteruser. I start by securing the masteruser because it’s the most powerful database credential and has full access over the database.
    Figure 2: Select "Credentials for DocumentDB database"

    Figure 2: Select “Credentials for DocumentDB database”

    Note: To follow along, you need the AWSSecretsManagerReadWriteAccess managed policy because this policy grants permissions to store secrets in Secrets Manager. Read the AWS Secrets Manager Documentation for more information about the minimum IAM permissions required to store a secret.

  3. By default, Secrets Manager creates a unique encryption key for each AWS region and AWS account where you use Secrets Manager. I chose to encrypt this secret with the default encryption key.
    Figure 3: Select the default or your CMK

    Figure 3: Select the default or your CMK

  4. Next, view the list of DocumentDB clusters in my account and select the database this credential accesses. For this example, I select the DB instance documentdb-instance, and then select Next.
    Figure 4: Select the instance you created

    Figure 4: Select the instance you created

  5. In this step, specify values for Secret Name and Description. Based on where you will use this secret, give it a hierarchical name, such as Applications/MyApp/Documentdb-instancee, and then select Next.
    Figure 5: Provide a name and description

    Figure 5: Provide a name and description

  6. For the next step, I chose to keep the Disable automatic rotation default setting because in my example my application that uses the secret is running on Amazon EC2. I’ll enable rotation after I’ve updated my application (see Phase 2 below) to use Secrets Manager APIs to retrieve secrets. Select Next.
    Figure 6: Choose to either enable or disable automatic rotation

    Figure 6: Choose to either enable or disable automatic rotation

    Note:If you’re storing a secret that you’re not using in your application, select Enable automatic rotation. See AWS Secrets Manager getting started guide on rotation for details.

  7. Review the information on the next screen and, if everything looks correct, select Store. You’ve now successfully stored a secret in Secrets Manager.
  8. Next, select See sample code in Python.
    Figure 7: Select the "See sample code" button

    Figure 7: Select the “See sample code” button

  9. Finally, take note of the code samples provided. You will use this code to update your application to retrieve the secret using Secrets Manager APIs.
    Figure 8: Copy the code sample for use in your application

    Figure 8: Copy the code sample for use in your application

Phase 2: Update an application to retrieve a secret from Secrets Manager

Now that you’ve stored the secret in Secrets Manager, you can update your application to retrieve the database credential from Secrets Manager instead of hard-coding this information in a configuration file or source code. For this example, I show how to configure a Python application to retrieve this secret from Secrets Manager.

  1. I connect to my Amazon EC2 instance via Secure Shell (SSH).
        import DocumentDB
        import config
        def no_secrets_manager_sample()
        # Get the user name, password, and database connection information from a config file.
        database = config.database
        user_name = config.user_name
        password = config.password                

  2. Previously, I configured my application to retrieve the database user name and password from the configuration file. Below is the source code for my application.
        # Use the user name, password, and database connection information to connect to the database
        db = Database.connect(database.endpoint, user_name, password, database.db_name, database.port) 

  3. I use the sample code from Phase 1 above and update my application to retrieve the user name and password from Secrets Manager. This code sets up the client, then retrieves and decrypts the secret Applications/MyApp/Documentdb-instance. I’ve added comments to the code to make the code easier to understand.
        # Use this code snippet in your app.
        # If you need more information about configurations or implementing the sample code, visit the AWS docs:   
        # https://aws.amazon.com/developers/getting-started/python/
        import boto3
        import base64
        from botocore.exceptions import ClientError
        def get_secret():
            secret_name = "Applications/MyApp/Documentdb-instance"
            region_name = "us-west-2"
            # Create a Secrets Manager client
            session = boto3.session.Session()
            client = session.client(
            # In this sample we only handle the specific exceptions for the 'GetSecretValue' API.
            # See https://docs.aws.amazon.com/secretsmanager/latest/apireference/API_GetSecretValue.html
            # We rethrow the exception by default.
                get_secret_value_response = client.get_secret_value(
            except ClientError as e:
                if e.response['Error']['Code'] == 'DecryptionFailureException':
                    # Secrets Manager can't decrypt the protected secret text using the provided KMS key.
                    # Deal with the exception here, and/or rethrow at your discretion.
                    raise e
                elif e.response['Error']['Code'] == 'InternalServiceErrorException':
                    # An error occurred on the server side.
                    # Deal with the exception here, and/or rethrow at your discretion.
                    raise e
                elif e.response['Error']['Code'] == 'InvalidParameterException':
                    # You provided an invalid value for a parameter.
                    # Deal with the exception here, and/or rethrow at your discretion.
                    raise e
                elif e.response['Error']['Code'] == 'InvalidRequestException':
                    # You provided a parameter value that is not valid for the current state of the resource.
                    # Deal with the exception here, and/or rethrow at your discretion.
                    raise e
                elif e.response['Error']['Code'] == 'ResourceNotFoundException':
                    # We can't find the resource that you asked for.
                    # Deal with the exception here, and/or rethrow at your discretion.
                    raise e
                # Decrypts secret using the associated KMS CMK.
                # Depending on whether the secret is a string or binary, one of these fields will be populated.
                if 'SecretString' in get_secret_value_response:
                    secret = get_secret_value_response['SecretString']
                    decoded_binary_secret = base64.b64decode(get_secret_value_response['SecretBinary'])
            # Your code goes here.                          

  4. Applications require permissions to access Secrets Manager. My application runs on Amazon EC2 and uses an IAM role to obtain access to AWS services. I will attach the following policy to my IAM role. This policy uses the GetSecretValue action to grant my application permissions to read a secret from Secrets Manager. This policy also uses the resource element to limit my application to read only the Applications/MyApp/Documentdb-instance secret from Secrets Manager. You can visit the AWS Secrets Manager documentation to understand the minimum IAM permissions required to retrieve a secret.
        "Version": "2012-10-17",
        "Statement": {
        "Sid": "RetrieveDbCredentialFromSecretsManager",
        "Effect": "Allow",
        "Action": "secretsmanager:GetSecretValue",
        "Resource": "arn:aws:secretsmanager:::secret:Applications/MyApp/Documentdb-instance"

Phase 3: Enable rotation for your secret

Rotating secrets regularly is a security best practice. Secrets Manager makes it easier to follow this security best practice by offering built-in integrations and supporting extensibility with Lambda. When you enable rotation, Secrets Manager creates a Lambda function and attaches an IAM role to this function to execute rotations on a schedule you define.

Note: Configuring rotation is a privileged action that requires several IAM permissions, and you should only grant this access to trusted individuals. To grant these permissions, you can use the AWS IAMFullAccess managed policy.

Now, I show you how to configure Secrets Manager to rotate the secret
Applications/MyApp/Documentdb-instance automatically.

  1. From the Secrets Manager console, I go to the list of secrets and choose the secret I created in phase 1, Applications/MyApp/Documentdb-instance.
    Figure 9: Choose the secret from Phase 1

    Figure 9: Choose the secret from Phase 1

  2. Scroll to Rotation configuration, and then select Edit rotation.
    Figure 10: Select the Edit rotation configuration

    Figure 10: Select the Edit rotation configuration

  3. To enable rotation, select Enable automatic rotation, and then choose how frequently Secrets Manager rotates this secret. For this example, I set the rotation interval to 30 days. Then, choose create a new Lambda function to perform rotation and give the function an easy to remember name. For this example, I choose the name RotationFunctionforDocumentDB.
    Figure 11: Chose to enable automatic rotation, select a rotation interval, create a new Lambda function, and give it a name

    Figure 11: Chose to enable automatic rotation, select a rotation interval, create a new Lambda function, and give it a name

  4. Next, Secrets Manager requires permissions to rotate this secret on your behalf. Because I’m storing the masteruser database credential, Secrets Manager can use this credential to perform rotations. Therefore, I select Use this secret, and then select Save.
    Figure12: Select credentials for Secret Manager to use

    Figure12: Select credentials for Secret Manager to use

  5. The banner on the next screen confirms that I successfully configured rotation and the first rotation is in progress, which enables you to verify that rotation is functioning as expected. Secrets Manager will rotate this credential automatically every 30 days.
    Figure 13: The banner at the top of the screen will show the status of the rotation

    Figure 13: The banner at the top of the screen will show the status of the rotation


I explained the key benefits of AWS Secrets Manager and showed how you can use temporary credentials to access your Amazon DocumentDB clusters and Amazon Redshift instances securely. You can follow similar steps to rotate credentials for Amazon Redshift.

Secrets Manager helps you protect access to your applications, services, and IT resources without the upfront investment and on-going maintenance costs of operating your own secrets management infrastructure. To get started, visit the Secrets Manager console. To learn more, read the Secrets Manager documentation. If you have comments about this post, submit them in the Comments section below. If you have questions about anything in this post, start a new thread on the Secrets Manager forum.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Apurv Awasthi

Apurv is the product manager for credentials management services at AWS, including AWS Secrets Manager and IAM Roles. He enjoys the “Day 1” culture at Amazon because it aligns with his experience building startups in the sports and recruiting industries. Outside of work, Apurv enjoys hiking. He holds an MBA from UCLA and an MS in computer science from University of Kentucky.

How to enable cross-account Amazon Redshift COPY and Redshift Spectrum query for AWS KMS–encrypted data in Amazon S3

Post Syndicated from Asim Kumar Sasmal original https://aws.amazon.com/blogs/big-data/how-to-enable-cross-account-amazon-redshift-copy-and-redshift-spectrum-query-for-aws-kms-encrypted-data-in-amazon-s3/

This post shows a step-by-step walkthrough of how to set up a cross-account Amazon Redshift COPY and Spectrum query using a sample dataset in Amazon S3. The sample dataset is encrypted at rest using AWS KMS-managed keys (SSE-KMS).

About AWS Key Management Service (AWS KMS)

With AWS Key Management Service (AWS KMS), you can have centralized control over the encryption keys used to protect your data at rest. You can create, import, rotate, disable, delete, define usage policies, and audit the use of encryption keys used to encrypt your data. AWS KMS uses FIPS 140-2 validated cryptographic modules to protect the confidentiality and integrity of your master keys.

AWS KMS is seamlessly integrated with most AWS services. This integration means that you can easily use customer master keys (CMKs) to control the encryption of the data you store within these services. When deciding to encrypt data in a service such as Amazon Redshift, you can choose to use an AWS-managed CMK that Amazon Redshift automatically creates in KMS. You can track the usage of the key, but it’s managed by the service on your behalf. In some cases, you might need direct control over the lifecycle of a CMK or want to allow other accounts to use it. In these cases, you can create and manage your own CMK that AWS services such as Amazon Redshift can use on your behalf. These customer-managed CMKs enable you to have full control over the access permissions that determine who can use the key and under which conditions. AWS KMS is integrated with AWS CloudTrail, a service that provides a record of actions performed by a user, role, or AWS service in AWS KMS.

About Amazon Redshift and Redshift Spectrum

Amazon Redshift is a petabyte scale, fully managed data warehouse service on AWS. It uses a distributed, massively parallel processing (MPP), shared-nothing architecture that scales horizontally to meet usage requirements.

Amazon Redshift Spectrum is a feature of Amazon Redshift that extends the analytic power of Amazon Redshift beyond the data that is stored on local disks in the data warehouse. In other words, Amazon Redshift Spectrum enables you to use the same ANSI SQL syntax of Amazon Redshift on the data that is stored in an Amazon S3 data lake. You do so using external tables, without having to ingest the data into Amazon Redshift first. A common pattern is to run queries that span both the frequently accessed “hot” data stored locally in Amazon Redshift and the “warm/cold” data stored cost-effectively in Amazon S3. That pattern separates compute and storage by enabling independent scaling of both to match the use case. This means you don’t have to pay for unused compute capacity just to add more storage. More importantly, this approach enables seamless interoperability between your data lake and Amazon Redshift.

The Amazon Redshift COPY command supports the following types of Amazon S3 encryption:

  • Server-side encryption with Amazon S3-managed keys (SSE-S3)
  • Server-side encryption with AWS KMS-managed keys (SSE-KMS)
  • Client-side encryption using a client-side symmetric master key

The Amazon Redshift COPY command doesn’t support the following types of Amazon S3 encryption:

  • Server-side encryption with customer-provided keys (SSE-C)
  • Client-side encryption using an AWS KMS–managed customer master key
  • Client-side encryption using a customer-provided asymmetric master key

About the use case

A multiple-account AWS environment is a common pattern across our customers for a variety of reasons. One of the common reasons for data lake customers in AWS is to separate ownership of data assets from different business units in the company. At the same time, business units might need to grant access to some of their data assets to each other for new business insights.

As illustrated in the following drawing, in our example Account A owns an S3 bucket with SSE-KMS encrypted data and Account B owns an Amazon Redshift cluster with Redshift Spectrum enabled. Account B needs access to the same data to load to the Amazon Redshift cluster using the COPY command and also to query using Redshift Spectrum.

Solution walkthrough

Following, we walk through a couple different options to support this use case.


The solution assumes that you already have the following set up:

    1. Access to two AWS accounts (we call them Account A and B) in the same AWS Region.*
    2. Grant the AdministratorAccess policy to the AWS accounts (which should be restricted further for production).
    3. Account A has a customer-managed CMK in AWS KMS with the following attributes:
      • Alias as kms_key_account_a
      • Description as Cross Account KMS Key in Account A
      • Administrator as current IAM user using which you signed in to the AWS console and created the KMS key
      • Account B added as External Accounts

      Copy and save the CMK Amazon Resource Name (ARN) to be used shortly

    4. Account A uses the following sample dataset from AWS:
      Customer - s3://awssampledbuswest2/ssbgz/customer0002_part_00.gz

    5. Account A has an S3 bucket called rs-xacct-kms-bucket with bucket encryption option set to AWS KMS using the KMS key kms_key_account_a created earlier.
    6. Use the following AWS CLI command to copy the customer table data from AWS sample dataset SSB – Sample Schema Benchmark, found in the Amazon Redshift documentation.Note: Because bucket names are global across all AWS customers, you need a unique bucket name for your test run. Be sure to replace rs-xacct-kms-bucket with your own bucket name in the following command:
      aws s3 cp s3://awssampledbuswest2/ssbgz/ s3://rs-xacct-kms-bucket/customer/ --recursive --exclude '*' --include 'customer*'

    7. After the copy is complete, check the KMS key ID for the file from S3 console, as shown following.
    8. Account B has an Amazon Redshift cluster:
      • The cluster name is rstest
      • It’s publicly accessible
      • It has an IAM role attached called redshift_role_account_b with the following two managed IAM policies:
        • AmazonS3ReadOnlyAccess
        • AWSGlueConsoleFullAccess

            Note: Be sure to update redshift_role_account_b with your own IAM role.

            You can set up a database session successfully from a client tool, such as SQL Workbench from your laptop.

* This walkthrough uses a publicly available AWS sample dataset from the US-West-2 (Oregon) Region. Hence, we recommend that you use the US-West-2 (Oregon) Region for your test run to reduce cross-region network latency and cost due to data movement.

Step-by-step walkthrough

Depending on which account’s AWS Glue Data Catalog you want to use for Redshift Spectrum, there are two solution options to choose from:

  1. AWS Glue Data Catalog in Account B
  2. AWS Glue Data Catalog in Account A

Option 1: AWS Glue Data Catalog in Account B

Set up permissions

  1. Sign in to Account A’s AWS console. Then, change the AWS Region to us-west-2 (Oregon). Add the following bucket policy for the rs-xacct-kms-bucket bucket so that Account B (which owns the Amazon Redshift cluster – rstest) can access the bucket.

Note: Replace <Account B> with AWS Account ID for Account B and rs-xacct-kms-bucket with your bucket name.

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "AllowS3",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<Account B>:root"
            "Action": [
            "Resource": [
    1. Sign in to Account B’s AWS console. Then, change the AWS Region to us-west-2 (Oregon). Create IAM policies and roles as described following:

a) Create the following two IAM permission policies: rs_xacct_bucket_policy to give Account B access to the S3 bucket in Account A, and rs_xacct_kms_policy to give Account B access to the CMK in Account A.

Policy name: rs_xacct_kms_policy

Note: Replace <ARN of kms_key_account_a from Account A> with your KMS key ARN from Account A.

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "AllowUseOfTheKey",
            "Effect": "Allow",
            "Action": [
            "Resource": [
                "<ARN of kms_key_account_a from Account A>"
            "Sid": "AllowAttachmentOfPersistentResources",
            "Effect": "Allow",
            "Action": [
            "Resource": [
                "<ARN of kms_key_account_a from Account A>"
            "Condition": {
                "Bool": {
                    "kms:GrantIsForAWSResource": true

Policy name: rs_xacct_bucket_policy

Note: Replace rs-xacct-kms-bucket with your bucket name.

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "AllowS3",
            "Effect": "Allow",
            "Action": "s3:*",
            "Resource": [

b) Create a new IAM role called xacct_kms_role_account_b for the Amazon Redshift service with the following IAM policies attached:


Save the Amazon Resource Name (ARN) of the IAM role. You’ll use it soon.

c) Now let’s set up the IAM role chaining for Amazon Redshift between the two IAM roles, redshift_role_account_b and xacct_kms_role_account_b.

To chain roles, you establish a trust relationship between the roles. A role that assumes another role (for example, Role A) must have a permission policy that allows it to assume the next chained role (for example, Role B). Similarly, the role that passes permissions (Role B) must have a trust policy that allows it to pass its permissions to the previous chained role (Role A).

The first role in the chain must be a role attached to the Amazon Redshift cluster. The first role and each subsequent role that assumes the next role in the chain must have a policy that includes a specific statement. This statement has the Allow effect on the sts:AssumeRole action and the ARN of the next role in a Resource element.

In our example, Role A is redshift_role_account_b, which needs the permission policy rs_xacct_assume_role_policy, which  allows it to assume Role B (which is xacct_kms_role_account_b). Both IAM roles are owned by AWS Account B.

d) Let’s create the IAM permission policy rs_xacct_assume_role_policy and attach the policy to the IAM role redshift_role_account_b.

Policy name: rs_xacct_assume_role_policy

Note: Replace <ARN for IAM role xacct_kms_role_account_b from Account B>.

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "Stmt1487639602000",
            "Effect": "Allow",
            "Action": [
            "Resource": [
"<ARN for IAM role xacct_kms_role_account_b from Account B>"

e) Change the trust relationship for IAM role xacct_kms_role_account_b by choosing Edit trust relationship and replacing the existing trust policy with the following:

Note: Replace <Account B> with the AWS Account ID for Account B.

  "Version": "2012-10-17",
  "Statement": [
      "Effect": "Allow",
      "Principal": {
        "Service": "redshift.amazonaws.com"
      "Action": "sts:AssumeRole"
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<Account B>:root"
      "Action": "sts:AssumeRole"

f) Create an AWS Glue service IAM role called glue_service_role_account_b with the following policies attached:

• AWSGlueServiceRole (AWS managed policy)
• rs_xacct_bucket_policy (managed policy created earlier)
• rs_xacct_kms_policy (managed policy created earlier)

Note: Be sure to update glue_service_role_account_b with your own IAM role.

Perform the Amazon Redshift COPY

  1. Log in to the Amazon Redshift cluster from your query tool and create the customer table using the DDL following.
CREATE TABLE customer 
  c_custkey      INTEGER NOT NULL,
  c_name         VARCHAR(25) NOT NULL,
  c_address      VARCHAR(25) NOT NULL,
  c_city         VARCHAR(10) NOT NULL,
  c_nation       VARCHAR(15) NOT NULL,
  c_region       VARCHAR(12) NOT NULL,
  c_phone        VARCHAR(15) NOT NULL,
  c_mktsegment   VARCHAR(10) NOT NULL

2. Now you can run the COPY statement following successfully.

copy customer from 's3://rs-xacct-kms-bucket/customer/' 
iam_role '<IAM role ARN of redshift_role_account_b,IAM role ARN of xacct_kms_role_account_b>'
region 'us-west-2';

Note: Replace the IAM role ARNs from Account B separated by a comma without any spaces around it.

3. Run the following sample query to verify that the data was loaded successfully.

select * from customer limit 10;

Set up an AWS Glue Data Catalog table for Redshift Spectrum to query

Let’s now create an AWS Glue crawler in Account B to crawl the same customer data and create a table called customer in the AWS Glue Data Catalog database spectrumdb_account_b following these steps:

  1. Navigate to Databases on the AWS Glue console and choose Add database to create an AWS Glue Data Catalog database called spectrumdb_account_b, as shown following.

  1. Navigate to Crawlers on the AWS Glue console and choose Add crawler, as shown following.

  1. Create a crawler customerxacct, as shown following.

Note: The Crawler job name (customerxacct in this case) is not same as the table name created by the crawler (a common confusion). The table name is picked up automatically from the prefix and folder name from your S3 bucket and folder structure. You also have an option to attach a table name prefix if you want to.              

  1. Choose Next to enter Data store details of the customer table, as following.

  1. Choose Next to get to the Add another data store We leave the default, No, because we don’t have any other data stores to add.

  1. Choose Next to choose the IAM role created earlier, glue_service_role_account_b, for the crawler to use, as shown following.

  1. Choose Next to go to the Schedule page and choose the schedule that you want this crawler job to run. For this example, we can choose Run on demand.

  1. Choose Next to choose the AWS Glue Data Catalog database spectrumdb_account_b (created earlier by create external schema command) as the crawler output location.

  1. Choose Next to get to the review page.

  1. After reviewing the details, choose Finish to finish creating the crawler.

  1. Now, let’s run the crawler job by selecting the job as following and choosing Run crawler.

  1. Wait and watch for the job to complete. Its status changes from Starting to Stopping to Ready. You can choose the refresh button for the latest status.

  1. If the job fails, the failure is recorded in Amazon CloudWatch logs. To view the logs, choose Logs, shown in the screenshot preceding, which takes you to the CloudWatch logs.
  1. Now, let’s go to the AWS Glue Data Catalog database to make sure that the table exists.

Choose Databases, choose the spectrumdb_account_b database, and then choose View Tables, or choose the hyperlink of the database name. You should see the customer table, as shown following.

  1. Choose the customer hyperlink to get to the external table, details following.

Because the data file didn’t have a header record, the AWS Glue crawler has assigned a default column naming convention as shown preceding. For the customer table, this naming is column 0 to column 7

  1. Choose Edit Schema and assign appropriate column names, as per the mapping following.

c0 => c_custkey

c1 => c_name

c2 => c_address

c3 => c_city

c4 => c_nation

c5 => c_region

c6 => c_phone

c7 => c_mktsegment

When you are done, choose Save.

Perform the Redshift Spectrum query

Now that the customer table is created in AWS Glue Data Catalog, let’s query the table using Redshift Spectrum.

  1. Log in to the Amazon Redshift cluster from your query tool.
  2. Run the statements following to create an external schema called spectrumxacct for Redshift Spectrum pointing to the AWS Glue Data Catalog database. This database is spectrumdb_account_b in Account B, already created on the AWS Glue console.
    drop schema if exists spectrumxacct;
    create external schema spectrumxacct
    from data catalog 
    database 'spectrumdb_account_b'
    iam_role '<IAM role ARN of redshift_role_account_b,IAM role ARN of xacct_kms_role_account_b>'
    create external database if not exists;

    Note: Replace the IAM role ARNs from Account B separated by a comma without any spaces around it.

  3. Run the following sample query to verify that Redshift Spectrum can query the data successfully.
    select * from spectrumxacct.customer limit 10;

Note: Redshift Spectrum uses the AWS Glue Data Catalog in Account B, not Account A.

Option 2: AWS Glue Data Catalog in Account A


Set up permissions

1. Sign in to the Account A AWS console, then change the AWS Region to us-west-2 (Oregon).

    • a) Create the following IAM policies:

• rs-xacct-bucket-policy to give access to the S3 bucket in Account A
• rs_xacct_kms_policy to give access to the CMK in Account A

Policy name: rs_xacct_bucket_policy

Note: Replace the bucket name rs-xacct-kms-bucket with your bucket name.

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "AllowS3",
            "Effect": "Allow",
            "Action": [
            "Resource": [

Policy name: rs_xacct_kms_policy

Note: Replace <ARN of kms_key_account_a from Account A> with your KMS key ARN from Account A.

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "AllowUseOfTheKey",
            "Effect": "Allow",
            "Action": [
            "Resource": [
                "<ARN of kms_key_account_a from Account A>"
            "Sid": "AllowAttachmentOfPersistentResources",
            "Effect": "Allow",
            "Action": [
            "Resource": [
                "<ARN of kms_key_account_a from Account A>"
            "Condition": {
                "Bool": {
                    "kms:GrantIsForAWSResource": true

b) Create a new IAM role called xacct_kms_role_account_a for the Amazon Redshift service with the following IAM policies:

AWSGlueConsoleFullAccess (this managed policy provides the required permissions for the AWS Glue Data Catalog)

Save the IAM role ARN to be used shortly.

c) Change the trust relationship for the IAM role xacct_kms_role_account_a by choosing Edit trust relationship and replacing the existing trust policy with the following:

Note: Replace <Account B> with the AWS account ID for Account B.

  "Version": "2012-10-17",
  "Statement": [
      "Effect": "Allow",
      "Principal": {
        "Service": "redshift.amazonaws.com"
      "Action": "sts:AssumeRole"
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<Account B>:root"
      "Action": "sts:AssumeRole"

d) Create an AWS Glue service IAM role called glue_service_role_account_a with the following policies attached:

AWSGlueServiceRole (AWS managed policy)
rs_xacct_bucket_policy (managed policy created earlier)
rs_xacct_kms_policy (managed policy created earlier)

Note: Be sure to update glue_service_role_account_a with your own IAM role

2. Sign in to Account B’s AWS console and change the AWS Region to us-west-2 (Oregon) if it’s not already selected.

a) Modify the existing IAM policy rs_xacct_assume_role_policy and replace the existing JSON policy with the following:

 Note: Replace <ARN for IAM role xacct_kms_role_account_a from Account A>.

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "Stmt1487639602000",
            "Effect": "Allow",
            "Action": [
            "Resource": [
"<ARN for IAM role xacct_kms_role_account_a from Account A>"

Perform the Amazon Redshift COPY

1. Log in to the Amazon Redshift cluster from your query tool and create the customer table using the DDL following.

CREATE TABLE customer 
  c_custkey      INTEGER NOT NULL,
  c_name         VARCHAR(25) NOT NULL,
  c_address      VARCHAR(25) NOT NULL,
  c_city         VARCHAR(10) NOT NULL,
  c_nation       VARCHAR(15) NOT NULL,
  c_region       VARCHAR(12) NOT NULL,
  c_phone        VARCHAR(15) NOT NULL,
  c_mktsegment   VARCHAR(10) NOT NULL

2. Now you should be able to run the COPY statement following successfully.

copy customer from 's3://rs-xacct-kms-bucket/customer/' 
iam_role '<ARN for IAM role redshift_role_account_b from Account B,<ARN for IAM role xacct_kms_role_account_a from Account A>'
region 'us-west-2';

Note: Replace the IAM role ARNs separated by a comma without any spaces around it.

3. Run the sample query following to validate that the data was loaded successfully.

select * from customer limit 10;

Set up AWS Glue Data Catalog table for Redshift Spectrum to query

Let’s now create an AWS Glue crawler in Account A to crawl the same customer data and create a table called customer in the AWS Glue Data Catalog database spectrumdb_account_a in Account A following these steps:

Follow the same steps as outlined in Option 1 to create and run a crawler with the following changes:

  1. This time, create the crawler in Account A (as opposed to Account B for Option 1).
  2. Create an AWS Glue Data Catalog database spectrumdb_account_a in Account A (as opposed to spectrumdb_account_b in Account B), and choose that database for crawler to create the customer table.
  3. While providing S3 path, choose the option Specified path in my account (unlike Specified path in another account chosen for Option 1).
  4. Make sure to use glue_service_role_account_a created earlier as the AWS Glue service IAM role.=

Perform the Redshift Spectrum query

Now that the customer table is created in the AWS Glue Data Catalog, let’s query the table using Redshift Spectrum.

1. Log in to the Amazon Redshift cluster from your query tool and run the statements following. These create an external schema called spectrumxacct2 for Redshift Spectrum pointing to the AWS Glue Data Catalog database spectrumdb_account_a (created earlier from AWS Glue console) in Account A.

drop schema if exists spectrumxacct2;
create external schema spectrumxacct2
from data catalog 
database 'spectrumdb_account_a' 
iam_role '<ARN for IAM role redshift_role_account_b from Account B,<ARN for IAM role xacct_kms_role_account_a from Account A>'
create external database if not exists;

Note: Replace the IAM role ARNs separated by a comma without any spaces around it.

2. Run the following query, which should run successfully.

select * from spectrumxacct2.customer limit 10;

Note: Spectrum uses the AWS Glue Data Catalog in Account A, not Account B.


This post shows a step-by-step walkthrough of how to set up a cross-account Amazon Redshift COPY and query using Redshift Spectrum for a sample KMS encrypted dataset in Amazon S3. It demonstrates two solution options to choose from depending on which account’s AWS Glue Catalog you want to use for Redshift Spectrum.

If you have questions or suggestions, please leave a comment.


About the Author

Asim Kumar Sasmal is a Sr. Data Architect – IoT in the Global Specialty Practice of AWS Professional Services. He helps AWS customers around the globe to design and build data driven solutions by providing expert technical consulting, best practices guidance, and implementation services on AWS platform. He is passionate about working backwards from customer ask, help them to think big, and dive deep to solve real business problems by leveraging the power of AWS platform.

Bannerconnect uses Amazon Redshift to help clients improve digital marketing results

Post Syndicated from Danny Stommen original https://aws.amazon.com/blogs/big-data/bannerconnect-uses-amazon-redshift-to-help-clients-improve-digital-marketing-results/

Bannerconnect uses programmatic marketing solutions that empower advertisers to win attention and customers by getting their ads seen by the right person at the right time and place. Data-driven insights help large advertisers, trade desks, and agencies boost brand awareness and maximize the results of their digital marketing. Timely analysis of log data is critical to respond to dynamic changes in customer behavior, optimize marketing campaigns quickly, and to gain competitive advantage.

By moving to AWS and Amazon Redshift, our clients can now get near real-time analytics at their fingertips. In this blog post, we describe the challenges that we faced with our legacy, on-premises data warehouse, and discuss the benefits we received by moving to Amazon Redshift. Now, we can ingest data faster, do more sophisticated analytics, and help our clients make faster, data-driven decisions to improve their digital marketing.

Legacy on-premises situation and challenges

Our on-premises, legacy infrastructure consisted of the IBM PureData System as our log level data warehouse. We used a MySQL database for storing all metadata and all analytics data. In this physical, nonvirtualized environment, we needed to plan capacity carefully, far in advance, to handle data growth. We needed a sizeable team to manage upgrades, maintenance, backups, and the day-to-day management of workloads and query performance.

We faced many challenges. We had only 1 gigabyte of bandwidth available to load log-level data into the data warehouse. At peak loads, our extract, transform, load (ETL) server ran completely full, and bandwidth became a bottleneck that delayed when the data was available for analytics. Software and firmware upgrades to the data warehouse needed to be scheduled, and maintenance downtime sometimes took up to eight hours to complete. Our infrastructure was also fragile. We ran everything on one PureData System, and we didn’t have a separate development and test environment. Clients that had direct access to our production environment could submit incorrect SQL queries and pull down the entire data warehouse.

From the log-level data, we created aggregates and stored them in MySQL. Indexes slowed down the loading process significantly. Several aggregations that we wanted to do were simply not possible. Running ad hoc (one-time) queries against 200 gigabytes of uncompressed, row-based data took ages to complete. Many dashboard queries took 10–15 minutes or longer, and were ultimately cancelled. Users were frustrated, so we knew that we had to evolve to a more responsive solution, end-to-end. We chose AWS and Amazon Redshift for our data warehouse.

Moving to Amazon Redshift

Because our legacy software was not designed to run in the cloud, we decided to rebuild our applications using all available AWS components. This saved us the hassle of any migration process, and we could design our applications to use the full potential of AWS.

Our new infrastructure uses Amazon Redshift as our log-level data warehouse. We use a 40-node ds2.xlarge cluster for our production processes. Here, we run log-level queries to aggregate data for the analytics cluster, and run thousands of queries each day to optimize our marketing campaigns.

We set up a separate 30-node ds2.xlarge Amazon Redshift cluster for client access. We replicate the log level data to this cluster, and allow our clients to run queries here without jeopardizing our production processes. Our clients perform data science queries against the data in this cluster.

We also created a 24-node dc2.large cluster for high-performance queries that would not be affected by large, complex queries running on our other clusters. We use this cluster for ad hoc analytics on aggregated data, made available through our API.

We use Amazon S3 as our main data store, giving us infinite storage. Amazon EC2 hosts our ETL processes, API, and several other applications.

Bannerconnect architecture. Amazon S3 is not added to the flowchart to make it simpler. You can add S3 to almost every arrow in the chart.

Our biggest gains

Creating our next-generation solution on AWS and Amazon Redshift provides many benefits for us. We simply follow the best practices provided in Amazon Redshift documentation.

  • Managed service: We wanted to focus on what we are good at, developing software, and not managing and maintaining infrastructure. With Amazon Redshift, we no longer have to do software updates, firmware upgrades, or deal with broken hardware. We no longer need to plan capacity months in advance, or deal with integrating new servers into our environment. Amazon Redshift completely automates all of this for us, including scaling with our data growth, so we can focus on data and analytics to better serve our clients.
  • A full, separate development and test environment: We now have an isolated Amazon Redshift cluster (single node), where we can perform, develop, and test without worrying about breaking the production environment. In our on-premises setup, we did have a development database, but it was always on the production infrastructure. Furthermore, we have an exact copy of our entire infrastructure in our test environment (obviously all in small scale and small instance types). This lets us run automated tests on each deployment to verify that all data flows work as expected.
  • Infinite scalability: We can scale instantly to any capacity we need. Amazon S3 gives us infinite storage, and we can scale Amazon Redshift compute capacity in just a few clicks.
  • Separate clusters: Clients now can’t jeopardize our production processes. Clients still write incorrect SQL queries. However, by using query monitoring rules in Amazon Redshift, we can identify these queries and have Amazon Redshift automatically stop them. Bad queries might affect the client cluster momentarily, but they don’t affect our production processes at all.
  • Faster ad hoc analytics: Due to the massive parallel processing, data compression, and columnar-based storage capabilities in Amazon Redshift, we can create aggregates that were not possible in MySQL. In terms of performance, it’s hard to give good numbers to compare. Running a query against a smaller aggregate using an index on MySQL might be faster at times. However, the majority of our queries are significantly faster on Amazon Redshift. For example, our biggest aggregated table contains about 2 billion records and 500 GB of data (compressed). MySQL couldn’t handle this, but Amazon Redshift results are retrieved within seconds. Large, complex queries took a long time on MySQL. Amazon Redshift completes these in tens of seconds or less.

Building the multicluster environment

This section explores an easy option to build a multicluster setup using AWS CloudFormation templates. With the templates, you can launch multiple Amazon Redshift clusters inside a VPC in both private and public subnets in different Availability Zones. The private subnet enables internal applications, such as EC2 instances, to execute the ETL process to interact with the Amazon Redshift cluster to refresh data. You can use the public Amazon Redshift cluster for the external client tools and scripts. Here is the architecture of this setup:

Let’s walk through the configuration. For demonstration purposes, we use just two Amazon Redshift clusters in a private and public subnet, but you can modify these steps to add more parallel clusters. The configuration is a two-step process to first create the network stack and later launch the Amazon Redshift cluster in those stacks. This process creates the following:

  • VPC and associated subnets, security groups, and routes
  • IAM roles to load data from S3 to Amazon Redshift
  • Amazon Redshift cluster or clusters


Step 1 – Create a network stack

  1. Sign in to the AWS Management Console and navigate to CloudFormation, then do the following:
  • Choose the AWS Region to launch the stack in, for example US East (Ohio).
  • Choose Create Stack.
  • Choose Specify an Amazon S3 template URL.
  • Copy and paste this URL into the text box: https://s3.amazonaws.com/salamander-us-east-1/Bannerconnect/networkstack.json
  1. Choose Next and provide the following information:
    • Stack Name: Name the stack anything convenient.
    • CIDR Prefix: Enter a class B CIDR prefix (for example, 168, 10.1, or 172.16).
    • Environment Tag: Name the environment anything convenient to tag the resources.
    • Key Name: The EC2 key pair to allow access to EC2 instances. If you don’t already have one, see Amazon EC2 Key Pairs in the EC2 documentation.
    • Use the default values for all other parameters and choose Create.
  2. The stack takes 10 minutes to launch, after which the network stack is ready.
  3. Review the outputs of the stack when the launch is complete to note the resource names that were created.

Step 2 – Create an Amazon Redshift cluster or clusters

  1. Navigate back to the CloudFormation console and do the following:
  • Choose Create Stack.
  • Choose Specify an Amazon S3 template URL.
  • Copy and paste this URL into the text box: https://s3.amazonaws.com/salamander-us-east-1/Bannerconnect/reshiftstack.json
  1. Choose Next and provide the following information:
    • Stack Name: Name the stack anything convenient.
    • Cluster Type: Choose a multi-node or single-node cluster.
    • Inbound Traffic: Allow inbound traffic to the cluster from this CIDR range.
    • Master Username: The user name that is associated with the master user account for the cluster that is being created. The default is adminuser.
    • Master User Password: The password that is associated with the master user account for the cluster that is being created.
    • Network Stack Name: The name of the active CloudFormation stack that was created in step 1, which contains the networking resources such as the subnet and security group.
    • Node Type: The node type to be provisioned for the Amazon Redshift cluster.
    • Number Of Nodes: The number of compute nodes in the Amazon Redshift cluster:
      • When the cluster type is specified as single-node, this value should be 1.
      • When the cluster type is specified as multi-node, this value should be greater than 1.
    • Port Number: The port number on which the cluster accepts incoming connections. The default is 5439.
    • Public Access: Public access to the Amazon Redshift cluster, either true or false. When this value is true, the cluster is launched in a public subnet. When this value is false, the cluster is launched in a private subnet.
    • Use the default values for all other parameters and choose Create.
  2. The stack take 10 minutes to launch, after which the Amazon Redshift cluster is launched in the network stack.
  3. Review the outputs of the stack when the launch is complete to note the resource names that were created for the Amazon Redshift cluster.
  4. Repeat steps 5–8 to add more Amazon Redshift clusters to this network stack.

With this easy deployment using the AWS CloudFormation template, you can launch all the resources needed for a multicluster setup with a few clicks.


Migrating to Amazon Redshift and setting up the data warehouse on AWS enabled us to build highly scalable decoupled applications and to use different clusters for different use cases. Operationally, we were able to build robust dev, test, and prod systems independently that are easy to manage to implement complex data workflows.

Recently, we started using Amazon Redshift Spectrum to query data directly from Amazon S3, without needing to load the data into Amazon Redshift. This saves us loading time and speeds up time to analytics, creating many new possibilities for us. Loading dynamic data with different formats and columns becomes a lot easier with Amazon Redshift Spectrum.


About the Authors

Danny Stommen has been working for Bannerconnect for 10 years, with his current role being Solutions Architect and most of his time working on the CORE solution. Next to work, he enjoys spending quality time with his family and actively playing soccer.




Thiyagarajan Arumugam is a Big Data Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Before working at AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

Running Amazon Payments analytics on Amazon Redshift with 750TB of data

Post Syndicated from Bishwabandhu Newton original https://aws.amazon.com/blogs/big-data/running-amazon-payments-analytics-on-amazon-redshift-with-750tb-of-data/

The Amazon Payments Data Engineering team is responsible for data ingestion, transformation, and the computation and storage of data. It makes these services available for more than 300 business customers across the globe. These customers include product managers, marketing managers, program managers, data scientists, business analysts and software development engineers. They use the data for their scheduled and one-time queries to make the right business decisions. This data is also used for building weekly, monthly and quarterly business review metrics, which are reviewed by the leadership team.

We support various consumer payment business teams, including the following:

  • Amazon Payment Products (Credit cards, Shop with Points, Amazon Currency Convertor, International Payment Products)
  • Gift Cards
  • Payment Acceptance Experience
  • Amazon Business Payments.

We also feed into the machine learning recommendation engine. This engine suggests the best payment product to a customer on Amazon’s payment checkout page.

Challenges with old datawarehouse

This section describes our previous challenges with our data warehouse and analytical needs. With payment products launches and their extension to new marketplaces, we had exponential growth in data volume. Later, scaling our extract, transform, and load process (ETL) was met with severe challenges, and resulted in delays and operational burdens. Here are the specific challenges we faced with our data warehouse:

  • Upsert did not scale, so we got updates more than ~10MN per run. The consumer product catalog dataset has more than 6.5BN records listed in the US marketplace, and occasionally the daily updates exceeded 10MN mark. We saw a similar trend for the Order attributes dataset.
  • Data aggregation either took longer or never finished if we had to analyze even for 6 months of payments data. Often, business owners wanted to aggregate the data based on certain attributes. For example, the number of successful transactions and monetary value by certain types of cards.
  • Shared cluster, and thus shared storage and compute caused resource crunch and impacted all its users. Each team was given ~100TB each on the Data warehouse. Each team could bring their table and join with central data warehouse tables. Any bad query on the cluster impacted all other queries on the same cluster. It was difficult to identify the owner of those bad queries.
  • There were more than 30,000 production tables. it became almost impossible to host all of them on the same cluster.
  • Index corruption on a larger table was cumbersome to rebuild and backfill the table.
  • We required a Database Administrator to apply patches and updates.

Using Amazon Redshift as the new payments data warehouse

We started exploring different options which suits our analytical needs, which is fast, reliable and scales well for future data growth. With all of the previously described issues, Central Data warehouse moved towards separating the compute and storage layer and they decided to be responsible for storage. They built a data lake on Amazon S3, which is encrypted to store even the highly confidential critical data. Each consumer team got the guideline to bring their own compute capacity for their analytical needs. Our payments team started looking for the following advantages:

  • Expedient analytics.
  • Integrates with S3 and the other AWS services.
  • Affordable storage and compute rates.
  • Works for ETL processing.

We chose Amazon Redshift because of it has the following features:

  • Bulk uploads are faster. ~700MN data inserts into ~30 minutes.
  • Data upsert is exceptionally fast.
  • Aggregation query on multi-million dataset with fewer columns of data returns in a few seconds as compared to a few minutes.
  • No need for DBA time to be allocated to maintain the database. Data engineers can easily perform backups, re-snapshot to a new cluster, set up alarms in case of cluster issues, and add new nodes.
  • The ability to keep data on S3. This data is accessible from multiple independent Amazon Redshift clusters through Spectrum and also allows users to join Spectrum tables with other tables created locally on Amazon Redshift. It offloads some processing to the Spectrum layer while storing the data on S3.
  • The ability to use Amazon Redshift best practices to design our tables in regards to distribution keys, sort keys, and compression. As a result, the query performance exceeded our SLA expectations.
  • An effective compression factor. This saves more than 40 to 50 percent of space by choosing the right compression. This enables faster query and efficient storage option.

Sources of data and storage

We consume data from different sources, like PostgreSQL, Amazon DynamoDB live streams, Central data warehouse data lake and bank partners’ data through secure protocols. Data from PostgreSQL databases are in relational format, whereas DynamoDB has key value pairs. We translate the key/value data to relational format and store in Amazon Redshift and S3. Most frequently accessed data is kept in Amazon Redshift. Less frequently accessed and larger datasets are stored in S3 and accessed through Amazon Redshift Spectrum.

Central data lake hosts more than 30,000 tables from different teams, such as Orders, Shipments, and Refunds. Whereas, we as a payments team need approximately 200 tables from this data lake as source tables. Later, we built a data mart specific to payment products, which feeds both to scheduled and one-time data and reporting needs. All small and mid-size tables, smaller than 50 TB, are directly loaded in Amazon Redshift from data lake, which physically stores the data. Tables larger than 50 TB are not stored locally on Amazon Redshift. Instead, we pull from the data lake using EMR-Hive, convert the format from tsv to ORC/Parquet and store on S3. We create an Amazon Redshift Spectrum table on top of S3 data. Format conversion lowers the runtime for each analytical aggregation queries, whereas storing on S3 makes sure we do not fill up entire Amazon Redshift cluster with data rather use it for efficient computing.

Data Architecture

Different components

  1. Central data warehouse data lake (Andes) — Almost all the systems in Amazon, wanting to share their data with other teams, publish their data to this datalake. It is an encrypted storage built on top of Amazon S3 which has metadata attached along with datafiles. Every dataset has a onetime dump and then incremental delta files. Teams willing to consume the data by
  • Physically copying the data into their own Amazon Redshift cluster. It is efficient for smaller and mid-size tables which are accessed most frequently.
  • Using Amazon Redshift Spectrum to run analytical queries on datasets stored in data lake. It helps in accessing cold large data, generally larger than 50TB, thereby avoids scaling up your Amazon Redshift cluster just because all the space might be consumed by these larger data files.
  • Using the AWS Glue catalog to update the metadata in your team’s S3 bucket and use Amazon EMR to pull the data, apply transformation, change format and store the final data in S3 bucket, which can further be consumed using Amazon Redshift Spectrum. It is efficient when the dataset is large and need transformations before being consumed.
  1. Amazon Redshift clusters — Amazon Redshift has centric architecture and is best suited for being single place for all source of truth, but we are managing three clusters mostly because of having consistent SLA of our reports, decoupling the user query experience with central data lake ingestion process (which is resource intensive). Here are the cluster specific reasons for why we need these as separate clusters.
  • Staging cluster:
    • Our data sources are dynamic which are in the transition state and moving away from relational to non-relational sources; for example, Oracle to Postgres or to DynamoDB.
    • The mechanism to pull data from central data lake and store into Amazon Redshift, is also evolving and is resource intensive in current state.
    • Our datamart is payment specific, though the table names in our datamart looks similar to central data lakes tables, but our datamart data is different than central data lake datasets. We apply necessary transformation and filters before bringing the data to the user’s Amazon Redshift cluster.
  • User cluster: Our internal business users wanted to create the tables in public schema for their analysis. They also needed direct connect access for any adhoc analysis. Most of the users know SQL and are aware of best practices but there are users who are new to SQL and sometimes their query is not optimized and impact other running queries, we have Amazon Redshift workload manager (WLM) settings to protect our cluster from these bad queries.
  • Prod ETL cluster: We have tight SLA to make dataset available for data users. In order to minimize the impact of bad queries running on the system we have set up replica of user cluster. All prod transforms run here and the output data is copied to both user and prod clusters. It insures the SLA we commit to data business users.
  1. Near real time data ingestion — Many applications like promotional data, card registration, gift card issuance etc need realtime data collection to detect fraud. Application data is stored in Amazon DynamoDB, with DynamoDB Streams enabled. We consume the data from these streams through an AWS Lambda function and Amazon Kinesis Data Firehose. Kinesis Firehose delivers the data to S3 and submits the copy command to load the data into Redshift. We have micro batch of 15 mins which makes sure not all the connections are consumed by these near-real time data applications.
  2. Alternate compute on Amazon EMR — We track customer behavior through website clickstream data. There are few metrics on conversion rate (customer applied for payment products after they saw the banner at different placements like payment checkout page etc), payment product application, website hit waterfall analysis, funnel reports which tracks the journey of a customer starting from seeing these adds, clicking to apply button, filling the application form to finally submitting the application. We get almost 10B website hits records on a given day. Before we generate any report, we look for cheaper storage for these massive and not frequently accessed datasets. We decided to choose S3 as a storage option and applied the transformations using Amazon EMR. With this approach, we made sure we do not fill up the database with massive cold data and at the same time we enable data access on S3 using Amazon Redshift Spectrum, which provided similar query performance. As Amazon Redshift is a columnar database and is exceptionally fast for any sort of aggregation if we choose fewer dimensional columns. We wanted similar performance for the data we stored on S3. We were able to do it using Amazon EMR and by changing the data format from TSV to ORC or Parquet. Every day, we created new partition data on S3 and refreshed the Amazon Redshift Spectrum table definition to include new partition data. These Spectrum table were accessed by business users for their one-time analysis using any Amazon Redshift SQL client or for scheduling any ETL pipeline.
  3. Publish the data to data warehouse datalake for non-payment users — We built payment specific datasets. For example, decline transaction behavior, wallet penetration and others. Sometimes non-payments business users are also interested in consuming these datasets. We publish these datasets to central data warehouse data lake for them. Additionally, payments application teams are the source for payment product application data. Data engineering team consumes these datasets, apply needed transformation and publish it both for payments and non-payments user through Amazon Redshift and the data lake.

Schema management

We have a prod schema which stores all production tables and only platform team has access to make any changes to it. We also provide payment product specific sandboxes which is accessible by product specific member. There is generic public schema for any payments data users. They can create, load, truncate/drop the tables in this schema.

Database and ETL lookup

Here are few fun facts about our Amazon Redshift database objects.

  • Number of Databases: 4
    • Staging Database DB1: ds2.8xlarge x 20 nodes
      Memory. 244 GiB per node
      Storage. 16TB HDD storage per node
    • User Database DB2: ds2.8xlarge x 24 nodes
      Memory. 244 GiB per node
      Storage. 16TB HDD storage per node
    • Platform Database DB3: ds2.8xlarge x 24 nodes
      Memory. 244 GiB per node
      Storage. 16TB HDD storage per node
    • Reporting Database DB4: ds2.8xlarge x 4 nodes
      Memory. 244 GiB per node
      Storage. 16TB HDD storage per node
  • Size of Databases:
    • Total memory: 17 TB
    • Total storage: 1.15 Petabytes
  • Number of tables:
    • Analytics prod db: 6500
    • Analytics staging db: 390

User cluster

Here are some stats around the database that is exposed to the users. It has both core tables and users are allowed to create their own tables based on their need. We have the mirror image of the same database, which hosts all of the tables, except user=created tables. Another database is used to run ETL platform related prod pipeline. Most of these tables have entire history, except the snapshot tables like clickstream datasets, which have been archived to S3.

Staging cluster

Here are some stats around the staging database. It is the landing zone for all the data coming from other teams or from the central data warehouse data lake. Table retention has been applied to all the tables as most of the ELT downstream jobs look for the last updated date, pulls just incremental data and store in user and replica databases.

Scheduled ETL and Query load on database

  • Number of daily extraction ETL jobs: 2943
  • Number of loading ETL jobs: 1655

  • Total daily load processed volume: 119 BN
  • Total daily loading runtime: 11,415 mins

  • Total daily data extraction volume: 166 BN
  • Total daily date extraction runtimes: 25,585 mins

Both scheduled and one-time query loads on the database

  • Daily query load on database by different database users.

Best practices

  1. Design tables with right sort keys and distribution key: Query performance is dependent on how much data it scans and if the joines are co-located join. Choosing right sort key make sure we do not scan the data which we do not need and selecting right distribution key makes sure the joining data is present on the same node and there is less movement of data over network resulting in better query performance. For more information, see Amazon Redshift Best Practices for Designing Tables.
  1. Refer to Amazon Redshift Best Practices for Designing Queries while writing the query.
  1. Change the loading strategy by splitting larger files into smaller files, use bulk loads instead serial inserts. For more information, see Amazon Redshift Best Practices for Loading Data.
  1. Configure the appropriate WLM setting to avoid system abuse by allocating right run-times, memory, priority queues, etc. For more information, see Tutorial: Configuring Workload Management (WLM) Queues to Improve Query Processing.
  1. Use Amazon Redshift advisor to identify potential tables needing compression, tables with missing stats, uncompressed data loads, and further fine tune your ETL pipelines. For more information, see Working Recommendations from Amazon Redshift Advisor.
  1. Identify the tables with most of the wasted space and vacuum them frequently. It releases the wasted space and, at the same time, increases the query performance. For more information, see Vacuuming Tables.
  1. Analyze the SQLs submitted to DB and identify the pattern on table usage and expensive joins. It helps data engineers build more denormalized tables by pre-joining these tables, and helping users access a single table, which is fast and efficient.

Conclusion and Next Steps

Amazon Redshift clusters with total capacity of 1.15 PB, ~6500 tables, 4500 scheduled ETL runs, 13,000 ETL queries a day, is solving almost all the ETL need of business users in payments team. However, the recent volume growth is filling up our dbs more than we expected, Next step could be choosing cheaper storage option by building a datalake on S3 and access them using Amazon Redshift spectrum without even bothering about scaling challenges and with seamless user experience.


About the authors

Bishwabandhu Newton is a senior data engineer with Amazon Consumer Payments team. He has over 12 years of data warehousing experience, with 9+ years at Amazon.com.




Matt Scaer is a Principal Data Warehousing Specialist Solution Architect, with over 20 years of data warehousing experience, with 11+ years at both AWS and Amazon.com.

Scale your Amazon Redshift clusters up and down in minutes to get the performance you need, when you need it

Post Syndicated from Ayush Jain original https://aws.amazon.com/blogs/big-data/scale-your-amazon-redshift-clusters-up-and-down-in-minutes-to-get-the-performance-you-need-when-you-need-it/

Amazon Redshift is the cloud data warehouse of choice for organizations of all sizes—from fast-growing technology companies such as Turo and Yelp to Fortune 500 companies such as 21st Century Fox and Johnson & Johnson. With quickly expanding use cases, data sizes, and analyst populations, these customers have a critical need for scalable data warehouses.

Since we launched Amazon Redshift, our customers have grown with us. Working closely with them, we have learned how their needs evolve as their data scales. We commonly encounter scenarios like the following for data analytics:

  • A US-based retail company runs a large number of scheduled queries and complex BI reports. Their Amazon Redshift usage peaks between 8 AM and 6 PM, when their data scientists and analysts run heavy workloads. At night, they might have an occasional user who queries the data and generates a small report. Consequently, they don’t need the same cluster capacity at night as they do during the day.
  • A healthcare consulting company is rapidly growing their Data as a Service (DaaS) business. They want to quickly create a duplicate environment and provide the cluster endpoint to their clients. After the duplicate cluster is created, it needs to be right-sized quickly based on the cost and performance requirements of the client.
  • An IoT service provider is on a rapid growth trajectory. Whenever a major event takes place, their sensors send in terabytes of new data that needs to be ingested into Amazon Redshift and analyzed soon after it arrives.

When database administrators (DBAs) don’t have the nimbleness to react to these scenarios, the analysts can experience longer response times for mission-critical workloads. Or, they might be locked out altogether if the data warehouse is down for a resize. The DBAs, in turn, cannot support Service Level Agreements (SLAs) that they have set with their business stakeholders.

With Amazon Redshift, you can already scale quickly in three ways. First, you can query data in your Amazon S3 data lakes in place using Amazon Redshift Spectrum, without needing to load it into the cluster. This flexibility lets you analyze growing data volumes without waiting for extract, transform, and load (ETL) jobs or adding more storage capacity. Second, you can resize your Amazon Redshift clusters by adding more nodes or changing node types in just a few hours. During this time, analysts can continue to run read queries with no downtime. This gives you more agility compared to on-premises data warehouses that take days to scale. Third, you can spin up multiple Amazon Redshift clusters by quickly restoring data from a snapshot. This allows you to add compute resources that might be needed to support high concurrency.

Introducing elastic resize

We’re excited to introduce elastic resize, a new feature that enables you to add or remove nodes in an Amazon Redshift cluster in minutes. This further increases your agility to get better performance and more storage for demanding workloads, and to reduce cost during periods of low demand. You can resize manually from the AWS Management Console or programmatically with a simple API call.

With elastic resize, you can start small and scale up on-demand as your needs grow, as illustrated in the following diagram.

Amazon Redshift customers who have been previewing elastic resize before launch have immediately benefited by the scalability that it unlocks for them. Here is what some of our customers had to say about elastic resize:


Amazon Prime Video uses advanced data analytics to customize viewing recommendations and measure fan viewing experiences. “Redshift’s new elastic resize feature reduced our operational resizing time from 6 hours down to 15 minutes, allowing us to dynamically scale our infrastructure according to the diverse nature of our workloads and optimizing costs while maximizing performance.” Sergio Diaz Bautista, Data Engineer at Amazon Prime Video



Yelp uses Amazon Redshift to analyze mobile app usage data and ad data on customer cohorts, auctions, and ad metrics. “Yelp is at the forefront of using data analytics to drive business decisions and enhance its users’ experience. Using elastic resize, we can confidently optimize for the best performance and keep costs low by configuring the cluster to scale up whenever demand increases beyond the usual variability window and scale down during off-peak hours. The ability to scale our data warehouse containing hundreds of terabytes of data, in minutes, is amazing,” says Shahid Chohan, data architect at Yelp.com


“Coupang is disrupting how the world shops on phones. We cannot always predict analytical demand because of evolving business needs and ad hoc analyses that are unexpectedly required. With elastic resize, we can scale compute and storage quickly to finish large ETL jobs faster and serve more users querying the data,” says Hara Ketha, senior manager of data engineering at Coupang.


OLX uses Amazon Redshift to power personalization and relevance, run reporting, and generate actionable customer insights. “With OLX, millions of people across the world buy and sell from each other daily. Redshift has been at the core of advanced analytics and big data innovation at OLX. For our ever increasing data needs, we can now add nodes using elastic resize blazingly fast. It’s also easy to size down during periods of low activity in order to save costs. Thank you Redshift!” said Michał Adamkiewicz, data architect for the Europe data team at OLX.com

How elastic resize works

Elastic resize is fundamentally different from the existing classic resize operation available in Amazon Redshift. Unlike classic resize, which creates a new cluster and transfers data to it, elastic resize modifies the number of nodes in your existing cluster. It enables you to deploy the additional nodes in minutes with minimal disruption to ongoing read or write queries. Hence, you can size up your cluster quickly for faster performance and size down when the job is finished to save cost.

You can still use classic resize when you want to change the node type (for instance, if you are upgrading from DC1 to DC2 nodes). The following stages describe what happens behind the scenes when you trigger elastic resize.

Stage 1: Preparing for resize while the cluster is fully available

At the start of elastic resize, Amazon Redshift first updates the snapshot on Amazon S3 with the most recent data. Today, Amazon Redshift takes an automated snapshot every 8 hours or 5 GB of data change, whichever comes first. In addition, you can also take snapshots manually. Each snapshot is incremental to capture changes that occurred since the last automated or manual snapshot. While the first snapshot is taken, the cluster is fully available for read and write queries.

On the console and via the describe-clusters API operation, the cluster status shows as available, prep-for-resize during this stage. Additionally, an event notification (REDSHIFT-EVENT-3012) lets you know that a request for elastic resize has been received.

Stage 2: Resizing the cluster while the cluster is unavailable

This stage, when the resizing actually happens, takes just a few minutes to finish. As the resize begins, the existing connections are put on hold. No new connections are accepted until the resize finishes, and the cluster is unavailable for querying. Some of the held connections or queries might fail if they were part of ongoing transactions. New nodes are added (for scaling up) or removed (for scaling down) during this period. Another incremental backup is taken to account for any data updates made by users during stage 1.

On the console and using the describe-clusters API operation, the cluster status now shows as resizing. Additionally, an event notification (REDSHIFT-EVENT-3011) lets you know that elastic resize started on the cluster.

Stage 3: Data transferring while the cluster is fully available

After resizing is finished in stage 2, the cluster is fully available for read and write queries. Queries that were being held are queued for execution automatically. During this stage, data is transferred from Amazon S3 to the nodes. Because Amazon Redshift only subtracts or adds nodes, only a fraction of the data needs to be transferred. For example, if you resized a three-node dc2.8xlarge cluster to six nodes, only 50 percent of the data needs to be moved.

We also improved the rate at which we restore data blocks from S3 by 2x, cutting down the time it would have taken to finish stage 3. In addition, this improvement also makes restores from a snapshot faster.

Moreover, Amazon Redshift moves the data intelligently to minimize the impact of data transfer on queries during this stage. The most frequently accessed data blocks are moved first, followed by other blocks based on their access frequency. This results in most incoming queries finding the data blocks that they are trying to access on disk.

On the console and using the describe-clusters API operation, the cluster status now shows as available, transferring data. Additionally, an event notification (REDSHIFT-EVENT-3534) lets you know that elastic resize has finished and the cluster is available for reads and writes.

After the data transfer is finished, the cluster status on the console and via the describe-clusters API operation once again shows as available.

Elastic resize constraints

Before deciding whether elastic resize is appropriate for your use case, consider the following constraints:

  • The new node configuration must have enough storage for existing data. Even when you add nodes, your new configuration might not have enough storage because of the way that data is redistributed.
  • You can resize only by a factor of 2, up or down, for dc2.large or ds2.xlarge node types. For example, you can resize a four-node cluster up to eight nodes or down to two nodes. This limitation exists to avoid data skew between nodes caused by an uneven distribution of slices.
  • For dc2.8xlarge or ds2.8xlarge node types, you can resize up to two times the original node count, or down to one-half the original node count. For example, you can resize a 16-node cluster to any size up to 32 nodes, or any size down to 8 nodes. This limitation exists to avoid data skew between nodes caused by an uneven distribution of slices.


Elastic resize significantly improves your ability to scale your Amazon Redshift clusters on-demand. Together with features such as Amazon Redshift Spectrum, it enables you to independently scale storage and compute so that you can adapt to the evolving analytical needs of your business.

To learn more about elastic resize and understand how you can size your clusters, watch the AWS Online Tech Talk: Best Practices for Scaling Amazon Redshift. You can also refer to the documentation for Resizing Clusters in the Amazon Redshift Cluster Management Guide.

About the Authors

Ayush Jain is a Product Marketer at Amazon Web Services. He loves growing cloud services and helping customers get more value from the cloud deployments. He has several years of experience in Software Development, Product Management and Product Marketing in developer and data services.




Himanshu Raja is a Senior Product Manager for Amazon Redshift. Himanshu loves solving hard problems with data and cherishes moments when data goes against intuition. In his spare time, Himanshu enjoys cooking Indian food and watching action movies.

Create cross-account and cross-region AWS Glue connections

Post Syndicated from Pankaj Malhotra original https://aws.amazon.com/blogs/big-data/create-cross-account-and-cross-region-aws-glue-connections/

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. AWS Glue uses connections to access certain types of source and target data stores, as described in the AWS Glue documentation.

By default, you can use AWS Glue to create connections to data stores in the same AWS account and AWS Region as the one where you have AWS Glue resources. In this blog post, we describe how to access data stores in an account or AWS Region different from the one where you have AWS Glue resources.

AWS Glue connections

AWS Glue uses a connection to crawl and catalog a data store’s metadata in the AWS Glue Data Catalog, as the documentation describes. AWS Glue ETL jobs also use connections to connect to source and target data stores. AWS Glue supports connections to Amazon Redshift, Amazon RDS, and JDBC data stores.

A connection contains the properties needed by AWS Glue to access a data store. These properties might include connection information such as user name and password, data store subnet IDs, and security groups.

If the data store is located inside an Amazon VPC, AWS Glue uses the VPC subnet ID and security group ID connection properties to set up elastic network interfaces in the VPC containing the data store. Doing this enables ETL jobs and crawlers to connect securely to the data store in the VPC.

AWS Glue can create this elastic network interface setup if the VPC containing the data store is in the same account and AWS Region as the AWS Glue resources. The security groups specified in a connection’s properties are applied on each of the network interfaces. The security group rules and network ACLs associated with the subnet control network traffic through the subnet. Correct rules for allowing outbound traffic through the subnet ensure that AWS Glue can establish network connectivity with all subnets in the VPC containing the data store, and therefore access the source or target data store.

VPC components can be interlinked only if they are present in the same AWS Region. Therefore, AWS Glue cannot create the elastic network interfaces inside a VPC in another region. If the VPC containing the data store is in another region, you have to add the network routes and create additional network interfaces which allow network interfaces set up by AWS Glue to establish network connectivity with the data store.

In this blog post, we describe how to configure the networking routes and interfaces to give AWS Glue access to a data store in an AWS Region different from the one with your AWS Glue resources. In our example, we connect AWS Glue, located in Region A, to an Amazon Redshift data warehouse located in Region B.

Note: The examples here assume that the Amazon Redshift cluster is in a different AWS Region, but belongs to the same account. The same setup and instructions are also valid for an Amazon Redshift cluster in a different account.

Setting up VPC components for AWS Glue

AWS Glue requires a VPC with networking routes to the data stores to which it connects. In our solution, the security groups and route tables are configured to enable elastic network interfaces set up by AWS Glue in a private subnet to reach the internet or connect to data stores outside the VPC. The following diagram shows the necessary components and the network traffic flow.

Required components for VPC setup:

  • AWS Glue resources in a private subnet in Region A.
  • A NAT gateway with an Elastic IP address attached to it in a public subnet in Region A.
  • A private route table containing a route allowing outbound network traffic from the private subnet to pass through the NAT gateway.
  • An internet gateway in Region A.
  • A public route table with a route allowing outbound network traffic from the public subnet to pass through the internet gateway.

Note: We must update the default security group of the VPC to include a self-referencing inbound rule and an outbound rule to allow all traffic from all ports. Later in the example, we attach this security group to an AWS Glue connection to let network interfaces set up by AWS Glue communicate with each other within a private subnet.

Network traffic flow through the components:

Outbound network traffic from AWS Glue resources in the private subnet to any destination or data store outside the private subnet is routed through the NAT gateway.

The NAT gateway is present in a public subnet and has an associated Elastic IP address. It forwards network traffic from AWS Glue resources to internet by using an internet gateway.

When AWS Glue tries to establish a connection with a data store outside of the private subnet, the incoming network traffic on the data store side appears to come from the NAT Gateway.

On the data store side, you allow the data store or its security group to accept incoming network traffic from the Elastic IP address attached to the NAT gateway. This is shown in the section “Allow Amazon Redshift to accept network traffic from AWS Glue,” following.

Creating VPC components using AWS CloudFormation

You can automate the creation of a VPC and all the components described preceding using the vpc_setup.yaml CloudFormation template, hosted on GitHub. Follow these step-by-step instructions to create the stack in your AWS account:

  1. Deploy the stack in the US Oregon (us-west-2) Region:

Note: In this example, we create the AWS Glue resources and connection in the us-west-2 Region. You can change this to the AWS Region where you have your AWS Glue connection and resources.

You are directed to the AWS CloudFormation console, with the stack name and URL template fields pre-filled.

  1. Choose Next.
  2. Use the default IP ranges and choose Next.
  3. Skip this step and choose Next.
  4. Review and choose Create.
  5. Wait for stack creation to complete. After completion, all the VPC components and necessary setup required are created.
  6. Navigate to the VPC console and copy the Elastic IP address for the newly created NAT.
    Note: This IP address is used for outbound network flow from AWS Glue resources and so should be whitelisted on the data store side. For more detail, see “Allow Amazon Redshift to accept network traffic from AWS Glue,” following.

Before creating and testing an AWS Glue connection to your data store, you need an IAM role that lets AWS Glue access the VPC components that you just created.

Creating an IAM role to let AWS Glue access Amazon VPC components

For this example, we create a role called TestAWSGlueConnectionIAMRole with a managed IAM policy AWSGlueServiceRole attached to it.

  1. Choose the Roles tab from the AWS Identity and Access Management (IAM) console.
  2. Choose Create role and select AWS Glue as a trusted entity.

  1. Attach an IAM policy to the role that allows AWS Glue to access the VPC components. In this example, we are using the default AWSGlueServiceRole policy, which contains all the required permissions for the setup.

  1. We name the role TestAWSGlueConnectionRole.

Note: The default GlueServiceRole policy that we attached to our custom role TestAWSGlueConnectionIAMRole has permissions for accessing VPC components. If you are using a custom policy instead of the default one, it should also contain the same permissions to be able to access VPC components.

Creating an Amazon Redshift cluster using AWS CloudFormation

For this example, we create a sample Amazon Redshift cluster in a VPC in the US N. Virginia (us-east-1) Region. Follow these step-by-step instructions to create the stack in your AWS account:

  1. Navigate to the CloudFormation console in region us-east-1 and create a new stack using this CloudFormation template, described in the documentation.
  2. Provide the configuration for the cluster and MasterUsername and MasterUserPassword. MasterUserPassword must follow the following constraints:
  • It must be 8–64 characters in length.
  • It must contain at least one uppercase letter, one lowercase letter, and one number.
  • It can use any printable ASCII characters (ASCII code 33–126) except ‘ (single quote), ” (double quote), :, \, /, @, or space.

  1. Choose Next and proceed with the stack creation.
  2. Review the configuration and choose Create.

  1. Wait for stack creation to complete, which can take a few minutes.

  1. Navigate to the Amazon Redshift console and choose the cluster name to see the cluster properties.

  1. Note the JDBC URL for the cluster and the attached security group for later use.

Note: We created a sample Amazon Redshift cluster in a public subnet present inside a VPC in Region B. We recommend that you follow the best practices for increased security and availability while setting up a new Amazon Redshift cluster, as shown in our samples on GitHub.

Creating an AWS Glue connection

Now you have the required VPC setup, Amazon Redshift cluster, and IAM role in place. Next, you can create an AWS Glue connection and test it as follows:

  1. Choose Add Connection under the Connections tab in AWS Glue console. The AWS Region in which we are creating this connection is the same as for our VPC setup, that is US Oregon (us-west-2).

  1. Choose a JDBC connection type. You can choose to enforce JDBC SSL or not, depending on the configuration for your data store.

  1. Add the connection-specific configuration. Note the URL for our Amazon Redshift cluster. It shows that the Amazon Redshift cluster is present in us-east-1.

Note: We use the VPC (VPCForGlue) and the private subnet (GluePrivateSubnet) we created for this connection. For security groups, we use the default security group for the VPC. This security group has a self-referencing inbound rule and an outbound rule that allows all traffic.

  1. Review configuration and choose Finish.

The AWS Glue console should now show that the connection was created successfully.

Note: Completing this step just means that an AWS Glue connection was created. It doesn’t guarantee that AWS Glue can actually connect to your data store. Before we test the connection, we also need to allow Amazon Redshift to accept network traffic coming from AWS Glue.

Allow Amazon Redshift to accept network traffic from AWS Glue

The Amazon Redshift cluster in a different AWS Region (us-east-1) from AWS Glue must allow incoming network traffic from AWS Glue.

For this, we update the security group attached to the Amazon Redshift cluster, and whitelist the Elastic IP address attached to the NAT gateway for the AWS Glue VPC.

Testing the AWS Glue connection

As a best practice, before you use a data store connection in an ETL job, choose Test connection. AWS Glue uses the parameters in your connection to confirm that it can access your data store and reports back any errors.

  1. Select the connection TestAWSGlueConnection that we just created and choose Test Connection.
  2. Select the TestAWSGlueConnectionIAMRole that we created for allowing AWS Glue resources to access VPC components.

  1. After you choose the Test connection button in the previous step, it can take a few seconds for AWS Glue to successfully connect to the data store. When it does, the console shows a message saying it “connected successfully to your instance.”


By creating a VPC setup similar to the one we describe, you can let AWS Glue connect to a data store in a different account or AWS Region. By doing this, you establish network connectivity between AWS Glue resources and your data store. You can now use this AWS Glue connection in ETL jobs and AWS Glue crawlers to connect with the data store.

If you have questions or suggestions, please leave a comment following.

Additional Reading

If you found this post helpful, be sure to check out Connecting to and running ETL jobs across multiple VPCs using a dedicated AWS Glue VPC, and How to access and analyze on-premises data stores using AWS Glue.


About the Author

Pankaj Malhotra is a Software Development Engineer at Amazon Web Services. He enjoys solving problems related to cloud infrastructure and distributed systems. He specializes in developing multi-regional, resilient services using serverless technologies.




Connecting to and running ETL jobs across multiple VPCs using a dedicated AWS Glue VPC

Post Syndicated from Nivas Shankar original https://aws.amazon.com/blogs/big-data/connecting-to-and-running-etl-jobs-across-multiple-vpcs-using-a-dedicated-aws-glue-vpc/

Many organizations use a setup that includes multiple VPCs based on the Amazon VPC service, with databases isolated in separate VPCs for security, auditing, and compliance purposes. This blog post shows how you can use AWS Glue to perform extract, transform, load (ETL) and crawler operations for databases located in multiple VPCs.

The solution presented here uses a dedicated AWS Glue VPC and subnet to perform the following operations on databases located in different VPCs:

  • Scenario 1: Ingest data from an Amazon RDS for MySQL database, transform it in AWS Glue, and output the results to an Amazon Redshift data warehouse.
  • Scenario 2: Ingest data from an Amazon RDS for MySQL database, transform it in AWS Glue, and output the results to an Amazon RDS for PostgreSQL database.

In this blog post, we’ll go through the steps needed to build an ETL pipeline that consumes from one source in one VPC and outputs it to another source in a different VPC. We’ll set up in multiple VPCs to reproduce a situation where your database instances are in multiple VPCs for isolation related to security, audit, or other purposes.

For this solution, we create a VPC dedicated to AWS Glue. Next, we set up VPC peering between the AWS Glue VPC and all of the other database VPCs. Then we configure an Amazon S3 endpoint, route tables, security groups, and IAM so that AWS Glue can function properly. Lastly, we create AWS Glue connections and an AWS Glue job to perform the task at hand.

Step 1: Set up a VPC

To simulate these scenarios, we create four VPCs with their respective IPv4 CIDR ranges. (Note: CIDR ranges can’t overlap when you use VPC peering.)

VPC 1Amazon Redshift172.31.0.0/16
VPC 2Amazon RDS for MySQL172.32.0.0/16
VPC 3Amazon RDS for PostgreSQL172.33.0.0/16
VPC 4AWS Glue172.30.0.0/16

Key configuration notes:

  1. The AWS Glue VPC needs at least one private subnet for AWS Glue to use.
  2. Ensure that DNS hostnames are enabled for all of your VPCs (unless you plan to refer to your databases by IP address later on, which isn’t recommended).

Step 2: Set up a VPC peering connection

Next, we peer our VPCs together to ensure that AWS Glue can communicate with all of the database targets and sources. This approach is necessary because AWS Glue resources are created with private addresses only. Thus, they can’t use an internet gateway to communicate with public addresses, such as public database endpoints. If your database endpoints are public, you can alternatively use a network address translation (NAT) gateway with AWS Glue rather than peer across VPCs.

Create the following peering connections.

Peer 1172.30.0.0/16- VPC 4172.31.0.0/16- VPC 1
Peer 2172.30.0.0/16- VPC 4172.32.0.0/16 -VPC 2
Peer 3172.30.0.0/16- VPC 4172.33.0.0/16- VPC 3

These peering connections can be across separate AWS Regions if needed. The database VPCs are not peered together; they are all peered with the AWS Glue VPC instead. We do this because AWS Glue connects to each database from its own VPC. The databases don’t connect to each other.

Key configuration notes:

  1. Create a VPC peering connection, as described in the Amazon VPC documentation. Select the AWS Glue VPC as the requester and the VPC for your database as the accepter.
  2. Accept the VPC peering request. If you are peering to a different AWS Region, switch to that AWS Region to accept the request.

Important: Enable Domain Name Service (DNS) settings for each of the peering connections. Doing this ensures that AWS Glue can retrieve the private IP address of your database endpoints. Otherwise, AWS Glue resolves your database endpoints to public IP addresses. AWS Glue can’t connect to public IP addresses without a NAT gateway.

Step 3: Create an Amazon S3 endpoint for the AWS Glue subnet

We need to add an Amazon S3 endpoint to the AWS Glue VPC (VPC 4). During setup, associate the endpoint with the route table that your private subnet uses. For more details on creating an S3 endpoint for AWS Glue, see Amazon VPC Endpoints for Amazon S3 in the AWS Glue documentation.

AWS Glue uses S3 to store your scripts and temporary data to load into Amazon Redshift.

Step 4: Create a route table configuration

Add the following routes to the route tables used by the respective services’ subnets. These routes are configured along with existing settings.

VPC 4—AWS GlueDestinationTarget
Route table172.33.0.0/16- VPC 3Peer 3 VPC 1Peer 1 VPC 2Peer 2


VPC 1—Amazon RedshiftDestinationTarget
Route table172.30.0.0/16- VPC 4Peer 1


VPC 2—Amazon RDS MySQLDestinationTarget
Route table172.30.0.0/16- VPC 4Peer 2


VPC 3—Amazon RDS PostgreSQLDestinationTarget
Route table172.30.0.0/16- VPC 4Peer 3

Key configuration notes:

  • The route table for the AWS Glue VPC has peering connections to all VPCs. It has these so that AWS Glue can initiate connections to all of the databases.
  • All of the database VPCs have a peering connection back to the AWS Glue VPC. They have these connections to allow return traffic to reach AWS Glue.
  • Ensure that your S3 endpoint is present in the route table for the AWS Glue VPC.

Step 5: Update the database security groups

Each database’s security group must allow traffic to its listening port (3306, 5432, 5439, and so on) from the AWS Glue VPC for AWS Glue to be able to connect to it. It’s also a good idea to restrict the range of source IP addresses as much as possible.

There are two ways to accomplish this. If your AWS Glue job will be in the same AWS Region as the resource, you can define the source as the security group that you use for AWS Glue. If you are using AWS Glue to connect across AWS Regions, specify the IP range from the private subnet in the AWS Glue VPC instead. The examples following use a security group as our AWS Glue job, and data sources are all in the same AWS Region.

In addition to configuring the database’s security groups, AWS Glue requires a special security group that allows all inbound traffic from itself. Because it isn’t secure to allow traffic from, we create a self-referencing rule that simply allows all traffic originating from the security group. You can create a new security group for this purpose, or you can modify an existing security group. In the example following, we create a new security group to use later when AWS Glue connections are created.

The security group Amazon RDS for MySQL needs to allow traffic from AWS Glue:

Amazon RDS for PostgreSQL allows traffic to its listening port from the same:

Amazon Redshift does it as so:

AWS Glue does it as so:

Step 6: Set up IAM

Make sure that you have an AWS Glue IAM role with access to Amazon S3. You might want to provide your own policy for access to specific Amazon S3 resources. Data sources require s3:ListBucket and s3:GetObject permissions. Data targets require s3:ListBucket, s3:PutObject, and s3:DeleteObject permissions. For more information on creating an Amazon S3 policy for your resources, see Policies and Permissions in the IAM documentation.

The role should look like this:

Or you can create an S3 policy that’s more restricted to suit your use case.

Step 7: Set up an AWS Glue connection

The Amazon RDS for MySQL connection in AWS Glue should look like this:

The Amazon Redshift connection should look like this:

The Amazon RDS for PostgreSQL connection should look like this:

Step 8: Set up an AWS Glue job

Key configuration notes:

  1. Create a crawler to import table metadata from the source database (Amazon RDS for MySQL) into the AWS Glue Data Catalog. The scenario includes a database in the catalog named gluedb, to which the crawler adds the sample tables from the source Amazon RDS for MySQL database.
  2. Use either the source connection or destination connection to create a sample job as shown following. (This step is required for the AWS Glue job to establish a network connection and create the necessary elastic network interfaces with the databases’ VPCs and peered connections.)
  3. This scenario uses pyspark code and performs the load operation from Amazon RDS for MySQL to Amazon Redshift. The ingest from Amazon RDS for MySQL to Amazon RDS for PostgreSQL includes a similar job.
  4. After running the job, verify that the table exists in the target database and that the counts match.

The following screenshots show the steps to create a job in the AWS Glue Management Console.

Following are some of examples of loading data from source tables to target instances. These are simple one-to-one mappings, with no transformations applied. Notice that the data sources and data sink (target) connection configuration access multiple VPCs from a single AWS Glue job.

Sample script 1 (Amazon RDS for MySQL to Amazon Redshift)

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource = glueContext.create_dynamic_frame.from_catalog(database = "gluedb", table_name = "mysqldb_events", transformation_ctx = "datasource")

datasink = glueContext.write_dynamic_frame.from_jdbc_conf(frame = datasource, catalog_connection = "Redshift", connection_options = {"dbtable": "mysqldb_events", "database": "dmartblog"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink")


Sample script 2:  Amazon RDS for MySQL to Amazon RDS for PostgreSQL (can also change with other RDS endpoint)

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource = glueContext.create_dynamic_frame.from_catalog(database = "gluedb", table_name = "mysqldb_events", transformation_ctx = "datasource")

datasink = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "vpc-pgsql", connection_options = {"dbtable": "mysqldb_events", "database": "mypgsql"}, transformation_ctx = "datasink")



In this blog post, you learn how to configure AWS Glue to run in a separate VPC so that it can execute jobs for databases located in multiple VPCs.

The benefits of doing this include the following:

  • A separate VPC and dedicated pool on the running AWS Glue job, isolated from database and compute nodes.
  • Dedicated ETL developer access to a single VPC for better security control and provisioning.


Additional Reading

If you found this post useful, be sure to check out Restrict access to your AWS Glue Data Catalog with resource-level IAM permissions and resource-based policies, and Using Amazon Redshift Spectrum, Amazon Athena, and AWS Glue with Node.js in Production.


About the Author

Nivas Shankar is a Senior Big Data Consultant at Amazon Web Services. He helps and works closely with enterprise customers building big data applications on the AWS platform. He holds a Masters degree in physics and is highly passionate about theoretical physics concepts. He enjoys spending time with his wife and two adorable kids. In his spare time, he takes his kids to tennis and football practice.



Ian Eberhart is a Cloud Support Engineer on the Big Data team for AWS Premium Support. He works with customers on a daily basis to find solutions for moving and sorting their data on the AWS platform. In his spare time, Ian enjoys seeing independent and weird movies, riding his bike, and hiking in the mountains.


Chasing earthquakes: How to prepare an unstructured dataset for visualization via ETL processing with Amazon Redshift

Post Syndicated from Ian Funnell original https://aws.amazon.com/blogs/big-data/chasing-earthquakes-how-to-prepare-an-unstructured-dataset-for-visualization-via-etl-processing-with-amazon-redshift/

As organizations expand analytics practices and hire data scientists and other specialized roles, big data pipelines are growing increasingly complex. Sophisticated models are being built using the troves of data being collected every second.

The bottleneck today is often not the know-how of analytical techniques. Rather, it’s the difficulty of building and maintaining ETL (extract, transform, and load) jobs using tools that might be unsuitable for the cloud.

In this post, I demonstrate a solution to this challenge. I start with a noisy semistructured dataset of seismic events, spanning several years and recorded at different locations across the globe. I set out to obtain broad insights about the nature of the rocks forming the Earth’s surface itself—the tectonic plate structure—to be visualized using the mapping capability in Amazon QuickSight.

To accomplish this, I use several AWS services, orchestrated together using Matillion ETL for Amazon Redshift:

Tectonic plate structure context

An earthquake is caused by a buildup of pressure that gets suddenly released. Earthquakes tend to be more severe at the boundaries of destructive tectonic plates. These boundaries are formed when a heavier and denser oceanic plate collides with a lighter continental plate, or when two oceanic plates collide. Due to the difference in density, the oceanic lithosphere is pushed underneath the continental plate, forming what is called a subduction zone. (See the following diagram.) In subduction zones, earthquakes can occur at depths as great as 700 kilometers.

Photography by KDS4444 [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons

For our analysis, we ultimately want to visualize the depth of an earthquake focus to locate subduction zones, and therefore find places on earth with the most severe earthquakes.

Seismographic data source

The data source is from the International Federation of Digital Seismograph Networks (FDSN). The event data is in JSON format (from the European Mediterranean Seismological Centre, or EMSC). An external process accumulates files daily into an Amazon S3 bucket, as shown following.

Each individual file contains all the seismic events for one day—usually several hundred—in an embedded array named “features,” as shown in the following example:

  "type": "FeatureCollection",
  "metadata": {
    "totalCount": 103
  "features": [
      "geometry": {
        "type": "Point",
        "coordinates": [26.76, 45.77, -140]
      "type": "Feature",
      "id": "20180302_0000103",
      "properties": {
        "lastupdate": "2018-03-02T23:27:00.0Z",
        "lon": 26.76, "lat": 45.77, "depth": 140,
        "mag": 3.7,
        "time": "2018-03-02T23:22:52.1Z",
        "flynn_region": "ROMANIA"
      "geometry": {
        "type": "Point",

Architecture overview

Athena reads and flattens the S3 data and makes it available for Matillion ETL to load into Amazon Redshift via JDBC. Matillion orchestrates this data movement, and it also provides a graphical framework to design and build the more complex data enrichment and aggregation steps to be performed by Amazon Redshift. Finally, the prepared data is queried by Amazon QuickSight for visualization.

Amazon Athena setup

You can use Athena to query data in S3 using standard SQL, via a serverless infrastructure that is managed entirely by AWS on your behalf. Before you can query your data, start by creating an external table. By doing this, you are defining the schema to apply to the data when it is being queried.

You can choose to use an AWS Glue crawler to assist in automatically discovering the schema and format of your source data files.

The following is the CREATE TABLE statement that you can copy and paste into the Athena console to create the schema needed to query the seismic data. Make sure that you substitute the correct S3 path to your seismic data in the LOCATION field of the statement.

  `type` string COMMENT 'from deserializer', 
  `metadata` struct<totalcount:int> COMMENT 'from deserializer', 
  `features` array<struct<geometry:struct<type:string,coordinates:array<double>>,type:string,id:string,properties:struct<lastupdate:string,magtype:string,evtype:string,lon:double,auth:string,lat:double,depth:double,unid:string,mag:double,time:string,source_id:string,source_catalog:string,flynn_region:string>>> COMMENT 'from deserializer')
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://your-bucket/SeismicPortal'

After the table is created, you are ready to query it. Athena uses the in-memory, distributed SQL engine called Apache Presto. It provides the ability to unnest arrays, which you use next.

Transfer to Amazon Redshift

The embedded array in every source record gets flattened out and converted to individual records during the JDBC export (download the .jar file) into Amazon Redshift. You use a Matillion ETL Database Query component to assist with the data transfer during this step, as shown in the following image.

This component simplifies ETL by automating the following steps:

Runs the SQL SELECT statement (shown in the following example).

  1. Streams query results across the network from Athena and into temporary storage in S3.
  2. Performs a bulk data load into Amazon Redshift.

Athena executes the following SQL statement:

SELECT f.id,
	   f.properties.time AS event_time,
FROM “seismic”.”sp_events”
CROSS JOIN UNNEST (features) as t(f)

The CROSS JOIN UNNEST syntax flattens the embedded array, generating hundreds of individual event records per day.

Now that the data has been copied and flattened into individual event records (shown in the following image), it’s ready for enrichment and aggregation.

Data enrichment

Earthquakes occur along a continuous range of spatial coordinates. In order to aggregate them, as we’ll be doing very soon, it’s necessary to first group them together. A convenient method is to assign every event into a Universal Transverse Mercator (UTM) zone. These zones are six-degree bands of longitudes that convert the spherical latitude/longitude coordinates into a 2D representation. Performing this conversion provides good granularity for visualization later.

The calculation to convert a spherical longitude/latitude coordinate into a two-dimensional UTM coordinate is complex. It can be performed ideally using an Amazon Redshift user-defined function (UDF). I chose a UDF for the ability to invoke it, via a Matillion component, in the next step.

CREATE OR REPLACE FUNCTION f_ll_utm (Lat float, Long float)
AS $$
From math import pi, sin, cos, tan, sqrt

_deg2rad = pi / 180.0
_rad2deg = 180.0 / pi

_EquatorialRadius = 1
_eccentricitySquared = 2
_ellipsoid = [ “WGS-84”, 6378137, 0.00669438]

The UDF has to return three pieces of information:

  • UTM Zone code
  • Easting (x-axis measurement in meters)
  • Northing (ditto, for the y-axis)

A scalar UDF can only return a single value. Therefore the three results were returned as a pipe-delimited string, in which the three values are pipe-separated:

To bring the values out into individual fields, the UDF is first invoked using a Matillion ETL Calculator component, followed by a field splitter and a Calculator to perform data type conversion and rounding.

Data aggregation

To reiterate, we’re interested in the depth of earthquake focus specifically on destructive plate boundaries. Knowing the depth helps us estimate the potential severity of earthquakes.

We need to find the average event depth within each UTM zone, in the expectation that a spatial pattern will appear that will highlight the subduction zones.

The last three steps in the Matillion transformation (shown in the following diagram) perform the necessary aggregation, add a depth quartile, and create an output table from the resulting data.

The ”Aggregate to UTM ref” step gets Amazon Redshift to perform a GROUP BY function in SQL, which approximates every event to the appropriate UTM zone. While doing this aggregation, you simultaneously do the following:

  • Count the events (which determines the size of the visual representation).
  • Find the average depth (which determines the color of the visual representation).
  • Determine the average latitude and longitude (which approximates to the center of the UTM zone, and determines the position of the visual representation).

The following image shows the aggregation type for each column:

Average depth is a useful measure, but to maximize the visual impact of the final presentation, we also take the opportunity to rank the results into quartiles. This allows the zones with the deepest quartile to stand out consistently on the map.

NTILE(4) OVER (ORDER BY "avg_depth")

Amazon Redshift is great at performing this type of analytics, which is delivered inside another Matillion ETL Calculator component.

The Recreate Output step materializes the dataset into an Amazon Redshift table, ready for Amazon QuickSight to visualize.

Amazon QuickSight visualization

The Amazon QuickSight “points on map” visualization is perfect for this 2D rendering. The values for the field wells come straight from the aggregated data in Amazon Redshift:

  • Geospatial — the average lat/long per UTM grid.
  • Size — the record count, in other words, the number of events in that UTM zone.
  • Color — the Depth Ntile, with the fourth quartile in pink.

The resulting map shows the global subduction zones highlighted clearly in pink, these being the areas with the deepest earthquake’s focus on average.

Recap and summary

In this post, I used seismological data as an example to explore challenges around the visualization of unstructured data and to provide best practices. I suggested a way to overcome these challenges with an architecture that is also applicable for datasets from a wide array of sources, beyond geology. I then explored how to orchestrate activities of different data processing tasks between S3, Athena, and Amazon Redshift using Matillion ETL for Amazon Redshift.

If you’re trying to solve similar analytics challenges and want to see how Amazon Redshift and Matillion can help, launch a 14 day free trial of Matillion ETL for Amazon Redshift on the AWS Marketplace or schedule a demo today. If you have questions or suggestions, please comment below.

Additional Reading

If you found this post helpful, be sure to check out Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift, and Orchestrate multiple ETL jobs using AWS Step Functions and AWS Lambda.


About the Author

Ian Funnell, a lead solution architect at Matillion, is a seasoned cloud and data warehousing specialist. Ian is dedicated to supporting customers in driving their data transformation forward and solving their deepest technical challenges. Ian’s has 25+ years experience in the tech industry.




Performance matters: Amazon Redshift is now up to 3.5x faster for real-world workloads

Post Syndicated from Ayush Jain original https://aws.amazon.com/blogs/big-data/performance-matters-amazon-redshift-is-now-up-to-3-5x-faster-for-real-world-workloads/

Since we launched Amazon Redshift, thousands of customers have trusted us to get uncompromising speed for their most complex analytical workloads. Over the course of 2017, our customers benefited from a 3x to 5x performance gain, resulting from short query acceleration, result caching, late materialization, and many other under-the-hood improvements. In this post, we highlight recent improvements to Amazon Redshift and how our continued focus on performance enhancements is benefiting customers. We also discuss performance testing derived from industry-standard benchmarks that help us measure the impact of these ongoing improvements.

Recent performance improvements

With the largest number of data warehousing deployments in the cloud, we have the ability to analyze usage patterns across a variety of analytical workloads and uncover opportunities to improve performance. We leverage these insights to deliver improvements that seamlessly benefit thousands of customers. Major improvements in performance over the past six months include the following:

  • Improved resource management for memory-intensive queries: Amazon Redshift improved how joins and aggregations consume and reserve memory. This improved cache efficiency for the majority of the hash tables and reduced spilling for memory-intensive joins and aggregations by up to 1.6x.
  • Improved performance for commits: As a central component of write transactions, commit has a direct impact on the performance of data update and data ingestion workloads, such as ETL (extract, transform, and load) jobs. Since November 2017, we’ve delivered a series of commit performance optimizations such as batching multiple commits in a single operation, improved usage of commit locks, and a locality-aware metadata defragmenter. These and other related optimizations have resulted in a 4x commit time reduction on average for HDD-based clusters. For heavy transactions (the top 5 percent of commit operations in Amazon Redshift), the delivered optimizations resulted in a 7.5x improvement.
  • Improved performance for repeated queries: With Amazon Redshift’s result caching, dashboards, visualization, and business intelligence (BI) tools that execute queries repeatedly now see a significant boost in performance. In addition, result caching frees up resources that can improve the performance of all other queries.
  • Query processing improvements: Amazon Redshift now performs 2x–6x faster for scenarios such as repeated subqueries, advanced analytics functions with predicates, and complex query plans by eliminating duplicate work and streamlining steps.
  • Faster string manipulation: Amazon Redshift yields 5x better performance for frequently used string functions because of more efficient code generation techniques.

We’ve also complemented these out-of-the-box improvements with tailored recommendations to help you get better performance at a lower cost with Amazon Redshift Advisor. Advisor has already provided close to 50,000 recommendations since it launched in July 2018.

All of these optimizations have transparently boosted customers’ ability to get faster insights from their AWS analytics platform and saved thousands of hours of execution time on a daily basis. This applies to even the largest deployments, where customers have multiple petabytes of data in Redshift clusters, and seamless access to even larger data volumes in their Amazon S3 data lakes with Amazon Redshift Spectrum. “Redshift’s query performance and scalability has been increasing, even though our data has grown.” said Minero Aoki, Senior Data Engineer, Cookpad Inc. “In the last 10 months, we have seen commit performance increase by 500% without any increase in cost.”

Using benchmarks to measure success

To measure the impact of these ongoing improvements, we measure performance on a nightly basis and run queries derived from industry-standard benchmarks such as TPC-DS. We also occasionally benchmark Amazon Redshift against other data warehouse services. We set up these measurements to reflect our customers’ real-world usage, as highlighted earlier. This enables us to accurately gauge whether Amazon Redshift is getting better with each release, which happens every two weeks.

Comparing Amazon Redshift releases over the past few months, we observed that Amazon Redshift is now 3.5x faster versus six months ago, running all 99 queries derived from the TPC-DS benchmark. This is shown in the following chart.

Note: We used a Cloud DW benchmark derived from TPC-DS for this study. As such, the Cloud DW benchmark is not comparable to published TPC-DS results. TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council.

For this post, we also compared the latest Amazon Redshift release with Microsoft Azure SQL Data Warehouse using the Cloud DW benchmark derived from TPC-DS. Queries ran against a 3 TB dataset on a 4-node cluster on both services, using dc2.8xlarge for Amazon Redshift and DW2000c Gen2 for Azure SQL Data Warehouse. We could not run a larger dataset because Azure could not allocate the DW15000c cluster required for a 30 TB dataset owing to capacity constraints at the time of publishing.

We observed that Amazon Redshift is 15x faster than Azure SQL Data Warehouse running all 99 queries with one user, and 14x faster with four concurrent users. There were a couple of outlier queries that took Azure SQL Data Warehouse several hours to complete. Excluding the two long running queries, Amazon Redshift is 2x faster than Azure SQL Data Warehouse with 1 user and 1.6x faster with four concurrent users. The following charts compare the two services.

Note: We used queries derived from TPC-DS v2.9 for this study. Amazon Redshift and Azure SQL DW do not support rollup queries, so we used TPC-DS provided variants for queries 5, 14, 18, 27, 36, 67, 70, 77, 80, and 86. We used out-of-the-box Workload Management configuration for Amazon Redshift, which allows for 5 concurrent queries, and ‘largerc’ resource class for Azure SQL DW, which has a lower limit of 4 concurrent queries. Amazon Redshift took 25 minutes to run all 99 queries, whereas Azure SQL Data Warehouse took 6.4 hours. Ignoring two queries that each took Azure SQL Data Warehouse more than 1 hour to execute (Q38 and Q67), Amazon Redshift took 22 minutes, while Azure SQL Data Warehouse took 42 minutes.


Evaluating Amazon Redshift

Although benchmarks against other data warehouse services are interesting, they are of limited value. First, there’s no one-size-fits-all benchmark. Each service has its unique real-world usage patterns and ways to configure and tune for them. We make a best effort to configure the services based on publicly available guidance, but we can’t guarantee optimal performance for any given service. We see this commonly with third-party benchmarks, for instance, where Amazon Redshift’s powerful distribution and sort keys are not used—even though the large majority of our customers use them.

Similarly, each benchmark query can only be run once, in contrast to real-world scenarios where 99.5 percent of queries we observe have components that can be found in the compilation cache (Amazon Redshift generates and compiles code for each query execution plan. The compiled code segments are stored in a least recently used cache and shared across sessions in a cluster). In other words, they are similar to queries that were run previously. So, the query run times measured by benchmarking studies can end up over-indexing on compilation times, which might not indicate the actual performance you can expect to get.

Secondly, these studies are, by necessity, a point-in-time assessment. As cloud vendors update and evolve their service, benchmark numbers might already be obsolete by the time they’re published.

Therefore, we don’t recommend that you make product selection decisions based on these benchmarks because your data and your query workloads have their own unique characteristics. If you’re evaluating Amazon Redshift for your analytics platform, we have created a Proof of Concept guide to help. You can also request assistance from us, or work with one of our System Integration and Consulting Partners and make a data-driven decision.

Finally, we invite you to watch the recent Fireside chat webinar and join us at re:Invent 2018 in Las Vegas, where we have a ton of exciting news to share with you. Happy querying!

If you would like instruction to reproduce the benchmark, please contact us at [email protected]. If you have questions or suggestions, please comment below.

About the Authors

Ayush Jain is a Product Marketer at Amazon Web Services. He loves growing cloud services and helping customers get more value from the cloud deployments. He has several years of experience in Software Development, Product Management and Product Marketing in developer and data services.




Mostafa Mokhtar is an engineer working on Redshift performance. Previously, he held similar roles at Cloudera, Hortonworks and on the SQL Server team at Microsoft.


Closing the customer journey loop with Amazon Redshift at Equinox Fitness Clubs

Post Syndicated from Ryan Kelly original https://aws.amazon.com/blogs/big-data/closing-the-customer-journey-loop-with-amazon-redshift-at-equinox-fitness-clubs/

Clickstream analysis tools handle their data well, and some even have impressive BI interfaces. However, analyzing clickstream data in isolation comes with many limitations. For example, a customer is interested in a product or service on your website. They go to your physical store to purchase it. The clickstream analyst asks, “What happened after they viewed the product?” and the commerce analyst asks themselves, “What happened before they bought it?”

It’s no surprise that clickstream data can enhance your other data sources. When used with purchase data, it helps you determine abandoned carts or optimize marketing spending. Similarly, it helps you analyze offline and online behavior, and also behavior before customers even registered an account. However, once the benefits of clickstream data feeds are evident, you must accommodate new requests quickly.

This blog post shows how we, at Equinox Fitness Clubs, moved our data from Amazon Redshift to Amazon S3 to use a late-binding view strategy with our clickstream data. Expect such fun things as Apache Spark, Apache Parquet, data lakes, hive partitioning, and external tables, all of which we will talk about extensively in this post!

When we first started passing our clickstream data from its own tool to our Amazon Redshift data warehouse, speed was the primary concern. Our initial use case was to tie Salesforce data and Adobe Analytics data together to get a better understanding about our lead process. Adobe Analytics could tell us what channel and campaigns people came from, what pages they looked at during the visit, and if they submitted a lead form on our site. Salesforce could tell us if a lead was qualified, if they ever met an advisor, and ultimately if they became a member. Tying these two datasets together helped us better understand and optimize our marketing.

To begin, we knew the steps involved to centralize Salesforce and Adobe Analytics data into Amazon Redshift. However, even when joined together in Redshift, they needed a common identifier to talk to each other. Our first step involved generating and sending the same GUID to both Salesforce and Adobe Analytics when a person submitted a lead form on our website.

Next, we had to pass Salesforce data to Redshift. Luckily, those feeds already existed, so we could add this new GUID attribute in the feed and describe it in Redshift.

Similarly, we had to generate data feeds from Adobe Analytics to Amazon Redshift. Adobe Analytics provides Amazon S3 as a destination option for our data, so we passed the data to S3 and then created a job to send it to Redshift. The job involved taking the daily Adobe Analytics feed – which comes with a data file containing hundreds of columns and hundreds of thousands of rows, a collection of lookup files like the headers for the data, and a manifest file that describes the files that were sent – and passing it all to Amazon S3 in its raw state. From there, we used Amazon EMR with Apache Spark to process the data feed files into a single CSV file, and then save it to S3 so that we could perform the COPY command to send the data to Amazon Redshift.

The job ran for a few weeks and worked well until we started to use the data more frequently. While the job was effective, backdating data with new columns (schema evolution) started to occur. That’s when we decided we needed greater flexibility because of the nature of the data.

Data lake to the rescue

When we decided to refactor the job, we had two things lined up for us. First, we were already moving towards more of a data lake strategy. Second, Redshift Spectrum had been recently released. It would enable us to query these flat files of clickstream data in our data lake without ever having to run the COPY command and store it in Redshift. Also, we could more efficiently join the clickstream data to other data sources stored inside of Redshift.

We wanted to take advantage of self-describing data which combines the schema of the data with the data itself. Converting the data to self-describing data would help us manage the wide clickstream datasets and prevent the schema-evolution related challenges. We could pull every column we desired into the data lake file and then use only the important columns in our queries to speed up processing. To accomplish this flexibility, we used the Apache Parquet file format, which is both self-describing and blazing fast due to its columnar storage technology. We used Apache Spark on Amazon EMR to convert from CSV to Parquet, and partition the data for scanning performance, as shown in the following code.


from datetime import date, timedelta
from pyspark.sql.types import *
from pyspark.sql import SparkSession
import json
import argparse
# Usage
# spark-submit all_omniture_to_parquet.py 2017-10-31 s3a:// eqxdl-prod-l-omniture eqxios eqxdl-prod eqeqxiosprod omniture_eqxios
# python -m tasks.s2w_all_omniture_to_parquet 2017-10-31

parser = argparse.ArgumentParser()
parser.add_argument('year_month_day_arg', help='Run date (yyyy-mm-dd)', type=str, default='XXX')
parser.add_argument('s3_protocol', help='S3 protocol i.e. s3a://',type=str, default='XXX')
parser.add_argument('source_bucket', help='Omniture source data bucket',type=str, default='XXX')
parser.add_argument('source_path', help='Omniture source data path',type=str, default='XXX')
parser.add_argument('target_bucket', help='Omniture target data bucket',type=str, default='XXX')
parser.add_argument('report_suite', help='Omniture report suite ID',type=str, default='XXX')
parser.add_argument('application', help='App name for job',type=str, default='XXX')
args = parser.parse_args()

spark = SparkSession\

sc = spark.sparkContext

def manifest_toJSON(file, location):
	text = sc.textFile(file).collect()
	manifest = {'lookup_files': [], 'data_files': location, 'total_rows': 0}
	for x in text:
		if 'Lookup-File:' in x:
			manifest['lookup_files'].append(location+x.split(': ')[1])
		elif 'Data-File: 01' in x:
			wildcard_path = x.replace('Data-File: 01','*')
			manifest['data_files'] += wildcard_path
		elif 'Record-Count:' in x:
			manifest['total_rows'] += int(x.split(': ')[1])
	return manifest

# Create metadata by stitching together the file paths for
# the header file and the data file from the manifest file
# base_filepath = '/Users/rkelly/projects/sparkyTest/project_remodeling/ios_test_data/'
base_filepath = '{}{}/{}/'.format(args.s3_protocol, args.source_bucket, args.source_path)
manifest_filepath = base_filepath+'{}_{}.txt'.format(args.report_suite, args.year_month_day_arg)
metadata = manifest_toJSON(manifest_filepath, base_filepath)

# Create a list of files and their data
# Look specifically for the column_headers.tsv data
# Split on \x00 to remove the garbage encoding and return a string of headers
lookup_files = sc.textFile(','.join(metadata['lookup_files'])).collect()
encoded_header = lookup_files[[idx for idx, s in enumerate(lookup_files) if 'column_headers.tsv' in s][0]].split('\x00')
header = encoded_header[[idx for idx, s in enumerate(encoded_header) if '\t' in s][0]]\
			.replace('\n', '')\
			.replace('(', '')\
			.replace(')', '')\
			.replace(' ', '-')

# Create a schema for the list from the header file splitting on tabs
# Cast everything as a string to avoid data type failures
schema = StructType([ StructField(field, StringType(), True) for field in header.split('\t')])

# Bypass RDD and write data file as a dataframe
# then save as parquet to tie headers to their respective values
df = spark.read.csv(metadata['data_files'], header=False, schema=schema, sep='\t', nullValue=None)
destination_filepath = '{}{}/{}/dt={}/'.format(args.s3_protocol, args.target_bucket, args.application, args.year_month_day_arg)

# Gracefully exit out of spark and this file


Using the AWS Glue Data Catalog allowed us to make our clickstream data available to be queried within Amazon Redshift and other query tools like Amazon Athena and Apache Spark. This is accomplished by mapping the Parquet file to a relational schema. AWS Glue, enables querying additional data in mere seconds. This is because schema changes can occur in real time. This means you can delete and add columns, reorder column indices, and change column types all at once. You can then query the data immediately after saving the schema. Additionally, Parquet format prevents failures when the shape of the data changes, or when certain columns are deprecated and removed from the data set.

We used the following query to create our first AWS Glue table for our Adobe Analytics website data. We ran this query in Amazon Redshift in SQL Workbench.


--First create your schema
create external schema omniture_prod
from data catalog 
database 'omniture' 
iam_role 'arn:aws:iam:::role

--Then create your “table” 
CREATE EXTERNAL TABLE omniture_prod.eqx_web (
  date_time		VARCHAR,
  va_closer_id		VARCHAR,
  va_closer_detail	VARCHAR,
  va_finder_detail	VARCHAR,
  va_finder_id		VARCHAR,
  ip			VARCHAR,
  domain		VARCHAR,
  post_evar1		VARCHAR
LOCATION 's3://eqxdl-prod/omniture/eqx_web/'
table properties ('parquet.compress'='SNAPPY');

--Check your databases, schemas, and tables
select * from pg_catalog.svv_external_databases;
select * from pg_catalog.svv_external_schemas;
select * from pg_catalog.svv_external_tables;


After running this query, we added additional columns to the schema upon request throughthe AWS Glue interface. We also used partitioning to make our queries faster and cheaper.

At this point, we had a new schema folder in our database. It contained the external table that could be queried, but we wanted to take it a step further. We needed to add some transformations to the data such as:

  • Renaming IDs to strings
  • Concatenating values
  • Manipulating strings, excluding bot traffic that we send from AWS to test the website
  • Changing the column names to be more user friendly.

To do this, we created a view of the external table, as follows:


create view edw_t.f_omniture_web as 
    REPLACE(dt, '-', '') as hive_date_key,
    va_closer_detail as last_touch_campaign,
        WHEN (va_closer_id) = '1' THEN 'Paid Search'
        WHEN (va_closer_id) = '2' THEN 'Natural Search'
        WHEN (va_closer_id) = '3' THEN 'Display'
        WHEN (va_closer_id) = '4' THEN 'Email Acq'
        WHEN (va_closer_id) = '5' THEN 'Direct'
        WHEN (va_closer_id) = '6' THEN 'Session Refresh'
        WHEN (va_closer_id) = '7' THEN 'Social Media'
        WHEN (va_closer_id) = '8' THEN 'Referring Domains'
        WHEN (va_closer_id) = '9' THEN 'Email Memb'
        WHEN (va_closer_id) = '10' THEN 'Social Placement'
        WHEN (va_closer_id) = '11' THEN 'Other Placement'
        WHEN (va_closer_id) = '12' THEN 'Partnership'
        WHEN (va_closer_id) = '13' THEN 'Other Eqx Sites'
        WHEN (va_closer_id) = '14' THEN 'Influencers'
        ELSE NULL
    END AS last_touch_channel,
    va_finder_detail as first_touch_campaign,
    va_finder_id as va_finder_id,
        WHEN (va_finder_id) = '1' THEN 'Paid Search'
        WHEN (va_finder_id) = '2' THEN 'Natural Search'
        WHEN (va_finder_id) = '3' THEN 'Display'
        WHEN (va_finder_id) = '4' THEN 'Email Acq'
        WHEN (va_finder_id) = '5' THEN 'Direct'
        WHEN (va_finder_id) = '6' THEN 'Session Refresh'
        WHEN (va_finder_id) = '7' THEN 'Social Media'
        WHEN (va_finder_id) = '8' THEN 'Referring Domains'
        WHEN (va_finder_id) = '9' THEN 'Email Memb'
        WHEN (va_finder_id) = '10' THEN 'Social Placement'
        WHEN (va_finder_id) = '11' THEN 'Other Placement'
        WHEN (va_finder_id) = '12' THEN 'Partnership'
        WHEN (va_finder_id) = '13' THEN 'Other Eqx Sites'
        WHEN (va_closer_id) = '14' THEN 'Influencers'
        ELSE NULL
    END AS first_touch_channel,
    ip as ip_address,
    domain as domain,
    post_evar1 AS internal_compaign,
    post_evar10 as site_subsection_nm,
    post_evar11 as IOS_app_view_txt,
    post_evar12 AS site_section_nm,
    post_evar15 AS transaction_id,
    post_evar23 as join_barcode_id,
    post_evar3 AS page_nm,
    post_evar32 as host_nm,
    post_evar41 as class_category_id,
    post_evar42 as class_id,
    post_evar43 as class_instance_id,
    post_evar60 AS referral_source_txt,
    post_evar69 as adwords_gclid,
    post_evar7 as usersec_tracking_id, 
    post_evar8 as facility_id,
    post_event_list as post_event_list,  
    post_visid_low||post_visid_high as unique_adobe_id,
    post_visid_type as post_visid_type,
    post_page_event as hit_type,
    visit_num as visit_number,
    post_evar25 as login_status,
    exclude_hit as exclude_hit,
    hit_source as hit_source,
    post_evar64 as api_error_msg,
    post_evar70 as page_load_time,
    post_evar78 as join_transaction_id,
    post_evar9 as page_url,
    visit_start_pagename as entry_pg,
    post_tnt as abtest_campaign,
    post_tnt_action as abtest_experience,
    user_agent as user_agent,
    mobile_id as mobile_id,
    cast(date_time as timestamp) as date_time,
        'America/New_York', -- timezone of origin
            when post_t_time_info like '%undefined%' then '0'
            when post_t_time_info is null then '0'
            when post_t_time_info = '' then '0'
            when cast(split_part(post_t_time_info,' ',4) as int) < 0
              then left(split_part(post_t_time_info,' ',4),4)
            else left(split_part(post_t_time_info,' ',4),3) end as int
        cast(date_time as timestamp)
    ) as date_time_local,
    post_t_time_info as local_timezone
from omniture_prod.eqx_web
where exclude_hit = '0'
and hit_source not in ('5','7','8','9')

and domain <> 'amazonaws.com'
and domain <> 'amazon.com'



Now we can perform queries from Amazon Redshift, blending our structured Salesforce data with our semi-structured, dynamic Adobe Analytics data. With these changes, our data became extremely flexible, friendly on storage size, and very performant when queried. Since then, we have started using Redshift Spectrum for many use cases from data quality checks, machine data, historical data archiving, and empowering our data analysts and scientists to more easily blend and onboard data.


with web_leads as (
  select transaction_id,last_touch_channel
  from edw_t.f_omniture_web
  where hive_date_key = '20170301'
      and post_event_list like '%201%'
      and transaction_id != '807f0cdc-80cf-42d3-8d75-e55e277a8718'
opp_lifecycle as (
  SELECT lifecycle,weblead_transactionid
  FROM edw_t.f_opportunity
  where weblead_transactionid is not null
  and created_date_key between '20170220'and '20170310'
  coalesce(opp_lifecycle.lifecycle, 'N/A') as Lifecycle
from web_leads
left join opp_lifecycle on web_leads.transaction_id = opp_lifecycle.weblead_transactionid


We were able to setup an efficient and flexible analytics platform for clickstream data by combining the Amazon S3 data lake with Amazon Redshift. This has eliminated the need to always load clickstream data into the data warehouse, and also made the platform adaptable to schema changes in the incoming data. You can read the Redshift documentation to get started with Redshift Spectrum, and also watch our presentation at the AWS Chicago Summit 2018 below.

Additional Reading

If you found this post helpful, be sure to check out From Data Lake to Data Warehouse: Enhancing Customer 360 with Amazon Redshift Spectrum, and Narrativ is helping producers monetize their digital content with Amazon Redshift.


About the Author

Ryan Kelly is a data architect at Equinox, where he helps outline and implement frameworks for data initiatives. He also leads clickstream tracking which helps aid teams with insights on their digital initiatives. Ryan loves making it easier for people to reach and ingest their data for the purposes of business intelligence, analytics, and product/service enrichment. He also loves exploring and vetting new technologies to see how they can enhance what they do at Equinox.




How Annalect built an event log data analytics solution using Amazon Redshift

Post Syndicated from Eric Kamm original https://aws.amazon.com/blogs/big-data/how-annalect-built-an-event-log-data-analytics-solution-using-amazon-redshift/

Ingesting and analyzing event log data into a data warehouse in near real-time is challenging. Data ingest must be fast and efficient. The data warehouse must be able to scale quickly to handle incoming data volumes. Storing massive amounts of historical data in the data warehouse for infrequent access is prohibitively expensive. If you store infrequently accessed data outside the data warehouse, loading time is unacceptable.

At Annalect, we found a way to overcome these challenges. In this article, we discuss how we built a solution on AWS to manage, enhance, and analyze event log data from our ad technology partners. We use Amazon S3 for storage, Amazon EC2 and AWS Batch for compute, AWS Glue Data Catalog, and Amazon Redshift and Amazon Redshift Spectrum for analytics. This scalable, on-demand architecture has proven to be a high-performance solution for our analytics users and is very cost-effective.

Architecture overview

In mid-2016, we were using Amazon Redshift with local tables to provide SQL access to event-level ad tech datasets for analysts and modelers to improve ad buying strategies. We retained up to six months of historical data for select data sources; this represented about 230 billion events. By early 2017, the same data footprint grew to over 385 billion events, and we increased our Amazon Redshift capacity to keep pace. As we looked forward, we wanted to support more data, both in terms of history and data source coverage. At the same time, we wanted to maintain or improve query performance, ideally without increasing costs.

Because we were already using Amazon Redshift as our data warehouse, we looked at Redshift Spectrum. Redshift Spectrum is a feature of Amazon Redshift that queries data stored in Amazon S3. It uses the same SQL capabilities as Amazon Redshift, without the need to load the data into Amazon Redshift tables stored on disks within an Amazon Redshift cluster.

With Redshift Spectrum, a table’s storage is separate from a cluster’s compute resources, which enables each to scale independently of the other. We can keep large data tables in S3 instead of local Amazon Redshift storage, which reduces the size of our Amazon Redshift clusters and lowers costs. Queries that use external Redshift Spectrum tables use Redshift Spectrum compute resources (separate from the Amazon Redshift cluster) to scan the data in S3. Queries are billed by AWS at a metered rate based on the amount of data scanned. For query efficiency, we convert the data stored in S3 into Parquet files using AWS Batch.

The result is an efficient, mostly serverless analytics environment where the data is always available to query with Redshift Spectrum. Although we manage some always-on Amazon Redshift clusters, we think of this as “mostly serverless.” This is because the cluster footprint is very small relative to the serverless compute power we use through Redshift Spectrum for analysis and through AWS Batch for data preparation.

The Amazon Redshift clusters are dedicated to performing analytics for our analysts, rather than having a portion of their time dedicated to load jobs. This is because no time is spent copying data into local tables in the Amazon Redshift clusters and because we do the data preparation work outside of Amazon Redshift.

With the evolution to Redshift Spectrum and storing table data in S3, by early 2018 we increased the amount of data that we deliver to analysts by an order of magnitude. At the same time, we kept our costs at the same level as 2017. Using Redshift Spectrum is a great cost saving to us. We now have up to 36 months of historical data available for many more data sources, representing trillions of events.

Our data management infrastructure, the Enhanced Event-Level Data Environment (EELDE), is structured as shown following.

The EELDE provides four primary functions:

  • Data ingest
  • Data transformation
  • Compute-intensive enrichment
  • Analyst ad hoc query environment

These four functions are described following in more detail.

Data ingest

We architected our ingest system to ensure smooth and rapid downstream data availability. We took into account the inherent inconsistencies of vendor data feeds, both in the contents of the feeds themselves and in different vendors’ data push conventions. We also took into account the large volume of data files that data vendors might deliver at a given time. The diagram following shows our data ingest process.

We set up two storage areas in Amazon S3 to initially hold data from our vendors. The Landing Zone serves as a repository for long-term storage of historical, unaltered raw log files. Log files that are pulled by us directly from vendors are stored immediately in the Landing Zone. The Datafeeds storage serves as a transient place for log files that are pushed out by the vendors. We curate and arrange the data files delivered into the Datafeeds storage by vendors as we pull them into the Landing Zone.

The pull processes are executed in the EELDE’s Job Execution Layer, which is autoscaling and can handle many data files in parallel. This ensures that data processing and availability are not delayed by sudden spikes in data volume. Some spikes might occur from the need to backfill a large dataset from a new vendor. Others might occur when a vendor delivers a large volume of logs accumulated after a delay on their side.

Whether the vendor logs are pushed or pulled, we monitor the Landing Zone. We send alerts when we detect changes in file formats or gaps in data that might have an impact on downstream analysis. We found that this architecture and process gives us the control, flexibility, and agility that we need to respond rapidly to any volume of data and any changes in the data.

Data transformation

The EELDE can process, enhance, and analyze large volumes of event data in a compute-efficient and cost-efficient manner. An important part of this ability is the initial transformation of files in the Landing Zone. The EELDE’s Job Execution Layer uses AWS Batch to parallelize the compute-intensive data transformation from the original source data. This data is typically in comma-separated value (CSV) or JSON format. AWS Batch parallelizes it into a columnar, compressed storage format, Apache Parquet. In the diagrams, you can see the Parquet storage location labeled as the “Level 2 (columnar)” bucket.

For our Parquet transformation jobs, we have had a good experience running many containers with local-mode Apache Spark using AWS Batch. The goal for us is supporting a high throughput of file transformations where each file’s size is in the range of a few megabytes to a few gigabytes. A given Parquet transformation might take anywhere from a few seconds to 30 minutes.

The resource isolation and job management provided by AWS Batch makes running hundreds of independent data transformation jobs straightforward. At the same time, it removes some of the challenges of running many tasks simultaneously within Amazon EMR. In contrast, to transform larger datasets, the parallel compute per job that EMR can provide is attractive.

The diagram following shows data transformation and cataloging.

We partition the large Parquet-backed event data tables by date. Date partitioning is strategically implemented to optimize both query cost and query performance by providing an efficient path to the required data for a given analytics process. After the files are converted to Parquet, they are exposed as table partitions in the AWS Glue Data Catalog. Although the data storage format has been transformed, the data still retains all of its original fields, which is important for supporting ad hoc analyses.

Once the Parquet conversion and AWS Glue Data Catalog management processes have run, the data is accessible through SQL queries using Redshift Spectrum. Redshift Spectrum uses the AWS Glue Data Catalog to reference external tables stored in S3 so that you can query them through the Amazon Redshift cluster. Using a columnar format like Parquet increases the speed and cost efficiency of performing queries with Redshift Spectrum.

Compute-intensive enrichment

Before analytics work, the Job Execution Layer might also run other compute intensive data transformations to enhance the logs so that they can support more advanced types of analysis. Event-level log combination and event attribute augmentation are two examples.

Although these enrichment tasks might have different compute requirements, they can all use Redshift Spectrum to gather relevant datasets, which are then processed in the Job Execution Layer. The results are made available again through Redshift Spectrum. This enables complex enrichment of large datasets in acceptable timeframes and at a low cost.

Analyst ad hoc query environment

Our environment provides analysts with a data landscape and SQL query environment for mission-critical analytics. The data is organized within the EELDE into separate external schemas by ad tech vendor and system seat or account. Our agency-based analysts and data scientists are granted access to specific schemas according to project needs.

By having access to various datasets within a single platform, analysts can gain unique insights that are not possible by analyzing data in separate silos. They can create new datasets from a combination of source vendors for use during specific analytic projects while respecting data separation requirements. Doing this helps support use cases such as predictive modeling, channel planning, and multisource attribution.

Best practices to keep costs low

Along the way, we have developed some best practices for Redshift Spectrum:

  • Use compressed columnar format. We use Apache Parquet for our large event-level tables. Doing this reduces query run time, data scan costs, and storage costs. Our large tables typically have 40–200 columns, yet most of our queries use only 4–8 columns at a time. Using Parquet commonly lowers the scan cost by over 90 percent compared to noncolumnar alternatives.
  • Partition tables. We partition our large event level tables by date, and we train users to use the partition column in WHERE clauses. Doing this provides cost savings proportional to the portion of dates used by the queries. For example, suppose that we expose six months of data and a query looks at only one month. In this case, the cost for that query is reduced by five-sixths by using the appropriate WHERE clause.
  • Actively manage table partitions. Although we have up to 36 months of historical data available as Parquet files, by default we expose only the most recent 6 months of data as external table partitions. This practice reduces the cost of queries when users fail to filter on the partition column. On request, we temporarily add older date partitions to the necessary tables to support queries that need more than 6 months of data. In this way, we support longer timeframes when needed. At the same time, we usually maintain a lower cost limit by not exposing more data than is commonly necessary.
  • Move external tables into Amazon Redshift when necessary. When a large number of queries repeatedly hit the same external table, we often create a temporary local table in Amazon Redshift to query against. Depending on the number of queries and their scope, it can be helpful to load the data into a local Amazon Redshift table with the appropriate distribution key and sort keys. This practice helps eliminate the repeated Redshift Spectrum scan costs.
  • Use spectrum_scan_size_mb WLM rules. Most of our users and jobs operate with an effective Redshift Spectrum data scan limit of 1 terabyte by setting a spectrum_scan_size_mb WLM rule on the default WLM queue. For a single analysis or modeling query, this limit rarely needs to be exceeded.
  • Add clusters to increase concurrency when needed. Multiple Amazon Redshift clusters have access to the same external tables in S3. If we have a lot of users running queries at the same time against external tables, we might choose to add more clusters. Redshift Spectrum and the management tools we have created for our environment make this an easy task.
  • Use short-lived Amazon Redshift clusters. For some scheduled production jobs, we use short-lived Amazon Redshift clusters with the necessary data made available through external schemas. Redshift Spectrum allows us to avoid lengthy data load times, so it is practical to launch a Amazon Redshift cluster for a couple of hours of work and then terminate it every day. Doing this can save 80–90 percent compared to having an always-on cluster for a project. It also avoids contention with other query activity, because the short-lived cluster is dedicated to the project.


By establishing a data warehouse strategy using Amazon S3 for storage and Redshift Spectrum for analytics, we increased the size of the datasets we support by over an order of magnitude. In addition, we improved our ability to ingest large volumes of data quickly, and maintained fast performance without increasing our costs. Our analysts and modelers can now perform deeper analytics to improve ad buying strategies and results.


Additional Reading

If you found this post useful, be sure to check out Using Amazon Redshift Spectrum, Amazon Athena, and AWS Glue with Node.js in Production and How I built a data warehouse using Amazon Redshift and AWS services in record time.


About the Author

Eric Kamm is a senior engineer and architect at Annalect. He has been creating and managing large-scale ETL workflows and providing data and compute environments for teams of analysts and modelers for over twenty years. He enjoys working with cloud technologies because they can accelerate project planning and implementation while increasing reliability and operational efficiency.

Store, Protect, Optimize Your Healthcare Data with AWS: Part 2

Post Syndicated from Stephen Jepsen original https://aws.amazon.com/blogs/architecture/store-protect-optimize-your-healthcare-data-with-aws-part-2/

Leveraging Analytics and Machine Learning Tools for Readmissions Prediction

This blog post was co-authored by Ujjwal Ratan, a senior AI/ML solutions architect on the global life sciences team.

In Part 1, we looked at various options to ingest and store sensitive healthcare data using AWS. The post described our shared responsibility model and provided a reference architecture that healthcare organizations could use as a foundation to build a robust platform on AWS to store and protect their sensitive data, including protected health information (PHI). In Part 2, we will dive deeper into how customers can optimize their healthcare datasets for analytics and machine learning (ML) to address clinical and operational challenges.

There are a number of factors creating pressures for healthcare organizations, both providers and payers, to adopt analytic tools to better understand their data: regulatory requirements, changing reimbursement models from volume- to value-based care, population health management for risk-bearing organizations, and movement toward personalized medicine. As organizations deploy new solutions to address these areas, the availability of large and complex datasets from electronic health records, genomics, images (for example, CAT, PET, MRI, ultrasound, X-ray), and IoT has been increasing. With these data assets growing in size, healthcare organizations want to leverage analytic and ML tools to derive new actionable insights across their departments.

One example of the use of ML in healthcare is diagnostic image analysis, including digital pathology. Pathology is extremely important in diagnosing and treating patients, but it is also extremely time-consuming and largely a manual process. While the complexity and quantity of workloads are increasing, the number of pathologists is decreasing. According to one study, the number of active pathologists could drop by 30 percent by 2030 compared to 2010 levels. (1) A cloud architecture and solution can automate part of the workflow, including sample management, analysis, storing, sharing, and comparison with previous samples to complement existing provider workflows effectively. A recent study using deep learning to analyze metastatic breast cancer tissue samples resulted in an approximately 85% reduction in human error rate. (2)

ML is also being used to assist radiologists in examining other diagnostic images such as X-rays, MRIs, and CAT scans. Having large quantities of images and metadata to train the algorithms that are the key to ML is one of the main challenges for ML adoption. To help address this problem, the National Institutes of Health recently released 90,000 X-ray plates tagged either with one of 14 diseases or tagged as being normal. Leading academic medical centers are using these images to build their neural networks and train their algorithms. With advanced analytics and ML, we can answer the hard questions such as “what is the next best action for my patient, the expected outcome, and the cost.”

The foundations for a great analytical layer

Let’s pick up from where we left off in Part 1. We have seen how providers can ingest data into AWS from their data centers and store it securely into different services depending on the type of data. For example:

  1. All object data is stored in Amazon S3, Amazon S3 Infrequent Access, or Amazon Glacier depending on how often they are used.
  2. Data from the provider’s database is either processed and stored as objects in Amazon S3 or aggregated into data marts on Amazon Redshift.
  3. Metadata of the objects on Amazon S3 are maintained in the DynamoDB database.
  4. Amazon Athena is used to query the objects directly stored on Amazon S3 to address ad hoc requirements.

We will now look at two best practices that are key to building a robust analytical layer using these datasets.

  1. Separating storage and compute: You should not be compelled to scale compute resources just to store more data. The scaling rules of the two layers should be separate.
  2. Leverage the vast array of AWS big data services when it comes to building the analytical platforms instead of concentrating on just a few of them. Remember, one size does not fit all.

Technical overview

In this overview, we will demonstrate how we can leverage AWS big data and ML services to build a scalable analytical layer for our healthcare data. We will use a single source of data stored in Amazon S3 for performing ad hoc analysis using Amazon Athena, integrate it with a data warehouse on Amazon Redshift, build a visual dashboard for some metrics using Amazon QuickSight, and finally build a ML model to predict readmissions using Amazon SageMaker. By not moving the data around and just connecting to it using different services, we avoid building redundant copies of the same data. There are multiple advantages to this approach:

  1. We optimize our storage. Not having redundant copies reduces the amount of storage required.
  2. We keep the data secure with only authorized services having access to it. Keeping multiple copies of the data can result in higher security risk.
  3. We are able to scale the storage and compute separately as needed.
  4. It becomes easier to manage the data and monitor usage metrics centrally such as how often the data has been accessed, who has been accessing it, and what has been the growth pattern of the data over a period of time. These metrics can be difficult to aggregate if the data is duplicated multiple times.

Let’s build out this architecture using the following steps:

  1. Create a database in AWS Glue Data Catalog

We will do this using a Glue crawler. First create a JSON file that contains the parameters for the Glue crawler.

"Name": "readmissions",
"Role": "arn of the role for Glue",
"DatabaseName": "readmissions",
"Description": "glue data catalog for storing readmission data",
"Targets": {
"S3Targets": [
"Path": "s3://<bucket>/<prefix>"
"Path": "s3://<bucket>/<prefix>"

As you can see, the crawler will crawl two locations in Amazon S3 and save the resulting tables in a new database called “readmissions.” Replace the role ARN and Amazon S3 locations with your corresponding details. Save this in a file create_crawler.json. Then from the AWS CLI, call the following command to create the crawler:

aws glue create-crawler --cli-input-json file://create_crawler.json

Once the crawler is created, run it by calling the following command:

aws glue start-crawler --name readmissions

Log on to the AWS Glue console, navigate to the crawlers, and wait until the crawler completes running.

This will create two tables — phi and non-phi — in a database named “readmissions” in the AWS Glue Data Catalog as shown below.

  1. Query the data using Athena

The Amazon Glue Data Catalog is seamlessly integrated with Amazon Athena. For details on how to enable this, see Integration with AWS Glue.

As a result of this integration, the tables created using the Glue crawler can now be queried using Amazon Athena. Amazon Athena allows you to do ad hoc analysis on the dataset. You can do exploratory analysis on the data and also determine its structure and quality. This type of upfront ad hoc analysis is invaluable for ensuring the data quality in your downstream data warehouse or your ML algorithms that will make use of this data for training models. In the next few sections, we will explore these aspects in greater detail.

To query the data using Amazon Athena, navigate to the Amazon Athena console.

NOTE: Make sure the region is the same as the region you chose in the previous step. If it’s not the same, switch the region by using the drop-down menu on the top right-hand corner of the screen.

Once you arrive in the Amazon Athena console, you should already see the tables and databases you created previously, and you should be able to see the data in the two tables by writing Amazon Athena queries. Here is a list of the top 10 rows from the table readmissions.nonphi:

Now that we are able to query the dataset, we can run some queries for exploratory analysis. Here are just a few examples:

AnalysisAmazon Athena Query
How many Patients have been discharged to home?SELECT count(*) from nonphi where discharge_disposition = ‘Discharged to home’
What’s the minimum and the maximum number of procedures carried out on a patient?SELECT min(num_procedures), max(num_procedures) from nonphi
How many patients were referred to this hospital by another physician?SELECT count(*) FROM nonphi group by admission_source having admission_source = ‘Physician Referral’
What were the top 5 specialties with positive readmissions?

SELECT count(readmission_result) as num_readmissions, medical_specialty from

(select readmission_result,medical_specialty from nonphi where readmission_result = ‘Yes’)

group by medical_specialty order by num_readmissions desc limit 5

Which payer was responsible for paying for treatments that involved more than 5 procedures?SELECT distinct payer_code from nonphi where num_procedures >5 and payer_code !='(null)’

While this information is valuable, you typically do not want to invest too much time and effort into building an ad hoc query platform like this because at this stage, you are not even sure if the data is of any value for your business-critical analytical applications. One benefit of using Amazon Athena for ad hoc analysis is that it requires little effort or time. It uses Schema-On-Read instead of schema on write, allowing you to work with various source data formats without worrying about the underlying structures. You can put the data on Amazon S3 and start querying immediately.

  1. Create an external table in Amazon Redshift Spectrum with the same data

Now that we are satisfied with the data quality and understand the structure of the data, we would like to integrate this with a data warehouse. We’ll use Amazon Redshift Spectrum to create external tables on the files in S3 and then integrate these external tables with a physical table in Amazon Redshift.

Amazon Redshift Spectrum allows you to run Amazon Redshift SQL queries against data on Amazon S3, extending the capabilities of your data warehouse beyond the physical Amazon Redshift clusters. You don’t need to do any elaborate ETL or move the data around. The data exists in one place in Amazon S3 and you interface with it using different services (Athena and Redshift Spectrum) to satisfy different requirements.

Before beginning, please look at this step by step guide to set up Redshift Spectrum.

After you have set up Amazon Redshift Spectrum, you can begin executing the steps below:

  1. Create an external schema called “readmissions.” Amazon Redshift Spectrum integrates with the Amazon Glue Data Catalog and allows you to create spectrum tables by referring the catalog. This feature allows you to build the external table on the same data that you analyzed with Amazon Athena in the previous step without the need for ETL. This can be achieved by the following:
create external schema readmissions
from data catalog
database 'readmissions'
iam_role 'arn for your redshift spectrum role '
region ‘region when the S3 data exists’;

NOTE: Make sure you select the appropriate role arn and region.

  1. Once the command executes successfully, you can confirm the schema was created by running the following:
select * from svv_external_schemas;

You should see a row similar to the one above with your corresponding region and role.

You can also see the external tables that were created by running the following command:

select * from SVV_EXTERNAL_TABLES;

  1. Let’s confirm we can see all the rows in the external table by counting the number of rows:
select count(*) from readmissions.phi;
select count(*) from readmissions.nonphi;

You should see 101,766 rows in both the tables, confirming that your external tables contain all the records that you read using the AWS Glue data crawler and analyzed using Athena.

  1. Now that we have all the external tables created, let’s create an aggregate fact table in the physical Redshift data warehouse. We can use the “As Select” clause of the Redshift create table query to do this:
create table readmissions_aggregate_fact as
avg(time_in_hospital) as avg_time_in_hospital,
min(num_procedures) as min_procedures,
max(num_procedures) as max_procedures,
avg(num_procedures) as avg_num_procedures,
avg(num_medications) as avg_num_medications,
avg(number_outpatient) as avg_number_outpatient,
avg(number_emergency) as avg_number_emergency,
avg(number_inpatient) as avg_number_inpatient,
avg(number_diagnoses) as avg_number_diagnoses
from readmissions.nonphi
group by readmission_result,admission_type,discharge_disposition,diabetesmed

Once this query executes successfully, you can see a new table created in the physical public schema of your Amazon Redshift cluster. You can confirm this by executing the following query:

select distinct(tablename) from pg_table_def where schemaname = 'public'

  1. Build a QuickSight Dashboard from the aggregate fact

We can now create dashboards to visualize the data in our readmissions aggregate fact table using Amazon QuickSight. Here are some examples of reports you can generate using Amazon QuickSight on the readmission data.

For more details on Amazon QuickSight, refer to the service documentation.

  1. Build a ML model in Amazon SageMaker to predict readmissions

As a final step, we will create a ML model to predict the attribute readmission_result, which denotes if a patient was readmitted or not, using the non-PHI dataset.

  1. Create a notebook instance in Amazon SageMaker that is used to develop our code.
  2. The code reads non-PHI data from the Amazon S3 bucket as a data frame in Python. This is achieved using the pandas.readcsv function.

  1. Use the pandas.get_dummies function to encode categorical values into numeric values for use with the model.

  1. Split the data into two, 80% for training and 20% for testing, using the numpy.random.rand function.

  1. Form train_X, train_y and test_X, test_y corresponding to training features, training labels, testing features, and testing labels respectively.

  1. Use the Amazon SageMaker Linear learner algorithm to train our model. The implementation of the algorithm uses dense tensor format to optimize the training job. Use the function write_numpy_to_dense_tensor from the Amazon SageMaker library to convert the numpy array into the dense tensor format.

  1. Create the training job in Amazon SageMaker with appropriate configurations and run it.

  1. Once the training job completes, create an endpoint in Amazon SageMaker to host our model, using the linear.deploy function to deploy the endpoint.

  1. Finally, run a prediction by invoking the endpoint using the linear_predictor.predict function.

You can view the complete notebook here.

Data, analytics, and ML are strategic assets to help you manage your patients, staff, equipment, and supplies more efficiently. These technologies can also help you be more proactive in treating and preventing disease. Industry luminaries share this opinion: “By leveraging big data and scientific advancements while maintaining the important doctor-patient bond, we believe we can create a health system that will go beyond curing disease after the fact to preventing disease before it strikes by focusing on health and wellness,” writes Lloyd B. Minor, MD, dean of the Stanford School of Medicine.

ML and analytics offer huge value in helping achieve the quadruple aim : improved patient satisfaction, improved population health, improved provider satisfaction, and reduced costs. Technology should never replace the clinician but instead become an extension of the clinician and allow them to be more efficient by removing some of the mundane, repetitive tasks involved in prevention, diagnostics, and treatment of patients.

(1) “The Digital Future of Pathology.” The Medical Futurist, 28 May 2018, medicalfuturist.com/digital-future-pathology.

(2) Wang, Dayong, et al. “Deep Learning for Identifying Metastatic Breast Cancer.” Deep Learning for Identifying Metastatic Breast Cancer, 18 June 2016, arxiv.org/abs/1606.05718.

About the Author

Stephen Jepsen is a Global HCLS Practice Manager in AWS Professional Services.


Narrativ is helping producers monetize their digital content with Amazon Redshift

Post Syndicated from Colin Nichols original https://aws.amazon.com/blogs/big-data/narrativ-is-helping-producers-monetize-their-digital-content-with-amazon-redshift/

Narrativ, in their own words: Narrativ is building monetization technology for the next generation of digital content producers. Our product portfolio includes a real-time bidding platform and visual storytelling tools that together generate millions of dollars of advertiser value and billions of data points each month.

At Narrativ, we have seen massive growth in our platform usage with a similar order-of-magnitude increase in data generated by our products over the past 15 months. In this blog post, we share our evolution to a robust, scalable, performant, and cost-effective analytics environment with AWS. We also discuss the best practices that we learned along the way in data warehousing and data lake analytics.

Anticipating Narrativ’s continued growth acceleration, we began planning late last year for the next order of magnitude. We have been using Amazon Redshift as our data warehouse, which has served us very well. As our data continued to grow, we utilized Amazon S3 as a data lake and used external tables in Amazon Redshift Spectrum to query data directly in S3. We were excited that this allowed us to easily scale storage and compute resources separately to meet our needs without trading off against cost or complexity.

In the process, we created Spectrify, which simplifies working with Redshift Spectrum and encapsulates a number of best practices. Spectrify accomplishes three things in one easy command. First, it exports an Amazon Redshift table to S3 in comma-separated value (CSV) format. Second, it converts exported CSVs to Apache Parquet files in parallel. Third, it creates the external tables in the Amazon Redshift cluster. Queries can now span massive volumes of Parquet data in S3 with data in Amazon Redshift, and return results in just moments.


The preceding diagram shows how Spectrify simplifies querying across Parquet data in Amazon S3 and data in Amazon Redshift, returning results in just moments.

Amazon Redshift at Narrativ

Amazon Redshift has been the foundation of our data warehouse since Narrativ’s inception. We use it to produce traffic statistics, to feed the data-driven algorithms in our programmatic systems, and as a window to explore data insights. Amazon Redshift has scaled well to support the company’s needs. Our cluster has grown from 3 nodes to 36 nodes, and queries per hour have increased significantly without losing performance. As our workload moved from statistics and accounting toward data-driven insights, Amazon Redshift kept pace with increasing query complexity.

When planning the next iteration of our data strategy, our requirements were as follows:

  • Be maintainable by a small engineering team (mostly hands-off)
  • Be able to query all historic data at any time
  • Be able to query the previous three months of data very quickly
  • Maintain performance at 10x current data volume

We considered a few ideas to meet these requirements:

  • Upgrade to a DC2 cluster and a DS2 cluster
  • Upgrade to a very large DC2 cluster
  • Move to an open-source alternative (difficult to maintain)
  • Move to another big data provider (high barrier to entry, costly, risky)

Then we discovered Redshift Spectrum, which changes the game by separating compute costs from storage costs. With Redshift Spectrum, we can offload older, infrequently used data to S3 while maintaining a modestly sized, yet fast, Amazon Redshift cluster. Compared to other solutions, Redshift Spectrum reduces complexity by allowing us to join to the same set of dimension tables for all of our queries. And finally, it even preserves, if not improves, performance (depending on the query). All of our requirements were met with Redshift Spectrum, and it was easy, fast, and cost-effective to implement.

Moving forward, we view Redshift Spectrum as a critical tool to scaling our data capabilities on AWS to the next order of magnitude. Redshift Spectrum allows us to define Amazon Redshift tables backed by data stored in Amazon S3. In this way, Redshift Spectrum turns Amazon Redshift into a hybrid data warehouse/data lake. It provides the best of both worlds—unlimited storage separated from compute costs, plus the performance and converged workflow of Amazon Redshift itself.

Unlocking Redshift Spectrum

Using best practices can provide massive performance improvements in Redshift Spectrum. We shared our Spectrify implementation best practices in an open-source Python library in GitHub to help others reap the same rewards that we have. Following are further details on our top suggestions for optimizing Redshift Spectrum, which you will find incorporated into Spectrify.

Use Apache Parquet

Apache Parquet is a columnar file format. When you store your data in Parquet, Redshift Spectrum can scan only the column data relevant to your query. In contrast, row-oriented formats like CSV or JSON require a scan of every column, regardless of the query. For Narrativ’s workload, using Parquet led to large performance gains and drastically lower bills.

Columnar formats also enable more efficient data compression. Data compresses better when similar data is grouped together. In this case, we saw about an 80 percent reduction in gzipped Parquet files vs. gzipped CSVs. This reduction means data can be ingested, and thus scanned, much faster.

When we first investigated converting our data to Parquet, we found no options that fit our needs. We don’t use Spark or any other project from the Hadoop ecosystem, which is the most common toolset for creating Parquet files. A few Python projects had write support for Parquet. However, we had reservations about data consistency caveats, especially in relation to nullable columns. Additionally, we didn’t find a library that supported the correct timestamp format.

We built Spectrify to solve all these problems. We worked with the Apache Arrow project to ensure full compatibility between Redshift Spectrum and Parquet files generated by Spectrify.

Partition your data

Narrativ stores timestamped event data in Redshift Spectrum. We partition our data on event creation date, allowing Redshift Spectrum to skip inspecting most of our data files for many of our queries. Generally, our queries are concerned with a small subset of the time domain, such as a month or even just a day (for example, answering a specific question from Black Friday 2017). Queries are rarely concerned with to the entire years-long history.

This drastically reduces the amount of data scanned, and we see most queries complete in one-tenth the time. Also, our bill is much lower. Partitioning alone reduced our average cost per query by a factor of 20.


As we prepared for an order-of-magnitude increase in data volumes, we turned to a data lake strategy built on Amazon S3 using Amazon Redshift and Redshift Spectrum for analytics. Redshift Spectrum enabled us to query data across S3 and Amazon Redshift, maintain or improve query performance, and do it easily and cost-effectively. Along the way, we learned best practices and built Spectrify to streamline data preparation and analytics even further. We hope that sharing our code examples and what we learned will also help you achieve similar results with your growing data.

For more information, see our Spectrify Usage documentation and code samples, and our full example on GitHub.


Additional Reading

If you found this post useful, be sure to check out Using Amazon Redshift Spectrum, Amazon Athena, and AWS Glue with Node.js in Production and How I built a data warehouse using Amazon Redshift and AWS services in record time.


About the Author

Colin Nichols leads big data initiatives at Narrativ. He enjoys working at the intersection of systems, and has a strong record of shaping engineering culture around the tenets of quality, automation, and continuous learning. In his spare time, Colin enjoys teaching his dog, MacGyver, new tricks.

Get sub-second query response times with Amazon Redshift result caching

Post Syndicated from Entong Shen original https://aws.amazon.com/blogs/big-data/get-sub-second-query-response-times-with-amazon-redshift-result-caching/

Customers tell us that their data warehouse and business intelligence users want extremely fast response times so that they can make equally fast decisions. They also tell us that their users often repeat the same queries over and over again, even when the data has not changed. Repeat queries consume compute resources each time they are executed, which slows down performance for all queries.

In this post, we take a look at query result caching in Amazon Redshift. Result caching does exactly what its name implies—it caches the results of a query. When the same query comes in against the same data, the prior results are retrieved from the cache and returned immediately, instead of rerunning the same query. Result caching reduces system use, making more resources available for other workloads. It delivers faster response times for users, improves throughput for all queries, and increases concurrency.

Result caching options

There are two main ways that you can implement data warehouse result caching. The first method is to save subsets of the data tables and cache query results outside the data warehouse. This method requires additional logic and memory outside the data warehouse. You must take great care to ensure that the cache is invalidated and a query is rerun when table data is modified.

The second method is to cache the results of a query inside the data warehouse and return the cached result for future repeat queries. This method delivers higher performance because it is faster to cache data and serve it from within the cluster.

Amazon Redshift uses the second method to cache query results within the cluster to achieve higher query throughput. Amazon Redshift result caching automatically responds to data and workload changes, transparently serving multiple BI applications and SQL tools. It is available by default for all Amazon Redshift customers for no additional charge.

Since Amazon Redshift introduced result caching, the feature has saved customers thousands of hours of execution time on a daily basis. “With Amazon Redshift result caching, 20 percent of our queries now complete in less than one second,” said Greg Rokita, Executive Director of Technology, Edmunds, at the AWS Summit in San Francisco. “Our cluster reliance on disk has decreased, and consequently the cluster is able to better serve the rest of our queries. Best of all, we didn’t have to change anything to get this speed-up with Redshift, which supports our mission-critical workloads.”

How does it work?

When a query is executed in Amazon Redshift, both the query and the results are cached in memory. As future queries come in, they are normalized and compared to the queries in the cache to determine whether there are repeat queries. In this comparison, Amazon Redshift also determines whether the underlying data has changed in any way. Any data modification language (DML) or data definition language (DDL) on a table or function invalidates only the cache entries that refer to it. Examples of such statements include INSERT, DELETE, UPDATE, COPY, and TRUNCATE. Amazon Redshift manages the cache memory to evict old entries, ensuring that optimal memory use is maintained for the cache itself.

Result caching is fully managed by Amazon Redshift, and it requires no changes in your application code. Amazon Redshift automatically selects the optimal configuration based on the specific condition of your cluster, and no tuning is required for you to get the most effective configuration.

When Amazon Redshift determines that a query is eligible to reuse prior query cached results, it bypasses query planning, the workload manager (WLM), and the query execution engine altogether. Cached result rows are returned to the client application immediately with sub-second performance. This method frees up cluster resources for ETL (extract, transform, and load) and other workloads that need the compute resources.

Design and usage

The following diagram illustrates the architecture of Amazon Redshift result caching.

The query result cache resides in the memory of the leader node and is shared across different user sessions to the same database. The feature is transparent, so it works by default without the need for user configurations. Result caching complies with Amazon Redshift multi-version concurrency control (MVCC). It acquires the proper locks on the table objects and manages the lifecycle of the cache entries when multiple user sessions read/write a table object at the same time. In addition, access control of the cached results is managed so that a user must have the required permission of the objects used in the query to retrieve result rows from the cache.

Which queries are served from the result cache?

Read-only queries are eligible for caching with some exceptions. Query results are not cached in the following circumstances:

  • When a query uses a function that must be evaluated each time it is run, such as getdate().
  • When a query refers to system tables or views.
  • When a query refers to external tables, that is, Amazon Redshift Spectrum tables.
  • When a query runs only on the leader node, or the result is too large.

Suppose that your query contains functions like current_date and you want to take advantage of the result cache. You can consider rewriting the query by materializing the value of current_date (for example, in your JDBC application), using the query text, and refreshing it as needed.

To determine which executed queries served results from the cache, a new column source_query has been added to system view SVL_QLOG to record the source query ID when a query is executed from the cache. You can use the following example query to find out which queries used cached results:

select userid, query, elapsed, source_query from svl_qlog where userid > 1
order by query desc;

userid | query  | elapsed  | source_query
   104 | 629035 |       27 |       628919
   104 | 629034 |       60 |       628900
   104 | 629033 |       23 |       628891
   102 | 629017 |  1229393 |             
   102 | 628942 |       28 |       628919
   102 | 628941 |       57 |       628900
   102 | 628940 |       26 |       628891
   100 | 628919 | 84295686 |             
   100 | 628900 | 87015637 |             
   100 | 628891 | 58808694 | 

For more information about result cache usage, see Result Caching in the Amazon Redshift documentation.


Amazon Redshift result caching helps ensure that no computing resources are wasted on repeat queries. It enables you to do more analytics in less time to support decision making and improve outcomes. In this post, we explained how Amazon Redshift result caching works and discussed the significant impact for Amazon Redshift customers. Result caching is enabled automatically, and we encourage you to see the difference it can make in your environment.

Additional Reading

If you found this post useful, be sure to check out Amazon Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required, Collect Data Statistics Up to 5x Faster by Analyzing Only Predicate Columns with Amazon Redshift and Amazon Redshift – 2017 Recap.


About the Authors

Entong Shen is a software engineer on the Amazon Redshift query processing team. He has been working on MPP databases for over 5 years and has focused on query optimization, statistics and SQL language features. In his spare time, he enjoys listening to music of all genres and working in his succulent garden.



Larry Heathcote is a Principal Product Marketing Manager at Amazon Web Services for data warehousing and analytics. Larry is passionate about seeing the results of data-driven insights on business outcomes. He enjoys family time, home projects, grilling out and the taste of classic barbeque.



Maor Kleider is a Senior Product Manager for Amazon Redshift, a fast, simple and cost-effective data warehouse. Maor is passionate about collaborating with customers and partners, learning about their unique big data use cases and making their experience even better. In his spare time, Maor enjoys traveling and exploring new restaurants with his family.



Meng Tong is a Senior Software Engineer with Amazon Redshift Query Processing team. His work has been focused on query performance improvements across rewriter, optimizer and executor, Redshift Spectrum, and most recently Redshift Result Caching. His passion lies in discovering simple, elegant solutions for customer needs in big data systems. He is a big Rafael Nadal fan and enjoys watching and playing tennis in his spare time.



Naresh Chainani is a Senior Software Development Manager at Amazon Redshift where he leads the Query Processing team. Naresh is passionate about building high-performance databases to enable customers to gain timely insights and make critical business decisions. In his spare time, Naresh enjoys reading and playing tennis.




Streaming Events from Amazon Pinpoint to Redshift

Post Syndicated from Brent Meyer original https://aws.amazon.com/blogs/messaging-and-targeting/streaming-events-from-amazon-pinpoint-to-redshift/

Note: This post was originally written by Ryan Idrigo-Lam, one of the founding members of the Amazon Pinpoint team.

You can use Amazon Pinpoint to segment, target, and engage with your customers directly from the console. The Pinpoint console also includes a variety of dashboards that you can use to keep track of how your customers use your applications, and measure how likely your customers are to engage with the messages you send them.

Some Pinpoint customers, however, have use cases that require a bit more than what these dashboards have to offer. For example, some customers want to join their Pinpoint data to external data sets, or to collect historical data beyond the six month window that Pinpoint retains. To help customers meet these needs, and many more, Amazon Pinpoint includes a feature called Event Streams.

This article provides information about using Event Streams to export your data from Amazon Pinpoint and into a high-performance Amazon Redshift database. Once your data is in Redshift, you can run queries against it, join it with other data sets, use it as a data source for analytics and data visualization tools, and much more.

Step 1: Create a Redshift Cluster

The first step in this process involves creating a new Redshift cluster to store your data. You can complete this step in a few clicks by using the Amazon Redshift console. For more information, see Managing Clusters Using the Console in the Amazon Redshift Cluster Management Guide.

When you create the new cluster, make a note of the values you specify for the Cluster Identifier, Database Name, Master User Name, and Master User Password. You’ll use all of these values when you set up Amazon Kinesis Firehose in the next section.

Step 2: Create a Firehose Delivery Stream with a Redshift Destination

After you create your Redshift cluster, you can create the Amazon Kinesis Data Firehose delivery stream that will deliver your Pinpoint data to the Redshift cluster.

To create the Kinesis Data Firehose delivery stream

  1. Open the Amazon Kinesis Data Firehose console at https://console.aws.amazon.com/firehose/home.
  2. Choose Create delivery stream.
  3. For Delivery stream name, type a name.
  4. Under Choose source, for Source, choose Direct PUT or other sources. Choose Next.
  5. On the Process records page, do the following:
    1. Under Transform source records with AWS Lambda, choose Enabled if you want to use a Lambda function to transform the data before Firehose loads it into Redshift. Otherwise, choose Disabled.
    2. Under Convert record format, choose Disabled, and then choose Next.
  6. On the Choose destination page, do the following:
    1. For Destination, choose Amazon Redshift.
    2. Under Amazon Redshift destination, specify the Cluster name, User name, Password, and Database for the Redshift database you created earlier. Also specify a name for the Table.
    3. Under Intermediate S3 destination, choose an S3 bucket to store data in. Alternatively, choose Create new to create a new bucket. Choose Next.
  7. On the Configure settings page, do the following:
    1. Under IAM role, choose an IAM role that Firehose can use to access your S3 bucket and KMS key. Alternatively, you can have the Firehose console create a new role. Choose Next.
    2. On the Review page, confirm the settings you specified on the previous pages. If the settings are correct, choose Create delivery stream.

Step 3: Create a JSONPaths file

The next step in this process is to create a JSONPaths file and upload it to an Amazon S3 bucket. You use the JSONPaths file to tell Amazon Redshift how to interpret the unstructured JSON that Amazon Pinpoint provides.

To create a JSONPaths file and upload it to Amazon S3

  1. In a text editor, create a new file.
  2. Paste the following code into the text file:
      "jsonpaths": [

  3. Modify the preceding code example to include the fields that you want to import into Redshift.
    Note: You can specify custom attributes or metrics by replacing my_custom_attribute or my_custom_metric in the example above with your custom attributes or metrics, respectively.
  4. When you finish modifying the code example, remove all whitespace, including spaces and line breaks, from the file. Save the file as json-paths.json.
  5. Open the Amazon S3 console at https://s3.console.aws.amazon.com/s3/home.
  6. Choose the S3 bucket you created when you set up the Firehose stream. Upload json-paths.json into the bucket.

Step 4: Configure the table in Redshift

At this point, it’s time to finish setting up your Redshift database. In this section, you’ll create a table in the Redshift cluster you created earlier. The columns in this table mirror the values you specified in the JSONPaths file in the previous section.

  1. Connect to your Redshift cluster by using a database tool such as SQL Workbench/J. For more information about connecting to a cluster, see Connect to the Cluster in the Amazon Redshift Getting Started Guide.
  2. Create a new table that contains a column for each field in the JSONPaths file you created in the preceding section. You can use the following example as a template.
    CREATE schema AWSMA;
      event_type VARCHAR(256) NOT NULL ENCODE LZO,
      event_timestamp TIMESTAMP NOT NULL ENCODE LZO,
      arrival_timestamp TIMESTAMP NULL ENCODE LZO,
      event_version CHAR(12) NULL ENCODE LZO,
      application_app_id VARCHAR(64) NOT NULL ENCODE LZO,
      application_package_name VARCHAR(256) NULL ENCODE LZO,
      application_version_name VARCHAR(256) NULL ENCODE LZO,
      application_version_code VARCHAR(256) NULL ENCODE LZO,
      application_title VARCHAR(256) NULL ENCODE LZO,
      application_cognito_identity_pool_id VARCHAR(64) NULL ENCODE LZO,
      application_sdk_name VARCHAR(256) NULL ENCODE LZO,
      application_sdk_version VARCHAR(256) NULL ENCODE LZO,
      client_cognito_id VARCHAR(64) NULL ENCODE LZO,
      device_model VARCHAR(256) NULL ENCODE LZO,
      device_make VARCHAR(256) NULL ENCODE LZO,
      device_platform_name VARCHAR(256) NULL ENCODE LZO,
      device_platform_version VARCHAR(256) NULL ENCODE LZO,
      device_locale_code VARCHAR(256) NULL ENCODE LZO,
      device_locale_language VARCHAR(64) NULL ENCODE LZO,
      device_locale_country VARCHAR(64) NULL ENCODE LZO,
      session_id VARCHAR(64) NULL ENCODE LZO,
      session_start_timestamp TIMESTAMP NULL ENCODE LZO,
      session_stop_timestamp TIMESTAMP NULL ENCODE LZO,
      monetization_transaction_id VARCHAR(64) NULL ENCODE LZO,
      monetization_transaction_store VARCHAR(64) NULL ENCODE LZO,
      monetization_transaction_item_id VARCHAR(64) NULL ENCODE LZO,
      monetization_transaction_quantity FLOAT8 NULL,
      monetization_transaction_price_reported VARCHAR(64) NULL ENCODE LZO,
      monetization_transaction_price_amount FLOAT8 NULL,
      monetization_transaction_price_currency_code VARCHAR(16) NULL ENCODE LZO,
      monetization_transaction_price_currency_symbol VARCHAR(32) NULL ENCODE LZO,
      - Custom Attributes
      a_campaign_id VARCHAR(4000),
      a_campaign_activity_id VARCHAR(4000),
      a_my_custom_attribute VARCHAR(4000),
      - Custom Metrics
      m_my_custom_metric float8
    SORTKEY ( application_app_id, event_timestamp, event_type);

Step 5: Configure the Firehose Stream

You’re getting close! At this point, you’re ready to point the Kinesis Data Firehose stream to your JSONPaths file so that Redshift parses the incoming data properly. You also need to list the columns of the table that your data will be copied into.

To configure the Firehose Stream

  1. Open the Amazon Kinesis Data Firehose console at https://console.aws.amazon.com/firehose/home.
  2. In the list of delivery streams, choose the delivery stream you created earlier.
  3. On the Details tab, choose Edit.
  4. Under Amazon Redshift destination, for COPY options, paste the following:
    JSON 's3://s3-bucket/json-paths.json'
    TIMEFORMAT 'epochmillisecs'

  5. Replace s3-bucket in the preceding code example with the path to the S3 bucket that contains json-paths.json.
  6. For Columns, list all of the columns that are present in the JSONPaths file you created earlier. Specify the column names in the same order as they’re listed in the json-paths.json file, using commas to separate the column names. When you finish, choose Save.

Step 6: Enable Event Streams in Amazon Pinpoint

The only thing left to do now is to tell Amazon Pinpoint to start sending data to Amazon Kinesis.

To enable Event Streaming in Amazon Pinpoint

  1. Open the Amazon Pinpoint console at https://console.aws.amazon.com/pinpoint/home.
  2. Choose the application or project that you want to enable event streams for.
  3. In the navigation pane, choose Settings.
  4. On the Event stream tab, choose Enable streaming of events to Amazon Kinesis.
  5. Under Stream to Amazon Kinesis, select Send events to an Amazon Kinesis Firehose delivery stream.
  6. For Amazon Kinesis Firehose delivery stream, choose the stream you created earlier.
  7. For IAM role, choose an existing role that allows the firehose:PutRecordBatch action, or choose Automatically create a role to have Amazon Pinpoint create a role with the appropriate permissions. If you choose to have Amazon Pinpoint create a role for you, type a name for the role. Choose Save.

That’s it! Once you complete this final step, Amazon Pinpoint starts exporting the data you specified into your Redshift cluster.

I hope this walk through was helpful. If you have any questions, please let us know in the comments or in the Amazon Pinpoint forum.