Tag Archives: Amazon Redshift

How to migrate a large data warehouse from IBM Netezza to Amazon Redshift with no downtime

Post Syndicated from Guillermo Menendez Corral original https://aws.amazon.com/blogs/big-data/how-to-migrate-from-ibm-netezza-to-amazon-redshift-with-no-downtime/

A large EMEA company recently decided to migrate their on-premises IBM Netezza data warehouse to Amazon Redshift. Given the volume of data (270TB uncompressed and more than 27K tables), the number of interconnected systems that made use of this data (more than 4.5K business processes), and the zero downtime requirements, we understood that the project would be a challenge. Since the company planned to decommission the data center where the data warehouse was deployed in less than a year’s time, there were also time constraints in place.

The data warehouse is a central piece for the company; it allows users across units to gather data and generate the daily reports required to run the business. In just a few years, business units accessing the cluster increased almost 3x, with 5x the initial number of users, executing 50x the number of daily queries for which the cluster had been designed. The legacy data warehouse was not able to scale to cover their business needs anymore, resulting in nightly ETL processes running outside of their time boundaries, and live queries taking too long.

The general dissatisfaction among the business users — along with the proximity of the data center decommissioning — moved the company to plan the migration, putting its IT department in charge of the definition of the new architecture, and the execution of the project.

Amazon Redshift, Amazon Web Services’ (AWS) fast, scalable OLAP data warehouse that makes it simple and cost-effective to analyze all your data across your data warehouse and data lake, was the perfect fit to solve their problems. Not only does Amazon Redshift provide full elasticity for future growth, and features such as concurrency scaling to cover high demand peaks, it also offers a whole ecosystem of analytics services to be easily integrated.

In this article, we explain how this customer performed a large-scale data warehouse migration from IBM Netezza to Amazon Redshift without downtime, by following a thoroughly planned migration process, and leveraging AWS Schema Conversion Tool (SCT) and Amazon Redshift best practices.

Preparing the migration

Large enterprise customers typically use data warehouse systems as a central repository for data analytics, aggregating operational data from heterogeneous transactional databases and running analytical workloads to serve analysts teams through business intelligence applications and reports. Using AWS, customers can benefit from the flexibility of having multiple compute resources processing different workloads, each workload scaling as the demand grows.

In this section, we describe the steps that we followed to prepare the migration of this data warehouse from IBM Netezza to Amazon Redshift.

Identifying workloads and dependencies

Customers typically have three different types of workloads running in their data warehouses:

  1. Batch processes: Long-running processes that require many resources and low concurrency, such as ETL jobs.
  2. Ad hoc queries: Short queries with high concurrency, such as analysts querying data.
  3. Business workloads: Typically mixed workloads, such as BI applications, reports, and dashboards.

In this customer’s case, they were building business data marts through complex ETL jobs, running statistical models, generating reports, and allowing analysts to run ad hoc queries. In essence, these applications are divided into two groups of workloads: batch and ad hoc queries. The on-premises platform was always saturated and struggling to deal with the level of concurrency demanded by these workloads while offering acceptable performance.

The following diagram shows the architecture that customer had before the migration:

By using Amazon Redshift, the customer is able to fulfill the requirements of every analytical workload. Within the old data warehouse, shown in the above diagram, two different managed service providers managed two sets of independent data and workloads. For the new architecture, the decision was to split the data warehouse into two different Amazon Redshift clusters to serve those different workloads, as described in the following section. Within each of these clusters, the customer is able to manage resources and concurrency for different applications under the same workload by configuring Amazon Redshift Workload Management (WLM). A typical WLM setup is to match every workload with a queue, so in this case, each cluster had two queues configured: batch and ad hoc, each with a different number of slots and assigned memory.

Sizing the target architecture

For heterogeneous migrations, like this one, a comprehensive analysis should be performed on the source database, collecting enough data to design the new architecture that supports both data and applications.

The AWS Schema Conversion Tool was the perfect companion for this project, as the customer was able to automate the reports and generate an assessment that helped estimate the migration complexity for different objects, e.g. data types, UDFs, and stored procedures.

In a typical database migration, customers categorize large tables by number of rows. However, when migrating to columnar databases, such as Amazon Redshift, it is essential to also assess table width (that is, number of columns) from the very beginning. While columnar databases are generally more efficient than row-based databases in storing data, wide tables with few rows may have a negative impact on columnar databases. To estimate the minimum table size required for each table in Amazon Redshift, use this formula from the AWS Knowledge Center.

For this customer, there was a clear separation between core applications, and a large isolated business application with minimum data dependencies and a different set of users. As a result, one of the main architectural decisions was to use two different Amazon Redshift clusters:

  • Primary cluster: Holds the core schemas and most of the data, and serves most of the business applications. Due to high storage requirements and long batch processes that run here, the recommended Amazon Redshift node type for this cluster is the dense storage family.
  • Secondary cluster: Purpose built cluster for a single application, demands high I/O. The recommended Amazon Redshift node type for this cluster is the dense compute family.

Planning the migration

There are several approaches when it comes to database migration. A must-have requirement for many migrations is to minimize downtime, which was the main driver for the migration pattern described in this post.

One of the challenges when migrating databases is to keep the data updated in both systems, capturing the changes on the source, and applying them to the destination database during the migration. By definition, data warehouses shouldn’t be used to run transactional (OLTP) workloads, but long-running ETL processes and analytical workloads (OLAP). Usually, those ETL processes update data in batches, typically on a daily basis. This simplifies the migration, because when those ETL processes that load data into the data warehouse are run in parallel against both target systems during the migration, CDC is not required.

The following image summarizes how we planned this migration with the customer, following a parallel approach to minimize downtime on their production data warehouse:

The main steps of this process are (1) data migration, (2) technical validation, (3) data sync, and (4) business validation, as described in the following section.

Running the migration

Full data migration

The initial data migration was the first milestone of the project. The main requirements for this phase were: (1) minimize the impact on the data source, and (2) transfer the data as fast as possible. To do this, AWS offers several options, depending on the size of the database, network performance (AWS Direct Connect or AWS Snowball), and whether the migration is heterogeneous or not (AWS Database Migration Service or AWS Schema Conversion Tool).

For this heterogeneous migration, the customer used the AWS Schema Conversion Tool (SCT). The SCT enabled them to run the data migration, provisioning multiple virtual machines in the same data center where IBM Netezza was installed, each running an AWS SCT Data Extractor agent. These data extractors are Java processes that connect directly to the source database and migrate data in chunks to the target database.

Sizing data extraction agents

To estimate the number of data extractor agents needed for the migration, consider this rule of thumb: One data extractor agent per 1 TB of compressed data on the source. Another recommendation is to install extraction agents on individual computers.

For each agent, consider the following hardware general requirements:

CPU4Lots of transformations and a large number of packets to process during data migration
RAM16Data chunks are kept in memory before dumping to disk.
Disk100 GB / ~500 IOPSIntermediate results are stored on Disk.
NetworkAt least 1 Gbit (10 Gbit recommended)While provisioning the resources, it is recommended to reduce the number of network hops from the source to AWS SCT data extraction agents.

Follow this documentation in order to go through the installation steps for the data extraction agents.

Depending on the size of the data to be migrated and the network speed, you may also run data extractor agents on top of EC2. For large data warehouses and in order to minimize downtime and optimize data transfer, it is recommended to deploy the data extractor agents as close as possible to the source. For example, in this migration project, 24 individual SCT extractor agents were installed in the on-premises data center for concurrent data extraction, and in order to speed up the process. Due to the stress that these operations generate on the data source, every extraction phase was run during weekends and off-hours.

The diagram below depicts the migration architecture deployed during the data migration phases:

Creating data extraction tasks

The source tables were migrated in parallel, on a table-by-table basis, using the deployed SCT extraction agents. These extraction agents authenticate using a valid user on the data source, allowing to adjust the resources available for that user during the extraction. Data was processed locally by SCT agents and uploaded to S3 through the network (via AWS Direct Connect). Note that other migration scenarios might require the use of AWS Snowball devices. Check the Snowball documentation to decide which transfer method is better for your scenario.

As part of the analysis performed while planning the migration, the customer identified large tables, e.g. tables with more than 20 million rows or 1TB size. In order to extract data from those tables, they used the virtual partitioning feature on AWS SCT, creating several sub-tasks and parallelizing the data extraction process for this table. We recommend creating two groups of tasks for each schema that migrates; one for small tables and another one for large tables using virtual partitions.

These tasks can be defined and created before running, so everything is ready for the migration window. Visit the following documentation to create, run, and monitor AWS SCT Data Extraction Tasks.

Technical validation

Once the initial extracted data was loaded to Amazon Redshift, data validation tests were performed in parallel, using validation scripts developed by the partner teams involved in the migration. The goal at this stage is to validate production workloads, comparing IBM Netezza and Amazon Redshift outputs from the same inputs.

Typical activities covered during this phase are the following:

  • Count number of objects and rows on each table.
  • Compare the same random subset of data in both IBM Netezza and Amazon Redshift for all migrated tables, validating that data is exactly the same row by row.
  • Check incorrect column encodings.
  • Identify skewed table data.
  • Annotate queries not benefiting from sort keys.
  • Identify inappropriate join cardinality.
  • Deal with tables with large varchar columns.
  • Confirm that processes do not crash when connected with target environment.
  • Validate daily batch job runs (job duration, number of rows processed).

You’ll find the right techniques to execute most of those activities in Top 10 Performance Tuning Techniques for Amazon Redshift.

Data synchronization

During this phase, the customer again migrated the tables and schemas that lost synchronization with the source during the Technical Validation phase. By using the same mechanism described on the First Full Data Migration section, and as ETL processes that generate the data marts are already running on the future system, data is kept updated after this synchronization phase.

Business validation

After the second data migration was successfully performed and the data movement was technically validated, the last remaining task was to involve the data warehouse  users in the final validation. These users from different business units across the company accessed the data warehouse using a variety of tools and methods: JDBC/ODBC clients, Python code, PL/SQL procedures, custom applications, etc.  It was central to the migration to make sure that every end user had verified and adapted his processes to work seamlessly with Amazon Redshift before the final cut-over was performed.

This phase took around three months and consisted of several tasks:

  • Adapt business users’ tools, applications, and scripts to connect to Amazon Redshift endpoints.
  • Modify user’s data load and dump procedures, replacing data movement to / from shared storage via ODBC / JDBC with COPY / UNLOAD operations from / to S3.
  • Modify any incompatible query, taking into account Amazon Redshift PostgreSQL implementation nuances.
  • Run business processes, both against IBM Netezza and Amazon Redshift, and compare results and execution times, being sure to notify any issue or unexpected result to the team in charge of the migration, so the case can be analyzed in detail.
  • Tune query performance, taking into account table sort keys and making extensive use of the EXPLAIN command in order to understand how Amazon Redshift plans and executes queries.

This business validation phase was key to have all end users aligned and ready for the final cut-over. Following Amazon Redshift best practices enabled end users to leverage the capabilities of their new data warehouse.

Soft cut-over

After all the migration and validation tasks had been performed, every ETL, business process, external system, and user tool was successfully connected and tested against Amazon Redshift.

This is when every process can be disconnected from the old data warehouse, which can be safely powered off and decommissioned.


In this blog post, we described the steps taken to perform a successful large-scale data warehouse migration from an on-premises IBM Netezza to Amazon Redshift. These same steps can be extrapolated to any other source data warehouse.

While this article describes a pure lift-and-shift migration, this is just the beginning of a transformation process towards a full-fledged corporate data lake. There are a series of next steps necessary in order to gain full advantage of the powerful analytics tools and services provided by AWS:

  • Activate Amazon Redshift’s Concurrency scaling feature on interactive query queues so clusters scale seamlessly on high usage periods without needing to provision the clusters for peak capacity.
  • Create a data lake in S3 and offload less accessed data, keeping warm and hot data on the Amazon Redshift clusters for higher performance.
  • Leverage Amazon Redshift Spectrum to be able to combine cold and hot data on analytic queries when required by the business needs.
  • Use Amazon Athena to be able to query cold data without affecting the data warehouse performance.

It is worth pointing out several takeaways that we feel are central to achieving a successful large-scale migration to Amazon Redshift:

  • Start with a PoC to make an accurate initial sizing of the Amazon Redshift cluster.
  • Create a detailed migration plan, which includes a clear procedure for every affected system.
  • Have end users fully aligned with the migration process, and make sure all their processes are validated before the final cutover is performed.
  • Follow Amazon Redshift best practices and techniques to leverage its full capabilities and performance.
  • Engage with the AWS account team from early stages and throughout the whole process. They are the point of contact with AWS specialists, Professional Services, and partners in order to bring the migration project to a successful conclusion.

We hope you found this post useful. Please feel free to leave a comment or question.


About the Authors

Guillermo Menendez Corral is a solutions architect at Amazon Web Services. He has over 12 years of experience designing and building SW applications and currently provides architectural guidance to AWS customers, with a focus on Analytics and Machine Learning.





Arturo Bayo is a big data consultant at Amazon Web Services. He promotes a data-driven culture in enterprise customers around EMEA, providing specialized guidance on business intelligence and data lake projects while working with AWS customers and partners to build innovative solutions around data and analytics.





AWS Lake Formation – Now Generally Available

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-lake-formation-now-generally-available/

As soon as companies started to have data in digital format, it was possible for them to build a data warehouse, collecting data from their operational systems, such as Customer relationship management (CRM) and Enterprise resource planning (ERP) systems, and use this information to support their business decisions.

The reduction in costs of storage, together with an even greater reduction in complexity for managing large quantities of data, made possible by services such as Amazon S3, has allowed companies to retain more information, including raw data that is not structured, such as logs, images, video, and scanned documents.

This is the idea of a data lake: to store all your data in one, centralized repository, at any scale. We are seeing this approach with customers like Netflix, Zillow, NASDAQ, Yelp, iRobot, FINRA, and Lyft. They can run their analytics on this larger dataset, from simple aggregations to complex machine learning algorithms, to better discover patterns in their data and understand their business.

Last year at re:Invent we introduced in preview AWS Lake Formation, a service that makes it easy to ingest, clean, catalog, transform, and secure your data and make it available for analytics and machine learning. I am happy to share that Lake Formation is generally available today!

With Lake Formation you have a central console to manage your data lake, for example to configure the jobs that move data from multiple sources, such as databases and logs, to your data lake. Having such a large and diversified amount of data makes configuring the right access permission also critical. You can secure access to metadata in the Glue Data Catalog and data stored in S3 using a single set of granular data access policies defined in Lake Formation. These policies allow you to define table and column-level data access.

One thing I like the most of Lake Formation is that it works with your data already in S3! You can easily register your existing data with Lake Formation, and you don’t need to change existing processes loading your data to S3. Since data remains in your account, you have full control.

You can also use Glue ML Transforms to easily deduplicate your data. Deduplication is important to reduce the amount of storage you need, but also to make analyzing your data more efficient because you don’t have neither the overhead nor the possible confusion of looking at the same data twice. This problem is trivial if duplicate records can be identified by a unique key, but becomes very challenging when you have to do a “fuzzy match”. A similar approach can be used for record linkage, that is when you are looking for similar items in different tables, for example to do a “fuzzy join” of two databases that do not share a unique key.

In this way, implementing a data lake from scratch is much faster, and managing a data lake is much easier, making these technologies available to more customers.

Creating a Data Lake
Let’s build a data lake using the Lake Formation console. First I register the S3 buckets that are going to be part of my data lake. Then I create a database and grant permission to the IAM users and roles that I am going to use to manage my data lake. The database is registered in the Glue Data Catalog and holds the metadata required to analyze the raw data, such as the structure of the tables that are going to be automatically generated during data ingestion.

Managing permissions is one of the most complex tasks for a data lake. Consider for example the huge amount of data that can be part of it, the sensitive, mission-critical nature of some of the data, and the different structured, semi-structured, and unstructured formats in which data can reside. Lake Formation makes it easier with a central location where you can give IAM users, roles, groups, and Active Directory users (via federation) access to databases, tables, optionally allowing or denying access to specific columns within a table.

To simplify data ingestion, I can use blueprints that create the necessary workflows, crawlers and jobs on AWS Glue for common use cases. Workflows enable orchestration of your data loading workloads by building dependencies between Glue entities, such as triggers, crawlers and jobs, and allow you to track visually the status of the different nodes in the workflows on the console, making it easier to monitor progress and troubleshoot issues.

Database blueprints help load data from operational databases. For example, if you have an e-commerce website, you can ingest all your orders in your data lake. You can load a full snapshot from an existing database, or incrementally load new data. In case of an incremental load, you can select a table and one or more of its columns as bookmark keys (for example, a timestamp in your orders) to determine previously imported data.

Log file blueprints simplify ingesting logging formats used by Application Load Balancers, Elastic Load Balancers, and AWS CloudTrail. Let’s see how that works more in depth.

Security is always a top priority, and I want to be able to have a forensic log of all management operations across my account, so I choose the CloudTrail blueprint. As source, I select a trail collecting my CloudTrail logs from all regions into an S3 bucket. In this way, I’ll be able to query account activity across all my AWS infrastructure. This works similarly for a larger organization having multiple AWS accounts: they just need, when configuring the trail in the CloudTrial console, to apply the trail to their whole organization.

I then select the target database, and the S3 location for the data lake. As data format I use Parquet, a columnar storage format that will make querying the data faster and cheaper. The import frequency can be hourly to monthly, with the option to choose the day of the week and the time. For now, I want to run the workflow on demand. I can do that from the console or programmatically, for example using any AWS SDK or the AWS Command Line Interface (CLI).

Finally, I give the workflow a name, the IAM role to use during execution, and a prefix for the tables that will be automatically created by this workflow.

I start the workflow from the Lake Formation console and select to view the workflow graph. This opens the AWS Glue console, where I can visually see the steps of the workflow and monitor the progress of this run.

When the workflow is completed a new table is available in my data lake database. The source data remain as logs in the S3 bucket output of CloudTrail, but now I have them consolidated, in Parquet format and partitioned by date, in my data lake S3 location. To optimize costs, I can set up an S3 lifecycle policy that automatically expires data in the source S3 bucket after a safe amount of time has passed.

Securing Access to the Data Lake
Lake Formation provides secure and granular access to data stores in the data lake, via a new grant/revoke permissions model that augments IAM policies. It is simple to set up these permissions, for example using the console:

I simply select the IAM user or role I want to grant access to. Then I select the database and optionally the tables and the columns I want to provide access to. It is also possible to select which type of access to provide. For this demo, simple select permissions are sufficient.

Accessing the Data Lake
Now I can query the data using tools like Amazon Athena or Amazon Redshift. For example, I open the query editor in the Athena console. First, I want to use my new data lake to look into which source IP addresses are most common in my AWS Account activity:

SELECT sourceipaddress, count(*)
FROM my_trail_cloudtrail
GROUP BY  sourceipaddress

Looking at the result of the query, you can see which are the AWS API endpoints that I use the most. Then, I’d like to check which user identity types are used. That is an information stored in JSON format inside one of the columns. I can use some of the JSON functions available with Amazon Athena to get that information in my SQL statements:

SELECT json_extract_scalar(useridentity, '$.type'), count(*)
FROM "mylake"."my_trail_cloudtrail"
GROUP BY  json_extract_scalar(useridentity, '$.type')

Most of the times, AWS services are the ones creating activities in my trail. These queries are just an example, but give me quickly a deeper insight in what is happening in my AWS account.

Think of what could be a similar impact for your business! Using database and logs blueprints, you can quickly create workflows to ingest data from multiple sources within your organization, set the right permission at column level of who can have access to any information collected, clean and prepare your data using machine learning transforms, and correlate and visualize the information using tools like Amazon Athena, Amazon Redshift, and Amazon QuickSight.

Customizing Data Access with Column-Level Permissions
In order to follow data privacy guidelines and compliance, the mission-critical data stored in a data lake requires to create custom views for different stakeholders inside the company. Let’s compare the visibility of two IAM users in my AWS account, one that has full permissions on a table, and one that has only select access to a subset of the columns of the same table.

I already have a user with full access to the table containing my CloudTrail data, it’s called danilop. I create a new limitedview IAM user and I give it access to the Athena console. In the Lake Formation console, I only give this new user select permissions on three of the columns.

To verify the different access to the data in the table, I log in with one user at a time and go to the Athena console. On the left I can explore which tables and columns the logged-in user can see in the Glue Data Catalog. Here’s a comparison for the two users, side-by-side:

The limited user has access only to the three columns that I explicitly configured, and to the four columns used for partitioning the table, whose access is required to see any data. When I query the table in the Athena console with a select * SQL statement, logged in as the limitedview user, I only see data from those seven columns:

Available Now
There is no additional cost in using AWS Lake Formation, you pay for the use of the underlying services such as Amazon S3 and AWS Glue. One of the core benefits of Lake Formation are the security policies it is introducing. Previously you had to use separate policies to secure data and metadata access, and these policies only allowed table-level access. Now you can give access to each user, from a central location, only to the the columns they need to use.

AWS Lake Formation is now available in US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), and Asia Pacific (Tokyo). Redshift integration with Lake Formation requires Redshift cluster version 1.0.8610 or higher, your clusters should have been automatically updated by the time you read this. Support for Apache Spark with Amazon EMR will follow over the next few months.

I only scratched the surface of what you can do with Lake Formation. Building and managing a data lake for your business is now much easier, let me know how you are using these new capabilities!


Bringing your stored procedures to Amazon Redshift

Post Syndicated from Joe Harris original https://aws.amazon.com/blogs/big-data/bringing-your-stored-procedures-to-amazon-redshift/

Amazon always works backwards from the customer’s needs. Customers have made strong requests that they want stored procedures in Amazon Redshift, to make it easier to migrate their existing workloads from legacy, on-premises data warehouses.

With that primary goal in mind, AWS chose to implement PL/pqSQL stored procedure to maximize compatibility with existing procedures and simplify migrations. In this post, we discuss how and where to use stored procedures to improve operational efficiency and security. We also explain how to use stored procedures with AWS Schema Conversion Tool.

What is a stored procedure?

A stored procedure is a user-created object to perform a set of SQL queries and logical operations. The procedure is stored in the database and is available to users who have sufficient privileges to run it.

Unlike a user-defined function (UDF), a stored procedure can incorporate data definition language (DDL) and data manipulation language (DML) in addition to SELECT queries. A stored procedure doesn’t have to return a value. You can use the PL/pgSQL procedural language, including looping and conditional expressions, to control logical flow.

Stored procedures are commonly used to encapsulate logic for data transformation, data validation, and business-specific operations. By combining multiple SQL steps into a stored procedure, you can reduce round trips between your applications and the database.

You can also use stored procedures for delegated access control. For example, you can create stored procedures to perform functions without giving a user access to the underlying tables.

Why would you use stored procedures?

Many customers migrating to Amazon Redshift have complex data warehouse processing pipelines built with stored procedures on their legacy data warehouse platform. Complex transformations and important aggregations are defined with stored procedures and reused in many parts of their processing. Re-creating the logic of these processes using an external programming language or a new ETL platform could be a large project. Using Amazon Redshift stored procedures allows you to migrate to Amazon Redshift more quickly.

Other customers would like to tighten security and limit the permissions of their database users. Stored procedures offer new options to allow DBAs to perform necessary actions without having to give permissions too widely. With the security definer concepts in stored procedures, it is now possible to allow users to perform actions they otherwise would not have permissions to run.

Additionally, using stored procedures in this way helps reduce the operations burden. An experienced DBA is able to define a well-tested process for some administrative or maintenance action. They can then allow other, less experienced operators to execute the process without entrusting them with full superuser permissions on the cluster.

Finally, some customers prefer using stored procedures to manage their ETL/ELT operations as an alternative to shell scripting or complex orchestration tools. It can be difficult to ensure that shell scripts correctly retrieve and interpret the state of each operation in an ETL/ELT process. It can also be challenging to take on the operation and maintenance of an orchestration tool with a small data warehouse team.

Stored procedures allow the ETL/ELT logical steps to be fully enclosed in a master procedure that is written so that it either succeeds completely or fails cleanly with no side effects. The stored procedure can be called with confidence from a simple scheduler like cron.

Create a stored procedure

To create a stored procedure in Amazon Redshift, use the following the syntax:

CREATE [ OR REPLACE ] PROCEDURE sp_procedure_name 
  ( [ [ argname ] [ argmode ] argtype [, ...] ] )
AS $$
$$ LANGUAGE plpgsql 
[ SET configuration_parameter { TO value | = value } ]

When you design stored procedures, think about the encapsulated functionality, input and output parameters, and security level. As an example, here’s how you can write a stored procedure to check primary key violations, given names of the schema, table, and primary key column, using dynamic SQL:

CREATE OR REPLACE PROCEDURE check_primary_key(schema_name varchar(128), 
table_name varchar(128), col_name varchar(128)) LANGUAGE plpgsql
AS $$
  cnt_var integer := 0;
  SELECT INTO cnt_var count(*) from pg_table_def where schemaname = schema_name and
  tablename = table_name and "column" = col_name;
  IF cnt_var = 0 THEN
    RAISE EXCEPTION 'Input table or column does not exist.';

  DROP TABLE IF EXISTS duplicates;
    $_$ CREATE TEMP TABLE duplicates as
    SELECT $_$|| col_name ||$_$, count(*) as counter
    FROM $_$|| table_name ||$_$
    GROUP BY 1
    HAVING count(*) > 1
    ORDER BY counter desc $_$;
  SELECT INTO cnt_var COUNT(*) FROM duplicates;
  IF cnt_var = 0
    THEN RAISE INFO 'No duplicates found';
    DROP TABLE IF EXISTS duplicates;
    RAISE INFO 'Duplicates exist for % value(s) in column %', cnt, col_name;
    RAISE INFO 'Check tmp table "duplicates" for duplicated values';

For details about the kinds of SQL queries and control flow logic that can be used inside a stored procedure, see Creating Stored Procedures in Amazon Redshift.

Invoke a stored procedure

Stored procedures must be invoked by the CALL command, which takes the procedure name and the input argument values. CALL can’t be part of any regular queries. As an example, here’s how to invoke the stored procedure created earlier:

db=# call check_primary_key('public', 'bar', 'b');
INFO:  Duplicates exist for 1 value(s) in column b
INFO:  Check tmp table "duplicates" for duplicated values

Amazon Redshift stored procedure calls can return results through output parameters or a result set. Nested and recursive calls are also supported. For details, see CALL command.

How to use security definer procedures

Now that you know how to create and invoke a stored procedure, here’s more about the security aspects. When you create a procedure, only you as the owner (creator) have the privilege to call or execute it. You can grant EXECUTE privilege to other users or groups, which enables them to execute the procedure. EXECUTE privileges do not automatically imply that the caller can access all database objects (tables, views, and so on) that are referenced in the stored procedure.

Take the example of a procedure, sp_insert_customers, created by user Mary. It has an INSERT statement that writes to table customers that is owned by Mary. If Mary grants EXECUTE privileges to user John, John still can’t INSERT into the table customers unless he has explicitly been granted INSERT privileges on customers.

However, it might make sense to allow John to call the stored procedure without giving him INSERT privileges on customers. To do this, Mary has to set the SECURITY attribute of the procedure to DEFINER when creating the procedure and then grant EXECUTE privileges to John. With this set, when John calls sp_insert_customers, it executes with the privileges of Mary and can insert into customers without him having been granted INSERT privileges on that table.

When the security attribute is not specified during procedure creation, its value is set to INVOKER by default. This means that the procedure executes with the privileges of the user that calls it. When the security attribute is explicitly set to DEFINER, the procedure executes with the privileges of the procedure owner.

Best practices with stored procedures in Amazon Redshift

Here are some best practices for using stored procedures.

Ensure that stored procedures are captured in your source control tool.

If you plan to use stored procedures as a key element of your data processing, you should also establish a practice of committing all stored procedure changes to a source control system.

You could also consider defining a specific user who is the owner of important stored procedures and automating the process of creating and modifying procedures.

You can retrieve the source for existing stored procedures using the following command:

SHOW procedure_name;

Consider the security scope of each procedure and who calls it

By default, stored procedures run with the permission of the user that calls them. Use the SECURITY DEFINER attribute to enable stored procedures to run with different permissions. For instance, explicitly revoke access to DELETE from an important table and define a stored procedure that executes the delete after checking a safe list.

When using SECURITY DEFINER, take care to:

  • Grant EXECUTE on the procedure to specific users, not to PUBLIC. This ensures that the procedure can’t be misused by general users.
  • Qualify all database objects that the procedure accesses with the schema names if possible. For example, use myschema.mytable instead of just mytable.
  • Set the search_path when creating the procedure by using the SET option. This prevents objects in other schemas with the same name from being affected by an important stored procedure.

Use set-based logic and avoid manually looping over large datasets

When manipulating data within your stored procedures, continue to use normal, set-based SQL as much as possible, for example, INSERT, UPDATE, DELETE.

Stored procedures provide new control structures such as FOR and WHILE loops. These are useful for iterating over a small number of items, such as a list of tables. However, you should avoid using the loop structures to replace a set-based SQL operation. For example, iterating over millions of values to update them one-by-one is inefficient and slow.

Be aware of REFCURSOR limits and use temp tables for larger result sets

Result sets may be returned from a stored procedure either as a REFCURSOR or using temp tables.  REFCURSOR is an in-memory data structure and is the simplest option in many cases.

However, there is a limit of one REFCURSOR per stored procedure. You may want to return multiple result sets, interact with results from multiple sub-procedures, or return millions of result rows (or more). In those cases, we recommend directing results to a temp table and returning a reference to the temp table as the output of your stored procedure.

Keep procedures simple and nest procedures for complex processes

Try to keep the logic of each stored procedure as simple possible. You maximize your flexibility and make your stored procedures more understandable by keeping them simple.

The code of your stored procedures can become complex as you refine and enhance them. When you encounter a long and complex stored procedure, you can often simplify by moving sub-elements into a separate procedure that is called from the original procedure.

Migrating a stored procedure with the AWS Schema Conversion Tool

With Amazon Redshift announcing the support for stored procedures, AWS also enhanced AWS Schema Conversion Tool to convert stored procedures from legacy data warehouses to Amazon Redshift.

AWS SCT already supports the conversion of Microsoft SQL Server data warehouse stored procedures to Amazon Redshift.

With build 627, AWS SCT can now convert Microsoft SQL Server data warehouse stored procedures to Amazon Redshift. Here are the steps in AWS SCT:

  1. Create a new OLAP project for a SQL Server data warehouse (DW) to Amazon Redshift conversion.
  2. Connect to the SQL Server DW and Amazon Redshift endpoints.
  3. Uncheck all nodes in the source tree.
  4. Open the context (right-click) menu for Schemas.
  5. Open the context (right-click) menu for the Stored Procedures node and choose Convert Script (just like when you convert database objects).
  6. (Optional) You can also choose to review the assessment report and apply the conversion.

Here is an example of a SQL Server DW stored procedure conversion:


Stored procedure support in Amazon Redshift is now generally available in every AWS Region. We hope you are as excited about running stored procedures in Amazon Redshift as we are.

With stored procedure support in Amazon Redshift and AWS Schema Conversion Tool, you can now migrate your stored procedures to Amazon Redshift without having to encode them in another language or framework. This feature reduces migration efforts. We hope more on-premises customers can take advantage of Amazon Redshift and migrate to the cloud for database freedom.


About the Authors

Joe Harris is a senior Redshift database engineer at AWS, focusing on Redshift performance. He has been analyzing data and building data warehouses on a wide variety of platforms for two decades. Before joining AWS he was a Redshift customer from launch day in 2013 and was the top contributor to the Redshift forum.



Abhinav Singh is a database engineer at AWS. He works on design and development of database migration projects as well as customers to provide guidance and technical assistance on database migration projects, helping them improve the value of their solutions when using AWS.




Entong Shen is a software engineer on the Amazon Redshift query processing team. He has been working on MPP databases for over 6 years and has focused on query optimization, statistics and SQL language features. In his spare time, he enjoys listening to music of all genres and working in his succulent garden.



Vinay is a principal product manager at Amazon Web Services for Amazon Redshift. Previously, he was a senior director of product at Teradata and a director of product at Hortonworks. At Hortonworks, he launched products in Data Science, Spark, Zeppelin, and Security. Outside of work, Vinay loves to be on a yoga mat or on a hiking trail.



Sushim Mitra is a software development engineer on the Amazon Redshift query processing team. He focuses on query optimization problems, SQL Language features and Database security. When not at work, he enjoys reading fiction from all over the world.





Orchestrating an ETL process using AWS Step Functions for Amazon Redshift

Post Syndicated from Thiyagarajan Arumugam original https://aws.amazon.com/blogs/big-data/orchestrating-an-etl-process-using-aws-step-functions-for-amazon-redshift/

Modern data lakes depend on extract, transform, and load (ETL) operations to convert bulk information into usable data. This post walks through implementing an ETL orchestration process that is loosely coupled using AWS Step Functions, AWS Lambda, and AWS Batch to target an Amazon Redshift cluster.

Because Amazon Redshift uses columnar storage, it is well suited for fast analytical insights using the convenient ANSI SQL queries. You can rapidly scale your Amazon Redshift clusters up and down in minutes to meet the demanding workloads for both your end-user reports and timely data refresh into the data warehouse.

AWS Step Functions makes it easy to develop and use repeatable workflows that scale well. Step Functions lets you build automation workflows from individual Lambda functions. Each function performs a discrete task and lets you develop, test, and modify the components of your workflow quickly and seamlessly.

An ETL process refreshes your data warehouse from source systems, organizing the raw data into a format you can more readily use. Most organizations run ETL as a batch or as part of a real-time ingest process to keep the data warehouse current and provide timely analytics. A fully automated and highly scalable ETL process helps minimize the operational effort that you must invest in managing the regular ETL pipelines. It also ensures the timely and accurate refresh of your data warehouse. You can tailor this process to refresh data into any data warehouse or the data lake.

This post also provides an AWS CloudFormation template that launches the entire sample ETL process in one click to refresh the TPC-DS dataset. Find the template link in the Set up the entire workflow using AWS CloudFormation section.

Architectural overview

The following diagram illustrates the architectural overview of the different components involved in the orchestration of the ETL workflow. This workflow uses Step Functions to fetch source data from Amazon S3 to refresh the Amazon Redshift data warehouse.

Here are the core components of the workflow:

  • Amazon CloudWatch triggers the ETL process based on a schedule, through the AWS CLI, or using the various AWS SDKs in a Lambda function.
  • The ETL workflow uses Step Functions for a multi-step ETL process and manages AWS services into serverless workflows. You can build and easily iterate these using JSON-based templates. For example, a typical ETL process may involve refreshing dimensions first and later refreshing the fact tables. You can declare your order of operations using a Step Functions state machine.
  • A Lambda function lets you build microservices to coordinate job submission and monitoring without needing to write code for workflow logic, parallel processes, error handling, timeouts, or retries.
  • AWS Batch runs several ETL jobs such as transforms and loads into Amazon Redshift. AWS Batch manages all the infrastructure for you, avoiding the complexities of provisioning, managing, monitoring, and scaling your batch computing jobs. It also lets you wait for the jobs to complete.
  • The source data in Amazon S3 refreshes an Amazon Redshift data warehouse through a PL/SQL container. To specify the ETL logic, I use.sql files that contain the SQL code for a particular step. For example, a .sql file for a typical dimension table refresh contains steps to load the data from Amazon S3 to a temporary staging table and INSERT/UPDATE the target table. Before beginning, review a sample dimensional table .sql file.

You can execute the workflow and monitor it using the state machine. You can trigger the ETL according to a schedule or an event (for example, as soon as all the data files arrive in S3).


Before you get started, create a Docker image that can execute .sql files. AWS Batch creates resources for executing the ETL steps using this Docker image. To create the Docker image, you need:

If this is your first time using AWS Batch, see Getting Started with AWS Batch. Create an environment to build and register the Docker image. For this post, register this image in an Amazon ECR repository. This is a private repository by default, making it useful for AWS Batch jobs.

Building the fetch and running psql Docker image

To build the Docker image, follow the steps outlined in the post Creating a Simple “Fetch & Run” AWS Batch Job.

Use the following Docker configuration and fetch and run psql scripts to build the images.

  1. DockerFetchRunPsqlUbundu
  2. fetch_and_run_psql.sh

Follow the steps in the post to import the Docker image into the ECR container registry. After you complete the previous steps, your Docker image is ready to trigger a .sql execution for an Amazon Redshift cluster.

Example: ETL process using TPC-DS dataset

This example uses a subset of the TPC-DS dataset to demonstrate a typical dimensional model refresh. Here is the Entity Relationship diagram of the TPC-DS data model that I use for this ETL application:

The ETL process refreshes table data for the Store_Sales fact table along with the Customer_Address and Item dimensions for a particular dataset date.

Setting up the ETL workflow using Step Functions

Step Functions make complicated workflows more straightforward. You can set up dependency management and failure handling using a JSON-based template. Workflows are just a series of steps, with the output of one step acting as input into the next.

This example completes various dimensional table transforms and loads before triggering the Fact table load. Also, a workflow can branch out into multiple parallel steps whenever needed. You can monitor each step of execution as it happens, which means you can identify and fix problems quickly.

This illustration outlines the example ETL process set up through Step Functions:

For more information, see the detailed workflow diagram.

In the above workflow, the ETL process checks the DB connection in Step 1 and triggers the Customer_Address (Step 2.1) and Item_dimension (Step 2.2) steps, which execute in parallel. The Store_Sales (Step 3) FACT table waits for the process to complete the dimensional tables. Each ETL step is autonomous, allowing you to monitor and respond to failures at any stage.

I now examine the Store_Sales step (Step 3) in detail. Other steps follow a similar pattern of implementation.

Here is the state implementation for Store_Sales step (Step 3):

   "Comment":"A simple ETL example that submits a Job to AWS Batch",

The Parallel process that loads all the dimension tables sets up a dependency on later Store Sales Fact transformation/load SalesFACTInit through the Next attribute. The SalesFACTInit step triggers the transformation using the SubmitStoreSalesFACTJob to AWS Batch triggered through AWS Lambda job JobStatusPol-SubmitJobFunction. GetStoreSalesFACTJobStatus polls through the AWS Lambda JobStatusPoll-CheckJobFunction every 30 seconds to check for completion. CheckStoreSalesFACTJobStatus validates the status and decides to succeed or fail the process depending on the returned status.

Here is snippet of input for executing the state machine job for Step 3:


The input defines which .sql file each step invokes, along with the refresh date. You can represent any complex ETL workflow as a JSON workflow, making it easy to manage. This also decouples the inputs to invoke for each step.

Executing the ETL workflow

AWS Batch executes each .sql script (store_sales.sql) that the state machine invokes by using an incremental data refresh for the sales data on a particular date.

Here is the load and transformation implementation for the store_sales.sql:

\set s3datareadrolevar 'aws_iam_role=' :s3datareadrole
-- This transform ETL will refresh data for the store_sales table
-- Start a new transaction

begin transaction;

-- Create a stg_store_sales staging table and COPY data from input S3 location it with updated rows from SALES_UPDATE

DROP TABLE if exists public.stg_store_sales;
CREATE TABLE public.stg_store_sales
	sold_date DATE   ENCODE lzo
	,sold_time INTEGER   ENCODE lzo
	,i_item_id CHAR(16)   ENCODE lzo
	,c_customer_id CHAR(16)   ENCODE lzo
	,cd_demo_sk INTEGER   ENCODE lzo
	,hd_income_band_sk INTEGER   ENCODE lzo
	,hd_buy_potential CHAR(15)   ENCODE lzo
	,hd_dep_count INTEGER   ENCODE lzo
	,hd_vehicle_count INTEGER   ENCODE lzo
	,ca_address_id CHAR(16)   ENCODE lzo
	,s_store_id CHAR(16)   ENCODE lzo
	,p_promo_id CHAR(16)   ENCODE lzo
	,ss_ticket_number INTEGER   ENCODE lzo
	,ss_quantity INTEGER   ENCODE lzo
	,ss_wholesale_cost NUMERIC(7,2)   ENCODE lzo
	,ss_list_price NUMERIC(7,2)   ENCODE lzo
	,ss_sales_price NUMERIC(7,2)   ENCODE lzo
	,ss_ext_discount_amt NUMERIC(7,2)   ENCODE lzo
	,ss_ext_sales_price NUMERIC(7,2)   ENCODE lzo
	,ss_ext_wholesale_cost NUMERIC(7,2)   ENCODE lzo
	,ss_ext_list_price NUMERIC(7,2)   ENCODE lzo
	,ss_ext_tax NUMERIC(7,2)   ENCODE lzo
	,ss_coupon_amt NUMERIC(7,2)   ENCODE lzo
	,ss_net_paid NUMERIC(7,2)   ENCODE lzo
	,ss_net_paid_inc_tax NUMERIC(7,2)   ENCODE lzo
	,ss_net_profit NUMERIC(7,2)   ENCODE lzo

\set s3loc 's3://salamander-us-east-1/reinvent2018/ant353/store_sales/saledate=' :dt '/'
-- COPY input data to the staging table
copy public.stg_store_sales
CREDENTIALS :'s3datareadrolevar'
DELIMITER '~' gzip region 'us-east-1';

-- Delete any rows from target store_sales for the input date for idempotency
delete from store_sales where ss_sold_date_sk in (select d_date_sk from date_dim where d_date=:'dt');
--Insert data from staging table to the target TABLE
INSERT INTO store_sales
SELECT date_dim.d_date_sk ss_sold_date_sk,
       time_dim.t_time_sk ss_sold_time_sk,
       item.i_item_sk ss_item_sk,
       customer.c_customer_sk ss_customer_sk,
       customer_demographics.cd_demo_sk ss_cdemo_sk,
       household_demographics.hd_demo_sk ss_hdemo_sk,
       customer_address.ca_address_sk ss_addr_sk,
       store.s_store_sk ss_store_sk,
       promotion.p_promo_sk ss_promo_sk,
FROM stg_store_sales AS store_sales
  JOIN date_dim ON store_sales.sold_date = date_dim.d_date
  LEFT JOIN time_dim ON store_sales.sold_time = time_dim.t_time
  LEFT JOIN item
         ON store_sales.i_item_id = item.i_item_id
        AND i_rec_end_date IS NULL
  LEFT JOIN customer ON store_sales.c_customer_id = customer.c_customer_id
  LEFT JOIN customer_demographics ON store_sales.cd_demo_sk = customer_demographics.cd_demo_sk
  LEFT JOIN household_demographics
         ON store_sales.hd_income_band_sk = household_demographics.hd_income_band_sk
        AND store_sales.hd_buy_potential = household_demographics.hd_buy_potential
        AND store_sales.hd_dep_count = household_demographics.hd_dep_count
        AND store_sales.hd_vehicle_count = household_demographics.hd_vehicle_count
  LEFT JOIN customer_address ON store_sales.ca_address_id = customer_address.ca_address_id
  LEFT JOIN store
         ON store_sales.s_store_id = store.s_store_id
        AND s_rec_end_date IS NULL
  LEFT JOIN promotion ON store_sales.p_promo_id = promotion.p_promo_id;

  --drop staging table

  DROP TABLE if exists public.stg_store_sales;

  -- End transaction and commit
  end transaction;

This ETL implementation runs through the following steps:

  1. A COPY command fast loads the data from S3 in bulk into the staging table stg_store_sales.
  2. Begin…end transactions encapsulate multiple steps in the transformation and load process. This leads to fewer commit operations in the end, making the process less expensive.
  3. ETL implementation is idempotent. If it fails, you can retry the job without any cleanup. For example, it recreates the stg_store_sales is each time, then deletes target table store_sales with the data for the particular refresh date each time.

For best practices used in the preceding implementation, see the Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift post.

Furthermore, Customer_Address demonstrates a Type 1 implementation and Item follows a Type 2 implementation in a typical dimensional model.

Set up the entire workflow using AWS CloudFormation

The AWS CloudFormation template includes all the steps of this solution. This template creates all the required AWS resources and invokes initial data setup and the refresh of this data for a particular day. Here is a list of all the resources it creates inside the CloudFormation stack:

  • A VPC and associated subnets, security groups, and routes
  • IAM roles
  • An Amazon Redshift cluster
  • An AWS Batch job definition and compute environment
  • A Lambda function to submit and poll AWS Batch jobs
  • A Step Functions state machine to orchestrate the ETL workflow and refresh the data in the Amazon Redshift cluster

Here is the architecture of this setup that shows the Amazon Redshift setup in the VPC and the ETL process orchestrated using Step Functions:

Step 1: Create the stack with AWS CloudFormation

To deploy this application in your AWS account, start by launching this CloudFormation stack:

  • This stack uses the password Password#123. Change it as soon as possible. Use a minimum of eight characters, at least one uppercase letter, one lowercase letter, one number, and one special character.
  1. Use the default values for all other parameters.

The stack takes about ten minutes to launch. Wait for it to complete when the status changes to CREATE_COMPLETE.

Make a note of the value of ExecutionInput in the Output section of the stack. The JSON looks like the following code example:

{ "DBConnection":{ "jobName":"
alue":"s3://salamander-us-east-1/reinvent2018/ant353/etlscript/store_sales.sql" }, { "name":"DATASET_DATE", "value":"2003-01-02" } ] } } }

Note the Physical ID of JobDefinition and JobQueue in the Resources section of the stack.

Step 2: Set up TPC-DS 1-GB initial data in Amazon Redshift

The following steps load an initial 1 GB of TPCDS data into the Amazon Redshift cluster:

  • In the AWS Batch console, choose Job, select the job queue noted earlier, and choose Submit Job.
  • Set a new job name, for example, TPCDSdataload, and select the JobDefinition value that you noted earlier. Choose Submit Job. Wait for the job to completely load the initial 1 GB of TPCDS data into the Amazon Redshift cluster.
  • In the AWS Batch dashboard and monitor for the completion of TPCDS data load. This takes about ten minutes to complete.

Step 3: Execute the ETL workflow in the setup function

The ETL process is a multi-step workflow to refresh the TPCDS dimensional model with data from 2010-10-10.

  1. In the Step Functions console, choose JobStatusPollerStateMachine-*.
  2. Choose Start execution and provide an optional execution name, for example, ETLWorkflowDataRefreshfor2003-01-02. In the execution input, enter the ExecutionInput value that you noted earlier. This kickstarts the ETL process. The state machine uses the Lambda poller to submit and monitor each step of the ETL job. Each input invokes the ETL workflow. You can monitor the process of the ETL by refreshing your browser.

Step 4: Verify the ETL data refresh in the Amazon Redshift cluster

In the Amazon Redshift console, choose Query Editor. Enter the following credentials:

  • Database: dev.
  • Database Use: awsuser.
  • Password: This requires the password that you created in Step 1 (default Password#123).

After you are logged in to the public schema, execute the following query to check the data load for 2010-10-10:

SELECT c_last_name, 
FROM   (SELECT ss_ticket_number, 
               ca_city            bought_city, 
               Sum(ss_coupon_amt) amt, 
               Sum(ss_net_profit) profit 
        FROM   store_sales, 
        WHERE  store_sales.ss_sold_date_sk = date_dim.d_date_sk 
               AND store_sales.ss_store_sk = store.s_store_sk 
               AND store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk 
               AND store_sales.ss_addr_sk = customer_address.ca_address_sk 
               AND ( household_demographics.hd_dep_count = 6 
                      OR household_demographics.hd_vehicle_count = 0 ) 
               AND d_date =  '2003-01-02'
        GROUP  BY ss_ticket_number, 
                  ca_city) dn, 
       customer_address current_addr 
WHERE  ss_customer_sk = c_customer_sk 
       AND customer.c_current_addr_sk = current_addr.ca_address_sk 
       AND current_addr.ca_city <> bought_city 
ORDER  BY c_last_name, 
LIMIT 100;

The query should display the TPC-DS dataset for 2010-10-10 that the ETL process loaded.

Step 5: Cleaning up

When you finish testing this solution, remember to clean up all the AWS resources that you created using AWS CloudFormation. Use the AWS CloudFormation console or AWS CLI to delete the stack that you specified previously.


In this post, I described how to implement an ETL workflow using decoupled services in AWS and set up a highly scalable orchestration that lets you refresh data into an Amazon Redshift cluster.

You can easily expand on what you learned here. Here are some options that you let you extend this solution to accommodate other analytical services or make it robust enough to be production ready:

  • This example invokes the state machine manually using Step Functions. You can instead trigger the state machine automatically using a CloudWatch event or S3 event, such as whenever new files arrive in the source bucket. You also drive the ETL invocation using a schedule. For useful information for automating your ETL workflow, see Schedule a Serverless Workflow.
  • You can add an alert mechanism in case of failures. To do this, create a Lambda function that sends you an email based on the status of each step in the Step Functions workflow.
  • Each step of the state machine is autonomous and can invoke any service with a Lambda function. You can integrate any analytical service into your workflow. For example, you can create a separate Lambda function to invoke AWS Glue and clean some of your data before transforming the data using Amazon Redshift. In this case, you add the AWS Glue job as a dependency in the step before the dimension load.

With this Step Functions-based workflow, you can decouple the different steps of ETL orchestration using any analytical service. Because of this, the solution is adaptable and interchangeable to a wide variety of applications.

If you have questions or suggestions, please leave a comment below.


About the Author

Thiyagarajan Arumugam is a Big Data Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.




Why 3M Health Information Systems chose Amazon Redshift to build a healthcare analytics platform

Post Syndicated from Dhanraj Shriyan original https://aws.amazon.com/blogs/big-data/why-3m-health-information-systems-chose-amazon-redshift-to-build-a-healthcare-analytics-platform/

3M Health Information Systems (HIS), a business of 3M Health Care, works with providers, payers, and government agencies to anticipate and navigate a changing healthcare landscape. 3M provides healthcare performance measurement and management solutions, analysis and strategic services that help clients move from volume to value-based health care, resulting in millions of dollars in savings, improved provider performance and higher quality care. 3M’s innovative software is designed to raise the bar for computer-assisted coding, clinical documentation improvement, performance monitoring, quality outcomes reporting, and terminology management.

There was an ongoing initiative at 3M HIS to migrate applications installed on-premises or on other cloud hosting providers to Amazon Web Services (AWS). 3M HIS began migration to AWS to take advantage of compute, storage and network elasticity. We wanted to build on a solid foundation that would help us focus more of our efforts on delivering customer value while also scaling to support business growth that we expected over the next several years. 3M HIS was already processing healthcare data for many customers which is complex in nature, requiring a lot of complex transformations to get data into a format useful for analytics or machine learning.
After reviewing many solutions, 3M HIS chose Amazon Redshift as the appropriate data warehouse solution. We concluded Amazon Redshift met our needs; a fast, fully managed, petabyte-scale data warehouse solution that uses columnar storage to minimize I/O, provides high data compression rates, and offers fast performance. We quickly spun up a cluster in our development environment, built out the dimensional model, loaded data, and made it available to perform benchmarking and testing of the user data. An extract, transform, load (ETL) tool was used to process and load the data from various sources into Amazon Redshift.

After reviewing many solutions, 3M HIS chose Amazon Redshift as the appropriate data warehouse solution. We concluded Amazon Redshift met our needs; a fast, fully managed, petabyte-scale data warehouse solution that uses columnar storage to minimize I/O, provides high data compression rates, and offers fast performance. We quickly spun up a cluster in our development environment, built out the dimensional model, loaded data, and made it available to perform benchmarking and testing of the user data. An extract, transform, load (ETL) tool was used to process and load the data from various sources into Amazon Redshift.

3M legacy implementation

3M HIS processes a large volume of data through this data warehouse. We ingest healthcare data from clients, representing millions of procedure codes, diagnosis codes and all the associated meta data for each of those codes. The legacy process loaded this data into the data warehouse once every two weeks.

For reporting, we published 25 static reports and 6 static HTML reports for over 1000 customers every week. To provide business intelligence reports, we built analytical cubes on a legacy relational database and provided reports from those cubes.

With the ever-increasing amounts of data to be processed meeting the SLA’s was a challenge. It was time to replace our SQL database with a modern architecture and tooling that would auto scale based on the data to be processed and be performant.

How 3M modernized the data warehouse with Amazon Redshift

When choosing a new solution, first we had to ensure that we were able to load data in near real-time. Second, we had to ensure that the solution was scalable, because the amount of data stored in the data warehouse would be 10x larger as compared to the existing solution. Third, the solution needed to be able to provide reports for massive queries in reasonable amount of time without impacting the ETL processing that would be running 24/7. Last, we needed a cost-effective data warehouse that could integrate with other analytics services that were part of the overall solution.

We evaluated data warehouses such as Amazon Redshift and Snowflake. We chose Amazon Redshift because it fulfilled the preceding criteria and aligned with our preference for native AWS managed services. But also, we did so because Amazon Redshift would be a solution for the future that could keep pace with the growth of the business in an economically sustainable way.

To build the reporting tool, we migrated our multi-terabyte data warehouse to Amazon Redshift. The data was processed through the ETL workflow into an Amazon S3 bucket and then bulk-copied into Amazon Redshift. The GitHub repository provided by AWS, a collection of scripts and utilities, helped us set up the cluster and get the best performance possible from Amazon Redshift.

Top lessons that we learned during the implementation

During the initial development, we encountered challenges loading the data into the Amazon Redshift tables because we were trying to load the data from an Amazon RDS staging instance into Amazon Redshift. After some research, we identified that a bulk load from Amazon S3 bucket was the best practice to get large amounts of data loaded into the Amazon Redshift tables.

The second challenge was that the Amazon Redshift VACUUM and ANALYZE operations were holding up our ETL pipeline, because these operations had been baked into the ETL process. We performed frequent data loads into the Amazon Redshift tables and a lot of DELETE operations as part of the ETL processing. These two concerns meant that the VACUUM and ANALYZE operations had to be performed frequently, resulting in the tables being locked for the operations’ duration and conflicting with the ETL process. Triggering the process after all the loads were complete helped eliminate the performance issues we were encountering. VACUUM and ANALYZE have recently been automated, which we expect to prevent such issues arising in the future.

The final challenge was to find a way to use windowing functions which previously resided in the Analysis Services cube layer, whose functionality Amazon Redshift now fulfilled. However, most of the windowing functions that we need are built into Amazon Redshift allowing for an easy transition to port the existing functionality to Amazon Redshift and provide the same results.

During the port, we used Amazon Redshift’s comprehensive best practices guide and tuning techniques. We found these helped us set up the Amazon Redshift cluster for optimal performance.

Flow diagram of the new implementation

Benefits of the updated implementation

With the legacy solution, our implementation had grown complex, we found it difficult to support the growing volume of new data we needed to incorporate into the database and then report on in near real-time. Reports executing on the data were slowly drifting away from the initial SLA requirements. With Amazon Redshift, we’re working to solve those problems with less need for manual maintenance and care and feeding of the solution. First, it has the potential to allow us to store a larger quantity of data for a longer time. Second, adding nodes to the cluster when needed is simple and we can do it in minutes with the Elastic Resize feature. Likewise, we can scale back nodes when cost sensitivity is an issue. Third, Amazon Redshift also gives better support for computing analytics over large sets of grouped data than our previous solution. Often, we want to look at how recent data compares to historical data. In some cases, we want more than a year or two of historical data to rule out seasonality and we have found that Amazon Redshift is a more scalable solution for this type of operation.


At 3M HIS, we’re transforming health care from a system that treats disease to one that improves health and wellness with accurate health and clinical information from the start. 3M’s nearly 40 years of clinical knowledge, from data management and nosology to reimbursement and risk adjustment, opens the door for our providers and payers clients to find innovative solutions that improve outcomes across the continuum of care. We help our clients ensure accurate and compliant reimbursement, as well as leverage 3M’s analytical capabilities, powered by AWS, to improve health system and health plan performance while lowering costs.


About the Author

Dhanraj Shriyan is an Enterprise Data Architect at 3M Health Information Systems with a Masters in Predictive Analytics from Northwestern University, Chicago. He loves helping customers in exploring their data and providing valuable insights and implementing a scalable solution using the right database technology based on their needs. He has several years of experience in building large scale datawarehouse Business Intelligence solutions in cloud as well as on prem. Currently, exploring graph technologies and Lake formation services in AWS.





Query your Amazon Redshift cluster with the new Query Editor

Post Syndicated from Surbhi Dangi original https://aws.amazon.com/blogs/big-data/query-your-amazon-redshift-cluster-with-the-new-query-editor/

Data warehousing is a critical component for analyzing and extracting actionable insights from your data. Amazon Redshift is a fast, scalable data warehouse that makes it cost-effective to analyze all of your data across your data warehouse and data lake.

The Amazon Redshift console recently launched the Query Editor. The Query Editor is an in-browser interface for running SQL queries on Amazon Redshift clusters directly from the AWS Management Console. Using the Query Editor is the most efficient way to run queries on databases hosted by your Amazon Redshift cluster.

After creating your cluster, you can use the Query Editor immediately to run queries on the Amazon Redshift console. It’s a great alternative to connecting to your database with external JDBC/ODBC clients.

In this post, we show how you can run SQL queries for loading data in clusters and monitoring cluster performance directly from the console.

Using the Query Editor instead of your SQL IDE or tool

The Query Editor provides an in-browser interface for running SQL queries on Amazon Redshift clusters. For queries that are run on compute nodes, you can then view the query results and query execution plan next to your queries.

The ability to visualize queries and results in a convenient user interface lets you accomplish many tasks, both as a database administrator and a database developer. The visual Query Editor helps you do the following:

  • Build complex queries.
  • Edit and run queries.
  • Create and edit data.
  • View and export results.
  • Generate EXPLAIN plans on your queries.

With the Query Editor, you can also have multiple SQL tabs open at the same time. Colored syntax, query autocomplete, and single-step query formatting are all an added bonus!

Database administrators typically maintain a repository of commonly used SQL statements that they run regularly. If you have this written in a notepad somewhere, the saved queries feature is for you. This feature lets you save and reuse your commonly run SQL statements in one step. This makes it efficient for you to review, rerun, and modify previously run SQL statements. The Query Editor also has an exporter so that you can export the query results into a CSV format.

The Query Editor lets you perform common tasks, such as creating a schema and table on the cluster and loading data in tables. These common tasks are now possible with a few simple SQL statements that you run directly on the console. You can also do day-to-day administrative tasks from the console. These tasks can include finding long-running queries on the cluster, checking for potential deadlocks with long-running updates on a cluster, and checking for how much space is available in the cluster.

The Query Editor is available in 16 AWS Regions. It’s available on the Amazon Redshift console at no extra cost to you. Standard Amazon Redshift rates apply for your cluster usage and for Amazon Redshift Spectrum. To learn more, see Amazon Redshift pricing.

Let’s get started with the Query Editor

The following sections contain the steps for setting up your Amazon Redshift cluster with a sample dataset from an Amazon S3 bucket using the Query Editor directly from the console. For new users, this is an especially handy alternative to setting up JDBC/ODBC clients to establish a connection to your cluster. If you already have a cluster, you can complete these steps in 10 minutes or less.

In the following example, you use the Query Editor to perform these tasks:

  • Load a sample dataset in your cluster.
  • Run SQL queries on a sample dataset and view results and execution details.
  • Run administration queries on system tables and save frequently used queries.
  • Run SQL queries to join an internal and external table.

Use the following steps to set up your cluster for querying:

  1. On the Amazon Redshift console, create a cluster.For detailed steps, see the procedure described in Launch a Sample Amazon Redshift Cluster in the Amazon Redshift Getting Started Guide. Use any of the following currently supported node types: dc1.8xlarge, dc2.large, dc2.8xlarge, or ds2.8xlarge.For this post, we used the Quick launch cluster button on the Amazon Redshift dashboard to create a single-node, dc2.large cluster called demo-cluster in the us-east-1 Region. As you go through the tutorial, replace this cluster name with the name of the cluster that you launched, and the Region where you launched the cluster.

  1. Add Query Editor-related permissions to the AWS account.To access the Query Editor feature on the console, you need permissions. For detailed steps, see Enabling Access to the Query Editor in the Amazon Redshift Cluster Management Guide.
  1. To load and run queries on a sample dataset (including permissions to load data from S3 or to use the AWS Glue or Amazon Athena Data Catalogs), follow these steps:a. To load sample data from Amazon S3 using the COPY command, you must provide authentication for your cluster to access Amazon S3 on your behalf. Sample data for this procedure is provided in an Amazon S3 bucket that is owned by Amazon Redshift. The bucket permissions are configured to allow all authenticated AWS users read access to the sample data files.To perform this step:

• Attach the AmazonS3ReadOnlyAccess policy to the IAM role. The AmazonS3ReadOnlyAccess policy grants your cluster read-only access to all Amazon S3 buckets.

• If you’re using the AWS Glue Data Catalog, attach the AWSGlueConsoleFullAccess policy to the IAM role. If you’re using the Athena Data Catalog, attach the AmazonAthenaFullAccess policy to the IAM role.

b. In step 2 of the example, you run the COPY command to load the sample data. The COPY command includes a placeholder for the IAM role Amazon Resource Name (ARN). To load sample data, add the role ARN in the COPY The following is a sample COPY command:

COPY myinternalschema.event FROM 's3://aws-redshift-spectrum-sample-data-us-east-1/spectrum/event/allevents_pipe.txt'
delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS' region 'us-east-1';

After you complete these steps, your Amazon Redshift cluster is ready. The following sections describe three steps that demonstrate what you can do with the Query Editor:

  • Use the Query Editor for loading data.
  • Perform several day-to-day administration tasks.
  • Run a query on data stored in the Amazon Redshift cluster and Amazon S3 data lake, with no need for loading or other data preparation.

Step 1: Connect to your cluster in the Query Editor

To connect to your cluster:

  1. Using the left navigation pane on the Amazon Redshift console, navigate to the Query Editor.
  2. In the Credentials dialog box, in the Cluster drop-down list, choose the cluster name (demo-cluster). Choose the database and the database user for this cluster.
  3. If you created the cluster by using the service-provided default values, choose dev as your Database selection, and enter awsuser in the Database user box.
  4. Enter the password for the cluster. Commonly, Amazon Redshift database users log on by providing a database user name and password. As an alternative, if you don’t remember your password, you can retrieve it in an encrypted format by choosing Create a temporary password, as shown in the following example. For more information, see Using IAM Authentication to Generate Database User Credentials.

This connects to the cluster if you have Query Editor-related permissions for the AWS account. For more information, see the step to add the Query Editor-related permissions to the AWS account in the previous section.

Step 2: Prepare the cluster with a sample dataset

To prepare the cluster with a sample dataset:

  1. Run the following SQL in the Query Editor. This creates the schema myinternalschema in the Amazon Redshift cluster demo-cluster.
/* Create a schema */
CREATE SCHEMA myinternalschema

  1. Run the following SQL statement in the Query Editor to create a table for schema myinternalschema.
/* Create table */
CREATE TABLE myinternalschema.event(
	eventid integer not null distkey,
	venueid smallint not null,
	catid smallint not null,
	dateid smallint not null sortkey,
	eventname varchar(200),
	starttime timestamp);
  1. Run the following SQL statement with the COPY command to copy the sample dataset from Amazon S3 to your Amazon Redshift cluster, demo-cluster, in the us-east-1 The Amazon S3 path for the sample dataset is s3://aws-redshift-spectrum-sample-data-us-east-1/spectrum/event/allevents_pipe.txt.

Before choosing Run Query, remember to replace the placeholder in the example with the ARN for the IAM role that is associated with this AWS account. If your cluster is in another AWS Region, replace the Region in the region parameter and the Amazon S3 path, as shown in the following SQL command:

/* Load data */
COPY myinternalschema.event FROM 's3://aws-redshift-spectrum-sample-data-us-east-1/spectrum/event/allevents_pipe.txt'
delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS' region 'us-east-1';
  1. To ensure access to the public dataset in Amazon S3, make sure that this AWS account has the correct permissions to access Amazon S3, AWS Glue, and Athena. For more information, see the step to load and run queries on the sample dataset (Amazon S3 and AWS Glue/Amazon Athena Data Catalog permissions) earlier in this post.
  2. To verify the data in the previously created table in the Query Editor, browse through the tables in the schema viewer on the left. Choose the preview icon next to the table name to see the first 10 records from the event table. Choosing this option runs the following query for a preview of the table, displaying 10 rows from the table:
/* View a snippet of the same dataset in myinternalschema */ 
SELECT * FROM myinternalschema.event

You can also enter your own SQL statements. Use Ctrl + Space to autocomplete queries in the Query Editor, to verify the data in the table that you created.

Step 3: Helpful cluster management queries

You are all set to try Amazon Redshift! In day-to-day cluster management and monitoring, you can run the following SQL queries in the Query Editor. These frequently used queries let you find and shut down long-running queries, uncover deadlock situations, and check for available disk space on your Amazon Redshift cluster. Save these queries and get convenient access to them by choosing Saved queries in the left navigation on the console, as shown in the following example:

Kill malfunctioning or long-running queries on a cluster

If there is a malfunctioning query that must be shut down, locating the query can often be a multi-step process. Run the following SQL in the Query Editor to find all queries that are running on an Amazon Redshift cluster with a SQL statement:

/* Queries are currently in progress */ 
 , query
 , pid
 , starttime
 , left(text, 50) as text
FROM pg_catalog.stv_inflight

After locating the malfunctioning queries from the query result set, use the cancel <pid> <msg> command to kill a query. Be sure to use the process ID—pid in the previous SQL—and not the query ID. You can supply an optional message that is returned to the issuer of the query and logged.

Monitor disk space being used on a cluster

One of the most frequently used console functions is monitoring the percentage of disk space used by a cluster. Queries fail if there is limited space in the cluster to create temp tables used while the query is running. Vacuums can also fail if the cluster does not have free space to store intermediate data in the cluster restore process. Monitoring this metric is important for planning ahead before the cluster gets full and you have to resize or add more clusters.

If you suspect that you are experiencing high or full disk usage with Amazon Redshift, run the following SQL in the Query Editor to find disk space available and see individual table sizes on the cluster:

/* Disk space available on your Redshift cluster */
SELECT SUM(used)::float / SUM(capacity) as pct_full
FROM pg_catalog.stv_partitions
/* Find individual table sizes */
SELECT t.name, COUNT(tbl) / 1000.0 as gb
SELECT DISTINCT id, name FROM stv_tbl_perm
) t
JOIN stv_blocklist ON tbl=t.id

From here, you can either drop the unnecessary tables or resize your cluster to have more capacity. For more information, see Resizing Clusters in Amazon Redshift.

Watch for deadlock situations with suspiciously long-running updates on the cluster

If a cluster has a suspiciously long-running update, it might be in a deadlocked transaction. The stv_locks table indicates any transactions that have locks, along with the process ID of the relevant sessions. This pid can be passed to pg_terminate_backend(pid) to kill the offending session.

Run a SQL statement in the Query Editor to inspect the locks:

\/* Find all transactions that have locks along with the process id of the relevant sessions */ 
FROM pg_catalog.stv_locks 
ORDER BY last_update asc

To shut down the session, run select pg_terminate_backend(lock_owner_pid), using the value from stl_locks.

See the rows affected by the most recent vacuums of the cluster

By running a vacuum command on tables in the cluster, any free space because of delete and update operations is reclaimed. At the same time, the data of the table gets sorted. The result is a compact and sorted table, which improves the cluster performance.

Run the following SQL statement to see a count of rows that were deleted or resorted from the most recent vacuums from the svv_vacuum_summary table:

/* Deleted or restored rows from most recent vacuums */
select * from svv_vacuum_summary
where table_name = 'events'

Debug connection issues for Amazon Redshift clusters

Joining stv_sessions and stl_connection_log tables returns a list of all sessions (all connects, authenticates, and disconnects on the cluster) and the respective remote host and port information.

To list all connections, run the following SQL statement in the Query Editor:

/* List connections, along with remote host information */ 
 '' remotehost,
FROM stv_sessions
LEFT JOIN stl_connection_log ON pid = process
  AND starttime > recordtime - interval '1 second'
ORDER BY starttime DESC

Use the saved queries feature to save these commonly used SQL statements in your account and run them in the Query Editor with one click.

Bonus step 4: Query with Amazon Redshift Spectrum

With Amazon Redshift Spectrum, you can query data in Amazon S3 without the need to first load it into Amazon Redshift. Amazon Redshift Spectrum queries employ massive parallelism to quickly process large datasets in S3, without ingesting that data into Amazon Redshift. Much of the processing occurs in the Amazon Redshift Spectrum layer. Multiple clusters can concurrently query the same dataset in Amazon S3 without needing to make copies of the data for each cluster.

To get set up with Amazon Redshift Spectrum, run the following SQL statements in the Query Editor for demo-cluster. If your cluster is in another AWS Region, be sure to replace the Region in the region parameter and the Amazon S3 path in the following SQL statement.

To create a new schema from a data catalog to use with Amazon Redshift Spectrum:

/* Create external (Amazon S3) schema */
from data catalog
database 'myexternaldatabase'
region 'us-east-1'
create external database if not exists;

To create a table for the Amazon Redshift Spectrum S3 sample dataset:

/* Create external table */
CREATE EXTERNAL TABLE myexternalschema.sales(
salesid integer,
listid integer,
sellerid integer,
buyerid integer,
eventid integer,
dateid smallint,
qtysold smallint, pricepaid decimal(8,1), commission decimal(8,1), saletime timestamp)
row format delimited
fields terminated by '\t'
stored as textfile
location 's3://aws-redshift-spectrum-sample-data-us-east-1/spectrum/sales/' 
table properties ('numRows'='171000');

Start querying!

This section provides an example scenario to start querying data from the external (Amazon S3) sales table and the internal (Amazon Redshift) event table. The join query in this scenario looks for all events (from the sales dataset loaded on the demo-cluster) with the sale price paid > 50 (from the Amazon Redshift Spectrum dataset in Amazon S3, s3://aws-redshift-spectrum-sample-data-us-east-1/spectrum/sales/).

/* Join a table from the sample dataset with a Spectrum table */
/* Join external (Amazon S3) and internal (Amazon Redshift) table */
    myexternalschema.sales.eventid = myinternalschema.event.eventid       
    and myexternalschema.sales.pricepaid > 50  
group by
order by
    1 desc;

In the Query results section, choose View execution to see the detailed execution plan. The query plan is available for all queries executed on compute nodes.

Note: Queries that do not reference user tables, such as administration queries that only use catalog tables, do not have an available query plan.

Optionally, download the query results to your local disk for offline use. Queries run for up to three minutes in the Query Editor. After a query is completed, the Query Editor provides two minutes to fetch results. Rerun the query and try again if you hit the two-minute threshold.

Load additional tables from the Amazon Redshift sample dataset by using the following SQL statements and get creative with your queries. Before choosing Run query in the Query Editor, remember to add the ARN for the IAM role that is associated with this AWS account in the placeholder in the following SQL statement. If your cluster is in another AWS Region, replace the Region in the region parameter and the Amazon S3 path in the following SQL statement.

copy users from 's3://awssampledbuswest2/tickit/allusers_pipe.txt' 
delimiter '|' region 'us-west-2';

copy venue from 's3://awssampledbuswest2/tickit/venue_pipe.txt' 
delimiter '|' region 'us-west-2';

copy category from 's3://awssampledbuswest2/tickit/category_pipe.txt' 
delimiter '|' region 'us-west-2';

copy date from 's3://awssampledbuswest2/tickit/date2008_pipe.txt' 
delimiter '|' region 'us-west-2';

copy listing from 's3://awssampledbuswest2/tickit/listings_pipe.txt' 
delimiter '|' region 'us-west-2';

copy sales from 's3://awssampledbuswest2/tickit/sales_tab.txt'
delimiter '\t' timeformat 'MM/DD/YYYY HH:MI:SS' region 'us-west-2';


In this post, we introduced the Query Editor, an in-browser interface for running SQL queries on Amazon Redshift clusters. We showed how you can use it to run SQL queries for loading data in clusters and monitoring cluster performance directly on the console. To learn more about Amazon Redshift and start with Query Editor, visit the Amazon Redshift webpage.

If you like this feature, share your feedback by using the Send feedback link on the console, as shown following.

If you have any questions or suggestions, please leave a comment below.

Happy querying!


About the Authors

Surbhi Dangi is a senior product/design manager at AWS. Her work includes building user experiences for Database, Analytics & AI AWS consoles, launching new database and analytics products, working on new feature launches for existing products, and building broadly adopted internal tools for AWS teams. She enjoys traveling to new destinations to discover new cultures, trying new cuisines, and teaches product management 101 to aspiring PMs.



Raja Bhogi is an engineering manager at AWS. He is responsible for building delightful and easy-to-use web experiences for analytics and blockchain products. His work includes launching web experiences for new analytics products, and working on new feature launches for existing products. He is passionate about web technologies, performance insights, and tuning. He is a thrill seeker and enjoys everything from roller coasters to bungy jumping.




Federate Amazon Redshift access with Okta as an identity provider

Post Syndicated from Rajiv Gupta original https://aws.amazon.com/blogs/big-data/federate-amazon-redshift-access-with-okta-as-an-identity-provider/

Managing database users and access can be a daunting and error-prone task. In the past, database administrators had to determine which groups a user belongs to and which objects a user/group is authorized to use. These lists were maintained within the database and could easily get disjointed from the corporate directory.

With federation, you can manage users and groups within the enterprise identity provider (IdP) and pass them to Amazon Redshift at login. In a previous post, Federate Database User Authentication Easily with IAM and Amazon Redshift, I discussed the internals of the federation workflow using Active Directory Federation Service (AD FS) as our identity provider.

In this post, I focus on Okta as the identity provider. I provide step-by-step guidance showing how you can set up a trial Okta.com account, build users and groups within your organization’s directory, and enable single sign-on (SSO) into Amazon Redshift. You can do all of this while also maintaining group-level access controls within your data warehouse.

The steps in this post are structured into the following sections:

  • Identity provider (Okta) configuration – You set up Okta, which contains your users organized into logical groups.
  • AWS configuration – You set up a role that establishes a trust relationship between your identity provider and AWS and a role that Okta uses to access Amazon Redshift.
  • Identity provider (Okta) advanced configuration – You finalize the Okta configuration by inputting the roles that you just created. You also inform Okta about which groups are allowed to be passed to Amazon Redshift.
  • Amazon Redshift server/client setup – You set up groups within the Amazon Redshift database to match the Okta groups. You also authorize these groups to access certain schemas and tables. Finally, you set up your client tools to use your enterprise credentials and sign in to Amazon Redshift.

Identity provider (Okta) configuration

In this first step, you set up Okta, add users, and organize them into logical groups. You then add the Amazon Web Services Redshift Okta application.

Step 1: Create an Okta account

If you don’t already have access to an Okta account, you can start a 30-day free trial: https://www.okta.com/free-trial/.

Step 2: Set up your Okta directory

Sign in to Okta.com using the following URL, where <prefix> is specific to your account and was created at account setup:


Navigate to the Directory page to add people and groups into Okta that match your organization. Be sure to use lowercase group names because Amazon Redshift expects the group names to be lowercase.

In the following example, I added three users and two groups, where one of the users (Jorge) belongs to both the “sales” and “marketing” groups.

First, choose Admin in the upper-right corner.

To add users, choose Add Person. The following example shows the users that were created.

To add groups into Okta, choose Add Group. The following example shows three groups.

Step 3: Add the “Amazon Web Services Redshift” Okta application

Navigate to the Applications page. Choose Add Application, and search for the Amazon Web Services Redshift application. Proceed with the default settings.

Step 4: Download the Okta application metadata

Make sure that you have navigated to the Amazon Web Services Redshift application’s settings page, which appears as follows.

Choose Sign On, and then choose the Identity Provider metadata link to download the metadata file in xml format (for example, metadata.xml).

AWS configuration

Next, you set up a role that establishes a trust relationship between the identity provider and AWS. You also create a role that Okta uses to access Amazon Redshift.

Step 5: Create the SAML IAM identity provider

Switching to AWS Management Console, sign in using your AWS credentials. Then open the AWS Identity and Access Management (IAM) console.

On the IAM console, choose Identity providers, and then choose Create Provider, as shown following.

Provide a name for your IdP, and upload the metadata file that you downloaded in the previous step.

Step 6: Create the IAM SAML 2.0 federation role

On the IAM console, navigate to Roles and create a new SAML 2.0 federation role.  Reference the IdP that you created in the previous step, and choose Allow programmatic and AWS Management Console access.

Step 7: Add other permissions to query Amazon Redshift

Choose Next: Assign Permissions. Then choose Create policy.

Create the following custom policy, replacing the region, account, and cluster parameters. These permissions allow the role to use Amazon Redshift to query data, create users, and allow users to join groups.

    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
           "Action": [
           "Resource": "arn:aws:redshift:<region>:<account>:cluster:<cluster>"}]

There are a few important things to note:

  • The group membership lasts only for the duration of the user session.
  • There is no CreateGroup permission because groups need to be manually created and granted DB privileges.

The following image shows the summary page for the role.

Identity provider (Okta) advanced configuration

In this section, you finalize the Okta configuration by adding the roles that you just created. You also tell Okta which groups are allowed to be passed to Amazon Redshift.

Step 8: Configure the advanced sign-on settings

Switch back to Okta.com. Navigate to the settings page for the Amazon Web Services Redshift application. In the Sign-On section, scroll to Advanced Sign-On Settings.

Enter the previously created IdP and role ARNS, which are globally unique and ensure that Okta will be directed to your AWS account. Allowed DB Groups is a list of allowed groups that will be sent to Amazon Redshift in the DBGroup SAML assertion.

Don’t use the asterisk (*) wildcard. This will cause the Everyone group to be passed, and Amazon Redshift will complain because it expects the group names to be lowercase.  Note that the ${user.username} is sent in the DBUser SAML assertion.

Step 9: Authorize users

Authorize users to use the Amazon Web Services Redshift application by selecting their respective groups or individual user accounts. In this example, I authorized users by group.

Amazon Redshift server/client setup

Next, you set up groups in the Amazon Redshift database to match the Okta groups. You also authorize these groups to access certain schemas and tables. Finally, you set up your client tools to use your enterprise credentials and sign in to Amazon Redshift.

Step 10: Set up groups

Log in to your Amazon Redshift cluster with an admin account. Create groups that match the IdP group names, and grant the appropriate permissions to tables and schemas.

CREATE GROUP marketing;
GRANT ALL on TABLES to GROUP marketing;

Step 11: Configure the JDBC SQL client

Assuming that the Amazon Redshift JDBC driver is installed, set up a new connection to your cluster using your IdP credentials. In the following example, I am using SQLWorkbenchJ. For the URL, be sure to enter “iam” to instruct the driver to authenticate using IAM. For Username and Password, enter the values that you set in Okta.

Enter the extended properties as follows. For app_id and idp_host, refer to the URL for the application in your web browser:


Step 12: Configure the ODBC SQL client

Assuming that the Amazon Redshift ODBC driver is installed, set up a new connection to your cluster using your IdP credentials. In the following example, I modified the ~/Library/ODBC/odbc.ini file.  See the previous instructions for determining the <app_id> and <prefix> values.

[ODBC Data Sources]
Redshift DSN=Installed

[Redshift DSN]

Step 13: Test user access

You should now be able to sign on with the users created. In our example, [email protected] has access to the tables in the “sales” schema only. The user [email protected] has access to tables in the “marketing” schema only. And [email protected] has access to tables in both schemas. Using the [email protected] user, you get following results when trying to query data from each of the schemas:

select storeid From sales.stores


select * From marketing.campaign

An error occurred when executing the SQL command:
select * From marketing.campaign

[Amazon](500310) Invalid operation: permission denied for schema marketing;
1 statement failed.

Execution time: 0.16s


In this post, I provided a step-by-step guide for configuring and using Okta as your Identity Provider (IdP) to enable single sign-on to an Amazon Redshift cluster. I also showed how group membership within your IdP can be passed along, enabling you to manage user access to Amazon Redshift resources from within your IdP.

If you have questions or suggestions, please comment below.


About the Author

Rajiv Gupta is a data warehouse specialist solutions architect with Amazon Web Services.





New – Concurrency Scaling for Amazon Redshift – Peak Performance at All Times

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-concurrency-scaling-for-amazon-redshift-peak-performance-at-all-times/

Amazon Redshift is a data warehouse that can expand to exabyte-scale. Today, tens of thousands of AWS customers (including NTT DOCOMO, Finra, and Johnson & Johnson) use Redshift to run mission-critical BI dashboards, analyze real-time streaming data, and run predictive analytics jobs.

A challenge arises when the number of concurrent queries grows at peak times. When a multitude of business analysts all turn to their BI dashboards or long-running data science workloads compete with other workloads for resources, Redshift will queue queries until enough compute resources become available in the cluster. This ensures that all of the work gets done, but it can mean that performance is impacted at peak times. Two options present themselves:

  • Overprovision the cluster to meet peak needs. This option addresses the immediate issue, but wastes resources and costs more than necessary.
  • Optimize the cluster for typical workloads. This option forces you to wait longer for results at peak times, possibly delaying important business decisions.

New Concurrency Scaling
Today I would like to offer a third option. You can now configure Redshift to add more query processing power on an as-needed basis. This happens transparently and in a manner of seconds, and provides you with fast, consistent performance even as the workload grows to hundreds of concurrent queries. Additional processing power is ready in seconds and does not need to be pre-warmed or pre-provisioned. You pay only for what you use, with per-second billing and also accumulate one hour of concurrency scaling cluster credits every 24 hours while your main cluster is running. The extra processing power is removed when it is no longer needed, making this a perfect way to address the bursty use cases that I described above.

You can allocate the burst power to specific users or queues, and you can continue to use your existing BI and ETL applications. Concurrency Scaling Clusters are used to handle many forms of read-only queries, with additional flexibility in the works; read about Concurrency Scaling to learn more.

Using Concurrency Scaling
This feature can be enabled for an existing cluster in minutes! We recommend starting with a fresh Redshift Parameter Group for testing purposes, so I start by creating one:

Then I edit my cluster’s Workload Management Configuration, select the new parameter group, set the Concurrency Scaling Mode to auto, and click Save:

I will use the Cloud Data Warehouse Benchmark Derived From TPC-DS as a source of test data and test queries. I download the DDL, customize it with my AWS credentials, and use psql to connect to my cluster and create the test data:

sample=# create database sample;
sample=# \connect sample;
psql (9.2.24, server 8.0.2)
WARNING: psql version 9.2, server version 8.0.
         Some psql features might not work.
SSL connection (cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256)
You are now connected to database "sample" as user "awsuser".
sample=# \i ddl.sql

The DDL creates the tables and loads populates them using data stored in an S3 bucket:

sample=# \dt
                 List of relations
 schema |          name          | type  |  owner
 public | call_center            | table | awsuser
 public | catalog_page           | table | awsuser
 public | catalog_returns        | table | awsuser
 public | catalog_sales          | table | awsuser
 public | customer               | table | awsuser
 public | customer_address       | table | awsuser
 public | customer_demographics  | table | awsuser
 public | date_dim               | table | awsuser
 public | dbgen_version          | table | awsuser
 public | household_demographics | table | awsuser
 public | income_band            | table | awsuser
 public | inventory              | table | awsuser
 public | item                   | table | awsuser
 public | promotion              | table | awsuser
 public | reason                 | table | awsuser
 public | ship_mode              | table | awsuser
 public | store                  | table | awsuser
 public | store_returns          | table | awsuser
 public | store_sales            | table | awsuser
 public | time_dim               | table | awsuser
 public | warehouse              | table | awsuser
 public | web_page               | table | awsuser
 public | web_returns            | table | awsuser
 public | web_sales              | table | awsuser
 public | web_site               | table | awsuser
(25 rows)

Then I download the queries and open up a bunch of PuTTY windows so that I can generate a meaningful load for my Redshift cluster:

I run an initial set of parallel queries, and then ramp up over time, I can see them in the Cluster Performance tab for my cluster:

I can see the additional processing power come online as needed, and then go away when no longer needed, in the Database Performance tab:

As you can see, my cluster scales as needed in order to handle all of the queries as expeditiously as possible. The Concurrency Scaling Usage shows me how many seconds of additional processing power I have consumed (as I noted earlier, each cluster accumulates a full hour of concurrency credits every 24 hours).

I can use the parameter max_concurrency_scaling_clusters to control the number of Concurrency Scaling Clusters that can be used (the default limit is 10, but you can request an increase if you need more).

Available Today
You can start making use of Concurrency Scaling Clusters today in the US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), and Asia Pacific (Tokyo) Regions today, with more to come later this year.



Granting fine-grained access to the Amazon Redshift Management Console

Post Syndicated from Raj Jayaraman original https://aws.amazon.com/blogs/big-data/granting-fine-grained-access-to-the-amazon-redshift-management-console/

As a fully managed service, Amazon Redshift is designed to be easy to set up and use. In this blog post, we demonstrate how to grant access to users in an operations group to perform only specific actions in the Amazon Redshift Management Console. If you implement a custom IAM policy, you can set it up so these users can monitor and terminate running queries. At the same time, you can prevent these users from performing other more privileged operations such as modifying, restarting, or deleting an Amazon Redshift cluster.

An overview of Amazon Redshift access control

Since its release in February 2013, Amazon Redshift has quickly become a popular cloud-based data warehousing platform for thousands of customers worldwide.

Access to Amazon Redshift requires credentials that AWS can use to authenticate your requests. Those credentials must have permissions to access Amazon Redshift resources, such as an Amazon Redshift cluster or a snapshot. For more details on these credentials, see Authentication and Access Control for Amazon Redshift in the Amazon Redshift documentation.

Every AWS resource is owned by an AWS account, and permissions to create or access the resources are governed by AWS Identity and Access Management (IAM) policies. An AWS account administrator can attach permissions policies to IAM identities (users, groups, and roles). In particular, an AWS account administrator can attach an IAM permissions policy to a specific user. Such a policy grants permissions for that user to manage an Amazon Redshift resource, such as a snapshot or an event subscription.

When granting permissions, you can decide who gets the permissions and which Amazon Redshift resources they get permissions for. You can also decide on the specific actions that you want to allow on those resources. Policies attached to an IAM identity are referred to as identity-based IAM policies, and policies attached to a resource are referred to as resource-based policies. Amazon Redshift supports only identity-based IAM policies.

Use case: Setting limited access for a user

Consider the following use case. Suppose that an IAM user who is a member of a customer’s operations group needs to monitor and terminate queries running in an Amazon Redshift cluster. It’s best if they do so through the Amazon Redshift console. This user is not allowed to modify or delete any other Amazon Redshift resources.

To implement this use case, we need to implement a custom IAM policy that ensures this IAM user has read-only access to the Amazon Redshift console. Doing this means that the user can get descriptions of the available clusters and navigate to the Queries tab. Additionally, we want the IAM user to be able to cancel a running query through the Amazon Redshift console. To allow this, we use the redshift:CancelQuerySession IAM action. For descriptions of other allowed Amazon Redshift actions in an IAM policy and what each action means, see Actions Defined by Amazon Redshift in the Amazon Redshift documentation.

To create such a custom IAM policy, follow these instructions:

  1. Sign in to the AWS Management Console and open the IAM console at https://console.aws.amazon.com/iam/.
  2. In the navigation pane on the left, choose Policies.
  3. Choose Create policy.
  4. Choose the JSON tab and input the following policy:
    "Version": "2012-10-17",
    "Statement": [
            "Action": [
            "Effect": "Allow",
            "Resource": "*"

  1. On the Review policy page, type a value for Name and optionally for Description for the policy that you are creating. Review the policy Summary to see the permissions that are granted by your policy. Then choose Create policy to save your work.
  2. Attach this policy to an existing or a new IAM user.

With this permission policy, an IAM user can select an Amazon Redshift cluster, list all running queries in the Queries tab, and terminate a query if needed. All the permissions are read-only. Thus, the user can’t create a new Amazon Redshift cluster or modify or delete an existing cluster. However, the user can view available clusters, cluster snapshots, parameter groups, and cluster subnet groups, and view other properties of existing clusters.

Validating the use case

With the above IAM policy in place, after the IAM user logs into the Amazon Redshift Management Console, the user can select and view details about the Amazon Redshift cluster or clusters in the account. After navigating to the Queries tab, the user can see both the running and completed queries.

To cancel or terminate a long running query, the user can select the query from the list and choose Terminate Query. However, this user can’t modify or delete anything else in the Amazon Redshift console. As an example, if the user tries to modify an Amazon Redshift cluster (to change its endpoint), that user encounters the following error.


In this post, we have walked through a detailed customer use case of providing fine-grained access to the Amazon Redshift console. Using a set of carefully tailored IAM policies, a customer’s operations personnel can have read-only access to the Amazon Redshift console. These personnel can cancel or terminate running queries without the ability to modify, add, or delete any other Amazon Redshift resources.

We want to acknowledge our fellow AWS co-workers Ryan Mich, Sulay Shah and Hunter Grider for their many useful comments and suggestions.

If you have any questions or suggestions, leave your feedback in the comment section. If you need any further assistance to optimize your Amazon Redshift implementation, contact your AWS account team or a trusted AWS partner.


About the authors

Raj Jayaraman is a cloud support engineer with AWS Support at Amazon Web Services.





Po Hong, Ph.D. is a senior data architect within the Global Data & Analytics Specialty Practice at AWS Professional Services.





Build a modern analytics stack optimized for sharing and collaborating with Mode and Amazon Redshift

Post Syndicated from Benn Stancil original https://aws.amazon.com/blogs/big-data/build-a-modern-analytics-stack-optimized-for-sharing-and-collaborating-with-mode-and-amazon-redshift/

Leading technology companies, such as Netflix and Airbnb, are building on AWS to solve problems on the edge of the data ecosystem. While these companies show us what data and analytics make possible, the complexity and scale of their problems aren’t typical. Most of our challenges aren’t figuring out how to process billions of records to provide real-time recommendations to millions of customers. Instead, we struggle with wrangling the data that we have, finding and sharing insights from that data, and then acting on them. This leads even the most savvy data teams that have adopted tools, such as Jupyter notebooks, to either be bottlenecked by slow data pipelines, or resort to a manual process to democratize insights for their stakeholders.

This blog post walks you through the following:

  1. The stages of BI modernization that illustrate what problems arise when organizations adopt modern BI tools without truly modernizing the analytics platform.
  2. What a cloud-first data science platform, with dashboarding and notebooking solutions integrated together with efficient data pipelines, could look like.
  3. How to replicate such a platform in a few steps. This combines a collaborative analytics solution, such as Mode (an Amazon Redshift partner), with a strong analytics foundation built with AWS data warehousing, ETL, and data exploration services.

The stages of BI modernization

There are two primary stages in the evolution of BI – the use of Microsoft Excel for creating reports and the use of dashboarding tools for sharing insights in a consumable format.

Stage 1: The Excel workflow

At AWS and Mode, we talk to thousands of companies who are looking to get more value out of their data. When we ask them how they use data today, the most common answer we get is shown in the following example:

Most companies recognize this workflow as broken. The data pipelines are impossible to manage, and the analysis requires manual effort to reproduce. And in the end, we don’t know if “budget_analysis_v3_final_revised_FINAL.xls” is indeed final.

Stage 2: The dash to the dashboards

When looking for a more effective solution, companies often turn to BI products like Tableau, Amazon QuickSight, PowerBI, and Looker. These products, which were either born in the cloud or are heavily invested in it now, make it efficient to create and share reports and dashboards. KPIs can be delivered through up-to-date URLs rather than emailed files. This helps ensure that everyone has the same view of what’s happening across the business. The BI process is shown in the following example:

While modern BI is a significant step forward, it’s an incomplete solution. Dashboards reveal what’s happening, but businesses that want to use that data for action must understand why things are happening. Before a company can respond to falling sales in one region of the country, for example, it must understand what’s driving the drop. Because dashboards can’t be easily modified, extended, or reused for further analysis, they are often the wrong tool for analysts and data scientists who are charged with answering open-ended exploratory questions. As a result, data infrastructures remain fragmented, and analytics and data science workflows are still built on manual processes.

A cloud-first data science platform

The ideal technology stack for modern data science teams unifies these two stages described in the previous section. Dashboards should serve as the start for exploratory questions for analysts, analysts’ work should be as accessible as company dashboards, and the platform should facilitate a close collaboration between data scientists and business stakeholders.

Pioneering data teams at leading tech companies have developed internal solutions to do exactly this. Uber built a data science workbench for data exploration, data preparation, adhoc analyses, model exploration, workflow scheduling, dashboarding, and collaboration. Netflix recently unveiled the Netflix Data Platform, which automates the execution and distribution of Jupyter notebooks. Instacart built Blazer for exploring and sharing data.

All of these platforms have three things in common:

  • They combine visualization tools and interactive analysis tools, such as R and Python notebooks, and a collaboration platform.
  • They are powered by a modern data warehouse that can scale to accommodate any size of data and any number of analysts.
  • They have reliable ETL pipelines that provide analysts and data scientists access to the data they need, when they need it.

Building a cloud-first data science platform

Fortunately, AWS and its partners offer solutions that check all these boxes and provide the same power to data science teams that aren’t able to build it themselves. Data warehousing services like Amazon Redshift and Athena are fast, scalable, and accessible to anyone who can write standard SQL. ETL partners like Fivetran, Segment, and Matillion provide reliable, push-button ETL services from hundreds of applications into Amazon Redshift and Amazon S3. Finally, a cloud-based analytics platform such as Mode combines visualizations tools, fully hosted R and Python notebooks, and a distribution platform.

This modern stack, which is as powerful as the tooling inside Netflix or Airbnb, provides fully automated BI and data science tooling. It can be deployed in a matter of days and at a fraction of the cost of legacy data science tools.

Three steps to building the platform

Implement this data science infrastructure by using the following three steps:

  1. Set up a data warehouse.
  2. Populate your warehouse with data from around your company.
  3. Add a data science solution on top of your warehouse.

These steps do not require a large investment into engineering teams and custom-built software.

There are many ways to customize this stack to fit your company’s needs. However, this section shows how to set up using Amazon Redshift for a warehouse, Fivetran for ETL, and Mode for data science.

Step 1: Setting up Amazon Redshift

For information about setting up an Amazon Redshift warehouse, see Getting Started with Amazon Redshift. While you need an AWS account to set it up, the process requires no code and only takes a few minutes.

Most configuration options, including the size of the cluster, can be adjusted after the initial setup. Therefore, it’s not necessary to get everything exact at first. If a different configuration is more appropriate later, you can go back and change most of the Amazon Redshift settings.

Step 2: Populating Amazon Redshift with data

Your warehouse is only as good as the data in it. Fortunately, a number of ETL tools make it more efficient to continuously stream data from around your business and the applications you use. Application databases, third party apps like Salesforce and Zendesk, even CSV files – all of these can be easily fed into Amazon Redshift without any engineering effort.

Fivetran, an Amazon Redshift partner, is one such ETL tool (it’s a tool that we’re happy with at Mode). To connect Fivetran to your Amazon Redshift database, first configure your database to allow Fivetran to connect. Fivetran supports a variety of options for connecting, including connecting directly or by using an SSH tunnel. For more information about the steps, see the connection options.

As a final step, create an Amazon Redshift user for Fivetran. We recommend that you use another user than the master user. To create this user, log into the Amazon Redshift query editor (or a SQL client of your choice) and run the following commands:

CREATE USER fivetran PASSWORD <password>;
GRANT CREATE ON DATABASE <database> TO fivetran;

After Amazon Redshift is configured:

  1. Create a new Fivetran account.
  2. Select I already have an existing warehouse, then choose Redshift.
  3. Fill out the form with your Amazon Redshift credentials, as shown in the following example, then choose Save.

  1. After Fivetran is connected to Amazon Redshift, connect it with the data sources that you want to pull into Amazon Redshift. This process is now more efficient.
  2. In Fivetran, choose Connectors.
  3. Choose Add connector, then choose the data source that you want to integrate. Though the specifics vary by source, most of them follow the same pattern.
  4. Choose a schema in Amazon Redshift that you want to write your data to, and then follow the authorization flow that Fivetran automatically steps you through.

The following are examples of connection flows:

Connection flow for Salesforce


Connection flow for Google Analytics


By using similar flows, you can also connect other databases, such as Amazon RDS Postgres or the MySQL database, and directly upload CSVs.

When these connections are set up, data automatically syncs between your data sources and Amazon Redshift. If you want more control, Fivetran lets you choose which data to sync, and how often it’s updated.

Can’t find the data source you’re looking for? Other ETL tools, including Stitch Data, Segment, and ETLeap, provide similar services that are just as easy to set up. We recommend this guide when making a decision about which tool is right for you.

Step 3: Connecting Amazon Redshift to Mode

Finally, by connecting Mode to your Amazon Redshift, you can provide your entire company access to your data in a collaborative analytics environment.

To connect Mode, configure your security groups so that Mode can access Amazon Redshift. If you’re connecting Mode directly to your cluster, follow the security groups documentation linked above to grant access to the following IP addresses:

Mode also offers alternative ways of connecting if you’re unable to modify your firewall.

After you’ve completed these steps, you need only enter your credentials on Mode’s data source connection page, as shown in the following example:

After the connection is made, choose who in your organization can access that connection. Then you can immediately query your data and build the analysis for your team from the Mode Editor, as shown in the following example:

In addition to a SQL environment and visualization builder, Mode also offers integrated Python and R notebooks. Choose New Notebook in the left navigation bar to start a new Python or R instance that’s automatically populated with your query results as DataFrames. This enables data scientists to seamlessly create and share analysis directly with everyone around the company. Ultimately, this approach lets you build the most flexible platform for your analytical needs. Your business analysts and data scientists can now work in the same environment. They can collaborate seamlessly, and access the same data at all times.


This new architecture lets organizations to do more with their data, faster. Data teams that use Python and R can go beyond sharing static dashboards and reports; instead, they can also use popular forecasting and machine learning libraries like Prophet and TensorFlow. These libraries help teams find insights that they couldn’t have found otherwise. This lets teams deliver regular updates that keep everyone informed, and also answer strategic and high-value questions that drive key decisions. Moreover, Mode makes these analyses accessible to everyone around the business. Because the notebooks are fully managed, data scientists can share their work directly with stakeholders without any extra work from IT departments.

By combining Mode with Amazon Redshift, data teams also remove common bottlenecks in data integration, cleansing, or ETL processes that loads data into Amazon Redshift. With Amazon Redshift Spectrum, they can query data directly in their Amazon S3 data lake from a Mode dashboard or notebook. Moreover, they can combine these queries with data already loaded into the data warehouse.

Try it yourself

We’ve built an experience for you to get a feel for this stack. If you think it could work for your case, you can get started using Mode with Amazon Redshift in a matter of minutes. If you’re not already using Amazon Redshift, you can get started with a 2-month free trial and deploy the solution, as suggested. With Mode connected to Amazon Redshift, you can start exploring your data right away or try using one of the publicly available datasets.


About the Authors

Benn Stancil is a co­founder and Chief Analyst at Mode, a company building collaborative tools for data scientists and analysts. Benn is responsible for overseeing Mode’s internal analytics efforts, and is also an active contributor to the data science community. In addition, Benn provides strategic guidance to Mode’s product direction as a member of the product leadership team.




Ayush Jain is a Product Marketer at Amazon Web Services. He loves growing cloud services and helping customers get more value from the cloud deployments. He has several years of experience in Software Development, Product Management and Product Marketing in developer and data services.




Himanshu Raja is a Senior Product Manager for Amazon Redshift. Himanshu loves solving hard problems with data and cherishes moments when data goes against intuition. In his spare time, Himanshu enjoys cooking Indian food and watching action movies.

How to rotate Amazon DocumentDB and Amazon Redshift credentials in AWS Secrets Manager

Post Syndicated from Apurv Awasthi original https://aws.amazon.com/blogs/security/how-to-rotate-amazon-documentdb-and-amazon-redshift-credentials-in-aws-secrets-manager/

Using temporary credentials is an AWS Identity and Access Management (IAM) best practice. Even Dilbert is learning to set up temporary credentials. Today, AWS Secrets Manager made it easier to follow this best practice by launching support for rotating credentials for Amazon DocumentDB and Amazon Redshift automatically. Now, with a few clicks, you can configure Secrets Manager to rotate these credentials automatically, turning a typical, long-term credential into a temporary credential.

In this post, I summarize the key features of AWS Secrets Manager. Then, I show you how to store a database credential for an Amazon DocumentDB cluster and how your applications can access this secret. Finally, I show you how to configure AWS Secrets Manager to rotate this secret automatically.

Key features of Secrets Manager

These features include the ability to:

  • Rotate secrets safely. You can configure Secrets Manager to rotate secrets automatically without disrupting your applications, turning long-term secrets into temporary secrets. Secrets Manager natively supports rotating secrets for all Amazon database services—Amazon RDS, Amazon DocumentDB, and Amazon Redshift—that require a user name and password. You can extend Secrets Manager to meet your custom rotation requirements by creating an AWS Lambda function to rotate other types of secrets.
  • Manage access with fine-grained policies. You can store all your secrets centrally and control access to these securely using fine-grained AWS Identity and Access Management (IAM) policies and resource-based policies. You can also tag secrets to help you discover, organize, and control access to secrets used throughout your organization.
  • Audit and monitor secrets centrally. Secrets Manager integrates with AWS logging and monitoring services to enable you to meet your security and compliance requirements. For example, you can audit AWS AWS CloudTrail logs to see when Secrets Manager rotated a secret or configure AWS CloudWatch Events to alert you when an administrator deletes a secret.
  • Pay as you go. Pay for the secrets you store in Secrets Manager and for the use of these secrets; there are no long-term contracts or licensing fees.
  • Compliance. You can use AWS Secrets Manager to manage secrets for workloads that are subject to U.S. Health Insurance Portability and Accountability Act (HIPAA), Payment Card Industry Data Security Standard (PCI-DSS), and ISO/IEC 27001, ISO/IEC 27017, ISO/IEC 27018, or ISO 9001.

Phase 1: Store a secret in Secrets Manager

Now that you’re familiar with the key features, I’ll show you how to store the credential for a DocumentDB cluster. To demonstrate how to retrieve and use the secret, I use a Python application running on Amazon EC2 that requires this database credential to access the DocumentDB cluster. Finally, I show how to configure Secrets Manager to rotate this database credential automatically.

  1. In the Secrets Manager console, select Store a new secret.
    Figure 1: Select "Store a new secret"

    Figure 1: Select “Store a new secret”

  2. Next, select Credentials for DocumentDB database. For this example, I store the credentials for the database masteruser. I start by securing the masteruser because it’s the most powerful database credential and has full access over the database.
    Figure 2: Select "Credentials for DocumentDB database"

    Figure 2: Select “Credentials for DocumentDB database”

    Note: To follow along, you need the AWSSecretsManagerReadWriteAccess managed policy because this policy grants permissions to store secrets in Secrets Manager. Read the AWS Secrets Manager Documentation for more information about the minimum IAM permissions required to store a secret.

  3. By default, Secrets Manager creates a unique encryption key for each AWS region and AWS account where you use Secrets Manager. I chose to encrypt this secret with the default encryption key.
    Figure 3: Select the default or your CMK

    Figure 3: Select the default or your CMK

  4. Next, view the list of DocumentDB clusters in my account and select the database this credential accesses. For this example, I select the DB instance documentdb-instance, and then select Next.
    Figure 4: Select the instance you created

    Figure 4: Select the instance you created

  5. In this step, specify values for Secret Name and Description. Based on where you will use this secret, give it a hierarchical name, such as Applications/MyApp/Documentdb-instancee, and then select Next.
    Figure 5: Provide a name and description

    Figure 5: Provide a name and description

  6. For the next step, I chose to keep the Disable automatic rotation default setting because in my example my application that uses the secret is running on Amazon EC2. I’ll enable rotation after I’ve updated my application (see Phase 2 below) to use Secrets Manager APIs to retrieve secrets. Select Next.
    Figure 6: Choose to either enable or disable automatic rotation

    Figure 6: Choose to either enable or disable automatic rotation

    Note:If you’re storing a secret that you’re not using in your application, select Enable automatic rotation. See AWS Secrets Manager getting started guide on rotation for details.

  7. Review the information on the next screen and, if everything looks correct, select Store. You’ve now successfully stored a secret in Secrets Manager.
  8. Next, select See sample code in Python.
    Figure 7: Select the "See sample code" button

    Figure 7: Select the “See sample code” button

  9. Finally, take note of the code samples provided. You will use this code to update your application to retrieve the secret using Secrets Manager APIs.
    Figure 8: Copy the code sample for use in your application

    Figure 8: Copy the code sample for use in your application

Phase 2: Update an application to retrieve a secret from Secrets Manager

Now that you’ve stored the secret in Secrets Manager, you can update your application to retrieve the database credential from Secrets Manager instead of hard-coding this information in a configuration file or source code. For this example, I show how to configure a Python application to retrieve this secret from Secrets Manager.

  1. I connect to my Amazon EC2 instance via Secure Shell (SSH).
        import DocumentDB
        import config
        def no_secrets_manager_sample()
        # Get the user name, password, and database connection information from a config file.
        database = config.database
        user_name = config.user_name
        password = config.password                

  2. Previously, I configured my application to retrieve the database user name and password from the configuration file. Below is the source code for my application.
        # Use the user name, password, and database connection information to connect to the database
        db = Database.connect(database.endpoint, user_name, password, database.db_name, database.port) 

  3. I use the sample code from Phase 1 above and update my application to retrieve the user name and password from Secrets Manager. This code sets up the client, then retrieves and decrypts the secret Applications/MyApp/Documentdb-instance. I’ve added comments to the code to make the code easier to understand.
        # Use this code snippet in your app.
        # If you need more information about configurations or implementing the sample code, visit the AWS docs:   
        # https://aws.amazon.com/developers/getting-started/python/
        import boto3
        import base64
        from botocore.exceptions import ClientError
        def get_secret():
            secret_name = "Applications/MyApp/Documentdb-instance"
            region_name = "us-west-2"
            # Create a Secrets Manager client
            session = boto3.session.Session()
            client = session.client(
            # In this sample we only handle the specific exceptions for the 'GetSecretValue' API.
            # See https://docs.aws.amazon.com/secretsmanager/latest/apireference/API_GetSecretValue.html
            # We rethrow the exception by default.
                get_secret_value_response = client.get_secret_value(
            except ClientError as e:
                if e.response['Error']['Code'] == 'DecryptionFailureException':
                    # Secrets Manager can't decrypt the protected secret text using the provided KMS key.
                    # Deal with the exception here, and/or rethrow at your discretion.
                    raise e
                elif e.response['Error']['Code'] == 'InternalServiceErrorException':
                    # An error occurred on the server side.
                    # Deal with the exception here, and/or rethrow at your discretion.
                    raise e
                elif e.response['Error']['Code'] == 'InvalidParameterException':
                    # You provided an invalid value for a parameter.
                    # Deal with the exception here, and/or rethrow at your discretion.
                    raise e
                elif e.response['Error']['Code'] == 'InvalidRequestException':
                    # You provided a parameter value that is not valid for the current state of the resource.
                    # Deal with the exception here, and/or rethrow at your discretion.
                    raise e
                elif e.response['Error']['Code'] == 'ResourceNotFoundException':
                    # We can't find the resource that you asked for.
                    # Deal with the exception here, and/or rethrow at your discretion.
                    raise e
                # Decrypts secret using the associated KMS CMK.
                # Depending on whether the secret is a string or binary, one of these fields will be populated.
                if 'SecretString' in get_secret_value_response:
                    secret = get_secret_value_response['SecretString']
                    decoded_binary_secret = base64.b64decode(get_secret_value_response['SecretBinary'])
            # Your code goes here.                          

  4. Applications require permissions to access Secrets Manager. My application runs on Amazon EC2 and uses an IAM role to obtain access to AWS services. I will attach the following policy to my IAM role. This policy uses the GetSecretValue action to grant my application permissions to read a secret from Secrets Manager. This policy also uses the resource element to limit my application to read only the Applications/MyApp/Documentdb-instance secret from Secrets Manager. You can visit the AWS Secrets Manager documentation to understand the minimum IAM permissions required to retrieve a secret.
        "Version": "2012-10-17",
        "Statement": {
        "Sid": "RetrieveDbCredentialFromSecretsManager",
        "Effect": "Allow",
        "Action": "secretsmanager:GetSecretValue",
        "Resource": "arn:aws:secretsmanager:::secret:Applications/MyApp/Documentdb-instance"

Phase 3: Enable rotation for your secret

Rotating secrets regularly is a security best practice. Secrets Manager makes it easier to follow this security best practice by offering built-in integrations and supporting extensibility with Lambda. When you enable rotation, Secrets Manager creates a Lambda function and attaches an IAM role to this function to execute rotations on a schedule you define.

Note: Configuring rotation is a privileged action that requires several IAM permissions, and you should only grant this access to trusted individuals. To grant these permissions, you can use the AWS IAMFullAccess managed policy.

Now, I show you how to configure Secrets Manager to rotate the secret
Applications/MyApp/Documentdb-instance automatically.

  1. From the Secrets Manager console, I go to the list of secrets and choose the secret I created in phase 1, Applications/MyApp/Documentdb-instance.
    Figure 9: Choose the secret from Phase 1

    Figure 9: Choose the secret from Phase 1

  2. Scroll to Rotation configuration, and then select Edit rotation.
    Figure 10: Select the Edit rotation configuration

    Figure 10: Select the Edit rotation configuration

  3. To enable rotation, select Enable automatic rotation, and then choose how frequently Secrets Manager rotates this secret. For this example, I set the rotation interval to 30 days. Then, choose create a new Lambda function to perform rotation and give the function an easy to remember name. For this example, I choose the name RotationFunctionforDocumentDB.
    Figure 11: Chose to enable automatic rotation, select a rotation interval, create a new Lambda function, and give it a name

    Figure 11: Chose to enable automatic rotation, select a rotation interval, create a new Lambda function, and give it a name

  4. Next, Secrets Manager requires permissions to rotate this secret on your behalf. Because I’m storing the masteruser database credential, Secrets Manager can use this credential to perform rotations. Therefore, I select Use this secret, and then select Save.
    Figure12: Select credentials for Secret Manager to use

    Figure12: Select credentials for Secret Manager to use

  5. The banner on the next screen confirms that I successfully configured rotation and the first rotation is in progress, which enables you to verify that rotation is functioning as expected. Secrets Manager will rotate this credential automatically every 30 days.
    Figure 13: The banner at the top of the screen will show the status of the rotation

    Figure 13: The banner at the top of the screen will show the status of the rotation


I explained the key benefits of AWS Secrets Manager and showed how you can use temporary credentials to access your Amazon DocumentDB clusters and Amazon Redshift instances securely. You can follow similar steps to rotate credentials for Amazon Redshift.

Secrets Manager helps you protect access to your applications, services, and IT resources without the upfront investment and on-going maintenance costs of operating your own secrets management infrastructure. To get started, visit the Secrets Manager console. To learn more, read the Secrets Manager documentation. If you have comments about this post, submit them in the Comments section below. If you have questions about anything in this post, start a new thread on the Secrets Manager forum.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Apurv Awasthi

Apurv is the product manager for credentials management services at AWS, including AWS Secrets Manager and IAM Roles. He enjoys the “Day 1” culture at Amazon because it aligns with his experience building startups in the sports and recruiting industries. Outside of work, Apurv enjoys hiking. He holds an MBA from UCLA and an MS in computer science from University of Kentucky.

How to enable cross-account Amazon Redshift COPY and Redshift Spectrum query for AWS KMS–encrypted data in Amazon S3

Post Syndicated from Asim Kumar Sasmal original https://aws.amazon.com/blogs/big-data/how-to-enable-cross-account-amazon-redshift-copy-and-redshift-spectrum-query-for-aws-kms-encrypted-data-in-amazon-s3/

This post shows a step-by-step walkthrough of how to set up a cross-account Amazon Redshift COPY and Spectrum query using a sample dataset in Amazon S3. The sample dataset is encrypted at rest using AWS KMS-managed keys (SSE-KMS).

About AWS Key Management Service (AWS KMS)

With AWS Key Management Service (AWS KMS), you can have centralized control over the encryption keys used to protect your data at rest. You can create, import, rotate, disable, delete, define usage policies, and audit the use of encryption keys used to encrypt your data. AWS KMS uses FIPS 140-2 validated cryptographic modules to protect the confidentiality and integrity of your master keys.

AWS KMS is seamlessly integrated with most AWS services. This integration means that you can easily use customer master keys (CMKs) to control the encryption of the data you store within these services. When deciding to encrypt data in a service such as Amazon Redshift, you can choose to use an AWS-managed CMK that Amazon Redshift automatically creates in KMS. You can track the usage of the key, but it’s managed by the service on your behalf. In some cases, you might need direct control over the lifecycle of a CMK or want to allow other accounts to use it. In these cases, you can create and manage your own CMK that AWS services such as Amazon Redshift can use on your behalf. These customer-managed CMKs enable you to have full control over the access permissions that determine who can use the key and under which conditions. AWS KMS is integrated with AWS CloudTrail, a service that provides a record of actions performed by a user, role, or AWS service in AWS KMS.

About Amazon Redshift and Redshift Spectrum

Amazon Redshift is a petabyte scale, fully managed data warehouse service on AWS. It uses a distributed, massively parallel processing (MPP), shared-nothing architecture that scales horizontally to meet usage requirements.

Amazon Redshift Spectrum is a feature of Amazon Redshift that extends the analytic power of Amazon Redshift beyond the data that is stored on local disks in the data warehouse. In other words, Amazon Redshift Spectrum enables you to use the same ANSI SQL syntax of Amazon Redshift on the data that is stored in an Amazon S3 data lake. You do so using external tables, without having to ingest the data into Amazon Redshift first. A common pattern is to run queries that span both the frequently accessed “hot” data stored locally in Amazon Redshift and the “warm/cold” data stored cost-effectively in Amazon S3. That pattern separates compute and storage by enabling independent scaling of both to match the use case. This means you don’t have to pay for unused compute capacity just to add more storage. More importantly, this approach enables seamless interoperability between your data lake and Amazon Redshift.

The Amazon Redshift COPY command supports the following types of Amazon S3 encryption:

  • Server-side encryption with Amazon S3-managed keys (SSE-S3)
  • Server-side encryption with AWS KMS-managed keys (SSE-KMS)
  • Client-side encryption using a client-side symmetric master key

The Amazon Redshift COPY command doesn’t support the following types of Amazon S3 encryption:

  • Server-side encryption with customer-provided keys (SSE-C)
  • Client-side encryption using an AWS KMS–managed customer master key
  • Client-side encryption using a customer-provided asymmetric master key

About the use case

A multiple-account AWS environment is a common pattern across our customers for a variety of reasons. One of the common reasons for data lake customers in AWS is to separate ownership of data assets from different business units in the company. At the same time, business units might need to grant access to some of their data assets to each other for new business insights.

As illustrated in the following drawing, in our example Account A owns an S3 bucket with SSE-KMS encrypted data and Account B owns an Amazon Redshift cluster with Redshift Spectrum enabled. Account B needs access to the same data to load to the Amazon Redshift cluster using the COPY command and also to query using Redshift Spectrum.

Solution walkthrough

Following, we walk through a couple different options to support this use case.


The solution assumes that you already have the following set up:

    1. Access to two AWS accounts (we call them Account A and B) in the same AWS Region.*
    2. Grant the AdministratorAccess policy to the AWS accounts (which should be restricted further for production).
    3. Account A has a customer-managed CMK in AWS KMS with the following attributes:
      • Alias as kms_key_account_a
      • Description as Cross Account KMS Key in Account A
      • Administrator as current IAM user using which you signed in to the AWS console and created the KMS key
      • Account B added as External Accounts

      Copy and save the CMK Amazon Resource Name (ARN) to be used shortly

    4. Account A uses the following sample dataset from AWS:
      Customer - s3://awssampledbuswest2/ssbgz/customer0002_part_00.gz

    5. Account A has an S3 bucket called rs-xacct-kms-bucket with bucket encryption option set to AWS KMS using the KMS key kms_key_account_a created earlier.
    6. Use the following AWS CLI command to copy the customer table data from AWS sample dataset SSB – Sample Schema Benchmark, found in the Amazon Redshift documentation.Note: Because bucket names are global across all AWS customers, you need a unique bucket name for your test run. Be sure to replace rs-xacct-kms-bucket with your own bucket name in the following command:
      aws s3 cp s3://awssampledbuswest2/ssbgz/ s3://rs-xacct-kms-bucket/customer/ --recursive --exclude '*' --include 'customer*'

    7. After the copy is complete, check the KMS key ID for the file from S3 console, as shown following.
    8. Account B has an Amazon Redshift cluster:
      • The cluster name is rstest
      • It’s publicly accessible
      • It has an IAM role attached called redshift_role_account_b with the following two managed IAM policies:
        • AmazonS3ReadOnlyAccess
        • AWSGlueConsoleFullAccess

            Note: Be sure to update redshift_role_account_b with your own IAM role.

            You can set up a database session successfully from a client tool, such as SQL Workbench from your laptop.

* This walkthrough uses a publicly available AWS sample dataset from the US-West-2 (Oregon) Region. Hence, we recommend that you use the US-West-2 (Oregon) Region for your test run to reduce cross-region network latency and cost due to data movement.

Step-by-step walkthrough

Depending on which account’s AWS Glue Data Catalog you want to use for Redshift Spectrum, there are two solution options to choose from:

  1. AWS Glue Data Catalog in Account B
  2. AWS Glue Data Catalog in Account A

Option 1: AWS Glue Data Catalog in Account B

Set up permissions

  1. Sign in to Account A’s AWS console. Then, change the AWS Region to us-west-2 (Oregon). Add the following bucket policy for the rs-xacct-kms-bucket bucket so that Account B (which owns the Amazon Redshift cluster – rstest) can access the bucket.

Note: Replace <Account B> with AWS Account ID for Account B and rs-xacct-kms-bucket with your bucket name.

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "AllowS3",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<Account B>:root"
            "Action": [
            "Resource": [
    1. Sign in to Account B’s AWS console. Then, change the AWS Region to us-west-2 (Oregon). Create IAM policies and roles as described following:

a) Create the following two IAM permission policies: rs_xacct_bucket_policy to give Account B access to the S3 bucket in Account A, and rs_xacct_kms_policy to give Account B access to the CMK in Account A.

Policy name: rs_xacct_kms_policy

Note: Replace <ARN of kms_key_account_a from Account A> with your KMS key ARN from Account A.

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "AllowUseOfTheKey",
            "Effect": "Allow",
            "Action": [
            "Resource": [
                "<ARN of kms_key_account_a from Account A>"
            "Sid": "AllowAttachmentOfPersistentResources",
            "Effect": "Allow",
            "Action": [
            "Resource": [
                "<ARN of kms_key_account_a from Account A>"
            "Condition": {
                "Bool": {
                    "kms:GrantIsForAWSResource": true

Policy name: rs_xacct_bucket_policy

Note: Replace rs-xacct-kms-bucket with your bucket name.

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "AllowS3",
            "Effect": "Allow",
            "Action": "s3:*",
            "Resource": [

b) Create a new IAM role called xacct_kms_role_account_b for the Amazon Redshift service with the following IAM policies attached:


Save the Amazon Resource Name (ARN) of the IAM role. You’ll use it soon.

c) Now let’s set up the IAM role chaining for Amazon Redshift between the two IAM roles, redshift_role_account_b and xacct_kms_role_account_b.

To chain roles, you establish a trust relationship between the roles. A role that assumes another role (for example, Role A) must have a permission policy that allows it to assume the next chained role (for example, Role B). Similarly, the role that passes permissions (Role B) must have a trust policy that allows it to pass its permissions to the previous chained role (Role A).

The first role in the chain must be a role attached to the Amazon Redshift cluster. The first role and each subsequent role that assumes the next role in the chain must have a policy that includes a specific statement. This statement has the Allow effect on the sts:AssumeRole action and the ARN of the next role in a Resource element.

In our example, Role A is redshift_role_account_b, which needs the permission policy rs_xacct_assume_role_policy, which  allows it to assume Role B (which is xacct_kms_role_account_b). Both IAM roles are owned by AWS Account B.

d) Let’s create the IAM permission policy rs_xacct_assume_role_policy and attach the policy to the IAM role redshift_role_account_b.

Policy name: rs_xacct_assume_role_policy

Note: Replace <ARN for IAM role xacct_kms_role_account_b from Account B>.

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "Stmt1487639602000",
            "Effect": "Allow",
            "Action": [
            "Resource": [
"<ARN for IAM role xacct_kms_role_account_b from Account B>"

e) Change the trust relationship for IAM role xacct_kms_role_account_b by choosing Edit trust relationship and replacing the existing trust policy with the following:

Note: Replace <Account B> with the AWS Account ID for Account B.

  "Version": "2012-10-17",
  "Statement": [
      "Effect": "Allow",
      "Principal": {
        "Service": "redshift.amazonaws.com"
      "Action": "sts:AssumeRole"
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<Account B>:root"
      "Action": "sts:AssumeRole"

f) Create an AWS Glue service IAM role called glue_service_role_account_b with the following policies attached:

• AWSGlueServiceRole (AWS managed policy)
• rs_xacct_bucket_policy (managed policy created earlier)
• rs_xacct_kms_policy (managed policy created earlier)

Note: Be sure to update glue_service_role_account_b with your own IAM role.

Perform the Amazon Redshift COPY

  1. Log in to the Amazon Redshift cluster from your query tool and create the customer table using the DDL following.
CREATE TABLE customer 
  c_custkey      INTEGER NOT NULL,
  c_name         VARCHAR(25) NOT NULL,
  c_address      VARCHAR(25) NOT NULL,
  c_city         VARCHAR(10) NOT NULL,
  c_nation       VARCHAR(15) NOT NULL,
  c_region       VARCHAR(12) NOT NULL,
  c_phone        VARCHAR(15) NOT NULL,
  c_mktsegment   VARCHAR(10) NOT NULL

2. Now you can run the COPY statement following successfully.

copy customer from 's3://rs-xacct-kms-bucket/customer/' 
iam_role '<IAM role ARN of redshift_role_account_b,IAM role ARN of xacct_kms_role_account_b>'
region 'us-west-2';

Note: Replace the IAM role ARNs from Account B separated by a comma without any spaces around it.

3. Run the following sample query to verify that the data was loaded successfully.

select * from customer limit 10;

Set up an AWS Glue Data Catalog table for Redshift Spectrum to query

Let’s now create an AWS Glue crawler in Account B to crawl the same customer data and create a table called customer in the AWS Glue Data Catalog database spectrumdb_account_b following these steps:

  1. Navigate to Databases on the AWS Glue console and choose Add database to create an AWS Glue Data Catalog database called spectrumdb_account_b, as shown following.

  1. Navigate to Crawlers on the AWS Glue console and choose Add crawler, as shown following.

  1. Create a crawler customerxacct, as shown following.

Note: The Crawler job name (customerxacct in this case) is not same as the table name created by the crawler (a common confusion). The table name is picked up automatically from the prefix and folder name from your S3 bucket and folder structure. You also have an option to attach a table name prefix if you want to.              

  1. Choose Next to enter Data store details of the customer table, as following.

  1. Choose Next to get to the Add another data store We leave the default, No, because we don’t have any other data stores to add.

  1. Choose Next to choose the IAM role created earlier, glue_service_role_account_b, for the crawler to use, as shown following.

  1. Choose Next to go to the Schedule page and choose the schedule that you want this crawler job to run. For this example, we can choose Run on demand.

  1. Choose Next to choose the AWS Glue Data Catalog database spectrumdb_account_b (created earlier by create external schema command) as the crawler output location.

  1. Choose Next to get to the review page.

  1. After reviewing the details, choose Finish to finish creating the crawler.

  1. Now, let’s run the crawler job by selecting the job as following and choosing Run crawler.

  1. Wait and watch for the job to complete. Its status changes from Starting to Stopping to Ready. You can choose the refresh button for the latest status.

  1. If the job fails, the failure is recorded in Amazon CloudWatch logs. To view the logs, choose Logs, shown in the screenshot preceding, which takes you to the CloudWatch logs.
  1. Now, let’s go to the AWS Glue Data Catalog database to make sure that the table exists.

Choose Databases, choose the spectrumdb_account_b database, and then choose View Tables, or choose the hyperlink of the database name. You should see the customer table, as shown following.

  1. Choose the customer hyperlink to get to the external table, details following.

Because the data file didn’t have a header record, the AWS Glue crawler has assigned a default column naming convention as shown preceding. For the customer table, this naming is column 0 to column 7

  1. Choose Edit Schema and assign appropriate column names, as per the mapping following.

c0 => c_custkey

c1 => c_name

c2 => c_address

c3 => c_city

c4 => c_nation

c5 => c_region

c6 => c_phone

c7 => c_mktsegment

When you are done, choose Save.

Perform the Redshift Spectrum query

Now that the customer table is created in AWS Glue Data Catalog, let’s query the table using Redshift Spectrum.

  1. Log in to the Amazon Redshift cluster from your query tool.
  2. Run the statements following to create an external schema called spectrumxacct for Redshift Spectrum pointing to the AWS Glue Data Catalog database. This database is spectrumdb_account_b in Account B, already created on the AWS Glue console.
    drop schema if exists spectrumxacct;
    create external schema spectrumxacct
    from data catalog 
    database 'spectrumdb_account_b'
    iam_role '<IAM role ARN of redshift_role_account_b,IAM role ARN of xacct_kms_role_account_b>'
    create external database if not exists;

    Note: Replace the IAM role ARNs from Account B separated by a comma without any spaces around it.

  3. Run the following sample query to verify that Redshift Spectrum can query the data successfully.
    select * from spectrumxacct.customer limit 10;

Note: Redshift Spectrum uses the AWS Glue Data Catalog in Account B, not Account A.

Option 2: AWS Glue Data Catalog in Account A


Set up permissions

1. Sign in to the Account A AWS console, then change the AWS Region to us-west-2 (Oregon).

    • a) Create the following IAM policies:

• rs-xacct-bucket-policy to give access to the S3 bucket in Account A
• rs_xacct_kms_policy to give access to the CMK in Account A

Policy name: rs_xacct_bucket_policy

Note: Replace the bucket name rs-xacct-kms-bucket with your bucket name.

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "AllowS3",
            "Effect": "Allow",
            "Action": [
            "Resource": [

Policy name: rs_xacct_kms_policy

Note: Replace <ARN of kms_key_account_a from Account A> with your KMS key ARN from Account A.

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "AllowUseOfTheKey",
            "Effect": "Allow",
            "Action": [
            "Resource": [
                "<ARN of kms_key_account_a from Account A>"
            "Sid": "AllowAttachmentOfPersistentResources",
            "Effect": "Allow",
            "Action": [
            "Resource": [
                "<ARN of kms_key_account_a from Account A>"
            "Condition": {
                "Bool": {
                    "kms:GrantIsForAWSResource": true

b) Create a new IAM role called xacct_kms_role_account_a for the Amazon Redshift service with the following IAM policies:

AWSGlueConsoleFullAccess (this managed policy provides the required permissions for the AWS Glue Data Catalog)

Save the IAM role ARN to be used shortly.

c) Change the trust relationship for the IAM role xacct_kms_role_account_a by choosing Edit trust relationship and replacing the existing trust policy with the following:

Note: Replace <Account B> with the AWS account ID for Account B.

  "Version": "2012-10-17",
  "Statement": [
      "Effect": "Allow",
      "Principal": {
        "Service": "redshift.amazonaws.com"
      "Action": "sts:AssumeRole"
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<Account B>:root"
      "Action": "sts:AssumeRole"

d) Create an AWS Glue service IAM role called glue_service_role_account_a with the following policies attached:

AWSGlueServiceRole (AWS managed policy)
rs_xacct_bucket_policy (managed policy created earlier)
rs_xacct_kms_policy (managed policy created earlier)

Note: Be sure to update glue_service_role_account_a with your own IAM role

2. Sign in to Account B’s AWS console and change the AWS Region to us-west-2 (Oregon) if it’s not already selected.

a) Modify the existing IAM policy rs_xacct_assume_role_policy and replace the existing JSON policy with the following:

 Note: Replace <ARN for IAM role xacct_kms_role_account_a from Account A>.

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "Stmt1487639602000",
            "Effect": "Allow",
            "Action": [
            "Resource": [
"<ARN for IAM role xacct_kms_role_account_a from Account A>"

Perform the Amazon Redshift COPY

1. Log in to the Amazon Redshift cluster from your query tool and create the customer table using the DDL following.

CREATE TABLE customer 
  c_custkey      INTEGER NOT NULL,
  c_name         VARCHAR(25) NOT NULL,
  c_address      VARCHAR(25) NOT NULL,
  c_city         VARCHAR(10) NOT NULL,
  c_nation       VARCHAR(15) NOT NULL,
  c_region       VARCHAR(12) NOT NULL,
  c_phone        VARCHAR(15) NOT NULL,
  c_mktsegment   VARCHAR(10) NOT NULL

2. Now you should be able to run the COPY statement following successfully.

copy customer from 's3://rs-xacct-kms-bucket/customer/' 
iam_role '<ARN for IAM role redshift_role_account_b from Account B,<ARN for IAM role xacct_kms_role_account_a from Account A>'
region 'us-west-2';

Note: Replace the IAM role ARNs separated by a comma without any spaces around it.

3. Run the sample query following to validate that the data was loaded successfully.

select * from customer limit 10;

Set up AWS Glue Data Catalog table for Redshift Spectrum to query

Let’s now create an AWS Glue crawler in Account A to crawl the same customer data and create a table called customer in the AWS Glue Data Catalog database spectrumdb_account_a in Account A following these steps:

Follow the same steps as outlined in Option 1 to create and run a crawler with the following changes:

  1. This time, create the crawler in Account A (as opposed to Account B for Option 1).
  2. Create an AWS Glue Data Catalog database spectrumdb_account_a in Account A (as opposed to spectrumdb_account_b in Account B), and choose that database for crawler to create the customer table.
  3. While providing S3 path, choose the option Specified path in my account (unlike Specified path in another account chosen for Option 1).
  4. Make sure to use glue_service_role_account_a created earlier as the AWS Glue service IAM role.=

Perform the Redshift Spectrum query

Now that the customer table is created in the AWS Glue Data Catalog, let’s query the table using Redshift Spectrum.

1. Log in to the Amazon Redshift cluster from your query tool and run the statements following. These create an external schema called spectrumxacct2 for Redshift Spectrum pointing to the AWS Glue Data Catalog database spectrumdb_account_a (created earlier from AWS Glue console) in Account A.

drop schema if exists spectrumxacct2;
create external schema spectrumxacct2
from data catalog 
database 'spectrumdb_account_a' 
iam_role '<ARN for IAM role redshift_role_account_b from Account B,<ARN for IAM role xacct_kms_role_account_a from Account A>'
create external database if not exists;

Note: Replace the IAM role ARNs separated by a comma without any spaces around it.

2. Run the following query, which should run successfully.

select * from spectrumxacct2.customer limit 10;

Note: Spectrum uses the AWS Glue Data Catalog in Account A, not Account B.


This post shows a step-by-step walkthrough of how to set up a cross-account Amazon Redshift COPY and query using Redshift Spectrum for a sample KMS encrypted dataset in Amazon S3. It demonstrates two solution options to choose from depending on which account’s AWS Glue Catalog you want to use for Redshift Spectrum.

If you have questions or suggestions, please leave a comment.


About the Author

Asim Kumar Sasmal is a Sr. Data Architect – IoT in the Global Specialty Practice of AWS Professional Services. He helps AWS customers around the globe to design and build data driven solutions by providing expert technical consulting, best practices guidance, and implementation services on AWS platform. He is passionate about working backwards from customer ask, help them to think big, and dive deep to solve real business problems by leveraging the power of AWS platform.

Bannerconnect uses Amazon Redshift to help clients improve digital marketing results

Post Syndicated from Danny Stommen original https://aws.amazon.com/blogs/big-data/bannerconnect-uses-amazon-redshift-to-help-clients-improve-digital-marketing-results/

Bannerconnect uses programmatic marketing solutions that empower advertisers to win attention and customers by getting their ads seen by the right person at the right time and place. Data-driven insights help large advertisers, trade desks, and agencies boost brand awareness and maximize the results of their digital marketing. Timely analysis of log data is critical to respond to dynamic changes in customer behavior, optimize marketing campaigns quickly, and to gain competitive advantage.

By moving to AWS and Amazon Redshift, our clients can now get near real-time analytics at their fingertips. In this blog post, we describe the challenges that we faced with our legacy, on-premises data warehouse, and discuss the benefits we received by moving to Amazon Redshift. Now, we can ingest data faster, do more sophisticated analytics, and help our clients make faster, data-driven decisions to improve their digital marketing.

Legacy on-premises situation and challenges

Our on-premises, legacy infrastructure consisted of the IBM PureData System as our log level data warehouse. We used a MySQL database for storing all metadata and all analytics data. In this physical, nonvirtualized environment, we needed to plan capacity carefully, far in advance, to handle data growth. We needed a sizeable team to manage upgrades, maintenance, backups, and the day-to-day management of workloads and query performance.

We faced many challenges. We had only 1 gigabyte of bandwidth available to load log-level data into the data warehouse. At peak loads, our extract, transform, load (ETL) server ran completely full, and bandwidth became a bottleneck that delayed when the data was available for analytics. Software and firmware upgrades to the data warehouse needed to be scheduled, and maintenance downtime sometimes took up to eight hours to complete. Our infrastructure was also fragile. We ran everything on one PureData System, and we didn’t have a separate development and test environment. Clients that had direct access to our production environment could submit incorrect SQL queries and pull down the entire data warehouse.

From the log-level data, we created aggregates and stored them in MySQL. Indexes slowed down the loading process significantly. Several aggregations that we wanted to do were simply not possible. Running ad hoc (one-time) queries against 200 gigabytes of uncompressed, row-based data took ages to complete. Many dashboard queries took 10–15 minutes or longer, and were ultimately cancelled. Users were frustrated, so we knew that we had to evolve to a more responsive solution, end-to-end. We chose AWS and Amazon Redshift for our data warehouse.

Moving to Amazon Redshift

Because our legacy software was not designed to run in the cloud, we decided to rebuild our applications using all available AWS components. This saved us the hassle of any migration process, and we could design our applications to use the full potential of AWS.

Our new infrastructure uses Amazon Redshift as our log-level data warehouse. We use a 40-node ds2.xlarge cluster for our production processes. Here, we run log-level queries to aggregate data for the analytics cluster, and run thousands of queries each day to optimize our marketing campaigns.

We set up a separate 30-node ds2.xlarge Amazon Redshift cluster for client access. We replicate the log level data to this cluster, and allow our clients to run queries here without jeopardizing our production processes. Our clients perform data science queries against the data in this cluster.

We also created a 24-node dc2.large cluster for high-performance queries that would not be affected by large, complex queries running on our other clusters. We use this cluster for ad hoc analytics on aggregated data, made available through our API.

We use Amazon S3 as our main data store, giving us infinite storage. Amazon EC2 hosts our ETL processes, API, and several other applications.

Bannerconnect architecture. Amazon S3 is not added to the flowchart to make it simpler. You can add S3 to almost every arrow in the chart.

Our biggest gains

Creating our next-generation solution on AWS and Amazon Redshift provides many benefits for us. We simply follow the best practices provided in Amazon Redshift documentation.

  • Managed service: We wanted to focus on what we are good at, developing software, and not managing and maintaining infrastructure. With Amazon Redshift, we no longer have to do software updates, firmware upgrades, or deal with broken hardware. We no longer need to plan capacity months in advance, or deal with integrating new servers into our environment. Amazon Redshift completely automates all of this for us, including scaling with our data growth, so we can focus on data and analytics to better serve our clients.
  • A full, separate development and test environment: We now have an isolated Amazon Redshift cluster (single node), where we can perform, develop, and test without worrying about breaking the production environment. In our on-premises setup, we did have a development database, but it was always on the production infrastructure. Furthermore, we have an exact copy of our entire infrastructure in our test environment (obviously all in small scale and small instance types). This lets us run automated tests on each deployment to verify that all data flows work as expected.
  • Infinite scalability: We can scale instantly to any capacity we need. Amazon S3 gives us infinite storage, and we can scale Amazon Redshift compute capacity in just a few clicks.
  • Separate clusters: Clients now can’t jeopardize our production processes. Clients still write incorrect SQL queries. However, by using query monitoring rules in Amazon Redshift, we can identify these queries and have Amazon Redshift automatically stop them. Bad queries might affect the client cluster momentarily, but they don’t affect our production processes at all.
  • Faster ad hoc analytics: Due to the massive parallel processing, data compression, and columnar-based storage capabilities in Amazon Redshift, we can create aggregates that were not possible in MySQL. In terms of performance, it’s hard to give good numbers to compare. Running a query against a smaller aggregate using an index on MySQL might be faster at times. However, the majority of our queries are significantly faster on Amazon Redshift. For example, our biggest aggregated table contains about 2 billion records and 500 GB of data (compressed). MySQL couldn’t handle this, but Amazon Redshift results are retrieved within seconds. Large, complex queries took a long time on MySQL. Amazon Redshift completes these in tens of seconds or less.

Building the multicluster environment

This section explores an easy option to build a multicluster setup using AWS CloudFormation templates. With the templates, you can launch multiple Amazon Redshift clusters inside a VPC in both private and public subnets in different Availability Zones. The private subnet enables internal applications, such as EC2 instances, to execute the ETL process to interact with the Amazon Redshift cluster to refresh data. You can use the public Amazon Redshift cluster for the external client tools and scripts. Here is the architecture of this setup:

Let’s walk through the configuration. For demonstration purposes, we use just two Amazon Redshift clusters in a private and public subnet, but you can modify these steps to add more parallel clusters. The configuration is a two-step process to first create the network stack and later launch the Amazon Redshift cluster in those stacks. This process creates the following:

  • VPC and associated subnets, security groups, and routes
  • IAM roles to load data from S3 to Amazon Redshift
  • Amazon Redshift cluster or clusters


Step 1 – Create a network stack

  1. Sign in to the AWS Management Console and navigate to CloudFormation, then do the following:
  • Choose the AWS Region to launch the stack in, for example US East (Ohio).
  • Choose Create Stack.
  • Choose Specify an Amazon S3 template URL.
  • Copy and paste this URL into the text box: https://s3.amazonaws.com/salamander-us-east-1/Bannerconnect/networkstack.json
  1. Choose Next and provide the following information:
    • Stack Name: Name the stack anything convenient.
    • CIDR Prefix: Enter a class B CIDR prefix (for example, 168, 10.1, or 172.16).
    • Environment Tag: Name the environment anything convenient to tag the resources.
    • Key Name: The EC2 key pair to allow access to EC2 instances. If you don’t already have one, see Amazon EC2 Key Pairs in the EC2 documentation.
    • Use the default values for all other parameters and choose Create.
  2. The stack takes 10 minutes to launch, after which the network stack is ready.
  3. Review the outputs of the stack when the launch is complete to note the resource names that were created.

Step 2 – Create an Amazon Redshift cluster or clusters

  1. Navigate back to the CloudFormation console and do the following:
  • Choose Create Stack.
  • Choose Specify an Amazon S3 template URL.
  • Copy and paste this URL into the text box: https://s3.amazonaws.com/salamander-us-east-1/Bannerconnect/reshiftstack.json
  1. Choose Next and provide the following information:
    • Stack Name: Name the stack anything convenient.
    • Cluster Type: Choose a multi-node or single-node cluster.
    • Inbound Traffic: Allow inbound traffic to the cluster from this CIDR range.
    • Master Username: The user name that is associated with the master user account for the cluster that is being created. The default is adminuser.
    • Master User Password: The password that is associated with the master user account for the cluster that is being created.
    • Network Stack Name: The name of the active CloudFormation stack that was created in step 1, which contains the networking resources such as the subnet and security group.
    • Node Type: The node type to be provisioned for the Amazon Redshift cluster.
    • Number Of Nodes: The number of compute nodes in the Amazon Redshift cluster:
      • When the cluster type is specified as single-node, this value should be 1.
      • When the cluster type is specified as multi-node, this value should be greater than 1.
    • Port Number: The port number on which the cluster accepts incoming connections. The default is 5439.
    • Public Access: Public access to the Amazon Redshift cluster, either true or false. When this value is true, the cluster is launched in a public subnet. When this value is false, the cluster is launched in a private subnet.
    • Use the default values for all other parameters and choose Create.
  2. The stack take 10 minutes to launch, after which the Amazon Redshift cluster is launched in the network stack.
  3. Review the outputs of the stack when the launch is complete to note the resource names that were created for the Amazon Redshift cluster.
  4. Repeat steps 5–8 to add more Amazon Redshift clusters to this network stack.

With this easy deployment using the AWS CloudFormation template, you can launch all the resources needed for a multicluster setup with a few clicks.


Migrating to Amazon Redshift and setting up the data warehouse on AWS enabled us to build highly scalable decoupled applications and to use different clusters for different use cases. Operationally, we were able to build robust dev, test, and prod systems independently that are easy to manage to implement complex data workflows.

Recently, we started using Amazon Redshift Spectrum to query data directly from Amazon S3, without needing to load the data into Amazon Redshift. This saves us loading time and speeds up time to analytics, creating many new possibilities for us. Loading dynamic data with different formats and columns becomes a lot easier with Amazon Redshift Spectrum.


About the Authors

Danny Stommen has been working for Bannerconnect for 10 years, with his current role being Solutions Architect and most of his time working on the CORE solution. Next to work, he enjoys spending quality time with his family and actively playing soccer.




Thiyagarajan Arumugam is a Big Data Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Before working at AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

Running Amazon Payments analytics on Amazon Redshift with 750TB of data

Post Syndicated from Bishwabandhu Newton original https://aws.amazon.com/blogs/big-data/running-amazon-payments-analytics-on-amazon-redshift-with-750tb-of-data/

The Amazon Payments Data Engineering team is responsible for data ingestion, transformation, and the computation and storage of data. It makes these services available for more than 300 business customers across the globe. These customers include product managers, marketing managers, program managers, data scientists, business analysts and software development engineers. They use the data for their scheduled and one-time queries to make the right business decisions. This data is also used for building weekly, monthly and quarterly business review metrics, which are reviewed by the leadership team.

We support various consumer payment business teams, including the following:

  • Amazon Payment Products (Credit cards, Shop with Points, Amazon Currency Convertor, International Payment Products)
  • Gift Cards
  • Payment Acceptance Experience
  • Amazon Business Payments.

We also feed into the machine learning recommendation engine. This engine suggests the best payment product to a customer on Amazon’s payment checkout page.

Challenges with old datawarehouse

This section describes our previous challenges with our data warehouse and analytical needs. With payment products launches and their extension to new marketplaces, we had exponential growth in data volume. Later, scaling our extract, transform, and load process (ETL) was met with severe challenges, and resulted in delays and operational burdens. Here are the specific challenges we faced with our data warehouse:

  • Upsert did not scale, so we got updates more than ~10MN per run. The consumer product catalog dataset has more than 6.5BN records listed in the US marketplace, and occasionally the daily updates exceeded 10MN mark. We saw a similar trend for the Order attributes dataset.
  • Data aggregation either took longer or never finished if we had to analyze even for 6 months of payments data. Often, business owners wanted to aggregate the data based on certain attributes. For example, the number of successful transactions and monetary value by certain types of cards.
  • Shared cluster, and thus shared storage and compute caused resource crunch and impacted all its users. Each team was given ~100TB each on the Data warehouse. Each team could bring their table and join with central data warehouse tables. Any bad query on the cluster impacted all other queries on the same cluster. It was difficult to identify the owner of those bad queries.
  • There were more than 30,000 production tables. it became almost impossible to host all of them on the same cluster.
  • Index corruption on a larger table was cumbersome to rebuild and backfill the table.
  • We required a Database Administrator to apply patches and updates.

Using Amazon Redshift as the new payments data warehouse

We started exploring different options which suits our analytical needs, which is fast, reliable and scales well for future data growth. With all of the previously described issues, Central Data warehouse moved towards separating the compute and storage layer and they decided to be responsible for storage. They built a data lake on Amazon S3, which is encrypted to store even the highly confidential critical data. Each consumer team got the guideline to bring their own compute capacity for their analytical needs. Our payments team started looking for the following advantages:

  • Expedient analytics.
  • Integrates with S3 and the other AWS services.
  • Affordable storage and compute rates.
  • Works for ETL processing.

We chose Amazon Redshift because of it has the following features:

  • Bulk uploads are faster. ~700MN data inserts into ~30 minutes.
  • Data upsert is exceptionally fast.
  • Aggregation query on multi-million dataset with fewer columns of data returns in a few seconds as compared to a few minutes.
  • No need for DBA time to be allocated to maintain the database. Data engineers can easily perform backups, re-snapshot to a new cluster, set up alarms in case of cluster issues, and add new nodes.
  • The ability to keep data on S3. This data is accessible from multiple independent Amazon Redshift clusters through Spectrum and also allows users to join Spectrum tables with other tables created locally on Amazon Redshift. It offloads some processing to the Spectrum layer while storing the data on S3.
  • The ability to use Amazon Redshift best practices to design our tables in regards to distribution keys, sort keys, and compression. As a result, the query performance exceeded our SLA expectations.
  • An effective compression factor. This saves more than 40 to 50 percent of space by choosing the right compression. This enables faster query and efficient storage option.

Sources of data and storage

We consume data from different sources, like PostgreSQL, Amazon DynamoDB live streams, Central data warehouse data lake and bank partners’ data through secure protocols. Data from PostgreSQL databases are in relational format, whereas DynamoDB has key value pairs. We translate the key/value data to relational format and store in Amazon Redshift and S3. Most frequently accessed data is kept in Amazon Redshift. Less frequently accessed and larger datasets are stored in S3 and accessed through Amazon Redshift Spectrum.

Central data lake hosts more than 30,000 tables from different teams, such as Orders, Shipments, and Refunds. Whereas, we as a payments team need approximately 200 tables from this data lake as source tables. Later, we built a data mart specific to payment products, which feeds both to scheduled and one-time data and reporting needs. All small and mid-size tables, smaller than 50 TB, are directly loaded in Amazon Redshift from data lake, which physically stores the data. Tables larger than 50 TB are not stored locally on Amazon Redshift. Instead, we pull from the data lake using EMR-Hive, convert the format from tsv to ORC/Parquet and store on S3. We create an Amazon Redshift Spectrum table on top of S3 data. Format conversion lowers the runtime for each analytical aggregation queries, whereas storing on S3 makes sure we do not fill up entire Amazon Redshift cluster with data rather use it for efficient computing.

Data Architecture

Different components

  1. Central data warehouse data lake (Andes) — Almost all the systems in Amazon, wanting to share their data with other teams, publish their data to this datalake. It is an encrypted storage built on top of Amazon S3 which has metadata attached along with datafiles. Every dataset has a onetime dump and then incremental delta files. Teams willing to consume the data by
  • Physically copying the data into their own Amazon Redshift cluster. It is efficient for smaller and mid-size tables which are accessed most frequently.
  • Using Amazon Redshift Spectrum to run analytical queries on datasets stored in data lake. It helps in accessing cold large data, generally larger than 50TB, thereby avoids scaling up your Amazon Redshift cluster just because all the space might be consumed by these larger data files.
  • Using the AWS Glue catalog to update the metadata in your team’s S3 bucket and use Amazon EMR to pull the data, apply transformation, change format and store the final data in S3 bucket, which can further be consumed using Amazon Redshift Spectrum. It is efficient when the dataset is large and need transformations before being consumed.
  1. Amazon Redshift clusters — Amazon Redshift has centric architecture and is best suited for being single place for all source of truth, but we are managing three clusters mostly because of having consistent SLA of our reports, decoupling the user query experience with central data lake ingestion process (which is resource intensive). Here are the cluster specific reasons for why we need these as separate clusters.
  • Staging cluster:
    • Our data sources are dynamic which are in the transition state and moving away from relational to non-relational sources; for example, Oracle to Postgres or to DynamoDB.
    • The mechanism to pull data from central data lake and store into Amazon Redshift, is also evolving and is resource intensive in current state.
    • Our datamart is payment specific, though the table names in our datamart looks similar to central data lakes tables, but our datamart data is different than central data lake datasets. We apply necessary transformation and filters before bringing the data to the user’s Amazon Redshift cluster.
  • User cluster: Our internal business users wanted to create the tables in public schema for their analysis. They also needed direct connect access for any adhoc analysis. Most of the users know SQL and are aware of best practices but there are users who are new to SQL and sometimes their query is not optimized and impact other running queries, we have Amazon Redshift workload manager (WLM) settings to protect our cluster from these bad queries.
  • Prod ETL cluster: We have tight SLA to make dataset available for data users. In order to minimize the impact of bad queries running on the system we have set up replica of user cluster. All prod transforms run here and the output data is copied to both user and prod clusters. It insures the SLA we commit to data business users.
  1. Near real time data ingestion — Many applications like promotional data, card registration, gift card issuance etc need realtime data collection to detect fraud. Application data is stored in Amazon DynamoDB, with DynamoDB Streams enabled. We consume the data from these streams through an AWS Lambda function and Amazon Kinesis Data Firehose. Kinesis Firehose delivers the data to S3 and submits the copy command to load the data into Redshift. We have micro batch of 15 mins which makes sure not all the connections are consumed by these near-real time data applications.
  2. Alternate compute on Amazon EMR — We track customer behavior through website clickstream data. There are few metrics on conversion rate (customer applied for payment products after they saw the banner at different placements like payment checkout page etc), payment product application, website hit waterfall analysis, funnel reports which tracks the journey of a customer starting from seeing these adds, clicking to apply button, filling the application form to finally submitting the application. We get almost 10B website hits records on a given day. Before we generate any report, we look for cheaper storage for these massive and not frequently accessed datasets. We decided to choose S3 as a storage option and applied the transformations using Amazon EMR. With this approach, we made sure we do not fill up the database with massive cold data and at the same time we enable data access on S3 using Amazon Redshift Spectrum, which provided similar query performance. As Amazon Redshift is a columnar database and is exceptionally fast for any sort of aggregation if we choose fewer dimensional columns. We wanted similar performance for the data we stored on S3. We were able to do it using Amazon EMR and by changing the data format from TSV to ORC or Parquet. Every day, we created new partition data on S3 and refreshed the Amazon Redshift Spectrum table definition to include new partition data. These Spectrum table were accessed by business users for their one-time analysis using any Amazon Redshift SQL client or for scheduling any ETL pipeline.
  3. Publish the data to data warehouse datalake for non-payment users — We built payment specific datasets. For example, decline transaction behavior, wallet penetration and others. Sometimes non-payments business users are also interested in consuming these datasets. We publish these datasets to central data warehouse data lake for them. Additionally, payments application teams are the source for payment product application data. Data engineering team consumes these datasets, apply needed transformation and publish it both for payments and non-payments user through Amazon Redshift and the data lake.

Schema management

We have a prod schema which stores all production tables and only platform team has access to make any changes to it. We also provide payment product specific sandboxes which is accessible by product specific member. There is generic public schema for any payments data users. They can create, load, truncate/drop the tables in this schema.

Database and ETL lookup

Here are few fun facts about our Amazon Redshift database objects.

  • Number of Databases: 4
    • Staging Database DB1: ds2.8xlarge x 20 nodes
      Memory. 244 GiB per node
      Storage. 16TB HDD storage per node
    • User Database DB2: ds2.8xlarge x 24 nodes
      Memory. 244 GiB per node
      Storage. 16TB HDD storage per node
    • Platform Database DB3: ds2.8xlarge x 24 nodes
      Memory. 244 GiB per node
      Storage. 16TB HDD storage per node
    • Reporting Database DB4: ds2.8xlarge x 4 nodes
      Memory. 244 GiB per node
      Storage. 16TB HDD storage per node
  • Size of Databases:
    • Total memory: 17 TB
    • Total storage: 1.15 Petabytes
  • Number of tables:
    • Analytics prod db: 6500
    • Analytics staging db: 390

User cluster

Here are some stats around the database that is exposed to the users. It has both core tables and users are allowed to create their own tables based on their need. We have the mirror image of the same database, which hosts all of the tables, except user=created tables. Another database is used to run ETL platform related prod pipeline. Most of these tables have entire history, except the snapshot tables like clickstream datasets, which have been archived to S3.

Staging cluster

Here are some stats around the staging database. It is the landing zone for all the data coming from other teams or from the central data warehouse data lake. Table retention has been applied to all the tables as most of the ELT downstream jobs look for the last updated date, pulls just incremental data and store in user and replica databases.

Scheduled ETL and Query load on database

  • Number of daily extraction ETL jobs: 2943
  • Number of loading ETL jobs: 1655

  • Total daily load processed volume: 119 BN
  • Total daily loading runtime: 11,415 mins

  • Total daily data extraction volume: 166 BN
  • Total daily date extraction runtimes: 25,585 mins

Both scheduled and one-time query loads on the database

  • Daily query load on database by different database users.

Best practices

  1. Design tables with right sort keys and distribution key: Query performance is dependent on how much data it scans and if the joines are co-located join. Choosing right sort key make sure we do not scan the data which we do not need and selecting right distribution key makes sure the joining data is present on the same node and there is less movement of data over network resulting in better query performance. For more information, see Amazon Redshift Best Practices for Designing Tables.
  1. Refer to Amazon Redshift Best Practices for Designing Queries while writing the query.
  1. Change the loading strategy by splitting larger files into smaller files, use bulk loads instead serial inserts. For more information, see Amazon Redshift Best Practices for Loading Data.
  1. Configure the appropriate WLM setting to avoid system abuse by allocating right run-times, memory, priority queues, etc. For more information, see Tutorial: Configuring Workload Management (WLM) Queues to Improve Query Processing.
  1. Use Amazon Redshift advisor to identify potential tables needing compression, tables with missing stats, uncompressed data loads, and further fine tune your ETL pipelines. For more information, see Working Recommendations from Amazon Redshift Advisor.
  1. Identify the tables with most of the wasted space and vacuum them frequently. It releases the wasted space and, at the same time, increases the query performance. For more information, see Vacuuming Tables.
  1. Analyze the SQLs submitted to DB and identify the pattern on table usage and expensive joins. It helps data engineers build more denormalized tables by pre-joining these tables, and helping users access a single table, which is fast and efficient.

Conclusion and Next Steps

Amazon Redshift clusters with total capacity of 1.15 PB, ~6500 tables, 4500 scheduled ETL runs, 13,000 ETL queries a day, is solving almost all the ETL need of business users in payments team. However, the recent volume growth is filling up our dbs more than we expected, Next step could be choosing cheaper storage option by building a datalake on S3 and access them using Amazon Redshift spectrum without even bothering about scaling challenges and with seamless user experience.


About the authors

Bishwabandhu Newton is a senior data engineer with Amazon Consumer Payments team. He has over 12 years of data warehousing experience, with 9+ years at Amazon.com.




Matt Scaer is a Principal Data Warehousing Specialist Solution Architect, with over 20 years of data warehousing experience, with 11+ years at both AWS and Amazon.com.

Scale your Amazon Redshift clusters up and down in minutes to get the performance you need, when you need it

Post Syndicated from Ayush Jain original https://aws.amazon.com/blogs/big-data/scale-your-amazon-redshift-clusters-up-and-down-in-minutes-to-get-the-performance-you-need-when-you-need-it/

Amazon Redshift is the cloud data warehouse of choice for organizations of all sizes—from fast-growing technology companies such as Turo and Yelp to Fortune 500 companies such as 21st Century Fox and Johnson & Johnson. With quickly expanding use cases, data sizes, and analyst populations, these customers have a critical need for scalable data warehouses.

Since we launched Amazon Redshift, our customers have grown with us. Working closely with them, we have learned how their needs evolve as their data scales. We commonly encounter scenarios like the following for data analytics:

  • A US-based retail company runs a large number of scheduled queries and complex BI reports. Their Amazon Redshift usage peaks between 8 AM and 6 PM, when their data scientists and analysts run heavy workloads. At night, they might have an occasional user who queries the data and generates a small report. Consequently, they don’t need the same cluster capacity at night as they do during the day.
  • A healthcare consulting company is rapidly growing their Data as a Service (DaaS) business. They want to quickly create a duplicate environment and provide the cluster endpoint to their clients. After the duplicate cluster is created, it needs to be right-sized quickly based on the cost and performance requirements of the client.
  • An IoT service provider is on a rapid growth trajectory. Whenever a major event takes place, their sensors send in terabytes of new data that needs to be ingested into Amazon Redshift and analyzed soon after it arrives.

When database administrators (DBAs) don’t have the nimbleness to react to these scenarios, the analysts can experience longer response times for mission-critical workloads. Or, they might be locked out altogether if the data warehouse is down for a resize. The DBAs, in turn, cannot support Service Level Agreements (SLAs) that they have set with their business stakeholders.

With Amazon Redshift, you can already scale quickly in three ways. First, you can query data in your Amazon S3 data lakes in place using Amazon Redshift Spectrum, without needing to load it into the cluster. This flexibility lets you analyze growing data volumes without waiting for extract, transform, and load (ETL) jobs or adding more storage capacity. Second, you can resize your Amazon Redshift clusters by adding more nodes or changing node types in just a few hours. During this time, analysts can continue to run read queries with no downtime. This gives you more agility compared to on-premises data warehouses that take days to scale. Third, you can spin up multiple Amazon Redshift clusters by quickly restoring data from a snapshot. This allows you to add compute resources that might be needed to support high concurrency.

Introducing elastic resize

We’re excited to introduce elastic resize, a new feature that enables you to add or remove nodes in an Amazon Redshift cluster in minutes. This further increases your agility to get better performance and more storage for demanding workloads, and to reduce cost during periods of low demand. You can resize manually from the AWS Management Console or programmatically with a simple API call.

With elastic resize, you can start small and scale up on-demand as your needs grow, as illustrated in the following diagram.

Amazon Redshift customers who have been previewing elastic resize before launch have immediately benefited by the scalability that it unlocks for them. Here is what some of our customers had to say about elastic resize:


Amazon Prime Video uses advanced data analytics to customize viewing recommendations and measure fan viewing experiences. “Redshift’s new elastic resize feature reduced our operational resizing time from 6 hours down to 15 minutes, allowing us to dynamically scale our infrastructure according to the diverse nature of our workloads and optimizing costs while maximizing performance.” Sergio Diaz Bautista, Data Engineer at Amazon Prime Video



Yelp uses Amazon Redshift to analyze mobile app usage data and ad data on customer cohorts, auctions, and ad metrics. “Yelp is at the forefront of using data analytics to drive business decisions and enhance its users’ experience. Using elastic resize, we can confidently optimize for the best performance and keep costs low by configuring the cluster to scale up whenever demand increases beyond the usual variability window and scale down during off-peak hours. The ability to scale our data warehouse containing hundreds of terabytes of data, in minutes, is amazing,” says Shahid Chohan, data architect at Yelp.com


“Coupang is disrupting how the world shops on phones. We cannot always predict analytical demand because of evolving business needs and ad hoc analyses that are unexpectedly required. With elastic resize, we can scale compute and storage quickly to finish large ETL jobs faster and serve more users querying the data,” says Hara Ketha, senior manager of data engineering at Coupang.


OLX uses Amazon Redshift to power personalization and relevance, run reporting, and generate actionable customer insights. “With OLX, millions of people across the world buy and sell from each other daily. Redshift has been at the core of advanced analytics and big data innovation at OLX. For our ever increasing data needs, we can now add nodes using elastic resize blazingly fast. It’s also easy to size down during periods of low activity in order to save costs. Thank you Redshift!” said Michał Adamkiewicz, data architect for the Europe data team at OLX.com

How elastic resize works

Elastic resize is fundamentally different from the existing classic resize operation available in Amazon Redshift. Unlike classic resize, which creates a new cluster and transfers data to it, elastic resize modifies the number of nodes in your existing cluster. It enables you to deploy the additional nodes in minutes with minimal disruption to ongoing read or write queries. Hence, you can size up your cluster quickly for faster performance and size down when the job is finished to save cost.

You can still use classic resize when you want to change the node type (for instance, if you are upgrading from DC1 to DC2 nodes). The following stages describe what happens behind the scenes when you trigger elastic resize.

Stage 1: Preparing for resize while the cluster is fully available

At the start of elastic resize, Amazon Redshift first updates the snapshot on Amazon S3 with the most recent data. Today, Amazon Redshift takes an automated snapshot every 8 hours or 5 GB of data change, whichever comes first. In addition, you can also take snapshots manually. Each snapshot is incremental to capture changes that occurred since the last automated or manual snapshot. While the first snapshot is taken, the cluster is fully available for read and write queries.

On the console and via the describe-clusters API operation, the cluster status shows as available, prep-for-resize during this stage. Additionally, an event notification (REDSHIFT-EVENT-3012) lets you know that a request for elastic resize has been received.

Stage 2: Resizing the cluster while the cluster is unavailable

This stage, when the resizing actually happens, takes just a few minutes to finish. As the resize begins, the existing connections are put on hold. No new connections are accepted until the resize finishes, and the cluster is unavailable for querying. Some of the held connections or queries might fail if they were part of ongoing transactions. New nodes are added (for scaling up) or removed (for scaling down) during this period. Another incremental backup is taken to account for any data updates made by users during stage 1.

On the console and using the describe-clusters API operation, the cluster status now shows as resizing. Additionally, an event notification (REDSHIFT-EVENT-3011) lets you know that elastic resize started on the cluster.

Stage 3: Data transferring while the cluster is fully available

After resizing is finished in stage 2, the cluster is fully available for read and write queries. Queries that were being held are queued for execution automatically. During this stage, data is transferred from Amazon S3 to the nodes. Because Amazon Redshift only subtracts or adds nodes, only a fraction of the data needs to be transferred. For example, if you resized a three-node dc2.8xlarge cluster to six nodes, only 50 percent of the data needs to be moved.

We also improved the rate at which we restore data blocks from S3 by 2x, cutting down the time it would have taken to finish stage 3. In addition, this improvement also makes restores from a snapshot faster.

Moreover, Amazon Redshift moves the data intelligently to minimize the impact of data transfer on queries during this stage. The most frequently accessed data blocks are moved first, followed by other blocks based on their access frequency. This results in most incoming queries finding the data blocks that they are trying to access on disk.

On the console and using the describe-clusters API operation, the cluster status now shows as available, transferring data. Additionally, an event notification (REDSHIFT-EVENT-3534) lets you know that elastic resize has finished and the cluster is available for reads and writes.

After the data transfer is finished, the cluster status on the console and via the describe-clusters API operation once again shows as available.

Elastic resize constraints

Before deciding whether elastic resize is appropriate for your use case, consider the following constraints:

  • The new node configuration must have enough storage for existing data. Even when you add nodes, your new configuration might not have enough storage because of the way that data is redistributed.
  • You can resize only by a factor of 2, up or down, for dc2.large or ds2.xlarge node types. For example, you can resize a four-node cluster up to eight nodes or down to two nodes. This limitation exists to avoid data skew between nodes caused by an uneven distribution of slices.
  • For dc2.8xlarge or ds2.8xlarge node types, you can resize up to two times the original node count, or down to one-half the original node count. For example, you can resize a 16-node cluster to any size up to 32 nodes, or any size down to 8 nodes. This limitation exists to avoid data skew between nodes caused by an uneven distribution of slices.


Elastic resize significantly improves your ability to scale your Amazon Redshift clusters on-demand. Together with features such as Amazon Redshift Spectrum, it enables you to independently scale storage and compute so that you can adapt to the evolving analytical needs of your business.

To learn more about elastic resize and understand how you can size your clusters, watch the AWS Online Tech Talk: Best Practices for Scaling Amazon Redshift. You can also refer to the documentation for Resizing Clusters in the Amazon Redshift Cluster Management Guide.

About the Authors

Ayush Jain is a Product Marketer at Amazon Web Services. He loves growing cloud services and helping customers get more value from the cloud deployments. He has several years of experience in Software Development, Product Management and Product Marketing in developer and data services.




Himanshu Raja is a Senior Product Manager for Amazon Redshift. Himanshu loves solving hard problems with data and cherishes moments when data goes against intuition. In his spare time, Himanshu enjoys cooking Indian food and watching action movies.

Create cross-account and cross-region AWS Glue connections

Post Syndicated from Pankaj Malhotra original https://aws.amazon.com/blogs/big-data/create-cross-account-and-cross-region-aws-glue-connections/

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. AWS Glue uses connections to access certain types of source and target data stores, as described in the AWS Glue documentation.

By default, you can use AWS Glue to create connections to data stores in the same AWS account and AWS Region as the one where you have AWS Glue resources. In this blog post, we describe how to access data stores in an account or AWS Region different from the one where you have AWS Glue resources.

AWS Glue connections

AWS Glue uses a connection to crawl and catalog a data store’s metadata in the AWS Glue Data Catalog, as the documentation describes. AWS Glue ETL jobs also use connections to connect to source and target data stores. AWS Glue supports connections to Amazon Redshift, Amazon RDS, and JDBC data stores.

A connection contains the properties needed by AWS Glue to access a data store. These properties might include connection information such as user name and password, data store subnet IDs, and security groups.

If the data store is located inside an Amazon VPC, AWS Glue uses the VPC subnet ID and security group ID connection properties to set up elastic network interfaces in the VPC containing the data store. Doing this enables ETL jobs and crawlers to connect securely to the data store in the VPC.

AWS Glue can create this elastic network interface setup if the VPC containing the data store is in the same account and AWS Region as the AWS Glue resources. The security groups specified in a connection’s properties are applied on each of the network interfaces. The security group rules and network ACLs associated with the subnet control network traffic through the subnet. Correct rules for allowing outbound traffic through the subnet ensure that AWS Glue can establish network connectivity with all subnets in the VPC containing the data store, and therefore access the source or target data store.

VPC components can be interlinked only if they are present in the same AWS Region. Therefore, AWS Glue cannot create the elastic network interfaces inside a VPC in another region. If the VPC containing the data store is in another region, you have to add the network routes and create additional network interfaces which allow network interfaces set up by AWS Glue to establish network connectivity with the data store.

In this blog post, we describe how to configure the networking routes and interfaces to give AWS Glue access to a data store in an AWS Region different from the one with your AWS Glue resources. In our example, we connect AWS Glue, located in Region A, to an Amazon Redshift data warehouse located in Region B.

Note: The examples here assume that the Amazon Redshift cluster is in a different AWS Region, but belongs to the same account. The same setup and instructions are also valid for an Amazon Redshift cluster in a different account.

Setting up VPC components for AWS Glue

AWS Glue requires a VPC with networking routes to the data stores to which it connects. In our solution, the security groups and route tables are configured to enable elastic network interfaces set up by AWS Glue in a private subnet to reach the internet or connect to data stores outside the VPC. The following diagram shows the necessary components and the network traffic flow.

Required components for VPC setup:

  • AWS Glue resources in a private subnet in Region A.
  • A NAT gateway with an Elastic IP address attached to it in a public subnet in Region A.
  • A private route table containing a route allowing outbound network traffic from the private subnet to pass through the NAT gateway.
  • An internet gateway in Region A.
  • A public route table with a route allowing outbound network traffic from the public subnet to pass through the internet gateway.

Note: We must update the default security group of the VPC to include a self-referencing inbound rule and an outbound rule to allow all traffic from all ports. Later in the example, we attach this security group to an AWS Glue connection to let network interfaces set up by AWS Glue communicate with each other within a private subnet.

Network traffic flow through the components:

Outbound network traffic from AWS Glue resources in the private subnet to any destination or data store outside the private subnet is routed through the NAT gateway.

The NAT gateway is present in a public subnet and has an associated Elastic IP address. It forwards network traffic from AWS Glue resources to internet by using an internet gateway.

When AWS Glue tries to establish a connection with a data store outside of the private subnet, the incoming network traffic on the data store side appears to come from the NAT Gateway.

On the data store side, you allow the data store or its security group to accept incoming network traffic from the Elastic IP address attached to the NAT gateway. This is shown in the section “Allow Amazon Redshift to accept network traffic from AWS Glue,” following.

Creating VPC components using AWS CloudFormation

You can automate the creation of a VPC and all the components described preceding using the vpc_setup.yaml CloudFormation template, hosted on GitHub. Follow these step-by-step instructions to create the stack in your AWS account:

  1. Deploy the stack in the US Oregon (us-west-2) Region:

Note: In this example, we create the AWS Glue resources and connection in the us-west-2 Region. You can change this to the AWS Region where you have your AWS Glue connection and resources.

You are directed to the AWS CloudFormation console, with the stack name and URL template fields pre-filled.

  1. Choose Next.
  2. Use the default IP ranges and choose Next.
  3. Skip this step and choose Next.
  4. Review and choose Create.
  5. Wait for stack creation to complete. After completion, all the VPC components and necessary setup required are created.
  6. Navigate to the VPC console and copy the Elastic IP address for the newly created NAT.
    Note: This IP address is used for outbound network flow from AWS Glue resources and so should be whitelisted on the data store side. For more detail, see “Allow Amazon Redshift to accept network traffic from AWS Glue,” following.

Before creating and testing an AWS Glue connection to your data store, you need an IAM role that lets AWS Glue access the VPC components that you just created.

Creating an IAM role to let AWS Glue access Amazon VPC components

For this example, we create a role called TestAWSGlueConnectionIAMRole with a managed IAM policy AWSGlueServiceRole attached to it.

  1. Choose the Roles tab from the AWS Identity and Access Management (IAM) console.
  2. Choose Create role and select AWS Glue as a trusted entity.

  1. Attach an IAM policy to the role that allows AWS Glue to access the VPC components. In this example, we are using the default AWSGlueServiceRole policy, which contains all the required permissions for the setup.

  1. We name the role TestAWSGlueConnectionRole.

Note: The default GlueServiceRole policy that we attached to our custom role TestAWSGlueConnectionIAMRole has permissions for accessing VPC components. If you are using a custom policy instead of the default one, it should also contain the same permissions to be able to access VPC components.

Creating an Amazon Redshift cluster using AWS CloudFormation

For this example, we create a sample Amazon Redshift cluster in a VPC in the US N. Virginia (us-east-1) Region. Follow these step-by-step instructions to create the stack in your AWS account:

  1. Navigate to the CloudFormation console in region us-east-1 and create a new stack using this CloudFormation template, described in the documentation.
  2. Provide the configuration for the cluster and MasterUsername and MasterUserPassword. MasterUserPassword must follow the following constraints:
  • It must be 8–64 characters in length.
  • It must contain at least one uppercase letter, one lowercase letter, and one number.
  • It can use any printable ASCII characters (ASCII code 33–126) except ‘ (single quote), ” (double quote), :, \, /, @, or space.

  1. Choose Next and proceed with the stack creation.
  2. Review the configuration and choose Create.

  1. Wait for stack creation to complete, which can take a few minutes.

  1. Navigate to the Amazon Redshift console and choose the cluster name to see the cluster properties.

  1. Note the JDBC URL for the cluster and the attached security group for later use.

Note: We created a sample Amazon Redshift cluster in a public subnet present inside a VPC in Region B. We recommend that you follow the best practices for increased security and availability while setting up a new Amazon Redshift cluster, as shown in our samples on GitHub.

Creating an AWS Glue connection

Now you have the required VPC setup, Amazon Redshift cluster, and IAM role in place. Next, you can create an AWS Glue connection and test it as follows:

  1. Choose Add Connection under the Connections tab in AWS Glue console. The AWS Region in which we are creating this connection is the same as for our VPC setup, that is US Oregon (us-west-2).

  1. Choose a JDBC connection type. You can choose to enforce JDBC SSL or not, depending on the configuration for your data store.

  1. Add the connection-specific configuration. Note the URL for our Amazon Redshift cluster. It shows that the Amazon Redshift cluster is present in us-east-1.

Note: We use the VPC (VPCForGlue) and the private subnet (GluePrivateSubnet) we created for this connection. For security groups, we use the default security group for the VPC. This security group has a self-referencing inbound rule and an outbound rule that allows all traffic.

  1. Review configuration and choose Finish.

The AWS Glue console should now show that the connection was created successfully.

Note: Completing this step just means that an AWS Glue connection was created. It doesn’t guarantee that AWS Glue can actually connect to your data store. Before we test the connection, we also need to allow Amazon Redshift to accept network traffic coming from AWS Glue.

Allow Amazon Redshift to accept network traffic from AWS Glue

The Amazon Redshift cluster in a different AWS Region (us-east-1) from AWS Glue must allow incoming network traffic from AWS Glue.

For this, we update the security group attached to the Amazon Redshift cluster, and whitelist the Elastic IP address attached to the NAT gateway for the AWS Glue VPC.

Testing the AWS Glue connection

As a best practice, before you use a data store connection in an ETL job, choose Test connection. AWS Glue uses the parameters in your connection to confirm that it can access your data store and reports back any errors.

  1. Select the connection TestAWSGlueConnection that we just created and choose Test Connection.
  2. Select the TestAWSGlueConnectionIAMRole that we created for allowing AWS Glue resources to access VPC components.

  1. After you choose the Test connection button in the previous step, it can take a few seconds for AWS Glue to successfully connect to the data store. When it does, the console shows a message saying it “connected successfully to your instance.”


By creating a VPC setup similar to the one we describe, you can let AWS Glue connect to a data store in a different account or AWS Region. By doing this, you establish network connectivity between AWS Glue resources and your data store. You can now use this AWS Glue connection in ETL jobs and AWS Glue crawlers to connect with the data store.

If you have questions or suggestions, please leave a comment following.

Additional Reading

If you found this post helpful, be sure to check out Connecting to and running ETL jobs across multiple VPCs using a dedicated AWS Glue VPC, and How to access and analyze on-premises data stores using AWS Glue.


About the Author

Pankaj Malhotra is a Software Development Engineer at Amazon Web Services. He enjoys solving problems related to cloud infrastructure and distributed systems. He specializes in developing multi-regional, resilient services using serverless technologies.




Connecting to and running ETL jobs across multiple VPCs using a dedicated AWS Glue VPC

Post Syndicated from Nivas Shankar original https://aws.amazon.com/blogs/big-data/connecting-to-and-running-etl-jobs-across-multiple-vpcs-using-a-dedicated-aws-glue-vpc/

Many organizations use a setup that includes multiple VPCs based on the Amazon VPC service, with databases isolated in separate VPCs for security, auditing, and compliance purposes. This blog post shows how you can use AWS Glue to perform extract, transform, load (ETL) and crawler operations for databases located in multiple VPCs.

The solution presented here uses a dedicated AWS Glue VPC and subnet to perform the following operations on databases located in different VPCs:

  • Scenario 1: Ingest data from an Amazon RDS for MySQL database, transform it in AWS Glue, and output the results to an Amazon Redshift data warehouse.
  • Scenario 2: Ingest data from an Amazon RDS for MySQL database, transform it in AWS Glue, and output the results to an Amazon RDS for PostgreSQL database.

In this blog post, we’ll go through the steps needed to build an ETL pipeline that consumes from one source in one VPC and outputs it to another source in a different VPC. We’ll set up in multiple VPCs to reproduce a situation where your database instances are in multiple VPCs for isolation related to security, audit, or other purposes.

For this solution, we create a VPC dedicated to AWS Glue. Next, we set up VPC peering between the AWS Glue VPC and all of the other database VPCs. Then we configure an Amazon S3 endpoint, route tables, security groups, and IAM so that AWS Glue can function properly. Lastly, we create AWS Glue connections and an AWS Glue job to perform the task at hand.

Step 1: Set up a VPC

To simulate these scenarios, we create four VPCs with their respective IPv4 CIDR ranges. (Note: CIDR ranges can’t overlap when you use VPC peering.)

VPC 1Amazon Redshift172.31.0.0/16
VPC 2Amazon RDS for MySQL172.32.0.0/16
VPC 3Amazon RDS for PostgreSQL172.33.0.0/16
VPC 4AWS Glue172.30.0.0/16

Key configuration notes:

  1. The AWS Glue VPC needs at least one private subnet for AWS Glue to use.
  2. Ensure that DNS hostnames are enabled for all of your VPCs (unless you plan to refer to your databases by IP address later on, which isn’t recommended).

Step 2: Set up a VPC peering connection

Next, we peer our VPCs together to ensure that AWS Glue can communicate with all of the database targets and sources. This approach is necessary because AWS Glue resources are created with private addresses only. Thus, they can’t use an internet gateway to communicate with public addresses, such as public database endpoints. If your database endpoints are public, you can alternatively use a network address translation (NAT) gateway with AWS Glue rather than peer across VPCs.

Create the following peering connections.

Peer 1172.30.0.0/16- VPC 4172.31.0.0/16- VPC 1
Peer 2172.30.0.0/16- VPC 4172.32.0.0/16 -VPC 2
Peer 3172.30.0.0/16- VPC 4172.33.0.0/16- VPC 3

These peering connections can be across separate AWS Regions if needed. The database VPCs are not peered together; they are all peered with the AWS Glue VPC instead. We do this because AWS Glue connects to each database from its own VPC. The databases don’t connect to each other.

Key configuration notes:

  1. Create a VPC peering connection, as described in the Amazon VPC documentation. Select the AWS Glue VPC as the requester and the VPC for your database as the accepter.
  2. Accept the VPC peering request. If you are peering to a different AWS Region, switch to that AWS Region to accept the request.

Important: Enable Domain Name Service (DNS) settings for each of the peering connections. Doing this ensures that AWS Glue can retrieve the private IP address of your database endpoints. Otherwise, AWS Glue resolves your database endpoints to public IP addresses. AWS Glue can’t connect to public IP addresses without a NAT gateway.

Step 3: Create an Amazon S3 endpoint for the AWS Glue subnet

We need to add an Amazon S3 endpoint to the AWS Glue VPC (VPC 4). During setup, associate the endpoint with the route table that your private subnet uses. For more details on creating an S3 endpoint for AWS Glue, see Amazon VPC Endpoints for Amazon S3 in the AWS Glue documentation.

AWS Glue uses S3 to store your scripts and temporary data to load into Amazon Redshift.

Step 4: Create a route table configuration

Add the following routes to the route tables used by the respective services’ subnets. These routes are configured along with existing settings.

VPC 4—AWS GlueDestinationTarget
Route table172.33.0.0/16- VPC 3Peer 3 VPC 1Peer 1 VPC 2Peer 2


VPC 1—Amazon RedshiftDestinationTarget
Route table172.30.0.0/16- VPC 4Peer 1


VPC 2—Amazon RDS MySQLDestinationTarget
Route table172.30.0.0/16- VPC 4Peer 2


VPC 3—Amazon RDS PostgreSQLDestinationTarget
Route table172.30.0.0/16- VPC 4Peer 3

Key configuration notes:

  • The route table for the AWS Glue VPC has peering connections to all VPCs. It has these so that AWS Glue can initiate connections to all of the databases.
  • All of the database VPCs have a peering connection back to the AWS Glue VPC. They have these connections to allow return traffic to reach AWS Glue.
  • Ensure that your S3 endpoint is present in the route table for the AWS Glue VPC.

Step 5: Update the database security groups

Each database’s security group must allow traffic to its listening port (3306, 5432, 5439, and so on) from the AWS Glue VPC for AWS Glue to be able to connect to it. It’s also a good idea to restrict the range of source IP addresses as much as possible.

There are two ways to accomplish this. If your AWS Glue job will be in the same AWS Region as the resource, you can define the source as the security group that you use for AWS Glue. If you are using AWS Glue to connect across AWS Regions, specify the IP range from the private subnet in the AWS Glue VPC instead. The examples following use a security group as our AWS Glue job, and data sources are all in the same AWS Region.

In addition to configuring the database’s security groups, AWS Glue requires a special security group that allows all inbound traffic from itself. Because it isn’t secure to allow traffic from, we create a self-referencing rule that simply allows all traffic originating from the security group. You can create a new security group for this purpose, or you can modify an existing security group. In the example following, we create a new security group to use later when AWS Glue connections are created.

The security group Amazon RDS for MySQL needs to allow traffic from AWS Glue:

Amazon RDS for PostgreSQL allows traffic to its listening port from the same:

Amazon Redshift does it as so:

AWS Glue does it as so:

Step 6: Set up IAM

Make sure that you have an AWS Glue IAM role with access to Amazon S3. You might want to provide your own policy for access to specific Amazon S3 resources. Data sources require s3:ListBucket and s3:GetObject permissions. Data targets require s3:ListBucket, s3:PutObject, and s3:DeleteObject permissions. For more information on creating an Amazon S3 policy for your resources, see Policies and Permissions in the IAM documentation.

The role should look like this:

Or you can create an S3 policy that’s more restricted to suit your use case.

Step 7: Set up an AWS Glue connection

The Amazon RDS for MySQL connection in AWS Glue should look like this:

The Amazon Redshift connection should look like this:

The Amazon RDS for PostgreSQL connection should look like this:

Step 8: Set up an AWS Glue job

Key configuration notes:

  1. Create a crawler to import table metadata from the source database (Amazon RDS for MySQL) into the AWS Glue Data Catalog. The scenario includes a database in the catalog named gluedb, to which the crawler adds the sample tables from the source Amazon RDS for MySQL database.
  2. Use either the source connection or destination connection to create a sample job as shown following. (This step is required for the AWS Glue job to establish a network connection and create the necessary elastic network interfaces with the databases’ VPCs and peered connections.)
  3. This scenario uses pyspark code and performs the load operation from Amazon RDS for MySQL to Amazon Redshift. The ingest from Amazon RDS for MySQL to Amazon RDS for PostgreSQL includes a similar job.
  4. After running the job, verify that the table exists in the target database and that the counts match.

The following screenshots show the steps to create a job in the AWS Glue Management Console.

Following are some of examples of loading data from source tables to target instances. These are simple one-to-one mappings, with no transformations applied. Notice that the data sources and data sink (target) connection configuration access multiple VPCs from a single AWS Glue job.

Sample script 1 (Amazon RDS for MySQL to Amazon Redshift)

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource = glueContext.create_dynamic_frame.from_catalog(database = "gluedb", table_name = "mysqldb_events", transformation_ctx = "datasource")

datasink = glueContext.write_dynamic_frame.from_jdbc_conf(frame = datasource, catalog_connection = "Redshift", connection_options = {"dbtable": "mysqldb_events", "database": "dmartblog"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink")


Sample script 2:  Amazon RDS for MySQL to Amazon RDS for PostgreSQL (can also change with other RDS endpoint)

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource = glueContext.create_dynamic_frame.from_catalog(database = "gluedb", table_name = "mysqldb_events", transformation_ctx = "datasource")

datasink = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "vpc-pgsql", connection_options = {"dbtable": "mysqldb_events", "database": "mypgsql"}, transformation_ctx = "datasink")



In this blog post, you learn how to configure AWS Glue to run in a separate VPC so that it can execute jobs for databases located in multiple VPCs.

The benefits of doing this include the following:

  • A separate VPC and dedicated pool on the running AWS Glue job, isolated from database and compute nodes.
  • Dedicated ETL developer access to a single VPC for better security control and provisioning.


Additional Reading

If you found this post useful, be sure to check out Restrict access to your AWS Glue Data Catalog with resource-level IAM permissions and resource-based policies, and Using Amazon Redshift Spectrum, Amazon Athena, and AWS Glue with Node.js in Production.


About the Author

Nivas Shankar is a Senior Big Data Consultant at Amazon Web Services. He helps and works closely with enterprise customers building big data applications on the AWS platform. He holds a Masters degree in physics and is highly passionate about theoretical physics concepts. He enjoys spending time with his wife and two adorable kids. In his spare time, he takes his kids to tennis and football practice.



Ian Eberhart is a Cloud Support Engineer on the Big Data team for AWS Premium Support. He works with customers on a daily basis to find solutions for moving and sorting their data on the AWS platform. In his spare time, Ian enjoys seeing independent and weird movies, riding his bike, and hiking in the mountains.


Chasing earthquakes: How to prepare an unstructured dataset for visualization via ETL processing with Amazon Redshift

Post Syndicated from Ian Funnell original https://aws.amazon.com/blogs/big-data/chasing-earthquakes-how-to-prepare-an-unstructured-dataset-for-visualization-via-etl-processing-with-amazon-redshift/

As organizations expand analytics practices and hire data scientists and other specialized roles, big data pipelines are growing increasingly complex. Sophisticated models are being built using the troves of data being collected every second.

The bottleneck today is often not the know-how of analytical techniques. Rather, it’s the difficulty of building and maintaining ETL (extract, transform, and load) jobs using tools that might be unsuitable for the cloud.

In this post, I demonstrate a solution to this challenge. I start with a noisy semistructured dataset of seismic events, spanning several years and recorded at different locations across the globe. I set out to obtain broad insights about the nature of the rocks forming the Earth’s surface itself—the tectonic plate structure—to be visualized using the mapping capability in Amazon QuickSight.

To accomplish this, I use several AWS services, orchestrated together using Matillion ETL for Amazon Redshift:

Tectonic plate structure context

An earthquake is caused by a buildup of pressure that gets suddenly released. Earthquakes tend to be more severe at the boundaries of destructive tectonic plates. These boundaries are formed when a heavier and denser oceanic plate collides with a lighter continental plate, or when two oceanic plates collide. Due to the difference in density, the oceanic lithosphere is pushed underneath the continental plate, forming what is called a subduction zone. (See the following diagram.) In subduction zones, earthquakes can occur at depths as great as 700 kilometers.

Photography by KDS4444 [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons

For our analysis, we ultimately want to visualize the depth of an earthquake focus to locate subduction zones, and therefore find places on earth with the most severe earthquakes.

Seismographic data source

The data source is from the International Federation of Digital Seismograph Networks (FDSN). The event data is in JSON format (from the European Mediterranean Seismological Centre, or EMSC). An external process accumulates files daily into an Amazon S3 bucket, as shown following.

Each individual file contains all the seismic events for one day—usually several hundred—in an embedded array named “features,” as shown in the following example:

  "type": "FeatureCollection",
  "metadata": {
    "totalCount": 103
  "features": [
      "geometry": {
        "type": "Point",
        "coordinates": [26.76, 45.77, -140]
      "type": "Feature",
      "id": "20180302_0000103",
      "properties": {
        "lastupdate": "2018-03-02T23:27:00.0Z",
        "lon": 26.76, "lat": 45.77, "depth": 140,
        "mag": 3.7,
        "time": "2018-03-02T23:22:52.1Z",
        "flynn_region": "ROMANIA"
      "geometry": {
        "type": "Point",

Architecture overview

Athena reads and flattens the S3 data and makes it available for Matillion ETL to load into Amazon Redshift via JDBC. Matillion orchestrates this data movement, and it also provides a graphical framework to design and build the more complex data enrichment and aggregation steps to be performed by Amazon Redshift. Finally, the prepared data is queried by Amazon QuickSight for visualization.

Amazon Athena setup

You can use Athena to query data in S3 using standard SQL, via a serverless infrastructure that is managed entirely by AWS on your behalf. Before you can query your data, start by creating an external table. By doing this, you are defining the schema to apply to the data when it is being queried.

You can choose to use an AWS Glue crawler to assist in automatically discovering the schema and format of your source data files.

The following is the CREATE TABLE statement that you can copy and paste into the Athena console to create the schema needed to query the seismic data. Make sure that you substitute the correct S3 path to your seismic data in the LOCATION field of the statement.

  `type` string COMMENT 'from deserializer', 
  `metadata` struct<totalcount:int> COMMENT 'from deserializer', 
  `features` array<struct<geometry:struct<type:string,coordinates:array<double>>,type:string,id:string,properties:struct<lastupdate:string,magtype:string,evtype:string,lon:double,auth:string,lat:double,depth:double,unid:string,mag:double,time:string,source_id:string,source_catalog:string,flynn_region:string>>> COMMENT 'from deserializer')
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://your-bucket/SeismicPortal'

After the table is created, you are ready to query it. Athena uses the in-memory, distributed SQL engine called Apache Presto. It provides the ability to unnest arrays, which you use next.

Transfer to Amazon Redshift

The embedded array in every source record gets flattened out and converted to individual records during the JDBC export (download the .jar file) into Amazon Redshift. You use a Matillion ETL Database Query component to assist with the data transfer during this step, as shown in the following image.

This component simplifies ETL by automating the following steps:

Runs the SQL SELECT statement (shown in the following example).

  1. Streams query results across the network from Athena and into temporary storage in S3.
  2. Performs a bulk data load into Amazon Redshift.

Athena executes the following SQL statement:

SELECT f.id,
	   f.properties.time AS event_time,
FROM “seismic”.”sp_events”
CROSS JOIN UNNEST (features) as t(f)

The CROSS JOIN UNNEST syntax flattens the embedded array, generating hundreds of individual event records per day.

Now that the data has been copied and flattened into individual event records (shown in the following image), it’s ready for enrichment and aggregation.

Data enrichment

Earthquakes occur along a continuous range of spatial coordinates. In order to aggregate them, as we’ll be doing very soon, it’s necessary to first group them together. A convenient method is to assign every event into a Universal Transverse Mercator (UTM) zone. These zones are six-degree bands of longitudes that convert the spherical latitude/longitude coordinates into a 2D representation. Performing this conversion provides good granularity for visualization later.

The calculation to convert a spherical longitude/latitude coordinate into a two-dimensional UTM coordinate is complex. It can be performed ideally using an Amazon Redshift user-defined function (UDF). I chose a UDF for the ability to invoke it, via a Matillion component, in the next step.

CREATE OR REPLACE FUNCTION f_ll_utm (Lat float, Long float)
AS $$
From math import pi, sin, cos, tan, sqrt

_deg2rad = pi / 180.0
_rad2deg = 180.0 / pi

_EquatorialRadius = 1
_eccentricitySquared = 2
_ellipsoid = [ “WGS-84”, 6378137, 0.00669438]

The UDF has to return three pieces of information:

  • UTM Zone code
  • Easting (x-axis measurement in meters)
  • Northing (ditto, for the y-axis)

A scalar UDF can only return a single value. Therefore the three results were returned as a pipe-delimited string, in which the three values are pipe-separated:

To bring the values out into individual fields, the UDF is first invoked using a Matillion ETL Calculator component, followed by a field splitter and a Calculator to perform data type conversion and rounding.

Data aggregation

To reiterate, we’re interested in the depth of earthquake focus specifically on destructive plate boundaries. Knowing the depth helps us estimate the potential severity of earthquakes.

We need to find the average event depth within each UTM zone, in the expectation that a spatial pattern will appear that will highlight the subduction zones.

The last three steps in the Matillion transformation (shown in the following diagram) perform the necessary aggregation, add a depth quartile, and create an output table from the resulting data.

The ”Aggregate to UTM ref” step gets Amazon Redshift to perform a GROUP BY function in SQL, which approximates every event to the appropriate UTM zone. While doing this aggregation, you simultaneously do the following:

  • Count the events (which determines the size of the visual representation).
  • Find the average depth (which determines the color of the visual representation).
  • Determine the average latitude and longitude (which approximates to the center of the UTM zone, and determines the position of the visual representation).

The following image shows the aggregation type for each column:

Average depth is a useful measure, but to maximize the visual impact of the final presentation, we also take the opportunity to rank the results into quartiles. This allows the zones with the deepest quartile to stand out consistently on the map.

NTILE(4) OVER (ORDER BY "avg_depth")

Amazon Redshift is great at performing this type of analytics, which is delivered inside another Matillion ETL Calculator component.

The Recreate Output step materializes the dataset into an Amazon Redshift table, ready for Amazon QuickSight to visualize.

Amazon QuickSight visualization

The Amazon QuickSight “points on map” visualization is perfect for this 2D rendering. The values for the field wells come straight from the aggregated data in Amazon Redshift:

  • Geospatial — the average lat/long per UTM grid.
  • Size — the record count, in other words, the number of events in that UTM zone.
  • Color — the Depth Ntile, with the fourth quartile in pink.

The resulting map shows the global subduction zones highlighted clearly in pink, these being the areas with the deepest earthquake’s focus on average.

Recap and summary

In this post, I used seismological data as an example to explore challenges around the visualization of unstructured data and to provide best practices. I suggested a way to overcome these challenges with an architecture that is also applicable for datasets from a wide array of sources, beyond geology. I then explored how to orchestrate activities of different data processing tasks between S3, Athena, and Amazon Redshift using Matillion ETL for Amazon Redshift.

If you’re trying to solve similar analytics challenges and want to see how Amazon Redshift and Matillion can help, launch a 14 day free trial of Matillion ETL for Amazon Redshift on the AWS Marketplace or schedule a demo today. If you have questions or suggestions, please comment below.

Additional Reading

If you found this post helpful, be sure to check out Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift, and Orchestrate multiple ETL jobs using AWS Step Functions and AWS Lambda.


About the Author

Ian Funnell, a lead solution architect at Matillion, is a seasoned cloud and data warehousing specialist. Ian is dedicated to supporting customers in driving their data transformation forward and solving their deepest technical challenges. Ian’s has 25+ years experience in the tech industry.




Performance matters: Amazon Redshift is now up to 3.5x faster for real-world workloads

Post Syndicated from Ayush Jain original https://aws.amazon.com/blogs/big-data/performance-matters-amazon-redshift-is-now-up-to-3-5x-faster-for-real-world-workloads/

Since we launched Amazon Redshift, thousands of customers have trusted us to get uncompromising speed for their most complex analytical workloads. Over the course of 2017, our customers benefited from a 3x to 5x performance gain, resulting from short query acceleration, result caching, late materialization, and many other under-the-hood improvements. In this post, we highlight recent improvements to Amazon Redshift and how our continued focus on performance enhancements is benefiting customers. We also discuss performance testing derived from industry-standard benchmarks that help us measure the impact of these ongoing improvements.

Recent performance improvements

With the largest number of data warehousing deployments in the cloud, we have the ability to analyze usage patterns across a variety of analytical workloads and uncover opportunities to improve performance. We leverage these insights to deliver improvements that seamlessly benefit thousands of customers. Major improvements in performance over the past six months include the following:

  • Improved resource management for memory-intensive queries: Amazon Redshift improved how joins and aggregations consume and reserve memory. This improved cache efficiency for the majority of the hash tables and reduced spilling for memory-intensive joins and aggregations by up to 1.6x.
  • Improved performance for commits: As a central component of write transactions, commit has a direct impact on the performance of data update and data ingestion workloads, such as ETL (extract, transform, and load) jobs. Since November 2017, we’ve delivered a series of commit performance optimizations such as batching multiple commits in a single operation, improved usage of commit locks, and a locality-aware metadata defragmenter. These and other related optimizations have resulted in a 4x commit time reduction on average for HDD-based clusters. For heavy transactions (the top 5 percent of commit operations in Amazon Redshift), the delivered optimizations resulted in a 7.5x improvement.
  • Improved performance for repeated queries: With Amazon Redshift’s result caching, dashboards, visualization, and business intelligence (BI) tools that execute queries repeatedly now see a significant boost in performance. In addition, result caching frees up resources that can improve the performance of all other queries.
  • Query processing improvements: Amazon Redshift now performs 2x–6x faster for scenarios such as repeated subqueries, advanced analytics functions with predicates, and complex query plans by eliminating duplicate work and streamlining steps.
  • Faster string manipulation: Amazon Redshift yields 5x better performance for frequently used string functions because of more efficient code generation techniques.

We’ve also complemented these out-of-the-box improvements with tailored recommendations to help you get better performance at a lower cost with Amazon Redshift Advisor. Advisor has already provided close to 50,000 recommendations since it launched in July 2018.

All of these optimizations have transparently boosted customers’ ability to get faster insights from their AWS analytics platform and saved thousands of hours of execution time on a daily basis. This applies to even the largest deployments, where customers have multiple petabytes of data in Redshift clusters, and seamless access to even larger data volumes in their Amazon S3 data lakes with Amazon Redshift Spectrum. “Redshift’s query performance and scalability has been increasing, even though our data has grown.” said Minero Aoki, Senior Data Engineer, Cookpad Inc. “In the last 10 months, we have seen commit performance increase by 500% without any increase in cost.”

Using benchmarks to measure success

To measure the impact of these ongoing improvements, we measure performance on a nightly basis and run queries derived from industry-standard benchmarks such as TPC-DS. We also occasionally benchmark Amazon Redshift against other data warehouse services. We set up these measurements to reflect our customers’ real-world usage, as highlighted earlier. This enables us to accurately gauge whether Amazon Redshift is getting better with each release, which happens every two weeks.

Comparing Amazon Redshift releases over the past few months, we observed that Amazon Redshift is now 3.5x faster versus six months ago, running all 99 queries derived from the TPC-DS benchmark. This is shown in the following chart.

Note: We used a Cloud DW benchmark derived from TPC-DS for this study. As such, the Cloud DW benchmark is not comparable to published TPC-DS results. TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council.

For this post, we also compared the latest Amazon Redshift release with Microsoft Azure SQL Data Warehouse using the Cloud DW benchmark derived from TPC-DS. Queries ran against a 3 TB dataset on a 4-node cluster on both services, using dc2.8xlarge for Amazon Redshift and DW2000c Gen2 for Azure SQL Data Warehouse. We could not run a larger dataset because Azure could not allocate the DW15000c cluster required for a 30 TB dataset owing to capacity constraints at the time of publishing.

We observed that Amazon Redshift is 15x faster than Azure SQL Data Warehouse running all 99 queries with one user, and 14x faster with four concurrent users. There were a couple of outlier queries that took Azure SQL Data Warehouse several hours to complete. Excluding the two long running queries, Amazon Redshift is 2x faster than Azure SQL Data Warehouse with 1 user and 1.6x faster with four concurrent users. The following charts compare the two services.

Note: We used queries derived from TPC-DS v2.9 for this study. Amazon Redshift and Azure SQL DW do not support rollup queries, so we used TPC-DS provided variants for queries 5, 14, 18, 27, 36, 67, 70, 77, 80, and 86. We used out-of-the-box Workload Management configuration for Amazon Redshift, which allows for 5 concurrent queries, and ‘largerc’ resource class for Azure SQL DW, which has a lower limit of 4 concurrent queries. Amazon Redshift took 25 minutes to run all 99 queries, whereas Azure SQL Data Warehouse took 6.4 hours. Ignoring two queries that each took Azure SQL Data Warehouse more than 1 hour to execute (Q38 and Q67), Amazon Redshift took 22 minutes, while Azure SQL Data Warehouse took 42 minutes.


Evaluating Amazon Redshift

Although benchmarks against other data warehouse services are interesting, they are of limited value. First, there’s no one-size-fits-all benchmark. Each service has its unique real-world usage patterns and ways to configure and tune for them. We make a best effort to configure the services based on publicly available guidance, but we can’t guarantee optimal performance for any given service. We see this commonly with third-party benchmarks, for instance, where Amazon Redshift’s powerful distribution and sort keys are not used—even though the large majority of our customers use them.

Similarly, each benchmark query can only be run once, in contrast to real-world scenarios where 99.5 percent of queries we observe have components that can be found in the compilation cache (Amazon Redshift generates and compiles code for each query execution plan. The compiled code segments are stored in a least recently used cache and shared across sessions in a cluster). In other words, they are similar to queries that were run previously. So, the query run times measured by benchmarking studies can end up over-indexing on compilation times, which might not indicate the actual performance you can expect to get.

Secondly, these studies are, by necessity, a point-in-time assessment. As cloud vendors update and evolve their service, benchmark numbers might already be obsolete by the time they’re published.

Therefore, we don’t recommend that you make product selection decisions based on these benchmarks because your data and your query workloads have their own unique characteristics. If you’re evaluating Amazon Redshift for your analytics platform, we have created a Proof of Concept guide to help. You can also request assistance from us, or work with one of our System Integration and Consulting Partners and make a data-driven decision.

Finally, we invite you to watch the recent Fireside chat webinar and join us at re:Invent 2018 in Las Vegas, where we have a ton of exciting news to share with you. Happy querying!

If you would like instruction to reproduce the benchmark, please contact us at [email protected]. If you have questions or suggestions, please comment below.

About the Authors

Ayush Jain is a Product Marketer at Amazon Web Services. He loves growing cloud services and helping customers get more value from the cloud deployments. He has several years of experience in Software Development, Product Management and Product Marketing in developer and data services.




Mostafa Mokhtar is an engineer working on Redshift performance. Previously, he held similar roles at Cloudera, Hortonworks and on the SQL Server team at Microsoft.


Closing the customer journey loop with Amazon Redshift at Equinox Fitness Clubs

Post Syndicated from Ryan Kelly original https://aws.amazon.com/blogs/big-data/closing-the-customer-journey-loop-with-amazon-redshift-at-equinox-fitness-clubs/

Clickstream analysis tools handle their data well, and some even have impressive BI interfaces. However, analyzing clickstream data in isolation comes with many limitations. For example, a customer is interested in a product or service on your website. They go to your physical store to purchase it. The clickstream analyst asks, “What happened after they viewed the product?” and the commerce analyst asks themselves, “What happened before they bought it?”

It’s no surprise that clickstream data can enhance your other data sources. When used with purchase data, it helps you determine abandoned carts or optimize marketing spending. Similarly, it helps you analyze offline and online behavior, and also behavior before customers even registered an account. However, once the benefits of clickstream data feeds are evident, you must accommodate new requests quickly.

This blog post shows how we, at Equinox Fitness Clubs, moved our data from Amazon Redshift to Amazon S3 to use a late-binding view strategy with our clickstream data. Expect such fun things as Apache Spark, Apache Parquet, data lakes, hive partitioning, and external tables, all of which we will talk about extensively in this post!

When we first started passing our clickstream data from its own tool to our Amazon Redshift data warehouse, speed was the primary concern. Our initial use case was to tie Salesforce data and Adobe Analytics data together to get a better understanding about our lead process. Adobe Analytics could tell us what channel and campaigns people came from, what pages they looked at during the visit, and if they submitted a lead form on our site. Salesforce could tell us if a lead was qualified, if they ever met an advisor, and ultimately if they became a member. Tying these two datasets together helped us better understand and optimize our marketing.

To begin, we knew the steps involved to centralize Salesforce and Adobe Analytics data into Amazon Redshift. However, even when joined together in Redshift, they needed a common identifier to talk to each other. Our first step involved generating and sending the same GUID to both Salesforce and Adobe Analytics when a person submitted a lead form on our website.

Next, we had to pass Salesforce data to Redshift. Luckily, those feeds already existed, so we could add this new GUID attribute in the feed and describe it in Redshift.

Similarly, we had to generate data feeds from Adobe Analytics to Amazon Redshift. Adobe Analytics provides Amazon S3 as a destination option for our data, so we passed the data to S3 and then created a job to send it to Redshift. The job involved taking the daily Adobe Analytics feed – which comes with a data file containing hundreds of columns and hundreds of thousands of rows, a collection of lookup files like the headers for the data, and a manifest file that describes the files that were sent – and passing it all to Amazon S3 in its raw state. From there, we used Amazon EMR with Apache Spark to process the data feed files into a single CSV file, and then save it to S3 so that we could perform the COPY command to send the data to Amazon Redshift.

The job ran for a few weeks and worked well until we started to use the data more frequently. While the job was effective, backdating data with new columns (schema evolution) started to occur. That’s when we decided we needed greater flexibility because of the nature of the data.

Data lake to the rescue

When we decided to refactor the job, we had two things lined up for us. First, we were already moving towards more of a data lake strategy. Second, Redshift Spectrum had been recently released. It would enable us to query these flat files of clickstream data in our data lake without ever having to run the COPY command and store it in Redshift. Also, we could more efficiently join the clickstream data to other data sources stored inside of Redshift.

We wanted to take advantage of self-describing data which combines the schema of the data with the data itself. Converting the data to self-describing data would help us manage the wide clickstream datasets and prevent the schema-evolution related challenges. We could pull every column we desired into the data lake file and then use only the important columns in our queries to speed up processing. To accomplish this flexibility, we used the Apache Parquet file format, which is both self-describing and blazing fast due to its columnar storage technology. We used Apache Spark on Amazon EMR to convert from CSV to Parquet, and partition the data for scanning performance, as shown in the following code.


from datetime import date, timedelta
from pyspark.sql.types import *
from pyspark.sql import SparkSession
import json
import argparse
# Usage
# spark-submit all_omniture_to_parquet.py 2017-10-31 s3a:// eqxdl-prod-l-omniture eqxios eqxdl-prod eqeqxiosprod omniture_eqxios
# python -m tasks.s2w_all_omniture_to_parquet 2017-10-31

parser = argparse.ArgumentParser()
parser.add_argument('year_month_day_arg', help='Run date (yyyy-mm-dd)', type=str, default='XXX')
parser.add_argument('s3_protocol', help='S3 protocol i.e. s3a://',type=str, default='XXX')
parser.add_argument('source_bucket', help='Omniture source data bucket',type=str, default='XXX')
parser.add_argument('source_path', help='Omniture source data path',type=str, default='XXX')
parser.add_argument('target_bucket', help='Omniture target data bucket',type=str, default='XXX')
parser.add_argument('report_suite', help='Omniture report suite ID',type=str, default='XXX')
parser.add_argument('application', help='App name for job',type=str, default='XXX')
args = parser.parse_args()

spark = SparkSession\

sc = spark.sparkContext

def manifest_toJSON(file, location):
	text = sc.textFile(file).collect()
	manifest = {'lookup_files': [], 'data_files': location, 'total_rows': 0}
	for x in text:
		if 'Lookup-File:' in x:
			manifest['lookup_files'].append(location+x.split(': ')[1])
		elif 'Data-File: 01' in x:
			wildcard_path = x.replace('Data-File: 01','*')
			manifest['data_files'] += wildcard_path
		elif 'Record-Count:' in x:
			manifest['total_rows'] += int(x.split(': ')[1])
	return manifest

# Create metadata by stitching together the file paths for
# the header file and the data file from the manifest file
# base_filepath = '/Users/rkelly/projects/sparkyTest/project_remodeling/ios_test_data/'
base_filepath = '{}{}/{}/'.format(args.s3_protocol, args.source_bucket, args.source_path)
manifest_filepath = base_filepath+'{}_{}.txt'.format(args.report_suite, args.year_month_day_arg)
metadata = manifest_toJSON(manifest_filepath, base_filepath)

# Create a list of files and their data
# Look specifically for the column_headers.tsv data
# Split on \x00 to remove the garbage encoding and return a string of headers
lookup_files = sc.textFile(','.join(metadata['lookup_files'])).collect()
encoded_header = lookup_files[[idx for idx, s in enumerate(lookup_files) if 'column_headers.tsv' in s][0]].split('\x00')
header = encoded_header[[idx for idx, s in enumerate(encoded_header) if '\t' in s][0]]\
			.replace('\n', '')\
			.replace('(', '')\
			.replace(')', '')\
			.replace(' ', '-')

# Create a schema for the list from the header file splitting on tabs
# Cast everything as a string to avoid data type failures
schema = StructType([ StructField(field, StringType(), True) for field in header.split('\t')])

# Bypass RDD and write data file as a dataframe
# then save as parquet to tie headers to their respective values
df = spark.read.csv(metadata['data_files'], header=False, schema=schema, sep='\t', nullValue=None)
destination_filepath = '{}{}/{}/dt={}/'.format(args.s3_protocol, args.target_bucket, args.application, args.year_month_day_arg)

# Gracefully exit out of spark and this file


Using the AWS Glue Data Catalog allowed us to make our clickstream data available to be queried within Amazon Redshift and other query tools like Amazon Athena and Apache Spark. This is accomplished by mapping the Parquet file to a relational schema. AWS Glue, enables querying additional data in mere seconds. This is because schema changes can occur in real time. This means you can delete and add columns, reorder column indices, and change column types all at once. You can then query the data immediately after saving the schema. Additionally, Parquet format prevents failures when the shape of the data changes, or when certain columns are deprecated and removed from the data set.

We used the following query to create our first AWS Glue table for our Adobe Analytics website data. We ran this query in Amazon Redshift in SQL Workbench.


--First create your schema
create external schema omniture_prod
from data catalog 
database 'omniture' 
iam_role 'arn:aws:iam:::role

--Then create your “table” 
CREATE EXTERNAL TABLE omniture_prod.eqx_web (
  date_time		VARCHAR,
  va_closer_id		VARCHAR,
  va_closer_detail	VARCHAR,
  va_finder_detail	VARCHAR,
  va_finder_id		VARCHAR,
  ip			VARCHAR,
  domain		VARCHAR,
  post_evar1		VARCHAR
LOCATION 's3://eqxdl-prod/omniture/eqx_web/'
table properties ('parquet.compress'='SNAPPY');

--Check your databases, schemas, and tables
select * from pg_catalog.svv_external_databases;
select * from pg_catalog.svv_external_schemas;
select * from pg_catalog.svv_external_tables;


After running this query, we added additional columns to the schema upon request throughthe AWS Glue interface. We also used partitioning to make our queries faster and cheaper.

At this point, we had a new schema folder in our database. It contained the external table that could be queried, but we wanted to take it a step further. We needed to add some transformations to the data such as:

  • Renaming IDs to strings
  • Concatenating values
  • Manipulating strings, excluding bot traffic that we send from AWS to test the website
  • Changing the column names to be more user friendly.

To do this, we created a view of the external table, as follows:


create view edw_t.f_omniture_web as 
    REPLACE(dt, '-', '') as hive_date_key,
    va_closer_detail as last_touch_campaign,
        WHEN (va_closer_id) = '1' THEN 'Paid Search'
        WHEN (va_closer_id) = '2' THEN 'Natural Search'
        WHEN (va_closer_id) = '3' THEN 'Display'
        WHEN (va_closer_id) = '4' THEN 'Email Acq'
        WHEN (va_closer_id) = '5' THEN 'Direct'
        WHEN (va_closer_id) = '6' THEN 'Session Refresh'
        WHEN (va_closer_id) = '7' THEN 'Social Media'
        WHEN (va_closer_id) = '8' THEN 'Referring Domains'
        WHEN (va_closer_id) = '9' THEN 'Email Memb'
        WHEN (va_closer_id) = '10' THEN 'Social Placement'
        WHEN (va_closer_id) = '11' THEN 'Other Placement'
        WHEN (va_closer_id) = '12' THEN 'Partnership'
        WHEN (va_closer_id) = '13' THEN 'Other Eqx Sites'
        WHEN (va_closer_id) = '14' THEN 'Influencers'
        ELSE NULL
    END AS last_touch_channel,
    va_finder_detail as first_touch_campaign,
    va_finder_id as va_finder_id,
        WHEN (va_finder_id) = '1' THEN 'Paid Search'
        WHEN (va_finder_id) = '2' THEN 'Natural Search'
        WHEN (va_finder_id) = '3' THEN 'Display'
        WHEN (va_finder_id) = '4' THEN 'Email Acq'
        WHEN (va_finder_id) = '5' THEN 'Direct'
        WHEN (va_finder_id) = '6' THEN 'Session Refresh'
        WHEN (va_finder_id) = '7' THEN 'Social Media'
        WHEN (va_finder_id) = '8' THEN 'Referring Domains'
        WHEN (va_finder_id) = '9' THEN 'Email Memb'
        WHEN (va_finder_id) = '10' THEN 'Social Placement'
        WHEN (va_finder_id) = '11' THEN 'Other Placement'
        WHEN (va_finder_id) = '12' THEN 'Partnership'
        WHEN (va_finder_id) = '13' THEN 'Other Eqx Sites'
        WHEN (va_closer_id) = '14' THEN 'Influencers'
        ELSE NULL
    END AS first_touch_channel,
    ip as ip_address,
    domain as domain,
    post_evar1 AS internal_compaign,
    post_evar10 as site_subsection_nm,
    post_evar11 as IOS_app_view_txt,
    post_evar12 AS site_section_nm,
    post_evar15 AS transaction_id,
    post_evar23 as join_barcode_id,
    post_evar3 AS page_nm,
    post_evar32 as host_nm,
    post_evar41 as class_category_id,
    post_evar42 as class_id,
    post_evar43 as class_instance_id,
    post_evar60 AS referral_source_txt,
    post_evar69 as adwords_gclid,
    post_evar7 as usersec_tracking_id, 
    post_evar8 as facility_id,
    post_event_list as post_event_list,  
    post_visid_low||post_visid_high as unique_adobe_id,
    post_visid_type as post_visid_type,
    post_page_event as hit_type,
    visit_num as visit_number,
    post_evar25 as login_status,
    exclude_hit as exclude_hit,
    hit_source as hit_source,
    post_evar64 as api_error_msg,
    post_evar70 as page_load_time,
    post_evar78 as join_transaction_id,
    post_evar9 as page_url,
    visit_start_pagename as entry_pg,
    post_tnt as abtest_campaign,
    post_tnt_action as abtest_experience,
    user_agent as user_agent,
    mobile_id as mobile_id,
    cast(date_time as timestamp) as date_time,
        'America/New_York', -- timezone of origin
            when post_t_time_info like '%undefined%' then '0'
            when post_t_time_info is null then '0'
            when post_t_time_info = '' then '0'
            when cast(split_part(post_t_time_info,' ',4) as int) < 0
              then left(split_part(post_t_time_info,' ',4),4)
            else left(split_part(post_t_time_info,' ',4),3) end as int
        cast(date_time as timestamp)
    ) as date_time_local,
    post_t_time_info as local_timezone
from omniture_prod.eqx_web
where exclude_hit = '0'
and hit_source not in ('5','7','8','9')

and domain <> 'amazonaws.com'
and domain <> 'amazon.com'



Now we can perform queries from Amazon Redshift, blending our structured Salesforce data with our semi-structured, dynamic Adobe Analytics data. With these changes, our data became extremely flexible, friendly on storage size, and very performant when queried. Since then, we have started using Redshift Spectrum for many use cases from data quality checks, machine data, historical data archiving, and empowering our data analysts and scientists to more easily blend and onboard data.


with web_leads as (
  select transaction_id,last_touch_channel
  from edw_t.f_omniture_web
  where hive_date_key = '20170301'
      and post_event_list like '%201%'
      and transaction_id != '807f0cdc-80cf-42d3-8d75-e55e277a8718'
opp_lifecycle as (
  SELECT lifecycle,weblead_transactionid
  FROM edw_t.f_opportunity
  where weblead_transactionid is not null
  and created_date_key between '20170220'and '20170310'
  coalesce(opp_lifecycle.lifecycle, 'N/A') as Lifecycle
from web_leads
left join opp_lifecycle on web_leads.transaction_id = opp_lifecycle.weblead_transactionid


We were able to setup an efficient and flexible analytics platform for clickstream data by combining the Amazon S3 data lake with Amazon Redshift. This has eliminated the need to always load clickstream data into the data warehouse, and also made the platform adaptable to schema changes in the incoming data. You can read the Redshift documentation to get started with Redshift Spectrum, and also watch our presentation at the AWS Chicago Summit 2018 below.

Additional Reading

If you found this post helpful, be sure to check out From Data Lake to Data Warehouse: Enhancing Customer 360 with Amazon Redshift Spectrum, and Narrativ is helping producers monetize their digital content with Amazon Redshift.


About the Author

Ryan Kelly is a data architect at Equinox, where he helps outline and implement frameworks for data initiatives. He also leads clickstream tracking which helps aid teams with insights on their digital initiatives. Ryan loves making it easier for people to reach and ingest their data for the purposes of business intelligence, analytics, and product/service enrichment. He also loves exploring and vetting new technologies to see how they can enhance what they do at Equinox.