Tag Archives: database

Get started with Amazon Redshift cross-database queries (preview)

Post Syndicated from Neeraja Rentachintala original https://aws.amazon.com/blogs/big-data/get-started-with-amazon-redshift-cross-database-queries-preview/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL, business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics.

We’re excited to announce the public preview of the new cross-database queries capability to query across databases in an Amazon Redshift cluster. In this post, we provide an overview of the cross-database queries and a walkthrough of the key functionality that allows you to manage data and analytics at scale in your organization.

What are cross-database queries?

With cross-database queries, you can seamlessly query data from any database in your Amazon Redshift cluster, regardless of which database you’re connected to. Cross-database queries eliminate data copies and simplify your data organization to support multiple business groups on the same cluster. Support for cross-database queries is available on Amazon Redshift RA3 node types.

Data is organized across multiple databases in Amazon Redshift clusters to support multi-tenant configurations. However, you often need to query and join across these datasets by allowing read access. For example, different business groups and teams that own and manage their datasets in a specific database in the data warehouse need to collaborate with other groups. You might want to perform common ETL staging and processing while your raw data is spread across multiple databases. Organizing data in multiple Amazon Redshift databases is also a common scenario when migrating from traditional data warehouse systems.

With cross-database queries, you can now access data from any database on the Amazon Redshift cluster without having to connect to that specific database. You can also join datasets from multiple databases in a single query. You can access database objects such as tables, views with a simple three-part notation of <database>.<schema>.<object>, and analyze the objects using business intelligence (BI) or analytics tools. You can continue to set up granular access controls for users with standard Amazon Redshift SQL commands and ensure that users can only see the relevant subsets of the data they have permissions for.

Walkthrough overview

In this post, we walk through an end-to-end use case to illustrate cross-database queries, comprising the following steps:

  1. Set up permissions on the data.
  2. Access data and perform several cross-database queries.
  3. Connect from tools.

For this walkthrough, we use SQL Workbench, a SQL query tool, to perform queries on Amazon Redshift. For more information about connecting SQL Workbench to an Amazon Redshift cluster, see Connect to your cluster by using SQL Workbench/J .

Setting up permissions for cross-database queries

You can use standard Redshift SQL GRANT and REVOKE commands to configure appropriate permissions for users and groups. To configure permissions, we connect as an administrator to a database named TPCH_100G on an Amazon Redshift cluster that we set up with an industry standard dataset, TPC-H. You can set up this dataset in your environment using the code and scripts for this dataset on GitHub and the accompanying dataset hosted in a public Amazon Simple Storage Service (Amazon S3) bucket.

The following screenshot shows the configuration for your connection profile.

The TPCH_100G database consists of eight tables loaded in the schema PUBLIC, as shown in the following screenshot.

The following screenshot shows a test query on one of the TPC-H tables, customer.

The database administrator provides read permissions on the three of the tables, customer, orders, and lineitem, to an Amazon Redshift user called demouser. The user typically connects to and operates in their own team’s database TPCH_CONSUMERDB on the same Amazon Redshift cluster.

Performing cross-database queries using three-part notation

In this section, we see how cross-database queries work in action. With cross-database queries, you can connect to any database and query from all the other databases in the cluster without having to reconnect. In this use case, the user demouser connects to their database TPCH_CONSUMERDB (see the following screenshot).

While connected to TPCH_CONSUMERDB, demouser can also perform queries on the data in TPCH_100gG database objects that they have permissions to, referring to them using the simple and intuitive three-part notation TPCH_100G.PUBLIC.CUSTOMER (see the following screenshot).

You can refer to and query objects in any other database in the cluster using this <database>.<schema>.<object> notation as long as you have permissions to do so. The objects can be tables or views (including regular, late binding and materialized views).

In addition to performing queries on objects, you can create views on top of objects in other databases and apply granular access controls as relevant.

Joining data across databases

With cross-database queries, you can join datasets across databases. In the following screenshot, demouser queries and performs joins across the customer, lineitem, and orders tables in the TPCH_100G database.

You can also span joins on objects across databases. In the following query, demouser seamlessly joins the datasets from TPCH_100G (customer, lineitem, and orders tables) with the datasets in TPCH_CONSUMERDB (nation and supplier tables).

With cross-database queries, you get a consistent view of the data irrespective of the database you’re connected to.

Securely accessing relevant datasets by connecting from tools

To support the database hierarchy navigation and exploration introduced with cross-database queries, Amazon Redshift is introducing a new set of metadata views and modified versions of JDBC and ODBC drivers.

In addition, you can create aliases from one database to schemas in any other databases on the Amazon Redshift cluster. You create the aliases using the CREATE EXTERNAL SCHEMA command, which allows you to refer to the objects in cross-database queries with the two-part notation <external schema name>.<object>. For example, in the following screenshot, the database administrator connects to TPCH_CONSUMERDB and creates an external schema alias for the PUBLIC schema in TPC_100G database called TPC_100G_PUBLIC and grants the usage access on the schema to demouser.

Now, when demouser connects to TPCH_CONSUMERDB, they see the external schema in the object hierarchy (as in the following screenshot) with only the relevant objects that they have permissions to: CUSTOMER, LINEITEM, and ORDERS.

Now they can perform queries using the schema alias as if the data is local rather than using a three-part notation.

Summary and next steps

We provided you a glimpse into what you can accomplish with cross-database queries in Amazon Redshift. Cross-database queries allow you to organize and manage data across databases to effectively support multi-tenant data warehouse deployments for a wide variety of use cases. You can get started with your use case leveraging cross-database queries capability by trying out the preview. For more information, refer to the documentation cross-database queries.

About the Authors

Neeraja Rentachintala is a Principal Product Manager with Amazon Redshift. Neeraja is a seasoned Product Management and GTM leader, bringing over 20 years of experience in product vision, strategy and leadership roles in data products and platforms. Neeraja delivered products in analytics, databases, data Integration, application integration, AI/Machine Learning, large scale distributed systems across On-Premise and Cloud, serving Fortune 500 companies as part of ventures including MapR (acquired by HPE), Microsoft SQL Server, Oracle, Informatica and Expedia.com.



Jenny Chen is a senior database engineer at Amazon Redshift focusing on all aspects of Redshift performance, like Query Processing, Concurrency, Distributed system, Storage, OS and many more. She works together with development team to ensure of delivering highest performance, scalable and easy-of-use database for customer. Prior to her career in cloud data warehouse, she has 10-year of experience in enterprise database DB2 for z/OS in IBM with focus on query optimization, query performance and system performance.


Sushim Mitra is a software development engineer on the Amazon Redshift query processing team. His interest areas are Query Optimization problems, SQL Language features and Database security. When not at work, he enjoys reading fiction from all over the world.




Suzhen Lin is a senior software development engineer on the Amazon Redshift transaction processing and storage team. Suzhen Lin has over 15 years of experiences in industry leading analytical database products including AWS Redshift, Gauss MPPDB, Azure SQL Data Warehouse and Teradata as senior architect and developer. Her experiences cover storage, transaction processing, query processing, memory/disk caching and etc in on-premise/cloud database management systems.



Architecting for database encryption on AWS

Post Syndicated from Jonathan Jenkyn original https://aws.amazon.com/blogs/security/architecting-for-database-encryption-on-aws/

In this post, I review the options you have to protect your customer data when migrating or building new databases in Amazon Web Services (AWS). I focus on how you can support sensitive workloads in ways that help you maintain compliance and regulatory obligations, and meet security objectives.

Understanding transparent data encryption

I commonly see enterprise customers migrating existing databases straight from on-premises to AWS without reviewing their design. This might seem simpler and faster, but they miss the opportunity to review the scalability, cost-savings, and feature capability of native cloud services. A straight lift and shift migration can also create unnecessary operational overheads, carry-over unneeded complexity, and result in more time spent troubleshooting and responding to events over time.

One example is when enterprise customers who are using Transparent Data Encryption (TDE) or Extensible Key Management (EKM) technologies want to reuse the same technologies in their migration to AWS. TDE and EKM are database technologies that encrypt and decrypt database records as the records are written and read to the underlying storage medium. Customers use TDE features in Microsoft SQL Server, Oracle 10g and 11g, and Oracle Enterprise Edition to meet requirements for data-at-rest encryption. This shouldn’t mean that TDE is the requirement. It’s infrequent that an organizational policy or compliance framework specifies a technology such as TDE in the actual requirement. For example, the Payment Card Industry Data Security Standard (PCI-DSS) standard requires that sensitive data must be protected using “Strong cryptography with associated key-management processes and procedures.” Nowhere does PCI-DSS endorse or require the use of a specific technology.

Understanding risks

It’s important that you understand the risks that encryption-at-rest mitigates before selecting a technology to use. Encryption-at-rest, in the context of databases, generally manages the risk that one of the disks used to store database data is physically stolen and thus compromised. In on-premises scenarios, TDE is an effective technology used to manage this risk. All data from the database—up to and including the disk—is encrypted. The database manages all key management and cryptographic operations. You can also use TDE with a hardware security module (HSM) so that the keys and cryptography for the database are managed outside of the database itself. In TDE implementations, the HSM is used only to manage the key encryption keys (KEK), and not the data encryption keys (DEK) themselves. The DEKs are in volatile memory in the database at runtime, and so the cryptographic operations occur on the database itself.

You can also use native operating system encryption technologies such as dm-crypt or LUKS (Linux Unified Key Setup). Dm-crypt is a full disk encryption (FDE) subsystem in Linux kernel version 2.6 and beyond. Dm-crypt can be used on its own or with LUKS as an extension to add more features. When using dm-crypt, the operating system kernel is responsible for encrypting and decrypting data as it’s written and read from the attached volumes. This would achieve the same outcome as TDE—data written and read to the disk volume is encrypted, and the risk related to physical disk compromise is managed. DEKs are in runtime memory of the machine running the database.

With some TDE implementations, you can encrypt tables, rows, columns, and cells with different DEKs to achieve granular separation of duties between operators. Customers can then configure TDE to authorize access to each DEK based on database login credentials and job function, helping to manage risks associated with unauthorized access. However, the most common configuration I’ve seen is to rely on whole database encryption when using TDE. This configuration gives similar protection against the identified risks as dm-crypt with LUKS used without an HSM, since the DEKs and KEKs are stored within the instance in both cases and the result is that the database data on disk is encrypted.

Using encryption to manage data at rest risks in AWS

When you move to AWS, you gain additional security capabilities that can simplify your security implementations. Since the announcement of the AWS Key Management Service (AWS KMS) in 2014, it has been tightly integrated with Amazon Elastic Block Store (Amazon EBS), Amazon Simple Storage Service (Amazon S3), and dozens of other services on AWS. This means that data is encrypted on disk by checking a single check box. Furthermore, you get the benefits of AWS KMS for key management and cryptographic operations, while being transparent to the Amazon Elastic Compute Cloud (Amazon EC2) instance where the data is being encrypted and decrypted. For simplicity, the authorization for access to the data is managed entirely by AWS Identity and Access Management (IAM) and AWS KMS key resource policies.

If you need more granular access control to the data, you can use the AWS Encryption SDK to encrypt data at the application layer. That provides the same effect as TDE cell-level protection, with a FIPS140-2 Level 2 validated HSM, as might be required by a recognizing standard.

If you must use a FIPS140-2 Level 3 validated HSM to meet more stringent compliance standards or regulations, then you can use the Custom Key Store capability of AWS KMS to achieve that—again in a transparent way. This option has a trade-off, as there is additional operational overhead in terms of managing an AWS CloudHSM cluster.

Many customers choose to migrate their database into the managed Amazon Relational Database Service (Amazon RDS), rather than managing the database instance themselves. Like the Amazon EC2 service, RDS uses Amazon EBS volumes for its data storage, and so can seamlessly use AWS KMS for encryption at rest functionality. When you do so, your management overhead for the protection of data-at-rest reduces to almost zero. This lets you focus on business value while AWS is responsible for the management of your database and the protection of the underlying data. The next section reviews this option and others in more detail.

You can review the available Amazon RDS database engines and versions via the Amazon RDS User Guide documentation, or by running the following AWS Command Line Interface (AWS CLI) command:

aws rds describe-db-engine-versions --query "DBEngineVersions[].DBEngineVersionDescription" --region <regionIdentifier>

Recommended Solutions

If you’re moving an existing database to AWS, you have the following solutions for data at rest encryption. I go into more detail for each option below.

Table 1 – Encryption options

OptionDatabase managementHostEncryptionKey management
1Amazon managedAmazon RDSAmazon EBSAWS KMS
2Amazon managedAmazon RDSAmazon EBSAWS KMS Custom Key Store
3Customer managedAmazon EC2Amazon EBSAWS KMS
4Customer managedAmazon EC2Amazon EBSAWS KMS Custom Key Store
5Customer managedAmazon EC2Amazon EBSLUKS
6Customer managedAmazon EC2DatabaseDatabase TDE
7Customer managedAmazon EC2DatabaseCloudHSM

Option 1 – Using Amazon RDS with Amazon EBS encryption and key management provided by AWS KMS

This approach uses the Amazon RDS service where AWS manages the operating system and database engine. You can configure this service to be a highly scalable resource spanning multiple Availability Zones within an AWS Region to provide resiliency. AWS KMS manages the keys that are used to encrypt the attached Amazon EBS volumes at rest.

Note: This configuration is recommended as your default database encryption approach.


  • No key management requirement on host; key management is automated and performed by AWS KMS
  • Meets FIPS140-2 Level 2 validation requirements
  • Simple vertical and horizontal scalability
  • Snapshots for recovery are encrypted automatically
  • AWS manages the patching, maintenance, and configuration of the operating system and database engine
  • Well-recognized configuration, with support offered through AWS Support
  • AWS KMS costs are comparatively low


  • Dependent on Amazon RDS supported engines and versions
  • Might require additional controls to manage unauthorized access at table, row, column, or cell level

Option 2 – Using Amazon RDS with Amazon EBS encryption and key management provided by AWS KMS custom key store

This approach uses the Amazon RDS service where AWS manages the operating system and database engine. You can configure this service to be a highly scalable resource spanning multiple Availability Zones within a Region to provide resiliency. CloudHSM keys are used via AWS KMS service integration to encrypt the Amazon EBS volumes at rest.

Note: This configuration is recommended where FIPS140-2 Level 3 validation is a specified compliance requirement.


  • No key management requirement on host; key management is performed by AWS KMS
  • Meets FIPS140-2 Level 3 validation requirements
  • Simple vertical and horizontal scalability
  • Snapshots for recovery are encrypted automatically
  • AWS manages the patching, maintenance, and configuration of the database engine
  • Well-recognized configuration with support offered through AWS Support


  • Dependent on Amazon RDS supported engines and versions
  • You are responsible for provisioning, configuration, scaling, maintenance, and costs of running CloudHSM cluster
  • Might require additional controls to manage unauthorized access at table, row, column or cell level

Option 3 – Customer-managed database platform hosted on Amazon EC2 with Amazon EBS encryption and key management provided by KMS

In this approach, the key difference is that you’re responsible for managing the EC2 instances, operating systems, and database engines. You can still configure your databases to be highly scalable resources spanning multiple Availability Zones within a Region to provide resiliency, but it takes more effort. AWS KMS manages the keys that are used to encrypt the attached Amazon EBS volumes at rest.

Note: This configuration is recommended when Amazon RDS doesn’t support the desired database engine type or version.


  • A 1:1 relationship for migration of database engine configuration
  • Key rotation and management is handled transparently by AWS
  • Data encryption keys are managed by the hypervisor, not by your EC2 instance
  • AWS KMS costs are comparatively low


  • You’re responsible for patching and updates of the database engine and OS
  • Might require additional controls to manage unauthorized access at table, row, column, or cell level

Option 4 – Customer-managed database platform hosted on Amazon EC2 with Amazon EBS encryption and key management provided by KMS custom key store

In this approach, you are again responsible for managing the EC2 instances, operating systems, and database engines. You can still configure your databases to be highly scalable resources spanning multiple Availability Zones within a Region to provide resiliency, but it takes more effort. And similar to Option 2, CloudHSM keys are used via AWS KMS service integration to encrypt the Amazon EBS volumes at rest.

Note: This configuration is recommended when Amazon RDS doesn’t support the desired database engine type or version and when FIPS140-2 Level 3 compliance is required.


  • A 1:1 relationship for migration of database engine configuration
  • Data encryption keys managed by the hypervisor, not by your EC2 instance
  • Keys managed by FIPS140-2 Level 3 validated HSM


  • You’re responsible for provisioning, configuration, scaling, maintenance, and costs of running CloudHSM cluster
  • You’re responsible for patching and updates of the database engine and OS
  • Might require additional controls to manage unauthorized access at table, row, column, or cell level

Option 5 – Customer-managed database platform hosted on Amazon EC2 with Amazon EBS encryption and key management provided by LUKS

In this approach, you’re still responsible for managing the EC2 instances, operating systems, and database engines. You also need to install LUKS onto the Linux instance to manage the encryption of data on Amazon EBS.


  • A 1:1 relationship for migration of database engine configuration
  • Transparent encryption is managed by OS with LUKS


  • You’re responsible for patching and updates of the database engine and OS
  • Data encryption keys are managed directly on the EC2 instance, and not a dedicated key management system
  • Scaling must be vertical, which is slow and costly
  • LUKS is supported through open-source licensing
  • Support for backup and recovery is LUKS specific, and require additional consideration
  • Might require additional controls to manage unauthorized access at table, row, column or cell level

Note: This approach limits you to only Linux instances and requires the most technical knowledge and effort on your part. Options, such as BitLocker and SQL Server Always Encrypted, exist for Windows hosts, and the complexity and challenges are similar to those of LUKS.

Option 6 – Customer-managed database platform hosted on Amazon EC2 with database encryption and key management provided by TDE

In this approach, you’re still responsible for managing the EC2 instances, operating systems, and database engines. However, instead of encrypting the Amazon EBS volume where the database is stored, you use TDE wallet keys managed by the database engine to encrypt and decrypt records as they are stored and retrieved.


  • A 1:1 relationship for migration of database engine configuration
  • Table, row, column, and cell level encryption are managed by TDE, reducing end point risks relating to unauthorized access


  • You’re responsible for patching and updates of the database engine and OS
  • Costly license for TDE feature
  • Data encryption keys are managed directly on the EC2 instance
  • Scaling is dependent on TDE functionality and Amazon EC2 scaling
  • Support is split between AWS and a third-party database vendor
  • Cannot share snapshots

Note: This approach is not available with Amazon RDS.

Option 7 – Customer-managed database platform hosted on Amazon EC2 with database encryption performed by TDE and key management provided by CloudHSM

In this approach, you’re still responsible for managing the EC2 instances, operating systems, and database engines. However, instead of encrypting the Amazon EBS volume where the database is stored, you use TDE wallet keys managed by a CloudHSM cluster to encrypt and decrypt records as they are stored and retrieved.


  • A 1:1 relationship for migration of database engine configuration
  • Wallet keys (KEK) are managed by a FIPS140-2 Level 3 validated HSM
  • Table, row, column, and cell level encryption are managed by TDE, reducing end point risks relating to unauthorized access


  • You’re responsible for patching and updates of the database engine and OS
  • Costly license for TDE feature
  • You are responsible for provisioning, configuration, scaling, maintenance, and costs of running CloudHSM cluster
  • Integration and support of CloudHSM with TDE might vary
  • Scaling is dependent on TDE functionality, Amazon EC2 scaling, and CloudHSM cluster.
  • Data encryption keys are managed on EC2 instance
  • Support is split between AWS and a third-party database vendor
  • Cannot share snapshots

Note: This approach is not available with Amazon RDS.


While you can operate in AWS similar to how you operate in your on-premises environment, the preceding configurations and recommendations show how you can significantly reduce your challenges and increase your benefits by using cloud-native security services like AWS KMS, Amazon RDS, and CloudHSM. Specifically, using Amazon RDS with Amazon EBS volumes encrypted by AWS KMS provides a highly scalable, resilient, and secure way to manage your keys in AWS.

While there might be some architectural redesign and configuration work needed to move an on-premises database into Amazon RDS, you can leverage AWS services to help you meet your compliance requirements with less effort. By offloading the OS and database maintenance responsibility to AWS, you simultaneously reduce operational friction and increase security. By migrating this way, you can benefit from the scalability and resilience of the AWS global infrastructure and expertise. Lastly, to get started with migrating your database to AWS, I encourage you to use the AWS Database Migration Service.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Jonathan Jenkyn

Jonathan is a Senior Security Growth Strategies Consultant with AWS Professional Services. He’s an active member of the People with Disabilities affinity group, and has built several Amazon initiatives supporting charities and social responsibility causes. Since 1998, he has been involved in IT Security at many levels, from implementation of cryptographic primitives to managing enterprise security governance. Outside of work, he enjoys running, cycling, fund-raising for the BHF and Ipswich Hospital Charity, and spending time with his wife and 5 children.


Scott Conklin

Scott is a Senior Security Consultant with AWS Professional Services (Global Specialty Practice). Based out of Chicago with 4 years tenure, he is an avid distance runner, crypto nerd, lover of unicorns, and enjoys camping, nature, playing Minecraft with his 3 kids, and binge watching Amazon Prime with his wife.

Migrating IBM Netezza to Amazon Redshift using the AWS Schema Conversion Tool

Post Syndicated from Mattia Berlusconi original https://aws.amazon.com/blogs/big-data/migrating-ibm-netezza-to-amazon-redshift-with-the-aws-sct/

The post How to migrate a large data warehouse from IBM Netezza to Amazon Redshift with no downtime described a high-level strategy to move from an on-premises Netezza data warehouse to Amazon Redshift. In this post, we explain how a large European Enterprise customer implemented a Netezza migration strategy spanning multiple environments, using the AWS Schema Conversion Tool (AWS SCT) to accelerate schema and data migration. We also walk you through validating that the schema and data content were migrated as expected and followed Amazon Redshift best practices.

Solution overview

It’s important to build a migration plan unique to your organization’s processes and non-functional requirements. The following plan is a real-world use case from a large European Enterprise customer. It details the different environments migrated to and the tasks, tools, and scripts used to complete the work:

  1. Assess migration tasks
    1. Understand the scope of the migration
    2. Record objects to be migrated into a migration runbook
  2. Set up the migration environment
    1. Install AWS SCT
    2. Configure AWS SCT for Netezza source environments
  3. Migrate to the development environment
    1. Create users, groups, and schema
    2. Convert schema
    3. Migrate data
    4. Validate data
    5. Transform ETL, UDF, and procedures
  4. Migrate to other pre-production environments
    1. Create users, groups, and schema
    2. Convert schema
    3. Migrate data
    4. Validate data
    5. Transform ETL, UDF, and procedures
  5. Migrate to the production environment
    1. Create users, groups, and schema
    2. Convert schema
    3. Migrate data
    4. Validate data
    5. Transform ETL, UDF, and procedures
    6. Business validation (including optional dual-running)
    7. Cut over

Assessing migration tasks

To plan and keep track of the migration tasks, you should produce a tracker of all the Netezza databases, tables, and views in scope. This information forms a migration runbook that is updated during the migration to document the progress of data migration from Netezza to Amazon Redshift. For each table identified, record the number of rows and size in GB.

Some Netezza source systems contain two Netezza data warehouses, for example one for ETL loading throughout the day and one for end-user reporting users. Make sure it’s clear which data warehouses are in scope for the migration.

Setting up the migration environment

The migration strategy uses the AWS SCT to accelerate schema object conversion and migrate the data from the Netezza database to the Amazon Redshift cluster. The following diagram illustrates this architecture.

The migration should ensure the following:

  • The AWS SCT is installed within the AWS account onto an Amazon Elastic Compute Cloud (Amazon EC2) instance to facilitate migration operations, orchestrate the AWS SCT data extraction agents, and provide access via a user-friendly console.
  • The AWS SCT data extraction agents are installed and run as close to the Netezza data warehouse as possible. AWS strongly recommends installing them on premises within the same subnet as the Netezza data warehouse.

During the transfer of data from the on-premises data center to the AWS account, you can use either a direct connection or offline storage. AWS Snowball is a petabyte-scale offline solution for moving large amounts of data into the AWS account where sufficient bandwidth of a direct connection isn’t available. AWS Direct Connect is a cloud service solution that makes it easy to establish a dedicated network connection from your premises to an AWS account. You can establish private connectivity between your AWS account and your data center, office, or co-location environment by using Direct Connect, which in many cases can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than internet-based connections. Using Direct Connect also adds flexibility in case extract jobs need to be re-run.

Configuring AWS SCT for the Netezza source environment

The AWS SCT is installed on an EC2 instance running Microsoft Windows 10 with administrator privileges. Choosing Microsoft Windows as the operating system allows your users to graphically control the creation of projects, modify profiles, start and view the progress of the conversions, and view the output of the migration assessment reports.

Because you don’t perform the data migration directly on the AWS SCT console, a general purpose EC2 instance with 4 vCPU, 16 GB memory, 100 GB storage, and moderate network bandwidth is sufficient.

You should configure several AWS SCT data extraction agents to match the amount of data to be concurrently transferred and the number of Netezza connections available. You can install the data extraction agents on on-premises VM instances running Linux with root administration privileges. The size of each instance is 8 vCPU, 32 GB memory, and up to 10 Gb network capacity. For disk storage, we use 1TB of 500 IOPS Provisioned SSD because intermediate results are stored on disk.

It’s preferable that the on-premises instances are located as close as possible to the Netezza data warehouse, ideally only a single network hop away. This is important because each data extraction agent creates a table on the instance file system as storage for the extracted data. Also, for each agent, the CPU chosen is more powerful because the compression of the extracted data is processor intensive.

As stated earlier, the number of agents should be proportionate to the amount of concurrent data streams being transferred and the number of Netezza connections available for the transfer. A rule of thumb is to have one data extraction agent for each TB of compressed Netezza data to be migrated in parallel. For optimum performance, it’s recommended that each agent is installed on a single VM instance.

You should work with the DBA team to ensure as many Netezza concurrent connections are made available to the data extraction agents as possible. For the best performance, allocating all the available connections gives all the power of the source database, but if you need to run workloads in parallel with the data extracts, asking for a smaller amount (for example, 21) can suffice. This is a trade-off between resources available against the time required to migrate the data.

For this use case, we allocated seven extraction agents, because the largest project phase extracted 6 TB of Netezza data. The DBA team configured 21 Netezza concurrent connections, so each agent was configured with three parallel data extraction processes (known as threads; see the following configuration file).

Two parameters on the data extraction agents can impact the length of time it takes for the data to migrate from Netezza to the agents: the number of connections and the number of threads.

Tuning is required for each data extraction agent to maximize throughput during the data migration phase. Tuning is achieved by modifying the file /usr/share/aws/sct-extractor/conf/settings.properties, and the file must be applied against each agent. See the following code:

# Number of connections in the pool per agent

# Number of threads per agent

The preceding code has the following features:

  • extractor.source.connection.pool.size defines the number of connections the agent opens against the Netezza data warehouse.
  • extractor.extracting.thread.pool.size defines the number of parallel jobs the agent can spawn concurrently. The sum of this parameter for all the agents should be smaller than the maximum concurrent connections configured from Netezza.
  • It’s an AWS recommendation to have extractor.source.connection.pool.size 1.5 times larger than extractor.extracting.thread.pool.size. This is because while a task is running, the AWS SCT may need additional connections to retrieve metadata from Netezza to create additional tasks or other operations, such as to collect table statistics.

Migrating to the development environment

The first task to undertake is data model schema transformation. It consists of transforming the Netezza schema objects into Amazon Redshift-compliant syntax and deploying them into the Amazon Redshift development environment. Before migrating the Netezza tables and views, you must create the schemas, groups, and users.

Creating schemas, users, and groups

If you don’t follow this step, all the objects are created in the Amazon Redshift public schema, which isn’t recommended. The following best practices aren’t specific to Netezza migration, but you can use them as a checklist during this step:

  • Create schemas to logically separate views and tables.
  • Groups are easier to maintain than many users because you can grant permissions to groups, and you can add and remove users from groups. Also, groups can direct all traffic from all users in the group to a specific Amazon Redshift WLM queue (which can control priorities as well as QMR limits).
  • Grant permissions at the schema level to allow selected groups to access the schema. This is independent of the permissions for the objects within the schema.
  • Finally, assign users to groups.

Transforming the schema

The AWS SCT analyzes the Netezza data model schema, converts the syntax into Amazon Redshift-compliant DDL statements, and applies the target schema to the Amazon Redshift cluster. The AWS SCT accelerates this phase by making sure Amazon Redshift best practices are taken into account during the transformation.

Within Amazon Redshift, column-level encoding makes sure that the most performant level of compression is applied to every data block of storage for the tables. It’s recommended that the latest ZSTD encoding is applied to all varchar, char, Boolean, and geometry columns, and the AZ64 encoding is applied to all other columns, including integers and decimals.

To improve zone map performance, don’t encode the first column of a sort key (set to raw encoding).

Netezza supports both character-length and byte-length semantics.

If character length semantics (the default) is selected, the length is specified in terms of characters, and it can consume more bytes than the length indicates. For example, if the varchar datatype length is set to 100, it allows multi-byte characters from 1–4 bytes to a maximum of 400 bytes.

If the bytes length semantics is selected, length is specified in terms of bytes. It can support only the number of bytes specified in the varchar. For example, if the varchar datatype is set to 100 only, it allows storing characters up to 100 bytes. This includes single byte and multi byte.

As of this writing, Amazon Redshift doesn’t support character-length semantics, which can lead to String length exceeds DDL length errors while loading the data into Amazon Redshift tables. The simplest solution is to multiply the length of such attributes by 4. A more efficient solution requires determining the maximum length of each varchar column in bytes in Netezza, adding an additional 20% buffer to the maximum length, and setting that as the maximum value for the Amazon Redshift varchar datatype column.

If the Netezza column maximum length in bytes is less than Amazon Redshift column length in bytes, you don’t need to increase the size of the column length in Amazon Redshift. The following query gets the column datatype in Netezza:

	AND name IN ('tablename')

The following script generates a query to get the maximum amount of bytes actually used for each varchar column:

	'Select max(octet_length(' || ATTNAME || ')) from ' || DATABASE || '.' || name || ';'
	AND name IN ('TABLE')

During data migration, you can use the following query to identify the reason for the load failure:

	stl_load_errors err,
	svv_table_info ti
	starttime > 'YYYY-MM-DD'
	AND ti.table_id = err.tbl
	AND err.err_reason = 'String length exceeds DDL length'
	starttime DESC;

If the error reason is String length exceeds DDL length, you need to increase the length of the affected column.

As recommended earlier, based on the maximum column length in Netezza, you should add an additional 20% buffer to it and set that as maximum length to Amazon Redshift.

The following in Netezza is example output from the preceding SQL command:


The following code is output in Amazon Redshift:

ALTER TABLE schema_a.table_a 
ALTER COLUMN column_a TYPE varchar(60);

AWS SCT uses statistics from the source database with user-specified optimization strategies to determine the appropriate distribution key and sort key strategies for the target schema. These optimization strategies require collecting statistics from the source database in order to activate the most relevant optimization rule for each table.

It’s recommended to do the following:

  • Choose the current Netezza key distribution style as a good starting point for an Amazon Redshift table’s key distribution strategy. When the table is within Amazon Redshift with representative workloads, you can optimize the distribution choice if needed.
  • Set the Amazon Redshift distribution style to auto for all Netezza tables with random distribution. This makes sure that Amazon Redshift automatically chooses the most performant distribution style depending on the number of rows in the table.

Migrating the data

You use the AWS SCT to migrate the data from the source Netezza data warehouse to the Amazon Redshift cluster. The AWS SCT migrates data with a three-phase approach:

  • Extract – Extracts data from Netezza and stores it into the file system of on-premises AWS SCT data extraction agents
  • Upload – Uploads data from the agents to Amazon Simple Storage Service (Amazon S3)
  • Copy – Loads the data from Amazon S3 into Amazon Redshift via the COPY command

For any migration, especially ones with large volumes of data or many objects to migrate, it’s important to plan and migrate the tables in smaller tasks. This is where tracking the runs and progress via the migration runbook from the assessment phase is important.

Segment the source tables based on their size. The following choices were successful for a 60 TB Netezza migration:

  • One AWS SCT task for all tables less than 5 GB
  • One AWS SCT task for all tables 5–15 GB
  • Multiple AWS SCT tasks for tables under 50 GB; a few tables per task
  • One AWS SCT task for each table bigger than 50 GB

You should refine configuration according to the available migration windows. The approach ensures the following:

  • A task is an atomic process; if it succeeds, all the managed tables are migrated successfully
  • If it fails, it might be more convenient to run the entire task from scratch rather than double-check the status and consistency of each table
  • Task size should trade off between the mentioned opposite poles

To manage the substitution of special characters during these phases, set the following parameters:

  • For NULL values as a string, enter ~~~~. By default, this is not checked. Numeric and date type nulls are by default extracted as ‘\N’ and loaded to Amazon Redshift as nulls.
    • If it is unchecked or if it is checked and value is left black, AWS SCT extracts char/varchar type null as ‘\N’ and the COPY command has the NULL AS ‘\N’ parameter set. This causes issues during COPY operations when we have data with value ‘N’ in any column.
    • If checked and value is ~~~~, AWS SCT extracts char and varchar type null as ~~~~ and the COPY command has the NULL AS ~~~~ parameter set. Using junk characters (such as ~~~~) extracts char and varchar null values as ~~~~, and the COPY command replaces and loads ~~~~ as NULL. This way, we can extract and load char and varchar null values. This doesn’t cause issues during COPY when we have data with the value ‘N’ in any column.
    • If checked and value is ”, AWS SCT extracts the char and varchar type null as ” and the COPY command has the NULL AS ” parameter set. NULL AS ” is equivalent to EMPTYASNULL.
  • Deselect Use blank as null value. If BLANKASNULL is set (which is default setting), it replaces white space characters (‘ ‘) with NULL for char and varchar datatypes, and if the column is NOT NULL, inserting NULL fails. Deselecting BLANKASNULL loads the data as it is in the source.
  • Deselect Use empty as null value. If EMPTYASNULL is set (which is default setting), it replaces empty data (two delimiters in succession with no characters between the delimiters) with NULL for char and varchar datatypes. This is not needed.

The following screenshot shows our configuration for the AWS SCT tasks.

To keep track of the tasks and record them accurately in the migration runbook, on the AWS S3 settings tab, set the folder name to be the same as the task name. Using a consistent naming convention allows easier tracking of progress in the runbook, and is useful during troubleshooting for any issues encountered.

For each subject area in scope, the extraction can either occur while sharing the connections and threads with other process during the day, or it’s recommended for the initial data load to schedule the tasks during the evening, weekend, or agreed schedule with as many Netezza resources as possible.

Breaking the migration down into smaller tasks allows you to log the progress in the migration runbook and run individual tasks to completion during the allocated migration window.

It’s recommended to migrate a small sample table first to test the parameter settings. The following sample table contains specific examples of edge cases that can provide quick feedback as to the suitability of the parameter settings:

create table <schema>.test_dummy
       idrow integer,
       field1 integer,
       field2 character varying (50)
insert into <schema>.test_dummy (idrow, field1, field2) values (1, null, null);
insert into <schema>.test_dummy (idrow, field1, field2) values (2, null, '');
insert into <schema>.test_dummy (idrow, field1, field2) values (3, null, '    ');
insert into <schema>.test_dummy (idrow, field1, field2) values (4, 34, '  Test4   ');
insert into <schema>.test_dummy (idrow, field1, field2) values (5, 15, '');
insert into <schema>.test_dummy (idrow, field1, field2) values (6, 25, '   ');
insert into <schema>.test_dummy (idrow, field1, field2) values (7, 655, 'Test7');

When migrating large Netezza tables, data is migrated on a table-by-table basis using multiple data extraction agents. You should split large tables (for example, tables with more than 20 million rows or greater than 50 GB) into partitions using the AWS SCT virtual partitions functionality. Using virtual partitioning is a recommended best practice for data warehouse migrations using the AWS SCT extractors.

Virtual partitions decrease the migration timeline of a table by parallelizing the extraction of a configurable amount of subsections. You can migrate partitions in parallel, and extract failure is limited to a single partition instead of the entire table.

The AWS SCT creates a subtask for each table partition. Then, when the migration is running, AWS SCT assigns the subtask to an available data extractor to run. The AWS SCT orchestrates which subtask runs on which extractor, thereby keeping all extractors as busy as possible throughout the migration.

To use virtual partitioning, you should identify an attribute that you can use to evenly split the table. It’s important that the virtual partitions are well balanced in order to exploit the benefit of the parallelism. The AWS SCT usually virtually defines such partitions at extraction time—virtual partitions aren’t related to how data is stored into the source data warehouse.

AWS SCT provides three types of virtual partitioning: list, range, and auto split. For more information, see Use virtual partitioning in the AWS Schema Conversion Tool.

When using list partitioning, for very big tables (over 100 GB), the Netezza data slice IDs are an option for the partition key.

Migrating to other pre-production environments

After the data migration has successfully been proven in the development environment, you may choose to migrate to other pre-production environments. Apply the same steps and validation checks, including:

  • Validate that the schema deployment matches the development environment.
  • Validate the data migration has completed successfully, and that no data load errors are logged into the STL_LOAD_ERRORS table. The typical reasons for errors at this stage include schema mismatch, different input file formats, or insufficient varchar length for the input data.
  • Validate the ETL deployment is loading the data as expected.

Migrating to the production environment

Migration to the production environment follows the same processes as the non-production environments, with the addition of the following steps:

  • Undertake the task of business validation with your stakeholders to measure the accuracy of the migration in meeting the program goals:
    1. Undertake a period of dual-running the ETL deployment with production data being dual-loaded into the Netezza data warehouse and the production Amazon Redshift cluster.
    2. Compare the results sets from the Netezza data warehouse and the production Amazon Redshift cluster (the data validation scripts in the following section support this task).
    3. Update the migration runbook for each source table to record the number of records migrated, which validation checks have been run, and any discrepancies found during the checks.
    4. Run reports and dashboards against the Netezza data warehouse and the production Amazon Redshift cluster and ensure the results match.
    5. Obtain sign-off upon successful completion of these business validation tests.
  • After you successfully complete the dual-running of both ETL and reporting deployments, the source of truth is transferred from the Netezza data warehouse to the production Amazon Redshift cluster by decommissioning the Netezza ETL deployment and the Netezza data warehouse, and re-pointing all reporting and dashboard connections to the Amazon Redshift cluster.
  • When the Amazon Redshift cluster is live, monitor the cluster and ensure data model best practices are being followed.

Validating the data

After you migrate the data model schema and data contents to Amazon Redshift, you should perform data-validation tests to measure the migration’s success. The scripts included in this section cover checks commonly undertaken during migration engagements. All these scripts must be run by a superuser account.

Amazon Redshift utilities

The Amazon Redshift Utilities GitHub repo contains a set of utilities to accelerate troubleshooting or analysis on Amazon Redshift. Such utilities consist of queries, views, and scripts. These scripts aren’t deployed by default into Amazon Redshift clusters. The recommendation is to deploy the views into an admin schema.

Comparing source vs. target table and view counts

For Netezza, enter the following code:

SET CATALOG <database_name>;

	'<schema_name>' ,
	sum(CASE OBJTYPE WHEN 'TABLE' THEN 1 ELSE 0 END) AS table_count ,
	AND OBJNAME IN ('<table_name>')

For Amazon Redshift, enter the following code:

	trim(pg_namespace.nspname) AS schema_name ,
	sum(CASE pg_class.relkind WHEN 'r' THEN 1 ELSE 0 END) AS table_count ,
	sum(CASE pg_class.relkind WHEN 'v' THEN 1 ELSE 0 END) AS view_count
	pg_class.relnamespace = pg_namespace.oid
	AND pg_class.relkind IN ('r', 'v')
	AND schema_name IN ('<schema_name>')
	AND pg_class.relname IN ('<table_name>')

Comparing source vs. target table constraints

For Netezza, enter the following code:

SET CATALOG <database_name>;

	ISNULL(pk_count, 0) pk_count,
	ISNULL(fk_count, 0) fk_count,
	ISNULL(uk_count, 0) uk_count,
	ISNULL(ck_count, 0) ck_count,
	ISNULL(nn_t_count, 0) nn_count,
	ISNULL(pk_count, 0)+ ISNULL(fk_count, 0)+ ISNULL(uk_count, 0)+ ISNULL(ck_count, 0)+ ISNULL(nn_t_count, 0) Total_Count
		database, Schema_Name, Table_Name, Table_Id, sum(CASE Constraint_Type WHEN 'p' THEN 1 ELSE 0 END) AS pk_count, sum(CASE Constraint_Type WHEN 'f' THEN 1 ELSE 0 END) AS fk_count, sum(CASE Constraint_Type WHEN 'u' THEN 1 ELSE 0 END) AS uk_count, sum(CASE Constraint_Type WHEN 'c' THEN 1 ELSE 0 END) AS ck_count
			DISTINCT database, SCHEMA Schema_Name, relation Table_Name, contype Constraint_Type, constraintname, objid Table_Id
			_v_relation_keydata) in1
		database, Schema_Name, Table_Name, Table_Id) oc
		database, SCHEMA Schema_Name, name Table_Name, objid Table_Id, sum(CASE attnotnull WHEN TRUE THEN 1 ELSE 0 END) AS nn_t_count, sum(CASE attnotnull WHEN FALSE THEN 1 ELSE 0 END) AS nn_f_count, count(attnotnull) nn_total_count
		AND attnum>0
		database, Schema_Name, Table_Name, Table_Id) nc ON
	(oc.Table_Id = nc.Table_Id)
	nc.database = '<database_name>'
	AND nc.table_name IN ('<table_name>')
	1, 2;

For Amazon Redshift, enter the following code:

	ISNULL(nn_count, 0) nn_count,
	pk_count + fk_count + uk_count + ck_count + ISNULL(nn_count, 0) Total_Count
		Schema_Name, Table_Name, Table_Id, sum(CASE Constraint_Type WHEN 'p' THEN 1 ELSE 0 END) AS pk_count, sum(CASE Constraint_Type WHEN 'f' THEN 1 ELSE 0 END) AS fk_count, sum(CASE Constraint_Type WHEN 'u' THEN 1 ELSE 0 END) AS uk_count, sum(CASE Constraint_Type WHEN 'c' THEN 1 ELSE 0 END) AS ck_count
			trim(pg_namespace.nspname) Schema_Name, trim(pg_class.relname) Table_Name, trim(pg_constraint.conname) Constraint_Name, pg_constraint.contype Constraint_Type, pg_class.oid Table_Id
		INNER JOIN pg_class ON
			pg_namespace.oid = pg_class.relnamespace
		LEFT OUTER JOIN pg_constraint ON
			pg_constraint.conrelid = pg_class.oid
			schema_name NOT IN ('pg_catalog', 'pg_toast', 'information_schema')
			AND schema_name NOT LIKE '%_ext'
			AND pg_class.relkind = 'r')
		Schema_Name, Table_Name, Table_Id) oc
		attrelid Table_Id, count(attnotnull) nn_count
		attnotnull = TRUE
		AND attnum>0
		Table_Id) nc ON
	(oc.Table_Id = nc.Table_Id)
	schema_name IN ('<schema_name>')
	AND table_name IN ('<table_name>')
	1, 2;
Amazon Redshift

Generating missing constraints from Netezza

Run the following SQL statements in Netezza to generate the DDL statements to add any missing constraints in Amazon Redshift:

-- Generate Primary Key Constraints DDL
SET CATALOG <database_name>;

	'ALTER TABLE <schema_name>.' || relation || ' ADD CONSTRAINT ' || constraintname || ' PRIMARY KEY (' || attname || ')'
	DATABASE = '<database_name>'
	AND relation IN ('<table_name>')
	AND contype = 'p';

-- Generate Unique Key Constraints DDL
SET CATALOG <database_name>;

	'ALTER TABLE <schema_name>.' || relation || ' ADD CONSTRAINT ' || constraintname || ' UNIQUE (' || attname || ')'
	DATABASE = '<database_name>'
	AND relation IN ('<table_name>')
	AND contype = 'u';

-- Generate Foreign Key Constraints DDL  
SET CATALOG <database_name>;

	'ALTER TABLE <schema_name>.' || relation || ' ADD CONSTRAINT ' || constraintname || ' FOREIGN KEY (' || attname || ') REFERENCES <schema_name>.' || pkrelation || '(' || pkattname || ')' refconstrname
	DATABASE = '<database_name>'
	AND relation IN ('<table_name>')
	AND contype = 'f';

Run the generated script against the Amazon Redshift database.

Identifying tables with insufficient varchar column length

For Netezza, enter the following code:

-- Generate SQL for Varchar Column Length in Bytes   
SET CATALOG <database_name>;

	'SELECT ''' || database || ''' database_name,''' || SCHEMA || ''' schema_name,''' || name || ''' table_name, ''' || attname || ''' column_name, ''' || format_type || ''' data_type, ''' || attcolleng || ''' data_type_length_char, ' || 'MAX(OCTET_LENGTH(' || attname || ')) max_bytes FROM ' || SCHEMA || '.' || name || ' UNION ALL'
	AND database = '<database_name>'
	AND name IN ('<table_name>');

For Amazon Redshift, enter the following code:

-- Varchar Column Length in Bytes 
	trim(pg_namespace.nspname) Schema_Name,
	trim(pg_class.relname) Table_Name,
	trim(pg_attribute.attname) Column_Name,
	trim(pg_type.typname) Data_Type,
	pg_attribute.atttypmod-4 Data_Type_Length_Bytes
JOIN pg_type ON
	pg_type.oid = pg_attribute.atttypid
JOIN pg_class ON
	pg_class.oid = pg_attribute.attrelid
JOIN pg_namespace ON
	pg_namespace.oid = pg_class.relnamespace
	trim(pg_type.typname) LIKE 'varchar%'
	AND Data_Type_Length_Bytes <> 1
	AND Schema_Name IN ('<schema_name>')
	AND Table_Name IN ('<table_name>')
	1, 2, 3;

Comparing source vs. target row count

Remove the final UNION ALL from the following two scripts output before running.

For Netezza, enter the following:

SET CATALOG <database_name>;

	'SELECT ''' || database || ''' database_name,''' || SCHEMA || ''' schema_name,''' || tablename || ''' table_name,COUNT(*) count_of_rows from ' || SCHEMA || '.' || tablename || ' UNION ALL'
	AND database = '<database_name>'
	AND tablename IN ('<table_name>');

For Amazon Redshift, enter the following code:

	'SELECT ''' || schema_name || ''' schema_name,''' || table_name || ''' table_name,COUNT(*) count_of_rows from ' || schema_name || '.' || table_name || ' UNION ALL'
		trim(pg_namespace.nspname) schema_name, trim(pg_class.relname) table_name
	INNER JOIN pg_class ON
		pg_namespace.oid = pg_class.relnamespace
		pg_class.relkind = 'r'
		AND schema_name IN ('<schema_name>')
		AND table_name IN ('<table_name>')
		1, 2 );

Comparing source vs. target columns

For Netezza, enter the following code:

SET CATALOG <database_name>;

	SCHEMA Schema_Name,
	name Table_Name,
	attname Column_Name,
	Format_Type Data_Type
	AND attnum>0
	AND database = '<database_name>'
	AND name IN ('<table_name>')

For Amazon Redshift, enter the following code:

SET search_path TO <schema_name1>,<schema_name2>;

	trim(pgn.nspname) AS Schema_Name,
	trim(pgc.relname) AS Table_Name,
	det.attname AS Column_Name,
	def.type Data_Type
	pg_class AS pgc
JOIN pg_namespace AS pgn ON
	pgn.oid = pgc.relnamespace
		attrelid, attname, attnum
		attnum>0) AS det ON
	det.attrelid = pgc.oid
LEFT OUTER JOIN pg_table_def def ON
	(def.schemaname = pgn.nspname
	AND def.tablename = pgc.relname
	AND def."column" = det.attname)
	Schema_Name IN ('<schema_name>')
	AND Table_Name IN ('<table_name>')
	1, 2, 3;

Comparing source vs. target distribution key

For Netezza, enter the following code:

SET CATALOG <database_name>;

	td.attname AS Dist_Key
	_v_table t
LEFT OUTER JOIN _v_table_dist_map td ON
	t.tablename = td.tablename
	AND (td.distseqno IS NULL
	OR td.distseqno = 1)
	t.database = '<database_name>'
	AND t.tablename IN ('<table_name>')

For Amazon Redshift, enter the following code:

	trim(pgn.nspname) AS Schema_Name,
	trim(pgc.relname) AS Table_Name,
	decode(pgc.reldiststyle, 0, 'even', 1, 'key', 8, 'all') AS dist_style,
	det.dist_key AS dist_key
	pg_class AS pgc
JOIN pg_namespace AS pgn ON
	pgn.oid = pgc.relnamespace
		attrelid, min(CASE attisdistkey WHEN 't' THEN attname ELSE NULL END) AS dist_key
		1) AS det ON
	det.attrelid = pgc.oid
	Schema_Name IN ('<schema_name>')
	AND Table_Name IN ('<table_name>')
	1, 2;

Verifying if any invalid UTF-8 characters were replaced

For Amazon Redshift, enter the following code:

-- Validate if any invalid characters are replaced with '?' during COPY

SELECT userid,
FROM   stl_replacements;

Identifying COPY errors

For Amazon Redshift, enter the following code:

	stl_load_errors err,
	svv_table_info ti
	starttime > '<YYYY-MM-DD>'
	AND ti.table_id = err.tbl
	AND ti."table" = '<Table_Name>'
	1, 2, 3 DESC;

Additional data validation checks

In addition to checking the row count for each table, you should perform tests on data quality to guarantee production data readiness:

  • During this activity, run tailored queries and validate them against Amazon Redshift tables and views. The recommendation is to run such checks against records that include NULL values as well as strings including trailing whitespaces.
  • Compute and compare statistics (min, max, average, sum, checksums) on numeric attributes against Netezza equivalents.


In this post, we detailed a project migration plan to migrate from Netezza to Amazon Redshift. We included examples of sizing the AWS SCT data extraction agents depending on the volume of data to migrate and the resources made available for the transfer. Validation of successful schema and data migration is vital, and we included several scripts to validate that the data model and data content meet expectations.

Special thanks go to AWS colleagues Arturo Bayo, Boopathi P, and Sunil Vora for their project delivery and support with this post.

About the Authors

Mattia Berlusconi is a Data & Analytics consultant with AWS Professional Services supporting enterprises in adopting innovative solutions for organizing and exploiting data to achieve their business objectives. He is specialized in building data platforms and managing database migrations.




Simon Dimaline has specialised in data warehousing and data modelling for more than 20 years. He currently works for the Data & Analytics practice within AWS Professional Services accelerating customers’ adoption of AWS analytics services.





Federating Amazon Redshift access from OneLogin

Post Syndicated from Veerendra Nayak original https://aws.amazon.com/blogs/big-data/federating-amazon-redshift-access-from-onelogin/

You can use federation to access AWS accounts using credentials from a corporate directory, utilizing open standards such as SAML, to exchange identity and security information between an identity provider (IdP) and an application.

With this integration, you manage user identities to AWS resources centrally in IdPs. This improves enterprise security and removes the need for separate database users and passwords.

In this post, we walk through the steps required to set up Amazon Redshift user federation from OneLogin. Amazon Redshift supports SAML 2.0, and can be easily configured to integrate with OneLogin. For information about integrating with other IdPs, see Federate Amazon Redshift access with Microsoft Azure AD single sign-on and Federate Amazon Redshift access with Okta as an identity provider, respectively.

Solution overview

Amazon Redshift federated login with OneLogin involves the following steps:

  1. Create a OneLogin SAML application, users, and roles.
  2. Create two AWS Identity and Access Management (IAM) roles to support OneLogin integration with Amazon Redshift:
    1. A role to establish the trust relationship between IdP and AWS.
    2. A role that defines Amazon Redshift access policies.
  3. Edit the OneLogin application configuration and parameters using the AWS roles created in the previous step.
  4. Configure JDBC and ODBC clients to connect to Amazon Redshift using corporate credentials

Setting up your OneLogin user

If you don’t have OneLogin set up, you can sign up for a 30-day free trial.

  1. Sign in to OneLogin using the following URL: https://<orgname>.onelogin.com/admin (<orgname> is the name used when setting up the OneLogin account).
  2. On the Users page (https://<orgname>.onelogin.com/users), choose New User.
  3. On the Applications page, choose Add app.
  4. Choose Amazon Redshift JDBC/ODBC.
  5. After the application is created, choose SSO from the navigation pane.
  6. From the More Actions drop-down menu, choose SAML Metadata.

Setting up IAM

In this step, you configure your IdP in IAM and create roles to support OneLogin integration with Amazon Redshift.

Configuring IdP in IAM

To configure your IdP, complete the following steps:

  1. On the IAM console, choose Identity providers.
  2. Choose Create Provider.
  3. For Provider Type, choose SAML.
  4. For Provider Name, enter OneloginRedshift.
  5. For Metadata Document, choose the file that you downloaded in the earlier step.
  6. Choose Next.
  7. Choose Create.

Creating your IAM role

In this step, you create a new IAM role that users federated from OneLogin can assume.

  1. On the IAM console, create an IAM policy with the following permissions. In this policy, we allow Amazon Redshift to query data, create users, and allow users to join groups. For this use case, the sales and marketing groups are already created in Redshift.
    	"Version": "2012-10-17",
    	"Statement": [{
        	"Effect": "Allow",
           	"Action": [
           	"Resource": [

  2. On the Roles page, choose Create role.
  3. For Role name, enter OneloginRedshiftCluster.
  4. For Role description, enter a description.
  5. For Trusted entities, choose Redshift.
  6. Choose Next: Permissions.
  7. Choose the policy you created earlier (OneloginCustomPolicy).
  8. Choose Create role.

In the next steps, we edit the trust relationships.

  1. On the Summary page for your role, choose Edit trust relationship.
  2. Add the following policy document:
      "Version": "2012-10-17",
      "Statement": [
          "Effect": "Allow",
          "Principal": {
            "Federated": "arn:aws:iam::<account>:saml-provider/OneloginRedshift"
          "Action": "sts:AssumeRoleWithSAML",
          "Condition": {
              "StringLike": {
                  "SAML:aud": "http://localhost:7890/redshift/"

Setting up your OneLogin application

In this step, you edit the OneLogin application configuration and parameters using the AWS roles created in the previous step.

  1. Go to your application in OneLogin and confirm Redshift Local Host URL is set to http://localhost:7890/redshift/.
  2. On the Parameters page, add the following fields:
Field NameValueFlags: Include in SAML Assertion
DbUserEmailDefault (checked)

Enter the role created for Amazon Redshift access and the IDP ARN separated by a comma:


For example, arn:aws:iam::4XXXXXXXX4:role/ OneloginRedshiftCluster,arn:aws:iam::4XXXXXXX X4:saml-provider/OneloginRedshift.

Default (checked)



EmailDefault (checked)

Choose your AD groups.

If no AD integration is in place, choose user roles with semicolon-delimited input. This is to handle users associated with multiple groups.

The following section shows how to create roles and attach them to users.

Associating user roles

If you don’t have an AD association in OneLogin and you need to authorize access using groups in Amazon Redshift, complete the following steps.

  1. On the OneLogin page, under Users, choose Roles.
  2. Choose New Role.
  3. Create new roles that correspond to the Amazon Redshift user groups. Make sure that the role names are lowercase.
  4. Add the Amazon Redshift JDBC ODBC application created earlier.
  5. Choose Save.

We associate users to the role we created earlier so we can map users to Amazon Redshift groups.

  1. We assign the Fred Taylor user to the marketing role and Joe Bloggs to the sales role.

These roles are used to assign the users to the appropriate groups when they log in. You can also add users automatically to roles by using rules.

  1. Go to user profile and check if the role is associated with the user.
    1. If it’s not selected, choose New Role and add the application.

In the next steps, we set up the JDBC and ODBC tools.

Setting up JDBC and ODBC connections

In this post, we use SQL Workbench to demonstrate the JDBC setup, but you can extend the solution to other JDBC-compliant tools.

  1. Download the Amazon Redshift driver and ensure that the driver version is 1.2.41 or above with SDK included.
  2. In SQL Workbench/J, on the Manage drivers page, create a new Amazon Redshift driver profile and point it to the file downloaded in the previous step.
  3. Create a connection to the Amazon Redshift cluster using the driver you downloaded.
  4. For URL, enter the URL in the following format: jdbc:redshift:iam://<clusterendpoint>:5439/dev.
  5. Leave Username and Password blank (they are federated from OneLogin).
  6. Select Save password.
  7. Choose Extended properties and add the following values:
    1. login_urlhttps://exampleinc.onelogin.com/trust/saml2/http-post/sso/613ac582-9999999999 (from OneLogin application setup)
    2. plugin_namecom.amazon.redshift.plugin.BrowserSamlCredentialsProvider
    3. idp_response_timeout15
  8. Choose Test or Connect to open the OneLogin page.
  9. Enter your corporate user name and password.

You should see the following message upon successful authentication: “Thank you for using Amazon Redshift! You can now close this window.”

  1. Navigate back to SQL Workbench and you should be connected to the Amazon Redshift cluster with the OneLogin user name and the role assigned to you in OneLogin.
  2. Verify the user name passed in via OneLogin by running the following SQL command:
    select current_user

You can now verify that the users have been associated with the correct groups. For our use case, Fred Taylor has access to the tables in the marketing schema only. The user Joe Bloggs has access to tables in the sales schema only. Using the Joe Bloggs user, you get the following results when trying to query data from each schema:

select productid from sales.monthly_sales


select * from marketing.employee

An error occurred when executing the SQL command:
select * from marketing.employee

[Amazon](500310) Invalid operation: permission denied for schema marketing;
 [SQL State=42501, DB Errorcode=500310]
1 statement failed.

For client tools that support ODBC, you can configure the ODBC driver to connect Redshift to integrate with OneLogin. In this post, we show ODBC connectivity using the command line tool isql and Python.

isql is an interactive ODBC test tool to test your DSNs for their connectivity to databases and run SQL statements when you’re connected to a database. It is installed with PSQL.

  1. Download and install the ODBC driver (use ODBC – macOS X driver version 1.4.16 or higher).
  2. On MacOS, the installation process installs the driver files in the following directories:

  3. Open the /usr/local/etc/odbc.ini directory and add Amazon Redshift DSN and Login_URL. See the following screenshot.
  4. After odbc.ini is set up, we connect using isql. On terminal, enter the following code:
    isql -v "Amazon Redshift ODBC DSN"

isql should open the browser window to ask for credentials (use your OneLogin credentials).

  1. We can also use Python3 to connect to Amazon Redshift using ODBC. See the following example code:
    import pyodbc
    cnxn = pyodbc.connect('DRIVER={/opt/amazon/redshift/lib/libamazonredshiftodbc.dylib};\
    Login_URL= https://exampleinc.onelogin.com/trust/saml2/http-post/sso/613ac582-9999999999;\
    cursor = cnxn.cursor()
    cursor.execute("SELECT current_date;")
    for row in cursor.fetchall():
    	print (row)


In this post, we demonstrated how to set up federated login to Amazon Redshift using OneLogin. We also showed how to pass along group membership within your IdP, enabling you to manage user access to Amazon Redshift resources from within your IdP.

If you have any questions or suggestions, please leave us a comment.

About the Authors

Veerendra Nayak is a Senior Database Solution Architect with Amazon Web Services.





Sam Selvan is a Senior Database Solution Architect with Amazon Web Services.



Store and Access Time Series Data at Any Scale with Amazon Timestream – Now Generally Available

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/store-and-access-time-series-data-at-any-scale-with-amazon-timestream-now-generally-available/

Time series are a very common data format that describes how things change over time. Some of the most common sources are industrial machines and IoT devices, IT infrastructure stacks (such as hardware, software, and networking components), and applications that share their results over time. Managing time series data efficiently is not easy because the data model doesn’t fit general-purpose databases.

For this reason, I am happy to share that Amazon Timestream is now generally available. Timestream is a fast, scalable, and serverless time series database service that makes it easy to collect, store, and process trillions of time series events per day up to 1,000 times faster and at as little as to 1/10th the cost of a relational database.

This is made possible by the way Timestream is managing data: recent data is kept in memory and historical data is moved to cost-optimized storage based on a retention policy you define. All data is always automatically replicated across multiple availability zones (AZ) in the same AWS region. New data is written to the memory store, where data is replicated across three AZs before returning success of the operation. Data replication is quorum based such that the loss of nodes, or an entire AZ, does not disrupt durability or availability. In addition, data in the memory store is continuously backed up to Amazon Simple Storage Service (S3) as an extra precaution.

Queries automatically access and combine recent and historical data across tiers without the need to specify the storage location, and support time series-specific functionalities to help you identify trends and patterns in data in near real time.

There are no upfront costs, you pay only for the data you write, store, or query. Based on the load, Timestream automatically scales up or down to adjust capacity, without the need to manage the underlying infrastructure.

Timestream integrates with popular services for data collection, visualization, and machine learning, making it easy to use with existing and new applications. For example, you can ingest data directly from AWS IoT Core, Amazon Kinesis Data Analytics for Apache Flink, AWS IoT Greengrass, and Amazon MSK. You can visualize data stored in Timestream from Amazon QuickSight, and use Amazon SageMaker to apply machine learning algorithms to time series data, for example for anomaly detection. You can use Timestream fine-grained AWS Identity and Access Management (IAM) permissions to easily ingest or query data from an AWS Lambda function. We are providing the tools to use Timestream with open source platforms such as Apache Kafka, Telegraf, Prometheus, and Grafana.

Using Amazon Timestream from the Console
In the Timestream console, I select Create database. I can choose to create a Standard database or a Sample database populated with sample data. I proceed with a standard database and I name it MyDatabase.

All Timestream data is encrypted by default. I use the default master key, but you can use a customer managed key that you created using AWS Key Management Service (KMS). In that way, you can control the rotation of the master key, and who has permissions to use or manage it.

I complete the creation of the database. Now my database is empty. I select Create table and name it MyTable.

Each table has its own data retention policy. First data is ingested in the memory store, where it can be stored from a minimum of one hour to a maximum of a year. After that, it is automatically moved to the magnetic store, where it can be kept up from a minimum of one day to a maximum of 200 years, after which it is deleted. In my case, I select 1 hour of memory store retention and 5 years of magnetic store retention.

When writing data in Timestream, you cannot insert data that is older than the retention period of the memory store. For example, in my case I will not be able to insert records older than 1 hour. Similarly, you cannot insert data with a future timestamp.

I complete the creation of the table. As you noticed, I was not asked for a data schema. Timestream will automatically infer that as data is ingested. Now, let’s put some data in the table!

Loading Data in Amazon Timestream
Each record in a Timestream table is a single data point in the time series and contains:

  • The measure name, type, and value. Each record can contain a single measure, but different measure names and types can be stored in the same table.
  • The timestamp of when the measure was collected, with nanosecond granularity.
  • Zero or more dimensions that describe the measure and can be used to filter or aggregate data. Records in a table can have different dimensions.

For example, let’s build a simple monitoring application collecting CPU, memory, swap, and disk usage from a server. Each server is identified by a hostname and has a location expressed as a country and a city.

In this case, the dimensions would be the same for all records:

  • country
  • city
  • hostname

Records in the table are going to measure different things. The measure names I use are:

  • cpu_utilization
  • memory_utilization
  • swap_utilization
  • disk_utilization

Measure type is DOUBLE for all of them.

For the monitoring application, I am using Python. To collect monitoring information I use the psutil module that I can install with:

pip3 install psutil

Here’s the code for the collect.py application:

import time
import boto3
import psutil

from botocore.config import Config

DATABASE_NAME = "MyDatabase"
TABLE_NAME = "MyTable"

CITY = "London"
HOSTNAME = "MyHostname" # You can make it dynamic using socket.gethostname()

INTERVAL = 1 # Seconds

def prepare_record(measure_name, measure_value):
    record = {
        'Time': str(current_time),
        'Dimensions': dimensions,
        'MeasureName': measure_name,
        'MeasureValue': str(measure_value),
        'MeasureValueType': 'DOUBLE'
    return record

def write_records(records):
        result = write_client.write_records(DatabaseName=DATABASE_NAME,
        status = result['ResponseMetadata']['HTTPStatusCode']
        print("Processed %d records. WriteRecords Status: %s" %
              (len(records), status))
    except Exception as err:
        print("Error:", err)

if __name__ == '__main__':

    session = boto3.Session()
    write_client = session.client('timestream-write', config=Config(
        read_timeout=20, max_pool_connections=5000, retries={'max_attempts': 10}))
    query_client = session.client('timestream-query')

    dimensions = [
        {'Name': 'country', 'Value': COUNTRY},
        {'Name': 'city', 'Value': CITY},
        {'Name': 'hostname', 'Value': HOSTNAME},

    records = []

    while True:

        current_time = int(time.time() * 1000)
        cpu_utilization = psutil.cpu_percent()
        memory_utilization = psutil.virtual_memory().percent
        swap_utilization = psutil.swap_memory().percent
        disk_utilization = psutil.disk_usage('/').percent

        records.append(prepare_record('cpu_utilization', cpu_utilization))
            'memory_utilization', memory_utilization))
        records.append(prepare_record('swap_utilization', swap_utilization))
        records.append(prepare_record('disk_utilization', disk_utilization))

        print("records {} - cpu {} - memory {} - swap {} - disk {}".format(
            len(records), cpu_utilization, memory_utilization,
            swap_utilization, disk_utilization))

        if len(records) == 100:
            records = []


I start the collect.py application. Every 100 records, data is written in the MyData table:

$ python3 collect.py
records 4 - cpu 31.6 - memory 65.3 - swap 73.8 - disk 5.7
records 8 - cpu 18.3 - memory 64.9 - swap 73.8 - disk 5.7
records 12 - cpu 15.1 - memory 64.8 - swap 73.8 - disk 5.7
. . .
records 96 - cpu 44.1 - memory 64.2 - swap 73.8 - disk 5.7
records 100 - cpu 46.8 - memory 64.1 - swap 73.8 - disk 5.7
Processed 100 records. WriteRecords Status: 200
records 4 - cpu 36.3 - memory 64.1 - swap 73.8 - disk 5.7
records 8 - cpu 31.7 - memory 64.1 - swap 73.8 - disk 5.7
records 12 - cpu 38.8 - memory 64.1 - swap 73.8 - disk 5.7
. . .

Now, in the Timestream console, I see the schema of the MyData table, automatically updated based on the data ingested:

Note that, since all measures in the table are of type DOUBLE, the measure_value::double column contains the value for all of them. If the measures were of different types (for example, INT or BIGINT) I would have more columns (such as measure_value::int and measure_value::bigint) .

In the console, I can also see a recap of which kind measures I have in the table, their corresponding data type, and the dimensions used for that specific measure:

Querying Data from the Console
I can query time series data using SQL. The memory store is optimized for fast point-in-time queries, while the magnetic store is optimized for fast analytical queries. However, queries automatically process data on all stores (memory and magnetic) without having to specify the data location in the query.

I am running queries straight from the console, but I can also use JDBC connectivity to access the query engine. I start with a basic query to see the most recent records in the table:


Let’s try something a little more complex. I want to see the average CPU utilization aggregated by hostname in 5 minutes intervals for the last two hours. I filter records based on the content of measure_name. I use the function bin() to round time to a multiple of an interval size, and the function ago() to compare timestamps:

SELECT hostname,
       bin(time, 5m) as binned_time,
       avg(measure_value::double) as avg_cpu_utilization
  FROM MyDatabase.MyTable
 WHERE measure_name = 'cpu_utilization'
   AND time > ago(2h)
 GROUP BY hostname, bin(time, 5m)

When collecting time series data you may miss some values. This is quite common especially for distributed architectures and IoT devices. Timestream has some interesting functions that you can use to fill in the missing values, for example using linear interpolation, or based on the last observation carried forward.

More generally, Timestream offers many functions that help you to use mathematical expressions, manipulate strings, arrays, and date/time values, use regular expressions, and work with aggregations/windows.

To experience what you can do with Timestream, you can create a sample database and add the two IoT and DevOps datasets that we provide. Then, in the console query interface, look at the sample queries to get a glimpse of some of the more advanced functionalities:

Using Amazon Timestream with Grafana
One of the most interesting aspects of Timestream is the integration with many platforms. For example, you can visualize your time series data and create alerts using Grafana 7.1 or higher. The Timestream plugin is part of the open source edition of Grafana.

I add a new GrafanaDemo table to my database, and use another sample application to continuously ingest data. The application simulates performance data collected from a microservice architecture running on thousands of hosts.

I install Grafana on an Amazon Elastic Compute Cloud (EC2) instance and add the Timestream plugin using the Grafana CLI.

$ grafana-cli plugins install grafana-timestream-datasource

I use SSH Port Forwarding to access the Grafana console from my laptop:

$ ssh -L 3000:<EC2-Public-DNS>:3000 -N -f [email protected]<EC2-Public-DNS>

In the Grafana console, I configure the plugin with the right AWS credentials, and the Timestream database and table. Now, I can select the sample dashboard, distributed as part of the Timestream plugin, using data from the GrafanaDemo table where performance data is continuously collected:

Available Now
Amazon Timestream is available today in US East (N. Virginia), Europe (Ireland), US West (Oregon), and US East (Ohio). You can use Timestream with the console, the AWS Command Line Interface (CLI), AWS SDKs, and AWS CloudFormation. With Timestream, you pay based on the number of writes, the data scanned by the queries, and the storage used. For more information, please see the pricing page.

You can find more sample applications in this repo. To learn more, please see the documentation. It’s never been easier to work with time series, including data ingestion, retention, access, and storage tiering. Let me know what you are going to build!


Federating single sign-on access to your Amazon Redshift cluster with PingIdentity

Post Syndicated from Rajesh Francis original https://aws.amazon.com/blogs/big-data/federating-single-sign-on-access-to-your-amazon-redshift-cluster-with-pingidentity/

Single sign-on (SSO) enables users to have a seamless user experience while accessing various applications in the organization. If you’re responsible for setting up security and database access privileges for users and tasked with enabling SSO for Amazon Redshift, you can set up SSO authentication using ADFS, PingIdentity, Okta, Azure AD or other SAML browser based Identity Providers.

With federation, you can centralize management and governance of authentication and permissions by managing users and groups within the enterprise identity provider (IdP) and use them to authenticate to Amazon Redshift. For more information about the federation workflow using IAM and an identity provider, see Federate Database User Authentication Easily with IAM and Amazon Redshift.

This post shows you how to set up PingOne as your IdP. I provide step-by-step guidance to set up a trial account at pingidentity.com, build users and groups within your organization’s directory, and enable federated SSO into Amazon Redshift to maintain group-level access controls for your data warehouse.

Solution overview

The steps in this post are structured into the following sections:

  • IdP (PingOne) groups configuration – Create groups and assign users to logical groups in PingOne.
  • IdP (PingOne) application configuration – Create PingOne application(s) and configure AWS Identity and Access Management (IAM) roles, and groups allowed to be passed to Amazon Redshift.
  • IAM SAML federation configuration – Setup a role that allows PingOne to access Amazon Redshift by establishing a trust relationship between PingOne IdP and AWS.
  • Amazon Redshift groups and privileges setup – Setup groups within the Amazon Redshift database to match the PingOne groups. You also authorize these groups to access certain schemas and tables.
  • Amazon Redshift server and client setup and test SSO – Finally, configure SQL client tools to use your enterprise credentials and sign in to Amazon Redshift.

The process flow for federated authentication is shown in the following diagram and steps:

  1. The user logs in using a JDBC/ODBC SQL client.
  2. The IdP authenticates using the corporate user name and password, and returns a SAML assertion.
  3. The client uses AWS SDK to call AWS Security Token Service (AWS STS) to assume a role with SAML.
  4. AWS STS returns temporary AWS credentials.
  5. The client uses the temporary AWS credentials to get temporary cluster credentials.
  6. The client connects to Amazon Redshift using the temporary credentials.

Setting up PingOne provider groups and users

Before you get started, sign up for a free trial of PingOne for Enterprise. You then create the users and groups, and assign the users to the groups they belong to and are authorized to access.

You create groups and users in the PingOne user directory. You can set up the groups according to the read/write access privileges or by business functions in your organization to control access to the database objects.

In this post, we set up groups based on ReadOnly and ReadWrite privileges across all functions.

  1. After you have a PingOne account, log in to the PingOne admin dashboard.
  2. Choose Setup from the menu bar.
  3. On the Identity Repository tab, choose Connect to an Identity Repository.
  4. For Select an Identity Repository, you will see options for PingOne Directory, Active Directory, PingFederate and others. Choose PingOne Directory and go to Next.

After you connect to the PingOne repository, you should see the status CONFIGURED.

You can now create your groups and assign users.

  1. Choose Users from the menu bar.
  2. On the User Directory tab, choose Groups.
  3. Choose Add Group.
  4. For Name, enter readonly.
  5. For Directly Applied Role, select No Access.
  6. Choose Save.
  7. Repeat these steps for your readwrite group.
  8. To create the users, choose Users from the menu bar.
  9. On the User Directory tab, choose Users.
  10. Choose Add Users.

For this post, we create two users, Bob and Rachel.

  1. Under Group Memberships, for Memberships, select the group to add your user to.

For this post, we add Bob to readonly and Rachel to readwrite.

  1. Choose Add.
  2. Choose Save.
  3. Repeat these steps to create both users.

Configuring your IdP (PingOne) application

The next step is to set up the applications in the IdP for Amazon Redshift. Because we decided to control access through two groups, we create two applications.

  1. On the PingOne dashboard, choose Applications from the menu bar.
  2. On the My Applications tab, choose SAML.
  3. Choose Add Application.
  4. Choose New SAML Application.
  5. For Application Name, enter AmazonRedshiftReadOnly.
  6. Choose Continue to Next Step.
  7. On the Application Configuration page, for Assertion Consumer Service (ACS), enter http://localhost:7890/redshift/.
  8. For Entity ID, enter urn:amazon:webservices.
  9. For Signing, select Sign Assertion.
  10. For Signing Algorithm, choose RSA_SHA256.
  11. Choose Continue to Next Step.
  12. On the SSO Attribute Mapping page, add the following application attributes:
Application AttributeIdentity BridgeAs Literal

arn:aws:iam::<AWSAccount>:role/pingreadonlyrole,arn:aws:iam:: <AWSAccount>:saml-provider/pingreadonlyprov

pingreadonlyrole is the name of the IAM role you create in the next step.

pingreadonlyprov is the Identity Provider name in IAM where the metadata is imported. You use this name in next step to create your Identity Provider and import the metadata downloaded from this PingOne application configuration.



Choose Advanced and for Function, choose ExtractByRegularExpression. For Expression, enter (readonly|readwrite).+

This regular expression is to remove the @directory value from the PingIdentiy group name to be in line with the Amazon Redshift DB group names and send only the relevant groups to the Application.

Refer to the PingIdentity documentation for more details on parsing the memberof attribute in PingOne.

  1. Choose Continue to Next Step.
  2. On the Group Access page, add the groups that this application can access.

This adds the users who are members of that group so they can SSO to the application.

  1. On the Review Setup page, for SAML Metadata, choose Download.
  2. Save the file as ping-saml-readonly.xml.

You use this file later to import the metadata to create the PingOne IdP.

  1. Record the URL for Initiate Single Sign-On (SSO).

You use this URL to set up the SQL client for federated SSO.

  1. Choose Finish.
  2. Repeat these steps to create the second application, AmazonRedshiftReadWrite, with the following changes:
    1. On the SSO Attribute Mapping page, use the IAM role name pingreadwriterole and IdP name pingreadwriteprov.
    2. Save the SAML metadata file as ping-saml-readwrite.xml.

You should now see the two application names on the My Applications tab.

Configuring IAM SAML federation

To set up your IAM SAML configuration, you create the IAM IdP and the roles and policies for the groups.

Setting up the IAM SAML IdP

You set up the IAM IdP and the roles used in the PingOnereadonly and PingOnereadwrite applications to establish a trust relationship between the IdP and AWS. You need to create two IAM IdPs, one for each application. Complete the following steps:

  1. On the IAM console, under Access management, choose Identity providers.
  2. Choose Create Provider.
  3. For Provider Type, choose SAML.
  4. For Provider name, enter pingreadonlyprov.
  5. For Metadata Document, choose the metadata XML file you downloaded from the AmazonRedshiftReadOnly application.
  6. Repeat these steps to create the provider pingreadwriteprov.
    1. Choose the metadata XML file you downloaded from the AmazonRedshiftReadWrite application.

You now have two IdP providers: pingreadonlyprov and pingreadwriteprov.

Creating the IAM role and policy for the groups

You control access privileges to database objects for specific user groups by using IAM roles. In this section, you create separate IAM roles with policies to map to each of the groups defined in PingOne. These roles allow the user to access Amazon Redshift through the IdP.

You use the same role names that you used to set up applications in PingOne: pingreadonlyrole and pingreadwriterole.

Before you create the role, create the policies with the appropriate joingroup privileges.

  1. On the IAM console, under Access Management, choose Policies.
  2. Choose Create policy.
  3. On the JSON tab, enter the following code to create the two policies.
    1. Replace <cluster> with your cluster name and <dbname> with your database name.

The only difference between the two policies is the Action- redshift:JoinGroup section:

  • “JoinGroup”: pingreadonlypolicy allows users to join the readonly group
  • “JoinGroup”: pingreadwritepolicy allows users to join the readwrite group

The group membership lasts only for the duration of the user session, and there is no CreateGroup permission because you need to manually create groups and grant DB privileges in Amazon Redshift.

The following code is the pingreadonlypolicy policy:

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "GetClusterCredsStatement",
            "Effect": "Allow",
            "Action": [
            "Resource": [
                "arn:aws:redshift:*:*:dbuser: <dbname>/${redshift:DbUser}"
            "aws:userid":"*:${redshift:DbUser} "
            "Sid": "CreateClusterUserStatement",
            "Effect": "Allow",
            "Action": [
            "Resource": [
                "arn:aws:redshift:*:*:dbuser: <dbname>/${redshift:DbUser}"
            "Sid": "RedshiftJoinGroupStatement",
            "Effect": "Allow",
            "Action": [
            "Resource": [

The following code is the pingreadwritepolicy policy:

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "GetClusterCredsStatement",
            "Effect": "Allow",
            "Action": [
            "Resource": [
            "aws:userid":"*:${redshift:DbUser} "
            "Sid": "CreateClusterUserStatement",
            "Effect": "Allow",
            "Action": [
            "Resource": [
            "Sid": "RedshiftJoinGroupStatement",
            "Effect": "Allow",
            "Action": [
            "Resource": [
  1. On the IAM console, choose Roles.
  2. Choose Create role.
  3. For Select type of trusted entity, choose SAML 2.0 federation.
  4. For SAML provider, choose the provider you created.
  5. Select Allow programmatic access only.
  6. For Attribute, choose SAML:aud.
  7. For Value, enter http://localhost:7890/redshift/.
  8. Select pingreadonlypolicy for the first role and pingreadwritepolicy for the second role.
  9. Enter a name and description for each role.

The following screenshot shows your new roles: pingreadonlyrole and pingreadwriterole.

Setting up your groups and privileges in Amazon Redshift

In this section, you create the database groups in Amazon Redshift. These group names should match the group names you used when you set up your PingOne groups. Then you assign privileges to the groups to access the database objects including schemas and tables. User assignment to groups is done only one time in PingOne; you don’t assign users to groups in Amazon Redshift.

  1. Log in to your Amazon Redshift cluster with an admin account using the admin database credentials.
  2. Use the following scripts to create groups that match the IdP group names and grant the appropriate permissions to tables and schemas:
    CREATE GROUP readonly;
    CREATE GROUP readwrite;
    GRANT SELECT on TABLES to GROUP readonly;
    GRANT USAGE on SCHEMA finance to GROUP readonly;
    GRANT SELECT on ALL TABLES in SCHEMA finance to GROUP readonly;
    GRANT ALL on TABLES to GROUP readwrite;
    GRANT USAGE on SCHEMA finance to GROUP readwrite;
    GRANT ALL on ALL TABLES in SCHEMA finance to GROUP readwrite;

Setting up your Amazon Redshift server and client and testing SSO

In these final steps, you set up your client tools to use your enterprise credentials and sign in to Amazon Redshift.

Configuring the JDBC SQL Client using SQL Workbench/J

If you haven’t installed the JDBC driver, you can download the Amazon Redshift JDBC driver from the console. You then set up a new connection to your cluster using your PingOne IdP credentials.

  1. Create two new connection profiles, Redshift-ReadOnly and Redshift-ReadWrite.
  2. For URL, enter jdbc:redshift:iam://<cluster endpoint>.

IAM authentication requires using the JDBC driver with the AWS SDK included or making sure the AWS SDK is within your Java classpath.

You don’t need to enter a user name or password in JDBC setting. PingIdentity prompts you to log in on the web browser.

  1. Choose Extended Properties to define the SSO parameters for loging_url and plugin_name.
  2. In the Edit extended properties section, enter the following properties and values:

The login_url is the URL from the PingOne AmazonRedshift applications you set up earlier. Choose the SSO URL from the RedshiftReadOnly application for the readonly connection and the SSO URL from RedshiftReadWrite application for the readwrite connection.

The configuration in your extended properties screen should look like the screenshot below:

  1. Choose OK.

Testing SSO authentication and access privileges

When you log in from the SQL client, you’re redirected to the browser to sign in with your PingOne user name and password.

Log in as user bob with the IdP password.

This user has access to SELECT all tables in the finance schema and not INSERT/UPDATE access. You can enter the following statements to test your access.

The following query shows the results from the finance.revenue table:

/* Finance ReadOnly Query */
select * from finance.revenue limit 10;

customer		salesamt
ABC Company	        12000
Tech Logistics		175400
XYZ Industry		24355
The tax experts        186577

When you run an INSERT statement, you get the message that you’re not authorized to insert data:

/* Finance ReadWrite Insert */
insert into finance.revenue
values (10001, 'ABC Company', 12000);

You should see the results below:

INSERT INTO finance.revenue not successful
An error occurred when executing the SQL command:
insert into finance.revenue
values(10001, 'ABC Company', 12000)

[Amazon]()500310)Invalid operation:permission denied for relation revenue;1 statement failed.
Execution time:0.05s

You can repeat these steps for the user rachel, who has access to read and write (INSERT) data into the finance schema.

Configuring the ODBC client

To configure your ODBC client, complete the following steps.

  1. Open the ODBC Data source administrator from your desktop.
  2. On the System DSN tab, choose Add.
  3. For Server, enter your Amazon Redshift ODBC endpoint.
  4. For Port, enter 5439.
  5. For Database, enter your database name.
  6. For Auth Type, choose Identity Provider: Browser SAML to use browser-based authentication.
  7. For Cluster ID, enter your cluster ID.
  8. For Preferred Role, enter your IAM role ARN.
  9. For Login URL, enter your PingOne login URL from the application configuration (https://sso.connect.PingOne.com/sso/sp/initsso?saasid=<saasid>&idpid=<idpid>).
  10. For Listen port, enter 7890 (default).
  11. For Timeout, enter 60.


In this blog post, I walked you through a step-by-step guide to configure and use PingOne as your IdP and enabled federated SSO to an Amazon Redshift cluster. You can follow these steps to setup federated SSO for your organization and manage access privileges based on read/write privileges or by business function and passing group membership defined in your PingOne IdP to your Amazon Redshift cluster.

About the Authors

Rajesh Francis is a Sr. Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to build scalable Analytic solutions.






Best practices using AWS SCT and AWS Snowball to migrate from Teradata to Amazon Redshift

Post Syndicated from Ajinkya Puranik original https://aws.amazon.com/blogs/big-data/best-practices-using-aws-sct-and-aws-snowball-to-migrate-from-teradata-to-amazon-redshift/

This is a guest post from ZS. In their own words, “ZS is a professional services firm that works closely with companies to help develop and deliver products and solutions that drive customer value and company results. ZS engagements involve a blend of technology, consulting, analytics, and operations, and are targeted toward improving the commercial experience for clients.”

This blog is about the approaches evaluated and eventually chosen for ZS’s cloud transformation journey specifically for adoption of Amazon Redshift from prior Teradata based data warehousing solution.

ZS, a professional services firm that works side by side with companies to help develop and deliver products that drive customer value and company results. We leverage our extensive industry expertise, leading-edge analytics, technology and strategies to create solutions that work in the real world. With more than 35 years of experience and over 7,500 ZS employees in 24 offices worldwide, we are passionately committed to helping companies and their customers thrive.

ZS used Teradata as the primary data warehouse solution for several years. Part due to high ownership and operating cost, we started looking for an optimal solution which could provide scaling flexibility, lower maintenance liability and access accelerated innovation in the industry. This was achievable through solutions hosted on a cloud platform like AWS which ZS has already been using for numerous business workloads over the years.

Considerations for migration

Following were the three key areas which were critical for our Teradata to Amazon Redshift migration planning.

Table structures

The process included migrating the database schema first and then migrating the actual data from the databases. The schema on Amazon Redshift needed to be ready before loading the data from Amazon Simple Storage Service (Amazon S3).

AWS Schema Conversion Tool (SCT) helped in migrating table structures to Amazon Redshift, which converted the data types used for columns in Teradata tables into the corresponding Amazon Redshift data types. The AWS SCT tool also helped convert the table definition from Teradata to Amazon Redshift to include the appropriate keys, such as the Distribution Key/Sort Key. How to use the AWS SCT has been explained in the later sections of this blog

Database objects and data types

Teradata databases can hold a variety of database objects apart from tables like views, stored procedures, macros, User Defined Functions (UDF) and so on. The data types of the columns used in Teradata tables needed to be converted into the appropriate data types on Amazon Redshift. For other objects like views, stored procedures, the definitions from Teradata were exported and fresh objects were created in Amazon Redshift with appropriate changes in the new definitions. The AWS SCT can help in identifying the objects that need rework while migrating to Amazon Redshift.

Transferring data to AWS

Third and one of the major considerations was migrating the actual data to AWS. ZS’s use cases and isolation requirements were such that neither was Direct connect used in general nor were all AWS VPCs connected to corporate / on-premises network via VPN Tunnels. Data once exported out of Teradata gets uncompressed and expands approximately 4x resulting in requirement for data storage on local staging servers. Each ZS client workload had its respective warehouse on source and destination which also varied in size and had respective isolate change management timelines. Given these considerations we designed two use case specific approaches for transferring the exported data from the Teradata database to Amazon S3:

  • AWS Snowball – For databases larger than 4TB, we chose to transfer the data using AWS Snowball. Once the data was exported out of Teradata it was pushed to AWS Snowball periodically in batches. Resulting in optimal use of the storage space on the staging servers.
  • AWS CLI upload – For databases smaller than 4TB, data sets were exported from Teradata to staging servers and uploaded to Amazon S3 over the internet using AWS Command Line Interface (AWS CLI). These data sets were uploaded during non-business hours to minimize the impact on the ZS on-premises data center network bandwidth

The following diagram illustrates this architecture

Challenges and constraints

Exporting the data

The amount of data that had to be exported from the Teradata systems was 100+ TB (compressed). When exported, this would potentially expand to 500+TB. We needed a solution that could export this scale efficiently. Staging such large data volumes before migrating to AWS was a challenge due to limited on-premises SAN storage capacity. The mitigation chosen was to export in batches such that the exported data could be moved away from the staging server in a rolling fashion thus keeping space available throughout the migration. For certain datasets, due to volumes we further re-compressed the exported data before migration to Amazon S3.

Transferring the data

ZS had 150+ databases within our Teradata systems used across numerous ZS client initiatives. For certain projects, the data even had to be transferred to the client’s AWS account requiring respective unique processes while technology foundation was reusable. As alluded to earlier, due to varying dataset sizes per client workload, respective nuanced approaches were designed.

Initial approach for the solution

A cross functional team comprising of expertise across data warehousing, storage, network, cloud native technologies, business was formed at ZS which was also supported by AWS experts brought in via ZS’s AWS Partnership.

Primary focus at beginning was placed on finalizing data migration approaches. One such method that we tried was to use the AWS SCT to copy the schema onto Amazon Redshift and transfer the data to Amazon S3 using SCT Migration mode extract and upload. We also looked at file interface of AWS Snowball Edge to eliminate the need of having local storage for migration and directly exporting the Teradata exports on AWS Snowball Edge.

Approach constraints

While choosing a final approach, we came across the following challenges:

  1. Data export speeds were a major factor, considering the huge amount of data to migrate. We adopted the Teradata Parallel Transporter (TPT) approach because it showed better runtimes.
  2. Teradata holds up to 4X compressed data, which gets uncompressed post export. Holding such large datasets on a staging server was not feasible due to storage constraints.
  3. AWS Snowball Edge was evaluated instead of AWS Snowball to test the advantages of attaching it as a direct NFS to staging servers. However, since maximum file size supported by snowball edge NFS interface is 150 GB, we decided to continue with AWS Snowball.

TPT scripts method

Teradata Parallel Transporter (TPT) scripts were leveraged to export the data since it provided faster export speeds from Teradata servers compared to alternatives. We prepared the Teradata Parallel Transporter (TPT) scripts and launched these through Linux servers. Before starting the export, we had to ensure that enough free space was available on the server(s) to accommodate the export dumps.

The advantages of using TPT scripts to export data from Teradata tables were as follows:

  • Parallel processing to export data, which provided faster runtimes
  • Exporting varied data types into text format, which could be loaded into Amazon Redshift

Then the data was exported on the same servers where the TPT scripts were run. From here the data was copied either to the Amazon S3 bucket through the AWS CLI that was installed on the same server or to the Snowball device.

Final architecture

The hybrid cloud architecture we zeroed in on is depicted in picture below comprising of ZS’s on-premises data center hosting Teradata appliance, AWS destination environments and intermediary staging as well as shipping and data transfer networks. AWS SCT was leveraged for Schema migration and TPT exports for the data migration. The TPT export scripts were executed on the staging servers and the data was exported onto shared storage which was attached to staging servers. After the exports were completed the data was copied to AWS S3 using either AWS CLI for S3 or was pushed to AWS Snowball depending on the data size. The Snowball device was configured within the same network as the staging servers to ensure optimal transfer latency. Once data was copied completely onto AWS Snowball, it was shipped to AWS where data was transferred into the corresponding Amazon S3 bucket. On the AWS side, we had the S3 bucket for the corresponding Amazon Redshift cluster that held the data before loading into it.

Exporting the data

The TPT script is very effective when exporting huge amounts of data from the proprietary Teradata systems. You can prepare and deploy export scripts on a server within the same network as the Teradata appliance, which enables high export speeds.

The TPT export script is a combination of 1) Declaration section 2) Loop with built-in commands. Export dump with logs are generated as outputs.

Declaration section

The declaration section is where we initialize all the parameters, like the system identifier known as the tdpid, login user name, and delimiter that are used in the output files. See the following code that sets up shell variables:

 SourceTdpId=<cop alias entries from hosts file or IP>  
 SourceUserName=<user id having read access on the DB tables>  
 TargetTextDelimiter=^ (can be decided based on the column values)  

Loop with built-in commands:

The values for the required variables were passed from three input files:

  • <databasename>.<tablename>
  • Definition of the TPT export operator
  • Job variables file (this file gets removed at the end of export)

See the following shell script that uses shell and TPT utility commands:

  while read database table   
    echo "SourceTdpId = ""'"${SourceTdpId}"'" $'\n' ",""SourceLogonMech = ""'"${SourceLogonMech}"'" $'\n' ","   
       "SourceUserName = ""'"${SourceUserName}"'" $'\n' "," "SourceUserPassword = ""'"${SourceUserPassword}"'" $'\n' ","   
       "SourceWorkingDatabase = ""'"${SourceWorkingDatabase}"'" $'\n' "," "DDLPrivateLogName = ""'"${DDLPrivateLogName}"'" $'\n' ","   
       "ExportPrivateLogName = ""'"${ExportPrivateLogName}"'" $'\n' "," "TargetErrorList = ""[""'"${TargetErrorList}"'""]" $'\n' ","   
       "TargetFileName = ""'"${TargetFileName}".dat""'" $'\n' "," "TargetFormat = ""'"${TargetFormat}"'" $'\n' ","   
       "TargetTextDelimiter = ""'"${TargetTextDelimiter}"'" $'\n' "," "TargetOpenMode = ""'"${TargetOpenMode}"'" $'\n' ","   
       "SpoolMode = ""'"${SpoolMode}"'" $'\n' "," "SelectStmt = ""'""select * from ${database}""."${table}"'" $'\n' >> jobvar.txt   
    chmod 777 jobvar.txt   
    tbuild -j ${table} -f tpt2test_2.tpt -v jobvar.txt   
    rm -rf jobvar.txt   
    log_total_records "${TargetFileName}"   

Export dump and logs

The data exported from the Teradata system through the TPT scripts was placed on the staging server. To ensure the quality of the exported data, we verified that the record counts in the log file, created during the TPT export, matched with the table row counts.

Table row count in Teradata

TPT exported dataset row count

The TPT scripts generated one file for every Teradata table. The file format of these files was text with the .dat extension. See the following screenshot.

You can optimize data loading into Amazon Redshift tables by splitting the corresponding file (dataset) into subsets of equal sizes. The number of such subsets should ideally be equal to or a multiple of the number of slices for the Amazon Redshift node type configured in the cluster. We chose to split the TPT output files using the Linux split command on the TPT server:

‘split -C 20m --numeric-suffixes input_filename output_prefix’

For more information efficiently loading the Amazon Redshift tables, see Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift and Best Practices for Micro-Batch Loading on Amazon Redshift.

Transferring data to S3 buckets

ZS leveraged AWS account level isolation for many of our client solutions to align with respective compliance controls. AWS Snowball is associated with a single AWS account, and to achieve full client data isolation, separate devices were shipped for each large use case. As indicated above, we adopted two methods to transfer the data based on the export size for each client workload:

  • AWS CLI – Use when databases are smaller than 4TB.
  • Snowball – Use when databases are bigger than 4TB or when data needed to be loaded to a ZS owned Client Dedicated account.

Transferring data through the AWS CLI

Transferring the data via the AWS CLI includes the following steps:

  1. Install and configure the AWS CLI utility on ZS on-premises Linux (staging) server
  2. Export datasets out from Teradata on the staging server.
  3. Copy the exported datasets to Amazon S3 using the AWS CLI:

aws s3 cp filename.txt s3://aws-s3-bucket-name/foldername/

Transferring data through Snowball

To transfer the data with Snowball, complete the following steps:

  1. Create a Snowball job on the AWS Management Console and order the Snowball device.
  2. Configure the Snowball on ZS’s on-premises data center network and install the Snowball client on the staging server.
  3. Unlock the Snowball device by downloading the manifest file and an unlock code from the console, as shown in the following code:

snowball start -i XX.XX.XX.XX -m /home/abcd/XXXXXXXXX_manifest.bin -u XXXXXXXXXXX

  1. Use the Snowball CLI to list the S3 Bucket associated with Snowball.

snowball s3 ls

  1. Copy the files to Snowball:

snowball cp /location/of/the/exported/files s3://Bucket_name/Target/

Transferring the table structure to Amazon Redshift

There are a few differences in the table definition format between Amazon Redshift and Teradata. The AWS SCT tool helps convert the Teradata table structure into an appropriate Amazon Redshift table structure.

To transfer the Teradata table structure to Amazon Redshift, complete the following steps:

  1. Connect to the on-premises Teradata systems and the Amazon Redshift cluster endpoint.

  1. Select the specific table from Teradata and right-click the option Convert schema. This converts the table definition into the Amazon Redshift equivalent.

  1. In the Amazon Redshift section of the AWS SCT console, choose Apply to database when the table conversion is complete to create the table structure on Amazon Redshift.

Pushing the data to the tables

After you migrate the required data to the appropriate S3 bucket, convert the tables as per usability, and apply the tables to Amazon Redshift, you can push the data to these tables via the COPY command:

copy AXXXX_MAIN.table1  
 from 's3://aws-s3-bucket-name/AXXXX_MAIN.table1.dat'  
 iam_role 'arn:aws:iam::XXXXXXX:role/aws-iam-role '  
 delimiter '|'  
 region 'us-XXXX-1'; 

The naming convention we used for the exported datasets was <databasename>.<tablename>. The table structures (DDLs) were migrated through AWS SCT and the table names matched the dataset names. Therefore, when we created the COPY commands, we simply had to match the target table name in Amazon Redshift with that of the datasets on Amazon S3. For more information about this process, see Using the COPY command to load from Amazon S3.


In this blog, we intended to convey our journey and options evaluated before zeroing on one to transform on-premise Teradata data warehouse workloads onto Amazon Redshift at scale. Process built around multiple tools including AWS SCT, Teradata Parallel Transporter, and AWS Snowball facilitated our transformation

For more information about AWS SCT, see Introducing AWS Schema Conversion Tool Version 1.0.502. For more information about Snowball, see AWS Import/Export Snowball – Transfer 1 Petabyte Per Week Using Amazon-Owned Storage Appliances.

Disclaimer: The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.


About the Authors

Ajinkya Puranik is a Cloud Database Lead within Cloud Centre of Excellence at ZS Associates. He has years of experience managing, administrating, optimizing and adopting evolving data warehousing solutions. He played an instrumental role in ZS’s Teradata to Redshift transformation journey. His personal interests involve cricket and traveling.





Sushant Jadhav is a Senior Cloud Administrator within Cloud Center of Excellence at ZS Associates. He is a results-oriented professional with technology experience predominantly in the storage and backup industry. He has worked on many migration projects where he helped customers migrate from on-premises to AWS. Sushant enjoys working on all the AWS services and tries to bridge the gap between technology and business. He is always keen on learning new technologies and is always evolving in his role. Apart from work, he enjoys playing football.

How to delete user data in an AWS data lake

Post Syndicated from George Komninos original https://aws.amazon.com/blogs/big-data/how-to-delete-user-data-in-an-aws-data-lake/

General Data Protection Regulation (GDPR) is an important aspect of today’s technology world, and processing data in compliance with GDPR is a necessity for those who implement solutions within the AWS public cloud. One article of GDPR is the “right to erasure” or “right to be forgotten” which may require you to implement a solution to delete specific users’ personal data.

In the context of the AWS big data and analytics ecosystem, every architecture, regardless of the problem it targets, uses Amazon Simple Storage Service (Amazon S3) as the core storage service. Despite its versatility and feature completeness, Amazon S3 doesn’t come with an out-of-the-box way to map a user identifier to S3 keys of objects that contain user’s data.

This post walks you through a framework that helps you purge individual user data within your organization’s AWS hosted data lake, and an analytics solution that uses different AWS storage layers, along with sample code targeting Amazon S3.

Reference architecture

To address the challenge of implementing a data purge framework, we reduced the problem to the straightforward use case of deleting a user’s data from a platform that uses AWS for its data pipeline. The following diagram illustrates this use case.

We’re introducing the idea of building and maintaining an index metastore that keeps track of the location of each user’s records and allows us locate to them efficiently, reducing the search space.

You can use the following architecture diagram to delete a specific user’s data within your organization’s AWS data lake.

For this initial version, we created three user flows that map each task to a fitting AWS service:

Flow 1: Real-time metastore update

The S3 ObjectCreated or ObjectDelete events trigger an AWS Lambda function that parses the object and performs an add/update/delete operation to keep the metadata index up to date. You can implement a simple workflow for any other storage layer, such as Amazon Relational Database Service (RDS), Amazon Aurora, or Amazon Elasticsearch Service (ES). We use Amazon DynamoDB and Amazon RDS for PostgreSQL as the index metadata storage options, but our approach is flexible to any other technology.

Flow 2: Purge data

When a user asks for their data to be deleted, we trigger an AWS Step Functions state machine through Amazon CloudWatch to orchestrate the workflow. Its first step triggers a Lambda function that queries the metadata index to identify the storage layers that contain user records and generates a report that’s saved to an S3 report bucket. A Step Functions activity is created and picked up by a Lambda Node JS based worker that sends an email to the approver through Amazon Simple Email Service (SES) with approve and reject links.

The following diagram shows a graphical representation of the Step Function state machine as seen on the AWS Management Console.

The approver selects one of the two links, which then calls an Amazon API Gateway endpoint that invokes Step Functions to resume the workflow. If you choose the approve link, Step Functions triggers a Lambda function that takes the report stored in the bucket as input, deletes the objects or records from the storage layer, and updates the index metastore. When the purging job is complete, Amazon Simple Notification Service (SNS) sends a success or fail email to the user.

The following diagram represents the Step Functions flow on the console if the purge flow completed successfully.

For the complete code base, see step-function-definition.json in the GitHub repo.

Flow 3: Batch metastore update

This flow refers to the use case of an existing data lake for which index metastore needs to be created. You can orchestrate the flow through AWS Step Functions, which takes historical data as input and updates metastore through a batch job. Our current implementation doesn’t include a sample script for this user flow.

Our framework

We now walk you through the two use cases we followed for our implementation:

  • You have multiple user records stored in each Amazon S3 file
  • A user has records stored in homogenous AWS storage layers

Within these two approaches, we demonstrate alternatives that you can use to store your index metastore.

Indexing by S3 URI and row number

For this use case, we use a free tier RDS Postgres instance to store our index. We created a simple table with the following code:

				userid TEXT,
				s3path TEXT,
				recordline INTEGER

You can index on user_id to optimize query performance. On object upload, for each row, you need to insert into the user_objects table a row that indicates the user ID, the URI of the target Amazon S3 object, and the row that corresponds to the record. For instance, when uploading the following JSON input, enter the following code:

{"user_id":"UgMW8bLE0QMJDCkQ1Ax5Mg","body ":"…"}

We insert the tuples into user_objects in the Amazon S3 location s3://gdpr-demo/year=2018/month=2/day=26/input.json. See the following code:

(“V34qejxNsCbcgD8C0HVk-Q”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 0)
(“ofKDkJKXSKZXu5xJNGiiBQ”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 1)
(“UgMW8bLE0QMJDCkQ1Ax5Mg”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 2)

You can implement the index update operation by using a Lambda function triggered on any Amazon S3 ObjectCreated event.

When we get a delete request from a user, we need to query our index to get some information about where we have stored the data to delete. See the following code:

SELECT s3path,
                FROM user_objects
                WHERE userid = ‘V34qejxNsCbcgD8C0HVk-Q’
                GROUP BY;

The preceding example SQL query returns rows like the following:

(“s3://gdpr-review/year=2015/month=12/day=21/review-part-0.json“, {2102,529})

The output indicates that lines 529 and 2102 of S3 object s3://gdpr-review/year=2015/month=12/day=21/review-part-0.json contain the requested user’s data and need to be purged. We then need to download the object, remove those rows, and overwrite the object. For a Python implementation of the Lambda function that implements this functionality, see deleteUserRecords.py in the GitHub repo.

Having the record line available allows you to perform the deletion efficiently in byte format. For implementation simplicity, we purge the rows by replacing the deleted rows with an empty JSON object. You pay a slight storage overhead, but you don’t need to update subsequent row metadata in your index, which would be costly. To eliminate empty JSON objects, we can implement an offline vacuum and index update process.

Indexing by file name and grouping by index key

For this use case, we created a DynamoDB table to store our index. We chose DynamoDB because of its ease of use and scalability; you can use its on-demand pricing model so you don’t need to guess how many capacity units you might need. When files are uploaded to the data lake, a Lambda function parses the file name (for example, 1001-.csv) to identify the user identifier and populates the DynamoDB metadata table. Userid is the partition key, and each different storage layer has its own attribute. For example, if user 1001 had data in Amazon S3 and Amazon RDS, their records look like the following code:

{"userid:": 1001, "s3":{"s3://path1", "s3://path2"}, "RDS":{"db1.table1.column1"}}

For a sample Python implementation of this functionality, see update-dynamo-metadata.py in the GitHub repo.

On delete request, we query the metastore table, which is DynamoDB, and generate a purge report that contains details on what storage layers contain user records, and storage layer specifics that can speed up locating the records. We store the purge report to Amazon S3. For a sample Lambda function that implements this logic, see generate-purge-report.py in the GitHub repo.

After the purging is approved, we use the report as input to delete the required resources. For a sample Lambda function implementation, see gdpr-purge-data.py in the GitHub repo.

Implementation and technology alternatives

We explored and evaluated multiple implementation options, all of which present tradeoffs, such as implementation simplicity, efficiency, critical data compliance, and feature completeness:

  • Scan every record of the data file to create an index – Whenever a file is uploaded, we iterate through its records and generate tuples (userid, s3Uri, row_number) that are then inserted to our metadata storing layer. On delete request, we fetch the metadata records for requested user IDs, download the corresponding S3 objects, perform the delete in place, and re-upload the updated objects, overwriting the existing object. This is the most flexible approach because it supports a single object to store multiple users’ data, which is a very common practice. The flexibility comes at a cost because it requires downloading and re-uploading the object, which introduces a network bottleneck in delete operations. User activity datasets such as customer product reviews are a good fit for this approach, because it’s unexpected to have multiple records for the same user within each partition (such as a date partition), and it’s preferable to combine multiple users’ activity in a single file. It’s similar to what was described in the section “Indexing by S3 URI and row number” and sample code is available in the GitHub repo.
  • Store metadata as file name prefix – Adding the user ID as the prefix of the uploaded object under the different partitions that are defined based on query pattern enables you to reduce the required search operations on delete request. The metadata handling utility finds the user ID from the file name and maintains the index accordingly. This approach is efficient in locating the resources to purge but assumes a single user per object, and requires you to store user IDs within the filename, which might require InfoSec considerations. Clickstream data, where you would expect to have multiple click events for a single customer on a single date partition during a session, is a good fit. We covered this approach in the section “Indexing by file name and grouping by index key” and you can download the codebase from the GitHub repo.
  • Use a metadata file – Along with uploading a new object, we also upload a metadata file that’s picked up by an indexing utility to create and maintain the index up to date. On delete request, we query the index, which points us to the records to purge. A good fit for this approach is a use case that already involves uploading a metadata file whenever a new object is uploaded, such as uploading multimedia data, along with their metadata. Otherwise, uploading a metadata file on every object upload might introduce too much of an overhead.
  • Use the tagging feature of AWS services – Whenever a new file is uploaded to Amazon S3, we use the Put Object Tagging Amazon S3 operation to add a key-value pair for the user identifier. Whenever there is a user data delete request, it fetches objects with that tag and deletes them. This option is straightforward to implement using the existing Amazon S3 API and can therefore be a very initial version of your implementation. However, it involves significant limitations. It assumes a 1:1 cardinality between Amazon S3 objects and users (each object only contains data for a single user), searching objects based on a tag is limited and inefficient, and storing user identifiers as tags might not be compliant with your organization’s InfoSec policy.
  • Use Apache Hudi – Apache Hudi is becoming a very popular option to perform record-level data deletion on Amazon S3. Its current version is restricted to Amazon EMR, and you can use it if you start to build your data lake from scratch, because you need to store your as Hudi datasets. Hudi is a very active project and additional features and integrations with more AWS services are expected.

The key implementation decision of our approach is separating the storage layer we use for our data and the one we use for our metadata. As a result, our design is versatile and can be plugged in any existing data pipeline. Similar to deciding what storage layer to use for your data, there are many factors to consider when deciding how to store your index:

  • Concurrency of requests – If you don’t expect too many simultaneous inserts, even something as simple as Amazon S3 could be a starting point for your index. However, if you get multiple concurrent writes for multiple users, you need to look into a service that copes better with transactions.
  • Existing team knowledge and infrastructure – In this post, we demonstrated using DynamoDB and RDS Postgres for storing and querying the metadata index. If your team has no experience with either of those but are comfortable with Amazon ES, Amazon DocumentDB (with MongoDB compatibility), or any other storage layer, use those. Furthermore, if you’re already running (and paying for) a MySQL database that’s not used to capacity, you could use that for your index for no additional cost.
  • Size of index – The volume of your metadata is orders of magnitude lower than your actual data. However, if your dataset grows significantly, you might need to consider going for a scalable, distributed storage solution rather than, for instance, a relational database management system.


GDPR has transformed best practices and introduced several extra technical challenges in designing and implementing a data lake. The reference architecture and scripts in this post may help you delete data in a manner that’s compliant with GDPR.

Let us know your feedback in the comments and how you implemented this solution in your organization, so that others can learn from it.


About the Authors

George Komninos is a Data Lab Solutions Architect at AWS. He helps customers convert their ideas to a production-ready data product. Before AWS, he spent 3 years at Alexa Information domain as a data engineer. Outside of work, George is a football fan and supports the greatest team in the world, Olympiacos Piraeus.





Sakti Mishra is a Data Lab Solutions Architect at AWS. He helps customers architect data analytics solutions, which gives them an accelerated path towards modernization initiatives. Outside of work, Sakti enjoys learning new technologies, watching movies, and travel.

Using the Amazon Redshift Data API to interact with Amazon Redshift clusters

Post Syndicated from Debu Panda original https://aws.amazon.com/blogs/big-data/using-the-amazon-redshift-data-api-to-interact-with-amazon-redshift-clusters/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL, and your existing ETL, Business Intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics. As a data engineer or application developer, for some use cases, you want to interact with Amazon Redshift to load or query data with a simple API endpoint without having to manage persistent connections. Now, with the general availability of the Amazon Redshift Data API, you can interact with Amazon Redshift without having to configure JDBC or ODBC. This makes it easier and more secure to work with Amazon Redshift and opens up new use cases.

This introduction explains how to use the Amazon Redshift Data API from the AWS Command Line Interface (CLI) and Python. We also explain how to use AWS Secrets Manager to store and retrieve credentials for the Data API.

Introducing the Data API

The Amazon Redshift Data API enables you to painlessly access data from Amazon Redshift with all types of traditional, cloud-native, and containerized, serverless web services-based applications and event-driven applications. The following diagram illustrates this architecture.

The Amazon Redshift Data API simplifies data access, ingest, and egress from programming languages and platforms supported by the AWS SDK such as Python, Go, Java, Node.js, PHP, Ruby, and C++.

The Data API simplifies access to Amazon Redshift by eliminating the need for configuring drivers and managing database connections. Instead, you can run SQL commands to an Amazon Redshift cluster by simply calling a secured API endpoint provided by the Data API. The Data API takes care of managing database connections and buffering data. The Data API is asynchronous, so you can retrieve your results later. Your query results are stored for 24 hours. The Data API federates AWS Identity and Access Management (IAM) credentials so you can use identity providers like Okta or Azure Active Directory or database credentials stored in Secrets Manager without passing database credentials in API calls.

For customers using AWS Lambda, the Data API provides a secure way to access your database without the additional overhead for Lambda functions to be launched in an Amazon VPC. Integration with the AWS SDK provides a programmatic interface to run SQL statements and retrieve results asynchronously.

Relevant use cases

The Amazon Redshift Data API is not a replacement for JDBC and ODBC drivers, and is suitable for use cases where you don’t need a persistent connection to a cluster. It’s applicable in the following use cases:

  • Integrating web services-based applications to access data from Amazon Redshift using an API to run SQL statements. For example, you can run SQL from JavaScript.
  • Running a long-running query without having to wait for it to complete.
  • Running your query one time and retrieving the results multiple times without having to run the query again.
  • Building your ETL pipelines with AWS Step Functions, Lambda, and stored procedures.
  • Having simplified access to Amazon Redshift from Amazon SageMaker and Jupyter notebooks.
  • Building event-driven applications with Amazon EventBridge and Lambda.
  • Scheduling SQL scripts to simplify data load, unload, and refreshing of materialized views.

Creating an Amazon Redshift cluster

If you haven’t already created a Redshift cluster, or want to create a new one and aren’t sure how, follow these steps. In this post, we create a table and load data using the COPY command. Make sure that the IAM role you attach to your cluster has AmazonS3ReadOnlyAccess permission.

Prerequisites for using the Data API

You must be authorized to access the Amazon Redshift Data API. Amazon Redshift provides the RedshiftDataFullAccess managed policy, which offers full access to Amazon Redshift Data APIs. This policy also allows access to Amazon Redshift clusters, Secrets Manager, and IAM API operations needed to authenticate and access an Amazon Redshift cluster by using temporary credentials. If you want to use temporary credentials with the managed policy RedshiftDataFullAccess, you have to create one with the user name in the database as redshift_data_api_user.

You can also create your own IAM policy that allows access to specific resources by starting with RedshiftDataFullAccess as a template. For details, refer to the Amazon Redshift Cluster management guide.

The Data API allows you to access your database either using your IAM credentials or to use secrets stored in Secrets Manager. In this post, we use Secrets Manager.

For instructions on using database credentials for the Data API, see How to rotate Amazon Redshift credentials in AWS Secrets Manager.

Using the Data API from the AWS CLI

You can use Data API from the AWS CLI to interact with the Amazon Redshift cluster. For instructions on configuring AWS CLI, see Setting up the Amazon Redshift CLI. The Amazon Redshift command line interface (aws redshift) is a part of AWS CLI that lets you manage Amazon Redshift clusters, such as creating, deleting, and resizing. The Data API now provides a command line interface to the AWS CLI (redshift-data) that allows you to interact with the databases in an Amazon Redshift cluster.

Before we get started, ensure that you have the updated AWS SDK configured.

You can invoke help using the following command:

aws redshift-data help

The following table shows you different commands available with the Amazon Redshift Data API CLI.

list-databasesLists the databases in a cluster.
list-schemasLists the schemas in a database. You can filter this by a matching schema pattern.
list-tablesLists the tables in a database. You can filter the tables list by a schema name pattern, a matching table name pattern, or a combination of both.
describe-tableDescribes the detailed information about a table including column metadata.
execute-statementRuns a SQL statement, which can be DML, DDL, COPY, or UNLOAD.



Cancels a running query. To be canceled, a query must be in running state.
describe-statementDescribes the details about a specific SQL statement run. The information includes when the query started, when it finished, the number of rows processed, and the SQL statement.
list-statementsLists the SQL statements. By default, only finished statements are shown.

Fetches the temporarily cached result of the query. The result set contains the complete result set and the column metadata. You can paginate through a set of records to retrieve the entire result as needed.


If you want to get help on a specific command, you have to run the following command:

aws redshift-data list-tables help 

Now we look at how you can use these commands. First, get the secret key ARN by navigating to your key in Secrets Manager.

Listing databases

Most organizations use a single database in their Amazon Redshift cluster. You can use the following command to list the databases you have in your cluster. This operation requires you to connect to a database and therefore requires database credentials:

aws redshift-data list-databases  --cluster-identifier <your-cluster-id>--secret-arn  <replace-with-your-secret ARN> --database <your-db-name> --region us-west-2

Listing schema

Similar to listing databases, you can list your schemas by using the list-schemas command:

aws redshift-data list-schemas  --cluster-identifier <your-cluster-id> --secret-arn <replace-with-your-secret ARN> --database <your-db-name> --region us-west-2 

You have several schemas that match demo (demo, demo2, demo3, and so on). You can optionally provide a pattern to filter your results matching to that pattern:

aws redshift-data list-schemas --database <your-db-name> --cluster-identifier <your-cluster-id> --secret-arn <replace-with-your-secret ARN> --region <your-region> --schema-pattern "demo%"

Listing tables

The Data API provides a simple command, list-tables, to list tables in your database. You might have thousands of tables in a schema; the Data API lets you paginate your result set or filter the table list by providing filter conditions.

You can search across your schema with table-pattern; for example, you can filter the table list by all tables across all your schemas in the database. See the following code:

aws redshift-data list-tables --database dev --cluster-identifier <your-cluster-id>--secret-arn <your-secret-arn> --database <your-db-name> --region <your-region> --table-pattern "ven%"

You can filter your tables list in a specific schema pattern:

aws redshift-data list-tables --database <your-db-name> --cluster-identifier <your-cluster-id> --secret-arn <your-secret-arn> --region <your-region> --table-pattern "ven%" --schema-pattern demo

Running SQL commands

You can run your SELECT, DML, DDL, COPY, or UNLOAD command for Amazon Redshift with the Data API. You can optionally specify a name for your statement. You can optionally specify if you want to send an event to EventBridge after the query runs. The query is asynchronous, and you get a query ID after running a query.

Creating a schema

Let’s now use the Data API to see how you can create a schema. The following command will let you create a schema in your database. You do not have to run this SQL if you have pre-created the schema.

aws redshift-data execute-statement \
     --database <your-db-name>  \
     --cluster-identifier <your-cluster-id> \
     --secret-arn <your-secret-arn> \
     --sql "CREATE SCHEMA demo;" \
     --region <your-region>

The following shows an example output. We will discuss later how you can check the status of a SQL that you executed with execute-statement

    "ClusterIdentifier": "redshift-cluster-1",
    "CreatedAt": "2020-09-11T16:06:28.876000-07:00",
    "Database": "dev",
    "Id": "a6e0072b-4641-4e67-9105-aceb7f57266a",
    "SecretArn": "<Your-ARN->"

Creating a table

You can use the following command to create a table with the CLI.

aws redshift-data execute-statement \
    --database <your-db-name> \
    --cluster-identifier <your-cluster-id> \
    --secret-arn <your-secret-arn>  \
    --sql "CREATE TABLE demo.green_201601(
  vendorid                VARCHAR(4), \
  pickup_datetime         TIMESTAMP, \
  dropoff_datetime        TIMESTAMP, \
  store_and_fwd_flag      VARCHAR(1), \
  ratecode                INT, \
  pickup_longitude        FLOAT4, \
  pickup_latitude         FLOAT4, \
  dropoff_longitude       FLOAT4, \
  dropoff_latitude        FLOAT4, \
  passenger_count         INT, \
  trip_distance           FLOAT4, \
  fare_amount             FLOAT4, \
  extra                   FLOAT4, \
  mta_tax                 FLOAT4, \
  tip_amount              FLOAT4, \
  tolls_amount            FLOAT4, \
  ehail_fee               FLOAT4, \
  improvement_surcharge   FLOAT4, \
  total_amount            FLOAT4, \
  payment_type            VARCHAR(4),\
  trip_type               VARCHAR(4));" \
    --region <your-region>  

Loading sample data

The COPY command lets you load bulk data into your table in Amazon Redshift. You can use the following command to load data into the table we created earlier.

aws redshift-data execute-statement \
    --database <your-db-name> \
    --cluster-identifier <your-cluster-id> \
    --secret-arn <your-secret-arn>  \
    --region <your-region>  
    --sql "COPY demo.green_201601 \
FROM 's3://us-west-2.serverless-analytics/NYC-Pub/green/green_tripdata_2016-01' \
IAM_ROLE 'arn:aws:iam::<Your_ACCOUNT>:role/<YourRole>' \
REGION 'us-west-2';" 

Retrieving Data

The following query uses the table we created earlier:

aws redshift-data execute-statement \
    --database <your-db-name> \
    --cluster-identifier <your-cluster-id> \
    --secret-arn <your-secret-arn>  \
    --region <your-region> \
    --sql "SELECT ratecode, COUNT(*) FROM demo.green_201601 WHERE \
trip_distance > 5 GROUP BY 1 ORDER BY 1;"

If you’re fetching a large amount of data, using UNLOAD is recommended. You can unload data into Amazon Simple Storage Service (Amazon S3) either using CSV or Parquet format. UNLOAD uses the MPP capabilities of your Amazon Redshift cluster and is faster than retrieving a large amount of data to the client side.

The following shows an example output:

    "ClusterIdentifier": "redshift-cluster-1",
    "CreatedAt": "2020-09-01T09:39:45.693000-07:00",
    "Database": "dev",
    "Id": "fc4111f0-0e01-456a-83cf-d5922a8b100a",
    "SecretArn": "<your-secret-arn>"

You can fetch results using the query ID that you receive as an output of execute-statement.

Checking the status of a statement

You can check the status of your statement by using describe-statement. The output for describe-statement provides additional details such as PID, query duration, number of rows in and size of the result set, and the query ID given by Amazon Redshift. See the following command:

 aws redshift-data describe-statement \
    --id 76f59b84-34a1-481b-a37d-a7b7e1ea57dc \
    --region <your-region> 

The following is an example output:

    "ClusterIdentifier": "redshift-cluster-1",
    "CreatedAt": "2020-08-31T15:03:08.852000-07:00",
    "Duration": 82642162,
    "Id": "76f59b84-34a1-481b-a37d-a7b7e1ea57dc",
    "QueryString": " SELECT ratecode, COUNT(*) FROM demo.green_201601 WHERE trip_distance > 5 GROUP BY 1 ORDER BY 1;",
    "RedshiftPid": 27815,
    "RedshiftQueryId": 709972,
    "ResultRows": 6,
    "ResultSize": 186,
    "SecretArn": "<your-secret-arn>",
    "Status": "FINISHED",
    "UpdatedAt": "2020-08-31T15:03:09.505000-07:00"

Cancelling a running statement

If your query is still running, you can use cancel-statement to cancel a SQL query. See the following command:

aws redshift-data cancel-statement --id 39a0de2f-e85e-45ff-a0d7-cd074c348120        --region  <your-region> 

Fetching results from your query

You can fetch the query results by using get-statement-result. The query result is stored for 24 hours. See the following command:

aws redshift-data get-statement-result     --id 7b61da88-1b11-4ade-956a-21085a29118d     --region <your-region> 

The output of the result contains metadata such as the number of records fetched, column metadata, and a token for pagination.

Exporting Data

Amazon Redshift allows you to export from database tables to a set of files in an Amazon S3 bucket using the UNLOAD command with a SELECT statement. You can unload data in either text or Apache Parquet format. The following command shows you an example of how you can use the data lake export with the Data API:

aws redshift-data execute-statement --database <db-name> --cluster-identifier <cluster-name> \
--secret-arn <your-secret-arn> \ 
 --region <your-region> \
--sql "unload ('select * from demo.green_201601') to '<your-S3-bucket>' iam_role '<your-iam-role>'; " 

Using the Data API from the AWS SDK

You can use the Amazon Redshift Data API in any of the programming languages supported by AWS SDK. For this post, we use the AWS SDK for Python (boto3) as an example to illustrate the capabilities of the Data API.

We first import the boto3 package and establish a session:

def get_client(service, endpoint=None, region="us-west-2"):
    import botocore.session as bc
    session = bc.get_session()

    s = boto3.Session(botocore_session=session, region_name=region)
    if endpoint:
        return s.client(service, endpoint_url=endpoint)
    return s.client(service)

Getting a client object

You can create a client object from the boto3.Session object and using RedshiftData:

rsd = get_client('redshift-data')

If you don’t want to create a session, your client is as simple as the following code:

import boto3

client = boto3.client('redshift-data')

Running a statement

The following example code uses the Secrets Manager key to run a statement. For this post, we use the table we created earlier. You can use DDL, DML, COPY, and UNLOAD as a parameter:

resp = rsd.execute_statement(
    ClusterIdentifier="<replace-with-your-cluster-name> ",
    Sql="SELECT ratecode, COUNT(*) totalrides FROM demo.green_201601 WHERE trip_distance > 5 GROUP BY 1 ORDER BY 1;"

As we discussed earlier, running a query is asynchronous; running a statement returns an ExecuteStatementOutput, which includes the statement ID.

If you want to publish an event to EventBridge when the statement is complete, you can use the additional parameter WithEvent set to true:

resp = rsd.execute_statement(

Using IAM credentials

Amazon Redshift provides users to get temporary database credentials using GetClusterCredentials. We recommend you scope the access to a specific cluster and DB user if you are granting your users to use temporary credentials. Here is an example code that uses getting temporary IAM credentials. As you can see in the code, we are using the redshift_data_api_user. The managed policy RedshiftDataFullAccess scopes to use temporary credentials only to redshift_data_api_user.

def query(sql, cluster="redshift-cluster-1",database="dev",dbuser="redshift_data_api_user"):
     resp = rsd.execute_statement(

Describing a statement

You can use describe_statement to find the status of the query and number of records retrieved. See the following code:

desc = rsd.describe_statement(Id=id)
if desc["Status"] == "FINISHED":            

Fetching results from your query

You can use get_statement_result to retrieve results for your query if your query is complete. See the following code:

if desc and desc["ResultRows"]  > 0:
    result = rsd.get_statement_result(Id=qid)

The get_statement_result command returns a JSON object that includes metadata for the result, the actual result set. You might need to process to format the result if you want to display in a user-friendly format.

Fetching and formatting results

For this post, we demonstrate how to format the results with the Pandas framework. The post_process function processes the metadata and results to populate a data frame. The query function retrieves the result from a database in an Amazon Redshift cluster. See the following code:

import pandas as pd

def post_process(meta, records):
    columns = [k["name"] for k in meta]
    rows = []
    for r in records:
        tmp = []
        for c in r:
    return pd.DataFrame(rows, columns=columns)

def query(sql, cluster="redshift-cluster-1", user="awsuser", database="dev"):
    resp = rsd.execute_statement(
    qid = resp["Id"]
    desc = None
    while True:
        desc = rsd.describe_statement(Id=qid)
        if desc["Status"] == "FINISHED":
    if desc and desc["ResultRows"]  > 0:
        result = rsd.get_statement_result(Id=qid)
        rows, meta = result["Records"], result["ColumnMetadata"]
        return post_process(meta, rows)

pf=query("select venueid,venuename from venue  limit 100;")

In this post, we demonstrated the use of the Data API with Python. However, you can use the Data API with other programming languages supported by the AWS SDK.

Best practices

We recommend the following best practices when using the Data API:

  • Federate your IAM credentials to the database to connect with Amazon Redshift. Amazon Redshift allows users to get temporary database credentials with GetClusterCredentials. We recommend scoping the access to a specific cluster and DB user if you’re granting your users temporary credentials. For more information, see Example policy for using GetClusterCredentials.
  • Use a custom policy to provide fine-grained access to the Data API in the production environment if you don’t want your users to use temporary credentials. You have to use Secrets Manager to manage your credentials in such use cases.
  • Ensure that the record size that you retrieve is smaller than 64 KB.
  • Don’t retrieve a large amount of data to your client and use the UNLOAD command to export the query results to Amazon S3. You’re limited to retrieving only 100 MB of data with the Data API.
  • Don’t forget to retrieve your results within 24 hours; results are stored only for 24 hours.

Customer Feedback

Datacoral is a fast growing startup that offers an AWS-native data integration solution for analytics. Datacoral integrates data from databases, APIs, events and files into Amazon Redshift while providing guarantees on data freshness and data accuracy to ensure meaningful analytics. Using the Redshift API, they are able to create a completely event-driven and serverless platform that makes data integration and loading easier for our mutual customers. Founder and CEO Raghu Murthy said, “As an Amazon Redshift Ready Advanced Technology Partner, we have worked with the Redshift team to integrate their Redshift API into our product. The Redshift API provides the asynchronous component needed in our platform to submit and respond to data pipeline queries running on Amazon Redshift. It is the last piece of the puzzle for us to offer our customers a fully event-driven and serverless platform that is robust, cost-effective, and scales automatically. We are thrilled to be part of the launch”

Zynga Inc. is an American game developer running social video game services founded in April 2007. Zynga uses Amazon Redshift as its central data warehouse for game events, user, and revenue data. The data in the Amazon Redshift data warehouse is used for analytics, BI reporting, and AI/ML across all games and departments. Zynga wants to replace any programmatic access clients connected to Amazon Redshift with the new Amazon Redshift Data API. Currently, Zynga’s services connect using a wide variety of clients and drivers, and they plan to consolidate all of them. This will remove the need for Amazon Redshift credentials and regular password rotations. Johan Eklund, Senior Software Engineer, Analytics Engineering team in Zynga, who participated in the beta testing said, “Data API would be an excellent option for our services that will use Amazon Redshift programmatically. The main improvement would be authentication with IAM roles without having to involve the JDBC/ODBC drivers since they are all AWS hosted. Our most common service client environments are PHP, Python, Go plus a few more”.


In this post, we introduced you to the newly launched Amazon Redshift Data API. We also demonstrated how to use the Data API from the Amazon Redshift CLI and Python using the AWS SDK. We also provided best practices for using the Data API. To learn more, read the Amazon Redshift cluster management guide.

About the Authors

Debu Panda, a senior product manager at AWS, is an industry leader in analytics, application platform, and database technologies. He has more than 20 years of experience in the IT industry and has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).




Martin Grund is a Principal Engineer working in the Amazon Redshift team on all topics related to data lake (e.g. Redshift Spectrum), AWS platform integration and security.





Chao Duan is a software development manager at Amazon Redshift, where he leads the development team focusing on enabling self-maintenance and self-tuning with comprehensive monitoring for Redshift. Chao is passionate about building high-availability, high-performance, and cost-effective database to empower customers with data-driven decision making.




Daisy Yanrui Zhang is a software Dev Engineer working in the Amazon Redshift team on database monitoring, serverless database and database user experience.

Fast and predictable performance with serverless compilation using Amazon Redshift

Post Syndicated from Kiran Chinta original https://aws.amazon.com/blogs/big-data/fast-and-predictable-performance-with-serverless-compilation-using-amazon-redshift/

Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. Customers tell us that they want extremely fast query response times so they can make equally fast decisions.

This post presents the recently launched, massively scalable serverless compilation capability for Amazon Redshift, which can now concurrently compile query segments with additional compute resources at no extra cost. We also share how our customers have enjoyed faster performance (in several cases, twice as fast) because of this new capability.

Amazon Redshift query compilation

When a query is sent to Amazon Redshift, the query processing engine parses it into multiple segments and compiles these segments to produce optimized object files that are processed during query execution. When similar or same queries are sent to Amazon Redshift, the corresponding segments are present in the cluster code compilation cache. Query segments that use already compiled code in the cache run faster because there’s no overhead of query compilation.

You can also accelerate your workloads of one-time and first-time queries, which don’t have query segments compiled in the cache. Depending on the query’s complexity, Amazon Redshift usually compiles those queries within seconds. However, some mission-critical workloads require even faster response time. This is where the massively scalable serverless compilation capability in Amazon Redshift makes a big difference.

Amazon Redshift serverless query compilation

Amazon Redshift breaks down a query into a set of segments, and each segment is a set of operations, such as SCAN or BUILD HASH TABLE. With the launch of the massively scalable serverless compilation capability, Amazon Redshift can now compile the query segments faster and in parallel because the compilation isn’t limited by the specific cluster being used and its available CPU and memory resources.

The Amazon Redshift compilation capability is managed with an external resource that your Amazon Redshift cluster uses based on your workload. During query processing, Amazon Redshift generates query segments and sends the segments that aren’t present in the cluster’s local cache to the external compilation farm to be compiled with massive parallelism. At the time of running the query, the segments are quickly fetched from the compilation service and saved in the cluster’s local cache for future processing. This makes sure that one-time and first-time queries are processed with high performance in a transparent way, without any additional cost.

Design and usage

The massively scalable serverless compilation capabilities benefit you whenever you need query compilation, especially with complex and highly concurrent workloads. The following are some specific use cases where this capability helps:

  • Dashboard applications that require fast query performance experience lower query compilation time, leading to improved user experience.
  • Dynamic one-time queries with new query segments that aren’t present in the code cache can be processed faster.
  • Scheduled ETL or reporting jobs with a strict SLA benefit from lower query compilation times.
  • Highly complex and concurrent workloads run with high performance without impacting the overall cluster performance.
  • Clusters that are resized, upgraded, or paused and resumed use the external code cache. No warmup is needed.

The following diagram illustrates the architecture of the Amazon Redshift serverless compilation.

Compilation improvements

Although the serverless compilation has already been improving query performance significantly since its launch, the Amazon Redshift team is working to further improve its effectiveness and performance. More recently, we announced an unlimited cache size to store compiled objects and increase cache hits across the Amazon Redshift fleet from 99.60% to 99.95%.

The following graph shows the percent cache hit that’s improved beyond the local cache over the releases.

Faster performance

During a standard maintenance window, an Amazon Redshift patch flushes the compilation cache. Before we launched the new compilation capabilities, your cluster’s performance was impacted after being patched during maintenance periods. Now, that performance impact is almost unnoticeable with this feature.

Many Amazon Redshift customers are benefiting from these performance improvements and saving time and cost for their Amazon Redshift environments. In this section, we share the stories of two organizations.


Aptos is the largest provider of enterprise software focused exclusively on retail. They use Amazon Redshift to power the analytics solution for retail clients. Jonathan Strohl, a cloud engineer on the Aptos team, shared this anecdote with us:

“Prior to last week’s Redshift maintenance, we sent our clients the typical notification letting them know to expect performance delays the following morning due to the object cache being flushed during the maintenance. However, the morning after the maintenance, a couple of our clients emailed back asking whether the maintenance had actually occurred, because there had been no noticeable delay. The performance delays they had previously noticed were now eliminated due to the serverless compilation recently released by Amazon Redshift. This is the best result we could have hoped for—our clients were unable to tell that a cache-flushing maintenance had even occurred!”


Manthan delivers BI, analytics, and artificial intelligence solutions to more than 200 leading retailers across 22 countries. Vijay Chidambaram, Head of Cloud Engineering at Manthan, shared the following with us:

“The normal ETL runtimes are around 90–100 minutes. The ETL runtime would go to around 290 minutes post an upgrade without the serverless compilation feature. That value has come down to about 150 minutes, which is a 2X improvement. Across the clusters, there is no increase in the ETL wall clock runtime compared to normal runtimes on day two and beyond.”


Intentwise is an Amazon Advertising optimization platform that empowers brands, sellers, and agencies with insights, automation, and expertise. They use Amazon Redshift to power the analytics for their SaaS offering. Raghavendra, a Software Architect at Intentwise, shared the following with us:

“The new serverless compilation feature improves the query compilation time by 3x. This makes Amazon Redshift an even more powerful data warehouse for our analytical platform because it continues to innovate to offer better performance and lower costs, all with no efforts on our end.”


This post explained how the massively scalable serverless compilation capability for Amazon Redshift works and gave examples of the benefits you can expect from the performance improvements. The capability is free and automatically enabled on all new and existing Amazon Redshift clusters.

For more information about Amazon Redshift query planning and workflow, see Query planning and execution workflow. For more information about improving query performance, see Factors affecting query performance.

About the Authors

Kiran Chinta is a Senior Software Development Engineer at Amazon Redshift. He has been working on distributed databases for over 13 years and has focused on high availability, disaster recovery, SQL language features and performance features for on-prem and cloud databases. In his spare time, he enjoys reading and playing various sports.





Naresh Chainani is a Senior Software Development Manager at Amazon Redshift. He leads Query Processing, Query Performance, Distributed Systems and Workload Management with a strong team. Naresh is passionate about building high-performance databases to enable customers to gain timely insights and make critical business decisions. In his spare time, Naresh enjoys reading and playing tennis.





Maor Kleider is a product and database engineering leader for Amazon Redshift. Maor is passionate about collaborating with customers and partners, learning about their unique big data use cases and making their experience even better. In his spare time, Maor enjoys traveling and exploring new restaurants with his family.





Quan Li is a Senior Database Engineer at Amazon Redshift. His focus is enabling customers to deliver maximum business value. Quan is passionate about optimizing high performance analytical databases. During his spare time, he enjoys traveling and experiencing different types of cuisines with his family.

How Aruba Networks built a cost analysis solution using AWS Glue, Amazon Redshift, and Amazon QuickSight

Post Syndicated from Siddharth Thacker original https://aws.amazon.com/blogs/big-data/how-aruba-networks-built-a-cost-analysis-solution-using-aws-glue-amazon-redshift-and-amazon-quicksight/

This is a guest post co-written by Siddharth Thacker and Swatishree Sahu from Aruba Networks.

Aruba Networks is a Silicon Valley company based in Santa Clara that was founded in 2002 by Keerti Melkote and Pankaj Manglik. Aruba is the industry leader in wired, wireless, and network security solutions. Hewlett-Packard acquired Aruba in 2015, making it a wireless networking subsidiary with a wide range of next-generation network access solutions.

Aruba Networks provides cloud-based platform called Aruba Central for network management and AI Ops. Aruba cloud platform supports thousands of workloads to support customer facing production environment and also a separate development platform for Aruba engineering.

The motivation to build the solution presented in this post was to understand the unit economics of the AWS resources used by multiple product lines across different organization pillars. Aruba wanted a faster, effective, and reliable way to analyze cost and usage data and visualize that into a dashboard. This solution has helped Aruba in multiple ways, including:

  • Visibility into costs – Multiple Aruba teams can now analyze the cost of their application via data surfaced with this solution
  • Cost optimization – The solution helps teams identify new cost-optimization opportunities by making them aware of the higher-cost resources with low utilization so they can optimize accordingly
  • Cost management – The Cloud DevOps organization, the group who built this solution, can effectively plan at the application level and have a direct positive impact on gross margins
  • Cost savings – With daily cost data available, engineers can see the monetary impact of right-sizing compute and other AWS resources almost immediately
  • Big picture as well as granular – Users can visualize cost data from the top down and track cost at a business level and a specific resource level

Overview of the solution

This post describes how Aruba Networks automated the solution, from generating the AWS Cost & Usage Report (AWS CUR) to its final visualization on Amazon QuickSight. In this solution, they start by configuring the CUR on their primary payer account, which publishes the billing reports to an Amazon Simple Storage Service (Amazon S3) bucket. Then they use an AWS Glue crawler to define and catalog the CUR data. As the new CUR data is delivered daily, the data catalog is updated, and the data is loaded into an Amazon Redshift database using Amazon Redshift Spectrum and SQL. The reporting and visualization layer is built using QuickSight. Finally, the entire pipeline is automated by using AWS Data Pipeline.

The following diagram illustrates this architecture.

Aruba prefers the AWS CUR Report to AWS Cost Explorer because AWS Cost Explorer provides usage information at a high level, and not enough granularity for detailed operations, such as data transfer cost. AWS CUR provides the most detailed information available about your AWS costs and usage at an hourly granularity. This allows the Aruba team to drill down the costs by the hour or day, product or product resource, or custom tags, enabling them to achieve their goals.

Aruba implemented the solution with the following steps:

  1. Set up the CUR delivery to a primary S3 bucket from the billing dashboard.
  2. Use Amazon S3 replication to copy the primary payer S3 bucket to the analytics bucket. Having a separate analytics account helps prevent direct access to the primary account.
  3. Create and schedule the crawler to crawl the CUR data. This is required to make the metadata available in the Data Catalog and update it quickly when new data arrives.
  4. Create respective Amazon Redshift schema and tables.
  5. Orchestrate an ETL flow to load data to Amazon Redshift using Data Pipeline.
  6. Create and publish dashboards using QuickSight for executives and stakeholders.

Insights generated

The Aruba DevOps team built various reports that provide the cost classifications on AWS services, weekly cost by applications, cost by product, infrastructure, resource type, and much more using the detailed CUR data as shown by the following screenshot.

For example, using the following screenshot, Aruba can conveniently figure out that compute cost is the biggest contributor compared to other costs. To reduce the cost, they can consider using various cost-optimization methods like buying reserved instances, savings plans, or Spot Instances wherever applicable.

Similarly, the following screenshot highlights the cost doubled compared to the first week of April. This helps Aruba to identify anomalies quickly and make informed decisions.

Setting up the CUR delivery

For instructions on setting up a CUR, see Creating Cost and Usage Reports.

To reduce complexity in the workflow, Aruba chose to create resources in the same region with hourly granularity, mainly to see metrics more frequently.

To lower the storage costs for data files and maximize the effectiveness of querying data with serverless technologies like Amazon Athena, Amazon Redshift Spectrum, and Amazon S3 data lake, save the CUR in Parquet format. The following screenshot shows the configuration for delivery options.

The following table shows some example CUR data.


Replicating the CUR data to your analytics account

For security purposes, other teams aren’t allowed to access the primary (payer) account, and therefore can’t access CUR data generated from that account. Aruba replicated the data to their analytics account and build the cost analysis solution there. Other teams can access the cost data without getting access permission for the primary account. The data is replicated across accounts by adding an Amazon S3 replication rule in the bucket. For more information, see Adding a replication rule when the destination bucket is in a different AWS account.

Cataloging the data with a crawler and scheduling it to run daily

Because AWS delivers all daily reports in a report date range report-prefix/report-name/yyyymmdd-yyyymmdd folder, Aruba uses AWS Glue crawlers to crawl through the data and update the catalog.

AWS Glue is a fully managed ETL service that makes it easy to prepare and load the data for analytics. Once the AWS Glue is pointed to the data stored on AWS, it discovers the data and stores the associated metadata (such as table definition and schema) in the Data Catalog. After the data is cataloged, the data is immediately searchable, queryable, and available for ETL. For more information, see Populating the AWS Glue Data Catalog.

The following screenshot shows the crawler created on Amazon S3 location of the CUR data.

The following code is an example table definition populated by the crawler.:

  `identity_line_item_id` string, 
  `identity_time_interval` string, 
  `bill_invoice_id` string, 
  `resource_tags_user_infra_role` string)

  `year` string, 
  `month` string )

ROW FORMAT SERDE  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'

Transforming and loading using Amazon Redshift

Next in the analytics service, Aruba chose Amazon Redshift over Athena. Aruba has a use case to integrate cost data together with other tables already present in Amazon Redshift and hence using the same service makes it easy to integrate with their existing data. To further filter and transform data at the same time, and simplify the multi-step ETL, Aruba chose Amazon Redshift Spectrum. It helps to efficiently query and load CUR data from Amazon S3. For more information, see Getting started with Amazon Redshift Spectrum.

Use the following query to create an external schema and map it to the AWS Glue database created earlier in the Data Catalog:

--Choose a schema name of your choice, cur_redshift_external_schema name is just an example--
 create external schema cur_redshift_spectrum_external_schema from data catalog database 
 'aruba_curr_db' iam_role 'arn:aws:iam::xxxxxxxxxxxxx:role/redshiftclusterrole' 
 create external database if not exists;

The table created in the Data Catalog appears under the Amazon Redshift Spectrum schema. The schema, table, and records created can be verified with the following SQL code:

SELECT Count(*) 
FROM   cur_redshift_spectrum_external_schema.<TABLE>; 

--Query the right partition, year=2020 and month=2 is used an example
SELECT Count(*) 
FROM   cur_redshift_spectrum_external_schema.<TABLE> 
WHERE  year=2020 
AND    month=2;

Next, transform and load the data into the Amazon Redshift table. Aruba started by creating an Amazon Redshift table to contain the data. The following SQL code can be used to create the production table with the desired columns:

CREATE TABLE redshift_schema.redshift_table 
     usage_start_date TIMESTAMP, 
     usage_end_date   TIMESTAMP, 
     service_region   VARCHAR (256), 
     service_az       VARCHAR (256), 
     aws_resource_id  VARCHAR (256), 
     usage_amount     FLOAT (17), 
     charge_currency  VARCHAR (256), 
     aws_product_name VARCHAR (256), 
     instance_family  VARCHAR (256), 
     instance_type    VARCHAR (256), 
     unblended_cost   FLOAT (17), 
     usage_cost       FLOAT (17)

CUR is dynamic in nature, which means that some columns may appear or disappear with each update. When creating the table, we take static columns only. For more information, see Line item details.

Next, insert and update to ingest the data from Amazon S3 to the Amazon Redshift table. Each CUR update is cumulative, which means that each version of the CUR includes all the line items and information from the previous version.

The reports generated throughout the month are estimated and subject to change during the rest of the month. AWS finalizes the report at the end of each month. Finalized reports have the calculations for the blended and unblended costs, and cover all the usage for the month. For this use case, Aruba updates the last 45 days of data to make sure the finalized cost is captured. The below sample query can be used to verify the updated data:

-- Create Table Statement
 INSERT INTO redshift_schema.redshift_table
             Usage_Cost ) 
 SELECT line_item_usage_start_date, 
       case when line_item_type='Usage' then line_item_unblended_cost
            else 0
            end as usage_cost 
 FROM   cur_redshift_external_schema.cur_parquet_parquet
 WHERE  line_item_usage_start_date >= date_add('day', -45, getdate()) 
       AND line_item_usage_start_date < date_add('day', 1, getdate()); 

Using Data Pipeline to orchestrate the ETL workflow

To automate this ETL workflow, Aruba chose Data Pipeline. Data Pipeline helps to reliably process and move data between different AWS compute and storage services, as well as on-premises data sources. With Data Pipeline, Aruba can regularly access their data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon Relational Database Service (Amazon RDS), Amazon DynamoDB, and Amazon EMR. Although the detailed steps of setting up this pipeline are out of scope for this blog, there is a sample workflow definition JSON file, which can be imported after making the necessary changes.

Data Pipeline workflow

The following screenshot shows the multi-step ETL workflow using Data Pipeline. Data Pipeline is used to run the INSERT query daily, which inserts and updates the latest CUR data into our Amazon Redshift table from the external table.

In order to copy data to Amazon Redshift,  RedshiftDataNode and RedshiftCopyActivity can be used, and then scheduled to run periodically.

Sharing metrics and creating visuals with QuickSight

To share the cost and usage with other teams, Aruba choose QuickSight using Amazon Redshift as the data source. QuickSight is a native AWS service that seamlessly integrates with other AWS services such as Amazon Redshift, Athena, Amazon S3, and many other data sources.

As a fully managed service, QuickSight lets Aruba to easily create and publish interactive dashboards that include ML Insights. In addition to building powerful visualizations, QuickSight provides data preparation tools that makes it easy to filter and transform the data into the exact needed dataset. As a cloud-native service, dashboards can be accessed from any device and embedded into applications and portals, allowing other teams to monitor their resource usage easily. For more information about creating a dataset, see Creating a Dataset from a Database. Quicksight Visuals can then be created from this dataset.

The following screenshot shows a visual comparison of device cost and count to help find the cost per device. This visual helped Aruba quickly identify the cost per device increase in April and take necessary actions.

Similarly, the following visualization helped Aruba identify an increase in data transfer cost and helped them decide to invest in rearchitecting their application.

The following visualization classifies the cost spend per resource.


In this post, we discussed how Aruba Networks was able to successfully achieve the following:

  • Generate CUR and use AWS Glue to define data, catalog the data, and update the metadata
  • Use Amazon Redshift Spectrum to transform and load the data to Amazon Redshift tables
  • Query, visualize, and share the data stored using QuickSight
  • Automate and orchestrate the entire solution using Data Pipeline

Aruba use this solution to automatically generate a daily cost report and share it with their stakeholders, including executives and cloud operations team.


About the Authors

Siddharth Thacker works in Business & Finance Strategy in Cloud Software division at Aruba Networks. Siddharth has Master’s in Finance with experience in industries like banking, investment management, cloud software and focuses on business analytics, margin improvement and strategic partnerships at Aruba. In his spare time, he likes exploring outdoors and participate in team sports.

Swatishree Sahu is a Technical Data Analyst at Aruba Networks. She has lived and worked in India for 7 years as an SME for SOA-based integration tools before coming to US to pursue her master’s in Business Analytics from UT Dallas. Breaking down and analyzing data is her passion. She is a Star Wars geek, and in her free time, she loves gardening, painting, and traveling.

Ritesh Chaman is a Technical Account Manager at Amazon Web Services. With 10 years of experience in the IT industry, Ritesh has a strong background in Data Analytics, Data Management, and Big Data systems. In his spare time, he loves cooking (spicy Indian food), watching sci-fi movies, and playing sports.




Kunal Ghosh is a Solutions Architect at AWS. His passion is to build efficient and effective solutions on the cloud, especially involving Analytics, AI, Data Science, and Machine Learning. Besides family time, he likes reading and watching movies, and is a foodie.

Top 10 performance tuning techniques for Amazon Redshift

Post Syndicated from Matt Scaer original https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-techniques-for-amazon-redshift/

Customers use Amazon Redshift for everything from accelerating existing database environments, to ingesting weblogs for big data analytics. Amazon Redshift is a fully managed, petabyte-scale, massively parallel data warehouse that offers simple operations and high performance. Amazon Redshift provides an open standard JDBC/ODBC driver interface, which allows you to connect your existing business intelligence (BI) tools and reuse existing analytics queries.

Amazon Redshift can run any type of data model, from a production transaction system third-normal-form model to star and snowflake schemas, data vault, or simple flat tables.

This post takes you through the most common performance-related opportunities when adopting Amazon Redshift and gives you concrete guidance on how to optimize each one.

What’s new

This post refreshes the Top 10 post from early 2019. We’re pleased to share the advances we’ve made since then, and want to highlight a few key points.

Query throughput is more important than query concurrency.

Configuring concurrency, like memory management, can be relegated to Amazon Redshift’s internal ML models through Automatic WLM with Query Priorities. On production clusters across the fleet, we see the automated process assigning a much higher number of active statements for certain workloads, while a lower number for other types of use-cases. This is done to maximize throughput, a measure of how much work the Amazon Redshift cluster can do over a period of time. Examples are 300 queries a minute, or 1,500 SQL statements an hour. It’s recommended to focus on increasing throughput over concurrency, because throughput is the metric with much more direct impact on the cluster’s users.

In addition to the optimized Automatic WLM settings to maximize throughput, the concurrency scaling functionality in Amazon Redshift extends the throughput capability of the cluster to up to 10 times greater than what’s delivered with the original cluster. The tenfold increase is a current soft limit, you can reach out to your account team to increase it.

Investing in the Amazon Redshift driver.

AWS now recommends the Amazon Redshift JDBC or ODBC driver for improved performance. Each driver has optional configurations to further tune it for higher or lower number of statements, with either fewer or greater row counts in the result set.

Ease of use by automating all the common DBA tasks.

In 2018, the SET DW “backronym” summarized the key considerations to drive performance (sort key, encoding, table maintenance, distribution, and workload management). Since then, Amazon Redshift has added automation to inform 100% of SET DW, absorbed table maintenance into the service’s (and no longer the user’s) responsibility, and enhanced out-of-the-box performance with smarter default settings. Amazon Redshift Advisor continuously monitors the cluster for additional optimization opportunities, even if the mission of a table changes over time. AWS publishes the benchmark used to quantify Amazon Redshift performance, so anyone can reproduce the results.

Scaling compute separately from storage with RA3 nodes and Amazon Redshift Spectrum.

Although the convenient cluster building blocks of the Dense Compute and Dense Storage nodes continue to be available, you now have a variety of tools to further scale compute and storage separately. Amazon Redshift Managed Storage (the RA3 node family) allows for focusing on using the right amount of compute, without worrying about sizing for storage. Concurrency scaling lets you specify entire additional clusters of compute to be applied dynamically as-needed. Amazon Redshift Spectrum uses the functionally-infinite capacity of Amazon Simple Storage Service (Amazon S3) to support an on-demand compute layer up to 10 times the power of the main cluster, and is now bolstered with materialized view support.

Pause and resume feature to optimize cost of environments

All Amazon Redshift clusters can use the pause and resume feature. For clusters created using On Demand, the per-second grain billing is stopped when the cluster is paused. Reserved Instance clusters can use the pause and resume feature to define access times or freeze a dataset at a point in time.

Tip #1: Precomputing results with Amazon Redshift materializes views

Materialized views can significantly boost query performance for repeated and predictable analytical workloads such as dash-boarding, queries from BI tools, and extract, load, transform (ELT) data processing. Data engineers can easily create and maintain efficient data-processing pipelines with materialized views while seamlessly extending the performance benefits to data analysts and BI tools.

Materialized views are especially useful for queries that are predictable and repeated over and over. Instead of performing resource-intensive queries on large tables, applications can query the pre-computed data stored in the materialized view.

When the data in the base tables changes, you refresh the materialized view by issuing the Amazon Redshift SQL statement “refresh materialized view“. After issuing a refresh statement, your materialized view contains the same data as a regular view. Refreshes can be incremental or full refreshes (recompute). When possible, Amazon Redshift incrementally refreshes data that changed in the base tables since the materialized view was last refreshed.

To demonstrate how it works, we can create an example schema to store sales information, each sale transaction and details about the store where the sales took place.

To view the total amount of sales per city, we create a materialized view with the create materialized view SQL statement (city_sales) joining records from two tables and aggregating sales amount (sum(sales.amount)) per city (group by city):

  SELECT st.city, SUM(sa.amount) as total_sales
  FROM sales sa, store st
  WHERE sa.store_id = st.id
  GROUP BY st.city

Now we can query the materialized view just like a regular view or table and issue statements like “SELECT city, total_sales FROM city_sales” to get the following results. The join between the two tables and the aggregate (sum and group by) are already computed, resulting in significantly less data to scan.

When the data in the underlying base tables changes, the materialized view doesn’t automatically reflect those changes. You can refresh the data stored in the materialized view on demand with the latest changes from the base tables using the SQL refresh materialized view command. For example, see the following code:

!-- let's add a row in the sales base table

INSERT INTO sales (id, item, store_id, customer_id, amount) 
VALUES(8, 'Gaming PC Super ProXXL', 1, 1, 3000);

SELECT city, total_sales FROM city_sales WHERE city = 'Paris'

|city |total_sales|
|Paris|        690|

!-- the new sale is not taken into account !!
-- let's refresh the materialized view

SELECT city, total_sales FROM city_sales WHERE city = 'Paris'

|city |total_sales|
|Paris|       3690|

!-- now the view has the latest sales data

The full code for this use case is available as a very simple demo is available as a gist in GitHub.

You can also extend the benefits of materialized views to external data in your Amazon S3 data lake and federated data sources. With materialized views, you can easily store and manage the pre-computed results of a SELECT statement referencing both external tables and Amazon Redshift tables. Subsequent queries referencing the materialized views run much faster because they use the pre-computed results stored in Amazon Redshift, instead of accessing the external tables. This also helps you reduce the associated costs of repeatedly accessing the external data sources, because you can only access them when you explicitly refresh the materialized views.

Tip #2: Handling bursts of workload with concurrency scaling and elastic resize

The legacy, on-premises model requires you to estimate what the system will need 3-4 years in the future to make sure you’re leasing enough horsepower at the time of purchase. But the ability to resize a cluster allows for right-sizing your resources as you go. Amazon Redshift extends this ability with elastic resize and concurrency scaling.

Elastic resize lets you quickly increase or decrease the number of compute nodes, doubling or halving the original cluster’s node count, or even change the node type. You can expand the cluster to provide additional processing power to accommodate an expected increase in workload, such as Black Friday for internet shopping, or a championship game for a team’s web business. Choose classic resize when you’re resizing to a configuration that isn’t available through elastic resize. Classic resize is slower but allows you to change the node type or expand beyond the doubling or halving size limitations of an elastic resize. 

Elastic resize completes in minutes and doesn’t require a cluster restart. For anticipated workload spikes that occur on a predictable schedule, you can automate the resize operation using the elastic resize scheduler feature on the Amazon Redshift console, the AWS Command Line Interface (AWS CLI), or API.

Concurrency scaling allows your Amazon Redshift cluster to add capacity dynamically in response to the workload arriving at the cluster.

By default, concurrency scaling is disabled, and you can enable it for any workload management (WLM) queue to scale to a virtually unlimited number of concurrent queries, with consistently fast query performance. You can control the maximum number of concurrency scaling clusters allowed by setting the “max_concurrency_scaling_clusters” parameter value from 1 (default) to 10 (contact support to raise this soft limit). The free billing credits provided for concurrency scaling is often enough and the majority of customers using this feature don’t end up paying extra for it. For more information about the concurrency scaling billing model see Concurrency Scaling pricing.

You can monitor and control the concurrency scaling usage and cost by creating daily, weekly, or monthly usage limits and instruct Amazon Redshift to automatically take action (such as logging, alerting or disabling further usage) if those limits are reached. For more information, see Managing usage limits in Amazon Redshift.

Together, these options open up new ways to right-size the platform to meet demand. Before these options, you needed to size your WLM queue, or even an entire Amazon Redshift cluster, beforehand in anticipation of upcoming peaks.

Tip #3: Using the Amazon Redshift Advisor to minimize administrative work

Amazon Redshift Advisor offers recommendations specific to your Amazon Redshift cluster to help you improve its performance and decrease operating costs.

Advisor bases its recommendations on observations regarding performance statistics or operations data. Advisor develops observations by running tests on your clusters to determine if a test value is within a specified range. If the test result is outside of that range, Advisor generates an observation for your cluster. At the same time, Advisor creates a recommendation about how to bring the observed value back into the best-practice range. Advisor only displays recommendations that can have a significant impact on performance and operations. When Advisor determines that a recommendation has been addressed, it removes it from your recommendation list. In this section, we share some examples of Advisor recommendations:

Distribution key recommendation

Advisor analyzes your cluster’s workload to identify the most appropriate distribution key for the tables that can significantly benefit from a KEY distribution style. Advisor provides ALTER TABLE statements that alter the DISTSTYLE and DISTKEY of a table based on its analysis. To realize a significant performance benefit, make sure to implement all SQL statements within a recommendation group.

The following screenshot shows recommendations regarding distribution keys.

If you don’t see a recommendation, that doesn’t necessarily mean that the current distribution styles are the most appropriate. Advisor doesn’t provide recommendations when there isn’t enough data or the expected benefit of redistribution is small.

Sort key recommendation

Sorting a table on an appropriate sort key can accelerate query performance, especially queries with range-restricted predicates, by requiring fewer table blocks to be read from disk.

Advisor analyzes your cluster’s workload over several days to identify a beneficial sort key for your tables. See the following screenshot.

If you don’t see a recommendation for a table, that doesn’t necessarily mean that the current configuration is the best. Advisor doesn’t provide recommendations when there isn’t enough data or the expected benefit of sorting is small.

Table compression recommendation

Amazon Redshift is optimized to reduce your storage footprint and improve query performance by using compression encodings. When you don’t use compression, data consumes additional space and requires additional disk I/O. Applying compression to large uncompressed columns can have a big impact on your cluster.

The compression analysis in Advisor tracks uncompressed storage allocated to permanent user tables. It reviews storage metadata associated with large uncompressed columns that aren’t sort key columns.

The following screenshot shows an example of table compression recommendation.

Table statistics recommendation

Maintaining current statistics helps complex queries run in the shortest possible time. The Advisor analysis tracks tables whose statistics are out-of-date or missing. It reviews table access metadata associated with complex queries. If tables that are frequently accessed with complex patterns are missing statistics, Amazon Redshift Advisor creates a critical recommendation to run ANALYZE. If tables that are frequently accessed with complex patterns have out-of-date statistics, Advisor creates a suggested recommendation to run ANALYZE.

The following screenshot shows a table statistics recommendation.

Tip #4: Using Auto WLM with priorities to increase throughput

Auto WLM simplifies workload management and maximizes query throughput by using ML to dynamically manage memory and concurrency, which ensures optimal utilization of the cluster resources

Amazon Redshift runs queries using the queuing system (WLM). You can define up to eight queues to separate workloads from each other.

Amazon Redshift Advisor automatically analyzes the current WLM usage and can make recommendations to get more throughput from your cluster. Periodically reviewing the suggestions from Advisor helps you get the best performance.

Query priorities is a feature of Auto WLM that lets you assign priority ranks to different user groups or query groups, to ensure that higher priority workloads get more resources for consistent query performance, even during busy times. It is a good practice to set up query monitoring rules (QMR) to monitor and manage resource intensive or runaway queries. QMR also enables you to dynamically change a query’s priority based on its runtime performance and metrics-based rules you define.

For more information on migrating from manual to automatic WLM with query priorities, see Modifying the WLM configuration.

It’s recommended to take advantage of Amazon Redshift’s short query acceleration (SQA). SQA uses ML to run short-running jobs in their own queue. This keeps small jobs processing, rather than waiting behind longer-running SQL statements. SQA is enabled by default in the default parameter group and for all new parameter groups. You can enable and disable SQA via a check box on the Amazon Redshift console, or by using the Amazon Redshift CLI.

If you enable concurrency scaling, Amazon Redshift can automatically and quickly provision additional clusters should your workload begin to back up. This is an important consideration when deciding the cluster’s WLM configuration.

A common pattern is to optimize the WLM configuration to run most SQL statements without the assistance of supplemental memory, reserving additional processing power for short jobs. Some queueing is acceptable because additional clusters spin up if your needs suddenly expand. To enable concurrency scaling on a WLM queue, set the concurrency scaling mode value to AUTO. You can best inform your decisions by reviewing the concurrency scaling billing model. You can also monitor and control the concurrency scaling usage and cost by using the Amazon Redshift usage limit feature.

In some cases, unless you enable concurrency scaling for the queue, the user or query’s assigned queue may be busy, and you must wait for a queue slot to open. During this time, the system isn’t running the query at all. If this becomes a frequent problem, you may have to increase concurrency.

First, determine if any queries are queuing, using the queuing_queries.sql admin script. Review the maximum concurrency that your cluster needed in the past with wlm_apex.sql, or get an hour-by-hour historical analysis with wlm_apex_hourly.sql. Keep in mind that increasing concurrency allows more queries to run, but each query gets a smaller share of the memory. You may find that by increasing concurrency, some queries must use temporary disk storage to complete, which is also sub-optimal.

Tip #5: Taking advantage of Amazon Redshift data lake integration

Amazon Redshift is tightly integrated with other AWS-native services such as Amazon S3 which let’s the Amazon Redshift cluster interact with the data lake in several useful ways.

Amazon Redshift Spectrum lets you query data directly from files on Amazon S3 through an independent, elastically sized compute layer. Use these patterns independently or apply them together to offload work to the Amazon Redshift Spectrum compute layer, quickly create a transformed or aggregated dataset, or eliminate entire steps in a traditional ETL process.

  • Use the Amazon Redshift Spectrum compute layer to offload workloads from the main cluster, and apply more processing power to the specific SQL statement. Amazon Redshift Spectrum automatically assigns compute power up to approximately 10 times the processing power of the main cluster. This may be an effective way to quickly process large transform or aggregate jobs.
  • Skip the load in an ELT process and run the transform directly against data on Amazon S3. You can run transform logic against partitioned, columnar data on Amazon S3 with an INSERT … SELECT statement. It’s easier than going through the extra work of loading a staging dataset, joining it to other tables, and running a transform against it.
  • Use Amazon Redshift Spectrum to run queries as the data lands in Amazon S3, rather than adding a step to load the data onto the main cluster. This allows for real-time analytics.
  • Land the output of a staging or transformation cluster on Amazon S3 in a partitioned, columnar format. The main or reporting cluster can either query from that Amazon S3 dataset directly or load it via an INSERT … SELECT statement.

Within Amazon Redshift itself, you can export the data into the data lake with the UNLOAD command, or by writing to external tables. Both options export SQL statement output to Amazon S3 in a massively parallel fashion. You can do the following:

  • Using familiar CREATE EXTERNAL TABLE AS SELECT and INSERT INTO SQL commands, create and populate external tables on Amazon S3 for subsequent use by Amazon Redshift or other services participating in the data lake without the need to manually maintain partitions. Materialized views can also cover external tables, further enhancing the accessibility and utility of the data lake.
  • Using the UNLOAD command, Amazon Redshift can export SQL statement output to Amazon S3 in a massively parallel fashion. This technique greatly improves the export performance and lessens the impact of running the data through the leader node. You can compress the exported data on its way off the Amazon Redshift cluster. As the size of the output grows, so does the benefit of using this feature. For writing columnar data to the data lake, UNLOAD can write partition-aware Parquet data.

Tip #6: Improving the efficiency of temporary tables

Amazon Redshift provides temporary tables, which act like normal tables but have a lifetime of a single SQL session. The proper use of temporary tables can significantly improve performance of some ETL operations. Unlike regular permanent tables, data changes made to temporary tables don’t trigger automatic incremental backups to Amazon S3, and they don’t require synchronous block mirroring to store a redundant copy of data on a different compute node. Due to these reasons, data ingestion on temporary tables involves reduced overhead and performs much faster. For transient storage needs like staging tables, temporary tables are ideal.

You can create temporary tables using the CREATE TEMPORARY TABLE syntax, or by issuing a SELECT … INTO #TEMP_TABLE query. The CREATE TABLE statement gives you complete control over the definition of the temporary table. The SELECT … INTO and C(T)TAS commands use the input data to determine column names, sizes and data types, and use default storage properties. Consider default storage properties carefully, because they may cause problems. By default, for temporary tables, Amazon Redshift applies EVEN table distribution with no column encoding (such as RAW compression) for all columns. This data structure is sub-optimal for many types of queries.

If you employ the SELECT…INTO syntax, you can’t set the column encoding, column distribution, or sort keys. The CREATE TABLE AS (CTAS) syntax instead lets you specify a distribution style and sort keys, and Amazon Redshift automatically applies LZO encoding for everything other than sort keys, Booleans, reals, and doubles. You can exert additional control by using the CREATE TABLE syntax rather than CTAS.

If you create temporary tables, remember to convert all SELECT…INTO syntax into the CREATE statement. This ensures that your temporary tables have column encodings and don’t cause distribution errors within your workflow. For example, you may want to convert a statement using this syntax:

SELECT column_a, column_b INTO #my_temp_table FROM my_table;

You need to analyze the temporary table for optimal column encoding:

Master=# analyze compression #my_temp_table;
Table | Column | Encoding
#my_temp_table | columb_a | lzo
#my_temp_table | columb_b | bytedict
(2 rows)

You can then convert the SELECT INTO a statement to the following:


column_a varchar(128) encode lzo,
column_b char(4) encode bytedict)
distkey (column_a) -- Assuming you intend to join this table on column_a
sortkey (column_b) -- Assuming you are sorting or grouping by column_b

INSERT INTO my_temp_table SELECT column_a, column_b FROM my_table;


If you create a temporary staging table by using a CREATE TABLE LIKE statement, the staging table inherits the distribution key, sort keys, and column encodings from the parent target table. In this case, merge operations that join the staging and target tables on the same distribution key performs faster because the joining rows are collocated. To verify that the query uses a collocated join, run the query with EXPLAIN and check for DS_DIST_NONE on all the joins.

You may also want to analyze statistics on the temporary table, especially when you use it as a join table for subsequent queries. See the following code:

ANALYZE my_temp_table;

With this trick, you retain the functionality of temporary tables but control data placement on the cluster through distribution key assignment. You also take advantage of the columnar nature of Amazon Redshift by using column encoding.

Tip #7: Using QMR and Amazon CloudWatch metrics to drive additional performance improvements

In addition to the Amazon Redshift Advisor recommendations, you can get performance insights through other channels.

The Amazon Redshift cluster continuously and automatically collects query monitoring rules metrics, whether you institute any rules on the cluster or not. This convenient mechanism lets you view attributes like the following:

  • The CPU time for a SQL statement (query_cpu_time)
  • The amount of temporary space a job might ‘spill to disk’ (query_temp_blocks_to_disk)
  • The ratio of the highest number of blocks read over the average (io_skew)

It also makes Amazon Redshift Spectrum metrics available, such as the number of Amazon Redshift Spectrum rows and MBs scanned by a query (spectrum_scan_row_count and spectrum_scan_size_mb, respectively). The Amazon Redshift system view SVL_QUERY_METRICS_SUMMARY shows the maximum values of metrics for completed queries, and STL_QUERY_METRICS and STV_QUERY_METRICS carry the information at 1-second intervals for the completed and running queries respectively.

The Amazon Redshift CloudWatch metrics are data points for use with Amazon CloudWatch monitoring. These can be cluster-wide metrics, such as health status or read/write, IOPS, latency, or throughput. It also offers compute node–level data, such as network transmit/receive throughput and read/write latency. At the WLM queue grain, there are the number of queries completed per second, queue length, and others. CloudWatch facilitates monitoring concurrency scaling usage with the metrics ConcurrencyScalingSeconds and ConcurrencyScalingActiveClusters.

It’s recommended to consider the CloudWatch metrics (and the existing notification infrastructure built around them) before investing time in creating something new. Similarly, the QMR metrics cover most metric use cases and likely eliminate the need to write custom metrics.

Tip #8: Federated queries connect the OLAP, OLTP and data lake worlds

The new Federated Query feature in Amazon Redshift allows you to run analytics directly against live data residing on your OLTP source system databases and Amazon S3 data lake, without the overhead of performing ETL and ingesting source data into Amazon Redshift tables. This feature gives you a convenient and efficient option for providing realtime data visibility on operational reports, as an alternative to micro-ETL batch ingestion of realtime data into the data warehouse. By combining historical trend data from the data warehouse with live developing trends from the source systems, you can gather valuable insights to drive real-time business decision making.

For example, consider sales data residing in three different data stores:

  • Live sales order data stored on an Amazon RDS for PostgreSQL database (represented as “ext_postgres” in the following external schema)
  • Historical sales data warehoused in a local Amazon Redshift database (represented as “local_dwh”)
  • Archived, “cold” sales data older than 5 years stored on Amazon S3 (represented as “ext_spectrum”)

We can create a late binding view in Amazon Redshift that allows you to merge and query data from all three sources. See the following code:

CREATE VIEW store_sales_integrated AS 
SELECT * FROM ext_postgres.store_sales_live 
SELECT * FROM local_dwh.store_sales_current 
SELECT ss_sold_date_sk, ss_sold_time_sk, ss_item_sk, ss_customer_sk, ss_cdemo_sk, 
ss_hdemo_sk, ss_addr_sk, ss_store_sk, ss_promo_sk, ss_ticket_number, ss_quantity, 
ss_wholesale_cost, ss_list_price, ss_sales_price, ss_ext_discount_amt, 
ss_ext_sales_price, ss_ext_wholesale_cost, ss_ext_list_price, ss_ext_tax, 
ss_coupon_amt, ss_net_paid, ss_net_paid_inc_tax, ss_net_profit 
FROM ext_spectrum.store_sales_historical 

Currently, direct federated querying is supported for data stored in Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL databases, with support for other major RDS engines coming soon. You can also use the federated query feature to simplify the ETL and data-ingestion process. Instead of staging data on Amazon S3, and performing a COPY operation, federated queries allow you to ingest data directly into an Amazon Redshift table in one step, as part of a federated CTAS/INSERT SQL query.

For example, the following code shows an upsert/merge operation in which the COPY operation from Amazon S3 to Amazon Redshift is replaced with a federated query sourced directly from PostgreSQL:


CREATE TEMP TABLE staging (LIKE ods.store_sales);

-- replace the following COPY from S3: 
   /*COPY staging FROM 's3://yourETLbucket/daily_store_sales/' 
   IAM_ROLE 'arn:aws:iam::<account_id>:role/<s3_reader_role>' 
-- with this federated query to load staging data directly from PostgreSQL source
INSERT INTO staging SELECT * FROM pg.store_sales p
    WHERE p.last_updated_date > (SELECT MAX(last_updated_date) FROM ods.store_sales);

DELETE FROM ods.store_sales USING staging s WHERE ods.store_sales.id = s.id;

INSERT INTO ods.store_sales SELECT * FROM staging;

DROP TABLE staging;


For more information about setting up the preceding federated queries, see Build a Simplified ETL and Live Data Query Solution using Redshift Federated Query. For additional tips and best practices on federated queries, see Best practices for Amazon Redshift Federated Query.

Tip #9: Maintaining efficient data loads

Amazon Redshift best practices suggest using the COPY command to perform data loads of file-based data. Single-row INSERTs are an anti-pattern. The COPY operation uses all the compute nodes in your cluster to load data in parallel, from sources such as Amazon S3, Amazon DynamoDB, Amazon EMR HDFS file systems, or any SSH connection.

When performing data loads, compress the data files whenever possible. For row-oriented (CSV) data, Amazon Redshift supports both GZIP and LZO compression. It’s more efficient to load a large number of small files than one large one, and the ideal file count is a multiple of the cluster’s total slice count. Columnar data, such as Parquet and ORC, is also supported. You can achieve best performance when the compressed files are between 1MB-1GB each.

The number of slices per node depends on the cluster’s node size (and potentially elastic resize history). By ensuring an equal number of files per slice, you know that the COPY command evenly uses cluster resources and complete as quickly as possible. Query for the cluster’s current slice count with SELECT COUNT(*) AS number_of_slices FROM stv_slices;.

Another script in the amazon-redshift-utils GitHub repo, CopyPerformance, calculates statistics for each load. Amazon Redshift Advisor also warns of missing compression or too few files based on the number of slices (see the following screenshot):

Conducting COPY operations efficiently reduces the time to results for downstream users, and minimizes the cluster resources utilized to perform the load.

Tip #10: Using the latest Amazon Redshift drivers from AWS

Because Amazon Redshift is based on PostgreSQL, we previously recommended using JDBC4 PostgreSQL driver version 8.4.703 and psql ODBC version 9.x drivers. If you’re currently using those drivers, we recommend moving to the new Amazon Redshift–specific drivers. For more information about drivers and configuring connections, see JDBC and ODBC drivers for Amazon Redshift in the Amazon Redshift Cluster Management Guide.

While rarely necessary, the Amazon Redshift drivers do permit some parameter tuning that may be useful in some circumstances. Downstream third-party applications often have their own best practices for driver tuning that may lead to additional performance gains.

For JDBC, consider the following:

  • To avoid client-side out-of-memory errors when retrieving large data sets using JDBC, you can enable your client to fetch data in batches by setting the JDBC fetch size parameter or BlockingRowsMode.
  • Amazon Redshift doesn’t recognize the JDBC maxRows parameter. Instead, specify a LIMIT clause to restrict the result set. You can also use an OFFSET clause to skip to a specific starting point in the result set.

For ODBC, consider the following:

  • A cursor is enabled on the cluster’s leader node when useDelareFecth is enabled. The cursor fetches up to fetchsize/cursorsize and then waits to fetch more rows when the application request more rows.
  • The CURSOR command is an explicit directive that the application uses to manipulate cursor behavior on the leader node. Unlike the JDBC driver, the ODBC driver doesn’t have a BlockingRowsMode mechanism.

It’s recommended that you do not undertake driver tuning unless you have a clear need. AWS Support is available to help on this topic as well.


Amazon Redshift is a powerful, fully managed data warehouse that can offer increased performance and lower cost in the cloud. As Amazon Redshift grows based on the feedback from its tens of thousands of active customers world-wide, it continues to become easier to use and extend its price-for-performance value proposition. Staying abreast of these improvements can help you get more value (with less effort) from this core AWS service.

We hope you learned a great deal about making the most of your Amazon Redshift account with the resources in this post.

If you have questions or suggestions, please leave a comment.


About the Authors

Matt Scaer is a Principal Data Warehousing Specialist Solution Architect, with over 20 years of data warehousing experience, with 11+ years at both AWS and Amazon.com.






Manish Vazirani is an Analytics Specialist Solutions Architect at Amazon Web Services.







Tarun Chaudhary is an Analytics Specialist Solutions Architect at AWS.

Configure and optimize performance of Amazon Athena federation with Amazon Redshift

Post Syndicated from Harsha Tadiparthi original https://aws.amazon.com/blogs/big-data/configure-and-optimize-performance-of-amazon-athena-federation-with-amazon-redshift/

This post provides guidance on how to configure Amazon Athena federation with AWS Lambda and Amazon Redshift, while addressing performance considerations to ensure proper use.

If you use data lakes in Amazon Simple Storage Service (Amazon S3) and use Amazon Redshift as your data warehouse, you may want to integrate the two for a lake house approach. Lake House is the ability to integrate Data Lake and Data warehouse seamlessly. When you need to query your data lake from your Amazon Redshift Data warehouse, you can use Amazon Redshift Spectrum, which works great in unifying your data lake and data warehouse. However, when you use Athena in the data lake and need to access data in Amazon Redshift for the following two scenarios which are commonly seen, there is no easy approach:

  • Team A has a data lake in Amazon S3 and uses Athena. They need access to the data in an Amazon Redshift cluster owned by Team B.
  • Analysts using Athena to query their data lake for analytics need agility and flexibility to access data in an Amazon Redshift data warehouse without moving the data to Amazon S3 Data Lake.

In these scenarios, Athena federation with Amazon Redshift allows you to seamlessly access the data in your Amazon Redshift data warehouse without having to wait to unload the data to the Amazon S3 data lake, which removes the overhead in managing such jobs.

In this post, you walk through a step-by-step configuration to set up Athena federation using Lambda to access data in Amazon Redshift. You also see a performance benchmark analysis of interactive and ad hoc TPC-DS queries, and learn some key performance considerations and best practices when using federation.

Solution overview

Data federation is the capability to integrate data in another data store using a single interface. The following diagram depicts how Athena federation works by using Lambda to integrate with a federated data source.

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL.

Lambda lets you run code without provisioning or managing servers. You can run code for virtually any type of application with zero administration and only pay for when the code is running.

Amazon Redshift is a petabyte-scale data warehouse designed from the ground up, natively for the cloud. Amazon Redshift is the most popular and fastest cloud data warehouse. It’s integrated with your data lake, offers performance up to three times faster than any other data warehouse, and costs up to 75% less than any other cloud data warehouse.

The following diagram depicts all the data source connectors available as of this writing in the AWS Serverless Application Repository.

The AWS Serverless Application Repository is a managed repository for serverless applications. It enables you to store and share reusable applications, and easily assemble and deploy serverless architectures in powerful new ways.

You can also create a custom connector for sources that aren’t in the AWS Serverless Application Repository.


Before you get started, create a secret for the Amazon Redshift login ID and password using AWS Secrets Manager.

  1. On the Secrets Manager console, choose Secrets.
  2. Choose Store a new secret.
  3. Choose credentials for your Amazon Redshift cluster, and set your user name and password.
  4. Choose the cluster you want to use.
  5. For Secret name, enter a name for your secret. Use the prefix AthenaJDBCFederation so it’s easy to find.
  6. Leave the remaining fields at their defaults and choose Next.
  7. Complete your secret creation.

Setting up your S3 bucket

On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use. For this post, use the name myworkspace0009/athenafederation.

Configuring Athena federation with Amazon Redshift

To configure Athena federation with Amazon Redshift, complete the following steps:

  1. On the AWS Serverless Application Repository, choose Available applications.
  2. In the search field, enter athena federation.

  1. Choose
  2. In the Application settings section, provide the following details:
  3. Application nameAthenaRedshiftConnector
  4. SecretNamePrefixAthenaJdbcFederation
  5. SpillBucketmyworkspace0009/athenafederation
  6. JDBCConnectorConfigRedshift://jdbc:Redshift://<YourAmazon Redshift1Hostname>:5439/<DBName>?user=sample2&password=sample2
  7. DisableSpillEncyption – False
  8. LambdaFunctionNamerstpcds30
  9. SecurityGroupID – Security group ID where Amazon Redshift is deployed
  10. SpillPrefix – Leave default
  11. Subnetids – Use the subnets where Amazon Redshift is running with comma separation
  12. Select the I acknowledge check box.
  13. Choose Deploy.

In the next steps, you configure an Amazon Virtual Private Cloud (Amazon VPC) endpoint for Amazon S3 to allow Lambda to write federated query results to Amazon S3.

  1. On the Amazon VPC console, choose Endpoints.
  2. Choose Create endpoint.
  3. Choose the VPC for your endpoint.

  1. Make any necessary security changes as per your security requirements.

  1. Choose Create endpoint.

Running federated queries with Athena

To start running federated queries, complete the following steps:

  1. On the Athena console, choose Workgroups.
  2. If you don’t see a workgroup called AmazonAthenaPreviewFunctionality, create one.

When this feature becomes generally available, you won’t need to use this workgroup name.

  1. Run your queries, using lambda:rstpcds30 to run against tables in Amazon Redshift.

Athena query performance comparison

Several customers have asked us for performance insights and prescriptive guidance on how queries in Athena compare against federated queries and how to use them. In this section, we use a TPC-DS 3 TB standard dataset and a select few queries that fall in the category of ad hoc and interactive. The comparison of their performance should give you an idea of what to expect when running federated queries against Amazon Redshift.

For the following tests, we used a 3 TB TPC-DS dataset in Amazon S3 data lake with Parquet compressed, partitioned and served by Athena, and the same 3 TB TPC-DS dataset on Amazon Redshift cluster running four RA3.4XL nodes.

The following table summarizes the dataset sizes:

DatasetTable Size (Records)
store_sales8.6 billion
customer30 million
customer_address15 million
customer_demographics1.92 million

We ran the following four tests:

  • T1 – Queries ran in Athena without federation. All table data is in Amazon S3.
  • T2 – Queries ran in Athena with federation to Amazon Redshift. All table data is in Amazon S3, except the store_sales fact table in Amazon Redshift.
  • T3 – Queries ran in Athena with federation to Amazon Redshift. All tables and data are in Redshift.
  • T4 – Queries ran in Amazon Redshift without federation. All tables and data are in Redshift.

The following graph represents the performance of some of the ad hoc and interactive TPC-DS queries.

In the preceding graph, all T3 queries timed out at 900 seconds, depicted by the pink reference line, due to the Lambda 900-second timeout limit. This is due to overhead from store_sales fact data that needed to be transferred back to Athena.

The following graph removes T3 from the visualization, which gives better visibility when comparing the other tests.

Notice the query performance between T1 and T2 that completed in almost the same time while T4 queries ran significantly faster.

Amazon Redshift beats the performance of Athena in providing extremely low latency and should be the tool of choice if you’re looking for very low SLAs for analytics queries that Athena can’t achieve.

The following graph shows the data scanned in Amazon S3 for T1 and T2, which outlines why there isn’t much difference in query performance when compared to federated queries.

For the T2 federated queries, a small amount of dimension data is filtered in Amazon Redshift and brought back to Athena, instead of scanning the entire dimension tables. This is a typical nature for several ad hoc and interactive queries.

The performance of these TPC-DS queries between T1 and T2 is comparable because very little data is transferred back to Athena. You can see a similar behavior in several ad hoc and interactive query use cases because they use limited dimensions and scan a small subset of dimension data. Due to the 900-second timeout for the Lambda instances that connect to Amazon Redshift, it’s advised to minimize the amount of data the query brings back. Although Athena uses multiple Lambda instances in parallel to run your federated query, it’s also important to make sure the Amazon Redshift WLM queue has enough slots to process it, thereby not leading to queue wait time. For example, in some of the preceding queries, 20 Lambda executions were connecting to Amazon Redshift concurrently.

Key performance best practice considerations

When considering Athena federation with Amazon Redshift, you could take into account the following best practices:

  • Athena federation works great for queries with predicate filtering because the predicates are pushed down to Amazon Redshift. Use filter and limited-range scans in your queries to avoid full table scans.
  • If your SQL query requires returning a large volume of data from Amazon Redshift to Athena (which could lead to query timeouts or slow performance), unload the large tables in your query from Redshift to your Amazon S3 data lake.
  • Star schema is a commonly used data model in Amazon Redshift. In the star schema model, unload your large fact tables into your data lake and leave the dimension tables in Amazon Redshift. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your data lake.
  • When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the Amazon Redshift WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your Amazon Redshift cluster to benefit from concurrent connections to queue up.


In this post, you learned how to configure and use Athena federation with Amazon Redshift using Lambda. Now you don’t need to wait for all the data in your Amazon Redshift data warehouse to be unloaded to Amazon S3 and maintained on a day-to-day basis to run your queries. You can use the best practice considerations outlined in the post to minimize the data transferred from Amazon Redshift for better performance. When queries are well written for federation, the performance penalties are negligible, as observed in the TPC-DS benchmark queries in this post. Happy query federating!


About the Author

Harsha Tadiparthi is a Specialist Sr. Solutions Architect, AWS Analytics. He enjoys solving complex customer problems in Databases and Analytics and delivering successful outcomes. Outside of work, he loves to spend time with his family, watch movies, and travel whenever possible.



Speed up data ingestion on Amazon Redshift with BryteFlow

Post Syndicated from Pradnya Bhandary original https://aws.amazon.com/blogs/big-data/speed-up-data-ingestion-on-amazon-redshift-with-bryteflow/

This is a guest post by Pradnya Bhandary, Co-Founder and CEO at Bryte Systems.

Data can be transformative for an organization. How and where you store your data for analysis and business intelligence is therefore an especially important decision that each organization needs to make. Should you choose an on-premises data warehouse solution or embrace the cloud?

On-premises data warehouses require servers to be in your data center, a team to manage them, and a large initial investment that tries to accommodate current and future data needs. Cloud data warehouses, on the other hand, are fully managed and can start with a small investment and grow as your business demands. You don’t need to provision for the future, but for present needs. This knowledge brings peace of mind that extra capacity can be added overnight, if needed.

One such cloud data warehouse is Amazon Redshift. Amazon Redshift is the most popular cloud data warehouse. It’s a fully managed, petabyte-scale cloud-based data warehouse product designed for large-scale dataset storage and analysis. It provides agility, flexibility, and cost-effectiveness.

To use Amazon Redshift effectively, getting your data efficiently from various data silos to the data warehouse is critical, because this determines how quickly you can access the data. Do you have to wait until the data is loaded and make do with stale data, or can you access your data across the organization in near-real time to derive fresh and rich data insights?

In this post, we explain how Origin Energy followed some of the best practices for data ingestion for Amazon Redshift and how they achieved faster ingestion rates into Amazon Redshift using the BryteFlow software.

Origin Energy

Origin Energy, a leading energy provider, was finding that its on-premises data warehouse was struggling to support growing data demands. They also needed to unlock siloed data from legacy databases like SAP, Oracle, SQL Server, MySQL, and more. Access to data was fragmented and time consuming.

Moving to a cloud enterprise analytics environment with centralized data access was the only viable option to support their data initiatives. For more information, see the Origin Energy case study.

To speed up data ingestion on Amazon Redshift, they followed data ingestion best practices.

Log-based CDC mechanism to get data to Amazon Redshift

When data is replicated from a source database to a target that could be another database, data warehouse, or cloud data storage object, changes to the data in the source need to be captured and replicated to the destination to keep data consistent and trustworthy. Change data capture (CDC) makes this possible. CDC captures the changes in the source data and updates only the data in the destination that has changed. This does away with the tedious task of bulk load updating and enables real-time data integration. It’s a continual, extremely reliable process and has no impact on source systems.

Parallel multi-threaded initial sync

When doing an initial ingest of data, especially of exceptionally large datasets to Amazon Redshift, parallel, multi-thread syncing to replicate data is extremely helpful in cutting down total data ingestion time. Data replication proceeds in parallel on multiple threads with optimized extraction and loading.

Parallel multi-threaded log-based capture and merge for Oracle

When source systems generate a large amount of incremental data, and transactional logs are written very often, the CDC mechanism or mining of logs for incremental data can lag behind if it can’t keep up with the source throughput. You can configure BryteFlow to have multiple parallel threads for mining. Furthermore, you can configure BryteFlow so that the logs can be mined on a completely different server and therefore there is zero load on the source and data replication is much faster with parallelism. You can make the transaction logs available on a shared mount on a remote server you spin up, and mine logs on this server. This puts zero load on the source systems and operational systems and they aren’t impacted even with mining huge volumes of data (or volumes of incremental data).

Best practices for loading data to Amazon Redshift

The following are the three best practices for loading data into Amazon Redshift.

Split large files into multiple files for high-performance loads

Amazon Redshift is a massively parallel processing (MPP) data warehouse, where several compute nodes work in parallel to ingest the data. Each node is further subdivided into slices, with each slice having one or more dedicated cores, equally dividing the processing capacity. The number of slices per node depends on the node type of the cluster. The following table shows the different node types and the number of slices.

Node TypeDefault Slices per Node
Dense Compute DC2
Dense Storage DS2
RA3 Nodes

After you extract data into files, you can compress the files and split a single file to multiple files according to the number of slices, so that files are loaded with the compute being distributed evenly across the slices on Amazon Redshift. The number of multiple files is a configurable parameter in BryteFlow that can be set depending on the Amazon Redshift node types. The COPY command that ingests data into Amazon Redshift is configured optimally for fast data loads.

Automatic creation of tables, default distribution keys, and distribution style

Data is distributed among the nodes on the basis of distribution style and distribution key of a particular table in Amazon Redshift. An even distribution of data enables Amazon Redshift to assign the workload evenly to slices and maximizes the benefit of parallel processing. This also helps during data ingestion. When ingesting data, BryteFlow automatically creates tables with the right DDL on Amazon Redshift. It also creates default distribution keys and distribution style, so that table ingestion is highly performant.

Optimum sort keys used for optimum loads support efficient columnar storage

Data in the Amazon Redshift data warehouse is stored in a columnar fashion, which drastically reduces the number of disk I/O requests and minimizes the amount of data loaded into the memory to run a query. Reduction in I/O speeds up queries, and loading less data means Amazon Redshift can perform more in-memory processing. Using the optimum table sort keys is the best practice for efficient loads.

Automatic sync on Amazon Redshift

CDC is an important element of syncing data with Amazon Redshift. BryteFlow automatically merges changes on Amazon Redshift, with type2 history (if configured) with high performance. This means that data is ready to use as soon as it is ingested, without running lengthy ETL processes on it. The following diagram illustrates the type2 history feature.

Metadata for every extract and load can be captured on Amazon Aurora

Details on each extract and load process (for example, tables names), the number of records affected, the start and end times, and various other details is critical operational metadata that’s very useful in determining performance, tuning, and triggering other ETL processes. If this operational metadata is maintained on Amazon Redshift with every extract and load, constant single row inserts and updates can hamper performance drastically because Amazon Redshift is a columnar database and not an OLTP system.

At BryteFlow, we found that the best practice is to keep the operational metadata in an Amazon Aurora database, which is OLTP in nature and can store this metadata with constant updates and inserts and low latency.

Build your data lake on Amazon S3 and automatically query using Amazon Redshift Spectrum

BryteFlow enables you to build a continually refreshing data lake at scale on Amazon Simple Storage Service (Amazon S3) with continual replication and transformation of data. You can configure BryteFlow to use Amazon EMR on the incremental data with data on Amazon S3 and automatically merge and transform your data with an intuitive GUI. The EMR cluster can scale up or down depending on your data needs. You can then query data automatically on Amazon S3 using Amazon Redshift Spectrum.

Offload data ingestion and data preparation on Amazon S3 and load to Amazon Redshift

BryteFlow Blend helps with real-time data preparation of data ingested by BryteFlow, using Apache Spark on Amazon EMR with an intuitive GUI. You can load data prepared on Amazon S3 to Amazon Redshift via BryteFlow Blend or make it accessible to Amazon Redshift via Amazon Redshift Spectrum. This helps reserve the computational resources of Amazon Redshift for the actual querying (queries run much faster) while the Amazon S3 data lake handles data integration. The following diagram illustrates the distributed data integration architecture.

BryteFlow software supports all of the preceding ingestion best practices. Origin Energy used the BryteFlow software to build their analytics platform on Amazon Redshift. Origin’s data access has improved from several days to mere hours. BryteFlow software has helped achieve accurate data replication on the AWS Cloud with low latency, facilitating faster time-to-market for new and highly personalized customer offerings and a significant reduction in data costs.


Amazon Redshift delivers fast performance at scale for the most demanding workloads. Ingesting and preparing your data to Amazon Redshift using the BryteFlow software makes this an extremely attractive value proposition. You can liberate your data across data silos and quickly unlock the value on Amazon Redshift.

To see how this works for your project, you can get a free trial from our website. We offer complete support on your free trial including screen sharing, online support, and consultation.

Alternatively, you can go to AWS Marketplace for a free trial of BryteFlow Standard Edition or BryteFlow Enterprise Edition. Contact us to let us know so we can assist you through the trial.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.


About the Authors

Pradnya Bhandary is the Co-Founder and CEO at Bryte Systems.





Stream, transform, and analyze XML data in real time with Amazon Kinesis, AWS Lambda, and Amazon Redshift

Post Syndicated from Sakti Mishra original https://aws.amazon.com/blogs/big-data/stream-transform-and-analyze-xml-data-in-real-time-with-amazon-kinesis-aws-lambda-and-amazon-redshift/

When we look at enterprise data warehousing systems, we receive data in various formats, such as XML, JSON, or CSV. Most third-party system integrations happen through SOAP or REST web services, where the input and output data format is either XML or JSON. When applications deal with CSV or JSON, it becomes fairly simple to parse because most programming languages and APIs have direct support for CSV or JSON. But for XML files, we need to consider a custom parser, because the format is custom and can be very complex.

When systems interact with each other and process data through different pipelines, they expect real-time processing or availability of data, so that business decisions can be instant and quick. In this post, we discuss a use case where XMLs are streamed through a real-time processing system and can go through a custom XML parser to flatten data for easier business analysis.

To demonstrate the implementation approach, we use AWS cloud services like Amazon Kinesis Data Streams as the message bus, Amazon Kinesis Data Firehose as the delivery stream with Amazon Redshift data warehouse as the target storage solution, and AWS Lambda as record transformer of Kinesis Data Firehose, which flattens the nested XML structure with custom parser script in Python.

AWS services overview

This solution uses AWS services for the following purposes:

  • Kinesis Data Streams is a massively scalable and durable real-time data streaming service. It can continuously capture gigabytes of data per second from hundreds of thousands of sources, such as website click-streams, database event streams, financial transactions, social media feeds, IT logs, and location-tracking events. The data collected is available in milliseconds to enable real-time analytics use cases such as real-time dashboards, real-time anomaly detection, dynamic pricing, and more. We use Kinesis Data Streams because it’s a serverless solution that can scale based on usage.
  • Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores, and analytics tools. It can capture, transform, and load streaming data into Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Elasticsearch Service (Amazon ES), and Splunk, enabling near-real-time analytics with existing business intelligence (BI) tools and dashboards you’re already using today. It’s a fully managed service that automatically scales to match the throughput of your data and requires no ongoing administration. It can also batch, compress, transform, and encrypt the data before loading it, minimizing the amount of storage used at the destination and increasing security. In our use case, our target storage layer is Amazon Redshift, so Kinesis Data Firehose fits great to simplify the solution.
  • Lambda is an event-driven, serverless computing platform provided by AWS. It’s a computing service that runs code in response to events and automatically manages the computing resources required by that code. Lambda supports multiple programming languages, and for our use case, we use Python 3.8. Other options include Amazon Kinesis Data Analytics with Flink, Amazon EMR with Spark streaming, Kinesis Data Firehose, or a custom application based on Kinesis consumer library. We use Kinesis Data Firehose as the consumer in this use case, with AWS Lambda as the record transformer, because our target storage is Amazon Redshift, which is supported by Kinesis Data Firehose.
  • Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. This means customers of all sizes and industries can use it to store and protect any amount of data for a range of use cases, such as websites, mobile applications, backup and restore, archive, enterprise applications, IoT devices, and big data analytics. For our use case, we use Amazon S3 as an intermediate storage before loading to the data warehousing system, so that it’s fault tolerant and provides better performance while loading to Amazon Redshift. By default, Kinesis Data Firehose requests an intermediate S3 bucket path when Amazon Redshift is the target.
  • Amazon Redshift is a fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing BI tools. In our use case, we use Amazon Redshift so that BI tools like Amazon QuickSight can easily connect to Amazon Redshift to build real-time dashboards.

Architecture overview

The following diagram illustrates the simple architecture that you can use to implement the solution.

The architecture includes the following components:

  • The Amazon Kinesis Producer Library (KPL) represents the system that pushes data to Kinesis Data Streams. It can be a simple Amazon Elastic Compute Cloud (Amazon EC2) machine or your local windows command line that executes the Kinesis Data Streams command line interface (CLI) to push messages. Alternatively, it can be a dynamic application that uses Kinesis Data Streams APIs or KPL to push messages dynamically. For our use case, we spin up an EC2 instance through AWS Cloud9 and use Kinesis Data Streams CLI commands to publish messages.
  • Kinesis Data Streams receives messages against a partition key from the publisher and waits for consumers to consume it. By default, the retention period of the messages in Kinesis Data Streams is 24 hours, but you can extend it to 7 days.
  • Kinesis Data Firehose takes a few actions:
    • Consumes data from Kinesis Data Streams and writes the same XML message into a backup S3 bucket.
    • Invokes a Lambda function that acts as a record transformer. Lambda receives input as XML, applies transformations to flatten it to be pipe-delimited content, and returns it to Kinesis Data Firehose.
    • Writes the pipe-delimited content to another S3 bucket, which acts as an intermediate storage bucket before writing into Amazon Redshift.
    • Invokes the Amazon Redshift COPY command, which takes pipe-delimited data from the intermediate S3 bucket and writes it into Amazon Redshift.
  • Data is inserted into the Amazon Redshift table, which you can query for data analysis and reporting.

Solution overview

To implement this solution, you complete the following steps:

  1. Set up the Kinesis data stream as the message bus.
  2. Set up KPL, which publishes sample XML message data to Kinesis Data Streams.
  3. Create an Amazon Redshift cluster, which acts as target storage for the Firehose delivery stream.
  4. Set up the delivery stream, which uses Lambda for record transformation and Amazon Redshift as target storage.
  5. Customize a Lambda function script that converts the nested XML string to a flat pipe-delimited stream.


Before beginning this tutorial, make sure you have permissions to create Kinesis data streams and publish messages to the streams.

Setting up your Kinesis data stream

You can use the AWS Management Console to create a data stream as a one-time activity. You can configure the cluster capacity as per your requirement, but start with the minimum and apply auto scaling as the data volume increases. Auto scaling is based on Amazon CloudWatch metrics. For more information, see Scale Amazon Kinesis Data Streams with AWS Application Auto Scaling.

Setting up KPL

For this use case, we use the AWS Cloud9 environment IDE, where through the Linux command line, we can execute Kinesis Data Streams CLI commands to publish sample XML messages. The following code shows an example XML of an employee record that has one-level nesting for the all_addresses attribute:

aws kinesis put-record --stream-name <Stream-Name> --data "<employees><employee><first_name>FName 1</first_name><last_name>LName 1</last_name><all_address><address><type>primary</type><street_address>Street Address 1</street_address><state>State 1</state><zip>11111</zip></address><address><type>secondary</type><street_address>Street Address 2</street_address><state>State 2</state><zip>11112</zip></address></all_address><phone>111-111-1111</phone></employee><employee><first_name>FName 2</first_name><last_name>LName 2</last_name><all_address><address><type>primary</type><street_address>Street Address 3</street_address><state>State 3</state><zip>11113</zip></address><address><type>secondary</type><street_address>Street Address 4</street_address><state>State 4</state><zip>11114</zip></address></all_address><phone>111-111-1112</phone></employee></employees>" —partition-key <partition-key-name>

You need to change the stream name, XML data, and partition key in the preceding code as per your use case. Also, instead of an AWS Cloud9 environment, you have additional ways to submit messages to the data stream:

  • Use an EC2 instance to execute the Kinesis Data Streams CLI command
  • Use KPL or Kinesis Data Streams APIs in any programming language to submit messages dynamically through your custom application

Creating an Amazon Redshift cluster

In this step, you create an Amazon Redshift cluster that has required permissions and ports open for Kinesis Data Firehose to write to it. For instructions, see Controlling Access with Amazon Kinesis Data Firehose.

Make sure the cluster has the required port and permissions so that Kinesis Firehose can push data into it. Also make sure the table schema you create matches your pipe-delimited format that Lambda creates as output and Kinesis Data Firehose uses it to write to Amazon Redshift.

Setting up the delivery stream

When you create your Kinesis Data Firehose delivery stream on the console, define the source as Kinesis Data Streams, the target as the Amazon Redshift cluster, and enable record transformation with Lambda.

To complete this step, you need to create an AWS Identity and Access Management (IAM) role with the following permissions for the delivery stream:

  • Read permissions from the data stream
  • Write permissions to the intermediate S3 bucket
  • Write permissions to the defined Amazon Redshift cluster

Define the following configurations for the delivery stream:

  • Enable the source record transformation, where you selected your Lambda function.

  • As an optional step, you can enable source record backup, which saves the source XML to the S3 bucket path you define.

  • Define the intermediate S3 bucket, which you use to store transformed pipe-delimited records and later use for the Amazon Redshift copy.

  • In your Amazon Redshift configurations, for COPY options, make sure to specify DELIMITER ‘|’, because the Lambda function output is pipe delimited and Kinesis Data Firehose uses that in the Amazon Redshift copy operation.

Customizing the Lambda function

This function is invoked through Kinesis Data Firehose when the record arrives in Kinesis Data Streams.

Make sure you increase the Lambda execution timeout to more than 1 minute. See the following code:

from __future__ import print_function

import base64
import json
import boto3
import os
import time
import csv 
import sys

from xml.etree.ElementTree import XML, fromstring
import xml.etree.ElementTree as ET

print('Loading function')

def lambda_handler(event, context):
    output = []

    for record in event['records']:
        payload = base64.b64decode(record['data'])
        parsedRecords = parseXML(payload)
        # Do custom processing on the payload here
        output_record = {
            'recordId': record['recordId'],
            'result': 'Ok',
            'data': base64.b64encode(parsedRecords)

    print('Successfully processed {} records.'.format(len(event['records'])))
    return {'records': output}
def parseXML(inputXML):
    xmlstring =  str(inputXML.decode('utf-8'))
    # create element tree object
    root = ET.fromstring(str(xmlstring))
    #print("Root Tag"+root.tag)
    # create empty list for items 
    xmlItems = ""
    # iterate over employee records
    for item in root.findall('employee'):
       #print("child tag name:"+item.tag+" - Child attribute")
       # Form pipe delimited string, by concatenating XML values
       record = item.find('first_name').text + "|" + item.find('last_name').text + "|" + item.find('phone').text
       primaryaddress = ""
       secondaryaddress = ""
       # Get primary address and secondary address separately to be concatenated to the original record in sequence
       for addressitem in item.find('all_address').findall('address'):
           if(addressitem.find('type').text == "primary"):
               primaryaddress = addressitem.find('street_address').text + "|" + addressitem.find('state').text + "|" + addressitem.find('zip').text
           elif(addressitem.find('type').text == "secondary"):
               secondaryaddress = addressitem.find('street_address').text + "|" + addressitem.find('state').text + "|" + addressitem.find('zip').text
       #print("Primary Address:"+primaryaddress)
       #print("Secondary Address:"+secondaryaddress)
       record += "|" + primaryaddress + "|" + secondaryaddress + "\n"
       xmlItems += record
    #print("Final Transformed Output:"+xmlItems)
    return xmlItems.encode('utf-8')

You can customize this example code to embed your own XML parser logic. Keep in mind that, while using the function, the request and response (synchronous calls) body payload size can be up to 6 MB, so it’s important to make sure the return value isn’t increased over that limit.

Your Amazon Redshift table (employees) has respective fields to capture the flattened pipe-delimited data. Your query might look like the following code to fetch and read the data:

SELECT first_name, last_name, phone, primary_address_street, primary_address_state, primary_address_zip, secondary_address_street, secondary_address_state, secondary_address_zip
FROM employees

The following screenshot shows the result of the query in the Amazon Redshift query editor.


While setting up this framework in your development environment, you can debug individual components of the architecture with the following guidelines:

  • Use the Kinesis Data Streams Monitoring tab to validate that it receives messages and read operations are happening through the consumer (Kinesis Data Firehose). You can also use Kinesis Data Streams CLI commands to read from the stream.
  • Use the Kinesis Data Firehose Monitoring tab to check if it receives messages from Kinesis Data Streams and can push them to Amazon Redshift. You can also check for errors on the Error logs tab or directly on the Amazon CloudWatch console.
  • Validate Lambda with a test execution to check that it can transform records to pipe-delimited formats and return to Amazon Data Firehose with the expected format (base64 encoded format).
  • Confirm that the S3 intermediate storage bucket has the transformed record and doesn’t write into failed processing or error record paths. Also, check if the transformed records are pipe delimited and match the schema of the target Amazon Redshift table.
  • Validate if the backup S3 bucket has the original XML format records. If Lambda or the delivery stream fails, you have an approach to manually reprocess it.
  • Make sure Amazon Redshift has the new data records reflecting through SQL SELECT queries and check the cluster’s health on the Monitoring


This post showed you how to integrate real-time streaming of XML messages and flatten them to store in a data warehousing system for real-time dashboards.

Although you followed individual steps for each service in your development environment, for a production setup, consider the following automation methods:

  • AWS CloudFormation allows you to embed infrastructure as code that can spin up all required resources for the project, and you can easily migrate or set up your application in production or other AWS accounts.
  • A custom monitoring dashboard can take input from each AWS service you use through its APIs and show the health of each service with the number of records being processed.

Let us know in the comments any thoughts of questions you have about applying this solution to your use cases.


About the Author

Sakti Mishra is a Data Lab Solutions Architect at AWS. He helps customers architect data analytics solutions, which gives them an accelerated path towards modernization initiatives. Outside of work, Sakti enjoys learning new technologies, watching movies, and travel.



Scale your cloud data warehouse and reduce costs with the new Amazon Redshift RA3 nodes with managed storage

Post Syndicated from Corina Radovanovich original https://aws.amazon.com/blogs/big-data/scale-your-cloud-data-warehouse-and-reduce-costs-with-the-new-amazon-redshift-ra3-nodes-with-managed-storage/

One of our favorite things about working on Amazon Redshift, the cloud data warehouse service at AWS, is the inspiring stories from customers about how they’re using data to gain business insights. Many of our recent engagements have been with customers upgrading to the new instance type, Amazon Redshift RA3 with managed storage. In this post, we share experiences from customers using Amazon Redshift for the first time, and existing customers upgrading from DS2 (Dense Storage 2) and DC2 (Dense Compute 2) instances to gain improvements in performance and storage capacity for the same or lower costs.

From startups to global quick service restaurants and major financial institutions, Amazon Redshift customers span across all industries and sizes, including many Fortune 500 companies. We’re proud that Amazon Redshift breaks down cost and accessibility barriers of a data warehouse, so startups and non-profits can realize the same benefits as established enterprises from running analytics at scale with Amazon Redshift.

The diverse customer base also allows the Amazon Redshift team to continue to innovate with new features and capabilities that deliver the best price performance for any use case. Use cases range from analytics that help Britain’s railways run smoothly, to providing insight into the behavior of millions of people learning a new language, playing online games, learning to code, and much more. As the world around us responds to changes in every aspect of personal and business life, Amazon Redshift is helping tens of thousands of customers respond with fast and powerful analytics.

More and more customers have gravitated to Amazon Redshift because of continued innovation, including the new generation of Amazon Redshift nodes, RA3 with managed storage. This latest generation of Amazon Redshift is unique because it introduces the ability to independently scale compute and storage with Redshift managed storage, (RMS). This enables you to scale cost-effectively because you can add more data without increasing compute cost, or add more compute without increasing storage costs. This makes RA3 a cost-effective option for both steady and diverse data warehouse workloads, gives you room to grow, and maximizes performance.

New customers like Poloniex and OpenVault benefit from the flexibility of Amazon Redshift RA3

Many customers are growing and looking for a cloud data warehouse that can scale with them, easily integrate with other AWS services, and deliver great value. For customers like Poloniex and OpenVault, who are just getting started with Amazon Redshift, we recommend using the new RA3 nodes with managed storage. New customers like RA3 because you can size your data warehouse for your core workload and easily scale for spikes in users and data to balance performance and costs. For example, you can use concurrency scaling to automatically scale out when the number of queries suddenly spike up, or use elastic resize to scale up and add nodes to make queries run faster. If you’re using clusters intermittently, you can pause and resume on a schedule or manually. You can further reduce costs on steady state clusters by investing in reserved instances with a 1- or 3-year commitment.

Poloniex, one of the longest standing cryptocurrency trading platforms in the world, distributing hundreds of billions of dollars in cryptoassets, uses AWS to gain insights into how users interact with their platform and how they can improve the customer experience in trading, lending, storing, and distribution. They evaluated multiple data warehousing options and chose to work with AWS to design a lake house approach by querying and joining data across their Amazon Redshift data warehouse and Amazon Simple Storage Service (Amazon S3) data lake with the Amazon Redshift Spectrum feature.

“When we were evaluating data warehouses, we chose Amazon Redshift over Snowflake because of the transparent and predictable pricing,” says Peter Jamieson, Director of Analytics and Data Science at Poloniex. “The scalability and flexibility have been enormously valuable as we scale our analytics capability with a lean team and infrastructure. We benefit from the separation of compute and storage in the Amazon Redshift RA3 nodes because we have workflows that create a significant spike in our compute needs, especially when aggregating historical transaction data.”

Organizations are often looking to share the data and insight gained through analytics with their end-users as part of their product or service. The Software as a Service (SaaS) model enables this, and we work closely with SaaS customers to understand the value their data provides so they can use Amazon Redshift to unlock additional business value. Amazon Redshift is well positioned to build a scalable, multi-tenant SaaS solution with features that deliver consistent performance with multiple tenants sharing the same Amazon Redshift cluster.

OpenVault, a full-service technology solutions and data analytics company, enables cable, fiber, and mobile operators around the world to unlock the power of the data in their network to optimize and monetize their businesses. They shared a similar story:

“Amazon Redshift powers analytics in our SaaS solutions to provide insight that can be used to anticipate residential and business broadband trends,” says Tony Costa, EVP and CTO at OpenVault. “This makes it possible to use fast-growing broadband data to make decisions that result in revenue growth, new revenue streams, reduced operational/capital expenses, and improved quality of service for broadband operators. We chose Amazon Redshift RA3 because it is a cost-effective analytics and managed storage solution. It empowers OpenVault’s data scientists and operator customers to perform near real-time analysis of billions of rows of records and seamlessly evolve with the growing analytics needs and ad-hoc inquiries of our customers.”

If you’re new to Amazon Redshift, many resources are available to help you ramp up, including AWS employees and partners. For more information, see Getting Started with Amazon Redshift and Request Support for your Amazon Redshift Proof-of-Concept.

Duolingo, Social Standards, Yelp, Codecademy, and Nielsen get better performance and double the storage capacity at the same price by moving from Amazon Redshift DS2 to RA3

For years, customers with large data storage needs chose Amazon Redshift DS2 (Dense Storage 2) for its price-performance value. Customers such as NTT Docomo and Amazon.com ran petabyte-scale workloads in a single cluster on DS2 node types. However, as data size kept increasing exponentially, the amount of data actively being queried continued to become a smaller fraction of the total data size. You had to either keep adding nodes to store more data in the data warehouse, or retire data to Amazon S3 in a data lake. This creates operational overhead. With Amazon Redshift RA3, after the data is ingested in the cluster, it’s automatically moved to managed storage. RA3 nodes keep track of the frequency of access for each data block and cache the hottest blocks. If the blocks aren’t cached, the large networking bandwidth and precise storing techniques return the data in sub-seconds.

For customers like Duolingo, Social Standards, Yelp, and Codecademy, who are among the tens and thousands of customers already using Amazon Redshift, it’s easy to upgrade to RA3.

Duolingo is the most popular language-learning platform and the most downloaded education app in the world, with more than 300 million users. The company’s mission is to make education free, fun, and accessible to all. They upgraded from Amazon Redshift DS2 instances to the largest instance of RA3 to support their growing data.

“We use Amazon Redshift to analyze the events from our app to gain insight into how users learn with Duolingo,” says Jonathan Burket, a Senior Software Engineer at Duolingo. “We load billions of events each day into Amazon Redshift, have hundreds of terabytes of data, and that is expected to double every year. While we store and process all of our data, most of the analysis only uses a subset of that data. The new Amazon Redshift RA3 instances with managed storage deliver two times the performance for most of our queries compared to our previous DS2 instance-based Amazon Redshift clusters. The Amazon Redshift managed storage automatically adapts to our usage patterns. This means we don’t need to manually maintain hot and cold data tiers, and we can keep our costs flat when we process more data.”

For more information about how Duolingo uses Amazon Redshift, watch the session from AWS re:Invent 2019, How to scale data analytics with Amazon Redshift.

Amazon Redshift is designed to handle these high volumes of data that collectively uncover trends and opportunities. At Social Standards, a fast growing market analytics firm, Amazon Redshift powers the analytics that helps enterprises gain insights into collective social intelligence. The comparative analytics platform transforms billions of social data points into benchmarked insights about the brands, products, features, and trends that consumers are talking about.

“At Social Standards, we are creating the next generation of consumer analytics tools to discover and deliver actionable business insights with complete and authentic analysis of social data for strategic decision making, product innovation, financial analytics, and much more,” says Vladimir Bogdanov, CTO at Social Standards. “We use Amazon Redshift for near real-time analysis and storage of massive amounts of data. Each month we add around 600 million new social interactions and 1.2 TB of new data. As we look forward and continue to introduce new ways to analyze the growing data, the new Amazon Redshift RA3 instances proved to be a game changer. We moved from the Amazon Redshift DS2 instance type to RA3 with a quick and easy upgrade, and were able to increase our storage capacity by eight times, increase performance by two times, and keep costs the same.”

These performance and cost benefits also attracted the popular online reviews and marketplace company, Yelp, to upgrade from DS2 to RA3. Yelp’s mission is to connect people with great local businesses, and data mining and efficient data analysis are important in order to build the best user experience.

“We continue to adopt new Amazon Redshift features and are thrilled with the new RA3 instance type,” says Steven Moy, a Software Engineer at Yelp. “We have observed a 1.9 times performance improvement over DS2 while keeping the same costs and providing scalable managed storage. This allows us to keep pace with explosive data growth and have the necessary fuel to train our machine learning systems.”

For more information about how Yelp uses Amazon Redshift, watch the session from AWS re:Invent 2019, What’s new with Amazon Redshift, featuring Yelp.

As current health conditions shine a spotlight on online learning, many organizations are scaling and using data to guide decision-making. Codecademy uses Amazon Redshift to store all the growing data generated through customers’ use of their web application, including high-volume events such as page visits and button clicks. Their data science team uses this data to develop various statistical models, and by analyzing these models, improve the app based on how customers use it.

“Codecademy is an education company committed to teaching modern skills within technology and code, as well as a catalyst in the shift toward online learning,” says Doug Grove, Director of Infrastructure and Platform at Codecademy. “We were leveraging DS2.xls for our Amazon Redshift cluster and moved to RA3.4xls for performance gains. Moving to the RA3s resulted in a two times performance increase and cut data loading times in half. The separation of compute and storage allows us to scale independently, and allows for easier cluster maintenance.”

For many customers that started using a data warehouse on-premises and migrated to AWS, the scale and value of cloud continue to pay off. Nielsen, the global measurement and data analytics company, provides the most complete and trusted view of consumers and markets worldwide with operations in over 100 countries. A recent upgrade from DS2 to RA3 was the next step in their analytics journey, and helped them save costs, increase performance, and prepare for continued growth.

“We migrated from an on-premises data warehouse to Amazon Redshift in 2017 to optimize costs and to scale our solution to meet the growing demand,” says Sri Subramanian, Senior Manager of Technology at Nielsen. “Our data warehouse workloads run 24/7 at a scale of 1 billion rows per day. We recently migrated our Amazon Redshift cluster from DS2.8x to the new RA3.4x instance type. We have seen a performance gain of up to 40–50% on most of our workloads at a similar price point. Since the RA3 instance types separate compute and storage, disk utilization is no longer a concern. The upgrade was straightforward, and we went from proof of concept to solving complex business challenges quickly.”

These performance gains and productivity improvements are consistent themes from the feedback we’re getting from customers moving from DS2 to RA3. For more information about upgrading your workloads, see Overview of RA3 node types.

Rail Delivery Group, FiNC, and Playrix move from Amazon Redshift DC2 to RA3 to scale compute and storage independently for improved query performance and lower costs

Customers often chose DC2 (Dense Compute 2) for its superior query performance and low price. However, as the data sizes grew, clusters became bigger without the need for additional compute power. Many customers like Rail Delivery Group, FiNC, and Playrix are finding that by upgrading to RA3, they can get significantly more storage space and the same superior performance without increasing costs. For some use cases that need a large amount of raw computational power at the cheapest price and don’t require over 1 TB of data, DC2 provides industry-beating performance. However, if data is likely to grow to over 1 TB compressed, choosing RA3 node types and sizing for compute requirements is a much simpler and cheaper solution in the long run.

One company that found their storage needs growing faster than compute is Rail Delivery Group, a non-profit organization that brings together the companies that run Britain’s railway. They use Amazon Redshift to analyze rail industry data such as timetables, ticket sales, and smartcard usage.

“Since we started using Amazon Redshift for analytics in 2017, we have grown from 1 node to 10 nodes,” says Toby Ayre, Head of Data & Analytics at Rail Delivery Group. “Our data storage needs grew much faster than compute needs, and we had to keep unloading the data out of the data warehouse to Amazon S3. Now, with RA3.4xl nodes with managed storage, we can size for query performance and not worry about storage needs. Since we upgraded from a 10 node DC2.large cluster to a two node RA3.4xl cluster, our queries typically run 30% faster.”

Optimizing costs while also preparing for future growth are consistent requirements for our customers. For FiNC Technologies, the developer of the number one healthcare and fitness app in Japan, data drives a cycle of continuous improvement and enables them to deliver on their mission to provide personalized AI for everyone’s wellness. The personalized diet tutor, private gym, and wellness tracker app helps users make informed decisions about their health and well-being based on real-time metrics about their behavior.

“At FiNC Technologies, we rely on Amazon Redshift to manage KPIs to continuously improve our web services and apps,” says Komiyama, Kohei, a Data Scientist in FiNC. “We upgraded to the Amazon Redshift RA3 from DC2 because our storage needs were growing faster than our compute. We found it easy to upgrade, and like that our new data warehouse scales storage capacity automatically without any manual effort. Since upgrading, we’ve reduced operational costs by 70%, and feel prepared for future data growth.”

While FiNC optimized for growing storage, Playrix, one of the leading mobile game developers in the world, optimized for compute. With over $1 billion annual revenue and more than 2,000 global employees, Playrix builds popular games like Township, Fishdom, Gardenscapes, Homescapes, Wildscapes, and Manor Matters. They use data to better understand the customer journey.

“We rely on data from multiple internal and external sources to gain insight into user acquisition and make marketing decisions,” says Mikhail Artyugin, Technical Director at Playrix. “We moved our Amazon Redshift data warehouse from 20 nodes of DC2.xlarge to three nodes of RA3.4xl to future proof our system. We’re thrilled with the increase in computing power that makes it faster to deliver insight on the marketing data, and we have almost infinite storage space with managed storage, all for a reasonable price. The friendly and productive collaboration with AWS enterprise support and product team was an extra bonus.”


The Amazon Redshift RA3 nodes with managed storage deliver value to new customers like Poloniex and OpenVault, and to existing customers upgrading from DC2 and DS2 instances like Duolingo, Social Standards, Yelp, Codecademy, Nielsen, Rail Delivery Group, FiNC, and Playrix.

If you’re new to Amazon Redshift, check out our RA3 recommendation tool available on the AWS Management Console when you create a cluster. If you’re already an Amazon Redshift customer and you haven’t tried out RA3 yet, it’s easy to upgrade in minutes with a cross instance restore or elastic resize. If you have existing Amazon Redshift DC2 or DS2 Reserved Instances, you can contact us to get support with the upgrade. For more information about recommended RA3 node types and cluster sizes when upgrading from DC2 and DS2, see Overview of RA3 node types.

New features and capabilities for Amazon Redshift are released rapidly, and RA3 is set up for the new scale of data because with AQUA (Advanced Query Accelerator) for Amazon Redshift, performance will continue to improve. You can sign up for the preview of this innovative new hardware-accelerated cache, and the clusters running on RA3 will automatically benefit from AQUA when it’s released. We continue to innovate based on what we hear from our customers, so keep an eye on What’s New in Amazon Redshift to learn about our new releases.


About the authors

Corina Radovanovich leads product marketing for cloud data warehousing at AWS. She’s worked in marketing and communications for the biggest tech companies worldwide and specializes in cloud data services.





Himanshu Raja is a Principal Product Manager for Amazon Redshift. Himanshu loves solving hard problems with data and cherishes moments when data goes against intuition. In his spare time, Himanshu enjoys cooking Indian food and watching action movies.


Optimize Python ETL by extending Pandas with AWS Data Wrangler

Post Syndicated from Satoshi Kuramitsu original https://aws.amazon.com/blogs/big-data/optimize-python-etl-by-extending-pandas-with-aws-data-wrangler/

Developing extract, transform, and load (ETL) data pipelines is one of the most time-consuming steps to keep data lakes, data warehouses, and databases up to date and ready to provide business insights. You can categorize these pipelines into distributed and non-distributed, and the choice of one or the other depends on the amount of data you need to process.

Apache Spark is widely used to build distributed pipelines, whereas Pandas is preferred for lightweight, non-distributed pipelines. With the second use case in mind, the AWS Professional Service team created AWS Data Wrangler, aiming to fill the integration gap between Pandas and several AWS services, such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, AWS Glue, Amazon Athena, Amazon Aurora, Amazon QuickSight, and Amazon CloudWatch Log Insights.

AWS Data Wrangler is an open-source Python library that enables you to focus on the transformation step of ETL by using familiar Pandas transformation commands and relying on abstracted functions to handle the extraction and load steps.

You can use AWS Data Wrangler in different environments on AWS and on premises (for more information, see Install). This post focuses on data preparation for a data science project on Jupyter. By the end of this walkthrough, you will be able to set up AWS Data Wrangler on your Amazon SageMaker notebook.

Use case overview

In the following walkthrough, you use data stored in the NOAA public S3 bucket. For more information, see NOAA Global Historical Climatology Network Daily. The objective is to convert 10 CSV files (approximately 240 MB total) to a partitioned Parquet dataset, store its related metadata into the AWS Glue Data Catalog, and query the data using Athena to create a data analysis.

Configuring Amazon S3

Your first step is to create an S3 bucket to store the Parquet dataset.

  1. On the Amazon S3 console, choose Create bucket.
  2. For Bucket name, enter a name for your bucket.

  1. Choose Create.

Creating a new database in the Data Catalog

The Data Catalog is an Apache Hive-compatible managed metadata storage that lets you store, annotate, and share metadata on AWS.

For this use case, you use it to store the metadata associated with your Parquet dataset. The Data Catalog is integrated with many analytics services, including Athena, Amazon Redshift Spectrum, and Amazon EMR (Apache Spark, Apache Hive, and Presto).

  1. On the AWS Glue console, choose Databases.
  2. Choose Add database.
  3. For Database name, enter awswrangler_test.
  4. Choose Create.

Launching an Amazon SageMaker notebook

An Amazon SageMaker notebook is a managed instance running the Jupyter Notebook app. For this use case, you use it to write and run your code.

  1. On the Amazon SageMaker console, choose Notebook instance.
  2. Choose Create a notebook instance.
  3. For Notebook instance name, enter a name.
  4. For IAM role, choose an existing AWS Identity and Access Management (IAM) role or create a role that allows you to run Amazon SageMaker and grants access to Amazon S3, Athena, and AWS Glue for the related resources.

  1. Wait for the notebook status to show as InService.
  2. Choose Open Jupyter from the notebook instance you created.

Exploring the data

This section walks you through several notebook paragraphs to expose how to install and use AWS Data Wrangler.

  1. On Jupyter console, under New, choose conda_python3.
  2. To install AWS Data Wrangler, enter the following code:
    !pip install awswrangler

  3. To avoid dependency conflicts, restart the notebook kernel by choosing kernel -> Restart.
  4. Import the library given the usual alias wr:
    import awswrangler as wr

  5. List all files in the NOAA public bucket from the decade of 1880:

The following screenshot shows the output.

  1. Load the whole decade (10 files) into a Pandas DataFrame using the Amazon S3 prefix s3://noaa-ghcn-pds/csv/188:
    col_names = ["id", "dt", "element", "value", "m_flag", "q_flag", "s_flag", "obs_time"]
    df = wr.s3.read_csv(
        parse_dates=["dt", "obs_time"]  # Hint to parse these columns as date instead of strings

    The following screenshot shows the output.

  1. Create a new column extracting the year from the dt column (the new column is useful for creating partitions in the Parquet dataset):
    df["year"] = df["dt"].dt.year

The following screenshot shows the output.

  1. Store the Pandas DataFrame in the S3 bucket you created in the beginning of this post (replace the [BUCKET] placeholder in the code with your bucket name):

The preceding code creates the table noaa in the awswrangler_test database in the Data Catalog.

  1. After processing this, you can confirm the Parquet files exist in Amazon S3 and the table noaa is in AWS Glue data catalog. See the following code:
    wr.catalog.table(database="awswrangler_test", table="noaa")

The following screenshot shows the output.

  1. Run a SQL query from Athena that filters only the US maximum temperature measurements of the last 3 years (1887–1889) and receive the result as a Pandas DataFrame:
    sql = """
        (value / 10.0) AS temperature  -- Converting tenths of degrees C to regular degrees C
    FROM noaa
    WHERE year BETWEEN 1887 AND 1889  -- Only last 3 years (PARTITION filter)
    AND substr(id, 1, 2)='US'  -- Only U.S. stations
    AND element='TMAX'  -- Only Maximum temperature elements
    AND q_flag is NULL  -- Only HIGH quality measurement
    df = wr.athena.read_sql_query(sql, database="awswrangler_test")

The following screenshot shows the output.

The following two queries illustrate how you can visualize the data.

  1. To plot the average maximum temperature measured in the tracked station, enter the following code:
    %matplotlib inline

The following screenshot shows the output.

  1. To plot a moving average of the previous metric with a 30-day window, enter the following code:
    %matplotlib inline
    df.groupby("dt").mean().rolling(window=30, center=True).mean().plot();

The following screenshot shows the output.

Cleaning up

To avoid incurring future charges, delete the resources from the following services:

  1. AWS Glue database
    • On the AWS Glue console, choose the database you created.
    • From the Actions drop-down menu, choose Delete database.
    • Choose Delete.
  2. Amazon SageMaker notebook
    • On the Amazon SageMaker console, choose the notebook instance you created.
    • From the Actions drop-down menu, choose Stop.
    • When the status shows as Stopped, choose Database.
    • Choose Delete.
  3. S3 bucket
    • On the Amazon S3 console, choose Buckets.
    • Choose the bucket you created.
    • Choose Empty and enter your bucket name.
    • Choose Confirm.
    • Choose Delete and enter your bucket name.
    • Choose Delete bucket.
  4. IAM Role
    • On the IAM console, choose Roles.
    • Choose the role you attached to Amazon SageMaker.
    • Choose Delete role.
    • Choose Yes.


Installing AWS Data Wrangler is a breeze. With a single command, you can connect ETL tasks to multiple data sources and different data services. The library is a work in progress, with new features and enhancements added regularly. For more tutorials, see the GitHub repo.


About the Authors

Satoshi Kuramitsu is a Solutions Architect in AWS. His favorite AWS services are AWS Glue, Amazon Kinesis, and Amazon S3.






Igor Tavares is a Data & Machine Learning Engineer in the AWS Professional Services team and the original creator of AWS Data Wrangler.




Stream Twitter data into Amazon Redshift using Amazon MSK and AWS Glue streaming ETL

Post Syndicated from Jobin George original https://aws.amazon.com/blogs/big-data/stream-twitter-data-into-amazon-redshift-using-amazon-msk-and-aws-glue-streaming-etl/

Real-time analytics provide a point-in-time view for a variety of use cases. At the heart of any real-time solution is streaming data processing, especially when dynamic new content is being continually regenerated. Organizations might start using streaming data for simple analytics from logs or basic arithmetic dashboards, but eventually develop applications to perform more sophisticated forms of analysis, including machine learning, and extract deeper insights.

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load your data for analytics. In addition to capabilities such as a Data Catalog, automated schema discovery, automated code generation, and deduplication of data, AWS Glue is serverless, and you don’t have to provision resources while paying for what you use. AWS Glue recently released serverless streaming ETL, which makes it easy to set up continuous ingestion pipelines that stream data from various sources using Amazon Kinesis and Apache Kafka and load data to data lakes, data warehouses, and other data stores while cleaning and enriching the data as needed.

This post demonstrates how customers, system integrator (SI) partners, and developers can use the serverless streaming ETL capabilities of AWS Glue with Amazon Managed Streaming for Kafka (Amazon MSK) to stream data to a data warehouse such as Amazon Redshift. We also show you how to view Twitter streaming data on Amazon QuickSight via Amazon Redshift.


Before AWS Glue streaming ETL, you had to stitch multiple components together. For example, to stream real-time data from a social media feed, you needed to use either Amazon MSK or Kinesis to load data, using a combination of AWS Lambda and Amazon Simple Storage Service (Amazon S3) with multiple staging buckets. With AWS Glue streaming ETL, you can now simplify your pipeline with reduced touchpoints that better allow you to focus on business outcomes rather than pipeline management.

For more information about how AWS Glue streaming ETL integrates with Amazon Kinesis, see New – Serverless Streaming ETL with AWS Glue.

The following architecture shows an end-to-end implementation of a streaming solution using Amazon MSK, AWS Glue streaming ETL, Amazon Redshift, and QuickSight.

To illustrate how to set up this architecture, we’ll walk you through the following steps:

  1. Deploying an AWS CloudFormation template to launch a three-node MSK cluster and a Kafka client with an instance of Apache NiFi running on it
  2. Creating a Kafka topic and accessing the Apache NiFi UI
  3. Configuring data streams from Twitter to Kafka using Apache NiFi
  4. Creating an Amazon Redshift cluster and a table to persist streaming data
  5. Creating Amazon MSK and Amazon Redshift tables on AWS Glue Data Catalog tables
  6. Authoring an AWS Glue streaming ETL job to load data to Amazon Redshift
  7. Visualizing data from Amazon Redshift in QuickSight
  8. Cleaning up your resources

By default, the CloudFormation template launches the cluster on kafka.m5.large nodes and the client instance on m5.2xlarge, but you may choose to configure it with the instance type appropriate for your use case.


Make sure to complete the following steps as prerequisites:

Launching your CloudFormation stack

To create your resources for this use case, complete the following steps:

  1. Launch your stack in us-east-1:

  2. On the Quick create stack page, for Stack Name, enter Twitter-MSK-Glue-Redshift-Blog.
  3. For KeyName, choose KeyPair.
  4. For SSH location, enter your IP to log in to the Kafka client instance.

To find your IP, use checkip.amazonaws.com.

  1. Choose Create stack.

The stack creation can take up to 15 minutes to complete.

  1. When the stack creation is complete, on the Stack Outputs tab, record the values of the following:
    1. KafkaNiFiEC2Instance
    2. MSKSecurityGroupID
    3. PrivateSubnetOne
    4. RedshiftEndpoint

Creating a Kafka topic and accessing the NiFi UI

With the .pem key, you can now SSH to the NiFi node and do a local port forwarding to access the web interface on your local computer. We use Apache NiFi to easily configure and pull data from Twitter and publish it to Amazon MSK without dealing with coding. You may use any tool to poll data from Twitter and publish to Amazon MSK.

  1. To access the NiFi UI from your local system, use the following command to set up an SSH tunnel into the NiFi instance (replace KafkaNiFiEC2Instance with the information from the AWS CloudFormation output) running on port 8888:
ssh -i ~/Downloads/KeyPair.pem -L 8888:Localhost:8888 [email protected]: <KafkaNiFiEC2Instance> 

This command allows you to access NiFi via a browser running on your local system. Leave that terminal open to remain connected.

  1. To create a Kafka topic, enter the following code in the terminal (replace ZookeeperConnectString with your Amazon MSK cluster ZooKeeper URL):
/opt/kafka/bin/kafka-topics.sh --create --zookeeper <ZookeeperConnectString> --replication-factor 3 --partitions 1 --topic CovidTweets 

For instructions on finding your ZooKeeper URL, see Getting the Apache ZooKeeper Connection String for an Amazon MSK Cluster.

The following screenshot shows the resulting output.

  1. While the terminal is connected, launch your browser and use the following URL to access NiFi UI: http://localhost:8888/nifi/.

You should be able to view the NiFi cluster UI (see the following screenshot). You can see four processors already added on to the canvas.

NiFi supports user authentication via client certificates, username and password, Apache Knox, or OpenId Connect. For this post, we keep it open without security configuration, but make sure you have robust security in place for your NiFi instances when used for your own use cases. For more information, see Security Configuration.

Configuring data streams from Twitter to Amazon MSK using Apache NiFi

The following steps take you through connecting to Twitter and pulling data related to Twitter handles without coding.

  1. Choose the GetTwitter
  2. On the Properties tab, enter the following Twitter credentials:
    1. API key
    2. API secret key
    3. Access token
    4. Access token secret
  3. Choose Apply.

For security purposes, you can’t read the credentials entered. Additionally, in these configurations, you filter Tweets based on the term COVID19. You can add a comma-separated list if you want to customize these values.

  1. Choose the PublishKafka
  2. On the Properties tab, for Kafka Brokers, enter your comma-separated SSL Amazon MSK bootstrap URL (running on port 9094).

For instructions on finding the broker URL, see Getting the Bootstrap Brokers for an Amazon MSK cluster.

The Topic Name value is already set to CovidTweets, but you can change the topic name if you prefer a different name.

  1. Choose Apply.

You’re now ready to start the flow and stream data from the Twitter API into the MSK cluster. However, before you start streaming Twitter data, create the Data Catalog tables and author the AWS Glue streaming job.

Creating an Amazon Redshift cluster and target table

As part of the AWS CloudFormation deployment, you create a single-node Amazon Redshift cluster. To create the target table for storing relevant fields extracted from Tweets, connect to the cluster and complete the following steps:

  1. On the Amazon Redshift console, connect to the query editor.
  2. Enter the following credentials:
    1. Cluster – Choose the cluster with endpoint noted earlier
    2. Database namestreaming-data
    3. Database userawsuser
    4. Database passwordStr0ngPas$wd
  3. On the query editor page, enter the following DDL command to create a table named msk_tweets:
create table msk_tweets(created_at VARCHAR(max),id_str VARCHAR(100),text VARCHAR(max), source VARCHAR(max),user_location VARCHAR(1000),hashtags1 VARCHAR(1000),hashtags2 VARCHAR(1000), lang VARCHAR(max))

You need to modify these tables and fields per the use case.

Creating the Amazon MSK and Amazon Redshift Data Catalog tables in AWS Glue

This section walks you through creating your connections to Amazon MSK and Amazon Redshift, crawling Amazon Redshift, and creating Data Catalog tables to use as the target for the AWS Glue streaming ETL job.

Creating the Amazon MSK connection

To create your Amazon MSK connection, complete the following steps:

  1. On the AWS Glue console, choose Catalog.
  2. Choose Connection.
  3. Choose Add connection.
  4. On the Set up your connection’s properties page, for Connection name, enter MSK_Connection.
  5. For Connection type, choose Kafka.
  6. For Kafka bootstrap server URLs, enter your Amazon MSK SSL bootstrap URL running on port 9094.

For instructions on finding your broker URL, see Getting the Bootstrap Brokers for an Amazon MSK Cluster.

  1. Choose Next.

  1. On the Set up access to your data store page, for VPC, choose the VPC containing the name MSK-GLUE-VPC.
  2. For Subnet, choose the subnet containing the name MMPrivateSubnetOne.
  3. For Security groups, select the group with the prefix MSK-Glue-Redshift-MSKSecurityGroup.
  4. Choose Next.

  1. Review the information and choose Finish.

Creating the Amazon Redshift connection

You’re now ready to create the Amazon Redshift connection.

  1. On the AWS Glue Data Catalog Connection page, choose Add connection.
  2. For Connection name, enter Redshift_Connection.
  3. For Connection type, choose Amazon Redshift.
  4. Choose Next.
  5. On the next page, choose the cluster you created as part of the CloudFormation stack.
  6. Enter the following information:
    1. Database namestreaming-data
    2. Usernameawsuser
    3. Password – Str0ngPas$wd
  7. Choose Next.
  8. Review the details and choose Finish.

You can test the connection when creation is complete.

Crawling the database

You can now create an AWS Glue Data Catalog for your Amazon Redshift table by crawling the database using the connection you just created.

  1. On the AWS Glue Data Catalog Crawlers page, choose Add crawlers.
  2. For Crawler name, enter Redshift_Crawler.
  3. Choose Next.
  4. Choose Data stores.
  5. Chose Next.
  6. On the Add a data store page, for Choose a data store, choose JDBC.
  7. For Connection, choose Redshift_Connection.
  8. For Include path, enter streaming-data.
  9. Choose Next.

  1. On the page asking if you want to add additional data stores, choose No.
  2. Choose Next.
  3. On the Choose IAM role page, choose Glue_Service Role (you created this as part of the CloudFormation stack).
  4. Choose Next.
  5. For Frequency, choose Run on demand.
  6. Choose Next.
  7. On the next page, select an AWS Glue database and choose Next.

If you don’t have a database, choose Add database and create one.

  1. Review the information and choose Finish.

When you’re prompted to run the crawler, choose the prompt. It may take a few minutes for the crawler to finish, after which you can verify in the Data Catalog table section that you have a table called streaming_data_public_msk_tweets with Amazon Redshift classification.

Creating your table

You can now create a table for the Amazon MSK topic.

  1. On the Catalog page, choose Tables.
  2. Choose Add tables.
  3. Choose Add tables manually.
  4. For Table name, enter msk_covidtweets.
  5. Choose the database you created earlier.
  6. Choose Next.
  7. On the Add a data store page, for Select the type of source, select Kafka.
  8. For Topic name, enter CovidTweets.
  9. For Connection, enter MSK_Connection.
  10. Choose Next.

  1. On the next page, for Classification, choose JSON.
  2. Choose Next.
  3. On the Define schema page, choose Add column.
  4. Add the following commas, with Data type as string:
    1. id_str
    2. created_at
    3. source
    4. text
    5. location
    6. hashtags[0].text
    7. hashtags[1].text
    8. lang

  1. Choose Next.
  2. Review and choose Finish.

Authoring an AWS Glue streaming ETL job

In the following steps, you author a streaming ETL job.

  1. On the AWS Glue console, choose Jobs.
  2. Choose Add job.
  3. For Name, enter MSK-Glue-Redshift.
  4. For IAM role¸ choose Glue_Service_role.
  5. For Type, choose Spark Streaming.
  6. For This job runs, select A proposed script generated by AWS Glue.
  7. Leave other fields at their default.
  8. Choose Next.

  1. On the Choose a data source page, select msk_covidtweets.

  1. Choose Next.
  2. On the Choose a data target page, select streaming_data_public_msk_tweets.

  1. Choose Next.
  2. On the map source columns to target, verify that the columns are mapped correctly.
  3. Choose Save job and edit script.
  4. On the next page, verify that on the last line of the script, windowSize is set to 5 seconds.
  5. Choose Save.
  6. Choose Run job.

The AWS Glue streaming ETL job may take a few minutes to start running, after which the streaming from Amazon MSK starts.

While you’re waiting, you can start the NiFi Twitter flow to publish messages to Amazon MSK.

Starting the NiFi flow to stream Twitter data

To start your NiFi flow to stream data from Twitter and publish it to Amazon MSK, complete the following steps:

  1. Navigate back to the NiFi UI running on http://localhost:8888/nifi/.
  2. Choose the canvas (right-click) and choose Start.

After the NiFi flow starts, you can see that the Twitter data is flowing from GetTwitter and is pushed to the MSK cluster using the PublishKafka processor. When the publish is successful, the data is pending in the success queue and is truncated after 60 seconds.

After the flow begins, data is published to Amaon MSK. The AWS Glue streaming ETL job loads the data into the Amazon Redshift table msk_tweets. You may notice that the data is being queued up in the success connection of Publish_to_MSK, indicating that the data was successfully published to Amazon MSK.

Visualizing Twitter data from Amazon Redshift using QuickSight

This section reviews the steps to visualize the data from the Twitter feed.

  1. Create a new analysis in QuickSight.
  2. Create a new dataset with Amazon Redshift as the source.
  3. Choose msk_tweets as the Amazon Redshift table.
  4. Choose the Custom SQL
  5. Enter the following query:
select cast(created_at as timestamp) as create_timestamp, extract(minute from cast(created_at as timestamp)) as extracted_minutes,id_str,text,split_part(split_part(source,'>',2),'<',1) as formatted_source,user_location,hashtags1,hashtags2,lang from msk_tweets;
  1. Choose the Directly query your data option to query real-time data directly from the database.
  2. Choose Visualize.

You can select various visual types such as stacked area line chart, pie chart, hash cloud, and bar charts on QuickSight to build the following dashboard.

For instructions on building a QuickSight dashboard, see Tutorial: Create a Dashboard. For more information about improving dashboard performance, see Speed up your ELT and BI queries with Amazon Redshift materialized views.

Cleaning up

To clean up your resources, delete the AWS Glue database, tables, crawlers, and job, and service role.

Additionally, be sure to clean up all other AWS resources that you created using AWS CloudFormation. You can delete these resources on the AWS CloudFormation console or via the AWS Command Line Interface (AWS CLI) by deleting the stack named Twitter-MSK-Glue-Redshift-Blog.


In this post, we demonstrated a use case for building a serverless and cost-effective ETL pipeline for streaming, which allows you focus on the outcomes of your analytics. The CloudFormation template gives you an easy way to set up the process, which you can further modify to meet your specific use case needs. You can also modify your serverless AWS Glue ETL code with transformations and mappings to ensure that only valid data gets loaded to your data store. With this solution, you can use AWS Glue streaming as a mechanism to solve your streaming ETL use cases.

Please let us know if you have comments about this post!


About the Authors

Jobin George is a Sr. Partner Solutions Architect at AWS. He has more than a decade of experience with designing and implementing large scale Big Data and Analytics solutions. He provides technical guidance, design advice and thought leadership to some of the key AWS customers and Big Data partners.





Mahesh Goyal is a Data Architect in Big Data at AWS. He works with customers in their journey to the cloud with a focus on big data and data warehouses. In his spare time, Mahesh likes to listen to music and explore new food places with his family.






Dilip Rajan is a Partner Solutions Architect at AWS. His role is to help partners and customers design and build solutions at scale on AWS. Before AWS, he helped Amazon Fulfillment Operations migrate their Oracle Data Warehouse to Redshift while designing the next generation big data analytics platform using AWS technologies.

Manage and control your cost with Amazon Redshift Concurrency Scaling and Spectrum

Post Syndicated from Vince Marchillo original https://aws.amazon.com/blogs/big-data/manage-and-control-your-cost-with-amazon-redshift-concurrency-scaling-and-spectrum/

Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence tools.

This post shares the simple steps you can take to use the new Amazon Redshift usage controls feature to monitor and control your usage and associated cost for Amazon Redshift Spectrum and Concurrency Scaling features. Redshift Spectrum enables you to power a lake house architecture to directly query and join data across your data warehouse and data lake, and Concurrency Scaling enables you to support thousands of concurrent users and queries with consistently fast query performance.

Why this feature is important

With tens of thousands of customers, the Amazon Redshift team has the benefit of observing the workloads and behavior across a large variety of customers, including the internal teams at Amazon. We observed that across both internal and external customers, Amazon Redshift is running demanding workloads, such as extract, transform, and load (ETL) pipelines that condense several massive datasets, many of which are hundreds of terabytes in size, into single consumable tables for reporting and analytic purposes. These jobs take advantage of Concurrency Scaling to automatically scale Amazon Redshift query processing to handle burst workloads, and Redshift Spectrum to perform analytics on the transformed datasets by joining them with external data stored in a variety of open data formats in the data lake backed by Amazon Simple Storage Service (Amazon S3). Both these features are charged based on the usage, making cost management important, especially during peak periods when there is more activity for an organization. Although a combination of Amazon CloudWatch alarms and workload management (WLM) query monitoring rules can help to keep track and monitor usage, customers have asked to control their cost at the Amazon Redshift cluster level based on usage.

Usage limits to control Concurrency Scaling and Redshift Spectrum costs

With the new usage controls feature, you can now monitor and control the usage and associated costs for Redshift Spectrum and Concurrency Scaling. You can create daily, weekly, and monthly usage limits, and define actions to take if those limits are reached to maintain predictable spending. Actions include logging usage stats as an event to a system table, generating Amazon Simple Notification Service (Amazon SNS) alerts, and disabling Redshift Spectrum or Concurrency Scaling based on your defined thresholds. The new usage controls feature allows you to continue reaping the benefits provided by both Concurrency Scaling and Redshift Spectrum with the peace of mind that you can stay within budget simply by configuring the appropriate thresholds.

Setting up and managing usage controls

You can configure Amazon Redshift usage control options on the Amazon Redshift console or by using the AWS Command Line Interface (AWS CLI) or API. You can choose to set up to four limits per feature, allowing for multiple levels of logging or notifications before you disable Redshift Spectrum or Concurrency Scaling. The usage limit settings available are largely the same for both Concurrency Scaling and Redshift Spectrum usage—the main difference is that Concurrency Scaling usage limits are based on time spent (hours and minutes), while Redshift Spectrum usage limits are based on terabytes of data scanned. The fields you can adjust and select include the following:

  • Time – The time range for which your usage limits should be applied. You can choose daily, weekly, or monthly.
  • Usage limit – For Concurrency Scaling, this allows you to enter an integer value for hours and minutes to limit the amount of time this feature can be used before the usage limit kicks in. For Redshift Spectrum, you enter an integer value for the total number of terabytes you want to allow to be scanned before the limits apply.
  • Action – The action you want to take when your usage control limit has been reached. You can choose from Log to system table, Alert, or Disable feature. The Alert and Disable feature actions trigger a CloudWatch metric alarm and you can optionally set to send Amazon SNS-based notifications.

The ability to configure up to four limits per feature, combined with the three available actions that you can take, provides accurate visibility into your current usage and a way to generate metrics of your usage patterns. You can use the Disable feature option to easily prevent going over budget, and the Alert and Log actions can provide valuable insights, such as how you are currently using Redshift Spectrum and Concurrency Scaling. For example, configuring a usage limit with a daily time period and an action to log to a system table allows you to easily generate metrics on which days you had higher utilization of Redshift Spectrum or Concurrency Scaling. These metrics can provide insights into high-traffic days and potential areas where your pipelines can be adjusted to better distribute your traffic. Another option is to configure a weekly alert limit at 25% of your desired monthly usage as a way to ensure that you are within your monthly expected budget.

Setting usage control limits on the Amazon Redshift console

To set usage limits for Concurrency Scaling and Redshift Spectrum using the new Amazon Redshift console, perform the following steps:

  1. On the Amazon Redshift console, choose
  2. Select your desired cluster.
  3. From the Actions drop-down menu, choose Configure usage limit.

  1. To configure usage limits for Concurrency Scaling, choose Configure usage limit in the Concurrency scaling usage limit
  2. To configure usage limits for Redshift Spectrum, choose Configure usage limit in the Redshift Spectrum usage limit

  1. In the Configure usage limit section, select or deselect Concurrency scaling and Redshift Spectrum.

Selecting one of those options brings up the corresponding configuration windows.

  1. Choose a Time period (Daily, Weekly, or Monthly) from the drop-down menu.
  2. Enter your desired Usage limit.
  3. From the Action drop-down menu, choose an action (Alert, Log to system table, or Disable feature).

  1. To configure additional usage limits, choose Add another limit and action.
  2. When you have configured all your desired usage limits, choose Configure to confirm your usage limit settings.

Your configurations are now visible in the usage limit dashboard.

Managing usage control limits via the Amazon Redshift console

You can edit and delete usage limits on the Amazon Redshift console. The Edit option allows you to add limits or modify existing limit settings, and the Delete option deletes all configured limits for the corresponding service. To manage your configurations, perform the following steps:

  1. On the Amazon Redshift console, choose
  2. Select your desired cluster.
  3. From the Actions drop-down menu, choose Configure usage limit.
  4. To edit your existing usage limit configurations, choose Edit in the corresponding service box.

The editing option lets you add a new usage limit, remove a usage limit, or modify an existing usage limit and corresponding action.

To modify the time period of an existing usage limit, you can remove and add it as a new usage limit.

To delete your Concurrency Scaling limits, choose Delete usage limit in the Concurrency scaling usage limit section.

To delete your Redshift Spectrum limits, choose Delete usage limit in the Redshift Spectrum usage limit section. Choosing Delete usage limit removes all limits configured for that service.

Setting usage control limits via the AWS CLI

You can also use the AWS CLI to add, edit, describe, or remove usage control configurations. The following examples outline the required CLI commands for each use case:

  • create-usage-limit – This command adds a new usage limit configuration for your Amazon Redshift cluster. The command should include the following parameters:
    • –cluster-identifier – The name of the cluster on which to apply the usage control.
    • –period – The time range for your usage limit. You can enter daily, weekly, or monthly for this parameter.
    • –feature-type – The service to which you want to apply this usage control. You can enter spectrum or concurrency-scaling for this parameter.
    • –limit-type – For Redshift Spectrum, this parameter should be set to data-scanned. For Concurrency Scaling, this should be set to time.
    • –amount – For Redshift Spectrum, this parameter should equal the total terabytes allowed to be scanned in increments of 1 TB. For Concurrency Scaling, this parameter should be set to the total minutes (on the console, you can do this in hh:mm) allowed before limits actions are applied.
    • –breach-action – The action to take when you reach your configured limit. Possible values are log, emit-metric, or disable. emit-metric sends metrics for CloudWatch.

See the following example code:

aws redshift create-usage-limit --cluster-identifier <yourClusterIdentifier> --period <daily|weekly|monthly> --feature-type <spectrum|concurrency-scaling> --limit-type <data-scanned|time> --amount <yourDesiredAmount> --breach-action <log|emit-metric|disable>
  • describe-usage-limits – This command returns a JSON response that lists the configured usage limits for the cluster you choose. The response includes all the configurable fields, such as the limit type and breach actions, and includes a usage limit ID, which is required for the modify and delete commands. The describe command should include the following parameter:
    • –cluster-identifier – The cluster identifier for which you want to obtain the configured usage limits.

See the following example code:

aws redshift describe-usage-limits --cluster-identifier <yourClusterIdentifier>
    "UsageLimits": [
            "LimitType": "data-scanned",
            "Period": "daily",
            "BreachAction": "log",
            "FeatureType": "spectrum",
            "UsageLimitId": "4257b96e-5b12-4348-adc2-4922d2ceddd2",
            "Amount": 1,
            "ClusterIdentifier": "cost-controls-demo"
  • modify-usage-limit – This command allows you to modify an existing usage limit configuration on your Amazon Redshift cluster. This command requires the UsageLimitID for the limit you want to modify, which you can obtain by running the describe-usage-limits The modify-usage-limit command should include the following parameters:
    • –usage-limit-id – The ID of the usage limit that you want to modify. You can obtain this by running the describe-usage-limits.
    • –amount – The new value for your limit threshold.
    • –breach-action – The new action to take if you reach your limit threshold.

See the following example code:

aws redshift modify-usage-limit --usage-limit-id "<yourUsageLimitID>" --amount <newAmount> --breach-action <newBreachAction>
  • delete-usage-limit – This command deletes a configured usage limit from your Amazon Redshift cluster. This command requires the UsageLimitID, which you can obtain by running the describe-usage-limits This command should have the following parameter:
    • –usage-limit-id – The ID of the limit that you want to delete.

See the following example code:

aws redshift delete-usage-limit --usage-limit-id "<yourUsageLimitID>"

For more information, see Managing usage limits in Amazon Redshift.


The Amazon Redshift usage controls provide you with an easy way to monitor, alert, and limit the cost you incur when using Concurrency Scaling and Redshift Spectrum features. With up to four limits configurable per feature and options to log events, trigger Amazon SNS notifications, or disable the features altogether from the Redshift console and AWS CLI, you have all the tools needed to make sure you stay within your budget.


About the Authors

Vince Marchillo is a Solutions Architect within Amazon’s Business Data Technologies organization. Vince guides customers to leverage scalable, secure, and easy-to-use data lake architecture powered by AWS services and other technologies.





Maor Kleider is a product and database engineering leader for Amazon Redshift. Maor is passionate about collaborating with customers and partners, learning about their unique big data use cases and making their experience even better. In his spare time, Maor enjoys traveling and exploring new restaurants with his family.




Satish Sathiya is a Product Engineer at Amazon Redshift. He is an avid big data enthusiast who collaborates with customers around the globe to achieve success and meet their data warehousing and data lake architecture needs.

Federate access to your Amazon Redshift cluster with Active Directory Federation Services (AD FS): Part 2

Post Syndicated from Rajesh Francis original https://aws.amazon.com/blogs/big-data/federate-access-to-your-amazon-redshift-cluster-with-active-directory-federation-services-ad-fs-part-2/

In the first post of this series, Federating access to your Amazon Redshift cluster with Active Directory: Part 1, you set up Microsoft Active Directory Federation Services (AD FS) and Security Assertion Markup Language (SAML) based authentication and tested the SAML federation using a web browser.

In Part 2, you learn to set up an Amazon Redshift cluster and use federated authentication with AD FS to connect from a JDBC SQL client tool. You follow a detailed step-by-step process of using Amazon Redshift database authentication to simplify the credential management of database users and reuse what you may already have on premises. You can also integrate Amazon Redshift group-level privileges with federation.

You can use your corporate Active Directory (AD) and the SAML 2.0 AD FS identity provider to enable your users to easily access Amazon Redshift clusters using corporate user names without managing database users and passwords.

Overview of solution

This post walks you through setting up an Amazon Redshift cluster and federating database user authentication with AWS Identity and Access Management (IAM) and Amazon Redshift. You use the user you set up in your AD in Part 1 (Bob) to authenticate using AD FS and control access to database objects based on the group the user is assigned to.

You also learn how to set up SQL Client (SQL Workbench/J) and an Amazon Redshift JDBC driver to connect to a database cluster using the database user authentication method and the AD FS federated authentication method.

The following diagram illustrates the architecture of this solution.

The workflow contains the following steps:

  1. The client (SQL Workbench/J) makes a request to the Identity Provider (IdP).
  2. The IdP authenticates the user and sends back a SAML assertion.
  3. The client calls an AWS SAML endpoint to AssumeRoleWithSAML.
  4. AWS authenticates and returns temporary security credentials.
  5. The client uses temporary credentials to connect to the Amazon Redshift cluster endpoint.


The walkthrough includes the following steps:

  1. Set up your Amazon Redshift cluster.
  2. Download, set up, configure, and test different connection methods to Amazon Redshift using SQL Workbench/J and an Amazon Redshift JDBC driver.
  3. Configure your Amazon Redshift connection profile with SQL Workbench/J.


For this walkthrough, you should meet the following prerequisites:

  1. Complete all the steps from Federating access to your Amazon Redshift cluster with Active Directory: Part 1:
    • Set up domain controller and AD FS
    • Configure federation on AWS with the AD FS
    • Configure AWS as the relying party with AD FS
    • Test AWS SAML federation using your web browser
    • Copy the username and passwords for the AD users to use in this post
    • Copy the ARN for the two SAML roles you created to use in this post
    • Copy the AD domain name to use in this post
  2. Have a basic understanding of SQL with any database and experience using a SQL client, such as SQL Workbench/J
  3. Check with your AWS administrator to ensure you have access to the AWS Management Console with permissions to use Amazon Redshift and IAM

Setting up your Amazon Redshift cluster

To create your Amazon Redshift cluster, complete the following steps:

  1. On the console, open Amazon Redshift.
  2. Choose Create cluster.

  1. For Node type¸ choose dc2.large.
  2. For Nodes, enter 2.

  1. For Cluster identifier, enter adfs-redshift-cluster.
  2. For Database port, enter 5439.
  3. Enter a master user name and password of your choice.

  1. Keep everything else at their defaults and choose Create cluster.

Your cluster status initially shows the various Modifying stages.

When the cluster creation is complete, its status shows as Available.

  1. Choose the cluster.
  2. On the Properties tab, review the endpoint information and database configuration.

  1. Make sure that for this walkthrough, Publicly accessible is set to Yes.

For more information about VPC accessibility, see Managing clusters in a VPC.

For production setup of the Amazon Redshift cluster, you shouldn’t set the cluster to be publicly accessible; instead you make it private. For more information, see How can I access a private Amazon Redshift cluster from my local machine?

Configuring your cluster security group

You can create a new security group for your cluster or use the default security group (located in the Network and security section on the Properties tab).

  1. For the security group, choose Edit inbound settings.
  2. Add a rule for Amazon Redshift for the IP range on your computer and the public IP for the Windows 2016 domain controller, which you set up in Part 1.

  1. Choose Save.

Configuring and testing connection methods using SQL Workbench/J and Amazon Redshift JDBC driver

Before you get started, download the latest Amazon Redshift JDBC driver with AWS SDK. You use this for JDBC driver-based authentication in SQL Workbench/J.

  1. Download and install SQL Workbench/J. For instructions, see Connect to your cluster by using SQL Workbench/J.
  2. Open SQL Workbench/J and choose Manage Drivers.
  3. Choose Amazon Redshift.
  4. For Library, browse to the JDBC driver you downloaded.

  1. Press Ok.

Configuring your Amazon Redshift connection profile with SQL Workbench/J

To set up federated access, you take a two-step approach: connecting with the superuser and connecting using federated authentication.

Connecting using the superuser

For this walkthrough, you create database objects, groups, and users and assign proper privileges to the groups on the database objects they are allowed to access.

  1. Choose New Connection.
  2. Construct a JDBC connection URL for the database standard user:


You can find the value for redshiftclusterendpoint on the Amazon Redshift console, under Properties, Endpoint. The username and password are the values you provided when you created the cluster.

  1. Choose Test.

You should receive a successful connection message, as in the following screenshot.

  1. Connect as awsuser (a superuser).
  2. From SQL Workbench/J, enter the following commands to set up the following environment:
    • Create two database groups:
      CREATE GROUP sales;
      CREATE GROUP marketing;

    • Create two schemas:
      CREATE SCHEMA sales;
      CREATE SCHEMA marketing;

    • Create two tables in each schema:
      CREATE TABLE IF NOT EXISTS marketing.employee
      	n_empkey INTEGER   
      	,n_name CHAR(25)   
      	,n_regionkey INTEGER   
      	,n_comment VARCHAR(152)   
       SORTKEY (n_empkey);
      CREATE TABLE IF NOT EXISTS sales.employee_sales
      	n_empkey INTEGER   
      	,n_name CHAR(25)   
      	,n_regionkey INTEGER   
      	,n_comment VARCHAR(152)   
       SORTKEY (n_empkey);

    • Insert sample data into the two tables:
      INSERT INTO marketing.employee
      VALUES(1, 'Bob', 0, 'Marketing');
      INSERT INTO sales.employee_sales
      VALUES(1, 'John', 0, 'Sales');

    • Validate data is available in the tables:
Select * from marketing.employee;

Select * from sales.employee_sales;

You can now set up appropriate privileges for the sales and marketing groups. Groups are collections of users who are all granted privileges associated with the group. You can use groups to assign privileges by job function. For example, you can create different groups for sales, administration, and support and give the users in each group the appropriate access to the data they require for their work. You can grant or revoke privileges at the group level, and those changes apply to all members of the group, except for superusers.

  1. Enter the following SQL queries to grant access to all tables in the sales schema to the sales group and access to all tables in the marketing schema to the marketing group:
GRANT USAGE on SCHEMA sales to GROUP sales;

GRANT USAGE on SCHEMA marketing to GROUP marketing;
GRANT SELECT on ALL TABLES in SCHEMA marketing to GROUP marketing;

Connect using federated authentication

You can pass the URL information in the connection profile two different ways:

  • Specify the Amazon Redshift cluster endpoint with the port (for this use case, the driver doesn’t need to run the DescribeClusters API because everything is already available):

jdbc:redshift:iam//<<your redshift cluster endpoint>>/dev

  • Specify the cluster ID and Region (for this use case, the driver calls the DescribeClusters API and Get-Cluster-Credentials as additional steps):

jdbc:redshift:iam://<<cluster id:region>>/dev

The rest of the parameters are specified in the Extended Properties configuration.

For this walkthrough, you use the second method.

  1. Create a new connection profile in SQL Workbench/J.
  2. Choose Extended Properties.

  1. Provide the following values:
  2. idp_port443
  3. plugin_namecom.amazon.redshift.plugin.AdfsCredentialsProvider
  4. ssl_insecuretrue
  5. idp_host – The host name for the AD domain, for example, windows2016.adfsredshift.com
  6. preferred_role – The ARN of the SAML role you created in Part 1, for example, arn:aws:iam::<AWSAccountID>:role/ADFZ-Production (replace <AWSAccountID> with your AWS Account)
  7. AutoCreatetrue

Setting AutoCreate= True creates the AD user Bob in the Amazon Redshift database the first time without password settings. The authentication is temporary using the SAML configuration you completed in Part 1. You can also set the AutoCreate property with IdP configuration. For more information, see JDBC and ODBC Options for Creating Database User Credentials.

  1. Choose Test.

You should see a successful connection, and can start using SQL Workbench/J to query Amazon Redshift cluster as user Bob. The message indicates Connection to endpoint successful. For instructions on addressing connection errors, see Troubleshooting connection issues in Amazon Redshift.

  1. With this connection profile, run the following query to test SAML authentication and query privilege control:
select * from stv_sessions;

The following screenshot shows the output.

The output shows that the user [email protected] was authenticated using AD FS. The user also joined the marketing group as enforced by the AD FS DbGroups claim rule and the policy associated with the ADFZ-Production role, which the user assumes during this session.

  1. To run the query against the marketing schema, enter the following code:
select * from marketing.employee;

The following screenshot shows the output.

The output shows that AD user Bob is part of the AD group RSDB-marketing, which was mapped to the DB group marketing. This database group had select access to the schema marketing and all tables in that schema. Therefore, the user could successfully query the table through SQL Workbench/J.

  1. To run a query against the sales schema, enter the following code:
select * from sales.employee_sales;

The following screenshot shows the output.

The output shows that Bob is only part of the AD group RSDB-marketing. Due to the way the claim rule is set up, Bob doesn’t have access to the database group sales, and therefore the query returns with a permission denied error.

Cleaning up

To avoid incurring future charges, you can delete the resources by deleting the AWS CloudFormation stack. This cleans up all the resources from your AWS account that you set up in Part 1.

To delete the Amazon Redshift cluster you created, select the cluster and choose Delete from the Actions drop-down menu.

You’re prompted to take a final snapshot if you require the cluster for future testing or setup. Storing the snapshot incurs additional charges.


In the first of this two-part series, you simulated an on-premises AD and AD FS setup using a Windows 2016 Amazon Elastic Compute Cloud (Amazon EC2) instance, and tested authentication using a browser.

In this post, you set up an Amazon Redshift cluster and SQL Workbench/J client with an Amazon Redshift JDBC driver. You connected to the database cluster using a database user with administrator privileges to set up a test database environment, and used a federated user authentication with AD FS. In addition, you validated the configuration by running several queries to see how to control access to Amazon Redshift database objects by using groups and assigning users to specific groups using AD FS seamlessly.


About the Authors

Rajesh Francis is a Data Warehouse Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to build scalable Analytic solutions.






Vaibhav Agrawal is an Analytics Specialist Solutions Architect at AWS. Throughout his career, he has focused on helping customers design and build well-architected analytics and decision support platforms.