Tag Archives: Amazon Redshift

Break data silos and stream your CDC data with Amazon Redshift streaming and Amazon MSK

Post Syndicated from Umesh Chaudhari original https://aws.amazon.com/blogs/big-data/break-data-silos-and-stream-your-cdc-data-with-amazon-redshift-streaming-and-amazon-msk/

Data loses value over time. We hear from our customers that they’d like to analyze the business transactions in real time. Traditionally, customers used batch-based approaches for data movement from operational systems to analytical systems. Batch load can run once or several times a day. A batch-based approach can introduce latency in data movement and reduce the value of data for analytics. Change Data Capture (CDC)-based approach has emerged as alternative to batch-based approaches. A CDC-based approach captures the data changes and makes them available in data warehouses for further analytics in real-time.

CDC tracks changes made in source database, such as inserts, updates, and deletes, and continually updates those changes to target database. When the CDC is high-frequency, the source database is changing rapidly, and the target database (i.e., usually a data warehouse) needs to reflect those changes in near real-time.

With the explosion of data, the number of data systems in organizations has grown. Data silos causes data to live in different sources, which makes it difficult to perform analytics.

To gain deeper and richer insights, you can bring all the changes from different data silos into one place, like data warehouse. This post showcases how to use streaming ingestion to bring data to Amazon Redshift.

Redshift streaming ingestion provides low latency, high-throughput data ingestion, which enables customers to derive insights in seconds instead of minutes. It’s simple to set up, and directly ingests streaming data into your data warehouse from Amazon Kinesis Data Streams and Amazon Managed Streaming for Kafka (Amazon MSK) without the need to stage in Amazon Simple Storage Service (Amazon S3). You can create materialized views using SQL statements. After that, using materialized-view refresh, you can ingest hundreds of megabytes of data per second.

Solution overview

In this post, we create a low-latency data replication between Amazon Aurora MySQL to Amazon Redshift Data Warehouse, using Redshift streaming ingestion from Amazon MSK. Using Amazon MSK, we securely stream data with a fully managed, highly available Apache Kafka service. Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. We store CDC events in Amazon MSK, for a set duration of time, which makes it possible to deliver CDC events to additional destinations such as Amazon S3 data lake.

We deploy Debezium MySQL source Kafka connector on Amazon MSK Connect. Amazon MSK Connect makes it easy to deploy, monitor, and automatically scale connectors that move data between Apache Kafka clusters and external systems such as databases, file systems, and search indices. Amazon MSK Connect is a fully compatible with Apache Kafka Connect, which enables you to lift and shift your Apache Kafka Connect applications with zero code changes.

This solution uses Amazon Aurora MySQL hosting the example database salesdb. Users of the database can perform the row-level INSERT, UPDATE, and DELETE operations to produce the change events in the example salesdb database. Debezium MySQL source Kafka Connector reads these change events and emits them to the Kafka topics in Amazon MSK. Amazon Redshift then read the messages from the Kafka topics from Amazon MSK using Amazon Redshift Streaming feature. Amazon Redshift stores these messages using materialized views and process them as they arrive.

You can see how CDC performs create event by looking at this example here. We are going to use OP field – its mandatory string describes the type of operation that caused the connector to generate the event, in our solution for processing. In this example, c indicates that the operation created a row. Valid values for OP field are:

  • c = create
  • u = update
  • d = delete
  • r = read (applies to only snapshots)

The following diagram illustrates the solution architecture:

This image shows the architecture of the solution. we are reading from Amazon Aurora using the Debezium connector for MySQL. Debezium Connector for MySQL is deployed on Amazon MSK Connect and ingesting the events inside Amazon MSK which are being ingested further to Amazon Redshift MV

The solution workflow consists of the following steps:

  • Amazon Aurora MySQL has a binary log (i.e., binlog) that records all operations(INSERT, UPDATE, DELETE) in the order in which they are committed to the database.
  • Amazon MSK Connect runs the source Kafka Connector called Debezium connector for MySQL, reads the binlog, produces change events for row-level INSERT, UPDATE, and DELETE operations, and emits the change events to Kafka topics in amazon MSK.
  • An Amazon Redshift-provisioned cluster is the stream consumer and can read messages from Kafka topics from Amazon MSK.
  • A materialized view in Amazon Redshift is the landing area for data read from the stream, which is processed as it arrives.
  • When the materialized view is refreshed, Amazon Redshift compute nodes allocate a group of Kafka partition to a compute slice.
  • Each slice consumes data from the allocated partitions until the view reaches parity with last Offset for the Kafka topic.
  • Subsequent materialized view refreshes read data from the last offset of the previous refresh until it reaches parity with the topic data.
  • Inside the Amazon Redshift, we created stored procedure to process CDC records and update target table.

Prerequisites

This post assumes you have a running Amazon MSK Connect stack in your environment with the following components:

  • Aurora MySQL hosting a database. In this post, you use the example database salesdb.
  • The Debezium MySQL connector running on Amazon MSK Connect, which connects Amazon MSK in your Amazon Virtual Private Cloud (Amazon VPC).
  • Amazon MSK cluster

If you don’t have an Amazon MSK Connect stack, then follow the instructions in the MSK Connect lab setup and verify that your source connector replicates data changes to the Amazon MSK topics.

You should provision the Amazon Redshift cluster in same VPC of Amazon MSK cluster. If you haven’t deployed one, then follow the steps here in the AWS Documentation.

We use AWS Identity and Access Management (AWS IAM) authentication for communication between Amazon MSK and Amazon Redshift cluster. Please make sure you have created an AWS IAM role with a trust policy that allows your Amazon Redshift cluster to assume the role. For information about how to configure the trust policy for the AWS IAM role, see Authorizing Amazon Redshift to access other AWS services on your behalf. After it’s created, the role should have the following AWS IAM policy, which provides permission for communication with the Amazon MSK cluster.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "MSKIAMpolicy",
            "Effect": "Allow",
            "Action": [
                "kafka-cluster:ReadData",
                "kafka-cluster:DescribeTopic",
                "kafka-cluster:Connect"
            ],
            "Resource": [
                "arn:aws:kafka:*:0123456789:cluster/xxx/xxx",
                "arn:aws:kafka:*:0123456789:topic/*/*/*"
            ]
        },
        {
            "Sid": "MSKPolicy",
            "Effect": "Allow",
            "Action": [
                "kafka:GetBootstrapBrokers"
            ],
            "Resource": "arn:aws:kafka:*:0123456789:cluster/xxx/xxx"
        }
    ]
}

Please replace the ARN containing xxx from above example policy with your Amazon MSK cluster’s ARN.

  • Also, verify that Amazon Redshift cluster has access to Amazon MSK cluster. In Amazon Redshift Cluster’s security group, add the inbound rule for MSK security group allowing port 9098. To see how to manage redshift cluster security group, refer Managing VPC security groups for a cluster.

image shows, how to add the inbound rule for MSK security group allowing port 9098, In Amazon Redshift Cluster’s security group

  • And, in the Amazon MSK cluster’s security group add the inbound rule allowing port 9098 for leader IP address of your Amazon Redshift Cluster, as shown in the following diagram. You can find the IP address for your Amazon Redshift Cluster’s leader node on properties tab of Amazon Redshift cluster from AWS Management Console.

image shows how to add the inbound rule allowing port 9098 for leader IP address of your Amazon Redshift Cluster,in the Amazon MSK cluster’s security group

Walkthrough

Navigate to the Amazon Redshift service from AWS Management Console, then set up Amazon Redshift streaming ingestion for Amazon MSK by performing the following steps:

  1. Enable_case_sensitive_identifier to true – In case you are using default parameter group for Amazon Redshift Cluster, you won’t be able to set enable_case_sensitive_identifier to true. You can create new parameter group with enable_case_sensitive_identifier to true and attach it to Amazon Redshift cluster. After you modify parameter values, you must reboot any clusters that are associated with the modified parameter group. It may take few minutes for Amazon Redshift cluster to reboot.

This configuration value that determines whether name identifiers of databases, tables, and columns are case sensitive. Once done, please open a new Amazon Redshift Query Editor V2, so that config changes we made are reflected, then follow next steps.

  1. Create an external schema that maps to the streaming data source.
CREATE EXTERNAL SCHEMA MySchema
FROM MSK
IAM_ROLE 'arn:aws:iam::YourRole:role/msk-redshift-streaming'
AUTHENTICATION IAM
CLUSTER_ARN 'arn:aws:kafka:us-east-1:2073196*****:cluster/MSKCluster-msk-connect-lab/849b47a0-65f2-439e-b181-1038ea9d4493-10'; // Replace last part with your cluster ARN, this is just for example.//

Once done, verify if you are seeing below tables created from MSK Topics:

image shows tables created from MSK Topics

  1. Create a materialized view that references the external schema.
CREATE MATERIALIZED VIEW customer_debezium AUTO REFRESH YES AS
SELECT
*,
json_parse(kafka_value) as payload
from
"dev"."myschema"."salesdb.salesdb.CUSTOMER" ; // Replace myshecma with name you have given to your external schema in step 2 //

Now, you can query newly created materialized view customer_debezium using below command.

SELECT * FROM "dev"."public"."customer_debezium" order by refresh_time desc;

Check the materialized view is populated with the CDC records

  1. REFRESH MATERIALIZED VIEW (optional). This step is optional as we have already specified AUTO REFRESH AS YES while creating MV (materialized view).
REFRESH MATERIALIZED VIEW "dev"."public"."customer_debezium";

NOTE: Above the materialized view is auto-refreshed, which means if you don’t see the records immediately, then you have wait for few seconds and rerun the select statement. Amazon Redshift streaming ingestion view also comes with the option of a manual refresh, which allow you to manually refresh the object. You can use the following query that pulls streaming data to Redshift object immediately.

SELECT * FROM "dev"."public"."customer_debezium" order by refresh_time desc;

images shows records from the customer_debezium MV

Process CDC records in Amazon Redshift

In following steps, we create the staging table to hold the CDC data, which is target table that holds the latest snapshot and stored procedure to process CDC records and update in target table.

  1. Create staging table: The staging table is a temporary table that holds all of the data that will be used to make changes to the target table, including both updates and inserts.
CREATE TABLE public.customer_stg (
customer_id character varying(256) ENCODE raw
distkey
,
customer_name character varying(256) ENCODE lzo,
market_segment character varying(256) ENCODE lzo,
ts_ms bigint ENCODE az64,
op character varying(2) ENCODE lzo,
record_rank smallint ENCODE az64,
refresh_time timestamp without time zone ENCODE az64
) DISTSTYLE KEY
SORTKEY
(customer_id); // In this particular example, we have used LZO encoding as LZO encoding works well for CHAR and VARCHAR columns that store very long character strings. You can use BYTEDICT as well if it matches your use case. //
  1. Create target table

We use customer_target table to load the processed CDC events.

CREATE TABLE public.customer_target (
customer_id character varying(256) ENCODE raw
distkey
,
customer_name character varying(256) ENCODE lzo,
market_segment character varying(256) ENCODE lzo,
refresh_time timestamp without time zone ENCODE az64
) DISTSTYLE KEY
SORTKEY
(customer_id);
  1. Create Last_extract_time debezium table and Inserting Dummy value.

We need to store the timestamp of last extracted CDC events. We use of debezium_last_extract table for this purpose. For initial record we insert a dummy value, which enables us to perform a comparison between current and next CDC processing timestamp.

CREATE TABLE public.debezium_last_extract (
process_name character varying(256) ENCODE lzo,
latest_refresh_time timestamp without time zone ENCODE az64
) DISTSTYLE AUTO;

Insert into public.debezium_last_extract VALUES ('customer','1983-01-01 00:00:00');

SELECT * FROM "dev"."public"."debezium_last_extract";
  1. Create stored procedure

This stored procedure processes the CDC records and updates the target table with the latest changes.

CREATE OR REPLACE PROCEDURE public.incremental_sync_customer()

LANGUAGE plpgsql

AS $$

DECLARE

sql VARCHAR(MAX) := '';

max_refresh_time TIMESTAMP;

staged_record_count BIGINT :=0;

BEGIN

-- Get last loaded refresh_time number from target table

sql := 'SELECT MAX(latest_refresh_time) FROM debezium_last_extract where process_name =''customer'';';

EXECUTE sql INTO max_refresh_time;

-- Truncate staging table

EXECUTE 'TRUNCATE customer_stg;';

-- Insert (and transform) latest change record for member with sequence number greater than last loaded sequence number into temp staging table

EXECUTE 'INSERT INTO customer_stg ('||

'select coalesce(payload.after."CUST_ID",payload.before."CUST_ID") ::varchar as customer_id,payload.after."NAME"::varchar as customer_name,payload.after."MKTSEGMENT" ::varchar as market_segment, payload.ts_ms::bigint,payload."op"::varchar, rank() over (partition by coalesce(payload.after."CUST_ID",payload.before."CUST_ID")::varchar order by payload.ts_ms::bigint desc) as record_rank, refresh_time from CUSTOMER_debezium where refresh_time > '''||max_refresh_time||''');';

sql := 'SELECT COUNT(*) FROM customer_stg;';

EXECUTE sql INTO staged_record_count;

RAISE INFO 'Staged member records: %', staged_record_count;

// replace customer_stg with your staging table name //

-- Delete records from target table that also exist in staging table (updated/deleted records)

EXECUTE 'DELETE FROM customer_target using customer_stg WHERE customer_target.customer_id = customer_stg.customer_id';

// replace customer_target with your target table name //

-- Insert all records from staging table into target table

EXECUTE 'INSERT INTO customer_target SELECT customer_id,customer_name, market_segment, refresh_time FROM customer_stg where record_rank =1 and op <> ''d''';

-- Insert max refresh time to control table

EXECUTE 'INSERT INTO debezium_last_extract SELECT ''customer'', max(refresh_time) from customer_target ';

END

$$

images shows stored procedure with name incremental_sync_customer created in above step

Test the solution

Update example salesdb hosted on Amazon Aurora

  1. This will be your Amazon Aurora database and we access it from Amazon Elastic Compute Cloud (Amazon EC2) instance with Name= KafkaClientInstance.
  2. Please replace the Amazon Aurora endpoint with value of your Amazon Aurora endpoint and execute following command and the use salesdb.
mysql -f -u master -h mask-lab-salesdb.xxxx.us-east-1.rds.amazonaws.com --password=S3cretPwd99

image shows the details of the RDS for MySQL
  1. Do an update, insert , and delete in any of the tables created. You can also do update more than once to check the last updated record later in Amazon Redshift.

image shows the insert, updates and delete operations performed on RDS for MySQL

  1. Invoke the stored procedure incremental_sync_customer created in the above steps from Amazon Redshift Query Editor v2. You can manually run proc using following command or schedule it.
call incremental_sync_customer();
  1. Check the target table for latest changes. This step is to check latest values in target table. You’ll see that all the updates and deletes that you did in source table are shown at top as a result order by refresh_time.
SELECT * FROM "dev"."public"."customer_target" order by refresh_time desc;

image shows the records from from customer_target table in descending order

Extending the solution

In this solution, we showed CDC processing for the customer table, and you can use the same approach to extend it to other tables in the example salesdb database or add more databases to MSK Connect configuration property database.include.list.

Our proposed approach can work with any MySQL source supported by Debezium MySQL source Kafka Connector. Similarly, to extend this example to your workloads and use-cases, you need to create the staging and target tables according to the schema of the source table. Then you need to update the coalesce(payload.after."CUST_ID",payload.before."CUST_ID")::varchar as customer_id statements with the column names and types in your source and target tables. Like in example stated in this post, we used LZO encoding as LZO encoding, which works well for CHAR and VARCHAR columns that store very long character strings. You can use BYTEDICT as well if it matches your use case. Another consideration to keep in mind while creating target and staging tables is choosing a distribution style and key based on data in source database. Here we have chosen distribution style as key with Customer_id, which are based on source data and schema update by following the best practices mentioned here.

Cleaning up

  1. Delete all the Amazon Redshift clusters
  2. Delete Amazon MSK Cluster and MSK Connect Cluster
  3. In case you don’t want to delete Amazon Redshift clusters, you can manually drop MV and tables created during this post using below commands:
drop MATERIALIZED VIEW customer_debezium;
drop TABLE public.customer_stg;
drop TABLE public.customer_target;
drop TABLE public.debezium_last_extract;

Also, please remove inbound security rules added to your Amazon Redshift and Amazon MSK Clusters, along with AWS IAM roles created in the Prerequisites section.

Conclusion

In this post, we showed you how Amazon Redshift streaming ingestion provided high-throughput, low-latency ingestion of streaming data from Amazon Kinesis Data Streams and Amazon MSK into an Amazon Redshift materialized view. We increased speed and reduced cost of streaming data into Amazon Redshift by eliminating the need to use any intermediary services.

Furthermore, we also showed how CDC data can be processed rapidly after generation, using a simple SQL interface that enables customers to perform near real-time analytics on variety of data sources (e.g., Internet-of-Things [ IoT] devices, system telemetry data, or clickstream data) from a busy website or application.

As you explore the options to simplify and enable near real-time analytics for your CDC data,

We hope this post provides you with valuable guidance. We welcome any thoughts or questions in the comments section.


About the Authors

Umesh Chaudhari is a Streaming Solutions Architect at AWS. He works with AWS customers to design and build real time data processing systems. He has 13 years of working experience in software engineering including architecting, designing, and developing data analytics systems.

Vishal Khatri is a Sr. Technical Account Manager and Analytics specialist at AWS. Vishal works with State and Local Government helping educate and share best practices with customers by leading and owning the development and delivery of technical content while designing end-to-end customer solutions.

Federate IAM-based single sign-on to Amazon Redshift role-based access control with Okta

Post Syndicated from Debu Panda original https://aws.amazon.com/blogs/big-data/federate-iam-based-single-sign-on-to-amazon-redshift-role-based-access-control-with-okta/

Amazon Redshift accelerates your time to insights with fast, easy, and secure cloud data warehousing at scale. Tens of thousands of customers rely on Amazon Redshift to analyze exabytes of data and run complex analytical queries.

You can use your preferred SQL clients to analyze your data in an Amazon Redshift data warehouse. Connect seamlessly by leveraging an identity provider (IdP) or single sign-on (SSO) credentials to connect to the Redshift data warehouse to reuse existing user credentials and avoid additional user setup and configuration. Using role-based access control (RBAC), you can simplify managing user privileges, database permissions and management of security permissions in Amazon Redshift. You can also use redshift database roles to define a set of elevated permissions, such as for a system monitor or database administrator.

Using AWS Identity and Access Management (IAM) with RBAC, organizations can simplify user management because you no longer need to create users and map them to database roles manually. You can define the mapped database roles as a principal tag for the IdP groups or IAM role, so Redshift database roles and users who are members of those IdP groups are granted to the database roles automatically.

Earlier in 2023, we launched support for Okta integration with Amazon Redshift Serverless using database roles. In this post, we focus on Okta as the IdP and provide step-by-step guidance to integrate a Redshift provisioned cluster with Okta using the Redshift Query Editor v2 and with SQL clients like SQL Workbench/J. You can use this mechanism with other IdP providers such as Azure Active Directory or Ping with any applications or tools using Amazon’s JDBC, ODBC, or Python driver.

Recently we also announced Amazon Redshift integration with AWS IAM Identity Center, supporting trusted identity propagation, allowing you to use third-party Identity Providers (IdP) such as Microsoft Entra ID (Azure AD), Okta, Ping, and OneLogin. This integration simplifies the authentication and authorization process for Amazon Redshift users using Query Editor V2 or Amazon Quicksight, making it easier for them to securely access your data warehouse. AWS IAM Identity Center offers automatic user and group provisioning from Okta to itself by utilizing the System for Cross-domain Identity Management (SCIM) 2.0 protocol. This integration allows for seamless synchronization of information between two services, ensuring accurate and up-to-date information in AWS IAM Identity Center. Refer to Integrate Okta with Amazon Redshift Query Editor V2 using AWS IAM Identity Center for seamless Single Sign-On blog post to learn more about setting up single sign-on (SSO) to Amazon Redshift using integration with IdC and Okta as the Identity Provider.

If you are interested in using IAM-based single sign-on with Amazon Redshift database roles then you can continue reading this blog.

Solution overview

The following diagram illustrates the authentication flow of Okta with a Redshift provisioned cluster using federated IAM roles and automatic database role mapping.

Architecture Diagram

The workflow contains the following steps:

  1. Either the user chooses an IdP app in their browser, or the SQL client initiates a user authentication request to the IdP (Okta).
  2. Upon a successful authentication, Okta submits a request to the AWS federation endpoint with a SAML assertion containing the principal tags.
  3. The AWS federation endpoint validates the SAML assertion and invokes the AWS Security Token Service (AWS STS) API AssumeRoleWithSAML. The SAML assertion contains the IdP user and group information that is stored in the RedshiftDbUser and RedshiftDbRoles principal tags, respectively. Temporary IAM credentials are returned to the SQL client or, if using the Query Editor v2, the user’s browser is redirected to the Query Editor v2 console using the temporary IAM credentials.
  4. The temporary IAM credentials are used by the SQL client or Query Editor v2 to call the Redshift API GetClusterCredentialsWithIAM. This API uses the principal tags to determine the user and database roles that the user belongs to. An associated database user is created if the user is signing in for the first time and is granted the matching database roles automatically. A temporary password is returned to the SQL client.
  5. Using the database user and temporary password, the SQL client or Query Editor v2 connects to Amazon Redshift. Upon login, the user is authorized based on the Redshift database roles that were assigned in Step 4.

Prerequisites

You need the following prerequisites to set up this solution:

Connect with a Redshift provisioned cluster as a federated user using Query Editor v2

To connect using Query Editor v2, complete the following steps:

  1. Follow all the steps described in the sections Set up your Okta application and Set up AWS configuration in the following post.
  2. For the Amazon Redshift access IAM policy, replace the policy with the following JSON to use the GetClusterCredentialsWithIAM API:
    {
    	"Version": "2012-10-17",
    	"Statement": [
    					{
    						"Sid": "VisualEditor0",
    						"Effect": "Allow",
    						"Action": "redshift:GetClusterCredentialsWithIAM",
    						"Resource": "arn:aws:redshift:us-west-2:123456789012:dbname:redshift-cluster-1/dev"
    					}
    				]
    }

Now you’re ready to connect to your Redshift provisioned cluster using Query Editor v2 and federated login.

  1. Use the SSO URL from Okta and log in to your Okta account with your user credentials. For this demo, we log in with user Ethan.
  2. In Query Editor v2, choose your Redshift provisioned cluster (right-click) and choose Create connection.
  3. For Authentication, select Temporary credentials using your IAM identity.
  4. For Database, enter the database name you want to connect to.
  5. Choose Create connection.
  6. Run the following command to validate that you are logged in as a federated user and also to get the list of roles associated with that user for the current session:
SELECT current_user,* FROM pg_get_session_roles() eff_ro(name name, roleid integer);

Because Ethan is part of the sales group and has been granted permissions to access tables in the sales_schema, he should be able to access those tables without any issues. However, if he tries to access tables in the finance_schema, he would receive a permission denied error because Ethan is not part of the finance group in Okta.

Okta-QEV2-Federation

Connect with a Redshift provisioned cluster as a federated user via a third-party client

To connect as a federated user via a third-party client, complete the following steps:

  1. Follow steps 1 and 2 which are described in above section (Connect with a Redshift provisioned cluster as a federated user using Query Editor v2).
  2. Use the Redshift JDBC driver v2.1.0.18 and above because it supports authentication with IAM group federation. For the URL, enter jdbc:redshift:iam://<cluster endpoint>:<port>:<databasename>?groupfederation=true. For example, jdbc:redshift:iam://redshift-cluster-1.abdef0abc0ab.us-west-2.redshift.amazonaws.com:5439/dev?groupfederation=true

In the preceding URL, groupfederation is a mandatory parameter that allows you to authenticate with the IAM credentials for the Redshift provisioned cluster. Without the groupfederation parameter, it will not use Redshift database roles.

  1. For Username and Password, enter your Okta credentials.

SQL Workbench/J - Connection

  1. To set up extended properties, follow Steps 4–9 in the section Configure the SQL client (SQL Workbench/J) in the following post.

User Ethan will be able to access the sales_schema tables. If Ethan tries to access the tables in the finance_schema, he will get a permission denied error.

SQL Workbench/J Demo

Troubleshooting

If your connection didn’t work, consider the following:

  • Enable logging in the driver. For instructions, see Configure logging.
  • Make sure to use the latest Amazon Redshift JDBC driver version.
  • If you’re getting errors while setting up the application on Okta, make sure you have admin access.
  • If you can authenticate via the SQL client but get a permission issue or can’t see objects, grant the relevant permission to the role.

Clean up

When you’re done testing the solution, clean up the resources to avoid incurring future charges:

  1. Delete the Redshift provisioned cluster.
  2. Delete the IAM roles, IAM IdPs, and IAM policies.

Conclusion

In this post, we provided step-by-step instructions to integrate a Redshift provisioned cluster with Okta using the Redshift Query Editor v2 and SQL Workbench/J with the help of federated IAM roles and automatic database role mapping. You can use a similar setup with other SQL clients (such as DBeaver or DataGrip). We also showed how Okta group membership is mapped automatically with Redshift provisioned cluster roles to use role-based authentication seamlessly.

If you have any feedback or questions, please leave them in the comments.


About the Authors

Debu-PandaDebu Panda is a Senior Manager, Product Management at AWS. He is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world.

Ranjan Burman is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 16 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with cloud solutions.

Maneesh Sharma is a Senior Database Engineer at AWS with more than a decade of experience designing and implementing large-scale data warehouse and analytics solutions. He collaborates with various Amazon Redshift Partners and customers to drive better integration.

How FanDuel adopted a modern Amazon Redshift architecture to serve critical business workloads

Post Syndicated from Sreenivasa Munagala original https://aws.amazon.com/blogs/big-data/how-fanduel-adopted-a-modern-amazon-redshift-architecture-to-serve-critical-business-workloads/

This post is co-written with Sreenivasa Mungala and Matt Grimm from FanDuel.

In this post, we share how FanDuel moved from a DC2 nodes architecture to a modern Amazon Redshift architecture, which includes Redshift provisioned clusters using RA3 instances, Amazon Redshift data sharing, and Amazon Redshift Serverless.

About FanDuel

Part of Flutter Entertainment, FanDuel Group is a gaming company that offers sportsbooks, daily fantasy sports, horse racing, and online casinos. The company operates sportsbooks in a number of US and Canadian states. Fanduel first carved out a niche in the US through daily fantasy sports, such as their most popular fantasy sport: NFL football.

As FanDuel’s business footprint grew, so too did the complexity of their analytical needs. More and more of FanDuel’s community of analysts and business users looked for comprehensive data solutions that centralized the data across the various arms of their business. Their individual, product-specific, and often on-premises data warehouses soon became obsolete. FanDuel’s data team solved the problem of creating a new massive data store for centralizing the data in one place, with one version of the truth. At the heart of this new Global Data Platform was Amazon Redshift, which fast became the trusted data store from which all analysis was derived. Users could now assess risk, profitability, and cross-sell opportunities not only for piecemeal divisions or products, but also globally for the business as a whole.

FanDuel’s journey on Amazon Redshift

FanDuel’s first Redshift cluster was launched using Dense Compute (DC2) nodes. This was chosen over Dense Storage (DS2) nodes in order to take advantage of the greater compute power for the complex queries in their organization. As FanDuel grew, so did their data workloads. This meant that there was a constant challenge to scale and overcome contention while providing the performance their user community needed for day-to-day decision-making. FanDuel met this challenge initially by continuously adding nodes and experimenting with workload management (WLM), but it became abundantly obvious that they needed to take a more significant step to meet the needs of their users.

In 2021, FanDuel’s workloads almost tripled since they first started using Amazon Redshift in 2018, and they started evaluating Redshift RA3 nodes vs. DC2 nodes to take advantage of the storage and compute separation and deliver better performance at lower costs. FanDuel wanted to make the move primarily to separate storage and compute, and evaluate data sharing in the hopes of bringing different compute to the data to alleviate user contention on their primary cluster. FanDuel decided to launch a new RA3 cluster when they were satisfied that the performance matched that of their existing DC2 architecture, providing them the ability to scale storage and compute independently.

In 2022, FanDuel shifted their focus to using data sharing. Data sharing allows you to share live data securely across Redshift data warehouses for read and write (in preview) purposes. This means that workloads can be isolated to individual clusters, allowing for a more streamlined schema design, WLM configuration, and right-sizing for cost optimization. The following diagram illustrates this architecture.

To achieve a data sharing architecture, the plan was to first spin up consumer clusters for development and testing environments for their data engineers that were moving key legacy code to dbt. FanDuel wanted their engineers to have access to production datasets to test their new models and match the results from their legacy SQL-based code sets. They also wanted to ensure that they had adequate compute to run many jobs concurrently. After they saw the benefits of data sharing, they spun up their first production consumer cluster in the spring of 2022 to handle other analytics use cases. This was sharing most of the schemas and their tables from the main producer cluster.

Benefits of moving to a data sharing architecture

FanDuel saw a lot of benefits from the data sharing architecture, where data engineers had access to real production data to test their jobs without impacting the producer’s performance. Since splitting the workloads through a data sharing architecture, FanDuel has doubled their query concurrency and reduced the query queuing, resulting in a better end-to-end query time. FanDuel received positive feedback on the new environment and soon reaped the rewards of increased engineering velocity and reduced performance issues in production after deployments. Their initial venture into the world of data sharing was definitely considered a success.

Given the successful rollout of their first consumer in a data sharing architecture, they looked for opportunities to meet other users’ needs with new targeted consumers. With the assistance of AWS, FanDuel initiated the development of a comprehensive strategy aimed at safeguarding their extract, load, and transform (ELT) jobs. This approach involved implementing workload isolation and allocating dedicated clusters for these workloads, designated as the producer cluster within the data sharing architecture. Simultaneously, they planned to migrate all other activities onto one or more consumer clusters, apart from the existing cluster utilized by their data engineering team.

They spun up a second consumer in the summer of 2022 with the hopes of moving some of their more resource-intensive analytical processes off the main cluster. In order to empower their analysts over time, they had allowed a pattern in which users other than data engineers could create and share their own objects.

As the calendar flipped from 2022 to 2023, several developments changed the landscape of architecture at FanDuel. For one, FanDuel launched their initial event-based streaming work for their sportsbook data, which allowed them to micro-batch data into Amazon Redshift at a much lower latency than their previous legacy batch approach. This allowed them to generate C-Suite revenue reports at a much earlier SLA, which was a big win for the data team, because this was never achieved before the Super Bowl.

FanDuel introduced a new internal KPI called Query Efficiency, a measure to capture the amount of time users spent waiting for their queries to run. As the workload started increasing exponentially, FanDuel also noticed an increase in this KPI, specifically for risk and trading workloads.

Working with AWS Enterprise Support and the Amazon Redshift service team, FanDuel soon realized that the risk and trading use case was a perfect opportunity to move it to Amazon Redshift Serverless. Redshift Serverless offers scalability across dimensions such a data volume changes, concurrent users and query complexity, enabling you to automatically scale compute up or down to manage demanding and unpredictable workloads. Because billing is only accrued while queries are run, it also means that you no longer need to cover costs for compute you’re not utilizing. Redshift Serverless also manages workload management (WLM) entirely, allowing you to focus only on the query monitoring rules (QMRs) you want and usage limits, further limiting the need for you to manage your data warehouses. This adoption also complimented data sharing, where Redshift Serverless endpoints can read and write (in preview) from provisioned clusters during peak hours, offering flexible compute scalability and workload isolation and avoiding the impact on other mission-critical workloads. Seeing the benefits of what Redshift Serverless offers for their risk and trading workloads, they also moved some of their other workloads like business intelligence (BI) dashboards and risk and trading (RT) to a Redshift Serverless environment.

Benefits of introducing Redshift Serverless in a data sharing architecture

Through a combination of data sharing and a serverless architecture, FanDuel could elastically scale their most critical workloads on demand. Redshift Serverless Automatic WLM allowed users to get started without the need to configure WLM. With the intelligent and automated scaling capabilities of Redshift Serverless, FanDuel could focus on their business objectives without worrying about the data warehouse capacity. This architecture alleviated the constraints of a single predefined Redshift provisioned cluster and reduced the need for FanDuel to manage data warehouse capacity and any WLM configuration.

In terms of cost, Redshift Serverless enabled FanDuel to elegantly handle the most demanding workloads with a pay-as-you-go model, paying only when the data warehouse is in use, along with complete separation of compute and storage.

Having now introduced workload isolation and Redshift Serverless, FanDuel is able to achieve a more granular understanding of each team’s compute requirements without the noise of ELT and contending workloads all in the same environment. This allowed comprehensive analytics workloads to be conducted on consumers with vastly minimized contention while also being serviced with the most cost-efficient configuration possible.

The following diagram illustrates the updated architecture.

Results

FanDuel’s re-architecting efforts for workload isolation with risk and trading (RT) workloads using Redshift data sharing and Redshift Serverless resulted in the most critical business SLAs finishing three times faster, along with an increase in average query efficiency of 55% for overall workloads. These SLA improvements have resulted into an overall saving of tenfold in business cost, and they have been able to deliver business insights to other verticals such as product, commercial, and marketing much faster.

Conclusion

By harnessing the power of Redshift provisioned clusters and serverless endpoints with data sharing, FanDuel has been able to better scale and run analytical workloads without having to manage any data warehouse infrastructure. FanDuel is looking forward to future Amazon partnerships and is excited to embark on a journey of new innovation with Redshift Serverless and continued enhancements such as machine learning optimization and auto scaling.

If you’re new to Amazon Redshift, you can explore demos, other customer stories, and the latest features at Amazon Redshift. If you’re already using Amazon Redshift, reach out to your AWS account team for support, and learn more about what’s new with Amazon Redshift.


About the authors

Sreenivasa Munagala is a Principal Data Architect at FanDuel Group. He defines their Amazon Redshift optimization strategy and works with the data analytics team to provide solutions to their key business problems.

Matt Grimm is a Principal Data Architect at FanDuel Group, moving the company to an event-based, data-driven architecture using the integration of both streaming and batch data, while also supporting their Machine Learning Platform and development teams.

Luke Shearer is a Cloud Support Engineer at Amazon Web Services for the Data Insight Analytics profile, where he is engaged with AWS customers every day and is always working to identify the best solution for each customer.

Dhaval Shah is Senior Customer Success Engineer at AWS and specializes in bringing the most complex and demanding data analytics workloads to Amazon Redshift. He has more then 20 years of experiences in different databases and data warehousing technologies. He is passionate about efficient and scalable data analytics cloud solutions that drive business value for customers.

Ranjan Burman is an Sr. Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 17 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with cloud solutions.

Sidhanth Muralidhar is a Principal Technical Account Manager at AWS. He works with large enterprise customers who run their workloads on AWS. He is passionate about working with customers and helping them architect workloads for cost, reliability, performance, and operational excellence at scale in their cloud journey. He has a keen interest in data analytics as well.

How HR&A uses Amazon Redshift spatial analytics on Amazon Redshift Serverless to measure digital equity in states across the US

Post Syndicated from Harman Singh Dhodi original https://aws.amazon.com/blogs/big-data/how-hra-uses-amazon-redshift-spatial-analytics-on-amazon-redshift-serverless-to-measure-digital-equity-in-states-across-the-us/

In our increasingly digital world, affordable access to high-speed broadband is a necessity to fully participate in our society, yet there are still millions of American households without internet access. HR&A Advisors—a multi-disciplinary consultancy with extensive work in the broadband and digital equity space is helping its state, county, and municipal clients deliver affordable internet access by analyzing locally specific digital inclusion needs and building tailored digital equity plans.

The first step in this process is mapping the digital divide. Which households don’t have access to the internet at home? Where do they live? What are their specific needs?

Public data sources aren’t sufficient for building a true understanding of digital inclusion needs. To fill in the gaps in existing data, HR&A creates digital equity surveys to build a more complete picture before developing digital equity plans. HR&A has used Amazon Redshift Serverless and CARTO to process survey findings more efficiently and create custom interactive dashboards to facilitate understanding of the results. HR&A’s collaboration with Amazon Redshift and CARTO has resulted in a 75% reduction in overall deployment and dashboard management time and helped the team achieve the following technical goals:

  • Load survey results (CSV files) and geometry data (shape files) in a data warehouse
  • Perform geo-spatial transformations using extract, transform, and load (ELT) jobs to join geometry data with survey results within the data warehouse to allow for visualization of survey results on a map
  • Integrate with a business intelligence (BI) tool for advanced geo-spatial functions, visualizations, and mapping dashboards
  • Scale data warehouse capacity up or down to address workloads of varying complexity in a cost-efficient manner

In this post, we unpack how HR&A uses Amazon Redshift spatial analytics and CARTO for cost-effective geo-spatial measurement of digital inclusion and internet access across multiple US states.

Before we get to the architecture details, here is what HR&A and its client, Colorado’s Office of the Future of Work, has to say about the solution.

“Working with the team at HR&A Advisors, Colorado’s Digital Equity Team created a custom dashboard that allowed us to very effectively evaluate our reach while surveying historically marginalized populations across Colorado. This dynamic tool, powered by AWS and CARTO, provided robust visualizations of which regions and populations were interacting with our survey, enabling us to zoom in quickly and address gaps in coverage. Ensuring we were able to seek out data from those who are most impacted by the digital divide in Colorado has been vital to addressing digital inequities in our state.”

— Melanie Colletti, Digital Equity Manager at Colorado’s Office of the Future of Work

“AWS allows us to securely house all of our survey data in one place, quickly scrub and analyze it on Amazon Redshift, and mirror the results through integration with data visualization tools such as CARTO without the data ever leaving AWS. This frees up our local computer space, greatly automates the survey cleaning and analysis step, and allows our clients to easily access the data results. Following the proof of concept and development of first prototype, almost all of our state clients showed interest in using the same solution for their states.”

— Harman Singh Dhodi, Analyst at HR&A Advisors, Inc.

Storing and analyzing large survey datasets

HR&A used Redshift Serverless to store large amounts of digital inclusion data in one place and quickly transform and analyze it using CARTO’s analytical toolkit to extend the spatial capabilities of Amazon Redshift and integrate with CARTO’s data visualization tools—all without the data ever leaving the AWS environment. This cut down significantly on analytical turnaround times.

The CARTO Analytics Toolbox for Redshift is composed of a set of user-defined functions and procedures organized in a set of modules based on the functionality they offer.

The following figure shows the solution and workflow steps developed during the proof of concept with a virtual private cloud (VPC) on Amazon Redshift.

Figure 1: Workflow illustrating data ingesting, transformation, and visualization using Redshift and CARTO.

In the following sections, we discuss each phase in the workflow in more detail.

Data ingestion

HR&A receives survey data as wide CSV files with hundreds of columns in each file and related spatial data in hexadecimal Extended Well-Known Binary (EWKB) in the form of shape files. These files are stored in Amazon Simple Storage Service (Amazon S3).

The Redshift COPY command is used to ingest the spatial data from shape files into the native GEOMETRY data type supported in Amazon Redshift. A combination of Amazon Redshift Spectrum and COPY commands are used to ingest the survey data stored as CSV files. For the files with unknown structures, AWS Glue crawlers are used to extract metadata and create table definitions in the Data Catalog. These table definitions are used as the metadata repository for external tables in Amazon Redshift.

For files with known structures, a Redshift stored procedure is used, which takes the file location and table name as parameters and runs a COPY command to load the raw data into corresponding Redshift tables.

Data transformation

Multiple stored procedures are used to split the raw table data and load it into corresponding target tables while applying the user-defined transformations.

These transformation rules include transformation of GEOMETRY data using native Redshift geo-spatial functions, like ST_Area and ST_length, and CARTO’s advanced spatial functions, which are readily available in Amazon Redshift as part of the CARTO Analytics Toolbox for Redshift installation. Furthermore, all the data ingestion and transformation steps are automated using an AWS Lambda function to run the Redshift query when any dataset in Amazon S3 gets updated.

Data visualization

The HR&A team used CARTO’s Redshift connector to connect to the Redshift Serverless endpoint and built dashboards using CARTO’s SQL interface and widgets to assist mapping while performing dynamic calculations of the map data as per client needs.

The following are sample screenshots of the dashboards that show survey responses by zip code. The counties that are in lighter shades represent limited survey responses and need to be included in the targeted data collection strategy.

The first image shows the dashboard without any active filters. The second image shows filtered map and chats by respondents who took the survey in Spanish. The user can select and toggle between features by clicking on the respective category in any of the bar charts.

Figure 2: Illustrative Digital Equity Survey Dashboard for the State of Colorado. (© HR&A Advisors)

Figure 3: Illustrative Digital Equity Survey Dashboard for the State of Colorado, filtered for respondents who took the survey in Spanish language. (© HR&A Advisors)

The result: A new standard for automatically updating digital inclusion dashboards

After developing the first interactive dashboard prototype with this methodology, five of HR&A’s state clients (CA, TX, NV, CO, and MA) showed interest in the solution. HR&A was able to implement it for each of them within 2 months—an incredibly quick turnaround for a custom, interactive digital inclusion dashboard.

HR&A also realized about a 75% reduction in overall deployment and dashboard management time, which meant the consulting team could redirect their focus from manually analyzing data to helping clients interpret and strategically plan around the results. Finally, the dashboard’s user-friendly interface made survey data more accessible to a wider range of stakeholders. This helped build a shared understanding when assessing gaps in each state’s digital inclusion landscape and allowed for a targeted data collection strategy from areas with limited survey responses, thereby supporting more productive collaboration overall.

Conclusion

In this post, we showed how HR&A was able to analyze geo-spatial data in large volumes using Amazon Redshift Serverless and CARTO.

With HR&A’s successful implementation, it’s evident that Redshift Serverless, with its flexibility and scalability, can be used as a catalyst for positive social change. As HR&A continues to pave the way for digital equity, their story stands as a testament to how AWS services and its partners can be used in addressing real-world challenges.

We encourage you to explore Redshift Serverless with CARTO for analyzing spatial data and let us know your experience in the comments.


About the authors

Harman Singh Dhodi is an Analyst at HR&A Advisors, Harman combines his passion for data analytics with sustainable infrastructure practices, social inclusion, economic viability, climate resiliency, and building stakeholder capacity. Harman’s work often focuses on translating complex datasets into visual stories and accessible tools that help empower communities to understand the challenges they’re facing and create solutions for a brighter future.

Kiran Kumar Tati is an Analytics Specialist Solutions Architect based out of Omaha, NE. He specializes in building end-to-end analytic solutions. He has more than 13 years of experience with designing and implementing large scale Big Data and Analytics solutions. In his spare time, he enjoys playing cricket and watching sports.

Sapna Maheshwari is a Sr. Solutions Architect at Amazon Web Services. She helps customers architect data analytics solutions at scale on AWS. Outside of work she enjoys traveling and trying new cuisines.

Washim Nawaz is an Analytics Specialist Solutions Architect at AWS. He has worked on building and tuning data warehouse and data lake solutions for over 15 years. He is passionate about helping customers modernize their data platforms with efficient, performant, and scalable analytic solutions. Outside of work, he enjoys watching games and traveling.

Integrate Okta with Amazon Redshift Query Editor V2 using AWS IAM Identity Center for seamless Single Sign-On

Post Syndicated from Maneesh Sharma original https://aws.amazon.com/blogs/big-data/integrate-okta-with-amazon-redshift-query-editor-v2-using-aws-iam-identity-center-for-seamless-single-sign-on/

AWS IAM Identity Center (IdC) allows you to manage single sign-on (SSO) access to all your AWS accounts and applications from a single location. We are pleased to announce that Amazon Redshift now integrates with AWS IAM Identity Center, and supports trusted identity propagation, allowing you to use third-party Identity Providers (IdP) such as Microsoft Entra ID (Azure AD), Okta, Ping, and OneLogin. This integration simplifies the authentication and authorization process for Amazon Redshift users using Query Editor V2 or Amazon Quicksight, making it easier for them to securely access your data warehouse. Additionally, this integration positions Amazon Redshift as an IdC-managed application, enabling you to use database role-based access control on your data warehouse for enhanced security.

AWS IAM Identity Center offers automatic user and group provisioning from Okta to itself by utilizing the System for Cross-domain Identity Management (SCIM) 2.0 protocol. This integration allows for seamless synchronization of information between two services, ensuring accurate and up-to-date information in AWS IAM Identity Center.

In this post, we’ll outline a comprehensive guide for setting up SSO to Amazon Redshift using integration with IdC and Okta as the Identity Provider. This guide shows how you can SSO onto Amazon Redshift for Amazon Redshift Query Editor V2 (QEV2).

Solution overview

Using IAM IdC with Amazon Redshift can benefit your organization in the following ways:

  • Users can connect to Amazon Redshift without requiring an administrator to set up AWS IAM roles with complex permissions.
  • IAM IdC integration allows mapping of IdC groups with Amazon Redshift database roles. Administrators can then assign different privileges to different roles and assigning these roles to different users, giving organizations granular control for user access.
  • IdC provides a central location for your users in AWS. You can create users and groups directly in IdC or connect your existing users and groups that you manage in a standards-based identity provider like Okta, Ping Identity, or Microsoft Entra ID (i.e., Azure Active Directory [AD]).
  • IdC directs authentication to your chosen source of truth for users and groups, and it maintains a directory of users and groups for access by Amazon Redshift.
  • You can share one IdC instance with multiple Amazon Redshift data warehouses with a simple auto-discovery and connect capability. This makes it fast to add clusters without the extra effort of configuring the IdC connection for each, and it ensures that all clusters and workgroups have a consistent view of users, their attributes, and groups. Note: Your organization’s IdC instance must be in the same region as the Amazon Redshift data warehouse you’re connecting to.
  • Because user identities are known and logged along with data access, it’s easier for you to meet compliance regulations through auditing user access in AWS CloudTrail authorizes access to data.
    Architecture Diagram

Amazon Redshift Query Editor V2 workflow:

  1. End user initiates the flow using AWS access portal URL (this URL would be available on IdC dashboard console). A browser pop-up triggers and takes you to the Okta Login page where you enter Okta credentials. After successful authentication, you’ll be logged into the AWS Console as a federated user. Click on your AWS Account and choose the Amazon Redshift Query Editor V2 application. Once you federate to Query Editor V2, select the IdC authentication method.
  2. QEv2 invokes browser flow where you re-authenticate, this time with their AWS IdC credentials. Since Okta is the IdP, you enter Okta credentials, which are already cached in browser. At this step, federation flow with IdC initiates and at the end of this flow, the Session token and Access token is available to the QEv2 console in browser as cookies.
  3. Amazon Redshift retrieves your authorization details based on session token retrieved and fetches user’s group membership.
  4. Upon a successful authentication, you’ll be redirected back to QEV2, but logged in as an IdC authenticated user.

This solution covers following steps:

  1. Integrate Okta with AWS IdC to sync user and groups.
  2. Setting up IdC integration with Amazon Redshift
  3. Assign Users or Groups from IdC to Amazon Redshift Application.
  4. Enable IdC integration for a new Amazon Redshift provisioned or Amazon Redshift Serverless endpoint.
  5. Associate an IdC application with an existing provisioned or serverless data warehouse.
  6. Configure Amazon Redshift role-based access.
  7. Create a permission set.
  8. Assign permission set to AWS accounts.
  9. Federate to Redshift Query Editor V2 using IdC.
  10. Troubleshooting

Prerequisites

Walkthrough

Integrate Okta with AWS IdC to sync user and groups

Enable group and user provisioning from Okta with AWS IdC by following this documentation here.

If you see issues while syncing users and groups, then refer to this section these considerations for using automatic provisioning.

Setting up IAM IdC integration with Amazon Redshift

Amazon Redshift cluster administrator or Amazon Redshift Serverless administrator must perform steps to configure Redshift as an IdC-enabled application. This enables Amazon Redshift to discover and connect to the IdC automatically to receive sign-in and user directory services.

After this, when the Amazon Redshift administrator creates a cluster or workgroup, they can enable the new data warehouse to use IdC and its identity-management capabilities. The point of enabling Amazon Redshift as an IdC-managed application is so you can control user and group permissions from within the IdC, or from a source third-party identity provider that’s integrated with it.

When your database users sign in to an Amazon Redshift database, for example an analyst or a data scientist, it checks their groups in IdC and these are mapped to roles in Amazon Redshift. In this manner, a group can map to an Amazon Redshift database role that allows read access to a set of tables.

The following steps show how to make Amazon Redshift an AWS-managed application with IdC:

  1. Select IAM Identity Center connection from Amazon Redshift console menu.
    Redshift Application Creation
  2. Choose Create application
    Redshift IdC Create Application
  3. The IAM Identity Center connection opens. Choose Next.
  4. In IAM Identity Center integration setup section, for:
    1. IAM Identity Center display name – Enter a unique name for Amazon Redshift’s IdC-managed application.
    2. Managed application name – You can enter the managed Amazon Redshift application name or use the assigned value as it is.
  5. In Connection with third-party identity providers section, for:
    1. Identity Provider Namespace – Specify the unique namespace for your organization. This is typically an abbreviated version of your organization’s name. It’s added as a prefix for your IdC-managed users and roles in the Amazon Redshift database.
  6. In IAM role for IAM Identity Center access – Select an IAM role to use. You can create a new IAM role if you don’t have an existing one. The specific policy permissions required are the following:
    • sso:DescribeApplication – Required to create an identity provider (IdP) entry in the catalog.
    • sso:DescribeInstance – Used to manually create IdP federated roles or users.
    • redshift:DescribeQev2IdcApplications – Used to detect capability for IDC authentication from Redshift Query Editor V2.

The following screenshot is from the IAM role:
IAM IdC Role

IAM IdC Role Trust Relationship

  1. We won’t enable Trusted identity propagation because we are not integrating with AWS Lake Formation in this post.
  2. Choose Next.
    Redshift IdC Connection
  3. In Configure client connections that use third-party IdPs section, choose Yes if you want to connect Amazon Redshift with a third-party application. Otherwise, choose No. For this post we chose No because we’ll be integrating only with Amazon Redshift Query Editor V2.
  4. Choose Next.
    Redshift IdC No Third Party App
  5. In the Review and create application section, review all the details you have entered before and choose Create application.
    Redshift IdC Application - Review and Create

After the Amazon Redshift administrator finishes the steps and saves the configuration, the IdC properties appear in the Redshift Console. Completing these tasks makes Redshift an IdC-enabled application.
Redshift IdC Application Created

After you select the managed application name, the properties in the console includes the integration status. It says Success when it’s completed. This status indicates if IdC configuration is completed.
Amazon Redshift IdC Application Status

Assigning users or groups from IdC to Amazon Redshift application

In this step, Users or groups synced to your IdC directory are available to assign to your application where the Amazon Redshift administrator can decide which users or groups from IDC need to be included as part of Amazon Redshift application.

For example, if you have total 20 groups from your IdC and you don’t want all the groups to include as part of Amazon Redshift application, then you have options to choose which IdC groups to include as part of Amazon Redshift-enabled IdC application. Later, you can create two Redshift database roles as part of IDC integration in Amazon Redshift.

The following steps assign groups to Amazon Redshift-enabled IdC application:

  1. On IAM Identity Center properties in the Amazon Redshift Console, select Assign under Groups tab.
  2. If this is the first time you’re assigning groups, then you’ll see a notification. Select Get started.
  3. Enter which groups you want to synchronize in the application. In this example, we chose the groups wssso-sales and awssso-finance.
  4. Choose Done.
    Redshift Application - Assigning User and Groups

Enabling IdC integration for a new Amazon Redshift provisioned cluster or Amazon Redshift Serverless

After completing steps under section (Setting up IAM Identity Center integration with Amazon Redshift) — Amazon Redshift database administrator needs to configure new Redshift resources to work in alignment with IdC to make sign-in and data access easier. This is performed as part of the steps to create a provisioned cluster or a Serverless workgroup. Anyone with permissions to create Amazon Redshift resources can perform these IdC integration tasks. When you create a provisioned cluster, you start by choosing Create Cluster in the Amazon Redshift Console.

  1. Choose Enable for the cluster (recommended) in the section for IAM Identity Center connection in the create-cluster steps.
  2. From the drop down, choose the redshift application which you created in above steps.

Note that when a new data warehouse is created, the IAM role specified for IdC integration is automatically attached to the provisioned cluster or Serverless Namespace. After you finish entering the required cluster metadata and create the resource, you can check the status for IdC integration in the properties.
Amazon Redshift - Create Cluster

Associating an IdC application with an existing provisioned cluster or Serverless endpoint

If you have an existing provisioned cluster or serverless workgroup that you would like to enable for IdC integration, then you can do that by running a SQL command. You run the following command to enable integration. It’s required that a database administrator run the query.

CREATE IDENTITY PROVIDER "<idc_application_name>" TYPE AWSIDC
NAMESPACE '<idc_namespace_name>'
APPLICATION_ARN '<idc_application_arn>'
IAM_ROLE '<IAM role for IAM Identity Center access>';

Example:

CREATE IDENTITY PROVIDER "redshift-idc-app" TYPE AWSIDC
NAMESPACE 'awsidc'
APPLICATION_ARN 'arn:aws:sso::123456789012:application/ssoins-12345f67fe123d4/apl-a0b0a12dc123b1a4'
IAM_ROLE 'arn:aws:iam::123456789012:role/EspressoRedshiftRole';

To alter the IdP, use the following command (this new set of parameter values completely replaces the current values):

ALTER IDENTITY PROVIDER "<idp_name>"
NAMESPACE '<idc_namespace_name>'|
IAM_ROLE default | 'arn:aws:iam::<AWS account-id-1>:role/<role-name>';

Few of the examples are:

ALTER IDENTITY PROVIDER "redshift-idc-app"
NAMESPACE 'awsidctest';

ALTER IDENTITY PROVIDER "redshift-idc-app"
IAM_ROLE 'arn:aws:iam::123456789012:role/administratoraccess';

Note: If you update the idc-namespace value, then all the new cluster created afterwards will be using the updated namespace.

For existing clusters or serverless workgroups, you need to update the namespace manually on each Amazon Redshift cluster using the previous command. Also, all the database roles associated with identity provider will be updated with new namespace value.

You can disable or enable the identity provider using the following command:

ALTER IDENTITY PROVIDER <provider_name> disable|enable;

Example:

ALTER IDENTITY PROVIDER <provider_name> disable;

You can drop an existing identity provider. The following example shows how CASCADE deletes users and roles attached to the identity provider.

DROP IDENTITY PROVIDER <provider_name> [ CASCADE ]

Configure Amazon Redshift role-based access

In this step, we pre-create the database roles in Amazon Redshift based on the groups that you synced in IdC. Make sure the role name matches with the IdC Group name.

Amazon Redshift roles simplify managing privileges required for your end-users. In this post, we create two database roles, sales and finance, and grant them access to query tables with sales and finance data, respectively. You can download this sample SQL Notebook and import into Redshift Query Editor v2 to run all cells in the notebook used in this example. Alternatively, you can copy and enter the SQL into your SQL client.

Below is the syntax to create role in Amazon Redshift:

create role "<IDC_Namespace>:<IdP groupname>;

For example:

create role "awsidc:awssso-sales";
create role "awsidc:awssso-finance";

Create the sales and finance database schema:

create schema sales_schema;
create schema finance_schema;

Creating the tables:

CREATE TABLE IF NOT EXISTS finance_schema.revenue
(
account INTEGER   ENCODE az64
,customer VARCHAR(20)   ENCODE lzo
,salesamt NUMERIC(18,0)   ENCODE az64
)
DISTSTYLE AUTO
;

insert into finance_schema.revenue values (10001, 'ABC Company', 12000);
insert into finance_schema.revenue values (10002, 'Tech Logistics', 175400);
insert into finance_schema.revenue values (10003, 'XYZ Industry', 24355);
insert into finance_schema.revenue values (10004, 'The tax experts', 186577);

CREATE TABLE IF NOT EXISTS sales_schema.store_sales
(
ID INTEGER   ENCODE az64,
Product varchar(20),
Sales_Amount INTEGER   ENCODE az64
)
DISTSTYLE AUTO
;

Insert into sales_schema.store_sales values (1,'product1',1000);
Insert into sales_schema.store_sales values (2,'product2',2000);
Insert into sales_schema.store_sales values (3,'product3',3000);
Insert into sales_schema.store_sales values (4,'product4',4000);

Below is the syntax to grant permission to the Amazon Redshift Serverless role:

GRANT { { SELECT | INSERT | UPDATE | DELETE | DROP | REFERENCES } [,...]| ALL [ PRIVILEGES ] } ON { [ TABLE ] table_name [, ...] | ALL TABLES IN SCHEMA schema_name [, ...] } TO role <IdP groupname>;

Grant relevant permission to the role as per your requirement. In following example, we grant full permission to role sales on sales_schema and only select permission on finance_schema to role finance.

For example:

grant usage on schema sales_schema to role "awsidc:awssso-sales";
grant select on all tables in schema sales_schema to role "awsidc:awssso-sales";

grant usage on schema finance_schema to role "awsidc:awssso-finance";
grant select on all tables in schema finance_schema to role "awsidc:awssso-finance";

Create a permission set

A permission set is a template that you create and maintain that defines a collection of one or more IAM policies. Permission sets simplify the assignment of AWS account access for users and groups in your organization. We’ll create a permission set to allow federated user to access Query Editor V

The following steps to create permission set:

  1. Open the IAM Identity Center Console.
  2. In the navigation pane, under Multi-Account permissions, choose Permission sets.
  3. Choose Create permission set.
    IdC-Create Permission Set
  4. Choose Custom permission set and then choose Next.
  5. Under AWS managed policies, choose AmazonRedshiftQueryEditorV2ReadSharing.
  6. Under Customer managed policies, provide the policy name which you created in step 4 under section – Setting up IAM Identity Center integration with Amazon Redshift.
  7. Choose Next.
  8. Enter permission set name. For example, Amazon Redshift-Query-Editor-V2.
  9. Under Relay state – optional – set default relay state to the Query Editor V2 URL, using the format : https://<region>.console.aws.amazon.com/sqlworkbench/home.
    For this post, we use: https://us-east-1.console.aws.amazon.com/sqlworkbench/home.
  10. Choose Next.
  11. On the Review and create screen, choose Create. The console displays the following message: The permission set Redshift-Query-Editor-V2 was successfully created.
    IdC- Review Create Permission

Assign permission set to AWS accounts

  1. Open the IAM Identity Center Console.
  2. In the navigation pane, under Multi-account permissions, choose AWS accounts.
  3. On the AWS accounts page, select one or more AWS accounts that you want to assign single sign-on access to.
  4. Choose Assign users or groups.
  5. On the Assign users and groups to AWS-account-name, choose the groups that you want to create the permission set for. Then, choose Next.
  6. On the Assign permission sets to AWS-account-name, choose the permission set you created in the section – Create a permission set. Then, choose Next.
  7. On the Review and submit assignments to AWS-account-name page, for Review and submit, choose Submit. The console displays the following message: We reprovisioned your AWS account successfully and applied the updated permission set to the account.
    Idc-Review and Submit AssignmentsIdC-Assignment Permission Created

Federate to Amazon Redshift using Query Editor V2 using IdC

Now you’re ready to connect to Amazon Redshift Query Editor V2 and federated login using IdC authentication:

  1. Open the IAM Identity Center Console.
  2. Go to dashboard and select the AWS access portal URL.
  3. A browser pop-up triggers and takes you to the Okta Login page where you enter your Okta credentials.
  4. After successful authentication, you’ll be logged into the AWS console as a federated user.
  5. Select your AWS Account and choose the Amazon Redshift Query Editor V2 application.
  6. Once you federate to Query Editor V2, choose your Redshift instance (i.e., right-click) and choose Create connection.
  7. To authenticate using IdC, choose the authentication method IAM Identity Center.
  8. It will show a pop-up and since your Okta credentials is already cached, it utilizes the same credentials and connects to Amazon Redshift Query Editor V2 using IdC authentication.

The following demonstration shows a federated user (Ethan) used the AWS access portal URL to access Amazon Redshift using IdC authentication. User Ethan accesses the sales_schema tables. If User Ethan tries to access the tables in finance_schema, then the user gets a permission denied error.
QEV2-IdC-Demo

Troubleshooting

  • If you get the following error:
    ERROR: registered identity provider does not exist for "<idc-namespace:redshift_database_role_name"

This means that you are trying to create a role with a wrong namespace. Please check current namespace using the command select * from identity_providers;

  • If you get below error:
    ERROR: (arn:aws:iam::123456789012:role/<iam_role_name>) Insufficient privileges to access AWS IdC. [ErrorId: 1-1234cf25-1f23ea1234c447fb12aa123e]

This means that an IAM role doesn’t have sufficient privileges to access to the IdC. Your IAM role should contain a policy with following permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "redshift:DescribeQev2IdcApplications",
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "sso:DescribeApplication",
                "sso:DescribeInstance"
            ],
            "Resource": "arn:aws:sso::726034267621:application/ssoins-722324fe82a3f0b8/*"
        }
    ]
}
  • If you get below error:
    FATAL: Invalid scope. User's credentials are not authorized to connect to Redshift. [SQL State=28000]

Please make sure that the user and group are added to the Amazon Redshift IdC application.

Clean up

Complete the following steps to clean up your resources:

  1. Delete the Okta Applications which you have created to integrate with IdC.
  2. Delete IAM Identity Center configuration.
  3. Delete the Redshift application and the Redshift provisioned cluster which you have created for testing.
  4. Delete the IAM role which you have created for IdC and Redshift integration.

Conclusion

In this post, we showed you a detailed walkthrough of how you can integrate Okta with the IdC and Amazon Redshift Query Editor version 2 to simplify your SSO setup. This integration allows you to use role-based access control with Amazon Redshift. We encourage you to try out this integration.

To learn more about IdC with Amazon Redshift, visit the documentation.


About the Authors

Debu Panda is a Senior Manager, Product Management at AWS. He is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world.

Maneesh Sharma is a Senior Database Engineer at AWS with more than a decade of experience designing and implementing large-scale data warehouse and analytics solutions. He collaborates with various Amazon Redshift Partners and customers to drive better integration.

Harshida Patel is a Principal Solutions Architect, Analytics with AWS.

Praveen Kumar Ramakrishnan is a Senior Software Engineer at AWS. He has nearly 20 years of experience spanning various domains including filesystems, storage virtualization and network security. At AWS, he focuses on enhancing the Redshift data security.

Karthik Ramanathan is a Sr. Software Engineer with AWS Redshift and is based in San Francisco. He brings close to two decades of development experience across the networking, data storage and IoT verticals prior to Redshift. When not at work, he is also a writer and loves to be in the water.

How Eightfold AI implemented metadata security in a multi-tenant data analytics environment with Amazon Redshift

Post Syndicated from Arun Sudhir original https://aws.amazon.com/blogs/big-data/how-eightfold-ai-implemented-metadata-security-in-a-multi-tenant-data-analytics-environment-with-amazon-redshift/

This is a guest post co-written with Arun Sudhir from Eightfold AI.

Eightfold is transforming the world of work by providing solutions that empower organizations to recruit and retain a diverse global workforce. Eightfold is a leader in AI products for enterprises to build on their talent’s existing skills. From Talent Acquisition to Talent Management and talent insights, Eightfold offers a single AI platform that does it all.

The Eightfold Talent Intelligence Platform powered by Amazon Redshift and Amazon QuickSight provides a full-fledged analytics platform for Eightfold’s customers. It delivers analytics and enhanced insights about the customer’s Talent Acquisition, Talent Management pipelines, and much more. Customers can also implement their own custom dashboards in QuickSight. As part of the Talent Intelligence Platform Eightfold also exposes a data hub where each customer can access their Amazon Redshift-based data warehouse and perform ad hoc queries as well as schedule queries for reporting and data export. Additionally, customers who have their own in-house analytics infrastructure can integrate their own analytics solutions with Eightfold Talent Intelligence Platform by directly connecting to the Redshift data warehouse provisioned for them. Doing this gives them access to their raw analytics data, which can then be integrated into their analytics infrastructure irrespective of the technology stack they use.

Eightfold provides this analytics experience to hundreds of customers today. Securing customer data is a top priority for Eightfold. The company requires the highest security standards when implementing a multi-tenant analytics platform on Amazon Redshift.

The Eightfold Talent Intelligence Platform integrates with Amazon Redshift metadata security to implement visibility of data catalog listing of names of databases, schemas, tables, views, stored procedures, and functions in Amazon Redshift.

In this post, we discuss how the Eightfold Talent Lake system team implemented the Amazon Redshift metadata security feature in their multi-tenant environment to enable access controls for the database catalog. By linking access to business-defined entitlements, they are able to enforce data access policies.

Amazon Redshift security controls addresses restricting data access to users who have been granted permission. This post discusses restricting listing of data catalog metadata as per the granted permissions.

The Eightfold team needed to develop a multi-tenant application with the following features:

  • Enforce visibility of Amazon Redshift objects on a per-tenant basis, so that each tenant can only view and access their own schema
  • Implement tenant isolation and security so that tenants can only see and interact with their own data and objects

Metadata security in Amazon Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Many customers have implemented Amazon Redshift to support multi-tenant applications. One of the challenges with multi-tenant environments is that database objects are visible to all tenants even though tenants are only authorized to access certain objects. This visibility creates data privacy challenges because many customers want to hide objects that tenants can’t access.

The newly released metadata security feature in Amazon Redshift enables you to hide database objects from all other tenants and make objects only visible to tenants who are authorized to see and use them. Tenants can use SQL tools, dashboards, or reporting tools, and also query the database catalog, but they will only see appropriate objects for which they have permissions to see.

Solution overview

Exposing a Redshift endpoint to all of Eightfold’s customers as part of the Talent Lake endeavor involved several design choices that had to be carefully considered. Eightfold has a multi-tenant Redshift data warehouse that had individual customer schemas for customers, which they could connect to using their own customer credentials to perform queries on their data. Data in each customer tenant can only be accessed by the customer credentials that had access to the customer schema. Each customer could access data under their analytics schema, which was named after the customer. For example, for a customer named A, the schema name would be A_analytics. The following diagram illustrates this architecture.

Although customer data was secured by restricting access to only the customer user, when customers used business intelligence (BI) tools like QuickSight, Microsoft Power BI, or Tableau to access their data, the initial connection showed all the customer schemas because it was performing a catalog query (which couldn’t be restricted). Therefore, Eightfold’s customers had concerns that other customers could discover that they were Eightfold’s customers by simply trying to connect to Talent Lake. This unrestricted database catalog access posed a privacy concern to several Eightfold customers. Although this could be avoided by provisioning one Redshift database per customer, that was a logistically difficult and expensive solution to implement.

The following screenshot shows what a connection from QuickSight to our data warehouse looked like without metadata security turned on. All other customer schemas were exposed even though the connection to QuickSight was made as customer_k_user.

Approach for implementing metadata access controls

To implement restricted catalog access, and ensure it worked with Talent Lake, we cloned our production data warehouse with all the schemas and enabled the metadata security flag in the Redshift data warehouse by connecting to SQL tools. After it was enabled, we tested the catalog queries by connecting to the data warehouse from BI tools like QuickSight, Microsoft Power BI, and Tableau and ensured that only the customer schemas show up as a result of the catalog query. We also tested by running catalog queries after connecting to the Redshift data warehouse from psql, to ensure that only the customer schema objects were surfaced—It’s important to validate that given tenants have access to the Redshift data warehouse directly.

The metadata security feature was tested by first turning on metadata security in our Redshift data warehouse by connecting using a SQL tool or Amazon Redshift Query Editor v2.0 and issuing the following command:

ALTER SYSTEM SET metadata_security = TRUE;

Note that the preceding command is set at the Redshift cluster level or Redshift Serverless endpoint level, which means it is applied to all databases and schemas in the cluster or endpoint.

In Eightfold’s scenario, data access controls are already in place for each of the tenants for their respective database objects.

After turning on the metadata security feature in Amazon Redshift, Eightfold was able to restrict database catalog access to only show individual customer schemas for each customer that was trying to connect to Amazon Redshift and further validated by issuing a catalog query to access schema objects as well.

We also tested by connecting via psql and trying out various catalog queries. All of them yielded only the relevant customer schema of the logged-in user as the result. The following are some examples:

analytics=> select * from pg_user;
usename | usesysid | usecreatedb | usesuper | usecatupd | passwd | valuntil | useconfig 
------------------------+----------+-------------+----------+-----------+----------+----------+-------------------------------------------
customer_k_user | 377 | f | f | f | ******** | | 
(1 row)

analytics=> select * from information_schema.schemata;
catalog_name | schema_name | schema_owner | default_character_set_catalog | default_character_set_schema | default_character_set_name | sql_path 
--------------+----------------------+------------------------+-------------------------------+------------------------------+----------------------------+----------
analytics | customer_k_analytics | customer_k_user | | | | 
(1 row)

The following screenshot shows the UI after metadata security was enabled: only customer_k_analytics is seen when connecting to the Redshift data warehouse as customer_k_user.

This ensured that individual customer privacy was protected and increased customer confidence in Eightfold’s Talent Lake.

Customer feedback

“Being an AI-first platform for customers to hire and develop people to their highest potential, data and analytics play a vital role in the value provided by the Eightfold platform to its customers. We rely on Amazon Redshift as a multi-tenant Data Warehouse that provides rich analytics with data privacy and security through customer data isolation by using schemas. In addition to the data being secure as always, we layered on Redshift’s new metadata access control to ensure customer schemas are not visible to other customers. This feature truly made Redshift the ideal choice for a multi-tenant, performant, and secure Data Warehouse and is something we are confident differentiates our offering to our customers.”

– Sivasankaran Chandrasekar, Vice President of Engineering, Data Platform at Eightfold AI

Conclusion

In this post, we demonstrated how the Eightfold Talent Intelligence Platform team implemented a multi-tenant environment for hundreds of customers, using the Amazon Redshift metadata security feature. For more information about metadata security, refer to the Amazon Redshift documentation.

Try out the metadata security feature for your future Amazon Redshift implementations, and feel free to leave a comment about your experience!


About the authors

Arun Sudhir is a Staff Software Engineer at Eightfold AI. He has more than 15 years of experience in design and development of backend software systems in companies like Microsoft and AWS, and has a deep knowledge of database engines like Amazon Aurora PostgreSQL and Amazon Redshift.

Rohit Bansal is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to build next-generation analytics solutions using AWS Analytics services.

Anjali Vijayakumar is a Senior Solutions Architect at AWS focusing on EdTech. She is passionate about helping customers build well-architected solutions in the cloud.

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

Post Syndicated from Neeraja Rentachintala original https://aws.amazon.com/blogs/big-data/amazon-redshift-announcements-at-aws-reinvent-2023-to-enable-analytics-on-all-your-data/

In 2013, Amazon Web Services revolutionized the data warehousing industry by launching Amazon Redshift, the first fully-managed, petabyte-scale, enterprise-grade cloud data warehouse. Amazon Redshift made it simple and cost-effective to efficiently analyze large volumes of data using existing business intelligence tools. This cloud service was a significant leap from the traditional data warehousing solutions, which were expensive, not elastic, and required significant expertise to tune and operate. Since then, customer demands for better scale, higher throughput, and agility in handling a wide variety of changing, but increasingly business critical analytics and machine learning use cases has exploded, and we have been keeping pace. Today, tens of thousands of customers use Amazon Redshift in AWS global infrastructure to collectively process exabytes of data daily and employs Amazon Redshift as a key component of their data architecture to drive use cases from typical dashboarding to self-service analytics, real-time analytics, machine learning, data sharing and monetization, and more

The advancements to Amazon Redshift announced at AWS re:Invent 2023 further accelerates modernization of your cloud analytics environments, keeping our core tenet to help you achieve the best price-performance at any scale. These announcements drive forward the AWS Zero-ETL vision to unify all your data, enabling you to better maximize the value of your data with comprehensive analytics and ML capabilities, and innovate faster with secure data collaboration within and across organizations. From price-performance improvements to zero-ETL, to generative AI capabilities, we have something for everyone. Let’s dive into the highlights.

Modernizing analytics for scale, performance, and reliability

“Our migration from legacy on-premises platform to Amazon Redshift allows us to ingest data 88% faster, query data 3x faster, and load daily data to the cloud 6x faster. Amazon Redshift enabled us to optimize performance, availability, and reliability—significantly easing operational complexity, while increasing the velocity of our end-users’ decision-making experience on the Fab floor.”

– Sunil Narayan, Sr Dir, Analytics at GlobalFoundries

Diligently driving the best price-performance at scale with new improvements

Since day 1, Amazon Redshift has been building innovative capabilities to help you get to optimum performance, while keeping costs lower. Amazon Redshift continues to lead on the price-performance front with up to 6x better price-performance than alternative cloud data warehouse and for dash boarding applications with high concurrency and low latency. We closely analyze query patterns in the fleet and look for opportunities to drive customer-focused innovation. For example, earlier in the year, we announced speed ups for string-based data processing up to 63x compared to alternative compression encodings such as LZO (Lempel-Ziv-Oberhumer) or ZStandard. At AWS re:Invent 2023, we introduced more performance enhancements in query planning and execution such as enhanced bloom filters , query rewrites, and support for write operations in auto scaling . For more information about performance improvement capabilities, refer to the list of announcements below.

Amazon Redshift Serverless is more intelligent than ever with new AI-driven scaling and optimizations

Speaking of price-performance, new next generation AI-driven scaling and optimizations capabilities in Amazon Redshift Serverless can deliver up to 10x better price-performance for variable workloads (based on internal testing), without manual intervention. Amazon Redshift Serverless, generally available since 2021, allows you to run and scale analytics without having to provision and manage the data warehouse. Since GA, Redshift Serverless executed over a billion queries to power data insights for thousands of customers. With these new AI optimizations, Amazon Redshift Serverless scales proactively and automatically with workload changes across all key dimensions —such as data volume, concurrent users, and query complexity. You just specify your desired price-performance targets to either optimize for cost or optimize for performance or balanced and serverless does the rest. Learn more about additional improvements in Redshift Serverless, under the list of announcements below.

Multi-data warehouse writes through data sharing

Data sharing is a widely adopted feature in Amazon Redshift with customers running tens of millions of queries on shared data daily. Customers share live transactionally consistent data within and across organizations and regions for read purposes without data copies or data movement. Customers are using data sharing to modernize their analytics architectures from monolithic architectures to multi-cluster, data mesh deployments that enable seamless and secure access across organizations to drive data collaboration and powerful insights. At AWS re:Invent 2023, we extended data sharing capabilities to launch multi-data warehouse writes in preview. You can now start writing to Redshift databases from other Redshift data warehouses in just a few clicks, further enabling data collaboration, flexible scaling of compute for ETL/data processing workloads by adding warehouses of different types and sizes based on price-performance needs. Experience greater transparency of compute usage as each warehouse is billed for its own compute and consequently keep your costs under control.

Multidimensional data layouts

Amazon Redshift offers industry leading predictive optimizations that continuously monitor your workloads and seamlessly accelerate performance and maximize concurrency by adjusting data layout and compute management as you use the data warehouse more. In addition to the powerful optimizations Redshift already offers, such as Automatic Table Sort, Automatic sort and distribution keys, we are introducing Multidimensional Data Layouts, a new powerful table sorting mechanism that improves performance of repetitive queries by automatically sorting data based on the incoming query filters (for example: Sales in a specific region). This method significantly accelerates the performance of table scans compared to traditional methods.

Unifying all your data with zero-ETL approaches

“Using the Aurora MySQL zero-ETL integration, we experience near real-time data synchronization between Aurora MySQL databases and Amazon Redshift, making it possible to build an analysis environment in just three hours instead of the month of developer time it used to take before”

– MoneyForward

JOYME uses Amazon Redshift’s streaming ingestion and other Amazon services for risk control over users’ financial activity such as recharge, refund, and rewards.

“With Redshift, we are able to view risk counterparts and data in near real time—
instead of on an hourly basis. Redshift significantly improved our business ROI efficiency.”

– PengBo Yang, CTO, JOYME

Data pipelines can be challenging and costly to build and manage and can create hours-long delays to obtain transactional data for analytics. These delays can lead to missed business opportunities, especially when the insights derived from analytics on transactional data are relevant for only a limited amount of time. Amazon Redshift employs AWS’s zero-ETL approach that enables interoperability and integration between the data warehouse and operational databases and even your streaming data services, so that the data is easily and automatically ingested into the warehouse for you, or you can access the data in place, where it lives.

Zero-ETL integrations with operational databases

We delivered zero-ETL integration between Amazon Aurora MySQL Amazon Redshift (general availability) this year, to enable near real-time analytics and machine learning (ML) using Amazon Redshift on petabytes of transactional data from Amazon Aurora. Within seconds of transactional data being written into Aurora, the data is available in Amazon Redshift, so you don’t have to build and maintain complex data pipelines to perform extract, transform, and load (ETL) operations. At AWS re:Invent, we extended zero-ETL integration to additional sources specifically Aurora PostgreSQL, Dynamo DB, and Amazon RDS MySQL. Zero-ETL integration also enables you to load and analyze data from multiple operational database clusters in a new or existing Amazon Redshift instance to derive holistic insights across many applications.

Data lake querying with support for Apache Iceberg tables

Amazon Redshift allows customers to run a wide range of workloads on data warehouse and data lakes using its support for various open file and table formats. At AWS re:Invent, we announced the general availability of support for Apache Iceberg tables, so you can easily access your Apache Iceberg tables on your data lake from Amazon Redshift and join it with the data in your data warehouse when needed. Use one click to access your data lake tables using auto-mounted AWS Glue data catalogs on Amazon Redshift for a simplified experience. We have improved data lake query performance by integrating with AWS Glue statistics and introduce preview of incremental refresh for materialized views on data lake data to accelerate repeated queries.

Learn more about the zero-ETL integrations, data lake performance enhancements, and other announcements below.

Maximize value with comprehensive analytics and ML capabilities

“Amazon Redshift is one of the most important tools we had in growing Jobcase as a company.”

– Ajay Joshi, Distinguished Engineer, Jobcase

With all your data integrated and available, you can easily build and run near real-time analytics to AI/ML/Generative AI applications. Here’s a couple of highlights from this week and for the full list, see below.

Amazon Q Generative SQL capability

Query Editor, an out-of-the-box web-based SQL experience in Amazon Redshift is a popular tool for data exploration, visual analysis, and data collaboration. At AWS re:Invent, we introduced Amazon Q Generative SQL capabilities in Amazon Redshift Query Editor (preview), to simplify query authoring and increase your productivity by allowing you to express queries in natural language and receive SQL code recommendations. Generative SQL uses AI to analyze user intent, query patterns, and schema metadata to identify common SQL query patterns directly allowing you to get insights faster in a conversational format without extensive knowledge of your organization’s complex database metadata.

Amazon Redshift ML large language model (LLM) integration

Amazon Redshift ML enables customers to create, train, and deploy machine learning models using familiar SQL commands. Customers use Redshift ML to run an average of over 10 billion predictions a day within their data warehouses. At AWS re:Invent, we announced support for LLMs as preview. Now, you can use pre-trained open source LLMs in Amazon SageMaker JumpStart as part of Redshift ML, allowing you to bring the power of LLMs to analytics. For example, you can make inferences on your product feedback data in Amazon Redshift, use LLMs to summarize feedback, perform entity extraction, sentiment analysis and product feedback classification.

Innovate faster with secure data collaboration within and across the organizations

“Millions of companies use Stripe’s software and APIs to accept payments, send payouts, and manage their businesses online.  Access to their Stripe data via leading data warehouses like Amazon Redshift has been a top request from our customers. Our customers needed secure, fast, and integrated analytics at scale without building complex data pipelines or moving and copying data around. With Stripe Data Pipeline for Amazon Redshift, we’re helping our customers set up a direct and reliable data pipeline in a few clicks. Stripe Data Pipeline enables our customers to automatically share their complete, up-to-date Stripe data with their Amazon Redshift data warehouse, and take their business analytics and reporting to the next level.”

– Tony Petrossian, Head of Engineering, Revenue & Financial Management at Stripe

With Amazon Redshift, you can easily and securely share data and collaborate no matter where your teams or data is located. And have the confidence that your data is secure no matter where you operate or how highly regulated your industries are. We have enabled fine grained permissions, an easy authentication experience with single sign-on for your organizational identity—all provided at no additional cost to you.

Unified identity with IAM identity center integration

We announced Amazon Redshift integration with AWS IAM Identity Center to enable organizations to support trusted identity propagation between Amazon QuickSight,, Amazon Redshift Query Editor, and Amazon Redshift,  . Customers can use their organization identities to access Amazon Redshift in a single sign-on experience using third party identity providers (IdP), such as Microsoft Entra ID, Okta, Ping, OneLogin, etc. from Amazon QuickSight and Amazon Redshift Query Editor. Administrators can use third-party identity provider users and groups to manage fine grained access to data across services and audit user level access in AWS CloudTrail. With trusted identity propagation, a user’s identity is passed seamlessly between Amazon QuickSight, Amazon Redshift reducing time to insights and enabling a friction free analytics experience.

For the full set of announcements, see the following:

Learn more: https://aws.amazon.com/redshift


About the authors

Neeraja Rentachintala is a Principal Product Manager with Amazon Redshift. Neeraja is a seasoned Product Management and GTM leader, bringing over 20 years of experience in product vision, strategy and leadership roles in data products and platforms. Neeraja delivered products in analytics, databases, data Integration, application integration, AI/Machine Learning, large scale distributed systems across On-Premise and Cloud, serving Fortune 500 companies as part of ventures including MapR (acquired by HPE), Microsoft SQL Server, Oracle, Informatica and Expedia.com.

Sunaina AbdulSalah leads product marketing for Amazon Redshift. She focuses on educating customers about the impact of data warehousing and analytics and sharing AWS customer stories. She has a deep background in marketing and GTM functions in the B2B technology and cloud computing domains. Outside of work, she spends time with her family and friends and enjoys traveling.

Amazon Redshift adds new AI capabilities, including Amazon Q, to boost efficiency and productivity

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/amazon-redshift-adds-new-ai-capabilities-to-boost-efficiency-and-productivity/

Amazon Redshift puts artificial intelligence (AI) at your service to optimize efficiencies and make you more productive with two new capabilities that we are launching in preview today.

First, Amazon Redshift Serverless becomes smarter. It scales capacity proactively and automatically along dimensions such as the complexity of your queries, their frequency, the size of the dataset, and so on to deliver tailored performance optimizations. This allows you to spend less time tuning your data warehouse instances and more time getting value from your data.

Second, Amazon Q generative SQL in Amazon Redshift Query Editor generates SQL recommendations from natural language prompts. This helps you to be more productive in extracting insights from your data.

Let’s start with Amazon Redshift Serverless
When you use Amazon Redshift Serverless, you can now opt in for a preview of AI-driven scaling and optimizations. When enabled, the system observes and learns from your usage patterns, such as the concurrent number of queries, their complexity, and the time it takes to run them. Then, it automatically optimizes your serverless endpoint to meet your price performance target. Based on AWS internal testing, this new capability may give you up to ten times better price performance for variable workloads without any manual intervention.

AI-driven scaling and optimizations eliminate the time and effort to manually resize your workgroup and plan background optimizations based on workload needs. It continually runs automatic optimizations when they are most valuable for better performance, avoiding performance cliffs and time-outs.

This new capability goes beyond the existing self-tuning capabilities of Amazon Redshift Serverless, such as machine learning (ML)-enhanced techniques to adjust your compute, modify the physical schema of the database, create or drop materialized views as needed (the one we manage automatically, not yours), and vacuum tables. This new capability brings more intelligence to decide how to adjust the compute, what background optimizations are required, and when to apply them, and it makes its decisions based on more dimensions. We also orchestrate ML-based optimizations for materialized views, table optimizations, and workload management when your queries need it.

During the preview, you must opt in to enable these AI-driven scaling and optimizations on your workgroups. You configure the system to balance the optimization for price or performance. There is only one slider to adjust in the console.

Redshift serverless - AI driven workgoups

As usual, you can track resource usage and associated changes through the console, Amazon CloudWatch metrics, and the system table SYS_SERVERLESS_USAGE.

Now, let’s look at Amazon Q generative SQL in Amazon Redshift Query Editor
What if you could use generative AI to help analysts write effective SQL queries more rapidly? This is the new experience we introduce today in Amazon Redshift Query Editor, our web-based SQL editor.

You can now describe the information you want to extract from your data in natural language, and we generate the SQL query recommendations for you. Behind the scenes, Amazon Q generative SQL uses a large language model (LLM) and Amazon Bedrock to generate the SQL query. We use different techniques, such as prompt engineering and Retrieval Augmented Generation (RAG), to query the model based on your context: the database you’re connected to, the schema you’re working on, your query history, and optionally the query history of other users connected to the same endpoint. The system also remembers previous questions. You can ask it to refine a previously generated query.

The SQL generation model uses metadata specific to your data schema to generate relevant queries. For example, it uses the table and column names and the relationship between the tables in your database. In addition, your database administrator can authorize the model to use the query history of all users in your AWS account to generate even more relevant SQL statements. We don’t share your query history with other AWS accounts and we don’t train our generation models with any data coming from your AWS account. We maintain the high level of privacy and security that you expect from us.

Using generated SQL queries helps you to get started when discovering new schemas. It does the heavy lifting of discovering the column names and relationships between tables for you. Senior analysts also benefit from asking what they want in natural language and having the SQL statement automatically generated. They can review the queries and run them directly from their notebook.

Let’s explore a schema and extract information
For this demo, let’s pretend I am a data analyst at a company that sells concert tickets. The database schema and data are available for you to download. My manager asks me to analyze the ticket sales data to send a thank you note with discount coupons to the highest-spending customers in Seattle.

I connect to Amazon Redshift Query Editor and connect the analytic endpoint. I create a new tab for a Notebook (SQL generation is available from notebooks only).

Instead of writing a SQL statement, I open the chat panel and type, “Find the top five users from Seattle who bought the most number of tickets in 2022.” I take the time to verify the generated SQL statement. It seems correct, so I decide to run it. I select Add to notebook and then Run. The query returns the list of the top five buyers in Seattle.

sql generation - top 5 users

I had no previous knowledge of the data schema, and I did not type a single line of SQL to find the information I needed.

But generative SQL is not limited to a single interaction. I can chat with it to dynamically refine the queries. Here is another example.

I ask “Which state has the most venues?” Generative SQL proposes the following query. The answer is New York, with 49 venues, if you’re curious.

generative sql chat 01

I changed my mind, and I want to know the top three cities with the most venues. I simply rephrase my question: “What about the top three venues?

generative sql chat 02

I add the query to the notebook and run it. It returns the expected result.

generative sql chat 03

Best practices for prompting
Here are a couple of tips and tricks to get the best results out of your prompts.

Be specific – When asking questions in natural language, be as specific as possible to help the system understand exactly what you need. For example, instead of writing “find the top venues that sold the most tickets,” provide more details like “find the names of the top three venues that sold the most tickets in 2022.” Use consistent entity names like venue, ticket, and location instead of referring to the same entity in different ways, which can confuse the system.

Iterate – Break your complex requests into multiple simple statements that are easier for the system to interpret. Iteratively ask follow-up questions to get more detailed analysis from the system. For example, start by asking, “Which state has the most venues?” Then, based on the response, ask a follow-up question like “Which is the most popular venue from this state?”

Verify – Review the generated SQL before running it to ensure accuracy. If the generated SQL query has errors or does not match your intent, provide instructions to the system on how to correct it instead of rephrasing the entire request. For example, if the query is missing a filter clause on year, write “provide venues from year 2022.”

Availability and pricing
AI-driven scaling and optimizations are in preview in six AWS Regions: US East (Ohio, N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland, Stockholm). They come at no additional cost. You pay only for the compute capacity your data warehouse consumes when it is active. Pricing is per Redshift Processing Unit (RPU) per hour. The billing is per second of used capacity. The pricing page for Amazon Redshift has the details.

Amazon Q generative SQL for Amazon Redshift Query Editor is in preview in two AWS Regions today: US East (N. Virginia) and US West (Oregon). There is no charge during the preview period.

These are two examples of how AI helps to optimize performance and increase your productivity, either by automatically adjusting the price-performance ratio of your Amazon Redshift Serverless endpoints or by generating correct SQL statements from natural language prompts.

Previews are essential for us to capture your feedback before we make these capabilities available for all. Experiment with these today and let us know what you think on the re:Post forums or using the feedback button on the bottom left side of the console.

— seb

Announcing zero-ETL integrations with AWS Databases and Amazon Redshift

Post Syndicated from Jyoti Aggarwal original https://aws.amazon.com/blogs/big-data/announcing-zero-etl-integrations-with-aws-databases-and-amazon-redshift/

As customers become more data driven and use data as a source of competitive advantage, they want to easily run analytics on their data to better understand their core business drivers to grow sales, reduce costs, and optimize their businesses. To run analytics on their operational data, customers often build solutions that are a combination of a database, a data warehouse, and an extract, transform, and load (ETL) pipeline. ETL is the process data engineers use to combine data from different sources.

Through customer feedback, we learned that lot of undifferentiated time and resources go towards building and managing ETL pipelines between transactional databases and data warehouses. At Amazon Web Services (AWS), our goal is to make it easier for our customers to connect to and use all of their data and to do it with the speed and agility they need. We think that by automating the undifferentiated parts, we can help our customers increase the pace of their data-driven innovation by breaking down data silos and simplifying data integration.

Bringing operational data closer to analytics workflows

Customers want flexible data architectures that let them integrate data across their organization to give them a better picture of their customers, streamline operations, and help teams make better, faster decisions. But integrating data isn’t easy. Today, building these pipelines and assembling the architecture to interconnect all the data sources and optimize analytics results is complex, requires highly skilled resources, and renders data that can be erroneous or is often inconsistent.

Amazon Redshift powers data driven decisions for tens of thousands of customers every day with a fully managed, artificial intelligence (AI)-powered cloud data warehouse that delivers the best price-performance for your analytics workloads.

Zero-ETL is a set of integrations that eliminates the need to build ETL data pipelines. Zero-ETL integrations with Amazon Redshift enable customers to access their data in place using federated queries or ingest it into Amazon Redshift with a fully managed solution from across their databases. With newer features, such as support for autocopy that simplifies and automates file ingestion from Amazon Simple Storage Service (Amazon S3), Redshift Streaming Ingestion capabilities to continuously ingest any amount of streaming data directly into the warehouse, and multi-cluster data sharing architectures that minimize data movement and even provide access to third-party data, Amazon Redshift enables data integration and quick access to data without building manual pipelines.

With all the data integrated and available, Amazon Redshift empowers every data user to run analytics and build AI, machine learning (ML), and generative AI applications. Developers can run Apache Spark applications directly on the data in their warehouse from AWS analytics services, such as Amazon EMR and AWS Glue. They can enrich their datasets by joining operational data replicated through zero-ETL integrations with other sources such as sales and marketing data from SaaS applications and can even create Amazon QuickSight dashboards on top of this data to track key metrics across sales, website analytics, operations, and more—all in one place.

Customers can also use Amazon Redshift data sharing to securely share this data with multiple consumer clusters used by different teams—both within and across AWS accounts—driving a unified view of business and facilitating self-service access to application data within team clusters while maintaining governance over sensitive operational data.

Furthermore, customers can build machine learning models directly on their operational data in Amazon Redshift ML (native integration into Amazon SageMaker) without needing to build any data pipelines and use them to run billions of predictions with SQL commands. Or they can build complex transformations and aggregations on the integrated data using Amazon Redshift materialized views.

We’re excited to share four AWS database zero-ETL integrations with Amazon Redshift:

By bringing different database services closer to analytics, AWS is streamlining access to data and enabling companies to accelerate innovation, create competitive advantage, and maximize the business value extracted from their data assets.

Amazon Aurora zero-ETL integration with Amazon Redshift

The Amazon Aurora zero-ETL integration with Amazon Redshift unifies transactional data from Amazon Aurora with near real-time analytics in Amazon Redshift. This eliminates the burden of building and maintaining custom ETL pipelines between the two systems. Unlike traditional siloed databases that force a tradeoff between performance and analytics, the zero-ETL integration replicates data from multiple Aurora clusters into the same Amazon Redshift warehouse. This enables holistic insights across applications without impacting production workloads. The entire system can be serverless and can auto-scale to handle fluctuations in data volume without infrastructure management.

Amazon Aurora MySQL zero-ETL integration with Amazon Redshift processes over 1 million transactions per minute (an equivalent of 17.5 million insert/update/delete row operations per minute) from multiple Aurora databases and makes them available in Amazon Redshift in less than 15 seconds (p50 latency lag). Figure 1 shows how the Aurora MySQL zero-ETL integration with Amazon Redshift works at a high level.

Figure 1: High level working of Aurora MySQL zero-ETL integration with Amazon Redshift

In their own words, see how one of our customers is using Aurora MySQL zero-ETL integration with Amazon Redshift.

In the retail industry, for example, Infosys wanted to gain faster insights about their business, such as best-selling products and high-revenue stores, based on transactions in a store management system. They used Amazon Aurora MySQL zero-ETL integration with Amazon Redshift to achieve this. With this integration, Infosys replicated Aurora data to Amazon Redshift and created Amazon QuickSight dashboards for product managers and channel leaders in just a few seconds, instead of several hours. Now, as part of Infosys Cobalt and Infosys Topaz blueprints, enterprises can have near real-time analytics on transactional data, which can help them make informed decisions related to store management.

– Sunil Senan, SVP and Global Head of Data, Analytics, and AI, Infosys

To learn more, see Aurora Docs, Amazon Redshift Docs, and the AWS News Blog.

Amazon RDS for MySQL zero-ETL integration with Amazon Redshift

The new Amazon RDS for MySQL integration with Amazon Redshift empowers customers to easily perform analytics on their RDS for MySQL data. With a few clicks, it seamlessly replicates RDS for MySQL data into Amazon Redshift, automatically handling initial data loads, ongoing change synchronization, and schema replication. This eliminates the complexity of traditional ETL jobs. The zero-ETL integration enables workload isolation for optimal performance; RDS for MySQL focuses on high-speed transactions while Amazon Redshift handles analytical workloads. Customers can also consolidate data from multiple sources into Amazon Redshift, such as Aurora MySQL-Compatible Edition and Aurora PostgreSQL-Compatible Edition. This unified view provides holistic insights across applications in one place, delivering significant cost and operational efficiencies.

Figure 2 shows how a customer can use the AWS Management Console for Amazon RDS to get started with creating a zero-ETL integration from RDS for MySQL, Aurora MySQL-Compatible Edition, and Aurora PostgreSQL-Compatible Edition to Amazon Redshift.

Figure 2: How to create a zero-ETL integration using Amazon RDS.

This integration is currently in public preview, visit the getting started guide to learn more.

Amazon DynamoDB zero-ETL integration with Amazon Redshift

The Amazon DynamoDB zero-ETL integration with Amazon Redshift (limited preview) provides a fully managed solution for making data from DynamoDB available for analytics in Amazon Redshift. With minimal configuration, customers can replicate DynamoDB data into Amazon Redshift for analytics without consuming the DynamoDB Read Capacity Units (RCU). This zero-ETL integration unlocks powerful Amazon Redshift capabilities on DynamoDB data such as high-speed SQL queries, machine learning integrations, materialized views for fast aggregations, and secure data sharing.

This integration is currently in limited preview, use this link to request access.

Integrated services bring us closer to zero-ETL

Our mission is to help customers get the most value from their data, and integrated services are key to this. That’s why we’re building towards a zero-ETL future today. By automating complex ETL processes, data engineers can redirect their focus on creating value from the data. With this modern approach to data management, organizations can accelerate their use of data to streamline operations and fuel business growth.


About the author

Jyoti Aggarwal is a Product Management lead for Amazon Redshift zero-ETL. She brings along an expertise in cloud compute and storage, data warehouse, and B2B/B2C customer experience.

Improve performance of workloads containing repetitive scan filters with multidimensional data layout sort keys in Amazon Redshift

Post Syndicated from Yanzhu Ji original https://aws.amazon.com/blogs/big-data/improve-performance-of-workloads-containing-repetitive-scan-filters-with-multidimensional-data-layout-sort-keys-in-amazon-redshift/

Amazon Redshift, the most widely used cloud data warehouse, has evolved significantly to meet the performance requirements of the most demanding workloads. This post covers one such new feature—the multidimensional data layout sort key.

Amazon Redshift now improves your query performance by supporting multidimensional data layout sort keys, which is a new type of sort key that sorts a table’s data by filter predicates instead of physical columns of the table. Multidimensional data layout sort keys will significantly improve the performance of table scans, especially when your query workload contains repetitive scan filters.

Amazon Redshift already provides the capability of automatic table optimization (ATO), which automatically optimizes the design of tables by applying sort and distribution keys without the need for administrator intervention. In this post, we introduce multidimensional data layout sort keys as an additional capability offered by ATO and fortified by Amazon Redshift’s sort key advisor algorithm.

Multidimensional data layout sort keys

When you define a table with the AUTO sort key, Amazon Redshift ATO will analyze your query history and automatically select either a single-column sort key or multidimensional data layout sort key for your table, based on which option is better for your workload. When multidimensional data layout is selected, Amazon Redshift will construct a multidimensional sort function that co-locates rows that are typically accessed by the same queries, and the sort function is subsequently used during query runs to skip data blocks and even skip scanning the individual predicate columns.

Consider the following user query, which is a dominant query pattern in the user’s workload:

SELECT season, sum(metric2) AS "__measure__0"
FROM titles
WHERE lower(subregion) like '%United States%'
GROUP BY 1
ORDER BY 1;

Amazon Redshift stores data for each column in 1 MB disk blocks and stores the minimum and maximum values in each block as part of the table’s metadata. If a query uses a range-restricted predicate, Amazon Redshift can use the minimum and maximum values to rapidly skip over large numbers of blocks during table scans. However, this query’s filter on the subregion column can’t be used to determine which blocks to skip based on minimum and maximum values, and as a result, Amazon Redshift scans all rows from the titles table:

SELECT table_name, input_rows, step_attribute
FROM sys_query_detail
WHERE query_id = 123456789;

When the user’s query was run with titles using a single-column sort key on subregion, the result of the preceding query is as follows:

  table_name | input_rows | step_attribute
-------------+------------+---------------
  titles     | 2164081640 | 
(1 rows)

This shows that the table scan read 2,164,081,640 rows.

To improve scans on the titles table, Amazon Redshift might automatically decide to use a multidimensional data layout sort key. All rows that satisfy the lower(subregion) like '%United States%' predicate would be co-located to a dedicated region of the table, and therefore Amazon Redshift will only scan data blocks that satisfy the predicate.

When the user’s query is run with titles using a multidimensional data layout sort key that includes lower(subregion) like '%United States%' as a predicate, the result of the sys_query_detail query is as follows:

  table_name | input_rows | step_attribute
-------------+------------+---------------
  titles     | 152324046  | multi-dimensional
(1 rows)

This shows that the table scan read 152,324,046 rows, which is only 7% of the original, and it used the multidimensional data layout sort key.

Note that this example uses a single query to showcase the multidimensional data layout feature, but Amazon Redshift will consider all the queries running against the table and can create multiple regions to satisfy the most commonly run predicates.

Let’s take another example, with more complex predicates and multiple queries this time.

Imagine having a table items (cost int, available int, demand int) with four rows as shown in the following example.

#id cost available demand
1 4 3 3
2 2 23 6
3 5 4 5
4 1 1 2

Your dominant workload consists of two queries:

  • 70% queries pattern:
    select * from items where cost > 3 and available < demand

  • 20% queries pattern:
    select avg(cost) from items where available < demand

With traditional sorting techniques, you might choose to sort the table over the cost column, such that the evaluation of cost > 3 will benefit from the sort. So, the items table after sorting using a single cost column will look like the following.

#id cost available demand
Region #1, with cost <= 3
Region #2, with cost > 3
#id cost available demand
4 1 1 2
2 2 23 6
1 4 3 3
3 5 4 5

By using this traditional sort, we can immediately exclude the top two (blue) rows with ID 4 and ID 2, because they don’t satisfy cost > 3.

On the other hand, with a multidimensional data layout sort key, the table will be sorted based on a combination of the two commonly occurring predicates in the user’s workload, which are cost > 3 and available < demand. As a result, the table’s rows are sorted into four regions.

#id cost available demand
Region #1, with cost <= 3 and available < demand
Region #2, with cost <= 3 and available >= demand
Region #3, with cost > 3 and available < demand
Region #4, with cost > 3 and available >= demand
#id cost available demand
4 1 1 2
2 2 23 6
3 5 4 5
1 4 3 3

This concept is even more powerful when applied to entire blocks instead of single rows, when applied to complex predicates that use operators not suitable for traditional sorting techniques (such as like), and when applied to more than two predicates.

System tables

The following Amazon Redshift system tables will show users if multidimensional data layouts are used on their tables and queries:

  • To determine if a particular table is using a multidimensional data layout sort key, you can check whether sortkey1 in svv_table_info is equal to AUTO(SORTKEY(padb_internal_mddl_key_col)).
  • To determine if a particular query uses multidimensional data layout to accelerate table scans, you can check step_attribute in the sys_query_detail view. The value will be equal to multi-dimensional if the table’s multidimensional data layout sort key was used during the scan.

Performance benchmarks

We performed internal benchmark testing for multiple workloads with repetitive scan filters and see that introducing multidimensional data layout sort keys produced the following results:

  • A 74% total runtime reduction compared to having no sort key.
  • A 40% total runtime reduction compared to having the best single-column sort key on each table.
  • A 80% reduction in total rows read from tables compared to having no sort key.
  • A 47% reduction in total rows read from tables compared to having the best single-column sort key on each table.

Feature comparison

With the introduction of multidimensional data layout sort keys, your tables can now be sorted by expressions based off of the commonly occurring filter predicates in your workload. The following table provides a feature comparison for Amazon Redshift against two competitors.

Feature Amazon Redshift Competitor A Competitor B
Support for sorting on columns Yes Yes Yes
Support for sorting by expression Yes Yes No
Automatic column selection for sorting Yes No Yes
Automatic expressions selection for sorting Yes No No
Automatic selection between columns sorting or expressions sorting Yes No No
Automatic use of sorting properties for expressions during scans Yes No No

Considerations

Keep in mind the following when using a multidimensional data layout:

  • Multidimensional data layout is enabled when you set your table as SORTKEY AUTO.
  • Amazon Redshift Advisor will automatically choose either a single-column sort key or multidimensional data layout for the table by analyzing your historical workload.
  • Amazon Redshift ATO adjusts the multidimensional data layout sorting results based on the manner in which ongoing queries interact with the workload.
  • Amazon Redshift ATO maintains multidimensional data layout sort keys the same way as it currently does for existing sort keys. Refer to Working with automatic table optimization for more details on ATO.
  • Multidimensional data layout sort keys will work with both provisioned clusters and serverless workgroups.
  • Multidimensional data layout sort keys will work with your existing data as long as the AUTO SORTKEY is enabled on your table and a workload with repetitive scan filters is detected. The table will be reorganized based on the results of multi-dimensional sort function.
  • To disable multidimensional data layout sort keys for a table, use alter table: ALTER TABLE table_name ALTER SORTKEY NONE. This disables the AUTO sort key feature on the table.
  • Multidimensional data layout sort keys are preserved when restoring or migrating your provisioned cluster to a serverless cluster or vice versa.

Conclusion

In this post, we showed that multidimensional data layout sort keys can significantly improve query runtime performance for workloads where dominant queries have repetitive scan filters.

To create a preview cluster from the Amazon Redshift console, navigate to the Clusters page and choose Create preview cluster. You can create a cluster in the US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), Europe (Ireland), and Europe (Stockholm) Regions and test your workloads.

We would love to hear your feedback on this new feature and look forward to your comments on this post.


About the authors

Yanzhu Ji is a Product Manager in the Amazon Redshift team. She has experience in product vision and strategy in industry-leading data products and platforms. She has outstanding skill in building substantial software products using web development, system design, database, and distributed programming techniques. In her personal life, Yanzhu likes painting, photography, and playing tennis.

Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift.

Jialin Ding is an Applied Scientist in the Learned Systems Group, specializing in applying machine learning and optimization techniques to improve the performance of data systems such as Amazon Redshift.

Large Language Models for sentiment analysis with Amazon Redshift ML (Preview)

Post Syndicated from Blessing Bamiduro original https://aws.amazon.com/blogs/big-data/large-language-models-for-sentiment-analysis-with-amazon-redshift-ml-preview/

Amazon Redshift ML empowers data analysts and database developers to integrate the capabilities of machine learning and artificial intelligence into their data warehouse. Amazon Redshift ML helps to simplify the creation, training, and application of machine learning models through familiar SQL commands.

You can further enhance Amazon Redshift’s inferencing capabilities by Bringing Your Own Models (BYOM). There are two types of BYOM: 1) remote BYOM for remote inferences, and 2) local BYOM for local inferences. With local BYOM, you utilize a model trained in Amazon SageMaker for in-database inference within Amazon Redshift by importing Amazon SageMaker Autopilot and Amazon SageMaker trained models into Amazon Redshift. Alternatively, with remote BYOM you can invoke remote custom ML models deployed in SageMaker. This enables you to use custom models in SageMaker for churn, XGBoost, linear regression, multi-class classification and now LLMs.

Amazon SageMaker JumpStart is a SageMaker feature that helps deploy pretrained, publicly available large language models (LLM) for a wide range of problem types, to help you get started with machine learning. You can access pretrained models and use them as-is or incrementally train and fine-tune these models with your own data.

In prior posts, Amazon Redshift ML exclusively supported BYOMs that accepted text or CSV as the data input and output format. Now, it has added support for the SUPER data type for both input and output. With this support, you can use LLMs in Amazon SageMaker JumpStart which offers numerous proprietary and publicly available foundation models from various model providers.

LLMs have diverse use cases. Amazon Redshift ML supports available LLM models in SageMaker including models for sentiment analysis. In sentiment analysis, the model can analyze product feedback and strings of text and hence the sentiment. This capability is particularly valuable for understanding product reviews, feedback, and overall sentiment.

Overview of solution

In this post, we use Amazon Redshift ML for sentiment analysis on reviews stored in an Amazon Redshift table. The model takes the reviews as an input and returns a sentiment classification as the output. We use an out of the box LLM in SageMaker Jumpstart. Below picture shows the solution overview.

Walkthrough

Follow the steps below to perform sentiment analysis using Amazon Redshift’s integration with SageMaker JumpStart to invoke LLM models:

  1. Deploy LLM model using foundation models in SageMaker JumpStart and create an endpoint
  2. Using Amazon Redshift ML, create a model referencing the SageMaker JumpStart LLM endpoint
  3. Create a user defined function(UDF) that engineers the prompt for sentiment analysis
  4. Load sample reviews data set into your Amazon Redshift data warehouse
  5. Make a remote inference to the LLM model to generate sentiment analysis for input dataset
  6. Analyze the output

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account
  • An Amazon Redshift Serverless preview workgroup or an Amazon Redshift provisioned preview cluster. Refer to creating a preview workgroup or creating a preview cluster documentation for steps.
  • For the preview, your Amazon Redshift data warehouse should be on preview_2023 track in of these regions – US East (N. Virginia), US West (Oregon), EU-West (Ireland), US-East (Ohio), AP-Northeast (Tokyo) or EU-North-1 (Stockholm).

Solution Steps

Follow the below solution steps

1. Deploy LLM Model using Foundation models in SageMaker JumpStart and create an endpoint

  1. Navigate to Foundation Models in Amazon SageMaker Jumpstart
  2. Search for the foundation model by typing Falcon 7B Instruct BF16 in the search box
  3. Choose View Model

  4. In the Model Details  page, choose Open notebook in Studio

  5. When Select domain and user profile dialog box opens up, choose the profile you like from the drop down and choose Open Studio

  6. When the notebook opens, a prompt Set up notebook environment pops open. Choose ml.g5.2xlarge or any other instance type recommended in the notebook and choose Select

  7. Scroll to Deploying Falcon model for inference section of the notebook and run the three cells in that section
  8. Once the third cell execution is complete, expand Deployments section in the left pane, choose Endpoints to see the endpoint created. You can see endpoint Name. Make a note of that. It will be used in the next steps
  9. Select Finish.

2. Using Amazon Redshift ML, create a model referencing the SageMaker JumpStart LLM endpoint

Create a model using Amazon Redshift ML bring your own model (BYOM) capability. After the model is created, you can use the output function to make remote inference to the LLM model. To create a model in Amazon Redshift for the LLM endpoint created previously, follow the below steps.

  1. Login to Amazon Redshift endpoint. You can use Query editor V2 to login
  2. Import this notebook into Query Editor V2. It has all the SQLs used in this blog.
  3. Ensure you have the below IAM policy added to your IAM role. Replace <endpointname> with the SageMaker JumpStart endpoint name captured earlier
    {
      "Statement": [
          {
              "Action": "sagemaker:InvokeEndpoint",
              "Effect": "Allow",
              "Resource": "arn:aws:sagemaker:<region>:<AccountNumber>:endpoint/<endpointname>",
              "Principal": "*"
          }
      ]
    }

  4. Create model in Amazon Redshift using the create model statement given below. Replace <endpointname> with the endpoint name captured earlier. The input and output data type for the model needs to be SUPER.
    CREATE MODEL falcon_7b_instruct_llm_model
    FUNCTION falcon_7b_instruct_llm_model(super)
    RETURNS super
    SAGEMAKER '<endpointname>'
    IAM_ROLE default;

3. Load sample reviews data set into your Amazon Redshift data warehouse

In this blog post, we will use a sample fictitious reviews dataset for the walkthrough

  1. Login to Amazon Redshift using Query Editor V2
  2. Create sample_reviews table using the below SQL statement. This table will store the sample reviews dataset
    CREATE TABLE sample_reviews
    (
    review varchar(4000)
    );

  3. Download the sample file, upload into your S3 bucket and load data into sample_reviews table using the below COPY command
    COPY sample_reviews
    FROM 's3://<<your_s3_bucket>>/sample_reviews.csv'
    IAM_ROLE DEFAULT
    csv
    DELIMITER ','
    IGNOREHEADER 1;

4. Create a UDF that engineers the prompt for sentiment analysis

The input to the LLM consists of two main parts – the prompt and the parameters.

The prompt is the guidance or set of instructions you want to give to the LLM. Prompt should be clear to provide proper context and direction for the LLM. Generative AI systems rely heavily on the prompts provided to determine how to generate a response.  If the prompt does not provide enough context and guidance, it can lead to unhelpful responses. Prompt engineering helps avoid these pitfalls.

Finding the right words and structure for a prompt is challenging and often requires trial and error. Prompt engineering allows experimenting to find prompts that reliably produce the desired output.  Prompt engineering helps shape the input to best leverage the capabilities of the Generative-AI model being used. Well-constructed prompts allow generative AI to provide more nuanced, high-quality, and helpful responses tailored to the specific needs of the user.

The parameters allow configuring and fine-tuning the model’s output. This includes settings like maximum length, randomness levels, stopping criteria, and more. Parameters give control over the properties and style of the generated text and are model specific.

The UDF below takes varchar data in your data warehouse, parses it into SUPER (JSON format) for the LLM. This flexibility allows you to store your data as varchar in your data warehouse without performing data type conversion to SUPER to use LLMs in Amazon Redshift ML and makes prompt engineering easy. If you want to try a different prompt, you can just replace the UDF

The UDF given below has both the prompt and a parameter.

  • Prompt: “Classify the sentiment of this sentence as Positive, Negative, Neutral. Return only the sentiment nothing else” – This instructs the model to classify the review into 3 sentiment categories.
  • Parameter: “max_new_tokens”:1000 – This allows the model to return up to 1000 tokens.
CREATE FUNCTION udf_prompt_eng_sentiment_analysis (varchar)
  returns super
stable
as $$
  select json_parse(
  '{"inputs":"Classify the sentiment of this sentence as Positive, Negative, Neutral. Return only the sentiment nothing else.' || $1 || '","parameters":{"max_new_tokens":1000}}')
$$ language sql;

5. Make a remote inference to the LLM model to generate sentiment analysis for input dataset

The output of this step is stored in a newly created table called sentiment_analysis_for_reviews. Run the below SQL statement to create a table with output from LLM model

CREATE table sentiment_analysis_for_reviews
as
(
    SELECT 
        review, 
        falcon_7b_instruct_llm_model
            (
                udf_prompt_eng_sentiment_analysis(review)
        ) as sentiment
    FROM sample_reviews
);

6. Analyze the output

The output of the LLM is of datatype SUPER. For the Falcon model, the output is available in the attribute named generated_text. Each LLM has its own output payload format. Please refer to the documentation for the LLM you would like to use for its output format.

Run the below query to extract the sentiment from the output of LLM model. For each review, you can see it’s sentiment analysis

SELECT review, sentiment[0]."generated_text" :: varchar as sentiment 
FROM sentiment_analysis_for_reviews;

Cleaning up

To avoid incurring future charges, delete the resources.

  1. Delete the LLM endpoint in SageMaker Jumpstart
  2. Drop the sample_reviews table and the model in Amazon Redshift using the below query

    DROP MODEL falcon_7b_instruct_llm_model;
    DROP TABLE sample_reviews;
    DROP FUNCTION fn_gen_prompt_4_sentiment_analysis;

  3. If you have created an Amazon Redshift endpoint, delete the endpoint as well

Conclusion

In this post, we showed you how to perform sentiment analysis for data stored in Amazon Redshift using Falcon, a large language model(LLM) in SageMaker jumpstart and Amazon Redshift ML. Falcon is used as an example, you can use other LLM models as well with Amazon Redshift ML. Sentiment analysis is just one of the many use cases that are possible with LLM support in Amazon Redshift ML. You can achieve other use cases such as data enrichment, content summarization, knowledge graph development and more. LLM support broadens the analytical capabilities of Amazon Redshift ML as it continues to empower data analysts and developers to incorporate machine learning into their data warehouse workflow with streamlined processes driven by familiar SQL commands. The addition of SUPER data type enhances Amazon Redshift ML capabilities, allowing smooth integration of large language models (LLM) from SageMaker JumpStart for remote BYOM inferences.


About the Authors

Blessing Bamiduro is part of the Amazon Redshift Product Management team. She works with customers to help explore the use of Amazon Redshift ML in their data warehouse. In her spare time, Blessing loves travels and adventures.

Anusha Challa is a Senior Analytics Specialist Solutions Architect focused on Amazon Redshift. She has helped many customers build large-scale data warehouse solutions in the cloud and on premises. She is passionate about data analytics and data science.

Enhance query performance using AWS Glue Data Catalog column-level statistics

Post Syndicated from Sandeep Adwankar original https://aws.amazon.com/blogs/big-data/enhance-query-performance-using-aws-glue-data-catalog-column-level-statistics/

Today, we’re making available a new capability of AWS Glue Data Catalog that allows generating column-level statistics for AWS Glue tables. These statistics are now integrated with the cost-based optimizers (CBO) of Amazon Athena and Amazon Redshift Spectrum, resulting in improved query performance and potential cost savings.

Data lakes are designed for storing vast amounts of raw, unstructured, or semi-structured data at a low cost, and organizations share those datasets across multiple departments and teams. The queries on these large datasets read vast amounts of data and can perform complex join operations on multiple datasets. When talking with our customers, we learned that one the challenging aspect of data lake performance is how to optimize these analytics queries to execute faster.

The data lake performance optimization is especially important for queries with multiple joins and that is where cost-based optimizers helps the most. In order for CBO to work, column statistics need to be collected and updated based on changes in the data. We’re launching capability of generating column-level statistics such as number of distinct, number of nulls, max, and min on files such as Parquet, ORC, JSON, Amazon ION, CSV, XML on AWS Glue tables. With this launch, customers now have integrated end-to-end experience where statistics on Glue tables are collected and stored in the AWS Glue Catalog, and made available to analytics services for improved query planning and execution.

Using these statistics, cost-based optimizers improves query run plans and boosts the performance of queries run in Amazon Athena and Amazon Redshift Spectrum. For example, CBO can use column statistics such as number of distinct values and number of nulls to improve row prediction. Row prediction is the number of rows from a table that will be returned by a certain step during the query planning stage. The more accurate the row predictions are, the more efficient query execution steps are. This leads to faster query execution and potentially reduced cost. Some of the specific optimizations that CBO can employ include join reordering and push-down of aggregations based on the statistics available for each table and column.

For customers using data mesh with AWS Lake Formation permissions, tables from different data producers are cataloged in the centralized governance accounts. As they generate statistics on tables on centralized catalog and share those tables with consumers, queries on those tables in consumer accounts will see query performance improvements automatically. In this post, we’ll demonstrate the capability of AWS Glue Data Catalog to generate column statistics for our sample tables.

Solution overview

To demonstrate the effectiveness of this capability, we employ the industry-standard TPC-DS 3 TB dataset stored in an Amazon Simple Storage Service (Amazon S3) public bucket. We’ll compare the query performance before and after generating column statistics for the tables, by running queries in Amazon Athena and Amazon Redshift Spectrum. We are providing queries that we used in this post and we encourage to try out your own queries following workflow as illustrated in the following details.

The workflow consists of the following high level steps:

  1. Cataloging the Amazon S3 Bucket: Utilize AWS Glue Crawler to crawl the designated Amazon S3 bucket, extracting metadata, and seamlessly storing it in the AWS Glue data catalog. We’ll query these tables using Amazon Athena and Amazon Redshift Spectrum.
  2. Generating column statistics: Employ the enhanced capabilities of AWS Glue Data Catalog to generate comprehensive column statistics for the crawled data, thereby providing valuable insights into the dataset.
  3. Querying with Amazon Athena and Amazon Redshift Spectrum: Evaluate the impact of column statistics on query performance by utilizing Amazon Athena and Amazon Redshift Spectrum to execute queries on the dataset.

The following diagram illustrates the solution architecture.

Walkthrough

To implement the solution, we complete the following steps:

  1. Set up resources with AWS CloudFormation.
  2. Run AWS Glue Crawler on Public Amazon S3 bucket to list the 3TB TPC-DS dataset.
  3. Run queries on Amazon Athena and Amazon Redshift and note down query duration
  4. Generate statistics for AWS Glue Data Catalog tables
  5. Run queries on Amazon Athena and Amazon Redshift and compare query duration with previous run
  6. Optional: Schedule AWS Glue column statistics jobs using AWS Lambda and the Amazon EventBridge Scheduler

Set up resources with AWS CloudFormation

This post includes an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs. The template generates the following resources:

  • An Amazon Virtual Private Cloud (Amazon VPC), public subnet, private subnets and route tables.
  • An Amazon Redshift Serverless workgroup and namespace.
  • An AWS Glue crawler to crawl the public Amazon S3 bucket and create a table for the Glue Data Catalog for TPC-DS dataset
  • AWS Glue catalog databases and tables
  • An Amazon S3 bucket to store athena result.
  • AWS Identity and Access Management (AWS IAM) users and policies.
  • AWS Lambda and Amazon Event Bridge scheduler to schedule the AWS Glue Column statistics

To launch the AWS CloudFormation stack, complete the following steps:

Note: The AWS Glue data catalog tables are generated using the public bucket s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/, hosted in the us-east-1 region. If you intend to deploy this AWS CloudFormation template in a different region, it is necessary to either copy the data to the corresponding region or share the data within your deployed region for it to be accessible from Amazon Redshift.

  1. Log in to the AWS Management Console as AWS Identity and Access Management (AWS IAM) administrator.
  2. Choose Launch Stack to deploy a AWS CloudFormation template.
  3. Choose Next.
  4. On the next page, keep all the option as default or make appropriate changes based on your requirement choose Next.
  5. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  6. Choose Create.

This stack can take around 10 minutes to complete, after which you can view the deployed stack on the AWS CloudFormation console.

Run the AWS Glue Crawlers created by the AWS CloudFormation stack

To run your crawlers, complete the following steps:

  1. On the AWS Glue console to AWS Glue Console, choose Crawlers under Data Catalog in the navigation pane.
  2. Locate and run two crawlers tpcdsdb-without-stats and tpcdsdb-with-stats. It may take few mins to complete.

Once the crawler completes successfully, it would create two identical databases tpcdsdbnostats and tpcdsdbwithstats. The tables in tpcdsdbnostats will have No Stats and we’ll use them as reference. We generate statistics on tables in tpcdsdbwithstats. Please verify that you have those two databases and underlying tables from the AWS Glue Console. The tpcdsdbnostats database will look like below. At this time there are no statistics generated on these tables.

Run provided query using Amazon Athena on no-stats tables

To run your query in Amazon Athena on tables without statistics, complete the following steps:

  1. Download the athena queries from here.
  2. On the Amazon Athena Console, choose the provided query one at a time for tables in database tpcdsdbnostats.
  3. Run the query and note down the Run time for each query.

Run provided query using Amazon Redshift Spectrum on no-stats tables

To run your query in Amazon Redshift, complete the following steps:

  1. Download the Amazon Redshift queries from here.
  2. On the Redshift query editor v2, execute the Redshift Query for tables without stats section from downloaded query.
  3. Run the query and note down the query execution of each query.

Generate statistics on AWS Glue Catalog tables

To generate statistics on AWS Glue Catalog tables, complete the following steps:

  1. Navigate to the AWS Glue Console and choose the databases under Data Catalog.
  2. Click on tpcdsdbwithstats database and it will list all the available tables.
  3. Select any of these tables (e.g., call_center).
  4. Go to Column statistics – new tab and choose Generate statistics.
  5. Keep the default option. Under Choose columns keep Table (All columns) and Under Row sampling options Keep All rows, Under IAM role choose AWSGluestats-blog and select Generate statistics.

You’ll be able to see status of the statistics generation run as shown in the following illustration:

After generate statistics on AWS Glue Catalog tables, you should be able to see detailed column statistics for that table:

Reiterate steps 2–5 to generate statistics for all necessary tables, such as catalog_sales, catalog_returns, warehouse, item, date_dim, store_sales, customer, customer_address, web_sales, time_dim, ship_mode, web_site, web_returns. Alternatively, you can follow the “Schedule AWS Glue Statistics Runs” section near the end of this blog to generate statistics for all tables. Once done, assess query performance for each query.

Run provided query using Athena Console on stats tables

  1. On the Amazon Athena console, execute the Athena Query for tables with stats section from downloaded query.
  2. Run and note down the query execution of each query.

In our sample run of the queries on the tables, we observed the query execution time as per the below table. We saw clear improvement in the query performance, ranging from 13 to 55%.

Athena query time improvement

TPC-DS 3T Queries without glue stats (sec) with glue stats (sec) performance improvement (%)
Query 2 33.62 15.17 55%
Query 4 132.11 72.94 45%
Query 14 134.77 91.48 32%
Query 28 55.99 39.36 30%
Query 38 29.32 25.58 13%

Run the provided query using Amazon Redshift Spectrum on statistics tables

  1. On the Amazon Redshift query editor v2, execute the Redshift Query for tables with stats section from downloaded query.
  2. Run the query and note down the query execution of each query.

In our sample run of the queries on the tables, we observed the query execution time as per the below table. We saw clear improvement in the query performance, ranging from 13 to 89%.

Amazon Redshift Spectrum query time improvement

TPC-DS 3T Queries without glue stats (sec) with glue stats (sec) performance improvement (%)
Query 40 124.156 13.12 89%
Query 60 29.52 16.97 42%
Query 66 18.914 16.39 13%
Query 95 308.806 200 35%
Query 99 20.064 16 20%

Schedule AWS Glue statistics Runs

In this segment of the post, we’ll guide you through the steps of scheduling AWS Glue column statistics runs using AWS Lambda and the Amazon EventBridge Scheduler. To streamline this process, a AWS Lambda function and an Amazon EventBridge scheduler were created as part of the CloudFormation stack deployment.

  1. AWS Lambda function setup:

To begin, we utilize an AWS Lambda function to trigger the execution of the AWS Glue column statistics job. The AWS Lambda function invokes the start_column_statistics_task_run API through the boto3 (AWS SDK for Python) library. This sets the groundwork for automating the column statistics update.

Let’s explore the AWS Lambda function:

    • Go to the AWS Glue Lambda Console.
    • Select Functions and locate the GlueTableStatisticsFunctionv1.
    • For a clearer understanding of the AWS Lambda function, we recommend reviewing the code in the Code section and examining the environment variables under Configuration.
  1. Amazon EventBridge scheduler configuration

The next step involves scheduling the AWS Lambda function invocation using the Amazon EventBridge Scheduler. The scheduler is configured to trigger the AWS Lambda function daily at a specific time – in this case, 08:00 PM. This ensures that the AWS Glue column statistics job runs on a regular and predictable basis.

Now, let’s explore how you can update the schedule:

Cleaning up

To avoid unwanted charges to your AWS account, delete the AWS resources:

  1. Sign into the AWS CloudFormation console as the AWS IAM administrator used for creating the AWS CloudFormation stack.
  2. Delete the AWS CloudFormation stack you created.

Conclusion

In this post, we showed you how you can use AWS Glue Data Catalog to generate column-level statistics for AWS Glue tables. These statistics are now integrated with cost-based optimizer from Amazon Athena and Amazon Redshift Spectrum, resulting in improved query performance and potential costs savings. Refer to Docs for support for Glue Catalog Statistics across various AWS analytical services.

If you have questions or suggestions, submit them in the comments section.


About the Authors

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Navnit Shukla serves as an AWS Specialist Solution Architect with a focus on Analytics. He possesses a strong enthusiasm for assisting clients in discovering valuable insights from their data. Through his expertise, he constructs innovative solutions that empower businesses to arrive at informed, data-driven choices. Notably, Navnit Shukla is the accomplished author of the book titled Data Wrangling on AWS. He can be reached via LinkedIn.

Use custom domain names with Amazon Redshift

Post Syndicated from Raghu Kuppala original https://aws.amazon.com/blogs/big-data/use-custom-domain-names-with-amazon-redshift/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. With Amazon Redshift, you can analyze all your data to derive holistic insights about your business and your customers.

Amazon Redshift now supports custom URLs or custom domain names for your data warehouse. You might want to use a custom domain name or CNAME (Canonical Name) for the following reasons:

  • A custom domain name is straightforward to recall and use.
  • Routing connections is less disruptive. The connections from the client are pointed to the DNS record and not the server name. This lets you easily route connections to new clusters in failover or disaster recovery scenarios.
  • You can now obfuscate your server names with a friendly custom domain name.
  • It helps you avoid application code or connectivity changes in case the underlying data warehouse is migrated to a different Region or the endpoint is changed.

In this post, we discuss how you can modify your data warehouse to use custom domain names and how to connect to a data warehouse that has been configured with a custom URL.

Pre-requisites

To get started, you need a registered domain name. You can use Amazon Route 53 or a third-party domain registrar to register a domain.

You also need a validated Secure Sockets Layer (SSL) certificate for your custom endpoints. This is to verify ownership of the domain name and secure communication. You can use AWS Certificate Manager (ACM) to provision, manage, and deploy public SSL/TLS certificates. You need to use verify-full mode, which ensures that the connections are encrypted and verifies that the hostname of the server matches the hostname in the certificate.

Lastly, you need to attach the necessary permissions to the AWS Identity and Access Management (IAM) role that’s assigned to the relevant users and groups that will manage your Redshift data warehouse. These vary depending on if you’re using Amazon Redshift provisioned or Amazon Redshift Serverless. The permissions needed for the required actions are listed in the following table.

Action IAM Permission
Redshift Provisioned Redshift Serverless
Create custom domain for datawarehouse

redshift:CreateCustomDomainAssociation

acm:DescribeCertificate

redshiftServerless:CreateCustomDomainAssociation

acm:DescribeCertificate

Renaming cluster that has custom domain name acm:DescribeCertificate Not needed
Changing certificate for association

redshift:ModifyCustomDomainAssociation

acm:DescribeCertificate

redshiftServerless:UpdateCustomDomainAssociation

acm:DescribeCertificate

Deleting custom domain redshift:DeleteCustomDomainAssociation redshiftServerless:DeleteCustomDomainAssociation
Connecting to the data warehouse using custom domain name redshift:DescribeCustomDomainAssociations Not needed

The following screenshot shows an example of creating an IAM policy on the IAM console.

Creating DNS CNAME entry for custom domain name

The custom domain name typically includes the root domain and a subdomain, like mycluster.mycompany.com. You can either register a new root domain or use an existing one. For more information about registering a new domain with Route 53, refer to Registering a new domain.

After you set that up, you can add a DNS record that points your custom CNAME to the Redshift endpoint. You can find the data warehouse endpoint on the Amazon Redshift console on the cluster detail page.

The following screenshot illustrates locating a provisioned endpoint.

The following screenshot illustrates locating a serverless endpoint.

Now that you have created the CNAME entry, you can request a certificate from ACM. Complete the following steps:

  1. Open the ACM console and choose Request a certificate.
  2. For Fully qualified domain name, enter your custom domain name.
  3. Choose Request.
  4. Confirm that the request is validated by the owner of the domain by checking the status of the certificate.

The status should be Issued.

Now that you have created the CNAME record and certificate, you can create the custom domain URL for your Redshift cluster using the Amazon Redshift console.

Creating custom domain for a provisioned instance

To create a custom domain for a provisioned instance, complete the following steps:

  1. On the Amazon Redshift console, navigate to your provisioned instance detail page.
  2. On the Actions menu, choose Create custom domain name.
  3. For Custom domain name, enter the CNAME record for your Redshift provisioned cluster.
  4. For ACM certificate, choose the appropriate certificate.
  5. Choose Create.

You should now have a custom domain name associated to your provisioned data warehouse. The custom domain name and custom domain certificate ARN values should now be populated with your entries.

Note that sslmode=verify-full will only work for the new custom endpoint. You can’t use this mode with the default endpoint; you can connect to the default endpoint by using other SSL modes like sslmode=verify-ca.

Create a custom domain for a serverless instance

To create a custom domain for a serverless instance, complete the following steps:

  1. On the Amazon Redshift console, navigate to your serverless instance detail page.
  2. On the Actions menu, choose Create custom domain name.
  3. For Custom domain name, enter the CNAME record for your Redshift Serverless workgroup.
  4. For ACM certificate, choose the appropriate certificate.
  5. Choose Create.

You should now have a custom domain name associated to your serverless workgroup. The custom domain name and custom domain certificate ARN values should now be populated with your entries.

Note that, as with a provisioned instance, sslmode=verify-full will only work for the new custom endpoint. You can’t use this mode with the default endpoint; you can connect to the default endpoint by using other SSL modes like sslmode=verify-ca.

Connect using custom domain name

You can now connect to your cluster using the custom domain name. The JDBC URL would be similar to jdbc:redshift://prefix.rootdomain.com:5439/dev?sslmode=verify-full, where prefix.rootdomain.com is your custom domain name and dev is the default database. Use your preferred editor to connect to this URL using your user name and password.

Update the certificate association for your provisioned custom domain

To update the certificate association using the Amazon Redshift console, navigate to your provisioned cluster details page and on the Actions menu, choose Edit custom domain name. Update the domain name and ACM certificate, then choose Save changes.

To change the cluster’s ACM certificate associated to the custom domain using the AWS Command Line Interface (AWS CLI), use the following command:

aws redshift modify-custom-domain-association --cluster-identifier <clustername> --custom-domain-certificate-arn <newCertArn> --custom-domain-name <currentDomainNameOfCluster>

Update the certificate for your serverless custom domain

To update the certificate using the Amazon Redshift console, navigate to your serverless workgroup details page and on the Actions menu, choose Edit custom domain name. Update the domain name and ACM certificate, then choose Save changes.

To change the serverless workgroup’s ACM certificate associated to the custom domain using the AWS CLI, use the following command:

aws redshift-serverless update-custom-domain-association --region <aws-region> ----custom-domain-name <currentCustomDomainName> --custom-domain-certificate-arn <NewCustomdomaincertarn> --workgroup-name<workgroupname>

Delete a custom provisioned domain

To delete your custom domain, navigate to the provisioned cluster details page. On the Actions menu, choose Delete custom domain name. Enter delete to confirm, then choose Delete.

 To use the AWS CLI, use the following code:

aws redshift delete-custom-domain-association --cluster-identifier <ClusterName> --region <ClusterRegion>  --custom-domain-name <currentDomainName>

Delete a custom serverless domain

To delete your custom domain, navigate to the serverless workgroup details page. On the Actions menu, choose Delete custom domain name. Enter delete to confirm, then choose Delete.

To use the AWS CLI, use the following code:

aws redshift-serverless delete-custom-domain-association --workgroup-name <workgroupname> --custom-domain-name <CurrentCustomDomainName>

Conclusion

In this post, we discussed the benefits of using custom domain names for your Redshift data warehouse and the steps needed to associate a custom domain name with the Redshift endpoint. For more information, refer to Using a custom domain name for client connections.


About the Authors

Raghu Kuppala is an Analytics Specialist Solutions Architect experienced working in the databases, data warehousing, and analytics space. Outside of work, he enjoys trying different cuisines and spending time with his family and friends.

Sam Selvan is a Principal Analytics Solution Architect with Amazon Web Services.

Yanzhu Ji is a Product Manager in the Amazon Redshift team. She has experience in product vision and strategy in industry-leading data products and platforms. She has outstanding skill in building substantial software products using web development, system design, database, and distributed programming techniques. In her personal life, Yanzhu likes painting, photography, and playing tennis.

Nikhitha Loyapally is a Senior Software Development Engineer for Amazon Redshift.

Implement data warehousing solution using dbt on Amazon Redshift

Post Syndicated from Seshadri Senthamaraikannan original https://aws.amazon.com/blogs/big-data/implement-data-warehousing-solution-using-dbt-on-amazon-redshift/

Amazon Redshift is a cloud data warehousing service that provides high-performance analytical processing based on a massively parallel processing (MPP) architecture. Building and maintaining data pipelines is a common challenge for all enterprises. Managing the SQL files, integrating cross-team work, incorporating all software engineering principles, and importing external utilities can be a time-consuming task that requires complex design and lots of preparation.

dbt (DataBuildTool) offers this mechanism by introducing a well-structured framework for data analysis, transformation and orchestration. It also applies general software engineering principles like integrating with git repositories, setting up DRYer code, adding functional test cases, and including external libraries. This mechanism allows developers to focus on preparing the SQL files per the business logic, and the rest is taken care of by dbt.

In this post, we look into an optimal and cost-effective way of incorporating dbt within Amazon Redshift. We use Amazon Elastic Container Registry (Amazon ECR) to store our dbt Docker images and AWS Fargate as an Amazon Elastic Container Service (Amazon ECS) task to run the job.

How does the dbt framework work with Amazon Redshift?

dbt has an Amazon Redshift adapter module named dbt-redshift that enables it to connect and work with Amazon Redshift. All the connection profiles are configured within the dbt profiles.yml file. In an optimal environment, we store the credentials in AWS Secrets Manager and retrieve them.

The following code shows the contents of profile.yml:

SampleProject:

target: dev
outputs:
dev:
type: redshift
host: "{{ env_var('DBT_HOST') }}"
user: "{{ env_var('DBT_USER') }}"
password: "{{ env_var('DBT_PASSWORD') }}"
port: 5439
dbname: "{{ env_var('DBT_DB_NAME') }}"
schema: dev
threads: 4
keepalives_idle: 240 # default 240 seconds
connect_timeout: 10 # default 10 seconds
sslmode: require
ra3_node: true

The following diagram illustrates the key components of the dbt framework:

The primary components are as follows:

  • Models – These are written as a SELECT statement and saved as a .sql file. All the transformation queries can be written here which can be materialized as a table or view. The table refresh can be full or incremental based on the configuration. For more information, refer SQL models.
  • Snapshots – These implements type-2 slowly changing dimensions (SCDs) over mutable source tables. These SCDs identify how a row in a table changes over time.
  • Seeds – These are CSV files in your dbt project (typically in your seeds directory), which dbt can load into your data warehouse using the dbt seed command.
  • Tests – These are assertions you make about your models and other resources in your dbt project (such as sources, seeds, and snapshots). When you run dbt test, dbt will tell you if each test in your project passes or fails.
  • Macros – These are pieces of code that can be reused multiple times. They are analogous to “functions” in other programming languages, and are extremely useful if you find yourself repeating code across multiple models.

These components are stored as .sql files and are run by dbt CLI commands. During the run, dbt creates a Directed Acyclic Graph (DAG) based on the internal reference between the dbt components. It uses the DAG to orchestrate the run sequence accordingly.

Multiple profiles can be created within the profiles.yml file, which dbt can use to target different Redshift environments while running. For more information, refer to Redshift set up.

Solution overview

The following diagram illustrates our solution architecture.

The workflow contains the following steps:

  1. The open source dbt-redshift connector is used to create our dbt project including all the necessary models, snapshots, tests, macros and profiles.
  2. A Docker image is created and pushed to the ECR repository.
  3. The Docker image is run by Fargate as an ECS task triggered via AWS Step Functions. All the Amazon Redshift credentials are stored in Secrets Manager, which is then used by the ECS task to connect with Amazon Redshift.
  4. During the run, dbt converts all the models, snapshots, tests and macros to Amazon Redshift compliant SQL statements and it orchestrates the run based on the internal data lineage graph maintained. These SQL commands are run directly on the Redshift cluster and therefore the workload is pushed to Amazon Redshift directly.
  5. When the run is complete, dbt will create a set of HTML and JSON files to host the dbt documentation, which describes the data catalog, compiled SQL statements, data lineage graph, and more.

Prerequisites

You should have the following prerequisites:

  • A good understanding of the dbt principles and implementation steps.
  • An AWS account with user role permission to access the AWS services used in this solution.
  • Security groups for Fargate to access the Redshift cluster and Secrets Manager from Amazon ECS.
  • A Redshift cluster. For creation instructions, refer to Create a cluster.
  • An ECR repository: For instructions, refer to Creating a private repository
  • A Secrets Manager secret containing all the credentials for connecting to Amazon Redshift. This includes the host, port, database name, user name, and password. For more information, refer to Create an AWS Secrets Manager database secret.
  • An Amazon Simple Storage (Amazon S3) bucket to host documentation files.

Create a dbt project

We are using dbt CLI so all commands are run in the command line. Therefore, install pip if not already installed. Refer to installation for more information.

To create a dbt project, complete the following steps:

  1. Install dependent dbt packages:
    pip install dbt-redshift
  2. Initialize a dbt project using the dbt init <project_name> command, which creates all the template folders automatically.
  3. Add all the required DBT artifacts.
    Refer to the dbt-redshift-etlpattern repo which includes a reference dbt project. For more information about building projects, refer to About dbt projects.

In the reference project, we have implemented the following features:

  • SCD type 1 using incremental models
  • SCD type 2 using snapshots
  • Seed look-up files
  • Macros for adding reusable code in the project
  • Tests for analyzing inbound data

The Python script is prepared to fetch the credentials required from Secrets Manager for accessing Amazon Redshift. Refer to the export_redshift_connection.py file.

  1. Prepare the run_dbt.sh script to run the dbt pipeline sequentially. This script is placed in the root folder of the dbt project as shown in sample repo.
-- Import the dependent external libraries
dbt deps --profiles-dir . --project-dir .

-- Create tables based on the seed files
dbt seed --profiles-dir . --project-dir .

-- Run all the model files
dbt run --profiles-dir . --project-dir .

-- Run all the snapshot files
dbt snapshot --profiles-dir . --project-dir .

-- Run all inbuilt and custom test cases prepared
dbt test --profiles-dir . --project-dir .

-- Generate dbt documentation files
dbt docs generate --profiles-dir . --project-dir .

--Copying dbt outputs to s3 bucket - for hosting
aws s3 cp --recursive --exclude="*" --include="*.json" --include="*.html" dbt/target/ s3://<bucketName>/REDSHIFT_POC/
  1. Create a Docker file in the parent directory of the dbt project folder. This step builds the image of the dbt project to be pushed to the ECR repository.
FROM python:3

ADD dbt_src /dbt_src

RUN pip install -U pip

# Install DBT libraries
RUN pip install --no-cache-dir dbt-core

RUN pip install --no-cache-dir dbt-redshift

RUN pip install --no-cache-dir boto3

RUN pip install --no-cache-dir awscli

WORKDIR /dbt_src

RUN chmod -R 755 .

ENTRYPOINT [ "/bin/sh", "-c" ]

CMD ["./run_dbt.sh"]

Upload the image to Amazon ECR and run it as an ECS task

To push the image to the ECR repository, complete the following steps:

  1. Retrieve an authentication token and authenticate your Docker client to your registry:
    aws ecr get-login-password --region <region_name> | docker login --username AWS --password-stdin <repository_name>

  2. Build your Docker image using the following command:
docker build -t <image tag> .
  1. After the build is complete, tag your image so you can push it to the repository:
docker tag <image tag>:latest <repository_name>:latest
  1. Run the following command to push the image to your newly created AWS repository:
docker push <repository_name>/<image tag>:latest
  1. On the Amazon ECS console, create a cluster with Fargate as an infrastructure option.
  2. Provide your VPC and subnets as required.
  3. After you create the cluster, create an ECS task and assign the created dbt image as the task definition family.
  4. In the networking section, choose your VPC, subnets, and security group to connect with Amazon Redshift, Amazon S3 and Secrets Manager.

This task will trigger the run_dbt.sh pipeline script and run all the dbt commands sequentially. When the script is complete, we can see the results in Amazon Redshift and the documentation files pushed to Amazon S3.

  1. You can host the documentation via Amazon S3 static website hosting. For more information, refer to Hosting a static website using Amazon S3.
  2. Finally, you can run this task in Step Functions as an ECS task to schedule the jobs as required. For more information, refer to Manage Amazon ECS or Fargate Tasks with Step Functions.

The dbt-redshift-etlpattern repo now has all the code samples required.

Cost for executing dbt jobs in AWS Fargate as an Amazon ECS task with minimal operational requirements would take around $1.5 (cost_link) per month.

Clean up

Complete the following steps to clean up your resources:

  1. Delete the ECS Cluster you created.
  2. Delete the ECR repository you created for storing the image files.
  3. Delete the Redshift Cluster you created.
  4. Delete the Redshift Secrets stored in Secrets Manager.

Conclusion

This post covered the basic implementation of using dbt with Amazon Redshift in a cost-efficient way by using Fargate in Amazon ECS. We described the key infrastructure and configuration set-up with a sample project. This architecture can help you take advantage of the benefits of having a dbt framework to manage your data warehouse platform in Amazon Redshift.

For more information about dbt macros and models for Amazon Redshift internal operation and maintenance, refer to the following GitHub repo. In subsequent post, we’ll explore the traditional extract, transform, and load (ETL) patterns that you can implement using the dbt framework in Amazon Redshift. Test this solution in your account and provide feedback or suggestions in the comments.


About the Authors

Seshadri Senthamaraikannan is a data architect with AWS professional services team based in London, UK. He is well experienced and specialised in Data Analytics and works with customers focusing on building innovative and scalable solutions in AWS Cloud to meet their business goals. In his spare time, he enjoys spending time with his family and play sports.

Mohamed Hamdy is a Senior Big Data Architect with AWS Professional Services based in London, UK. He has over 15 years of experience architecting, leading, and building data warehouses and big data platforms. He helps customers develop big data and analytics solutions to accelerate their business outcomes through their cloud adoption journey. Outside of work, Mohamed likes travelling, running, swimming and playing squash.

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

Post Syndicated from Asser Moustafa original https://aws.amazon.com/blogs/big-data/power-enterprise-grade-data-vaults-with-amazon-redshift-part-1/

Amazon Redshift is a popular cloud data warehouse, offering a fully managed cloud-based service that seamlessly integrates with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more—all while providing up to 7.9x better price-performance than other cloud data warehouses.

As with all AWS services, Amazon Redshift is a customer-obsessed service that recognizes there isn’t a one-size-fits-all for customers when it comes to data models, which is why Amazon Redshift supports multiple data models such as Star Schemas, Snowflake Schemas and Data Vault. This post discusses best practices for designing enterprise-grade Data Vaults of varying scale using Amazon Redshift; the second post in this two-part series discusses the most pressing needs when designing an enterprise-grade Data Vault and how those needs are addressed by Amazon Redshift.

Whether it’s a desire to easily retain data lineage directly within the data warehouse, establish a source-system agnostic data model within the data warehouse, or more easily comply with GDPR regulations, customers that implement a Data Vault model will benefit from this post’s discussion of considerations, best practices, and Amazon Redshift features relevant to the building of enterprise-grade Data Vaults. Building a starter version of anything can often be straightforward, but building something with enterprise-grade scale, security, resiliency, and performance typically requires knowledge of and adherence to battle-tested best practices, and using the right tools and features in the right scenario.

Data Vault overview

Let’s first briefly review the core Data Vault premise and concepts. Data models provide a framework for how the data in a data warehouse should be organized into database tables. Amazon Redshift supports a number of data models, and some of the most popular data models include STAR schemas and Data Vault.

Data Vault is not only a modeling methodology, it’s also an opinionated framework that tells you how to solve certain problems in your data ecosystem. An opinionated framework provides a set of guidelines and conventions that developers are expected to follow, rather than leaving all decisions up to the developer. You can compare this with what big enterprise frameworks like Spring or Micronauts do when developing applications at enterprise scale. This is incredibly helpful especially on large data warehouse projects, because it structures your extract, load, and transform (ELT) pipeline and clearly tells you how to solve certain problems within the data and pipeline contexts. This also allows for a high degree of automation.

Data Vault 2.0 allows for the following:

  • Agile data warehouse development
  • Parallel data ingestion
  • A scalable approach to handle multiple data sources even on the same entity
  • A high level of automation
  • Historization
  • Full lineage support

However, Data Vault 2.0 also comes with costs, and there are use cases where it’s not a good fit, such as the following:

  • You only have a few data sources with no related or overlapping data (for example, a bank with a single core system)
  • You have simple reporting with infrequent changes
  • You have limited resources and knowledge of Data Vault

Data Vault typically organizes an organization’s data into a pipeline of four layers: staging, raw, business, and presentation. The staging layer represents data intake and light data transformations and enhancements that occur before the data comes to its more permanent resting place, the raw Data Vault (RDV).

The RDV holds the historized copy of all of the data from multiple source systems. It is referred to as raw because no filters or business transformations have occurred at this point except for storing the data in source system independent targets. The RDV organizes data into three key types of tables:

  • Hubs – This type of table represents a core business entity such as a customer. Each record in a hub table is married with metadata that identifies the record’s creation time, originating source system, and unique business key.
  • Links – This type of table defines a relationship between two or more hubs—for example, how the customer hub and order hub are to be joined.
  • Satellites – This type of table records the historized reference data about either hubs or links, such as product_info and customer_info

The RDV is used to feed data into the business Data Vault (BDV), which is responsible for reorganizing, denormalizing, and aggregating data for optimized consumption by the presentation mart. The presentation marts, also known as the data mart layer, further reorganizes the data for optimized consumption by downstream clients such as business dashboards. The presentation marts may, for example, reorganize data into a STAR schema.

For a more detailed overview of Data Vault along with a discussion of its applicability in the context of very interesting use cases, refer to the following:

How does Data Vault fit into a modern data architecture?

Currently, the lake house paradigm is becoming a major pattern in data warehouse design, even as part of a data mesh architecture. This follows the pattern of data lakes getting closer to what a data warehouse can do and vice versa. To compete with the flexibility of a data lake, Data Vault is a good choice. This way, the data warehouse doesn’t become a bottleneck and you can achieve similar agility, flexibility, scalability, and adaptability when ingestion and onboarding new data.

Platform flexibility

In this section, we discuss some recommended Redshift configurations for Data Vaults of varying scale. As mentioned earlier, the layers within a Data Vault platform are well known. We typically see a flow from the staging layer to the RDV, BDV, and finally presentation mart.

The Amazon Redshift is highly flexible in supporting both modest and large-scale Data Vaults, offering features like the following:

Modest vs. large-scale Data Vaults

Amazon Redshift is flexible in how you decide to structure these layers. For modest data vaults, a single Redshift warehouse with one database with multiple schemas will work just fine.

For large data vaults with more complex transformations, we would look at multiple warehouses, each with their own schema of mastered data representing one or more layer. The reason for using multiple warehouses is to take advantage of the Amazon Redshift architecture’s flexibility for implementing large-scale data vault implementations, such as using Redshift RA3 nodes and Redshift Serverless for separating the compute from the data storage layer and using Redshift data sharing to share the data between different Redshift warehouses. This enables you to scale the compute capacity independently at each layer depending on the processing complexity. The staging layer, for example, can be a layer within your data lake (Amazon S3 storage) or a schema within a Redshift database.

Using Amazon Aurora zero-ETL integrations with Amazon Redshift, you can create a data vault implementation with a staging layer in an Amazon Aurora database that will take care of real-time transaction processing and move the data to Amazon Redshift automatically for further processing in the Data Vault implementation without creating any complex ETL pipelines. This way, you can use Amazon Aurora for transactions and Amazon Redshift for analytics. Compute resources are isolated for the same data, and you’re using the most efficient tools to process it.

Large-scale Data Vaults

For larger Data Vaults platforms, concurrency and compute power become important to process both the loading of data and any business transformations. Amazon Redshift offers flexibility to increase compute capacity both horizontally via concurrency scaling and vertically via cluster resize and also via different architectures for each Data Vault layer.

Staging layer

You can create a data warehouse for the staging layer and perform hard business rules processing here, including calculation of hash keys, hash diffs, and addition of technical metadata columns. If data is not loaded 24/7, consider either pause/resume or a Redshift Serverless workgroup.

Raw Data Vault layer

For raw Data Vault (RDV), it’s recommended to either create one Redshift warehouse for the whole RDV or one Redshift warehouse for one or more subject areas within the RDV. For example, if the volume of data and number of normalized tables within the RDV for a particular subject area is large (either the raw data layer has so many tables that it runs out of maximum table limit on Amazon Redshift or the advantage of workload isolation within a single Redshift warehouse outweighs the overhead of performance and management), this subject area within the RDV can be run and mastered on its own Redshift warehouse.

The RDV is typically loaded 24/7 so a provisioned Redshift data warehouse may be most suitable here to take advantage of reserved instance pricing.

Business Data Vault layer

For creating a data warehouse for the business Data Vault (BDV) layer, this could be larger in size than the previous data warehouses due to the nature of the BDV processing, typically denormalization of data from a large number of source RDV tables.

Some customers run their BDV processing once a day, so a pause/resume window for Redshift provisioned cluster may be cost beneficial here. You can also run BDV processing on an Amazon Redshift Serverless warehouse which will automatically pause when workloads complete and resume when workloads start processing again.

Presentation Data Mart layer

For creating Redshift (provisioned or serverless) warehouses for one or more data marts, the schemas within these marts typically contain views or materialized views, so a Redshift data share will be set up between the data marts and the previous layers.

We need to ensure there is enough concurrency to cope with the increased read traffic at this level. This is achieved via multiple read only warehouses with a data share or the use of concurrency scaling to auto scale.

Example architectures

The following diagram illustrates an example platform for a modest Data Vault model.

The following diagram illustrates the architecture for a large-scale Data Vault model.

Data Vault data model guiding principles

In this section, we discuss some recommended design principles for joining and filtering table access within a Data Vault implementation. These guiding principles address different combinations of entity type access, but should be tested for suitability with each client’s particular use case.

Let’s begin with a brief refresher of table distribution styles in Amazon Redshift. There are four ways that a table’s data can be distributed among the different compute nodes in a Redshift cluster: ALL, KEY, EVEN, and AUTO.

The ALL distribution style ensures that a full copy of the table is maintained on each compute node to eliminate the need for inter-node network communication during workload runs. This distribution style is ideal for tables that are relatively small in size (such as fewer than 5 million rows) and don’t exhibit frequent changes.

The KEY distribution style uses a hash-based approach to persisting a table’s rows in the cluster. A distribution key column is defined to be one of the columns in the row, and the value of that column is hashed to determine on which compute node the row will be persisted. The current generation RA3 node type is built on the AWS Nitro System with managed storage that uses high performance SSDs for your hot data and Amazon S3 for your cold data, providing ease of use, cost-effective storage, and fast query performance. Managed storage means this mapping of row to compute node is more in terms of metadata and compute node ownership rather than the actual persistence. This distribution style is ideal for large tables that have well-known and frequent join patterns on the distribution key column.

The EVEN distribution style uses a round-robin approach to locating a table’s row. Simply put, table rows are cycled through the different compute nodes and when the last compute node in the cluster is reached, the cycle begins again with the next row being persisted to the first compute node in the cluster. This distribution style is ideal for large tables that exhibit frequent table scans.

Finally, the default table distribution style in Amazon Redshift is AUTO, which empowers Amazon Redshift to monitor how a table is used and change the table’s distribution style at any point in the table’s lifecycle for greater performance with workloads. However, you are also empowered to explicitly state a particular distribution style at any point in time if you have a good understanding of how the table will be used by workloads.

Hub and hub satellites

Hub and hub satellites are often joined together, so it’s best to co-locate these datasets based on the primary key of the hub, which will also be part of the compound key of each satellite. As mentioned earlier, for smaller volumes (typically fewer than 5–7 million rows) use the ALL distribution style and for larger volumes, use the KEY distribution style (with the _PK column as the distribution KEY column).

Link and link satellites

Link and link satellites are often joined together, so it’s best to co-locate these datasets based on the primary key of the link, which will also be part of the compound key of each link satellite. These typically involve larger data volumes, so look at a KEY distribution style (with the _PK column as the distribution KEY column).

Point in time and satellites

You may decide to denormalize key satellite attributes by adding them to point in time (PIT) tables with the goal of reducing or eliminating runtime joins. Because denormalization of data helps reduce or eliminate the need for runtime joins, denormalized PIT tables can be defined with an EVEN distribution style to optimize table scans.

However, if you decide not to denormalize, then smaller volumes should use the ALL distribution style and larger volumes should use the KEY distribution style (with the _PK column as the distribution KEY column). Also, be sure to define the business key column as a sort key on the PIT table for optimized filtering.

Bridge and link satellites

Similar to PIT tables, you may decide to denormalize key satellite attributes by adding them to bridge tables with the goal of reducing or eliminating runtime joins. Although denormalization of data helps reduce or eliminate the need for runtime joins, denormalized bridge tables are still typically larger in data volume and involved frequent joins, so the KEY distribution style (with the _PK column as the distribution KEY column) would be the recommended distribution style. Also, be sure to define the bridge of the dominant business key columns as sort keys for optimized filtering.

KPI and reporting

KPI and reporting tables are designed to meet the specific needs of each customer, so flexibility on their structure is key here. These are often standalone tables that exhibit multiple types of interactions, so the EVEN distribution style may be the best table distribution style to evenly spread the scan workloads.

Be sure to choose a sort key that is based on common WHERE clauses such as a date[time] element or a common business key. In addition, a time series table can be created for very large datasets that are always sliced on a time attribute to optimize workloads that typically interact with one slice of time. We discuss this subject in greater detail later in the post.

Non-functional design principles

In this section, we discuss potential additional data dimensions that are often created and married with business data to satisfy non-functional requirements. In the physical data model, these additional data dimensions take the form of technical columns added to each row to enable tracking of non-functional requirements. Many of these technical columns will be populated by the Data Vault framework. The following table lists some of the common technical columns, but you can extend the list as needed.

Column Name Applies to Table Description
LOAD_DTS All A timestamp recording of when this row was inserted. This is a primary key column for historized targets (links, satellites, reference), and a non-primary key column for transactional links and hubs.
BATCH_ID All A unique process ID identifying the run of the ETL code that populated the row.
JOB_NAME All The process name from the ETL framework. This may be a sub-process within a larger process.
SOURCE_SYSTEM_CD All The system from which this data was discovered.
HASH_DIFF Satellite A method in Data Vault of performing change data capture (CDC) changes.
RECORD_ID Satellite
Link
Reference
A unique identifier captured by the code framework for each row.
EFFECTIVE_DTS Link Business effective dates to record the business validity of the row. It’s set to the LOAD_DTS if no business date is present or needed.
DQ_AUDIT Satellite
Link
Reference
Warnings and errors found during staging for this row, tied to the RECORD_ID.

Advanced optimizations and guidelines

In this section, we discuss potential optimizations that can be deployed at the start or later on in the lifecycle of the Data Vault implementation.

Time series tables

Let’s begin with a brief refresher on time series tables as a pattern. Time series tables involve taking a large table and segmenting it into multiple identical tables that hold a time-bound portion of the rows in the original table. One common scenario is to divide a monolithic sales table into monthly or annual versions of the sales table (such as sales_jan,sales_feb, and so on). For example, let’s assume we want to maintain data for a rolling time period using a series of tables, as the following diagram illustrates.

With each new calendar quarter, we create a new table to hold the data for the new quarter and drop the oldest table in the series. Furthermore, if the table rows arrive in a naturally sorted order (such as sales date), then no work is needed to sort the table data, resulting in skipping the expensive VACUUM SORT operation on table.

Time series tables can help significantly optimize workloads that often need to scan these large tables but within a certain time range. Furthermore, by segmenting the data across tables that represent calendar quarters, we are able to drop aged data with a single DROP command. Had we tried to perform the same DELETE operation on a monolithic table design using the DELETE command, for example, it would have been a more expensive deletion operation that would have left the table in a suboptimal state requiring defragmentation and also saves to run a subsequent VACUUM process to reclaim space.

Should a workload ever need to query against the entire time range, you can use standard or materialized views using a UNION ALL operation within Amazon Redshift to easily stitch all the component tables back into the unified dataset. Materialized views can also be used to abstract the table segmentation from downstream users.

In the context of Data Vault, time series tables can be useful for archiving rows within satellites, PIT, and bridge tables that aren’t used often. Time series tables can then be used to distribute the remaining hot rows (rows that are either recently added or referenced often) with more aggressive table properties.

Conclusion

In this post, we discussed a number of areas ripe for optimization and automation to successfully implement a Data Vault 2.0 system at scale and the Amazon Redshift capabilities that you can use to satisfy the related requirements. There are many more Amazon Redshift capabilities and features that will surely come in handy, and we strongly encourage current and prospective customers to reach out to us or other AWS colleagues to delve deeper into Data Vault with Amazon Redshift.


About the Authors

Asser Moustafa is a Principal Analytics Specialist Solutions Architect at AWS based out of Dallas, Texas. He advises customers globally on their Amazon Redshift and data lake architectures, migrations, and visions—at all stages of the data ecosystem lifecycle—starting from the POC stage to actual production deployment and post-production growth.

Philipp Klose is a Global Solutions Architect at AWS based in Munich. He works with enterprise FSI customers and helps them solve business problems by architecting serverless platforms. In this free time, Philipp spends time with his family and enjoys every geek hobby possible.

Saman Irfan is a Specialist Solutions Architect at Amazon Web Services. She focuses on helping customers across various industries build scalable and high-performant analytics solutions. Outside of work, she enjoys spending time with her family, watching TV series, and learning new technologies.

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

Post Syndicated from Asser Moustafa original https://aws.amazon.com/blogs/big-data/power-enterprise-grade-data-vaults-with-amazon-redshift-part-2/

Amazon Redshift is a popular cloud data warehouse, offering a fully managed cloud-based service that seamlessly integrates with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more—all while providing up to 7.9x better price-performance than any other cloud data warehouses.

As with all AWS services, Amazon Redshift is a customer-obsessed service that recognizes there isn’t a one-size-fits-all for customers when it comes to data models, which is why Amazon Redshift supports multiple data models such as Star Schemas, Snowflake Schemas and Data Vault. This post discusses the most pressing needs when designing an enterprise-grade Data Vault and how those needs are addressed by Amazon Redshift in particular and AWS cloud in general. The first post in this two-part series discusses best practices for designing enterprise-grade data vaults of varying scale using Amazon Redshift.

Whether it’s a desire to easily retain data lineage directly within the data warehouse, establish a source-system agnostic data model within the data warehouse, or more easily comply with GDPR regulations, customers that implement a data vault model will benefit from this post’s discussion of considerations, best practices, and Amazon Redshift features as well as the AWS cloud capabilities relevant to the building of enterprise-grade data vaults. Building a starter version of anything can often be straightforward, but building something with enterprise-grade scale, security, resiliency, and performance typically requires knowledge and adherence to battle-tested best practices, and using the right tools and features in the right scenario.

Data Vault overview

For a brief review of the core Data Vault premise and concepts, refer to the first post in this series.

In the following sections, we discuss the most common areas of consideration that are critical for Data Vault implementations at scale: data protection, performance and elasticity, analytical functionality, cost and resource management, availability, and scalability. Although these areas can also be critical areas of consideration for any data warehouse data model, in our experience, these areas present their own flavor and special needs to achieve data vault implementations at scale.

Data protection

Security is always priority-one at AWS, and we see the same attention to security every day with our customers. Data security has many layers and facets, ranging from encryption at rest and in transit to fine-grained access controls and more. In this section, we explore the most common data security needs within the raw and business data vaults and the Amazon Redshift features that help achieve those needs.

Data encryption

Amazon Redshift encrypts data in transit by default. With the click of a button, you can configure Amazon Redshift to encrypt data at rest at any point in a data warehouse’s lifecycle, as shown in the following screenshot.

You can use either AWS Key Management Service (AWS KMS) or Hardware Security Module (HSM) to perform encryption of data at rest. If you use AWS KMS, you can either use an AWS managed key or customer managed key. For more information, refer to Amazon Redshift database encryption.

You can also modify cluster encryption after cluster creation, as shown in the following screenshot.

Moreover, Amazon Redshift Serverless is encrypted by default.

Fine-grained access controls

When it comes to achieving fine-grained access controls at scale, Data Vaults typically need to use both static and dynamic access controls. You can use static access controls to restrict access to databases, tables, rows, and columns to explicit users, groups, or roles. With dynamic access controls, you can mask part or all portions of a data item, such as a column based on a user’s role or some other functional analysis of a user’s privileges.

Amazon Redshift has long supported static access controls through the GRANT and REVOKE commands for databases, schemas, and tables, at row level and column level. Amazon Redshift also supports row-level security, where you can further restrict access to particular rows of the visible columns, as well as role-based access control, which helps simplify the management of security privileges in Amazon Redshift.

In the following example, we demonstrate how you can use GRANT and REVOKE statements to implement static access control in Amazon Redshift.

  1. First, create a table and populate it with credit card values:
    -- Create the credit cards table
    
    CREATE TABLE credit_cards 
    ( customer_id INT, 
    is_fraud BOOLEAN, 
    credit_card TEXT);
    
    --populate the table with sample values
    
    INSERT INTO credit_cards 
    VALUES
    (100,'n', '453299ABCDEF4842'),
    (100,'y', '471600ABCDEF5888'),
    (102,'n', '524311ABCDEF2649'),
    (102,'y', '601172ABCDEF4675'),
    (102,'n', '601137ABCDEF9710'),
    (103,'n', '373611ABCDEF6352');
    

  2. Create the user user1 and check permissions for user1 on the credit_cards table. We use SET SESSION AUTHORIZATION to switch to user1 in the current session:
       -- Create user
    
       CREATE USER user1 WITH PASSWORD '1234Test!';
    
       -- Check access permissions for user1 on credit_cards table
       SET SESSION AUTHORIZATION user1; 
       SELECT * FROM credit_cards; -- This will return permission defined error
    

  3. Grant SELECT access on the credit_cards table to user1:
    RESET SESSION AUTHORIZATION;
     
    GRANT SELECT ON credit_cards TO user1;
    

  4. Verify access permissions on the table credit_cards for user1:
    SET SESSION AUTHORIZATION user1;
    
    SELECT * FROM credit_cards; -- Query will return rows
    RESET SESSION AUTHORIZATION;

Data obfuscation

Static access controls are often useful to establish hard boundaries (guardrails) of the user communities that should be able to access certain datasets (for example, only those users that are part of the marketing user group should be allowed access to marketing data). However, what if access controls need to restrict only partial aspects of a field, not the entire field? Amazon Redshift supports partial, full, or custom data masking of a field through dynamic data masking. Dynamic data masking enables you to protect sensitive data in your data warehouse. You can manipulate how Amazon Redshift shows sensitive data to the user at query time without transforming it in the database by using masking policies.

In the following example, we achieve a full redaction of credit card numbers at runtime using a masking policy on the previously used credit_cards table.

  1. Create a masking policy that fully masks the credit card number:
    CREATE MASKING POLICY mask_credit_card_full 
    WITH (credit_card VARCHAR(256)) 
    USING ('000000XXXX0000'::TEXT);

  2. Attach mask_credit_card_full to the credit_cards table as the default policy. Note that all users will see this masking policy unless a higher priority masking policy is attached to them or their role.
    ATTACH MASKING POLICY mask_credit_card_full 
    ON credit_cards(credit_card) TO PUBLIC;

  3. Users will see credit card information being masked when running the following query
    SELECT * FROM credit_cards;

Centralized security policies

You can achieve a great deal of scale by combining static and dynamic access controls to span a broad swath of user communities, datasets, and access scenarios. However, what about datasets that are shared across multiple Redshift warehouses, as might be done between raw data vaults and business data vaults? How can scale be achieved with access controls for a dataset that resides on one Redshift warehouse but is authorized for use across multiple Redshift warehouses using Amazon Redshift data sharing?

The integration of Amazon Redshift with AWS Lake Formation enables centrally managed access and permissions for data sharing. Amazon Redshift data sharing policies are established in Lake Formation and will be honored by all of your Redshift warehouses.

Performance

It is not uncommon for sub-second SLAs to be associated with data vault queries, particularly when interacting with the business vault and the data marts sitting atop the business vault. Amazon Redshift delivers on that needed performance through a number of mechanisms such as caching, automated data model optimization, and automated query rewrites.

The following are common performance requirements for Data Vault implementations at scale:

  • Query and table optimization in support of high-performance query throughput
  • High concurrency
  • High-performance string-based data processing

Amazon Redshift features and capabilities for performance

In this section, we discuss Amazon Redshift features and capabilities that address those performance requirements.

Caching

Amazon Redshift uses multiple layers of caching to deliver subsecond response times for repeat queries. Through Amazon Redshift in-memory result set caching and compilation caching, workloads ranging from dashboarding to visualization to business intelligence (BI) that run repeat queries experience a significant performance boost.

With in-memory result set caching, queries that have a cached result set and no changes to the underlying data return immediately and typically within milliseconds.

The current generation RA3 node type is built on the AWS Nitro System with managed storage that uses high performance SSDs for your hot data and Amazon S3 for your cold data, providing ease of use, cost-effective storage, and fast query performance. In short, managed storage means fast retrieval for your most frequently accessed data and automated/managed identification of hot data by Amazon Redshift.

The large majority of queries in a typical production data warehouse are repeat queries, and data warehouses with data vault implementations observe the same pattern. The most optimal run profile for a repeat query is one that avoids costly query runtime interpretation, which is why queries in Amazon Redshift are compiled during the first run and the compiled code is cached in a global cache, providing repeat queries a significant performance boost.

Materialized views

Pre-computing the result set for repeat queries is a powerful mechanism for boosting performance. The fact that it automatically refreshes to reflect the latest changes in the underlying data is yet another powerful pattern for boosting performance. For example, consider the denormalization queries that might be run on the raw data vault to populate the business vault. It’s quite possible that some less-active source systems will have exhibited little to no changes in the raw data vault since the last run. Avoiding the hit of rerunning the business data vault population queries from scratch in those cases could be a tremendous boost to performance. Redshift materialized views provide that exact functionality by storing the precomputed result set of their backing query.

Queries that are similar to the materialized view’s backing query don’t have to rerun the same logic each time, because they can pull records from the existing result set. Developers and analysts can choose to create materialized views after analyzing their workloads to determine which queries would benefit. Materialized views also support automatic query rewriting to have Amazon Redshift rewrite queries to use materialized views, as well as auto refreshing materialized views, where Amazon Redshift can automatically refresh materialized views with up-to-date data from its base tables.

Alternatively, the automated materialized views (AutoMV) feature provides the same performance benefits of user-created materialized views without the maintenance overhead because Amazon Redshift automatically creates the materialized views based on observed query patterns. Amazon Redshift continually monitors the workload using machine learning and then creates new materialized views when they are beneficial. AutoMV balances the costs of creating and keeping materialized views up to date against expected benefits to query latency. The system also monitors previously created AutoMVs and drops them when they are no longer beneficial. AutoMV behavior and capabilities are the same as user-created materialized views. They are refreshed automatically and incrementally, using the same criteria and restrictions.

Also, whether the materialized views are user-created or auto-generated, Amazon Redshift automatically rewrites queries, without users to change queries, to use materialized views when there is enough of a similarity between the query and the materialized view’s backing query.

Concurrency scaling

Amazon Redshift automatically and elastically scales query processing power to provide consistently fast performance for hundreds of concurrent queries. Concurrency scaling resources are added to your Redshift warehouse transparently in seconds, as concurrency increases, to process read/write queries without wait time. When workload demand subsides, Amazon Redshift automatically shuts down concurrency scaling resources to save you cost. You can continue to use your existing applications and BI tools without any changes.

Because Data Vault allows for highly concurrent data processing and is primarily run within Amazon Redshift, concurrency scaling is the recommended way to handle concurrent transformation operations. You should avoid operations that aren’t supported by concurrency scaling.

Concurrent ingestion

One of the key attractions of Data Vault 2.0 is its ability to support high-volume concurrent ingestion from multiple source systems into the data warehouse. Amazon Redshift provides a number of options for concurrent ingestion, including batch and streaming.

For batch- and microbatch-based ingestion, we suggest using the COPY command in conjunction with CSV format. CSV is well supported by concurrency scaling. In case your data is already on Amazon S3 but in Bigdata formats like ORC or Parquet, always consider the trade-off of converting the data to CSV vs. non-concurrent ingestion. You can also use workload management to prioritize non-concurrent ingestion jobs to keep the throughput high.

For low-latency workloads, we suggest using the native Amazon Redshift streaming capability or the Amazon Redshift Zero ETL capability in conjunction with Amazon Aurora. By using Aurora as a staging layer for the raw data, you can handle small increments of data efficiently and with high concurrency, and then use this data inside your Redshift data warehouse without any extract, transform, and load (ETL) processes. For stream ingestion, we suggest using the native streaming feature (Amazon Redshift streaming ingestion) and have a dedicated stream for ingesting each table. This might require a stream processing solution upfront, which splits the input stream into the respective elements like the hub and the satellite record.

String-optimized compression

The Data Vault 2.0 methodology often involves time-sensitive lookup queries against potentially very large satellite tables (in terms of row count) that have low-cardinality hash/string indexes. Low-cardinality indexes and very large tables tend to work against time-sensitive queries. Amazon Redshift, however, provides a specialized compression method for low-cardinality string-based indexes called BYTEDICT. Using BYTEDICT creates a dictionary of the low-cardinality string indexes that allow Amazon Redshift to reads the rows even in a compressed state, thereby significantly improving performance. You can manually select the BYTEDICT compression method for a column, or alternatively rely on Amazon Redshift automated table optimization facilities to select it for you.

Support of transactional data lake frameworks

Data Vault 2.0 is an insert-only framework. Therefore, reorganizing data to save money is a challenge you may face. Amazon Redshift integrates seamlessly with S3 data lakes allowing you to perform data lake queries in your S3 using standard SQL as you would with native tables. This way, you can outsource less frequently used satellites to your S3 data lake, which is cheaper than keeping it as a native table.

Modern transactional lake formats like Apache Iceberg are also an excellent option to store this data. They ensure transactional safety and therefore ensure that your audit trail, which is a fundamental feature of Data Vault, doesn’t break.

We also see customers using these frameworks as a mechanism to implement incremental loads. Apache Iceberg lets you query for the last state for a given point in time. You can use this mechanism to optimize merge operations while still making the data accessible from within Amazon Redshift.

Amazon Redshift data sharing performance considerations

For large-scale Data Vault implementation, one of the preferred design principals is to have a separate Redshift data warehouse for each layer (staging, raw Data Vault, business Data Vault, and presentation data mart). These layers have separate Redshift provisioned or serverless warehouses according to their storage and compute requirements and use Amazon Redshift data sharing to share the data between these layers without physically moving the data.

Amazon Redshift data sharing enables you to seamlessly share live data across multiple Redshift warehouses without any data movement. Because the data sharing feature serves as the backbone in implementing large-scale Data Vaults, it’s important to understand the performance of Amazon Redshift in this scenario.

In a data sharing architecture, we have producer and consumer Redshift warehouses. The producer warehouse shares the data objects to one or more consumer warehouse for read purposes only without having to copy the data.

Producer/consumer Redshift cluster performance dependency

From a performance perspective, the producer (provisioned or serverless) warehouse is not responsible for query performance running on the consumer (provisioned or serverless) warehouse and has zero impact in terms of performance or activity on the producer Redshift warehouse. It depends on the consumer Redshift warehouse compute capacity. The producer warehouse is only responsible for the shared data.

Result set caching on the consumer Redshift cluster

Amazon Redshift uses result set caching to speed up the retrieval of data when it knows that the data in the underlying table has not changed. In a data sharing architecture, Amazon Redshift also uses result set caching on the consumer Redshift warehouse. This is quite helpful for repeatable queries that commonly occur in a data warehousing environment.

Best practices for materialized views in Data Vault with Amazon Redshift data sharing

In Data Vault implementation, the presentation data mart layer typically contains views or materialized views. There are two possible routes to create materialized views for the presentation data mart layer. First, create the materialized views on the producer Redshift warehouse (business data vault layer) and share materialized views with the consumer Redshift warehouse (dedicated data marts). Alternatively, share the table objects directly from the business data vault layer to the presentation data mart layer and build the materialized view on the shared objects directly on the consumer Redshift warehouse.

The second option is recommended in this case, because it gives us the flexibility of creating customized materialized views of data on each consumer according to the specific use case and simplifies the management because each data mart user can create and manage materialized views on their own Redshift warehouse rather than be dependent on the producer warehouse.

Table distribution implications in Amazon Redshift data sharing

Table distribution style and how data is distributed across Amazon Redshift plays a significant role in query performance. In Amazon Redshift data sharing, the data is distributed on the producer Redshift warehouse according to the distribution style defined for table. When we associate the data via a data share to the consumer Redshift warehouse, it maps to the same disk block layout. Also, a bigger consumer Redshift warehouse will result in better query performance for queries running on it.

Concurrency scaling

Concurrency scaling is also supported on both producer and consumer Redshift warehouses for read and write operations.

Cost and resource management

Given that multiple source systems and users will interact heavily with the data vault data warehouse, it’s a prudent best practice to enable usage and query limits to serve as guardrails against runaway queries and unapproved usage patterns. Furthermore, it often helps to have a systematic way for allocating service costs based on usage of the data vault to different source systems and user groups within your organization.

The following are common cost and resource management requirements for Data Vault implementations at scale:

  • Utilization limits and query resource guardrails
  • Advanced workload management
  • Chargeback capabilities

Amazon Redshift features and capabilities for cost and resource management

In this section, we discuss Amazon Redshift features and capabilities that address those cost and resource management requirements.

Utilization limits and query monitoring rules

Runaway queries and excessive auto scaling are likely to be the two most common runaway patterns observed with data vault implementations at scale.

A Redshift provisioned cluster supports usage limits for features such as Redshift Spectrum, concurrency scaling, and cross-Region data sharing. A concurrency scaling limit specifies the threshold of the total amount of time used by concurrency scaling in 1-minute increments. A limit can be specified for a daily, weekly, or monthly period (using UTC to determine the start and end of the period).

You can also define multiple usage limits for each feature. Each limit can have a different action, such as logging to system tables, alerting via Amazon CloudWatch alarms and optionally Amazon Simple Notification Service (Amazon SNS) subscriptions to that alarm (such as email or text), or disabling the feature outright until the next time period begins (such as the start of the month). When a usage limit threshold is reached, events are also logged to a system table.

Redshift provisioned clusters also support query monitoring rules to define metrics-based performance boundaries for workload management queues and the action that should be taken when a query goes beyond those boundaries. For example, for a queue dedicated to short-running queries, you might create a rule that cancels queries that run for more than 60 seconds. To track poorly designed queries, you might have another rule that logs queries that contain nested loops.

Each query monitoring rule includes up to three conditions, or predicates, and one query action (such as stop, hop, or log). A predicate consists of a metric, a comparison condition (=, <, or >), and a value. If all of the predicates for any rule are met, that rule’s action is triggered. Amazon Redshift evaluates metrics every 10 seconds and if more than one rule is triggered during the same period, Amazon Redshift initiates the most severe action (stop, then hop, then log).

Redshift Serverless also supports usage limits where you can specify the base capacity according to your price-performance requirements. You can also set the maximum RPU (Redshift Processing Units) hours used per day, per week, or per month to keep the cost predictable and specify different actions, such as write to system table, receive an alert, or turn off user queries when the limit is reached. A cross-Region data sharing usage limit is also supported, which limits how much data transferred from the producer Region to the consumer Region that consumers can query.

You can also specify query limits in Redshift Serverless to stop poorly performing queries that exceed the threshold value.

Auto workload management

Not all queries have the same performance profile or priority, and data vault queries are no different. Amazon Redshift workload management (WLM) adapts in real time to the priority, resource allocation, and concurrency settings required to optimally run different data vault queries. These queries could consist of a high number of joins between the hubs, links, and satellites tables; large-scale scans of the satellite tables; or massive aggregations. Amazon Redshift WLM enables you to flexibly manage priorities within workloads so that, for example, short or fast-running queries won’t get stuck in queues behind long-running queries.

You can use automatic WLM to maximize system throughput and use resources effectively. You can enable Amazon Redshift to manage how resources are divided to run concurrent queries with automatic WLM. Automatic WLM manages the resources required to run queries. Amazon Redshift determines how many queries run concurrently and how much memory is allocated to each dispatched query.

Chargeback metadata

Amazon Redshift provides different pricing models to cater to different customer needs. On-demand pricing offers the greatest flexibility, whereas Reserved Instances provide significant discounts for predictable and steady usage scenarios. Redshift Serverless provides a pay-as-you-go model that is ideal for sporadic workloads.

However, with any of these pricing models, Amazon Redshift customers can attribute cost to different users. To start, Amazon Redshift provides itemized billing like many other AWS services in AWS Cost Explorer to attain the overall cost of using Amazon Redshift. Moreover, the cross-group collaboration (data sharing) capability of Amazon Redshift enables a more explicit and structured chargeback model to different teams.

Availability

In the modern data organization, data warehouses are no longer used purely to perform historical analysis in batches overnight with relatively forgiving SLAs, Recovery Time Objectives (RTOs), and Recovery Point Objectives (RPOs). They have become mission-critical systems in their own right that are used for both historical analysis and near-real-time data analysis. Data Vault systems at scale very much fit that mission-critical profile, which makes availability key.

The following are common availability requirements for Data Vault implementations at scale:

  • RTO of near-zero
  • RPO of near-zero
  • Automated failover
  • Advanced backup management
  • Commercial-grade SLA

Amazon Redshift features and capabilities for availability

In this section, we discuss the features and capabilities in Amazon Redshift that address those availability requirements.

Separation of storage and compute

AWS and Amazon Redshift are inherently resilient. With Amazon Redshift, there’s no additional cost for active-passive disaster recovery. Amazon Redshift replicates all of your data within your data warehouse when it is loaded and also continuously backs up your data to Amazon S3. Amazon Redshift always attempts to maintain at least three copies of your data (the original and replica on the compute nodes, and a backup in Amazon S3).

With separation of storage and compute and Amazon S3 as the persistence layer, you can achieve an RPO of near-zero, if not zero itself.

Cluster relocation to another Availability Zone

Amazon Redshift provisioned RA3 clusters support cluster relocation to another Availability Zone in events where cluster operation in the current Availability Zone is not optimal, without any data loss or changes to your application. Cluster relocation is available free of charge; however, relocation might not always be possible if there is a resource constraint in the target Availability Zone.

Multi-AZ deployment

For many customers, the cluster relocation feature is sufficient; however, enterprise data warehouse customers require a low RTO and higher availability to support their business continuity with minimal impact to applications.

Amazon Redshift supports Multi-AZ deployment for provisioned RA3 clusters.

A Redshift Multi-AZ deployment uses compute resources in multiple Availability Zones to scale data warehouse workload processing as well as provide an active-active failover posture. In situations where there is a high level of concurrency, Amazon Redshift will automatically use the resources in both Availability Zones to scale the workload for both read and write requests using active-active processing. In cases where there is a disruption to an entire Availability Zone, Amazon Redshift will continue to process user requests using the compute resources in the sister Availability Zone.

With features such as multi-AZ deployment, you can achieve a low RTO, should there ever be a disruption to the primary Redshift cluster or an entire Availability Zone.

Automated backup

Amazon Redshift automatically takes incremental snapshots that track changes to the data warehouse since the previous automated snapshot. Automated snapshots retain all of the data required to restore a data warehouse from a snapshot. You can create a snapshot schedule to control when automated snapshots are taken, or you can take a manual snapshot any time.

Automated snapshots can be taken as often as once every hour and retained for up to 35 days at no additional charge to the customer. Manual snapshots can be kept indefinitely at standard Amazon S3 rates. Furthermore, automated snapshots can be automatically replicated to another Region and stored there as a disaster recovery site also at no additional charge (with the exception of data transfer charges across Regions) and manual snapshots can also be replicated with standard Amazon S3 rates applying (and data transfer costs).

Amazon Redshift SLA

As a managed service, Amazon Redshift frees you from being the first and only line of defense against disruptions. AWS will use commercially reasonable efforts to make Amazon Redshift available with a monthly uptime percentage for each Multi-AZ Redshift cluster during any monthly billing cycle, of at least 99.99% and for multi-node cluster, at least 99.9%. In the event that Amazon Redshift doesn’t meet the Service Commitment, you will be eligible to receive a Service Credit.

Scalability

One of the major motivations of organizations migrating to the cloud is improved and increased scalability. With Amazon Redshift, Data Vault systems will always have a number of scaling options available to them.

The following are common scalability requirements for Data Vault implementations at scale:

  • Automated and fast-initiating horizontal scaling
  • Robust and performant vertical scaling
  • Data reuse and sharing mechanisms

Amazon Redshift features and capabilities for scalability

In this section, we discuss the features and capabilities in Amazon Redshift that address those scalability requirements.

Horizontal and vertical scaling

Amazon Redshift uses concurrency scaling automatically to support virtually unlimited horizontal scaling of concurrent users and concurrent queries, with consistently fast query performance. Furthermore, concurrency scaling requires no downtime, supports read/write operations, and is typically the most impactful and used scaling option for customers during normal business operations to maintain consistent performance.

With Amazon Redshift provisioned warehouse, as your data warehousing capacity and performance needs to change or grow, you can vertically scale your cluster to make the best use of the computing and storage options that Amazon Redshift provides. Resizing your cluster by changing the node type or number of nodes can typically be achieved in 10–15 minutes. Vertical scaling typically occurs much less frequently in response to persistent and organic growth and is typically performed during a planned maintenance window when the short downtime doesn’t impact business operations.

Explicit horizontal or vertical resize and pause operations can be automated per a schedule (for example, development clusters can be automatically scaled down or paused for the weekends). Note that the storage of paused clusters remains accessible to clusters with which their data was shared.

For resource-intensive workloads that might benefit from a vertical scaling operation vs. concurrency scaling, there are also other best-practice options that avoid downtime, such as deploying the workload onto its own Redshift Serverless warehouse while using data sharing.

Redshift Serverless measures data warehouse capacity in RPUs, which are resources used to handle workloads. You can specify the base data warehouse capacity Amazon Redshift uses to serve queries (ranging from as little as 8 RPUs to as high as 512 RPUs) and change the base capacity at any time.

Data sharing

Amazon Redshift data sharing is a secure and straightforward way to share live data for read purposes across Redshift warehouses within the same or different accounts and Regions. This enables high-performance data access while preserving workload isolation. You can have separate Redshift warehouses, either provisioned or serverless, for different use cases according to your compute requirement and seamlessly share data between them.

Common use cases for data sharing include setting up a central ETL warehouse to share data with many BI warehouses to provide read workload isolation and chargeback, offering data as a service and sharing data with external consumers, multiple business groups within an organization, sharing and collaborating on data to gain differentiated insights, and sharing data between development, test, and production environments.

Reference architecture

The diagram in this section shows one possible reference architecture of a Data Vault 2.0 system implemented with Amazon Redshift.

We suggest using three different Redshift warehouses to run a Data Vault 2.0 model in Amazon Redshift. The data between these data warehouses is shared via Amazon Redshifts data sharing and allows you to consume data from a consumer data warehouse even if the provider data warehouse is inactive.

  • Raw Data Vault – The RDV data warehouse hosts hubs, links, and satellite tables. For large models, you can additionally slice the RDV into additional data warehouses to even better adopt the data warehouse sizing to your workload patterns. Data is ingested via the patterns described in the previous section as batch or high velocity data.
  • Business Data Vault – The BDV data warehouse hosts bridge and point in time (PIT) tables. These tables are computed based on the RDV tables using Amazon Redshift. Materialized or automatic materialized views are straightforward mechanisms to create those.
  • Consumption cluster – This data warehouse contains query-optimized data formats and marts. Users interact with this layer.

If the workload pattern is unknown, we suggest starting with a Redshift Serverless warehouse and learning the workload pattern. You can easily migrate between a serverless and provisioned Redshift cluster at a later stage based on your processing requirements, as discussed in Part 1 of this series.

Best practices building a Data Vault warehouse on AWS

In this section, we cover how the AWS Cloud as a whole plays its role in building an enterprise-grade Data Vault warehouse on Amazon Redshift.

Education

Education is a fundamental success factor. Data Vault is more complex than traditional data modeling methodologies. Before you start the project, make sure everyone understands the principles of Data Vault. Amazon Redshift is designed to be very easy to use, but to ensure the most optimal Data Vault implementation on Amazon Redshift, gaining a good understanding of how Amazon Redshift works is recommended. Start with free resources like reaching out to your AWS account representative to schedule a free Amazon Redshift Immersion Day or train for the AWS Analytics specialty certification.

Automation

Automation is a major benefit of Data Vault. This will increase efficiency and consistency across your data landscape. Most customers focus on the following aspects when automating Data Vault:

  • Automated DDL and DML creation, including modeling tools especially for the raw data vault
  • Automated ingestion pipeline creation
  • Automated metadata and lineage support

Depending on your needs and skills, we typically see three different approaches:

  • DSL – This is a common tool for generating data vault models and flows with Domain Specific Languages (DSL). Popular frameworks for building such DSLs are EMF with Xtext or MPS. This solution provides the most flexibility. You directly build your business vocabulary into the application and generate documentation and business glossary along with the code. This approach requires the most skill and biggest resource investment.
  • Modeling tool – You can build on an existing modeling language like UML 2.0. Many modeling tools come with code generators. Therefore, you don’t need to build your own tool, but these tools are often hard to integrate into modern DevOps pipelines. They also require UML 2.0 knowledge, which raises the bar for non-tech users.
  • Buy – There are a number of different third-party solutions that integrate well into Amazon Redshift and are available via AWS Marketplace.

Whichever approach of the above-mentioned approaches you choose, all three approaches offer multiple benefits. For example, you can take away repetitive tasks from your development team and enforce modeling standards like data types, data quality rules, and naming conventions. To generate the code and deploy it, you can use AWS DevOps services. As part of this process, you save the generated metadata to the AWS Glue Data Catalog, which serves as a central technical metadata catalog. You then deploy the generated code to Amazon Redshift (SQL scripts) and to AWS Glue.

We designed AWS CloudFormation for automation; it’s the AWS-native way of automating infrastructure creation and management. A major use case for infrastructure as code (IaC) is to create new ingestion pipelines for new data sources or add new entities to existing one.

You can also use our new AI coding tool Amazon CodeWhisperer, which helps you quickly write secure code by generating whole line and full function code suggestions in your IDE in real time, based on your natural language comments and surrounding code. For example, CodeWhisperer can automatically take a prompt such as “get new files uploaded in the last 24 hours from the S3 bucket” and suggest appropriate code and unit tests. This can greatly reduce development effort in writing code, for example for ETL pipelines or generating SQL queries, and allow more time for implementing new ideas and writing differentiated code.

Operations

As previously mentioned, one of the benefits of Data Vault is the high level of automation which, in conjunction with serverless technologies, can lower the operating efforts. On the other hand, some industry products come with built-in schedulers or orchestration tools, which might increase operational complexity. By using AWS-native services, you’ll benefit from integrated monitoring options of all AWS services.

Conclusion

In this series, we discussed a number of crucial areas required for implementing a Data Vault 2.0 system at scale, and the Amazon Redshift capabilities and AWS ecosystem that you can use to satisfy those requirements. There are many more Amazon Redshift capabilities and features that will surely come in handy, and we strongly encourage current and prospective customers to reach out to us or other AWS colleagues to delve deeper into Data Vault with Amazon Redshift.


About the Authors

Asser Moustafa is a Principal Analytics Specialist Solutions Architect at AWS based out of Dallas, Texas. He advises customers globally on their Amazon Redshift and data lake architectures, migrations, and visions—at all stages of the data ecosystem lifecycle—starting from the POC stage to actual production deployment and post-production growth.

Philipp Klose is a Global Solutions Architect at AWS based in Munich. He works with enterprise FSI customers and helps them solve business problems by architecting serverless platforms. In this free time, Philipp spends time with his family and enjoys every geek hobby possible.

Saman Irfan is a Specialist Solutions Architect at Amazon Web Services. She focuses on helping customers across various industries build scalable and high-performant analytics solutions. Outside of work, she enjoys spending time with her family, watching TV series, and learning new technologies.

What’s cooking with Amazon Redshift at AWS re:Invent 2023

Post Syndicated from Sunaina Abdul Salah original https://aws.amazon.com/blogs/big-data/whats-cooking-with-amazon-redshift-at-aws-reinvent-2023/

AWS re:Invent is a powerhouse of a learning event and every time I have attended, I’ve been amazed at its scale and impact. There are keynotes packed with announcements from AWS leaders, training and certification opportunities, access to more than 2,000 technical sessions, an elaborate expo, executive summits, after-hours events, demos, and much more. The analytics team is waiting to engage with our customers and partners at the analytics kiosk in the expo hall.

For the latest and greatest with Amazon Redshift, our cloud data warehousing solution, I’ve curated a few must-attend sessions. The Amazon Redshift team will be there meeting with customers and discussing the newest announcements. And there are plenty of announcements you don’t want to miss! My personal favorite in this session list is the What’s new in Amazon Redshift session, led by Neeraja Rentachintala, Director of Product Management for Amazon Redshift and Matt Sandler, Senior Director of Data & Analytics from McDonalds giving a broad view of everything the team has been working on. But there are so many good ones on the list. Product and data engineers, data analysts, data scientists, data and technology leaders, and line of business owners can take their pick of sessions!

And here’s a tip—reserve your seat now so you can get into the session you want and avoid waiting in long lines. To access the session catalog and reserve your seat for one of our sessions, you must be registered for re:Invent. Register now!

The big stage

Adam Selipsky, Chief Executive Officer of Amazon Web Services – Keynote
Tuesday, 11/28 | 8:30 AM – 10:30 AM

Join Adam Selipsky, CEO of Amazon Web Services, as he shares his perspective on cloud transformation and highlights the latest innovations from data to infrastructure to AI and ML, helping you achieve your goals faster.

Swami Sivasubramanian, Vice President of Data and AI, AWS – Keynote
Wednesday, 11/29 | 8:30 AM – 10:30 AM

Swami’s session is all about data, so if that’s the thing that makes your world go round, then this is the big stage session for you. Generative AI is augmenting our productivity and creativity in new ways, while also being fueled by massive amounts of enterprise data and human intelligence, and Swami’s keynote will lay the groundwork for your understanding of AWS offerings in this realm.

Peter DeSantis, Senior Vice President of AWS Utility Computing – Keynote
Monday, 11/27 | 7:30 PM – 9:00 PM

Join Peter DeSantis, Senior Vice President of AWS Utility Computing, as he continues the Monday Night Live tradition of diving deep into the engineering that powers AWS services and keep a look out for any Amazon Redshift related announcements.

Innovation Talks

G2 Krishnamoorthy, Vice President of AWS Analytics
ANT219-INT | Data drives transformation: Data foundations with AWS analytics
Thursday, 11/30 | 2:00 PM – 3:00 PM

G2’s session discusses strategies for embedding analytics into your applications and ideas for building a data foundation that supports your business initiatives. With new capabilities for self-service and straightforward builder experiences, you can democratize data access for line of business users, analysts, scientists, and engineers. Hear also from Adidas, GlobalFoundries, and University of California, Irvine.

Sessions

We’ve got something for everyone, from deep learning sessions on building multi-cluster architectures and AWS’s zero-ETL approach to near-real-time analytics and ML in Amazon Redshift, powering self-service analytics for your organization and much more. Sessions can be big room breakout sessions, usually with a customer speaker, or more intimate and technical chalk talks, workshops, or builder sessions. Take a look, plan your week, and soak in the learning!

Monday, November 27

  • 1:00 PM – 2:00 PM | ANT211 (Level 200 Break out session) | Powering self-service & near real-time analytics with Amazon Redshift
  • 1:00 PM – 2:00 PM | ANT404 (Level 400 Chalk Talk) | Build a secure and highly available data strategy with Amazon Redshift

Tuesday, November 28

  • 11:30 AM – 12:30 PM | ANT325 (Level 300 Break out session) | Amazon Redshift: A decade of innovation in cloud data warehousing
  • 11:30 AM – 12:30 PM | ANT328 (Level 300 chalk talk) | Accessing open table formats for superior data lake analytics
  • 2:30 PM – 3:30 PM | ANT322 (Level 300 Breakout session) | Modernize analytics by moving your data warehouse to Amazon Redshift
  • 5:30 PM – 6:30 PM | ANT349 (Level 300 Chalk Talk) | Advanced real-time analytics and ML in your data warehouse
  • 2:30 PM – 3:30 | ANT335 (Level 300 Chalk Talk) | Get the most out of your data warehousing workloads
  • 2:00 PM – 4:00 PM | ANT310 (Workshop) | Share data across Regions & organizations for near real-time insights

Wednesday, November 29

  • 12:00 PM – 2:00 PM | ANT307 (Workshop) | Connect and analyze all your data with zero-ETL approaches
  • 4:00 PM – 5:00 PM | ANT308 (Builder session) | Build large-scale transactional data lakes with Apache Iceberg on AWS

Thursday, November 30

  • 12:30 PM – 1:30 PM | ANT326 (Level 300) | Set up a zero-ETL-based analytics architecture for your organizations
  • 2:00 PM – 3:00 PM | ANT203 (Level 200 breakout session) | What’s new in Amazon Redshift
  • 2:30 PM – 3:30 PM | ANT342 (Level 300 chalk talk) | Scaling analytics with data lakes and data warehouses

Analytics kiosk in the AWS Village

When you are onsite, stop by the Analytics kiosk in the AWS re:Invent expo hall. Connect with experts, meet with book authors on data warehousing and analytics (at the Meet the Authors event on November 29 and 30, 3:00 PM – 4:00 PM), win prizes, and learn all about the latest innovations from our AWS Analytics services.

AWS Analytics Superheroes

We are excited to introduce the 2023 AWS Analytics Superheroes at this year’s re:Invent conference! Are you lightning fast and ultra-adaptable like Wire Weaver? A shapeshifting guardian and protector of data like Data Lynx? Or a digitally clairvoyant master of data insights like Cloud Sight? Join us at the Analytics kiosk to learn more about which AWS Analytics Superhero you are and receive superhero swag!

Get ready for re:Invent

To recap, here is what you can do to prepare:

  • Register for re:Invent 2023. Reserve your seats for your favorite sessions.
  • Check back for new sessions related to keynote announcements that will open up in the catalog after the keynote announcement is over. Those are the hot new launches!
  • Follow the AWS Databases & Analytics LinkedIn channel to stay up to date on everything re:Invent related.

See you in Vegas!


About the author

Sunaina AbdulSalah leads product marketing for Amazon Redshift. She focuses on educating customers about the impact of data warehousing and analytics and sharing AWS customer stories. She has a deep background in marketing and GTM functions in the B2B technology and cloud computing domains. Outside of work, she spends time with her family and friends and enjoys traveling.

How Wallapop improved performance of analytics workloads with Amazon Redshift Serverless and data sharing

Post Syndicated from Eduard Lopez original https://aws.amazon.com/blogs/big-data/how-wallapop-improved-performance-of-analytics-workloads-with-amazon-redshift-serverless-and-data-sharing/

Amazon Redshift is a fast, fully managed cloud data warehouse that makes it straightforward and cost-effective to analyze all your data at petabyte scale, using standard SQL and your existing business intelligence (BI) tools. Today, tens of thousands of customers run business-critical workloads on Amazon Redshift.

Amazon Redshift Serverless makes it effortless to run and scale analytics workloads without having to manage any data warehouse infrastructure.

Redshift Serverless automatically provisions and intelligently scales data warehouse capacity to deliver fast performance for even the most demanding and unpredictable workloads, and you pay only for what you use.

This is ideal when it’s difficult to predict compute needs such as variable workloads, periodic workloads with idle time, and steady-state workloads with spikes. As your demand evolves with new workloads and more concurrent users, Redshift Serverless automatically provisions the right compute resources, and your data warehouse scales seamlessly and automatically.

Amazon Redshift data sharing allows you to securely share live, transactionally consistent data in one Redshift data warehouse with another Redshift data warehouse (provisioned or serverless) across accounts and Regions without needing to copy, replicate, or move data from one data warehouse to another.

Amazon Redshift data sharing enables you to evolve your Amazon Redshift deployment architectures into a hub-and-spoke or data mesh model to better meet performance SLAs, provide workload isolation, perform cross-group analytics, and onboard new use cases, all without the complexity of data movement and data copies.

In this post, we show how Wallapop adopted Redshift Serverless and data sharing to modernize their data warehouse architecture.

Wallapop’s initial data architecture platform

Wallapop is a Spanish ecommerce marketplace company focused on second-hand items, founded in 2013. Every day, they receive around 300,000 new items from buyers to be added to their catalog. The marketplace can be accessed via mobile app or website.

The average monthly traffic is around 15 million active users. Since its creation in 2013, it has reached more than 40 million downloads and more than 700 million products have been listed.

Amazon Redshift plays a central role in their data platform on AWS for ingestion, ETL (extract, transform, and load), machine learning (ML), and consumption workloads that run their insight consumption to drive decision-making.

The initial architecture is composed of one main Redshift provisioned cluster that handled all the workloads, as illustrated in the following diagram. Their cluster was deployed with 8 nodes ra3.4xlarge and concurrency scaling enabled.

Wallapop had three main areas to improve in their initial data architecture platform:

  • Workload isolation challenges with growing data volumes and new workloads running in parallel
  • Administrative burden on data engineering teams to manage the concurrent workloads, especially at peak times
  • Cost-performance ratio while scaling during peak periods

The areas of improvement mainly focused on performance of data consumption workloads along with the BI and analytics consumption tool, where high query concurrency was impacting the final analytics preparation and its insights consumption.

Solution overview

To improve their data platform architecture, Wallapop designed and built a new distributed approach with Amazon Redshift with the support of AWS.

Their cluster size of the provisioned data warehouse didn’t change. What changed was lowering the usage concurrency scaling to 1 hour, which is in the Free Tier usage for every 24 hours of using the main cluster. The following diagram illustrates the target architecture.

Solution details

The new data platform architecture combines Redshift Serverless and provisioned data warehouses with Amazon Redshift data sharing, helping Wallapop improve their overall Amazon Redshift experience with improved ease of use, performance, and optimized costs.

Redshift Serverless measures data warehouse capacity in Redshift Processing Units (RPUs). RPUs are resources used to handle workloads. You can adjust the base capacity setting from 8 RPUs to 512 RPUs in units of 8 (8, 16, 24, and so on).

The new architecture uses a Redshift provisioned cluster with RA3 nodes to run their constant and write workloads (data ingestion and transformation jobs). For cost-efficiency, Wallapop is also benefiting from Redshift reserved instances to optimize on costs for these known, predictable, and steady workloads. This cluster acts as the producer cluster in their distributed architecture using data sharing, meaning the data is ingested into the storage layer of Amazon Redshift—Redshift Managed Storage (RMS).

For the consumption part of the data platform architecture, the data is shared with different Redshift Serverless endpoints to meet the needs for different consumption workloads.

Data sharing provides workloads isolation. With this architecture, Wallapop achieves better workload isolation and ensures that only the right data is shared with the different consumption applications. Additionally, this approach avoids data duplication in their consumer part, which optimizes costs and allows better governance processes, because they only have to manage a single version of the data warehouse data instead of different copies or versions of it.

Redshift Serverless is used as a consumer part of the data platform architecture to meet those predictable and unpredictable, non-steady, and often demanding analytics workloads, such as their CI/CD jobs and BI and analytics consumption workloads coming from their data visualization application. Redshift Serverless also helps them achieve better workload isolation due to its managed auto scaling feature that makes sure performance is consistently good for these unpredictable workloads, even at peak times. It also provides a better user experience for the Wallapop data platform team, thanks to the autonomics capabilities that Redshift Serverless provides.

The new solution combining Redshift Serverless and data sharing allowed Wallapop to achieve better performance, cost, and ease of use.

Eduard Lopez, Wallapop Data Engineering Manager, shared the improved experience of analytics users: “Our analyst users are telling us that now ‘Looker flies.’ Insights consumption went up as a result of it without increasing costs.”

Evaluation of outcome

Wallapop started this re-architecture effort by first testing the isolation of their BI consumption workload with Amazon Redshift data sharing and Redshift Serverless with the support of AWS. The workload was tested using different base RPU configurations to measure the base capacity and resources in Redshift Serverless. Base RPU ranges for Redshift Serverless range from 8–512. Wallapop tested their BI workload with two configurations: 32 base RPU and 64 base RPU, after enabling data sharing from their Redshift provisioned cluster to ensure the serverless endpoints have access to the necessary datasets.

Based on the results measured 1 week before testing, the main area for improvement was the queries that took longer than 10 seconds to complete (52%), represented by the yellow, orange, and red areas of the following chart, as well as the long-running queries represented by the red area (over 600 seconds, 9%).

The first test of this workload with Redshift Serverless using a 64 base RPU configuration immediately showed performance improvement results: the queries running longer than 10 seconds were reduced by 38% and the long-running queries (over 120 seconds) were almost completely eliminated.

Javier Carbajo, Wallapop Data Engineer, says, “Providing a service without downtime or loading times that are too long was one of our main requirements since we couldn’t have analysts or stakeholders without being able to consult the data.”

Following the first set of results, Wallapop also tested with a Redshift Serverless configuration using 32 base RPU to compare the results and select the configuration that could offer them the best price-performance for this workload. With this configuration, the results were similar to the previously test run on Redshift Serverless with 64 base RPU (still showing significant performance improvement from the original results). Based on the tests, this configuration was selected for the new architecture.

Gergely Kajtár, Wallapop Data Engineer, says, “We noticed a significant increase in the daily workflows’ stability after the change to the new Redshift architecture.”

Following this first workload, Wallapop has continued expanding their Amazon Redshift distributed architecture with CI/CD workloads running on a separated Redshift Serverless endpoint using data sharing with their Redshift provisioned (RA3) cluster.

“With the new Redshift architecture, we have noticed remarkable improvements both in speed and stability. That has translated into an increase of 2 times in analytical queries, not only by analysts and data scientists but from other roles as well such as marketing, engineering, C-level, etc. That proves that investing in a scalable architecture like Redshift Serverless has a direct consequence on accelerating the adoption of data as decision-making driver in the organization.”

– Nicolás Herrero, Wallapop Director of Data & Analytics.

Conclusion

In this post, we showed you how this platform can help Wallapop to scale in the future by adding new consumers when new needs or applications require to access data.

If you’re new to Amazon Redshift, you can explore demos, other customer stories, and the latest features at Amazon Redshift. If you’re already using Amazon Redshift, reach out to your AWS account team for support, and learn more about what’s new with Amazon Redshift.


About the Authors

Eduard Lopez is the Data Engineer Manager at Wallapop. He is a software engineer with over 6 years of experience in data engineering, machine learning, and data science.

Daniel Martinez is a Solutions Architect in Iberia Digital Native Businesses (DNB), part of the worldwide commercial sales organization (WWCS) at AWS.

Jordi Montoliu is a Sr. Redshift Specialist in EMEA, part of the worldwide specialist organization (WWSO) at AWS.

Ziad Wali is an Acceleration Lab Solutions Architect at Amazon Web Services. He has over 10 years of experience in databases and data warehousing, where he enjoys building reliable, scalable, and efficient solutions. Outside of work, he enjoys sports and spending time in nature.

Semir Naffati is a Sr. Redshift Specialist Solutions Architect in EMEA, part of the worldwide specialist organization (WWSO) at AWS.

How GamesKraft uses Amazon Redshift data sharing to support growing analytics workloads

Post Syndicated from Anshuman Varshney original https://aws.amazon.com/blogs/big-data/how-gameskraft-uses-amazon-redshift-data-sharing-to-support-growing-analytics-workloads/

This post is co-written by Anshuman Varshney, Technical Lead at Gameskraft.

Gameskraft is one of India’s leading online gaming companies, offering gaming experiences across a variety of categories such as rummy, ludo, poker, and many more under the brands RummyCulture, Ludo Culture, Pocket52, and Playship. Gameskraft holds the Guinness World Record for organizing the world’s largest online rummy tournament, and is one of India’s first gaming companies to build an ISO certified platform.

Amazon Redshift is a fully managed data warehousing service that offers both provisioned and serverless options, making it more efficient to run and scale analytics without having to manage your data warehouse. Amazon Redshift enables you to use SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes, using AWS-designed hardware and machine learning (ML) to deliver the best price-performance at scale.

In this post, we show how Gameskraft used Amazon Redshift data sharing along with concurrency scaling and WLM optimization to support its growing analytics workloads.

Amazon Redshift use case

Gameskraft used Amazon Redshift RA3 instances with Redshift Managed Storage (RMS) for their data warehouse. The upstream data pipeline is a robust system that integrates various data sources, including Amazon Kinesis and Amazon Managed Streaming for Apache Kafka (Amazon MSK) for handling clickstream events, Amazon Relational Database Service (Amazon RDS) for delta transactions, and Amazon DynamoDB for delta game-related information. Additionally, data is extracted from vendor APIs that includes data related to product, marketing, and customer experience. All of this diverse data is then consolidated into the Amazon Simple Storage Service (Amazon S3) data lake before being uploaded to the Redshift data warehouse. These upstream data sources constitute the data producer components.

Gameskraft used Amazon Redshift workload management (WLM) to manage priorities within workloads, with higher priority being assigned to the extract, transform, and load (ETL) queue that runs critical jobs for data producers. The downstream consumers consist of business intelligence (BI) tools, with multiple data science and data analytics teams having their own WLM queues with appropriate priority values.

As Gameskraft’s portfolio of gaming products increased, it led to an approximate five-times growth of dedicated data analytics and data science teams. Consequently, there was a fivefold rise in data integrations and a fivefold increase in ad hoc queries submitted to the Redshift cluster. These query patterns and concurrency were unpredictable in nature. Also, over time the number of BI dashboards (both scheduled and live) increased, which contributed to more queries being submitted to the Redshift cluster.

With this growing workload, Gameskraft was observing the following challenges:

  • Increase in critical ETL job runtime
  • Increase in query wait time in multiple queues
  • Impact of unpredictable ad hoc query workloads across other queues in the cluster

Gameskraft was looking for a solution that would help them mitigate all these challenges, and provide flexibility to scale ingestion and consumption workload processing independently. Gameskraft was also looking for a solution that would cater to their unpredictable future growth.

Solution overview

Gameskraft tackled these challenges in a phased manner using Amazon Redshift concurrency scaling, Amazon Redshift data sharing, Amazon Redshift Serverless, and Redshift provisioned clusters.

Amazon Redshift concurrency scaling lets you easily support thousands of concurrent users and concurrent queries, with consistently fast query performance. As concurrency increases, Amazon Redshift automatically adds query processing power in seconds to process queries without any delays. When the workload demand subsides, this extra processing power is automatically removed, so you pay only for the time when concurrency scaling clusters are in use. Amazon Redshift offers 1 hour of free concurrency scaling credits per active cluster per day, allowing you to accumulate 30 hours of free credits per month.

Gameskraft enabled concurrency scaling in selective WLM queues to ease the query wait time in those queues during peak usage and also helped reduce ETL query runtime. In the prior setup, we maintained four specialized queues for ETL, ad hoc queries, BI tools, and data science. To prevent blockages for other processes, we imposed minimal query timeouts using query monitoring rules (QMR). However, both the ETL and BI tools queues were persistently occupied, impacting the performance of the remaining queues.

Concurrency scaling helped alleviate query wait time in the ad hoc query queue. Still, the challenge of downstream consumption workloads (like ad hoc queries) impacting ingestion persisted, and Gameskraft was looking for a solution to manage these workloads independently.

The following table summarizes the workload management configuration prior to the solution implementation.

Queue Usage Concurrency Scaling Mode Concurrency on Main / Memory % Query Monitoring Rules
etl For ingestion from multiple data integration off auto Stop action on:
Query runtime (seconds) > 2700
report For scheduled reporting purposes off auto Stop action on:
Query runtime (seconds) > 600
datascience For data science workloads off auto Stop action on:
Query runtime (seconds) > 300
readonly For ad hoc and day-to-day analysis auto auto Stop action on:
Query runtime (seconds) > 120
bi_tool For BI tools auto auto Stop action on:
Query runtime (seconds) > 300

To achieve flexibility in scaling, Gameskraft used Amazon Redshift data sharing. Amazon Redshift data sharing allows you to extend the ease of use, performance, and cost benefits offered by a single cluster to multi-cluster deployments while being able to share data. Data sharing enables instant, granular, and fast data access across Amazon Redshift data warehouses without the need to copy or move it. Data sharing provides live access to data so that users always observe the most up-to-date and consistent information as it’s updated in the data warehouse. You can securely share live data across provisioned clusters, serverless endpoints within AWS account, across AWS accounts, and across AWS Regions.

Data sharing builds on Redshift Managed Storage (RMS), which underpins RA3 provisioned clusters and serverless workgroups, allowing multiple warehouses to query the same data with separate isolated compute. Queries accessing shared data run on the consumer cluster and read data from RMS directly without impacting the performance of the producer cluster. You can now rapidly onboard workloads with diverse data access patterns and SLA requirements and not be concerned about resource contention.

We chose to run all ETL workloads in the primary producer cluster to manage ETL independently. We used data sharing to share read-only access to data with a data science serverless workgroup, a BI provisioned cluster, an ad hoc query provisioned cluster, and a data integration serverless workgroup. Teams using these separate compute resources could then query the same data without copying the data between the producer and consumer. Additionally, we introduced concurrency scaling to the consumer queues, prioritizing BI tools, and extended the timeout for the remaining queues. These modifications notably enhanced overall efficiency and throughput.

The following table summarizes the new workload management configuration for the producer cluster.

Queue Usage Concurrency Scaling Mode Concurrency on Main / Memory % Query Monitoring Rules
etl For ingestion from multiple data integration auto auto Stop action on:
Query runtime (seconds) > 3600

The following table summarizes the new workload management configuration for the consumer cluster.

Queue Usage Concurrency Scaling Mode Concurrency on Main / Memory % Query Monitoring Rules
report For scheduled reporting purposes off auto Stop action on:
Query runtime (seconds) > 1200
Query queue time (seconds) > 1800
Spectrum scan row count (rows) > 100000
Spectrum scan (MB) > 3072
datascience For data science workloads off auto Stop action on:
Query runtime (seconds) > 600
Query queue time (seconds) > 1800
Spectrum scan row count (rows) > 100000
Spectrum scan (MB) > 3072
readonly For ad hoc and day-to-day analysis auto auto Stop action on:
Query runtime (seconds) > 900
Query queue time (seconds) > 3600
Spectrum scan (MB) > 3072
Spectrum scan row count (rows) > 100000
bi_tool_live For live BI tools auto auto Stop action on:
Query runtime (seconds) > 900
Query queue time (seconds) > 1800
Spectrum scan (MB) > 1024
Spectrum scan row count (rows) > 1000
bi_tool_schedule For scheduled BI tools auto auto Stop action on:
Query runtime (seconds) > 1800
Query queue time (seconds) > 3600
Spectrum scan (MB) > 1024
Spectrum scan row count (rows) > 1000

Solution implementation

Gameskraft is dedicated to maintaining uninterrupted system operations, prioritizing seamless solutions over downtime. In pursuit of this principle, strategic measures were undertaken to ensure a smooth migration process towards enabling data sharing, which included the following steps:

  • Planning:
    • Replicating users and groups to the consumer, to mitigate potential access complications for analytics, data science, and BI teams.
    • Establishing a comprehensive setup within the consumers, encompassing essential components like external schemas for Amazon Redshift Spectrum.
    • Fine-tuning WLM configurations tailored to the consumer’s requirements.
  • Implementation:
    • Introducing insightful monitoring dashboards in Grafana for CPU utilization, read/write throughputs, IOPS, and latencies specific to the consumer cluster, enhancing oversight capabilities.
    • Changing all interleaved key tables on the producer cluster to compound sortkey tables to seamlessly transition data.
    • Creating an external schema from the data share database on the consumer, mirroring that of the producer cluster with identical names. This approach minimizes the need for making query adjustments in multiple locations.
  • Testing:
    • Conducting an internal week-long regression testing and auditing process to meticulously validate all data points by running the same workload and twice the workload.
  • Final changes:
    • Updating the DNS record for the cluster endpoint, which included replacing the consumer cluster’s endpoint to the same domain as the producer cluster’s endpoint, to streamline connections and avoid making changes in multiple places.
    • Ensuring data security and access control by revoking group and user privileges from the producer cluster.

The following diagram illustrates the Gameskraft Amazon Redshift data sharing architecture.

This diagram illustrates the Gameskraft Amazon Redshift data sharing architecture

The following diagram illustrates the Amazon Redshift data sharing architecture with multiple consumer clusters.

This diagram illustrates the Amazon Redshift data sharing architecture with multiple consumer clusters.

With data sharing implementation, Gameskraft was able to isolate the producer and consumer workloads. Data sharing also provided the flexibility to independently scale the producer and consumer data warehouses.

The implementation of the overall solution helped Gameskraft support more frequent data refresh (43% reduction in overall job runtime) for its ETL workload, which runs on the producer cluster, along with capabilities to support a growing (five-times increase in users, BI workloads, and ad hoc queries) and unpredictable consumer workload.

The following dashboards show some of the critical ETL pipeline runtimes (before solution implementation and after solution implementation).

The first shows the delta P1/P2/P3 job runs before and after solution implementation (duration in minutes).

This screenshot shows the delta P1/P2/P3 job runs before and after solution implementation (duration in minutes).

The following shows the daily event ETL P1/P2/P3 job runs before and after solution implementation (duration in minutes).

This screenshot shows the daily event ETL P1/P2/P3 job runs before and after solution implementation (duration in minutes).

Key considerations

Gameskraft embraces a modern data architecture, with the data lake residing in Amazon S3. To grant seamless access to the data lake, we use the innovative capabilities of Redshift Spectrum, which is a bridge between the data warehouse (Amazon Redshift) and data lake (Amazon S3). It allows you to perform data transformations and analysis directly on data stored in Amazon S3, without having to duplicate the data into your Redshift cluster.

Gameskraft had a couple of key learnings while implementing this data sharing solution:

  • First, as of this writing, Amazon Redshift data sharing doesn’t support adding external schemas, tables, or late-binding views on external tables to the data share. To enable this, we have created an external schema as a pointer to AWS Glue database. The same AWS Glue database is referenced in the external schema on the consumer side.
  • Second, Amazon Redshift doesn’t support sharing tables with interleaved sort keys and views that refer to tables with interleaved sort keys. Due to the presence of interleaved sort keys across numerous tables and views, a prerequisite for inclusion into the data share involves revising the sort key configuration to use compound sort key.

Conclusion

In this post, we saw how Gameskraft used data sharing and concurrency scaling in Amazon Redshift with a producer and consumer cluster architecture to achieve the following:

  • Reduce query wait time for all queues in the producer and consumer
  • Scale the producer and consumer independently based on workload and queue requirements
  • Improve ETL pipeline performance and the data refresh cycle to support more frequent refreshes in the producer cluster
  • Onboard more queues and workloads (BI tools queue, data integration queue, data science queue, downstream team’s queue, ad hoc query queue) in the consumer without impacting the ETL pipeline in the producer cluster
  • Flexibility to use multiple consumers with a mix of provisioned Redshift cluster and Redshift Serverless

These Amazon Redshift features and architecture can help support a growing and unpredictable analytics workload.


About the Authors

Anshuman Varshney is Technical Lead at Gameskraft with a background in both backend and data engineering. He has a proven track record of leading and mentoring cross-functional teams to deliver high-performance, scalable solutions. Apart from work, he relishes moments with his family, indulges in cinematic experiences, and seizes every opportunity to explore new destinations through travel.

Prafulla Wani is an Amazon Redshift Specialist Solution Architect at AWS. He works with AWS customers on analytics architecture designs and Amazon Redshift proofs of concept. In his spare time, he plays chess with his son.

Saurov Nandy is a Solutions Architect at AWS. He works with AWS customers to design and implement solutions that solve complex business problems. In his spare time, he likes to explore new places and indulge in photography and video editing.

Shashank Tewari is a Senior Technical Account Manager at AWS. He helps AWS customers optimize their architectures to achieve performance, scale, and cost efficiencies. In his spare time, he likes to play video games with his kids. During vacations, he likes to trek on mountains and take up adventure sports.

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

Post Syndicated from Preshen Goobiah original https://aws.amazon.com/blogs/big-data/simplifying-data-processing-at-capitec-with-amazon-redshift-integration-for-apache-spark/

This post is cowritten with Preshen Goobiah and Johan Olivier from Capitec.

Apache Spark is a widely-used open source distributed processing system renowned for handling large-scale data workloads. It finds frequent application among Spark developers working with Amazon EMR, Amazon SageMaker, AWS Glue and custom Spark applications.

Amazon Redshift offers seamless integration with Apache Spark, allowing you to easily access your Redshift data on both Amazon Redshift provisioned clusters and Amazon Redshift Serverless. This integration expands the possibilities for AWS analytics and machine learning (ML) solutions, making the data warehouse accessible to a broader range of applications.

With the Amazon Redshift integration for Apache Spark, you can quickly get started and effortlessly develop Spark applications using popular languages like Java, Scala, Python, SQL, and R. Your applications can seamlessly read from and write to your Amazon Redshift data warehouse while maintaining optimal performance and transactional consistency. Additionally, you’ll benefit from performance improvements through pushdown optimizations, further enhancing the efficiency of your operations.

Capitec, South Africa’s biggest retail bank with over 21 million retail banking clients, aims to provide simple, affordable and accessible financial services in order to help South Africans bank better so that they can live better. In this post, we discuss the successful integration of the open source Amazon Redshift connector by Capitec’s shared services Feature Platform team. As a result of utilizing the Amazon Redshift integration for Apache Spark, developer productivity increased by a factor of 10, feature generation pipelines were streamlined, and data duplication reduced to zero.

The business opportunity

There are 19 predictive models in scope for utilizing 93 features built with AWS Glue across Capitec’s Retail Credit divisions. Feature records are enriched with facts and dimensions stored in Amazon Redshift. Apache PySpark was selected to create features because it offers a fast, decentralized, and scalable mechanism to wrangle data from diverse sources.

These production features play a crucial role in enabling real-time fixed-term loan applications, credit card applications, batch monthly credit behavior monitoring, and batch daily salary identification within the business.

The data sourcing problem

To ensure the reliability of PySpark data pipelines, it’s essential to have consistent record-level data from both dimensional and fact tables stored in the Enterprise Data Warehouse (EDW). These tables are then joined with tables from the Enterprise Data Lake (EDL) at runtime.

During feature development, data engineers require a seamless interface to the EDW. This interface allows them to access and integrate the necessary data from the EDW into the data pipelines, enabling efficient development and testing of features.

Previous solution process

In the previous solution, product team data engineers spent 30 minutes per run to manually expose Redshift data to Spark. The steps included the following:

  1. Construct a predicated query in Python.
  2. Submit an UNLOAD query via the Amazon Redshift Data API.
  3. Catalog data in the AWS Glue Data Catalog via the AWS SDK for Pandas using sampling.

This approach posed issues for large datasets, required recurring maintenance from the platform team, and was complex to automate.

Current solution overview

Capitec was able to resolve these problems with the Amazon Redshift integration for Apache Spark within feature generation pipelines. The architecture is defined in the following diagram.

The workflow includes the following steps:

  1. Internal libraries are installed into the AWS Glue PySpark job via AWS CodeArtifact.
  2. An AWS Glue job retrieves Redshift cluster credentials from AWS Secrets Manager and sets up the Amazon Redshift connection (injects cluster credentials, unload locations, file formats) via the shared internal library. The Amazon Redshift integration for Apache Spark also supports using AWS Identity and Access Management (IAM) to retrieve credentials and connect to Amazon Redshift.
  3. The Spark query is translated to an Amazon Redshift optimized query and submitted to the EDW. This is accomplished by the Amazon Redshift integration for Apache Spark.
  4. The EDW dataset is unloaded into a temporary prefix in an Amazon Simple Storage Service (Amazon S3) bucket.
  5. The EDW dataset from the S3 bucket is loaded into Spark executors via the Amazon Redshift integration for Apache Spark.
  6. The EDL dataset is loaded into Spark executors via the AWS Glue Data Catalog.

These components work together to ensure that data engineers and production data pipelines have the necessary tools to implement the Amazon Redshift integration for Apache Spark, run queries, and facilitate the unloading of data from Amazon Redshift to the EDL.

Using the Amazon Redshift integration for Apache Spark in AWS Glue 4.0

In this section, we demonstrate the utility of the Amazon Redshift integration for Apache Spark by enriching a loan application table residing in the S3 data lake with client information from the Redshift data warehouse in PySpark.

The dimclient table in Amazon Redshift contains the following columns:

  • ClientKey – INT8
  • ClientAltKey – VARCHAR50
  • PartyIdentifierNumber – VARCHAR20
  • ClientCreateDate – DATE
  • IsCancelled – INT2
  • RowIsCurrent – INT2

The loanapplication table in the AWS Glue Data Catalog contains the following columns:

  • RecordID – BIGINT
  • LogDate – TIMESTAMP
  • PartyIdentifierNumber – STRING

The Redshift table is read via the Amazon Redshift integration for Apache Spark and cached. See the following code:

import pyspark.sql.functions as F
from pyspark.sql import SQLContext
sc = # existing SparkContext
sql_context = SQLContext(sc)

secretsmanager_client = boto3.client('secretsmanager')
secret_manager_response = secretsmanager_client.get_secret_value(
    SecretId='string',
    VersionId='string',
    VersionStage='string'
)
username = # get username from secret_manager_response
password = # get password from secret_manager_response
url = "jdbc:redshift://redshifthost:5439/database?user=" + username + "&password=" + password

read_config = {
    "url": url,
    "tempdir": "s3://capitec-redshift-temp-bucket/<uuid>/",
    "unload_s3_format": "PARQUET"
}

d_client = (
    spark.read.format("io.github.spark_redshift_community.spark.redshift")
    .options(**read_config)
    .option("query", f"select * from edw_core.dimclient")
    .load()
    .where((F.col("RowIsCurrent") == 1) & (F.col("isCancelled") == 0))
    .select(
        F.col("PartyIdentifierNumber"),
        F.col("ClientCreateDate")
    )
    .cache()
)

Loan application records are read in from the S3 data lake and enriched with the dimclient table on Amazon Redshift information:

import pyspark.sql.functions as F
from awsglue.context import GlueContext
from pyspark import SparkContext

glue_ctx = GlueContext(SparkContext.getOrCreate())

push_down_predicate = (
    f"meta_extract_start_utc_ms between "
    f"'2023-07-12"
    f" 18:00:00.000000' and "
    f"'2023-07-13 06:00:00.000000'"
)

database_name="loan_application_system"
table_name="dbo_view_loan_applications"
catalog_id = # Glue Data Catalog

# Selecting only the following columns
initial_select_cols=[
            "RecordID",
            "LogDate",
            "PartyIdentifierNumber"
        ]

d_controller = (glue_ctx.create_dynamic_frame.from_catalog(catalog_id=catalog_id,
                                            database=database_name,
                                            table_name=table_name,
                                            push_down_predicate=push_down_predicate)
                .toDF()
                .select(*initial_select_cols)
                .withColumn("LogDate", F.date_format("LogDate", "yyyy-MM-dd").cast("string"))
                .dropDuplicates())

# Left Join on PartyIdentifierNumber and enriching the loan application record
d_controller_enriched = d_controller.join(d_client, on=["PartyIdentifierNumber"], how="left").cache()

As a result, the loan application record (from the S3 data lake) is enriched with the ClientCreateDate column (from Amazon Redshift).

How the Amazon Redshift integration for Apache Spark solves the data sourcing problem

The Amazon Redshift integration for Apache Spark effectively addresses the data sourcing problem through the following mechanisms:

  • Just-in-time reading – The Amazon Redshift integration for Apache Spark connector reads Redshift tables in a just-in-time manner, ensuring the consistency of data and schema. This is particularly valuable for Type 2 slowly changing dimension (SCD) and timespan accumulating snapshot facts. By combining these Redshift tables with the source system AWS Glue Data Catalog tables from the EDL within production PySpark pipelines, the connector enables seamless integration of data from multiple sources while maintaining data integrity.
  • Optimized Redshift queries – The Amazon Redshift integration for Apache Spark plays a crucial role in converting the Spark query plan into an optimized Redshift query. This conversion process simplifies the development experience for the product team by adhering to the data locality principle. The optimized queries use the capabilities and performance optimizations of Amazon Redshift, ensuring efficient data retrieval and processing from Amazon Redshift for the PySpark pipelines. This helps streamline the development process while enhancing the overall performance of the data sourcing operations.

Gaining the best performance

The Amazon Redshift integration for Apache Spark automatically applies predicate and query pushdown to optimize performance. You can gain performance improvements by using the default Parquet format used for unloading with this integration.

For additional details and code samples, refer to New – Amazon Redshift Integration with Apache Spark.

Solution Benefits

The adoption of the integration yielded several significant benefits for the team:

  • Enhanced developer productivity – The PySpark interface provided by the integration boosted developer productivity by a factor of 10, enabling smoother interaction with Amazon Redshift.
  • Elimination of data duplication – Duplicate and AWS Glue cataloged Redshift tables in the data lake were eliminated, resulting in a more streamlined data environment.
  • Reduced EDW load – The integration facilitated selective data unloading, minimizing the load on the EDW by extracting only the necessary data.

By using the Amazon Redshift integration for Apache Spark, Capitec has paved the way for improved data processing, increased productivity, and a more efficient feature engineering ecosystem.

Conclusion

In this post, we discussed how the Capitec team successfully implemented the Apache Spark Amazon Redshift integration for Apache Spark to simplify their feature computation workflows. They emphasized the importance of utilizing decentralized and modular PySpark data pipelines for creating predictive model features.

Currently, the Amazon Redshift integration for Apache Spark is utilized by 7 production data pipelines and 20 development pipelines, showcasing its effectiveness within Capitec’s environment.

Moving forward, the shared services Feature Platform team at Capitec plans to expand the adoption of the Amazon Redshift integration for Apache Spark in different business areas, aiming to further enhance data processing capabilities and promote efficient feature engineering practices.

For additional information on using the Amazon Redshift integration for Apache Spark, refer to the following resources:


About the Authors

Preshen Goobiah is the Lead Machine Learning Engineer for the Feature Platform at Capitec. He is focused on designing and building Feature Store components for enterprise use. In his spare time, he enjoys reading and traveling.

Johan Olivier is a Senior Machine Learning Engineer for Capitec’s Model Platform. He is an entrepreneur and problem-solving enthusiast. He enjoys music and socializing in his spare time.

Sudipta Bagchi is a Senior Specialist Solutions Architect at Amazon Web Services. He has over 12 years of experience in data and analytics, and helps customers design and build scalable and high-performant analytics solutions. Outside of work, he loves running, traveling, and playing cricket. Connect with him on LinkedIn.

Syed Humair is a Senior Analytics Specialist Solutions Architect at Amazon Web Services (AWS). He has over 17 years of experience in enterprise architecture focusing on Data and AI/ML, helping AWS customers globally to address their business and technical requirements. You can connect with him on LinkedIn.

Vuyisa Maswana is a Senior Solutions Architect at AWS, based in Cape Town. Vuyisa has a strong focus on helping customers build technical solutions to solve business problems. He has supported Capitec in their AWS journey since 2019.