Tag Archives: Amazon Redshift

AWS Week In Review – July 18, 2022

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/aws-week-in-review-july-18-2022/

Last week, AWS Summit New York was held in person at the Javits Center with thousands of attendees and over 100 sponsors and partners. During the keynote, Martin Beeby, AWS Principal Developer Advocate, talked about how innovations in cloud infrastructure enable customers to adapt to challenges and seize new opportunities. It included Liz Fong-Jones‘s great migration story of AWS Graviton in Honeycomb and Elliott Cordo‘s story of improving pharmacy experiences using AWS analytics and machine learning services in Capsule.

Watch the full keynote video!

A Recap of AWS Summit NY Announcements
During the keynote, we announced the general availability of some new services:

Amazon Redshift Serverless – This serverless option lets you analyze data at any scale without having to manage data warehouse infrastructure. You can now create multiple serverless endpoints per AWS account and Region using namespaces and workgroups and enjoy reducing serverless compute costs compared to the preview. To learn more, check out Danilio’s blog post, this demo video, and the latest episode of The Official AWS Podcast. We also introduced new features of row-level security (RLS), which implement fine-grained access to the rows in tables, and automated materialized view to lower query latency for repeatable workloads.

AWS Cloud WAN – This new network service makes it easy to build and operate wide area networks (WAN) that connect your data centers and branch offices, as well as multiple VPCs in multiple AWS Regions. To learn more, read Seb’s blog post.

Amazon DevOps Guru’s Log Anomaly Detection and Recommendations – This new feature identifies anomalies such as increased latency, error rates, and resource constraints within your app and then sends alerts with a description and actionable recommendations for remediation. To learn more, see Donnie’s blog post as a new News Blog writer.

Last Week’s Launches
Here are some other launches that caught my attention last week:

AWS AppConfig, a feature of AWS Systems Manager, makes it easy for customers to quickly and safely configure, validate, and deploy feature flags and application configuration. Now, we have announced AWS AppConfig Extensions, a new capability that allows customers to enhance and extend the capabilities of feature flags and dynamic runtime configuration data.

Available extensions at launch include AppConfig Notification extensions that push messages about configuration updates to Amazon EventBridge, Amazon SNS, Amazon SQS, or a Jira extension to track Feature Flag changes in AppConfig as Atlassian’s Jira issues. To get started, read Announcing AWS AppConfig Extensions and AppConfig Extensions.

Amazon VPC Flow Logs for Transit Gateway is a new capability that allows customers to gain deeper visibility and insights into network traffic on AWS Transit Gateway. With this feature, Transit Gateway can export detailed information, such as source/destination IPs, ports, protocols, traffic counters, timestamps, and various metadata for all of the network flow traversing through the Transit Gateway. To learn more, read Introducing VPC Flow Logs for AWS Transit Gateway and Logging network traffic using Transit Gateway Flow Logs.

AWS Lambda Powertools for TypeScript is an open-source developer library that can help you incorporate Well-Architected Serverless best practices focusing on three observability features: distributed tracing (Tracer), structured logging (Logger), and asynchronous business and application metrics (Metrics). Powertools is also available in the Python and Java programming languages. To learn more, see the blog post Simplifying serverless best practices with AWS Lambda Powertools for TypeScript. You can submit feedback, ideas, and issues directly on our GitHub project.

AWS re:Post is a vibrant Q&A community that helps you become even more successful on AWS. You can now add a profile picture or avatar to your account and add inline images such as diagrams or screenshots to support your questions or answers. Add your profile picture and start using inline images today!

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Here are some news, blog posts, and video series for you to know:

In July 2021, we notified users about the end of support for Internet Explorer 11, which is now approaching on July 31, 2022. The browser will no longer be supported in the AWS Management Console, web-based services such as Amazon QuickSight, Amazon Chime, Amazon Honeycode, and some other AWS websites. After that date, we can no longer guarantee that the features and webpages will function properly on IE 11. For more information, please visit AWS Supported Browsers.

In fall 2021, we began offering a free multi-factor authentication (MFA) security key to AWS account owners in the United States. Now eligible customers can order the free MFA security key through the ordering portal in the AWS Management Console. At this time, only U.S.-based AWS account root users who have spent more than $100 each month over the past 3 months are eligible to place an order. For more information, see our Free MFA Security Key page.

Amazon’s Machine Learning University expands with MLU Explains, a public website containing visual essays that incorporate fun animations and scrollytelling to explain machine learning concepts in an accessible manner. The following animation teaches the concepts of data splitting in machine learning using an example model that attempts to determine whether animals are cats or dogs. To learn more, read the Amazon Science blog post.

This is My Architecture is a video series that showcases innovative architectural solutions on the AWS Cloud by customers and partners. In June and July, over 15 episodes were updated, including GoDaddy, Riot Games, and Hudl. Each episode examines the most interesting and technically creative elements of each cloud architecture.

Upcoming AWS Events in August
Check your calendars and sign up for these AWS events:

AWS SummitRegistration is open for upcoming in-person AWS Summits that might be close to you in August: Sao Paulo (August 3–4), Anaheim (August 18), Taiwan (August 10–11), Chicago (August 28), and Canberra (August 31).

AWS Innovate – Data Edition – On August 23, learn how a modern data strategy can support your present and future use cases, including steps to build an end-to-end data solution to store and access, analyze and visualize, and even predict.

AWS Innovate – For Every Application Edition – On August 25, learn about a wide selection of AWS solutions across compute, storage, networking, hybrid, and edge infrastructure to help you scale application resources seamlessly and optimally.

Although these two Innovate events will be held in Asia Pacific and Japan time zones, you can view on-demand videos for two months following your registration.

If you’re interested in learning modern development practices live in New York City, I recommend joining AWS Solutions Day on August 10. I love advanced topics to focus on building new web apps with Java, JavaScript, TypeScript, and GraphQL.

If you’re interested in learning AWS fundamentals and preparing for AWS Certifications, there are several virtual events in August, such as AWS Cloud Practitioner Essentials Day, AWS Technical Essentials Day, and Exam Readiness for AWS Certificates.

That’s all for this week. Check back next Monday for another Week in Review!

— Channy

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Migrate a large data warehouse from Greenplum to Amazon Redshift using AWS SCT – Part 2

Post Syndicated from Suresh Patnam original https://aws.amazon.com/blogs/big-data/part-2-migrate-a-large-data-warehouse-from-greenplum-to-amazon-redshift-using-aws-sct/

In this second post of a multi-part series, we share best practices for choosing the optimal Amazon Redshift cluster, data architecture, converting stored procedures, compatible functions and queries widely used for SQL conversions, and recommendations for optimizing the length of data types for table columns. You can check out the first post of this series for guidance on planning, running, and validation of a large-scale data warehouse migration from Greenplum to Amazon Redshift using AWS Schema Conversion Tool (AWS SCT).

Choose your optimal Amazon Redshift cluster

Amazon Redshift has two types of clusters: provisioned and serverless. For provisioned clusters, you need to set up the same with required compute resources. Amazon Redshift Serverless can run high-performance analytics in the cloud at any scale. For more information, refer to Introducing Amazon Redshift Serverless – Run Analytics At Any Scale Without Having to Manage Data Warehouse Infrastructure.

An Amazon Redshift cluster consists of nodes. Each cluster has a leader node and one or more compute nodes. The leader node receives queries from client applications, parses the queries, and develops query run plans. The leader node then coordinates the parallel run of these plans with the compute nodes and aggregates the intermediate results from these nodes. It then returns the results to the client applications.

When determining your type of cluster, consider the following:

  • Estimate the size of the input data compressed, vCPU, and performance. As of this writing, we recommend the Amazon Redshift RA3 instance with managed storage, which scales compute and storage independently for fast query performance.
  • Amazon Redshift provides an automated “Help me choose” cluster based on the size of your data.
  • A main advantage of a cloud Amazon Redshift data warehouse is that you’re no longer stuck with hardware and commodities like old guard data warehouses. For faster innovation, you have the option to try different cluster options and choose the optimized one in terms of performance and cost.
  • At the time of development or pilot, you can usually start with a smaller number of nodes. As you move to production, you can adjust the number of nodes based on your usage pattern. When right-sizing your clusters, we recommend choosing the reserved instance type to cut down the cost even further. The public-facing utility Simple Replay can help you determine performance against different cluster types and sizes by replaying the customer workload. For provisioned clusters, if you’re planning to use the recommended RA3 instance, you can compare different node types to determine the right instance type.
  • Based on your workload pattern, Amazon Redshift supports resize, pause and stop, and concurrency scaling of the cluster. Amazon Redshift workload management (WLM) enables effective and flexible management of memory and query concurrency.

Create data extraction tasks with AWS SCT

With AWS SCT extraction agents, you can migrate your source tables in parallel. These extraction agents authenticate using a valid user on the data source, allowing you to adjust the resources available for that user during the extraction. AWS SCT agents process the data locally and upload it to Amazon Simple Storage Service (Amazon S3) through the network (via AWS Direct Connect). We recommend having a consistent network bandwidth between your Greenplum machine where the AWS SCT agent is installed and your AWS Region.

If you have tables around 20 million rows or 1 TB in size, you can use the virtual partitioning feature on AWS SCT to extract data from those tables. This creates several sub-tasks and parallelizes the data extraction process for this table. Therefore, we recommend creating two groups of tasks for each schema that you migrate: one for small tables and one for large tables using virtual partitions.

For more information, refer to Creating, running, and monitoring an AWS SCT data extraction task.

Data architecture

To simplify and modernize your data architecture, consider the following:

  • Establish accountability and authority to enforce enterprise data standards and policies.
  • Formalize the data and analytics operating model between enterprise and business units and functions.
  • Simplify the data technology ecosystem through rationalization and modernization of data assets and tools or technology.
  • Develop organizational constructs that facilitate more robust integration of the business and delivery teams, and build data-oriented products and solutions to address the business problems and opportunities throughout the lifecycle.
  • Back up the data periodically so that if something is wrong, you have the ability to replay.
  • During planning, design, execution, and throughout implementation and maintenance, ensure data quality management is added to achieve the desired outcome.
  • Simple is the key to an easy, fast, intuitive, and low-cost solution. Simple scales much better than complex. Simple makes it possible to think big (Invent and Simplify is another Amazon leadership principle). Simplify the legacy process by migrating only the necessary data used in tables and schemas. For example, if you’re performing truncate and load for incremental data, identify a watermark and only process incremental data.
  • You may have use cases that requiring record-level inserts, updates, and deletes for privacy regulations and simplified pipelines; simplified file management and near-real-time data access; or simplified change data capture (CDC) data pipeline development. We recommend using purposeful tools based on your use case. AWS offers the options to use Apache HUDI with Amazon EMR and AWS Glue.

Migrate stored procedures

In this section, we share best practices for stored procedure migration from Greenplum to Amazon Redshift. Data processing pipelines with complex business logic often use stored procedures to perform the data transformation. We advise using big data processing like AWS Glue or Amazon EMR to modernize your extract, transform, and load (ETL) jobs. For more information, check out Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift. For time-sensitive migration to cloud-native data warehouses like Amazon Redshift, redesigning and developing the entire pipeline in a cloud-native ETL tool might be time-consuming. Therefore, migrating the stored procedures from Greenplum to Amazon Redshift stored procedures can be the right choice.

For a successful migration, make sure to follow Amazon Redshift stored procedure best practices:

  • Specify the schema name while creating a stored procedure. This helps facilitate schema-level security and you can enforce grants or revoke access control.
  • To prevent naming conflicts, we recommend naming procedures using the prefix sp_. Amazon Redshift reserves the sp_ prefix exclusively for stored procedures. By prefixing your procedure names with sp_, you ensure that your procedure name won’t conflict with any existing or future Amazon Redshift procedure names.
  • Qualify your database objects with the schema name in the stored procedure.
  • Follow the minimal required access rule and revoke unwanted access. For similar implementation, make sure the stored procedure run permission is not open to ALL.
  • The SECURITY attribute controls a procedure’s privileges to access database objects. When you create a stored procedure, you can set the SECURITY attribute to either DEFINER or INVOKER. If you specify SECURITY INVOKER, the procedure uses the privileges of the user invoking the procedure. If you specify SECURITY DEFINER, the procedure uses the privileges of the owner of the procedure. INVOKER is the default. For more information, refer to Security and privileges for stored procedures.
  • Managing transactions when it comes to stored procedures are important. For more information, refer to Managing transactions.
  • TRUNCATE issues a commit implicitly inside a stored procedure. It interferes with the transaction block by committing the current transaction and creating a new one. Exercise caution while using TRUNCATE to ensure it never breaks the atomicity of the transaction. This also applies for COMMIT and ROLLBACK.
  • Adhere to cursor constraints and understand performance considerations while using cursor. You should use set-based SQL logic and temporary tables while processing large datasets.
  • Avoid hardcoding in stored procedures. Use dynamic SQL to construct SQL queries dynamically at runtime. Ensure appropriate logging and error handling of the dynamic SQL.
  • For exception handling, you can write RAISE statements as part of the stored procedure code. For example, you can raise an exception with a custom message or insert a record into a logging table. For unhandled exceptions like WHEN OTHERS, use built-in functions like SQLERRM or SQLSTATE to pass it on to the calling application or program. As of this writing, Amazon Redshift limits calling a stored procedure from the exception block.

Sequences

You can use IDENTITY columns, system timestamps, or epoch time as an option to ensure uniqueness. The IDENTITY column or a timestamp-based solution might have sparse values, so if you need a continuous number sequence, you need to use dedicated number tables. You can also use of the RANK() or ROW_NUMBER() window function over the entire set. Alternatively, get the high-water mark from the existing ID column from the table and increment the values while inserting records.

Character datatype length

Greenplum char and varchar data type length is specified in terms of character length, including multi-byte ones. Amazon Redshift character types are defined in terms of bytes. For table columns using multi-byte character sets in Greenplum, the converted table column in Amazon Redshift should allocate adequate storage to the actual byte size of the source data.

An easy workaround is to set the Amazon Redshift character column length to four times larger than the corresponding Greenplum column length.

A best practice is to use the smallest possible column size. Amazon Redshift doesn’t allocate storage space according to the length of the attribute; it allocates storage according to the real length of the stored string. However, at runtime, while processing queries, Amazon Redshift allocates memory according to the length of the attribute. Therefore, not setting a default size of four times greater helps from a performance perspective.

An efficient solution is to analyze production datasets and determine the maximum byte size length of the Greenplum character columns. Add a 20% buffer to support future incremental growth on the table.

To arrive at the actual byte size length of an existing column, run the Greenplum data structure character utility from the AWS Samples GitHub repo.

Numeric precision and scale

The Amazon Redshift numeric data type has a limit to store up to maximum precision of 38, whereas in a Greenplum database, you can define a numeric column without any defined length.

Analyze your production datasets and determine numeric overflow candidates using the Greenplum data structure numeric utility from the AWS Samples GitHub repo. For numeric data, you have options to tackle this based on your use case. For numbers with a decimal part, you have the option to round the data based on the data type without any data loss in the whole number part. For future reference, you can a keep copy of the column in VARCHAR or store in an S3 data lake. If you see an extremely small percentage of an outlier of overflow data, clean up the source data for quality data migration.

SQL queries and functions

While converting SQL scripts or stored procedures to Amazon Redshift, if you encounter unsupported functions, database objects, or code blocks for which you might have to rewrite the query, create user-defined functions (UDFs), or redesign. You can create a custom scalar UDF using either a SQL SELECT clause or a Python program. The new function is stored in the database and is available for any user with sufficient privileges to run. You run a custom scalar UDF in much the same way as you run existing Amazon Redshift functions to match any functionality of legacy databases. The following are some examples of alternate query statements and ways to achieve specific aggregations that might be required during a code rewrite.

AGE

The Greenplum function AGE () returns an interval subtracting from the current date. You could accomplish the same using a subset of MONTHS_BETWEEN(), ADD_MONTH(), DATEDIFF(), and TRUNC() functions based on your use case.

The following example Amazon Redshift query calculates the gap between the date 2001-04-10 and 1957-06-13 in terms of year, month, and days. You can apply this to any date column in a table.

select
	trunc(trunc(months_between('2001-04-10'::date, '1957-06-13'::date))/ 12) years,
	mod(trunc(months_between('2001-04-10'::date, '1957-06-13'::date))::int4, 12) months,
	'2001-04-10'::date -add_months('1957-06-13'::date,
	trunc(months_between('2001-04-10'::date, '1957-06-13'::date))::int4) days;

COUNT

If you have a use case to get distinct aggregation in the Count() window function, you could accomplish the same using a combination of the Dense_Rank () and Max() window functions.

The following example Amazon Redshift query calculates the distinct item count for a given date of sale:

select
	sale_date,
	item,
	cost,
	max(densernk) over (partition by sale_date order by item rows between unbounded preceding and unbounded following) as distinct_itemcount
from
	(
	select
		*,dense_rank() over (partition by sale_date order by item) as densernk
	from
		testaggr)
order by
	sale_date,
	item,
cost;

ORDER BY

Amazon Redshift aggregate window functions with an ORDER BY clause require a mandatory frame.

The following example Amazon Redshift query creates a cumulative sum of cost by sale date and orders the results by item within the partition:

select
	*,
    sum(cost) over (partition by sale_date
order by
	item rows between unbounded preceding and unbounded following) as total_cost_by_date
from
	testaggr
order by
	sale_date,
	item,
	cost;

STRING_AGG

In Greenplum, STRING_AGG() is an aggregate function, which is used to concatenate a list of strings. In Amazon Redshift, use the LISTAGG() function.

The following example Amazon Redshift query returns a semicolon-separated list of email addresses for each department:

select
	dept,
	listagg(email_address,';') 
within group (order by dept) as email_list
from
	employee_contact
group by
	dept
order by
	dept;

ARRAY_AGG

In Greenplum, ARRAY_AGG() is an aggregate function that takes a set of values as input and returns an array. In Amazon Redshift, use a combination of the LISTAGG() and SPLIT_TO_ARRAY() functions. The SPLIT_TO_ARRAY() function returns a SUPER datatype.

The following example Amazon Redshift query returns an array of email addresses for each department:

select
	dept,
	SPLIT_TO_ARRAY(email_list,
	';') email_array
from
	(
	select
		dept,
		listagg(email_address,
		';') 
within group (
		order by dept) as email_list
	from
		employee_contact
	group by
		dept
	order by
		dept);

To retrieve array elements from a SUPER expression, you can use the SUBARRAY() function:

 select
	SUBARRAY( email_array,
	0,
	1 ) first_element,
	SUBARRAY( email_array,
	1,
	1) second_element,
	SUBARRAY( email_array,
	0 ) all_element
from
	testarray
where
	dept = 'HR';

UNNEST

In Greenplum, you can use the UNNEST function to split an array and convert the array elements into a set of rows. In Amazon Redshift, you can use PartiQL syntax to iterate over SUPER arrays. For more information, refer to Querying semistructured data.

create temp table unnesttest as
select
	json_parse('{"scalar_array": [0,1,2,3,4,5.5,6,7.9,8,9]}') as data;

select
	element
from
	unnesttest as un,
	un.data.scalar_array as element at index;

WHERE

You can’t use a window function in the WHERE clause of a query in Amazon Redshift. Instead, construct the query using the WITH clause and then refer the calculated column in the WHERE clause.

The following example Amazon Redshift query returns the sale date, item, and cost from a table for the sales dates where the total sale is more than 100:

with aggrcost as (
select
	sale_date ,
	item,
	cost,
	sum(cost) over (partition by sale_date) as total_sale
from
	testaggr )
select
	*
from
	aggrcost
where
	total_sale > 100;

Refer to the following table for additional Greenplum date/time functions along with the Amazon Redshift equivalent to accelerate you code migration.

. Description Greenplum Amazon Redshift
1 The now() function return the start time of the current transaction now () sysdate
2 clock_timestamp() returns the start timestamp of the current statement within a transaction block clock_timestamp () to_date(getdate(),'yyyy-mm-dd') + substring(timeofday(),12,15)::timetz
3 transaction_timestamp () returns the start timestamp of the current transaction transaction_timestamp () to_date(getdate(),'yyyy-mm-dd') + substring(timeofday(),12,15)::timetz
4 Interval – This function adds x years and y months to the date_time_column and returns a timestamp type date_time_column + interval ‘ x years y months’ add_months(date_time_column, x*12 + y)
5 Get total number of seconds between two-time stamp fields date_part('day', end_ts - start_ts) * 24 * 60 * 60+ date_part('hours', end_ts - start_ts) * 60 * 60+ date_part('minutes', end_ts - start_ts) * 60+ date_part('seconds', end_ts - start_ts) datediff('seconds', start_ts, end_ts)
6 Get total number of minutes between two-time stamp fields date_part('day', end_ts - start_ts) * 24 * 60 + date_part('hours', end_ts - start_ts) * 60 + date_part('minutes', end_ts - start_ts) datediff('minutes', start_ts, end_ts)
7 Extract date part literal from difference of two-time stamp fields date_part('hour', end_ts - start_ts) extract(hour from (date_time_column_2 - date_time_column_1))
8 Function to return the ISO day of the week date_part('isodow', date_time_column) TO_CHAR(date_time_column, 'ID')
9 Function to return ISO year from date time field extract (isoyear from date_time_column) TO_CHAR(date_time_column, ‘IYYY’)
10 Convert epoch seconds to equivalent datetime to_timestamp(epoch seconds) TIMESTAMP 'epoch' + Number_of_seconds * interval '1 second'

Amazon Redshift utility for troubleshooting or running diagnostics for the cluster

The Amazon Redshift Utilities GitHub repo contains a set of utilities to accelerate troubleshooting or analysis on Amazon Redshift. Such utilities consist of queries, views, and scripts. They are not deployed by default onto Amazon Redshift clusters. The best practice is to deploy the needed views into the admin schema.

Conclusion

In this post, we covered prescriptive guidance around data types, functions, and stored procedures to accelerate the migration process from Greenplum to Amazon Redshift. Although this post describes modernizing and moving to a cloud warehouse, you should be augmenting this transformation process towards a full-fledged modern data architecture. The AWS Cloud enables you to be more data-driven by supporting multiple use cases. For a modern data architecture, you should use purposeful data stores like Amazon S3, Amazon Redshift, Amazon Timestream, and others based on your use case.


About the Authors

Suresh Patnam is a Principal Solutions Architect at AWS. He is passionate about helping businesses of all sizes transforming into fast-moving digital organizations focusing on big data, data lakes, and AI/ML. Suresh holds a MBA degree from Duke University- Fuqua School of Business and MS in CIS from Missouri State University. In his spare time, Suresh enjoys playing tennis and spending time with his family.

Arunabha Datta is a Sr. Data Architect at AWS Professional Services. He collaborates with customers and partners to architect and implement modern data architecture using AWS Analytics services. In his spare time, Arunabha enjoys photography and spending time with his family.

Migrate a large data warehouse from Greenplum to Amazon Redshift using AWS SCT – Part 1

Post Syndicated from Suresh Patnam original https://aws.amazon.com/blogs/big-data/part-1-migrate-a-large-data-warehouse-from-greenplum-to-amazon-redshift-using-aws-sct/

A data warehouse collects and consolidates data from various sources within your organization. It’s used as a centralized data repository for analytics and business intelligence.

When working with on-premises legacy data warehouses, scaling the size of your data warehouse or improving performance can mean purchasing new hardware or adding more powerful hardware. This is often expensive and time-consuming. Running your own on-premises data warehouse also requires hiring database managers, administrators to deal with outages, upgrades, and data access requests. As companies become more data-driven, reliable access to centralized data is increasingly important. As a result, there is a strong demand for data warehouses that are fast, accessible, and able to scale elastically with business needs. Cloud data warehouses like Amazon Redshift address these needs while eliminating the cost and risk of purchasing new hardware.

This multi-part series explains how to migrate an on-premises Greenplum data warehouse to Amazon Redshift using AWS Schema Conversion Tool (AWS SCT). In this first post, we describe how to plan, run, and validate the large-scale data warehouse migration. It covers the solution overview, migration assessment, and guidance on technical and business validation. In the second post, we share best practices for choosing the optimal Amazon Redshift cluster, data architecture, converting stored procedures, compatible functions and queries widely used for SQL conversions, and recommendations for optimizing the length of data types for table columns.

Solution overview

Amazon Redshift is an industry-leading cloud data warehouse. Amazon Redshift uses Structured Query Language (SQL) to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes using AWS-designed hardware and machine learning to deliver the best price-performance at any scale.

AWS SCT makes heterogeneous database migrations predictable by automatically converting the source database schema and most of the database code objects, SQL scripts, views, stored procedures, and functions to a format compatible with the target database. AWS SCT helps you modernize your applications simultaneously during database migration. When schema conversion is complete, AWS SCT can help migrate data from various data warehouses to Amazon Redshift using data extraction agents.

The following diagram illustrates our architecture for migrating data from Greenplum to Amazon Redshift using AWS SCT data extraction agents.

Perform a migration assessment

The initial data migration is the first milestone of the project. The main requirements for this phase are to minimize the impact on the data source and transfer the data as fast as possible. To do this, AWS offers several options, depending on the size of the database, network performance (AWS Direct Connect or AWS Snowball), and whether the migration is heterogeneous or not (AWS Database Migration Service (AWS DMS) or AWS SCT).

AWS provides a portfolio of cloud data migration services to provide the right solution for any data migration project. The level of connectivity is a significant factor in data migration, and AWS has offerings that can address your hybrid cloud storage, online data transfer, and offline data transfer needs.

Additionally, the AWS Snow Family makes it simple to get your data into and out of AWS via offline methods. Based on the size of the data, you can use AWS Snowmobile or AWS Snowball if you have petabytes to exabytes of data. To decide which transfer method is better for your use case, refer to Performance for AWS Snowball.

Perform schema conversion with AWS SCT

To convert your schema using AWS SCT, you must start a new AWS SCT project and connect your databases. Complete the following steps:

  1. Install AWS SCT.
  2. Open and initiate a new project.
  3. For Source database engine, choose Greenplum.
  4. For Target database engine, choose Amazon Redshift.
  5. Choose OK.
  6. Open your project and choose Connect to Greenplum.
  7. Enter the Greenplum database information.
  8. Choose Test connection.
  9. Choose OK after a successful connection test.
  10. Choose OK to complete the connection.
  11. Repeat similar steps to establish a connection to your Amazon Redshift cluster.

    By default, AWS SCT uses AWS Glue as the extract, transform, and load (ETL) solution for the migration. Before you continue, you must disable this setting.
  12. On the Settings menu, choose Project settings.
  13. Deselect Use AWS Glue.
  14. Choose OK.
  15. In the left pane, choose your schema (right-click) and choose Convert schema.
  16. When asked to replace objects, choose Yes.
  17. When asked to load statistics, choose Continue.

    By the end of this step, all Greenplum objects should be migrated to Amazon Redshift syntax. Some objects may be shown in red, meaning that AWS SCT couldn’t fully migrate these objects. You can view an assessment summary of the migration for more information.
  18. On the View menu, choose Assessment report view.

    In the bottom pane, you can see Greenplum DDL and Amazon Redshift DDL of the selected objects side by side for comparison.
  19. Choose the schema with a red icon, which indicates that it needs manual conversion.You’re presented with specific actions regarding the tables, constraints, or views that can’t be migrated to Amazon Redshift. You must investigate these issues and fix the errors manually with the required changes. Some examples are binary data in BLOB format, which AWS SCT automatically converts to character varying data type, but this may be highlighted as an issue. Additionally, some vendor-supplied procedures and functions couldn’t be converted, so AWS SCT can error out.

    As a final step, you can validate that the tables exist in Amazon Redshift.
  20. Connect using the Amazon Redshift query editor v2 or another third-party tool or utility of your choice and check for all the tables with the following code:
    select t.table_schema, t.table_name,i.tbl_rows
    from svv_tables t left join svv_table_info i 
    on t.table_schema = i.schema and t.table_name = i.table
    where t.table_type = 'BASE TABLE' and t.table_schema='SCHEMA NAME'
    order by 1,2;

Migrate the data

To start your data migration using AWS SCT data extraction agents, complete the following steps:

  1. Configure the AWS SCT extractor properties file with corresponding Greenplum properties:
    port=8192
    vendor=GREENPLUM
    driver.jars="C:\DMS Workshop\JDBC\postgresql-42.2.9.jar"
    redshift.driver.jars=C:\DMS Workshop\JDBC\RedshiftJDBC42-no-awssdk-1.2.43.1067.jar
    working.folder=C:/Users/developer
    ssl.option=ON
    ssl.require.client.authentication=ON
    #extractor.start.fetch.size=20000
    #extractor.out.file.size=10485760
    ssl.truststore.path=C:/DMS Workshop/sctagent/truststore
    ssl.keystore.path=C:/DMS Workshop/sctagent/keystore
    ssl.keystore.pwd=ZcUnMLbWYe1c0Dp/m9jIe8IMGNtyUOv+AnKVsV9eZI+SuJcWI9Fd99Lskd7FFA==
    ssl.truststore.pwd=w23mFvm+SO46eR4dJ0Ly7n+XcvgUhhqrJNMPUyWdhFFKBxRU64/V2uioOul/aBcs8tMs=

    Now you configure the AWS SCT extractor to perform a one-time data move. You can use multiple extractors when dealing with a large volume of data.

  2. To register the extractor, on the View menu, choose Data migration view.
  3. Choose Register.
  4. Enter the information for your new agent.
  5. Test the connection and choose Register.

    Now you create a task for the extractor to extract data into the tables created on Amazon Redshift.
  6. Under your schema in the left pane, choose Tables (right-click) and choose Create Local task.
  7. For Task name, enter a name.
  8. Test the connection and choose OK.
  9. Choose Create.
  10. Run your task and monitor its progress.

You can choose each task to get a detailed breakdown of its activity. Make sure to examine errors during the extract, upload, and copy process.

You can monitor the status of the tasks, the percentage completed, and the tables that were loaded successfully. You must also verify the count of records loaded into the Amazon Redshift database.

Technical validation

After the initial extracted data is loaded to Amazon Redshift, you must perform data validation tests in parallel. The goal at this stage is to validate production workloads, comparing Greenplum and Amazon Redshift outputs from the same inputs.

Typical activities covered during this phase include the following:

  • Count the number of objects and rows on each table.
  • Compare the same random subset of data in both Greenplum and Amazon Redshift for all migrated tables, validating that the data is exactly the same row by row.
  • Check for incorrect column encodings.
  • Identify skewed table data.
  • Annotate queries not benefiting from sort keys.
  • Identify inappropriate join cardinality.
  • Identify with tables with large VARCHAR columns.
  • Confirm that processes don’t crash when connected with the target environment.
  • Validate daily batch jobs (job duration, number of rows processed). To find the right techniques to perform most of those activities, refer to Top 10 Performance Tuning Techniques for Amazon Redshift
  • Set up Amazon Redshift automated alerts with Amazon Redshift Advisor.

Business validation

After you successfully migrate the data and validate the data movement, the last remaining task is to involve the data warehouse users in the validation process. These users from different business units across the company access the data warehouse using various tools and methods: JDBC/ODBC clients, Python scripts, custom applications, and more. It’s central to the migration to make sure that every end-user has verified and adapted this process to work seamlessly with Amazon Redshift before the final cutover.

This phase can consist of several tasks:

This business validation phase is key so all end-users are aligned and ready for the final cutover. Following Amazon Redshift best practices enables end-users to fully take advantage of the capabilities of their new data warehouse. After you perform all the migration validation tasks, connect and test every ETL job, business process, external system, and user tool against Amazon Redshift, you can disconnect every process from the old data warehouse, which you can now safely power off and decommission.

Conclusion

In this post, we provided detailed steps to migrate from Greenplum to Amazon Redshift using AWS SCT. Although this post describes modernizing and moving to a cloud warehouse, you should be augmenting this transformation process towards a full-fledged modern data architecture. The AWS Cloud enables you to be more data-driven by supporting multiple use cases. For a modern data architecture, you should use purposeful data stores like Amazon S3, Amazon Redshift, Amazon Timestream, and other data stores based on your use case.

Check out the second post in this series, where we cover prescriptive guidance around data types, functions, and stored procedures.


About the Authors

Suresh Patnam is a Principal Solutions Architect at AWS. He is passionate about helping businesses of all sizes transforming into fast-moving digital organizations focusing on big data, data lakes, and AI/ML. Suresh holds a MBA degree from Duke University- Fuqua School of Business and MS in CIS from Missouri State University. In his spare time, Suresh enjoys playing tennis and spending time with his family.

Arunabha Datta is a Sr. Data Architect at Amazon Web Services (AWS). He collaborates with customers and partners to architect and implement modern data architecture using AWS Analytics services. In his spare time, Arunabha enjoys photography and spending time with his family.

Accelerate resize and encryption of Amazon Redshift clusters with Faster Classic Resize

Post Syndicated from Sumeet Joshi original https://aws.amazon.com/blogs/big-data/accelerate-resize-and-encryption-of-amazon-redshift-clusters-with-faster-classic-resize/

Amazon Redshift has improved the performance of the classic resize feature and increased the flexibility of the cluster snapshot restore operation. You can use the classic resize operation to resize a cluster when you need to change the instance type or transition to a configuration that can’t be supported by elastic resize. This could take the cluster offline for many hours during the resize, but now the cluster can typically be available to process queries in minutes. Clusters can also be resized when restoring from a snapshot and in those cases there could be restrictions.

You can now also restore an encrypted cluster from an unencrypted snapshot or change the encryption key. Amazon Redshift uses AWS Key Management Service (AWS KMS) as an encryption option to provide an additional layer of data protection by securing your data from unauthorized access to the underlying storage. Now you can encrypt an unencrypted cluster with a KMS key faster by simply specifying a KMS key ID when modifying the cluster. You can also restore an AWS KMS-encrypted cluster from an unencrypted snapshot. You can access the feature via the AWS Management Console, SDK, or AWS Command Line Interface (AWS CLI). Please note that these features only apply to the clusters or target clusters with the RA3 node type.

In this post, we show you how the updated classic resize option works and also how it significantly improves the amount of time it takes to resize or encrypt your cluster with this enhancement. We also walk through the steps to resize your Amazon Redshift cluster using Faster Classic Resize.

Existing resize options

We’ve worked closely with our customers to learn how their needs evolve as their data scales or as their security and compliance requirements change. To address and meet your ever-growing demands, you often have to resize your Amazon Redshift cluster and choose an optimal instance type that delivers the best price/performance. As of this writing, there are three ways you can resize your clusters: elastic resize, classic resize, and the snapshot, restore, and resize method.

Among the three options, elastic resize is the fastest available resize mechanism because it works based on slice remapping instead of full data copy. And classic resize is used primarily when cluster resize is outside the allowed slice ranges by elastic resize, or the encryption status should be changed. Let’s briefly discuss these scenarios before describing how the new migration process helps.

Current limitations

The current resize options have a few limitations of note.

  • Configuration changes – Elastic resize supports the following RA3 configuration changes by design. So, when you need to choose a target cluster outsize the ranges mentioned in the following table, you should choose classic resize.
Node Type Growth Limit Reduction Limit
ra3.16xlarge 4x (from 4 to 16 nodes, for example) To one-quarter of the number (from 16 to 4 nodes, for example)
ra3.4xlarge 4x To one-quarter of the number
ra3.xlplus 2x (from 4 to 8 nodes, for example) To one-quarter of the number

Also, elastic resize can’t be performed if the current cluster is a single-node cluster or isn’t running on an EC2-VPC platform. These scenarios also drive customers to choose classic resize.

  • Encryption changes – You may need to encrypt your Amazon Redshift cluster based on security, compliance, and data consumption requirements. Currently, in order to modify encryption on an Amazon Redshift cluster, we use classic resize technology, which internally performs a deep copy operation of the entire dataset and rewrites the dataset with the desired encryption state. To avoid any changes during the deep copy operation, the source cluster is placed in read-only mode during the entire operation, which can take a few hours to days depending on the dataset size. Or, you might be locked out altogether if the data warehouse is down for a resize. As a result, the administrators or application owners can’t support Service Level Agreements (SLAs) that they have set with their business stakeholders.

Switching to the Faster Classic Resize approach can help speed up the migration process when turning on encryption. This has been one of the requirements for cross-account, cross-Region data sharing enabled on unencrypted clusters and integrations with AWS Data Exchange for Amazon Redshift. Additionally, Amazon Redshift Serverless is encrypted by default. So, to create a data share from a provisioned cluster to Redshift Serverless, the provisioned cluster should be encrypted as well. This is one more compelling requirement for Faster Classic Resize.

Faster Classic Resize

Faster Classic Resize works like elastic resize, but performs similar functions like classic resize, thereby offering the best of both approaches. Unlike classic resize, which involves extracting tuples from the source cluster and inserting those tuples on the target cluster, the Faster Classic Resize operation doesn’t involve extraction of tuples. Instead, it starts from the snapshots and the data blocks are copied over to the target cluster.

The new Faster Classic Resize operation involves two stages:

  • Stage 1 (Critical path) – This first stage consists of migrating the data from the source cluster to the target cluster, during which the source cluster is in read-only mode. Typically, this is a very short duration. Then the cluster is made available for read and writes.
  • Stage 2 (Off critical path) – The second stage involves redistributing the data as per the previous data distribution style. This process runs in the background off the critical path of migration from the source to target cluster. The duration of this stage is dependent on the volume to distribute, cluster workload, and so on.

Let’s see how Faster Classic Resize works with configuration changes, encryption changes, and restoring an unencrypted snapshot into an encrypted cluster.

Prerequisites

Complete the following prerequisite steps:

  1. Take a snapshot from the current cluster or use an existing snapshot.
  2. Provide the AWS Identity and Access Management (IAM) role credentials that are required to run the AWS CLI. For more information, refer to Using identity-based policies (IAM policies) for Amazon Redshift.
  3. For encryption changes, create a KMS key if none exists. For instructions, refer to Creating keys.

Configuration changes

As of today, you can initiate Faster Classic Resize through the AWS CLI when the required cluster configuration change is beyond what is allowed by elastic resize. The feature will be added to the console in a future release. As of this writing, you can use it change your cluster configuration from DC2, DS2, and RA3 node types to any RA3 node type. However, changing from RA3 to DC2 or DS2 is not supported yet.

We did a benchmark on Faster Classic Resize with different cluster combinations and volumes. The following table summarizes the results comparing critical paths in classic resize and Faster Classic Resize.

Volume Source Cluster Target Cluster Classic Resize
Duration (min)
Faster Classic Resize
Stage1 Duration (min)
% Faster
10 TB ra3 4xlarge – 6 nodes ra3 16xlarge – 8 nodes 78 11 86%
10 TB ra3 16xlarge – 8 nodes ra3 4xlarge – 2 nodes 738 11 99%
10 TB dc2 8xlarge – 6 nodes ra3 4xLarge – 2 nodes 706 8 99%
3 TB ra3 4xLarge – 2 nodes ra3 16xLarge – 4 nodes 53 11 79%
3 TB ra3 16xLarge – 4 nodes ra3 4xLarge – 2 nodes 244 7 97%
3 TB dc2 8xlarge – 6 nodes ra3 4xLarge – 2 nodes 251 7 97%

The Faster Classic Resize option consistently completed in significantly less time and made the cluster available for read and write operations in a short time. Classic resize took a longer time in all cases and kept the cluster in read-only mode, making it unavailable for writes. Also, the classic resize duration is comparatively longer when the target cluster configuration is smaller than the original cluster configuration.

Perform Faster Classic Resize

As of this writing, the feature isn’t available on the console. You can use either of the following two methods to resize your cluster using Faster Classic Resize via the AWS CLI.

  • Modify cluster method – Resize an existing cluster without changing the endpoint

The following are the steps involved:

    1. Take a snapshot on the current cluster prior to performing the resize operation.
    2. Determine the target cluster configuration and run the following command from the AWS CLI:
      aws redshift modify-cluster --region <CLUSTER REGION>
      --endpoint-url https://redshift.<CLUSTER REGION>.amazonaws.com/
      --cluster-identifier <CLUSTER NAME>
      --cluster-type multi-node
      --node-type <TARGET INSTANCE TYPE>
      --number-of-nodes <TARGET NUMBER OF NODES>

      For example:

      aws redshift modify-cluster --region us-east-1
      --endpoint-url https://redshift.us-east-1.amazonaws.com/
      --cluster-identifier my-cluster-identifier
      --cluster-type multi-node
      --node-type  ra3.16xlarge
      --number-of-nodes 12

  • Snapshot restore method – Restore an existing snapshot to the new cluster with the new cluster endpoint

The following are the steps involved:

    1. Identify the snapshot for restore and a unique name for the new cluster.
    2. Determine the target cluster configuration and run the following command from the AWS CLI:
      aws redshift restore-from-cluster-snapshot --region <CLUSTER REGION>
      --endpoint-url https://redshift.<CLUSTER REGION>.amazonaws.com/
      --snapshot-identifier <SNAPSHOT ID> 
      --cluster-identifier <CLUSTER NAME>
      --node-type <TARGET INSTANCE TYPE>
      --number-of-node <NUMBER>

      For example:

      aws redshift restore-from-cluster-snapshot --region us-east-1
      --endpoint-url https://redshift.us-east-1.amazonaws.com/
      --snapshot-identifier rs:sales-cluster-2022-05-26-16-19-36
      --cluster-identifier my-new-cluster-identifier
      --node-type ra3.16xlarge
      --number-of-node 12

Monitor the resize process

Monitor the progress through the cluster management console. You can also check the events generated by the resize process. The resize completion status is logged in events along with the duration it took for the resize. The following screenshot shows an example.

It’s important to note that you may observe longer query times in the second stage of Faster Classic Resize. During the first stage, the data for tables with dist-key distribution style is transferred as dist-even. Later, a background process converts them back to dist-key (in stage 2). However, background processes are running behind the scenes to get the data redistributed to the original distribution style (the distribution style before the cluster resize). You can monitor the progress of the background processes by querying the stv_xrestore_alter_queue_state table. It’s important to note that tables with ALL, AUTO, or EVEN distribution styles don’t require redistribution post-resize. Therefore, they’re not logged in the stv_xrestore_alter_queue_state table. The counts you observe in these tables are for the tables with distribution style as Key before the resize operation.

See the following example query:

select db_id, status, count(*) from stv_xrestore_alter_queue_state group by 1,2 order by 3 desc

The following table shows that for 60 tables data redistribution is finished, for 323 tables data redistribution is pending, and for 1 table data redistribution is in progress.

We ran tests to assess time to complete the redistribution. For 10 TB of data, it took approximately 5 hours and 30 minutes on an idle cluster. For 3 TB, it took approximately 2 hours and 30 minutes on an idle cluster. The following is a summary of tests performed on larger volumes:

  • A snapshot with 100 TB where 70% of blocks needs redistribution would take 10–40 hours
  • A snapshot with 450 TB where 70% of blocks needs redistribution would take 2–8 days
  • A snapshot with 1600 TB where 70% of blocks needs redistribution would take 7–27 days

The actual time to complete redistribution is largely dependent on data volume, cluster idle cycles, target cluster size, data skewness, and more. Therefore, we recommend performing Faster Classic Resize when there is enough of an idle window (such as weekends) for the cluster to perform redistribution.

Encryption changes

You can encrypt your Amazon Redshift cluster from the console (the modify cluster method) or using the AWS CLI using the snapshot restore method. Amazon Redshift performs the encryption change using Faster Classic Resize. The operation only takes a few minutes to complete and your cluster is available for both read and write. With Faster Classic Resize, you can change an unencrypted cluster to an encrypted cluster or change the encryption key using the snapshot restore method.

For this post, we show how you can change the encryption using the Amazon Redshift console. To test the timings, we created multiple Amazon Redshift clusters using TPC-DS data. The Faster Classic Resize option consistently completed in significantly less time and made clusters available for read and write operations faster. Classic resize took a longer time in all cases and kept the cluster in read-only mode. The following table contains the summary of the results.

Data Volume Cluster Encryption (Classic Resize)
Duration (min)
Encryption (Faster Classic Resize)
Duration (min)
% Faster
10 TB ra3.4xlarge – 2 nodes 580 11 98%
10 TB ra3.xlplus – 2 nodes 680 16 98%
3 TB ra3.4xlarge – 2 nodes 527 9 98%
3 TB ra3.xlplus – 2 nodes 570 10 98%

Now, let’s perform the encryption change from an unencrypted cluster to an encrypted cluster using the console. Complete the following steps:

  1. On the Amazon Redshift console, navigate to your cluster.
  2. On the Properties tab, on the Edit drop-down menu, choose Edit encryption.
  3. For Encryption, select Use AWS Key Management Service (AWS KMS).
  4. For AWS KMS, select Default Redshift key.
  5. Choose Save changes.

You can monitor the progress of your encryption change on the Events tab. As shown in the following screenshot, the entire process to change the encryption took approximately 11 minutes.

Restore an unencrypted snapshot to an encrypted cluster

As of today, the Faster Classic Resize option to restore an unencrypted snapshot into an encrypted cluster or to change the encryption key is available only through the AWS CLI. When triggered, the restored cluster operates in read/write mode immediately. The encryption state change for restored blocks that are unencrypted operates in the background, and newly ingested blocks continue to be encrypted.

Restore the snapshot using the following command into a new cluster. (Replace the indicated parameter values; --encrypted and --kms-key-id are required).

aws redshift restore-from-cluster-snapshot 
--cluster-identifier <CLUSTER NAME>
--snapshot-identifier <SNAPSHOT ID> 
--region <AWS REGION> 
--encrypted
--kms-key-id <KMS KEY ID>
--cluster-subnet-group-name <SUBNET GROUP>

When to use which resize option

The following flow chart provides guidance on which resize option is recommended when changing your cluster encryption status or resizing to a new cluster configuration.

Summary

In this post, we talked about the improved performance of Amazon Redshift’s classic resize feature and how Faster Classic Resize significantly improves your ability to scale your Amazon Redshift clusters using the classic resize method. We also talked about when to use different resize operations based on your requirements. We demonstrated how it works from the console (for encryption changes) and using the AWS CLI. We also showed the results of our benchmark test and how it significantly improves the migration time for configuration changes and encryption changes for your Amazon Redshift cluster.

To learn more about resizing your clusters, refer to Resizing clusters in Amazon Redshift. If you have any feedback or questions, please leave them in the comments.


About the authors

Sumeet Joshi is an Analytics Specialist Solutions Architect based out of New York. He specializes in building large-scale data warehousing solutions. He has over 17 years of experience in the data warehousing and analytical space.

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 16 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Krishna Chaitanya Gudipati is a Senior Software Development Engineer at Amazon Redshift. He has been working on distributed systems for over 14 years and is passionate about building scalable and performant systems. In his spare time, he enjoys reading and exploring new places.

Yanzhu Ji is a Product Manager on the Amazon Redshift team. She worked on the Amazon Redshift team as a Software Engineer before becoming a Product Manager. She has a rich experience of how the customer-facing Amazon Redshift features are built from planning to launching, and always treats customers’ requirements as first priority. In her personal life, Yanzhu likes painting, photography, and playing tennis.

Optimize your Amazon Redshift query performance with automated materialized views

Post Syndicated from Adam Gatt original https://aws.amazon.com/blogs/big-data/optimize-your-amazon-redshift-query-performance-with-automated-materialized-views/

Amazon Redshift is a fast, fully managed cloud data warehouse database that makes it cost-effective to analyze your data using standard SQL and business intelligence tools. Amazon Redshift allows you to analyze structured and semi-structured data and seamlessly query data lakes and operational databases, using AWS designed hardware and automated machine learning (ML)-based tuning to deliver top-tier price-performance at scale.

Although Amazon Redshift provides excellent price performance out of the box, it offers additional optimizations that can improve this performance and allow you to achieve even faster query response times from your data warehouse.

For example, you can physically tune tables in a data model to minimize the amount of data scanned and distributed within a cluster, which speeds up operations such as table joins and range-bound scans. Amazon Redshift now automates this tuning with the automatic table optimization (ATO) feature.

Another optimization for reducing query runtime is to precompute query results in the form of a materialized view. Materialized views store precomputed query results that future similar queries can use. This improves query performance because many computation steps can be skipped and the precomputed results returned directly. Unlike a simple cache, many materialized views can be incrementally refreshed when DML changes are applied on the underlying (base) tables and can be used by other similar queries, not just the query used to create the materialized view.

Amazon Redshift introduced materialized views in March 2020. In June 2020, support for external tables was added. With these releases, you could use materialized views on both local and external tables to deliver low-latency performance by using precomputed views in your queries. However, this approach required you to be aware of what materialized views were available on the cluster, and if they were up to date.

In November 2020, materialized view automatic refresh and query rewrite features were added. With materialized view-aware automatic rewriting, data analysts get the benefit of materialized views for their queries and dashboards without having to query the materialized view directly. The analyst may not even be aware the materialized views exist. The auto rewrite feature enables this by rewriting queries to use materialized views without the query needing to explicitly reference them. In addition, auto refresh keeps materialized views up to date when base table data is changed, and there are available cluster resources for the materialized view maintenance.

However, materialized views still have to be manually created, monitored, and maintained by data engineers or DBAs. To reduce this overhead, Amazon Redshift has introduced the Automated Materialized View (AutoMV) feature, which goes one step further and automatically creates materialized views for queries with common recurring joins and aggregations.

This post explains what materialized views are, how manual materialized views work and the benefits they provide, and what’s required to build and maintain manual materialized views to achieve performance improvements and optimization. Then we explain how this is greatly simplified with the new automated materialized view feature.

Manually create materialized views

A materialized view is a database object that stores precomputed query results in a materialized (persisted) dataset. Similar queries can use the precomputed results from the materialized view and skip the expensive tasks of reading the underlying tables and performing joins and aggregates, thereby improving the query performance.

For example, you can improve the performance of a dashboard by materializing the results of its queries into a materialized view or multiple materialized views. When the dashboard is opened or refreshed, it can use the precomputed results from the materialized view instead of rereading the base tables and reprocessing the queries. By creating a materialized view once and querying it multiple times, redundant processing can be avoided, improving query performance and freeing up resources for other processing on the database.

To demonstrate this, we use the following query, which returns daily order and sales numbers. It joins two tables and aggregates at the day level.

SET enable_result_cache_for_session TO OFF;

SELECT o.o_orderdate AS order_date
      ,SUM(l.l_extendedprice) AS ext_price_total
FROM orders o
INNER JOIN lineitem l
   ON o.o_orderkey = l.l_orderkey
WHERE o.o_orderdate >= '1997-01-01'
AND   o.o_orderdate < '1998-01-01'
GROUP BY o.o_orderdate
ORDER BY 1;

At the top of the query, we set enable_result_cache_for_session to OFF. This setting disables the results cache, so we can see the full processing runtime each time we run the query. Unlike a materialized view, the results cache is a simple cache that stores the results of a single query in memory, it can’t be used by other similar queries, is not updated when the base tables are modified, and because it isn’t persisted, can be aged-out of memory by more frequently used queries.

When we run this query on a 10-node ra3.4xl cluster with the TPC-H 3 TB dataset, it returns in approximately 20 seconds. If we need to run this query or similar queries more than once, we can create a materialized view with the CREATE MATERIALIZED VIEW command and query the materialized view object directly, which has the same structure as a table:

CREATE MATERIALIZED VIEW mv_daily_sales
AS
SELECT o.o_orderdate AS order_date
      ,SUM(l.l_extendedprice) AS ext_price_total
FROM orders o
INNER JOIN lineitem l
   ON o.o_orderkey = l.l_orderkey
WHERE o.o_orderdate >= '1997-01-01'
AND   o.o_orderdate < '1998-01-01'
GROUP BY o.o_orderdate;

SELECT order_date
      ,ext_price_total
FROM   mv_daily_sales
ORDER BY 1;

Because the join and aggregations have been precomputed, it runs in approximately 900 milliseconds, a performance improvement of 96%.

As we have just shown, you can query the materialized view directly; however, Amazon Redshift can automatically rewrite a query to use one or more materialized views. The query rewrite feature transparently rewrites the query as it’s being run to retrieve precomputed results from a materialized view. This process is automatically triggered on eligible and up-to-date materialized views, if the query contains the same base tables and joins, and has similar aggregations as the materialized view.

For example, if we rerun the sales query, because it’s eligible for rewriting, it’s automatically rewritten to use the mv_daily_sales materialized view. We start with the original query:

SELECT o.o_orderdate AS order_date
      ,SUM(l.l_extendedprice) AS ext_price_total
FROM orders o
INNER JOIN lineitem l
   ON o.o_orderkey = l.l_orderkey
WHERE o.o_orderdate >= '1997-01-01'
AND   o.o_orderdate < '1998-01-01'
GROUP BY o.o_orderdate
ORDER BY 1;

Internally, the query is rewritten to the following SQL and run. This process is completely transparent to the user.

SELECT order_date
      ,ext_price_total
FROM   mv_daily_sales
ORDER BY 1;

The rewriting can be confirmed by looking at the query’s explain plan:

EXPLAIN SELECT o.o_orderdate AS order_date
      ,SUM(l.l_extendedprice) AS ext_price_total
FROM orders o
INNER JOIN lineitem l
   ON o.o_orderkey = l.l_orderkey
WHERE o.o_orderdate >= '1997-01-01'
AND   o.o_orderdate < '1998-01-01'
GROUP BY o.o_orderdate;

+------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                      |
+------------------------------------------------------------------------------------------------+
|XN HashAggregate  (cost=5.47..5.97 rows=200 width=31)                                           |
|  ->  XN Seq Scan on mv_tbl__mv_daily_sales__0 derived_table1  (cost=0.00..3.65 rows=365 width=31)|
+------------------------------------------------------------------------------------------------+

The plan shows the query has been rewritten and has retrieved the results from the mv_daily_sales materialized view, not the query’s base tables: orders and lineitem.

Other queries that use the same base tables and level of aggregation, or a level of aggregation derived from the materialized view’s level, are also rewritten. For example:

EXPLAIN SELECT date_trunc('month', o.o_orderdate) AS order_month
      ,SUM(l.l_extendedprice) AS ext_price_total
FROM orders o
INNER JOIN lineitem l
   ON o.o_orderkey = l.l_orderkey
WHERE o.o_orderdate >= '1997-01-01'
AND   o.o_orderdate < '1998-01-01'
GROUP BY order_month;

+------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                      |
+------------------------------------------------------------------------------------------------+
|XN HashAggregate  (cost=7.30..10.04 rows=365 width=19)                                          |
|  ->  XN Seq Scan on mv_tbl__mv_daily_sales__0 derived_table1  (cost=0.00..5.47 rows=365 width=19)|
+------------------------------------------------------------------------------------------------+

If data in the orders or lineitem table changes, mv_daily_sales becomes stale; this means the materialized view isn’t reflecting the state of its base tables. If we update a row in lineitem and check the stv_mv_info system table, we can see the is_stale flag is set to t (true):

UPDATE lineitem
SET l_extendedprice = 5000
WHERE l_orderkey = 2362252519
AND l_linenumber = 1;

SELECT name
      ,is_stale
FROM stv_mv_info
WHERE name = 'mv_daily_sales';

+--------------+--------+
|name          |is_stale|
+--------------+--------+
|mv_daily_sales|t       |
+--------------+--------+

We can now manually refresh the materialized view using the REFRESH MATERIALIZED VIEW statement:

REFRESH MATERIALIZED VIEW mv_daily_sales;

SELECT name
      ,is_stale
FROM stv_mv_info
WHERE name = 'mv_daily_sales';

+--------------+--------+
|name          |is_stale|
+--------------+--------+
|mv_daily_sales|f       |
+--------------+--------+

There are two types of materialized view refresh: full and incremental. A full refresh reruns the underlying SQL statement and rebuilds the whole materialized view. An incremental refresh only updates specific rows affected by the source data change. To see if a materialized view is eligible for incremental refreshes, view the state column in the stv_mv_info system table. A state of 0 indicates the materialized view will be fully refreshed, and a state of 1 indicates the materialized view will be incrementally refreshed.

SELECT name
      ,state
FROM stv_mv_info
WHERE name = 'mv_daily_sales';

+--------------+--------+
|name          |state   |
+--------------+--------+
|mv_daily_sales|       1|
+--------------+--------+

You can schedule manual refreshes on the Amazon Redshift console if you need to refresh a materialized view at fixed periods, such as once per hour. For more information, refer to Scheduling a query on the Amazon Redshift console.

As well as the ability to do a manual refresh, Amazon Redshift can also automatically refresh materialized views. The auto refresh feature intelligently determines when to refresh the materialized view, and if you have multiple materialized views, which order to refresh them in. Amazon Redshift considers the benefit of refreshing a materialized view (how often the materialized view is used, what performance gain the materialized view provides) and the cost (resources required for the refresh, current system load, available system resources).

This intelligent refreshing has a number of benefits. Because not all materialized views are equally important, deciding when and in which order to refresh materialized views on a large system is a complex task for a DBA to solve. Also, the DBA needs to consider other workloads running on the system, and try to ensure the latency of critical workloads is not increased by the effect of refreshing materialized views. The auto refresh feature helps remove the need for a DBA to do these difficult and time-consuming tasks.

You can set a materialized view to be automatically refreshed in the CREATE MATERIALIZED VIEW statement with the AUTO REFRESH YES parameter:

CREATE MATERIALIZED VIEW mv_daily_sales
AUTO REFRESH YES
AS
SELECT ...

Now when the source data of the materialized view changes, the materialized view is automatically refreshed. We can view the status of the refresh in the svl_mv_refresh_status system table. For example:

UPDATE lineitem
SET l_extendedprice = 6000
WHERE l_orderkey = 2362252519
AND l_linenumber = 1;

SELECT mv_name
      ,starttime
      ,endtime
      ,status
      ,refresh_type
FROM svl_mv_refresh_status
WHERE mv_name = 'mv_daily_sales';

+--------------+--------------------------+--------------------------+---------------------------------------------+------------+
|mv_name       |starttime                 |endtime                   |status                                       |refresh_type|
+--------------+--------------------------+--------------------------+---------------------------------------------+------------+
|mv_daily_sales|2022-05-06 14:07:24.857074|2022-05-06 14:07:33.342346|Refresh successfully updated MV incrementally|Auto        |
+--------------+--------------------------+--------------------------+---------------------------------------------+------------+

To remove a materialized view, we use the DROP MATERIALIZED VIEW command:

DROP MATERIALIZED VIEW mv_daily_sales;

Now that you’ve seen what materialized views are, their benefits, and how they are created, used, and removed, let’s discuss the drawbacks. Designing and implementing a set of materialized views to help improve overall query performance on a database requires a skilled resource to perform several involved and time-consuming tasks:

  • Analyzing queries run on the system
  • Identifying which queries are run regularly and provide business benefit
  • Prioritizing the identified queries
  • Determining if the performance improvement is worth creating a materialized view and storing the dataset
  • Physically creating and refreshing the materialized views
  • Monitoring the usage of the materialized views
  • Dropping materialized views that are rarely or never used or can’t be refreshed due to the structure of base tables changing

Significant skill, effort, and time is required to design and create materialized views that provide an overall benefit. Also, ongoing monitoring is needed to identify poorly designed or underutilized materialized views that are occupying resources without providing gains.

Amazon Redshift now has a feature to automate this process, Automated Materialized Views (AutoMVs). We explain how AutoMVs work and how to use them on your cluster in the following sections.

Automatically create materialized views

When the AutoMV feature is enabled on an Amazon Redshift cluster (it’s enabled by default), Amazon Redshift monitors recently run queries and identifies any that could have their performance improved by a materialized view. Expensive parts of the query, such as aggregates and joins that can be persisted into materialized views and reused by future queries, are then extracted from the main query and any subqueries. The extracted query parts are then rewritten into create materialized view statements (candidate materialized views) and stored for further processing.

The candidate materialized views are not just one-to-one copies of queries; extra processing is applied to create generalized materialized views that can be used by queries similar to the original query. In the following example, the result set is limited by the filters o_orderpriority = '1-URGENT' and l_shipmode ='AIR'. Therefore, a materialized view built from this result set could only serve queries selecting that limited range of data.

SELECT o.o_orderdate
      ,SUM(l.l_extendedprice)
FROM orders o
INNER JOIN lineitem l
   ON o.o_orderkey = l.l_orderkey
WHERE o.o_orderpriority = '1-URGENT'
AND   l.l_shipmode ='AIR'
GROUP BY o.o_orderdate;

Amazon Redshift uses many techniques to create generalized materialized views; one of these techniques is called predicate elevation. To apply predicate elevation to this query, the filtered columns o_orderpriority and l_shipmode are moved into the GROUP BY clause, thereby storing the full range of data in the materialized view, which allows similar queries to use the same materialized view. This approach is driven by dashboard-like workloads that often issue identical queries with different filter predicates.

SELECT o.o_orderdate
      ,o.o_orderpriority
      ,l.l_shipmode
      ,SUM(l.l_extendedprice)
FROM orders o
INNER JOIN lineitem l
   ON o.o_orderkey = l.l_orderkey
GROUP BY o.o_orderdate
        ,o.o_orderpriority
        ,l.l_shipmode;

In the next processing step, ML algorithms are applied to calculate which of the candidate materialized views provides the best performance benefit and system-wide performance optimization. The algorithms follow similar logic to the auto refresh feature mentioned previously. For each candidate materialized view, Amazon Redshift calculates a benefit, which corresponds to the expected performance improvement should the materialized view be materialized and used in the workload. In addition, it calculates a cost corresponding to the system resources required to create and maintain the candidate. Existing manual materialized views are also considered; an AutoMV will not be created if a manual materialized view already exists that covers the same scope, and manual materialized views have auto refresh priority over AutoMVs.

The list of materialized views is then sorted in order of overall cost-benefit, taking into consideration workload management (WLM) query priorities, with materialized views related to queries on a higher priority queue ordered before materialized views related to queries on a lower priority queue. After the list of materialized views has been fully sorted, they’re automatically created and populated in the background in the prioritized order.

The created AutoMVs are then monitored by a background process that checks their activity, such as how often they have been queried and refreshed. If the process determines that an AutoMV is not being used or refreshed, for example due to the base table’s structure changing, it is dropped.

Example

To demonstrate this process in action, we use the following query taken from the 3 TB Cloud DW Benchmark, a performance testing benchmark derived from TPC-H. You can load the benchmark data into your cluster and follow along with the example.

SET enable_result_cache_for_session TO OFF;

SELECT /* TPC-H Q12 */
       l_shipmode
     , SUM(CASE
              WHEN o_orderpriority = '1-URGENT'
                 OR o_orderpriority = '2-HIGH'
                 THEN 1
              ELSE 0
   END) AS high_line_count
     , SUM(CASE
              WHEN o_orderpriority  '1-URGENT'
                 AND o_orderpriority  '2-HIGH'
                 THEN 1
              ELSE 0
   END) AS low_line_count
FROM orders
   , lineitem
WHERE o_orderkey = l_orderkey
AND l_shipmode IN ('MAIL', 'SHIP')
AND l_commitdate < l_receiptdate
AND l_shipdate = DATE '1994-01-01'
AND l_receiptdate < DATEADD(YEAR, 1, CAST('1994-01-01' AS DATE))
GROUP BY l_shipmode
ORDER BY l_shipmode;

We run the query three times and then wait for 30 minutes. On a 10-node ra3.4xl cluster, the query runs in approximately 8 seconds.

During the 30 minutes, Amazon Redshift assesses the benefit of materializing candidate AutoMVs. It computes a sorted list of candidate materialized views and creates the most beneficial ones with incremental refresh, auto refresh, and query rewrite enabled. When the query or similar queries run, they’re automatically and transparently rewritten to use one or more of the created AutoMVs.

Ongoing, if data in the base tables is modified (i.e. the AutoMV becomes stale), an incremental refresh automatically runs, inserting, updating, and deleting rows in the AutoMV to bring its data to the latest state.

Rerunning the query shows that it runs in approximately 800 milliseconds, a performance improvement of 90%. We can confirm the query is using the AutoMV by checking the explain plan:

EXPLAIN SELECT /* TPC-H Q12 */
       l_shipmode
     ,
 SUM(CASE
              WHEN o_orderpriority = '1-URGENT'
                 OR o_orderpriority = '2-HIGH'
                 THEN 1
              ELSE 0
   END) AS high_line_count
     , SUM(CASE
              WHEN o_orderpriority <> '1-URGENT'
                 AND o_orderpriority <> '2-HIGH'
                 THEN 1
              ELSE 0
   END) AS low_line_count
FROM orders
   , lineitem
WHERE o_orderkey = l_orderkey
AND l_shipmode IN ('MAIL', 'SHIP')
AND l_commitdate < l_receiptdate
AND l_shipdate < l_commitdate
AND l_receiptdate >= DATE '1994-01-01'
AND l_receiptdate < DATEADD(YEAR, 1, CAST('1994-01-01' AS DATE))
GROUP BY l_shipmode
ORDER BY l_shipmode;

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                                           |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|XN Merge  (cost=1000000000354.23..1000000000354.23 rows=1 width=30)                                                                                                  |
|  Merge Key: derived_table1.grvar_1                                                                                                                                  |
|  ->  XN Network  (cost=1000000000354.23..1000000000354.23 rows=1 width=30)                                                                                          |
|        Send to leader                                                                                                                                               |
|        ->  XN Sort  (cost=1000000000354.23..1000000000354.23 rows=1 width=30)                                                                                       |
|              Sort Key: derived_table1.grvar_1                                                                                                                       |
|              ->  XN HashAggregate  (cost=354.21..354.22 rows=1 width=30)                                                                                            |
|                    ->  XN Seq Scan on mv_tbl__auto_mv_2000__0 derived_table1  (cost=0.00..349.12 rows=679 width=30)                                                 |
|                          Filter: ((grvar_2 < '1995-01-01'::date) AND (grvar_2 >= '1994-01-01'::date) AND ((grvar_1 = 'SHIP'::bpchar) OR (grvar_1 = 'MAIL'::bpchar)))|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+

To demonstrate how AutoMVs can also improve the performance of similar queries, we change some of the filters on the original query. In the following example, we change the filter on l_shipmode from IN ('MAIL', 'SHIP') to IN ('TRUCK', 'RAIL', 'AIR'), and change the filter on l_receiptdate to the first 6 months of the previous year. The query runs in approximately 900 milliseconds and, looking at the explain plan, we confirm it’s using the AutoMV:

EXPLAIN SELECT /* TPC-H Q12 modified */
       l_shipmode
     , SUM(CASE
              WHEN o_orderpriority = '1-URGENT'
                 OR o_orderpriority = '2-HIGH'
                 THEN 1
              ELSE 0
   END) AS high_line_count
     , SUM(CASE
              WHEN o_orderpriority <> '1-URGENT'
                 AND o_orderpriority <> '2-HIGH'
                 THEN 1
              ELSE 0
   END) AS low_line_count
FROM orders
   , lineitem
WHERE o_orderkey = l_orderkey
AND l_shipmode IN ('TRUCK', 'RAIL', 'AIR')
AND l_commitdate < l_receiptdate
AND l_shipdate < l_commitdate
AND l_receiptdate >= DATE '1993-01-01'
AND l_receiptdate < DATE '1993-07-01'
GROUP BY l_shipmode
ORDER BY l_shipmode;

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                                                                         |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|XN Merge  (cost=1000000000396.30..1000000000396.31 rows=1 width=30)                                                                                                                                |
|  Merge Key: derived_table1.grvar_1                                                                                                                                                                |
|  ->  XN Network  (cost=1000000000396.30..1000000000396.31 rows=1 width=30)                                                                                                                        |
|        Send to leader                                                                                                                                                                             |
|        ->  XN Sort  (cost=1000000000396.30..1000000000396.31 rows=1 width=30)                                                                                                                     |
|              Sort Key: derived_table1.grvar_1                                                                                                                                                     |
|              ->  XN HashAggregate  (cost=396.29..396.29 rows=1 width=30)                                                                                                                          |
|                    ->  XN Seq Scan on mv_tbl__auto_mv_2000__0 derived_table1  (cost=0.00..392.76 rows=470 width=30)                                                                               |
|                          Filter: ((grvar_2 < '1993-07-01'::date) AND (grvar_2 >= '1993-01-01'::date) AND ((grvar_1 = 'AIR'::bpchar) OR (grvar_1 = 'RAIL'::bpchar) OR (grvar_1 = 'TRUCK'::bpchar)))|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The AutoMV feature is transparent to users and is fully system managed. Therefore, unlike manual materialized views, AutoMVs are not visible to users and can’t be queried directly. They also don’t appear in any system tables like stv_mv_info or svl_mv_refresh_status.

Finally, if the AutoMV hasn’t been used for some time by the workload, it’s automatically dropped and the storage released. When we rerun the query after this, the runtime returns to the original 8 seconds because the query is now using the base tables. This can be confirmed by examining the explain plan.

This example illustrates that the AutoMV feature reduces the effort and time required to create and maintain materialized views.

Performance tests and results

To see how well AutoMVs work in practice, we ran tests using the 1 TB and 3 TB versions of the Cloud DW benchmark derived from TPC-H. This test consists of a power run script with 22 queries that is run three times with the results cache off. The tests were run with two different clusters: 4-node ra3.4xlarge and 2-node ra3.16xlarge with a concurrency of 1 and 5.

The Cloud DW benchmark is derived from the TPC-H benchmark. It isn’t comparable to published TPC-H results, because the results of our tests don’t fully comply with the specification.

The following table shows our results.

Suite Scale Cluster Concurrency Number Queries Elapsed Secs – AutoMV Off Elapsed Secs – AutoMV On % Improvement
TPC-H 1 TB 4 node ra3.4xlarge 1 66 1046 913 13%
TPC-H 1 TB 4 node ra3.4xlarge 5 330 3592 3191 11%
TPC-H 3 TB 2 node
ra3.16xlarge
1 66 1707 1510 12%
TPC-H 3 TB 2 node
ra3.16xlarge
5 330 6971 5650 19%

The AutoMV feature improved query performance by up to 19% without any manual intervention.

Summary

In this post, we first presented manual materialized views, their various features, and how to take advantage of them. We then looked into the effort and time required to design, create, and maintain materialized views to provide performance improvements in a data warehouse.

Next, we discussed how AutoMVs help overcome these challenges and seamlessly provide performance improvements for SQL queries and dashboards. We went deeper into the details of how AutoMVs work and discussed how ML algorithms determine which materialized views to create based on the predicted performance improvement and overall benefit they will provide compared to the cost required to create and maintain them. Then we covered some of the internal processing logic such as how predicate elevation creates generalized materialized views that can be used by a range of queries, not just the original query that triggered the materialized view creation.

Finally, we showed the results of a performance test on an industry benchmark where the AutoMV feature improved performance by up to 19%.

As we have demonstrated, automated materialized views provide performance improvements to a data warehouse without requiring any manual effort or specialized expertise. They transparently work in the background, optimizing your workload performance and automatically adapting when your workloads change.

Automated materialized views are enabled by default. We encourage you to monitor any performance improvements they have on your current clusters. If you’re new to Amazon Redshift, try the Getting Started tutorial and use the free trial to create and provision your first cluster and experiment with the feature.


About the Authors

Adam Gatt is a Senior Specialist Solution Architect for Analytics at AWS. He has over 20 years of experience in data and data warehousing and helps customers build robust, scalable and high-performance analytics solutions in the cloud.

Rahul Chaturvedi is an Analytics Specialist Solutions Architect at AWS. Prior to this role, he was a Data Engineer at Amazon Advertising and Prime Video, where he helped build petabyte-scale data lakes for self-serve analytics.

Achieve fine-grained data security with row-level access control in Amazon Redshift

Post Syndicated from Harshida Patel original https://aws.amazon.com/blogs/big-data/achieve-fine-grained-data-security-with-row-level-access-control-in-amazon-redshift/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. With Amazon Redshift, you can analyze all your data to derive holistic insights about your business and your customers. One of the challenges with security is that enterprises want to provide fine-grained access control at the row level for sensitive data. You can do this by creating views or using different databases and schemas for different users. However, this approach isn’t scalable and becomes complex to maintain over time. Customers have asked us to simplify the process of securing their data by providing the ability to control granular access.

Row-level security (RLS) in Amazon Redshift is built on the foundation of role-based access control (RBAC). RLS allows you to control which users or roles can access specific records of data within tables, based on security policies that are defined at the database object level. This new RLS capability in Amazon Redshift enables you to dynamically filter existing rows of data in a table. This is in addition to column-level access control, where you can grant users permissions to a subset of columns. Now you can combine column-level access control with RLS policies to further restrict access to particular rows of visible columns.

In this post, we explore the row-level security features of Amazon Redshift and how you can use roles to simplify managing privileges required to your end-users.

Customer feedback

TrustLogix is a Norwest Venture Partners backed cloud security startup in the Data Security Governance space. TrustLogix delivers powerful monitoring, observability, audit, and fine-grained data entitlement capabilities that empower Amazon Redshift clients to implement data-centric security for their digital transformation initiatives.

“We’re excited about this new and deeper level of integration with Amazon Redshift. Our joint customers in security-forward and highly regulated sectors including financial services, healthcare, and pharmaceutical need to have incredibly fine-grained control over which users are allowed to access what data, and under which specific contexts. The new role-level security capabilities will allow our customers to precisely dictate data access controls based on their business entitlements while abstracting them away from the technical complexities. The new Amazon Redshift RLS capability will enable our joint customers to model policies at the business level, deploy and enforce them via a security-as-code model, ensuring secure and consistent access to their sensitive data.”

-Ganesh Kirti, founder and CEO of TrustLogix.

Overview of row-level security in Amazon Redshift

Row-level security allows you to restrict some records to certain users or roles, depending on the content of those records. With RLS, you can define policies to enforce fine-grained row-level access control. When creating RLS policies, you can specify expressions that control whether Amazon Redshift returns any existing rows in a table in a query. With RLS policies limiting access, you don’t have to add or externalize additional conditions in your queries. You can attach multiple policies to a table, and a single policy can be attached to multiple tables, making this implementation relationship many-to-many. Once attached, the RLS policy is applied on a relation and a set of users or roles, to run SELECT, UPDATE, and DELETE operations. All attached RLS policies have to evaluate together to true for a record to be returned by query. The RBAC built-in role, security admin, is responsible for managing the policies.

The following diagram illustrates the workflow.

With RLS, you can do the following:

  • Restrict row access based on roles – The security admin creates and defines if a role can access specific records of data within a table based on an RLS policy.
  • Combine multiple policies per user or role – Multiple policies can be defined per user or role, and all policies are applied with AND syntax.
  • Enhance granular access control – RLS is built on role-based access control and can work alongside column-level access control.
  • No access if no policy applied – All data access is blocked when there is no applicable policy on an RLS-protected table.
  • Enable row-level and column-level security on the table – In the following example, the user house is part of the role staff. When house queries the table, only one record pertaining to house is returned; the rest of the records are filtered as per the RLS policy. The sensitive column is also restricted, so users from the role staff can’t see this column. User cuddy is part of the role manager. When cuddy queries the employees table, all records and columns are returned.

Row-level security relevant use cases

With row-level security, many use cases for fine-grained access controls become possible. The following are just some of the many application use cases:

  • A global company with data analysts across different countries or regions can enforce restriction of data access to analysts based on geo location due to data compliance requirements.
  • A sales department can create a policy that allows them to restrict the access to sales performance information specific to a particular salesperson or region.
  • A payroll department can create an RLS policy to restrict access to look at an individual’s payroll, but managers need payroll information on their direct reports. Managers don’t need to know the details of payroll information for other departments.
  • A hospital can create an RLS policy that allows doctors and nurses to view data rows for their patients only.
  • A bank can create a policy to restrict access to financial data rows based on an employee’s business division or role in the company.
  • A multi-tenant application can create a policy to enforce a logical separation of each tenant’s data rows from every other tenant’s rows.

In the following example use cases, we illustrate enforcing an RLS policy on a fictitious healthcare setup. We demonstrate RLS on the medicine_data table and patients table, based on a policy established for managers, doctors, and departments. We also cover using a custom session variable context to set an RLS policy for the multi-tenant table customer.

To download the script and set up the tables, choose rls_createtable.sql.

Example 1: Read and write access

To grant read and write access, complete the following steps:

  1. Define four RLS policies using the secadmin role:
    1. all_can_see – No restrictions to be imposed
    2. hide_confidential – Restricts records for non-confidential rows
    3. only_doctors_can_see – Restricts records such that only doctors can see data
    4. see_only_own_department – Restricts records to only see data for own department
      CREATE RLS POLICY all_can_see
      USING ( true );
      
      CREATE RLS POLICY hide_confidential
      WITH ( confidential BOOLEAN )
      USING ( confidential = false )
      ;
      
      Note: Employee table is used as lookup in this policy
      
      CREATE RLS POLICY only_doctors_can_see
      USING (
          true = (
                  SELECT employee_is_doctor
                  FROM employees
                  WHERE employee_username = current_user
                  )
          )
      ;
      
      GRANT SELECT ON employees
      TO RLS POLICY only_doctors_can_see;
      
      CREATE RLS POLICY see_only_own_department
      WITH ( patient_dept_id INTEGER )
      USING (
          patient_dept_id IN (
                              SELECT department_id
                              FROM employees_departments
                              WHERE employee_username = current_user
                              )
          )
      ;
      
      GRANT SELECT ON employees_departments 
      TO RLS POLICY see_only_own_department;

  2. Create three roles for STAFF, MANAGER, and EXTERNAL:
    CREATE ROLE staff;
    CREATE ROLE manager;
    CREATE ROLE external;

  3. Now we define column-level access control for the roles and columns that are implementing the RLS policy:
    1. The MANAGER can access all columns in the Patients and Medicine_data tables, including the confidential column that defines RLS policies:
      --- manager can see full table patients and medicine data
      GRANT SELECT ON employees, employees_departments, patients, medicine_data TO ROLE manager, ROLE external;

    2. The STAFF role can access all columns except the confidential column:
      --- staff can see limited columns from medicine data
      GRANT SELECT (medicine_name, medicine_price) ON medicine_data 
      TO ROLE staff;
      
      --- staff can see limited columns from patients
      GRANT SELECT (patient_dept_id, patient_name, patient_birthday, patient_medicine, diagnosis) ON patients TO ROLE staff;

  4. Attach RLS policies to the roles we created:
    --- manager can see all medicine data
    ATTACH RLS POLICY all_can_see
    ON medicine_data
    TO ROLE manager;
    
    --- manager can see all patient data
    ATTACH RLS POLICY all_can_see
    ON patients
    TO ROLE manager;
    
    --- staff cannot see confidential medicine data
    ATTACH RLS POLICY hide_confidential
    ON medicine_data
    TO ROLE staff;
    
    --- staff cannot see confidential patient data
    ATTACH RLS POLICY hide_confidential
    ON patients
    TO ROLE staff;
    
    --- only doctors can see patient data
    ATTACH RLS POLICY only_doctors_can_see 
    ON patients
    TO PUBLIC;
    
    --- regular staff (doctors) can see data for patients in their department only
    ATTACH RLS POLICY see_only_own_department 
    ON patients
    TO ROLE staff;

  5. Enable RLS security on objects:
    ALTER TABLE medicine_data ROW LEVEL SECURITY on;
    ALTER TABLE patients ROW LEVEL SECURITY on;

  6. Create the users and grant them roles:
    CREATE USER house PASSWORD DISABLE;
    CREATE USER cuddy PASSWORD DISABLE;
    CREATE USER external PASSWORD DISABLE;
    
    GRANT ROLE staff TO house;
    GRANT ROLE manager TO cuddy;
    GRANT ROLE external TO external;

We can see RLS in action with a SELECT query:

--- As Cuddy, who is a doctor and a manager
SET SESSION AUTHORIZATION 'cuddy';

SELECT * FROM medicine_data;
--- policies applied: all_can_see

SELECT * FROM patients;
--- policies applied: all_can_see, only_doctors_can_see

As a super user and secadmin, you can query the svv_rls_applied_policy to audit and monitor the policies applied. We discuss system views for auditing and monitoring more later in this post.

--- As House, who is a doctor but not a manager - he is staff in department id 1

SET SESSION AUTHORIZATION 'house';

SELECT * FROM medicine_data;
--- column level access control applied 

SELECT current_user, medicine_name, medicine_price FROM medicine_data;
--- CLS + RLS policy = hide_confidential

SELECT * FROM patients;
--- column level access control applied

SELECT current_user, patient_dept_id, patient_name, patient_birthday, patient_medicine, diagnosis FROM patients;
--- CLS + RLS policies = hide_confidential, only_doctors_can_see, see_only_own_department

--- As External, who has no permission granted
SET SESSION AUTHORIZATION 'external';

SELECT * FROM medicine_data;
--- RLS policy applied: none - so no access

SELECT * FROM patients;
--- policies applied: none - so no access

With the UPDATE command, only the user house should be able to update patients records, as per the RLS for department 1:

SET SESSION AUTHORIZATION 'house';
UPDATE patients
SET diagnosis = 'house updated diagnosis';

select current_user, * from patients; --house should only be able to query department 1 non-confidential records

To test DELETE, as the user house, let’s delete records from patient table. Only two non-confidential records from patient_dept_id should be deleted as per the RLS policy:

SET SESSION AUTHORIZATION 'house';
delete  from patients;

Because both the records that house has access to are deleted from patients, selecting from the table will return no records.

When we switch to the user cuddy, who is manager and doctor, we have access to confidential records and can see three records:

SET SESSION AUTHORIZATION 'cuddy';
SELECT current_user, * from patients;

As a security admin, you can detach a policy from a table, user, or role. In this example, we detach the policy hide_confidential from the table patients from role staff:

DETACH RLS POLICY hide_confidential ON patients FROM ROLE staff;

When the user house queries the patients table, they should now have access to confidential records:

SET SESSION AUTHORIZATION 'house';

SELECT * from patients;

Using the security admin role, you can drop the policy hide_confidential:

DROP RLS POLICY IF EXISTS hide_confidential;

Because the hide_confidential RLS policy is still attached to the medicine_data table, you get the dependency error.

To remove this policy from all the tables, users, and roles, you can use the cascade option:

DROP RLS POLICY IF EXISTS hide_confidential cascade;

When user house queries the medicine_data table, no records are returned, because the medicine_data table has RLS on and no RLS policy is attached to the role staff for this table.

SET SESSION AUTHORIZATION 'house';
SELECT * from MEDICINE_DATA;

Let’s turn off row-level security on the table medicine_data using the security admin role:

ALTER TABLE MEDICINE_DATA ROW LEVEL SECURITY OFF;
SET SESSION AUTHORIZATION 'house';

SELECT * FROM MEDICINE_DATA;

Example 2: Session context variables

Some of the applications require you to use connection pooling, and you can use application-based user authentication instead of using separate database users for each user. The session context variables feature in Amazon Redshift enables you to pass the application user ID to the database for applying role-base security.

Amazon Redshift now allows you to set a customized session context variable using set_config. Using the session context variable allows you to provide such granular access using RLS.

In this example, we illustrate the use case when you have the common table customer, where you’re getting data from several customers. The table has a column with c_customer_id to distinguish data for respective customers.

  1. Create the external user and grant the external role:
    CREATE USER external_user PASSWORD 'Testemp1';
    grant role EXTERNAL to external_user;

  2. Grant SELECT on the customer table to role external:
    grant usage on schema report to role EXTERNAL;
    GRANT select ON TABLE report.customer TO ROLE EXTERNAL;

  3. Turn on row-level security for the report.customer table:
    ALTER TABLE report.customer row level security on;

  4. Create a row-level security policy using the session context variable app.customer_id to enforce the policy to filter records for c_customer_id:
    CREATE RLS POLICY see_only_own_customer_rows
    WITH ( c_customer_id char(16) )
    USING ( c_customer_id = current_setting('app.customer_id', FALSE));
    ATTACH RLS POLICY see_only_own_customer_rows ON report.customer TO ROLE EXTERNAL;

Now we can observe RLS in action. When you query the customer table with session context set to customer ID AAAAAAAAJNGEGCBA, the row-level policy was enforced only to return one customer row that matched the session variable value:

SET SESSION AUTHORIZATION 'external_user';

select set_config('app.customer_id', 'AAAAAAAAJNGEGCBA', FALSE);
select * from report.customer limit 10;

Auditing and monitoring RLS policies

Amazon Redshift has added several new system views to be able to monitor the row-level policies. The following table lists the system views, users, and roles that have access, and the function of the views.

System View Users Function
SVV_RLS_POLICY sys:secadmin View a list of all row-level security policies created
SVV_RLS_RELATION sys:secadmin View a list of all relations and users that have one or more row-level security policies attached on the currently connected database
SVV_RLS_APPLIED_POLICY sys:secadmin List RLS-protected relations
SVV_RLS_ATTACHED_POLICY Superuser, sys:operator, or any user with the system permission ACCESS SYSTEM TABLE Trace the application of RLS policies on queries that reference RLS-protected relations

Conclusion

In this post, we demonstrated how you can simplify the management of row-level security for fine-grained access control of your sensitive data building on the foundation of role-based access control. For more information about RLS best practices, refer to Amazon Redshift security overview. Try out RLS for your future Amazon Redshift implementations, and feel free to leave a comment about your use cases and experience.

Amazon Redshift Spectrum supports row-level, column-level, and cell-level access control for data stored in Amazon Simple Storage Service (Amazon S3) and managed by AWS Lake Formation. In a future post, we will show how you can implement row-level security for Redshift Spectrum tables using Lake Formation.


About the authors

Harshida Patel is a Specialist Sr. Solutions Architect, Analytics, with AWS.

Milind Oke is a Senior Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over two decades and specializes in Amazon Redshift.

Abhilash Nagilla is a Specialist Solutions Architect, Analytics, with AWS.

Yanzhu Ji is a Product Manager on the Amazon Redshift team. She worked on the Amazon Redshift team as a Software Engineer before becoming a Product Manager. She has rich experience of how the customer-facing Amazon Redshift features are built from planning to launching, and always treats customers’ requirements as first priority. In her personal life, Yanzhu likes painting, photography, and playing tennis.

Kiran Chinta is a Software Development Manager at Amazon Redshift. He leads a strong team in query processing, SQL language, data security, and performance. Kiran is passionate about delivering products that seamlessly integrate with customers’ business applications with the right ease of use and performance. In his spare time, he enjoys reading and playing tennis.

Debu-PandaDebu Panda is a Senior Manager, Product Management, with AWS. He is an industry leader in analytics, application platforms, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases, and has presented at multiple conferences such as AWS re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

Amazon Redshift Serverless – Now Generally Available with New Capabilities

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/amazon-redshift-serverless-now-generally-available-with-new-capabilities/

Last year at re:Invent, we introduced the preview of Amazon Redshift Serverless, a serverless option of Amazon Redshift that lets you analyze data at any scale without having to manage data warehouse infrastructure. You just need to load and query your data, and you pay only for what you use. This allows more companies to build a modern data strategy, especially for use cases where analytics workloads are not running 24-7 and the data warehouse is not active all the time. It is also applicable to companies where the use of data expands within the organization and users in new departments want to run analytics without having to take ownership of data warehouse infrastructure.

Today, I am happy to share that Amazon Redshift Serverless is generally available and that we added many new capabilities. We are also reducing Amazon Redshift Serverless compute costs compared to the preview.

You can now create multiple serverless endpoints per AWS account and Region using namespaces and workgroups:

  • A namespace is a collection of database objects and users, such as database name and password, permissions, and encryption configuration. This is where your data is managed and where you can see how much storage is used.
  • A workgroup is a collection of compute resources, including network and security settings. Each workgroup has a serverless endpoint to which you can connect your applications. When configuring a workgroup, you can set up private or publicly accessible endpoints.

Each namespace can have only one workgroup associated with it. Conversely, each workgroup can be associated with only one namespace. You can have a namespace without any workgroup associated with it, for example, to use it only for sharing data with other namespaces in the same or another AWS account or Region.

In your workgroup configuration, you can now use query monitoring rules to help keep your costs under control. Also, the way Amazon Redshift Serverless automatically scales data warehouse capacity is more intelligent to deliver fast performance for demanding and unpredictable workloads.

Let’s see how this works with a quick demo. Then, I’ll show you what you can do with namespaces and workgroups.

Using Amazon Redshift Serverless
In the Amazon Redshift console, I select Redshift serverless in the navigation pane. To get started, I choose Use default settings to configure a namespace and a workgroup with the most common options. For example, I’ll be able to connect using my default VPC and default security group.

Console screenshot.

With the default settings, the only option left to configure is Permissions. Here, I can specify how Amazon Redshift can interact with other services such as S3, Amazon CloudWatch Logs, Amazon SageMaker, and AWS Glue. To load data later, I give Amazon Redshift access to an S3 bucket. I choose Manage IAM roles and then Create IAM role.

Console screenshot.

When creating the IAM role, I select the option to give access to specific S3 buckets and pick an S3 bucket in the same AWS Region. Then, I choose Create IAM role as default to complete the creation of the role and to automatically use it as the default role for the namespace.

Console screenshot.

I choose Save configuration and after a few minutes the database is ready for use. In the Serverless dashboard, I choose Query data to open the Redshift query editor v2. There, I follow the instructions in the Amazon Redshift Database Developer guide to load a sample database. If you want to do a quick test, a few sample databases (including the one I am using here) are already available in the sample_data_dev database. Note also that loading data into Amazon Redshift is not required for running queries. I can use data from an S3 data lake in my queries by creating an external schema and an external table.

The sample database consists of seven tables and tracks sales activity for a fictional “TICKIT” website, where users buy and sell tickets for sporting events, shows, and concerts.

Sample database tables relations

To configure the database schema, I run a few SQL commands to create the users, venue, category, date, event, listing, and sales tables.

Console screenshot.

Then, I download the tickitdb.zip file that contains the sample data for the database tables. I unzip and load the files to a tickit folder in the same S3 bucket I used when configuring the IAM role.

Now, I can use the COPY command to load the data from the S3 bucket into my database. For example, to load data into the users table:

copy users from 's3://MYBUCKET/tickit/allusers_pipe.txt' iam_role default;

The file containing the data for the sales table uses tab-separated values:

copy sales from 's3://MYBUCKET/tickit/sales_tab.txt' iam_role default delimiter '\t' timeformat 'MM/DD/YYYY HH:MI:SS';

After I load data in all tables, I start running some queries. For example, the following query joins five tables to find the top five sellers for events based in California (note that the sample data is for the year 2008):

select sellerid, username, (firstname ||' '|| lastname) as sellername, venuestate, sum(qtysold)
from sales, date, users, event, venue
where sales.sellerid = users.userid
and sales.dateid = date.dateid
and sales.eventid = event.eventid
and event.venueid = venue.venueid
and year = 2008
and venuestate = 'CA'
group by sellerid, username, sellername, venuestate
order by 5 desc
limit 5;

Console screenshot.

Now that my database is ready, let’s see what I can do by configuring Amazon Redshift Serverless namespaces and workgroups.

Using and Configuring Namespaces
Namespaces are collections of database data and their security configurations. In the navigation pane of the Amazon Redshift console, I choose Namespace configuration. In the list, I choose the default namespace that I just created.

In the Data backup tab, I can create or restore a snapshot or restore data from one of the recovery points that are automatically created every 30 minutes and kept for 24 hours. That can be useful to recover data in case of accidental writes or deletes.

Console screenshot.

In the Security and encryption tab, I can update permissions and encryption settings, including the AWS Key Management Service (AWS KMS) key used to encrypt and decrypt my resources. In this tab, I can also enable audit logging and export the user, connection, and user activity logs.

Console screenshot.

In the Datashares tab, I can create a datashare to share data with other namespaces and AWS accounts in the same or different Regions. In this tab, I can also create a database from a share I receive from other namespaces or AWS accounts, and I can see the subscriptions for datashares managed by AWS Data Exchange.

Console screenshot.

When I create a datashare, I can select which objects to include. For example, here I want to share only the date and event tables because they don’t contain sensitive data.

Console screenshot.

Using and Configuring Workgroups
Workgroups are collections of compute resources and their network and security settings. They provide the serverless endpoint for the namespace they are configured for. In the navigation pane of the Amazon Redshift console, I choose Workgroup configuration. In the list, I choose the default namespace that I just created.

In the Data access tab, I can update the network and security settings (for example, change the VPC, the subnets, or the security group) or make the endpoint publicly accessible. In this tab, I can also enable Enhanced VPC routing to route network traffic between my serverless database and the data repositories I use (for example, the S3 buckets used to load or unload data) through a VPC instead of the internet. To access serverless endpoints that are in another VPC or subnet, I can create a VPC endpoint managed by Amazon Redshift.

Console screenshot.

In the Limits tab, I can configure the base capacity (expressed in Redshift processing units, or RPUs) used to process my queries. Amazon Redshift Serverless scales the capacity to deal with a higher number of users. Here I also have the option to increase the base capacity to speed up my queries or decrease it to reduce costs.

In this tab, I can also set Usage limits to configure daily, weekly, and monthly thresholds to keep my costs predictable. For example, I configured a daily limit of 200 RPU-hours, and a monthly limit of 2,000 RPU-hours for my compute resources. To control the data-transfer costs for cross-Region datashares, I configured a daily limit of 3 TB and a weekly limit of 10 TB. Finally, to limit the resources used by each query, I use Query limits to time out queries running for more than 60 seconds.

Console screenshot.

Availability and Pricing
Amazon Redshift Serverless is generally available today in the US East (Ohio), US East (N. Virginia), US East (Oregon), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Stockholm), and Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), and Asia Pacific (Tokyo) AWS Regions.

You can connect to a workgroup endpoint using your favorite client tools via JDBC/ODBC or with the Amazon Redshift query editor v2, a web-based SQL client application available on the Amazon Redshift console. When using web services-based applications (such as AWS Lambda functions or Amazon SageMaker notebooks), you can access your database and perform queries using the built-in Amazon Redshift Data API.

With Amazon Redshift Serverless, you pay only for the compute capacity your database consumes when active. The compute capacity scales up or down automatically based on your workload and shuts down during periods of inactivity to save time and costs. Your data is stored in managed storage, and you pay a GB-month rate.

To give you improved price performance and the flexibility to use Amazon Redshift Serverless for an even broader set of use cases, we are lowering the price from $0.5 to $0.375 per RPU-hour for the US East (N. Virginia) Region. Similarly, we are lowering the price in other Regions by an average of 25 percent from the preview price. For more information, see the Amazon Redshift pricing page.

To help you get practice with your own use cases, we are also providing $300 in AWS credits for 90 days to try Amazon Redshift Serverless. These credits are used to cover your costs for compute, storage, and snapshot usage of Amazon Redshift Serverless only.

Get insights from your data in seconds with Amazon Redshift Serverless.

Danilo

Accelerate machine learning with AWS Data Exchange and Amazon Redshift ML

Post Syndicated from Yadgiri Pottabhathini original https://aws.amazon.com/blogs/big-data/accelerate-machine-learning-with-aws-data-exchange-and-amazon-redshift-ml/

Amazon Redshift ML makes it easy for SQL users to create, train, and deploy ML models using familiar SQL commands. Redshift ML allows you to use your data in Amazon Redshift with Amazon SageMaker, a fully managed ML service, without requiring you to become an expert in ML.

AWS Data Exchange makes it easy to find, subscribe to, and use third-party data in the cloud. AWS Data Exchange for Amazon Redshift enables you to access and query tables in Amazon Redshift without extracting, transforming, and loading files (ETL).

As a subscriber, you can browse through the AWS Data Exchange catalog and find data products that are relevant to your business with data stored in Amazon Redshift, and subscribe to the data from the providers without any further processing, and no need for an ETL process.

If the provider data is not already available in Amazon Redshift, many providers will add the data to Amazon Redshift upon request.

In this post, we show you the process of subscribing to datasets through AWS Data Exchange without ETL, running ML algorithms on an Amazon Redshift cluster, and performing local inference and production.

Solution overview

The use case for the solution in this post is to predict ticket sales for worldwide events based on historical ticket sales data using a regression model. The data or ETL engineer can build the data pipeline by subscribing to the Worldwide Event Attendance product on AWS Data Exchange without ETL. You can then create the ML model in Redshift ML using the time series ticket sales data and predict future ticket sales.

To implement this solution, you complete the following high-level steps:

  1. Subscribe to datasets using AWS Data Exchange for Amazon Redshift.
  2. Connect to the datashare in Amazon Redshift.
  3. Create the ML model using the SQL notebook feature of the Amazon Redshift query editor V2.

The following diagram illustrates the solution architecture.

Prerequisites

Before starting this walkthrough, you must complete the following prerequisites:

  1. Make sure you have an existing Amazon Redshift cluster with RA3 node type. If not, you can create a provisioned Amazon Redshift cluster.
  2. Make sure the Amazon Redshift cluster is encrypted, because the data provider is encrypted and Amazon Redshift data sharing requires homogeneous encryption configurations. For more details on homogeneous encryption, refer to Data sharing considerations in Amazon Redshift.
  3. Create an AWS Identity Access and Management (IAM) role with access to SageMaker and Amazon Simple Storage Service (Amazon S3) and attach it to the Amazon Redshift cluster. Refer to Cluster setup for using Amazon Redshift ML for more details.
  4. Create an S3 bucket for storing the training data and model output.

Please note that some of the above AWS resources in this walkthrough will incur charges. Please remember to delete the resources when you’re finished.

Subscribe to an AWS Data Exchange product with Amazon Redshift data

To subscribe to an AWS Data Exchange public dataset, complete the following steps:

  1. On the AWS Data Exchange console, choose Explore available data products.
  2. In the navigation pane, under Data available through, select Amazon Redshift to filter products with Amazon Redshift data.
  3. Choose Worldwide Event Attendance (Test Product).
  4. Choose Continue to subscribe.
  5. Confirm the catalog is subscribed by checking that it’s listed on the Subscriptions page.

Predict tickets sold using Redshift ML

To set up prediction using Redshift ML, complete the following steps:

  1. On the Amazon Redshift console, choose Datashares in the navigation pane.
  2. On the Subscriptions tab, confirm that the AWS Data Exchange datashare is available.
  3. In the navigation pane, choose Query editor v2.
  4. Connect to your Amazon Redshift cluster in the navigation pane.

Amazon Redshift provides a feature to create notebooks and run your queries; the notebook feature is currently in preview and available in all Regions. For the remaining part of this tutorial, we run queries in the notebook and create comments in the markdown cell for each step.

  1. Choose the plus sign and choose Notebook.
  2. Choose Add markdown.
  3. Enter Show available data share in the cell.
  4. Choose Add SQL.
  5. Enter and run the following command to see the available datashares for the cluster:
    SHOW datashares;

You should be able to see worldwide_event_test_data, as shown in the following screenshot.

  1. Note the producer_namespace and producer_account values from the output, which we use in the next step.
  2. Choose Add markdown and enter Create database from datashare with producer_namespace and procedure_account.
  3. Choose Add SQL and enter the following code to create a database to access the datashare. Use the producer_namespace and producer_account values you copied earlier.
    CREATE DATABASE ml_blog_db FROM DATASHARE worldwide_event_test_data OF ACCOUNT 'producer_account' NAMESPACE 'producer_namespace';

  4. Choose Add markdown and enter Create new table to consolidate features.
  5. Choose Add SQL and enter the following code to create a new table called event consisting of the event sales by date and assign a running serial number to split into training and validation datasets:
    CREATE TABLE event AS
    	SELECT eventname, qtysold, saletime, day, week, month, qtr, year, holiday, ROW_NUMBER() OVER (ORDER BY RANDOM()) r
    	FROM "ml_blog_db"."public"."sales" s
    	INNER JOIN "ml_blog_db"."public"."event" e
    	ON s.eventid = e.eventid
    	INNER JOIN "ml_blog_db"."public"."date" d
    	ON s.dateid = d.dateid;

Event name, quantity sold, sale time, day, week, month, quarter, year, and holiday are columns in the dataset from AWS Data Exchange that are used as features in the ML model creation.

  1. Choose Add markdown and enter Split the dataset into training dataset and validation dataset.
  2. Choose Add SQL and enter the following code to split the dataset into training and validation datasets:
    CREATE TABLE training_data AS 
    	SELECT eventname, qtysold, saletime, day, week, month, qtr, year, holiday
    	FROM event
    	WHERE r >
    	(SELECT COUNT(1) * 0.2 FROM event);
    
    	CREATE TABLE validation_data AS 
    	SELECT eventname, qtysold, saletime, day, week, month, qtr, year, holiday
    	FROM event
    	WHERE r <=
    	(SELECT COUNT(1) * 0.2 FROM event);

  3. Choose Add markdown and enter Create ML model.
  4. Choose Add SQL and enter the following command to create the model. Replace the your_s3_bucket parameter with your bucket name.
    CREATE MODEL predict_ticket_sold
    	FROM training_data
    	TARGET qtysold
    	FUNCTION predict_ticket_sold
    	IAM_ROLE 'default'
    	PROBLEM_TYPE regression
    	OBJECTIVE 'mse'
    	SETTINGS (s3_bucket 'your_s3_bucket',
    	s3_garbage_collect off,
    	max_runtime 5000);

Note: It can take up to two hours to create and train the model.
The following screenshot shows the example output from adding our markdown and SQL.

  1. Choose Add markdown and enter Show model creation status. Continue to next step once the Model State has changed to Ready.
  2. Choose Add SQL and enter the following command to get the status of the model creation:
    SHOW MODEL predict_ticket_sold;

Move to the next step after the Model State has changed to READY.

  1. Choose Add markdown and enter Run the inference for eventname Jason Mraz.
  2. When the model is ready, you can use the SQL function to apply the ML model to your data. The following is sample SQL code to predict the tickets sold for a particular event using the predict_ticket_sold function created in the previous step:
    SELECT eventname,
    	predict_ticket_sold(
    	eventname, saletime, day, week, month, qtr, year, holiday ) AS predicted_qty_sold,
    	day, week, month
    	FROM event
    	Where eventname = 'Jason Mraz';

The following is the output received by applying the ML function predict_ticket_sold on the original dataset. The output of the ML function is captured in the field predicted_qty_sold, which is the predicted ticket sold quantity.

Share notebooks

To share the notebooks, complete the following steps:

  1. Create an IAM role with the managed policy AmazonRedshiftQueryEditorV2FullAccess attached to the role.
  2. Add a principal tag to the role with the tag name sqlworkbench-team.
  3. Set the value of this tag to the principal (user, group, or role) you’re granting access to.
  4. After you configure these permissions, navigate to the Amazon Redshift console and choose Query editor v2 in the navigation pane. If you haven’t used the query editor v2 before, please configure your account to use query editor v2.
  5. Choose Notebooks in the left pane and navigate to My notebooks.
  6. Right-click on the notebook you want to share and choose Share with my team.
  7. You can confirm that the notebook is shared by choosing Shared to my team and checking that the notebook is listed.

Summary

In this post, we showed you how to build an end-to-end pipeline by subscribing to a public dataset through AWS Data Exchange, simplifying data integration and processing, and then running prediction using Redshift ML on the data.

We look forward to hearing from you about your experience. If you have questions or suggestions, please leave a comment.


About the Authors

Yadgiri Pottabhathini is a Sr. Analytics Specialist Solutions Architect. His role is to assist customers in their cloud data warehouse journey and help them evaluate and align their data analytics business objectives with Amazon Redshift capabilities.

Ekta Ahuja is an Analytics Specialist Solutions Architect at AWS. She is passionate about helping customers build scalable and robust data and analytics solutions. Before AWS, she worked in several different data engineering and analytics roles. Outside of work, she enjoys baking, traveling, and board games.

BP Yau is a Sr Product Manager at AWS. He is passionate about helping customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Srikanth Sopirala is a Principal Analytics Specialist Solutions Architect at AWS. He is a seasoned leader with over 20 years of experience, who is passionate about helping customers build scalable data and analytics solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys reading, spending time with his family, and road cycling.

Data warehouse and business intelligence technology consolidation using AWS

Post Syndicated from Bappaditya Datta original https://aws.amazon.com/blogs/architecture/data-warehouse-and-business-intelligence-technology-consolidation-using-aws/

Organizations have been using data warehouse and business intelligence (DWBI) workloads to support business decision making for many years. These workloads are brought to the Amazon Web Services (AWS) platform to utilize the benefit of AWS cloud. However, these workloads are built using multiple vendor tools and technologies, and the customer faces the burden of administrative overhead.

This post provides architectural guidance to consolidate multiple DWBI technologies to AWS Managed Services to help reduce the administrative overhead, bring operational ease, and business efficiency. Two scenarios are explored:

  1. Upstream transactional databases are already on AWS
  2. Upstream transactional databases are present at on-premise datacenter

Challenges faced by an organization

Organizations are engaged in managing multiple DWBI technologies due to acquisitions, mergers, and the lift-and-shift of workloads. These workloads use extract, transform, and load (ETL) tools to read relational data from upstream transactional databases, process it, and store it in a data warehouse. Thereafter, these workloads use business intelligence tools to generate valuable insight and present it to users in form of reports and dashboards.

These DWBI technologies are generally installed and maintained on their own server. Figure 1 demonstrates the increased the administrative overhead for the organization but also creates challenges in maintaining the team’s overall knowledge.

DWBI workload with multiple tools

Figure 1. DWBI workload with multiple tools

Therefore, organizations are looking to consolidate technology usage and continue supporting important business functions.

Scenario 1

As we know, three major functions of DWBI workstream are:

  • ETL data using a tool
  • Store/manage the data in a data warehouse
  • Generate information from the data using business intelligence

Each of these functions can be performed efficiently using an AWS service. For example, AWS Glue can be used for ETL, Amazon Redshift for data warehouse, and Amazon QuickSight for business intelligence.

With the use of mentioned AWS services, organizations will be able to consolidate their DWBI technology usage. Organizations also will be able to quickly adapt to these services, as their engineering team can more easily use their DWBI knowledge with these services. For example, using SQL knowledge in AWS Glue jobs with SprakSQL, in Amazon Redshift queries, and in Amazon QuickSight dashboards.

Figure 2 demonstrates the redesigned the architecture of Figure 1 using AWS services. In this architecture, ETL functions are consolidated in AWS Glue. An AWS Glue crawler is used to auto-catalogue the source and target table metadata; then, AWS Glue ETL jobs use these catalogues to read data from source and write to target (data warehouse). AWS Glue jobs also apply necessary transformations (such as join, filter, and aggregate) to the data before writing. Additionally, an AWS Glue trigger is used to schedule the job executions. Alternatively, AWS Managed Workflows for Apache Airflow can be used to schedule jobs.

Consolidated workload with source on AWS

Figure 2. Consolidated workload with source on AWS

Similarly, data warehousing function is consolidated with Amazon Redshift. Amazon Redshift is used to store and organize enriched data and also enforce appropriate data access control for both workloads and users.

Lastly, business intelligence functions are consolidated using Amazon QuickSight. It used to create necessary dashboards that source data from Amazon Redshift and apply complex business logic to produce necessary charts and graphs needed for business insights. It is also used to implement necessary access restrictions to dashboards and data.

Scenario 2

In situation where source databases are in on-premises datacenter, the overall solution will be similar to Scenario 1, with an additional step to move the data continually from on-premise database to an Amazon Simple Storage Service (Amazon S3) bucket. The data movement can be efficiently handled by AWS Database Migration Service (AWS DMS).

To make the source database accessible to AWS DMS, a connection needs to established between the AWS cloud and on-premise network. Based on performance and throughput needs, the organization can choose either AWS Direct Connect service or AWS Site-to-Site VPN service to securely move the data. For the purpose of this discussion, we are considering AWS Direct Connect.

In Figure 3, AWS DMS task is used to perform a full-load followed by change data capture to continuously move the data to an S3 bucket. In this scenario, AWS Glue is used to catalogue and read the data from S3 bucket. The remaining portion of the dataflow is the same as the one mentioned in Scenario 1.

Consolidated workload with source at datacenter

Figure 3. Consolidated workload with source at datacenter

Scaling

Both of the updated architectures provide necessary scaling:

  • Auto scaling feature can be used to scale-up or -down AWS Glue ETL job resources
  • Concurrency scaling feature can be used to support virtually unlimited concurrent users and queries in Amazon Redshift
  • Amazon QuickSight resources (web server, Amazon QuickSight engine, and SPICE) are auto scaled by design

Security, monitoring, and auditing

Also, the updated architectures provide necessary security by using access control, data encryption at-rest and in transit, monitoring, and auditing.

Additionally, both Amazon Redshift and Amazon QuickSight provides their own authentication and access controls. Therefore, a user can be a local user or a federated one. With the help of these authentications, an organization will be able to control access to data in Amazon Redshift and also access to the dashboard in Amazon QuickSight.

Conclusion

In this blog post, we discussed how AWS Glue, Amazon Redshift, and Amazon QuickSight can be used to consolidate DWBI technologies. We also have discussed how an architecture can help an organization build a scalable, secure workload with auto scaling, access control, log monitoring and activity auditing.

Ready to get started?

Use AWS CloudWatch as a destination for Amazon Redshift Audit logs

Post Syndicated from Nita Shah original https://aws.amazon.com/blogs/big-data/using-aws-cloudwatch-as-destination-for-amazon-redshift-audit-logs/

Amazon Redshift is a fast, scalable, secure, and fully-managed cloud data warehouse that makes it simple and cost-effective to analyze all of your data using standard SQL. Amazon Redshift has comprehensive security capabilities to satisfy the most demanding requirements. To help you to monitor the database for security and troubleshooting purposes, Amazon Redshift logs information about connections and user activities in your database. This process is called database auditing.

Amazon Redshift Audit Logging is good for troubleshooting, monitoring, and security purposes, making it possible to determine suspicious queries by checking the connections and user logs to see who is connecting to the database. It gives information, such as the IP address of the user’s computer, the type of authentication used by the user, or the timestamp of the request. Audit logs make it easy to identify who modified the data. Amazon Redshift logs all of the SQL operations, including connection attempts, queries, and changes to your data warehouse. These logs can be accessed via SQL queries against system tables, saved to a secure Amazon Simple Storage Service (Amazon S3) Amazon location, or exported to Amazon CloudWatch. You can view your Amazon Redshift cluster’s operational metrics on the Amazon Redshift console, use CloudWatch, and query Amazon Redshift system tables directly from your cluster.

This post will walk you through the process of configuring CloudWatch as an audit log destination. It will also show you that the latency of log delivery to either Amazon S3 or CloudWatch is reduced to less than a few minutes using enhanced Amazon Redshift Audit Logging. You can enable audit logging to Amazon CloudWatch via the AWS-Console or AWS CLI & Amazon Redshift API.

Solution overview

Amazon Redshift logs information to two locations-system tables and log files.

  1. System tables: Amazon Redshift logs data to system tables automatically, and history data is available for two to five days based on log usage and available disk space. To extend the log data retention period in system tables, use the Amazon Redshift system object persistence utility from AWS Labs on GitHub. Analyzing logs through system tables requires Amazon Redshift database access and compute resources.
  2. Log files: Audit logging to CloudWatch or to Amazon S3 is an optional process. When you turn on logging on your cluster, you can choose to export audit logs to Amazon CloudWatch or Amazon S3. Once logging is enabled, it captures data from the time audit logging is enabled to the present time. Each logging update is a continuation of the previous logging update. Access to audit log files doesn’t require access to the Amazon Redshift database, and reviewing logs stored in Amazon S3 doesn’t require database computing resources. Audit log files are stored indefinitely in CloudWatch logs or Amazon S3 by default.

Amazon Redshift logs information in the following log files:

  • Connection log – Provides information to monitor users connecting to the database and related connection information. This information might be their IP address.
  • User log – Logs information about changes to database user definitions.
  • User activity log – It tracks information about the types of queries that both the users and the system perform in the database. It’s useful primarily for troubleshooting purposes.

Benefits of enhanced audit logging

For a better customer experience, the existing architecture of the audit logging solution has been improved to make audit logging more consistent across AWS services. This new enhancement will reduce log export latency from hours to minutes with a fine grain of access control. Enhanced audit logging improves the robustness of the existing delivery mechanism, thus reducing the risk of data loss. Enhanced audit logging will let you export logs either to Amazon S3 or to CloudWatch.

The following section will show you how to configure audit logging using CloudWatch and its benefits.

Setting up CloudWatch as a log destination

Using CloudWatch to view logs is a recommended alternative to storing log files in Amazon S3. It’s simple to configure and it may suit your monitoring requirements, especially if you use it already to monitor other services and application.

To set up a CloudWatch as your log destination, complete the following steps:

  1. On the Amazon Redshift console, choose Clusters in the navigation pane.
    This page lists the clusters in your account in the current Region. A subset of properties of each cluster is also displayed.
  2. Choose cluster where you want to configure CloudWatch logs.

  3. Select properties to edit audit logging.
  4. Choose Turn on configure audit logging, and CloudWatch under log export type.
  5. Select save changes.

Analyzing audit log in near real-time

To run SQL commands, we use redshift-query-editor-v2, a web-based tool that you can use to explore, analyze, share, and collaborate on data stored on Amazon Redshift. However, you can use any client tools of your choice to run SQL queries.

Now we’ll run some simple SQLs and analyze the logs in CloudWatch in near real-time.

  1. Run test SQLs to create and drop user.
  2. On the AWS Console, choose CloudWatch under services, and then select Log groups from the right panel.
  3. Select the userlog – user logs created in near real-time in CloudWatch for the test user that we just created and dropped earlier.

Benefits of using CloudWatch as a log destination

  • It’s easy to configure, as it doesn’t require you to modify bucket policies.
  • It’s easy to view logs and search through logs for specific errors, patterns, fields, etc.
  • You can have a centralized log solution across all AWS services.
  • No need to build a custom solution such as AWS Lambda or Amazon Athena to analyze the logs.
  • Logs will appear in near real-time.
  • It has improved log latency from hours to just minutes.
  • By default, log groups are encrypted in CloudWatch and you also have the option to use your own custom key.
  • Fine-granular configuration of what log types to export based on your specific auditing requirements.
  • It lets you export log groups’ logs to Amazon S3 if needed.

Setting up Amazon S3 as a log destination

Although using CloudWatch as a log destination is the recommended approach, you also have the option to use Amazon S3 as a log destination. When the log destination is set up to an Amzon S3 location, enhanced audit logging logs will be checked every 15 minutes and will be exported to Amazon S3. You can configure audit logging on Amazon S3 as a log destination from the console or through the AWS CLI.

Once you save the changes, the Bucket policy will be set as the following using the Amazon Redshift service principal.

For additional details please refer to Amazon Redshift audit logging.

For enabling logging through AWS CLI – db-auditing-cli-api.

Cost

Exporting logs into Amazon S3 can be more cost-efficient, though considering all of the benefits which CloudWatch provides regarding search, real-time access to data, building dashboards from search results, etc., it can better suit those who perform log analysis.

For further details, refer to the following:

Best practices

Amazon Redshift uses the AWS security frameworks to implement industry-leading security in the areas of authentication, access control, auditing, logging, compliance, data protection, and network security. For more information, refer to Security in Amazon Redshift.

Audit logging to CloudWatch or to Amazon S3 is an optional process, but to have the complete picture of your Amazon Redshift usage, we always recommend enabling audit logging, particularly in cases where there are compliance requirements.

Log data is stored indefinitely in CloudWatch Logs or Amazon S3 by default. This may incur high, unexpected costs. We recommend that you configure how long to store log data in a log group or Amazon S3 to balance costs with compliance retention requirements. Apply the right compression to reduce the log file size.

Conclusion

This post demonstrated how to get near real-time Amazon Redshift logs using CloudWatch as a log destination using enhanced audit logging. This new functionality helps make Amazon Redshift Audit logging easier than ever, without the need to implement a custom solution to analyze logs. We also demonstrated how the new enhanced audit logging reduces log latency significantly on Amazon S3 with fine-grained access control compared to the previous version of audit logging.

Unauthorized access is a serious problem for most systems. As an administrator, you can start exporting logs to prevent any future occurrence of things such as system failures, outages, corruption of information, and other security risks.


About the Authors

Nita Shah is an Analytics Specialist Solutions Architect at AWS based out of New York. She has been building data warehouse solutions for over 20 years and specializes in Amazon Redshift. She is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Evgenii Rublev is a Software Development Engineer on the Amazon Redshift team. He has worked on building end-to-end applications for over 10 years. He is passionate about innovations in building high-availability and high-performance applications to drive a better customer experience. Outside of work, Evgenii enjoys spending time with his family, traveling, and reading books.

Yanzhu Ji is a Product manager on the Amazon Redshift team. She worked on Amazon Redshift team as a Software Engineer before becoming a Product Manager, she has rich experience of how the customer facing Amazon Redshift features are built from planning to launching, and always treat customers’ requirements as first priority. In personal life, Yanzhu likes painting, photography and playing tennis.

Ryan Liddle is a Software Development Engineer on the Amazon Redshift team. His current focus is on delivering new features and behind the scenes improvements to best service Amazon Redshift customers. On the weekend he enjoys reading, exploring new running trails and discovering local restaurants.

Migrate from Snowflake to Amazon Redshift using AWS Glue Python shell

Post Syndicated from Raks Khare original https://aws.amazon.com/blogs/big-data/migrate-from-snowflake-to-amazon-redshift-using-aws-glue-python-shell/

As the most widely used cloud data warehouse, Amazon Redshift makes it simple and cost-effective to analyze your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to analyze exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics without having to manage the data warehouse infrastructure. It natively integrates with other AWS services, facilitating the process of building enterprise-grade analytics applications in a manner that is not only cost-effective, but also avoids point solutions.

We are continuously innovating and releasing new features of Amazon Redshift, enabling the implementation of a wide range of data use cases and meeting requirements with performance and scale. For example, Amazon Redshift Serverless allows you to run and scale analytics workloads without having to provision and manage data warehouse clusters. Other features that help power analytics at scale with Amazon Redshift include automatic concurrency scaling for read and write queries, automatic workload management (WLM) for concurrency scaling, automatic table optimization, the new RA3 instances with managed storage to scale cloud data warehouses and reduce costs, cross-Region data sharing, data exchange, and the SUPER data type to store semi-structured data or documents as values. For the latest feature releases for Amazon Redshift, see Amazon Redshift What’s New. In addition to improving performance and scale, you can also gain up to three times better price performance with Amazon Redshift than other cloud data warehouses.

To take advantage of the performance, security, and scale of Amazon Redshift, customers are looking to migrate their data from their existing cloud warehouse in a way that is both cost optimized and performant. This post describes how to migrate a large volume of data from Snowflake to Amazon Redshift using AWS Glue Python shell in a manner that meets both these goals.

AWS Glue is serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning (ML), and application development. AWS Glue provides all the capabilities needed for data integration, allowing you to analyze your data in minutes instead of weeks or months. AWS Glue supports the ability to use a Python shell job to run Python scripts as a shell, enabling you to author ETL processes in a familiar language. In addition, AWS Glue allows you to manage ETL jobs using AWS Glue workflows, Amazon Managed Workflows for Apache Airflow (Amazon MWAA), and AWS Step Functions, automating and facilitating the orchestration of ETL steps.

Solution overview

The following architecture shows how an AWS Glue Python shell job migrates the data from Snowflake to Amazon Redshift in this solution.

Architecture

The solution is comprised of two stages:

  • Extract – The first part of the solution extracts data from Snowflake into an Amazon Simple Storage Service (Amazon S3) data lake
  • Load – The second part of the solution reads the data from the same S3 bucket and loads it into Amazon Redshift

For both stages, we connect the AWS Glue Python shell jobs to Snowflake and Amazon Redshift using database connectors for Python. The first AWS Glue Python shell job reads a SQL file from an S3 bucket to run the relevant COPY commands on the Snowflake database using Snowflake compute capacity and parallelism to migrate the data to Amazon S3. When this is complete, the second AWS Glue Python shell job reads another SQL file, and runs the corresponding COPY commands on the Amazon Redshift database using Redshift compute capacity and parallelism to load the data from the same S3 bucket.

Both jobs are orchestrated using AWS Glue workflows, as shown in the following screenshot. The workflow pushes data processing logic down to the respective data warehouses by running COPY commands on the databases themselves, minimizing the processing capacity required by AWS Glue to just the resources needed to run the Python scripts. The COPY commands load data in parallel both to and from Amazon S3, providing one of the fastest and most scalable mechanisms to transfer data from Snowflake to Amazon Redshift.

Because all heavy lifting around data processing is pushed down to the data warehouses, this solution is designed to provide a cost-optimized and highly performant mechanism to migrate a large volume of data from Snowflake to Amazon Redshift with ease.

Glue Workflow

The entire solution is packaged in an AWS CloudFormation template for simplicity of deployment and automatic provisioning of most of the required resources and permissions.

The high-level steps to implement the solution are as follows:

  1. Generate the Snowflake SQL file.
  2. Deploy the CloudFormation template to provision the required resources and permissions.
  3. Provide Snowflake access to newly created S3 bucket.
  4. Run the AWS Glue workflow to migrate the data.

Prerequisites

Before you get started, you can optionally build the latest version of the Snowflake Connector for Python package locally and generate the wheel (.whl) package. For instructions, refer to How to build.

If you don’t provide the latest version of the package, the CloudFormation template uses a pre-built .whl file that may not be on the most current version of Snowflake Connector for Python.

By default, the CloudFormation template migrates data from all tables in the TPCH_SF1 schema of the SNOWFLAKE_SAMPLE_DATA database, which is a sample dataset provided by Snowflake when an account is created. The following stored procedure is used to dynamically generate the Snowflake COPY commands required to migrate the dataset to Amazon S3. It accepts the database name, schema name, and stage name as the parameters.

CREATE OR REPLACE PROCEDURE generate_copy(db_name VARCHAR, schema_name VARCHAR, stage_name VARCHAR)
   returns varchar not null
   language javascript
   as
   $$
var return_value = "";
var sql_query = "select table_catalog, table_schema, lower(table_name) as table_name from " + DB_NAME + ".information_schema.tables where table_schema = '" + SCHEMA_NAME + "'" ;
   var sql_statement = snowflake.createStatement(
          {
          sqlText: sql_query
          }
       );
/* Creates result set */
var result_scan = sql_statement.execute();
while (result_scan.next())  {
       return_value += "\n";
       return_value += "COPY INTO @"
       return_value += STAGE_NAME
       return_value += "/"
       return_value += result_scan.getColumnValue(3);
       return_value += "/"
       return_value += "\n";
       return_value += "FROM ";
       return_value += result_scan.getColumnValue(1);
       return_value += "." + result_scan.getColumnValue(2);
       return_value += "." + result_scan.getColumnValue(3);
       return_value += "\n";
       return_value += "FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = '|' COMPRESSION = GZIP)";
       return_value += "\n";
       return_value += "OVERWRITE = TRUE;"
       return_value += "\n";
       }
return return_value;
$$
;

Deploy the required resources and permissions using AWS CloudFormation

You can use the provided CloudFormation template to deploy this solution. This template automatically provisions an Amazon Redshift cluster with your desired configuration in a private subnet, maintaining a high standard of security.

  1. Sign in to the AWS Management Console, preferably as admin user.
  2. Select your desired Region, preferably the same Region where your Snowflake instance is provisioned.
  3. Choose Launch Stack:
  4. Choose Next.
  5. For Stack name, enter a meaningful name for the stack, for example, blog-resources.

The Parameters section is divided into two subsections: Source Snowflake Infrastructure and Target Redshift Configuration.

  1. For Snowflake Unload SQL Script, it defaults to S3 location (URI) of a SQL file which migrates the sample data in the TPCH_SF1 schema of the SNOWFLAKE_SAMPLE_DATA database.
  2. For Data S3 Bucket, enter a prefix for the name of the S3 bucket that is automatically provisioned to stage the Snowflake data, for example, sf-migrated-data.
  3. For Snowflake Driver, if applicable, enter the S3 location (URI) of the .whl package built earlier as a prerequisite. By default, it uses a pre-built .whl file.
  4. For Snowflake Account Name, enter your Snowflake account name.

You can use the following query in Snowflake to return your Snowflake account name:

SELECT CURRENT_ACCOUNT();
  1. For Snowflake Username, enter your user name to connect to the Snowflake account.
  2. For Snowflake Password, enter the password for the preceding user.
  3. For Snowflake Warehouse Name, enter the warehouse name for running the SQL queries.

Make sure the aforementioned user has access to the warehouse.

  1. For Snowflake Database Name, enter the database name. The default is SNOWFLAKE_SAMPLE_DATA.
  2. For Snowflake Schema Name, enter schema name. The default is TPCH_SF1.

CFN Param Snowflake

  1. For VPC CIDR Block, enter the desired CIDR block of Redshift cluster. The default is 10.0.0.0/16.
  2. For Subnet 1 CIDR Block, enter the CIDR block of the first subnet. The default is 10.0.0.0/24.
  3. For Subnet 2 CIDR Block, enter the CIDR block of the first subnet. The default is 10.0.1.0/24.
  4. For Redshift Load SQL Script, it defaults to S3 location (URI) of a SQL file which migrates the sample data in S3 to Redshift.

The following database view in Redshift is used to dynamically generate Redshift COPY commands required to migrate the dataset from Amazon S3. It accepts the schema name as the filter criteria.

CREATE OR REPLACE VIEW v_generate_copy
AS
SELECT
    schemaname ,
    tablename  ,
    seq        ,
    ddl
FROM
    (
        SELECT
            table_id   ,
            schemaname ,
            tablename  ,
            seq        ,
            ddl
        FROM
            (
                --COPY TABLE
                SELECT
                    c.oid::bigint  as table_id   ,
                    n.nspname      AS schemaname ,
                    c.relname      AS tablename  ,
                    0              AS seq        ,
                    'COPY ' + n.nspname + '.' + c.relname + ' FROM ' AS ddl
                FROM
                    pg_namespace AS n
                INNER JOIN
                    pg_class AS c
                ON
                    n.oid = c.relnamespace
                WHERE
                    c.relkind = 'r'
                --COPY TABLE continued                
                UNION                
                SELECT
                    c.oid::bigint as table_id   ,
                    n.nspname     AS schemaname ,
                    c.relname     AS tablename  ,
                    2             AS seq        ,
                    '''${' + '2}' + c.relname + '/'' iam_role ''${' + '1}'' gzip delimiter ''|'' EMPTYASNULL REGION ''us-east-1''' AS ddl
                FROM
                    pg_namespace AS n
                INNER JOIN
                    pg_class AS c
                ON
                    n.oid = c.relnamespace
                WHERE
                    c.relkind = 'r'
                --END SEMICOLON                
                UNION                
                SELECT
                    c.oid::bigint as table_id  ,
                    n.nspname     AS schemaname,
                    c.relname     AS tablename ,
                    600000005     AS seq       ,
                    ';'           AS ddl
                FROM
                    pg_namespace AS n
                INNER JOIN
                    pg_class AS c
                ON
                    n.oid = c.relnamespace
                WHERE
                    c.relkind = 'r' 
             )
        ORDER BY
            table_id  ,
            schemaname,
            tablename ,
            seq 
    );

SELECT ddl
FROM v_generate_copy
WHERE schemaname = 'tpch_sf1';
  1. For Redshift Database Name, enter your desired database name, for example, dev.
  2. For Number of Redshift Nodes, enter the desired compute nodes, for example, 2.
  3. For Redshift Node Type, choose the desired node type, for example, ra3.4xlarge.
  4. For Redshift Password, enter your desired password with the following constraints: it must be 8–64 characters in length, and contain at least one uppercase letter, one lowercase letter, and one number.
  5. For Redshift Port, enter the Amazon Redshift port number to connect to. The default port is 5439.

CFN Param Redshift 1 CFN Param Redshift 2

  1. Choose Next.
  2. Review and choose Create stack.

It takes around 5 minutes for the template to finish creating all resources and permissions. Most of the resources have the prefix of the stack name you specified for easy identification of the resources later. For more details on the deployed resources, see the appendix at the end of this post.

Create an IAM role and external Amazon S3 stage for Snowflake access to the data S3 bucket

In order for Snowflake to access the TargetDataS3Bucket created earlier by CloudFormation template, you must create an AWS Identity and Access Management (IAM) role and external Amazon S3 stage for Snowflake access to the S3 bucket. For instructions, refer to Configuring Secure Access to Amazon S3.

When you create an external stage in Snowflake, use the value for TargetDataS3Bucket on the Outputs tab of your deployed CloudFormation stack for the Amazon S3 URL of your stage.

CF Output

Make sure to name the external stage unload_to_s3 if you’re migrating the sample data using the default scripts provided in the CloudFormation template.

Convert Snowflake tables to Amazon Redshift

You can simply run the following DDL statements to create TPCH_SF1 schema objects in Amazon Redshift. You can also use AWS Schema Conversion Tool (AWS SCT) to convert Snowflake custom objects to Amazon Redshift. For instructions on converting your schema, refer to Accelerate Snowflake to Amazon Redshift migration using AWS Schema Conversion Tool.

CREATE SCHEMA TPCH_SF1;
SET SEARCH_PATH to TPCH_SF1;
CREATE TABLE customer (
  c_custkey int8 not null ,
  c_name varchar(25) not null,
  c_address varchar(40) not null,
  c_nationkey int4 not null,
  c_phone char(15) not null,
  c_acctbal numeric(12,2) not null,
  c_mktsegment char(10) not null,
  c_comment varchar(117) not null,
  Primary Key(C_CUSTKEY)
) ;

CREATE TABLE lineitem (
  l_orderkey int8 not null ,
  l_partkey int8 not null,
  l_suppkey int4 not null,
  l_linenumber int4 not null,
  l_quantity numeric(12,2) not null,
  l_extendedprice numeric(12,2) not null,
  l_discount numeric(12,2) not null,
  l_tax numeric(12,2) not null,
  l_returnflag char(1) not null,
  l_linestatus char(1) not null,
  l_shipdate date not null ,
  l_commitdate date not null,
  l_receiptdate date not null,
  l_shipinstruct char(25) not null,
  l_shipmode char(10) not null,
  l_comment varchar(44) not null,
  Primary Key(L_ORDERKEY, L_LINENUMBER)
)  ;

CREATE TABLE nation (
  n_nationkey int4 not null,
  n_name char(25) not null ,
  n_regionkey int4 not null,
  n_comment varchar(152) not null,
  Primary Key(N_NATIONKEY)                                
) ;

CREATE TABLE orders (
  o_orderkey int8 not null,
  o_custkey int8 not null,
  o_orderstatus char(1) not null,
  o_totalprice numeric(12,2) not null,
  o_orderdate date not null,
  o_orderpriority char(15) not null,
  o_clerk char(15) not null,
  o_shippriority int4 not null,
  o_comment varchar(79) not null,
  Primary Key(O_ORDERKEY)
) ;

CREATE TABLE part (
  p_partkey int8 not null ,
  p_name varchar(55) not null,
  p_mfgr char(25) not null,
  p_brand char(10) not null,
  p_type varchar(25) not null,
  p_size int4 not null,
  p_container char(10) not null,
  p_retailprice numeric(12,2) not null,
  p_comment varchar(23) not null,
  PRIMARY KEY (P_PARTKEY)
) ;

CREATE TABLE partsupp (
  ps_partkey int8 not null,
  ps_suppkey int4 not null,
  ps_availqty int4 not null,
  ps_supplycost numeric(12,2) not null,
  ps_comment varchar(199) not null,
  Primary Key(PS_PARTKEY, PS_SUPPKEY)
) ;

CREATE TABLE region (
  r_regionkey int4 not null,
  r_name char(25) not null ,
  r_comment varchar(152) not null,
  Primary Key(R_REGIONKEY)                             
) ;

CREATE TABLE supplier (
  s_suppkey int4 not null,
  s_name char(25) not null,
  s_address varchar(40) not null,
  s_nationkey int4 not null,
  s_phone char(15) not null,
  s_acctbal numeric(12,2) not null,
  s_comment varchar(101) not null,
  Primary Key(S_SUPPKEY)
);

Run an AWS Glue workflow for data migration

When you’re ready to start the data migration, complete the following steps:

  1. On the AWS Glue console, choose Workflows in the navigation pane.
  2. Select the workflow to run (<stack name>snowflake-to-redshift-migration).
  3. On the Actions menu, choose Run.Glue Workflow Run
  4. To check the status of the workflow, choose the workflow and on the History tab, select the Run ID and choose View run details.
    Glue Workflow Status
  5. When the workflow is complete, navigate to the Amazon Redshift console and launch the Amazon Redshift query editor v2 to verify the successful migration of data.
  6. Run the following query in Amazon Redshift to get row counts of all tables migrated from Snowflake to Amazon Redshift. Make sure to adjust the table_schema value accordingly if you’re not migrating the sample data.
SELECT tab.table_schema,
       tab.table_name,
       nvl(tinf.tbl_rows,0) tbl_rows,
       nvl(tinf.size,0) size
FROM svv_tables tab
LEFT JOIN svv_table_info tinf 
          on tab.table_schema = tinf.schema 
          and tab.table_name = tinf.”table”
WHERE tab.table_type = 'BASE TABLE'
      and tab.table_schema in ('tpch_sf1')
ORDER BY tbl_rows;

Redshift Editor

  1. Run the following query in Snowflake to compare and validate the data:
USE DATABASE snowflake_sample_data;
SELECT  TABLE_CATALOG,
        TABLE_SCHEMA,
        TABLE_NAME,
        ROW_COUNT,
        BYTES AS SIZE,
        COMMENT
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'TPCH_SF1'
ORDER BY ROW_COUNT;

Snowflake Editor

Clean up

To avoid incurring future charges, delete the resources you created as part of the CloudFormation stack by navigating to the AWS CloudFormation console, selecting the stack blog-resources, and choosing Delete.

Conclusion

In this post, we discussed how to perform an efficient, fast, and cost-effective migration from Snowflake to Amazon Redshift. Migrations from one data warehouse environment to another can typically be very time-consuming and resource-intensive; this solution uses the power of cloud-based compute by pushing down the processing to the respective warehouses. Orchestrating this migration with the AWS Glue Python shell provides additional cost optimization.

With this solution, you can facilitate your migration from Snowflake to Amazon Redshift. If you’re interested in further exploring the potential of using Amazon Redshift, please reach out to your AWS Account Team for a proof of concept.

Appendix: Resources deployed by AWS CloudFormation

The CloudFormation stack deploys the following resources in your AWS account:

  • Networking resourcesAmazon Virtual Private Cloud (Amazon VPC), subnets, ACL, and security group.
  • Amazon S3 bucket – This is referenced as TargetDataS3Bucket on the Outputs tab of the CloudFormation stack. This bucket holds the data being migrated from Snowflake to Amazon Redshift.
  • AWS Secrets Manager secrets – Two secrets in AWS Secrets Manager store credentials for Snowflake and Amazon Redshift.
  • VPC endpoints – The two VPC endpoints are deployed to establish a private connection from VPC resources like AWS Glue to services that run outside of the VPC, such as Secrets Manager and Amazon S3.
  • IAM roles – IAM roles for AWS Glue, Lambda, and Amazon Redshift. If the CloudFormation template is to be deployed in a production environment, you need to adjust the IAM policies so they’re not as permissive as presented in this post (which were set for simplicity and demonstration). Particularly, AWS Glue and Amazon Redshift don’t require all the actions granted in the *FullAccess policies, which would be considered overly permissive.
  • Amazon Redshift cluster – An Amazon Redshift cluster is created in a private subnet, which isn’t publicly accessible.
  • AWS Glue connection – The connection for Amazon Redshift makes sure that the AWS Glue job runs within the same VPC as Amazon Redshift. This also ensures that AWS Glue can access the Amazon Redshift cluster in a private subnet.
  • AWS Glue jobs – Two AWS Glue Python shell jobs are created:
    • <stack name>-glue-snowflake-unload – The first job runs the SQL scripts in Snowflake to copy data from the source database to Amazon S3. The Python script is available in S3. The Snowflake job accepts two parameters:
      • SQLSCRIPT – The Amazon S3 location of the SQL script to run in Snowflake to migrate data to Amazon S3. This is referenced as the Snowflake Unload SQL Script parameter in the input section of the CloudFormation template.
      • SECRET – The Secrets Manager ARN that stores Snowflake connection details.
    • <stack name>-glue-redshift-load – The second job runs another SQL script in Amazon Redshift to copy data from Amazon S3 to the target Amazon Redshift database. The Python script link is available in S3. The Amazon Redshift job accepts three parameters:
      • SQLSCRIPT – The Amazon S3 location of the SQL script to run in Amazon Redshift to migrate data from Amazon S3. If you provide custom SQL script to migrate the Snowflake data to Amazon S3 (as mentioned in the prerequisites), the file location is referenced as LoadFileLocation on the Outputs tab of the CloudFormation stack.
      • SECRET – The Secrets Manager ARN that stores Amazon Redshift connection details.
      • PARAMS – This includes any additional parameters required for the SQL script, including the Amazon Redshift IAM role used in the COPY commands and the S3 bucket staging the Snowflake data. Multiple parameter values can be provided separated by a comma.
  • AWS Glue workflow – The orchestration of Snowflake and Amazon Redshift AWS Glue Python shell jobs is managed via an AWS Glue workflow. The workflow <stack name>snowflake-to-redshift-migration runs later for actual migration of data.

About the Authors

Raks KhareRaks Khare is an Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers architect data analytics solutions at scale on the AWS platform.

Julia BeckJulia Beck is an Analytics Specialist Solutions Architect at AWS. She supports customers in validating analytics solutions by architecting proof of concept workloads designed to meet their specific needs.

Create cross-account, custom Amazon Managed Grafana dashboards for Amazon Redshift

Post Syndicated from Tahir Aziz original https://aws.amazon.com/blogs/big-data/create-cross-account-custom-amazon-managed-grafana-dashboards-for-amazon-redshift/

Amazon Managed Grafana recently announced a new data source plugin for Amazon Redshift, enabling you to query, visualize, and alert on your Amazon Redshift data from Amazon Managed Grafana workspaces. With the new Amazon Redshift data source, you can now create dashboards and alerts in your Amazon Managed Grafana workspaces to analyze your structured and semi-structured data across data warehouses, operational databases, and data lakes. The Amazon Redshift plugin also comes with default out-of-the-box dashboards that make it simple to get started monitoring the health and performance of your Amazon Redshift clusters.

In this post, we present a step-by-step tutorial to use the Amazon Redshift data source plugin to visualize metrics from your Amazon Redshift clusters hosted in different AWS accounts using AWS Single Sign-On (AWS SSO) as well as how to create custom dashboards visualizing data from Amazon Redshift system tables in Amazon Managed Grafana.

Solution overview

Let’s look at the AWS services that we use in our tutorial:

Amazon Managed Grafana is a fully managed service for open-source Grafana developed in collaboration with Grafana Labs. Grafana is a popular open-source analytics platform that enables you to query, visualize, alert on, and understand your operational metrics. You can create, explore, and share observability dashboards with your team, and spend less time managing your Grafana infrastructure and more time improving the health, performance, and availability of your applications. Amazon Managed Grafana natively integrates with AWS services (like Amazon Redshift) so you can securely add, query, visualize, and analyze operational and performance data across multiple accounts and Regions for the underlying AWS service.

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers. Today, tens of thousands of AWS customers from Fortune 500 companies, startups, and everything in between use Amazon Redshift to run mission-critical business intelligence (BI) dashboards, analyze real-time streaming data, and run predictive analytics jobs. With the constant increase in generated data, Amazon Redshift customers continue to achieve successes in delivering better service to their end-users, improving their products, and running an efficient and effective business.

AWS SSO is where you create or connect your workforce identities in AWS and manage access centrally across your AWS organization. You can choose to manage access just to your AWS accounts or cloud applications. You can create user identities directly in AWS SSO, or you can bring them from your Microsoft Active Directory or a standards-based identity provider, such as Okta Universal Directory or Azure AD. With AWS SSO, you get a unified administration experience to define, customize, and assign fine-grained access. Your workforce users get a user portal to access all their assigned AWS accounts, Amazon Elastic Compute Cloud (Amazon EC2) Windows instances, or cloud applications. AWS SSO can be flexibly configured to run alongside or replace AWS account access management via AWS Identity and Access Management (IAM).

The following diagram illustrates the solution architecture.

The solution includes the following components:

  • Captured metrics from the Amazon Redshift clusters in the development and production AWS accounts.
  • Amazon Managed Grafana, with the Amazon Redshift data source plugin added to it. Amazon Managed Grafana communicates with the Amazon Redshift cluster via the Amazon Redshift Data Service API.
  • The Grafana web UI, with the Amazon Redshift dashboard using the Amazon Redshift cluster as the data source. The web UI communicates with Amazon Managed Grafana via an HTTP API.

We walk you through the following steps in this post:

  1. Create a user in AWS SSO for Amazon Managed Grafana workspace access.
  2. Configure an Amazon Managed Grafana workspace.
  3. Set up two Amazon Redshift clusters as the data sources in Grafana.
  4. Import the Amazon Redshift dashboard supplied with the data source.
  5. Create a custom Amazon Redshift dashboard to visualize metrics from the Amazon Redshift clusters.

Prerequisites

To follow along with this post, you should have the following prerequisites:

Set up AWS SSO

In this section, we set up AWS SSO and register users.

In addition to AWS SSO integration, Amazon Managed Grafana also supports direct SAML integration with SAML 2.0 identity providers.

  1. If you don’t have AWS SSO enabled, open the AWS SSO console and choose Enable AWS SSO.
  2. After AWS SSO is enabled, choose Users in the navigation pane.
  3. Choose Add user.
  4. Enter the user details and choose Next: Groups.
  5. Choose Add user.

Set up your Amazon Grafana workspace

In this section, we demonstrate how to set up a Grafana workspace using Amazon Managed Grafana. We set up authentication using AWS SSO, register data sources, and add administrative users for the workspace.

  1. On the Amazon Managed Grafana console, choose Create workspace.
  2. For Workspace name, enter a suitable name.
  3. Choose Next.
  4. For Authentication access, select AWS Single Sign-On.
  5. For Permission type, select Service managed.
  6. Choose Next.
  7. Select Current account.
  8. For Data sources, select Amazon Redshift.
  9. Choose Next.
  10. Review the details and choose Create workspace.

    Now we assign a user to the workspace.
  11. On the Workspaces page, choose the workspace you created.
  12. Note the IAM role attached to your workspace.
  13. Choose Assign new user or group.
  14. Select the user to assign to the workspace.
  15. Choose Assign users and groups.

    For the purposes of this post, we need an admin user.
  16. To change the permissions of the user you just assigned, select the user name and choose Make admin.

For the cross-account setup, we use two Amazon Redshift clusters: production and development. In the next section, we configure IAM roles in both the production and development accounts so that the Grafana in the production account is able to connect to the Amazon Redshift clusters in the production account as well as in the development account.

Configure an IAM role for the development account

In this section, we set up the IAM role in the AWS account hosting the development environment. This role is assumed by the Amazon Managed Grafana service from the production AWS account to establish the connection between Amazon Managed Grafana and Amazon Redshift cluster in the development account.

  1. On the IAM console, choose Roles in the navigation pane.
  2. Choose Create role.
  3. Select Custom trust policy.
  4. Use the following policy code (update the account number for your production account and the Grafana service role attached to the workspace):
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": "arn:aws:iam::<production-account-number>:role/service-role/AmazonGrafanaServiceRole-xxxxxxxxxx",
                    "Service": "grafana.amazonaws.com"
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }

  5. Choose Next.
  6. Attach the managed IAM policy AmazonGrafanaRedshiftAccess to this role. For instructions, refer to Modifying a role permissions policy (console).
  7. Provide a role name, description, and tags (optional), and create the role.

Configure an IAM role for the production account

Next, we configure the IAM role created by the Amazon Managed Grafana service in order to establish the connection between Amazon Managed Grafana and the Amazon Redshift cluster in the production account.

  1. On the IAM console, choose Roles in the navigation pane.
  2. Search for the AmazonGrafanaServiceRole-xxxxxxx role attached to your Grafana workspace.
  3. Create an inline IAM policy and attach it to this role with the following policy code:
    {
    	"Version": "2012-10-17",
    	"Statement": [{
    		"Sid": "VisualEditor0",
    		"Effect": "Allow",
    		"Action": [
    			"sts:AssumeRole"
    		],
    		"Resource":"arn:aws:iam::<dev-account-number>:role/<DevAccountRoleName>"
    	}]
    }

  4. Provide a role name, description, and tags (optional), and create the role.

Import the default dashboard

In this section, we connect to the Amazon Redshift clusters in the production and development accounts from the Amazon Managed Grafana console and import the default dashboard.

  1. On the Amazon Managed Grafana console, choose Workspaces in the navigation pane.
  2. Choose the workspace you just created (authenticate and sign in if needed).
  3. In the navigation pane, choose Settings and on the Configuration menu, choose Data sources.
  4. Choose Add data source.
  5. Search for and choose Amazon Redshift.
  6. On the Settings tab, for Authentication provider, choose Workspace IAM role.
  7. For Default Region, choose us-east-1.
  8. Under Redshift Details, choose Temporary credentials.
  9. Enter the cluster identifier and database name for your Amazon Redshift cluster in the development account.
  10. For Database user, enter redshift_data_api_user.
  11. Choose Save & test.
    When the connection is successfully established, a message appears that the data source is working. You can now move on to the next step.
  12. Repeat these steps to add another data source to connect to the Amazon Redshift cluster in the development account.
  13. On the Settings tab, for Authentication provider, choose Workspace IAM role.
  14. Enter the workspace role as the ARN of the IAM role you created earlier (arn:aws:iam::dev-account-number:role/cross-account-role-name).
  15. For Default Region, choose us-east-1.
  16. Under Redshift Details, choose Temporary credentials.
  17. Enter the cluster identifier and database name for your Amazon Redshift cluster in the development account.
  18. For Database user, enter redshift_data_api_user.
  19. Choose Save & test.
    When the connection is successfully established, a message appears that the data source is working.
  20. On the Dashboards tab, choose Import next to Amazon Redshift.

On the dashboard page, you can change the data source between your production and development clusters on a drop-down menu.

The default Amazon Redshift dashboard, as shown in the following screenshot, makes it easy to monitor the overall health of the cluster by showing different cluster metrics, like total storage capacity used, storage utilization per node, open and closed connections, WLM mode, AQUA status, and more.

Additionally, the default dashboard displays several table-level metrics such as size of the tables, total number of rows, unsorted rows percentage, and more, in the Schema Insights section.

Add a custom dashboard for Amazon Redshift

The Amazon Redshift data source plugin allows you to query and visualize Amazon Redshift data metrics from within Amazon Managed Grafana. It’s preconfigured with general metrics. To add a custom metric from the Amazon Redshift cluster, complete the following steps:

  1. On the Amazon Managed Grafana console, choose All workspaces in the navigation pane.
  2. Choose the Grafana workspace URL for the workspace you want to modify.
  3. Choose Sign in with AWS SSO and provide your credentials.
  4. On the Amazon Managed Grafana workspace page, choose the plus sign and on the Create menu, choose Dashboard.
  5. Choose Add a new panel.
  6. Add the following custom SQL to get the data from the Amazon Redshift cluster:
    select 
    p.usename,
    count(*) as Num_Query,
    SUM(DATEDIFF('second',starttime,endtime)) as Total_Execution_seconds from stl_query s 
    inner join pg_user p on s.userid= p.usesysid where starttime between $__timeFrom() and $__timeTo()
    and s.userid>1 group by 1

    For this post, we use the default settings, but you can control and link the time range using the $__timeFrom() and $__timeTo() macros; they’re bound with the time range control of your dashboard. For more information and details about the supported expressions, see Query Redshift data.

  7. To inspect the data, choose Query inspector to test the custom query outcome.
    Amazon Managed Grafana supports a variety of visualizations. For this post, we create a bar chart.
  8. On the Visualizations tab in the right pane, choose Bar chart.
  9. Enter a title and description for the custom chart, and leave all other properties as default.
    For more information about supported properties, see Visualizations.
  10. Choose Save.
  11. In the pop-up window, enter a dashboard name and choose Save.

    A new dashboard is created with a custom metric.
  12. To add more metrics, choose the Add panel icon, choose Add a new panel, and repeat the previous steps.

Clean up

To avoid incurring future charges, complete the following steps:

  1. Delete the Amazon Managed Grafana workspace.
  2. If you created a new Amazon Redshift cluster for this demonstration, delete the cluster.

Conclusion

In this post, we demonstrated how to use AWS SSO and Amazon Managed Grafana to create an operational view to monitor the health and performance of Amazon Redshift clusters. We learned how to extend your default dashboard by adding custom and insightful dashboards to your Grafana workspace.

We look forward to hearing from you about your experience. If you have questions or suggestions, please leave a comment.


About the Authors

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 13 years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Shawn Sachdev is a Sr. Analytics Specialist Solutions Architect at AWS. He works with customers and provides guidance to help them innovate and build well-architected and high-performance data warehouses and implement analytics at scale on the AWS platform. Before AWS, he worked in several analytics and system engineering roles. Outside of work, he loves watching sports, and is an avid foodie and craft beer enthusiast.

Ekta Ahuja is an Analytics Specialist Solutions Architect at AWS. She is passionate about helping customers build scalable and robust data and analytics solutions. Before AWS, she worked in several different data engineering and analytics roles. Outside of work, she enjoys baking, traveling, and board games.

Resize Amazon Redshift from DC2 to RA3 with minimal or no downtime

Post Syndicated from Soujanya Konka original https://aws.amazon.com/blogs/big-data/resize-amazon-redshift-from-dc2-to-ra3-with-minimal-or-no-downtime/

Amazon Redshift is a popular cloud data warehouse that allows you to process exabytes of data across your data warehouse, operational database, and data lake using standard SQL. Amazon Redshift offers different node types like DC2 (dense compute) and RA3, which you can use for your different workloads and use cases. For more information about the benefits of migrating from DS2 to RA3, refer to Scale your cloud data warehouse and reduce costs with the new Amazon Redshift RA3 nodes with managed storage and Amazon Redshift Benchmarking: Comparison of RA3 vs. DS2 Instance Types.

Many customers use DC2 nodes for their compute-intensive workloads. It’s natural to scale with your growing workload, namely separating compute from storage so they’re right-sized as per your needs. RA3 nodes with managed storage enable you to optimize your data warehouse by scaling and paying for compute and managed storage independently. Amazon Redshift managed storage uses large, high-performance SSDs in each RA3 node for fast local storage and Amazon S3 for longer-term durable storage. If the data in a node grows beyond the size of the large local SSDs, Amazon Redshift managed storage automatically offloads that data to Amazon S3. RA3 nodes keep track of the frequency of access for each data block and cache the hottest blocks. If the blocks aren’t cached, the large networking bandwidth and precise storing techniques return the data in sub-seconds. Also, if you’re looking for features like cross-cluster data sharing and cross-Availability Zone cluster relocation, these are a few of the reasons for migrating to RA3. Many customers on DC2 have benefitted from migrating to RA3 to serve their growing performance requirements and business use cases.

As a first step of the migration, we always recommend finding the correct load of your system and determining the number of RA3 nodes that will meet your workload and give you the best cost-performance benefit. For this evaluation, you can use the simple Replay tool to conduct a what-if analysis and evaluate how your workload performs in different scenarios. For example, you can use the tool to benchmark your actual workload on a new instance type like RA3, evaluate a new feature, or assess different cluster configurations. To choose the right cluster type, you can compare different node types for your workload and choose the right configuration of RA3 with the Simple Replay utility.

Once you know the cluster type and nodes, the next question is how to migrate your current workload to RA3 with minimum downtime or without disrupting your current workload. In this post, we describe an approach to do this with minimum downtime.

Resizing an Amazon Redshift cluster

There are three ways to resize or migrate an Amazon Redshift cluster from DC2 to RA3 :

  • Elastic resize – If it’s available as an option, use elastic resize to change the node type, number of nodes, or both. Note that when you only change the number of nodes, the queries are temporarily paused and connections are kept open. An elastic resize can take between 10–15 minutes. During a resize operation, the cluster is read-only.
  • Classic resize – Use classic resize to change the node type, number of nodes, or both. Choose this option when you’re resizing to a configuration that isn’t available through elastic resize. A resize operation can take 2 hours or more, or last up to several days depending on your data size. During the resize operation, the source cluster is read-only.
  • Snapshot, restore, and resize – To keep your cluster available during a classic resize, make a copy of the existing cluster, then resize the new cluster. If data is written to the source cluster after a snapshot is taken, the data must be manually copied over after the migration is complete.

Checkpoints for resize

When a cluster is resized using elastic resize with the same node type, the operation doesn’t create a new cluster. As a result, the operation completes quickly. In case of resize, there could be one or more challenges causing the delay in resize:

  • Data volumes – The time required to complete a classic resize or a snapshot and restore operation might vary, depending on factors like the workload on the source cluster, the number and volume of tables being transformed, how evenly data is distributed across the compute nodes and slices, and the node configuration in the source and target clusters.
  • Snapshots – Automated snapshots are automatically deleted when their retention period expires, when you disable automated snapshots, or when you delete a cluster. If you want to keep an automated snapshot, you can copy it to a manual snapshot. You can take a manual snapshot of the cluster before the migration, which is used for resize operations, but it may not include live data from the time the snapshot was captured.
  • Cluster unavailable during resize – It’s critical to know roughly how long the resize will take. To do so, you can try creating a cluster from the snapshot in a test account. However, this only gives a ballpark idea because resize times can vary, especially if you intend to query your cluster during the resize. If the cluster is live almost all the time with minimal or zero non-business hours, a resize can be a challenge because the cluster can’t upsert live data and serve read requests on this data during this window.
  • Cluster endpoint retention – Elastic resize and cluster resize allow you to change the node type, number of nodes, or both, but the endpoint is retained. With snapshot resize, a new cluster endpoint is created, which may require a change in your application to replace the endpoint.
  • Reconciliation – Validate the target cluster data with the source to make sure migration was completed without data loss and ensure data quality. Reconciliation at the table level isn’t sufficient, you need to ensure records have also been copied from the source. You can run a matching record count check followed by data validation using checksum for accuracy of data.

Solution overview

The steps to prepare for migration are as follows:

  1. Take a snapshot of the existing production Amazon Redshift cluster running on DC2.
  2. Create another Amazon Simple Storage Service (Amazon S3) bucket, where AWS Glue writes the curated data in parallel.
  3. Use the snapshot to create an RA3 cluster.
  4. Configure AWS Database Migration Service (AWS DMS) to load data from the migrated bucket to Amazon S3.
  5. After you confirm that the data is synced between the two clusters (DC and RA3) and all other downstream applications, stop the DC cluster and change the endpoint of your dependent downstream application to the newly created RA3 cluster.

Following is the current architecture depicting a live workload.

In this solution, data comes from three source systems and are written into a raw S3 bucket:

  • Change data capture (CDC) from an RDS instance via AWS DMS (1 in the preceding diagram)
  • Events captured via an external API (2)
  • CSV files from an external source copied to the raw bucket (3)

These sources don’t have a pattern or an interval of pushing new data.

Every few minutes, the ingested data is picked up by an S3 event trigger to run an AWS Glue workflow (4 in the preceding diagram). It provides an orchestration layer to manage and run jobs and crawlers. This workflow includes a crawler (5) that updates the metadata schema and partitions of the dataset to the AWS Glue Data Catalog. Then the crawler triggers an AWS Glue job that writes the curated data to the S3 curated bucket. From there, another AWS Glue job uploads data into Amazon Redshift (6).

In this scenario, if your workload is critical and you can’t afford a long downtime, then you need to plan your migration accordingly.

Dual write and transient data curation pipeline

As a first step of the migration, you need a parallel data process pipeline as the AWS Glue job, which writes the data into the curated S3 bucket. Create another S3 bucket and name it migrated-curated-bucket and modify the AWS Glue transform job. You can also replicate another transform job to write data to a new reserve S3 bucket in parallel.

In this scenario, live data ingestion occurs every 30 minutes. When an iteration of the extract, transform, and load (ETL) job is complete, this triggers a manual snapshot of the Amazon Redshift cluster. After the snapshot is captured, a new Amazon Redshift cluster is created using that snapshot. Cluster creation time can vary depending on the snapshot volume.

If snapshot creation takes more than 30 minutes, then the ETL job should be stopped, and resume after the snapshot creation is complete. For example, if the ETL job is triggered at 8:00 AM and finishes at 8:10 AM, then snapshot creation starts at 8:10 AM. If it finishes by 8:30 AM (the next ETL job will run at 8:30 AM as per the half-hour interval), then the ETL process continues according to the schedule. Otherwise, the job stops, and resumes after the snapshot completion.

Now we use the snapshot to launch a new RA3 redshift cluster. The process doesn’t pause the existing ETL pipeline, rather it starts writing curated data in parallel to the reserve S3 bucket. The following diagram illustrates this updated workflow.

At this point, the existing cluster is still live and continues to process the live workload. Even if creation of the Amazon Redshift cluster takes time (owing to the huge volume of data), you should still be covered. The curated data in the S3 bucket acts as a staging reserve, and this data should be loaded into the RA3 cluster after its cluster is launched.

Backfill the new RA3 cluster with missing data

After the RA3 cluster has been launched, you need to playback the captured live data from the reserve S3 bucket to the newly created cluster. Playback is only for the duration of the snapshot capture to the current timestamp. With this process, you’re trying to bring the RA3 cluster in sync with the existing live DC2 cluster.

You need to configure an AWS DMS migration task with the reserve S3 bucket as the source endpoint and the newly created RA3 cluster as the target endpoint.

AWS DMS captures ongoing changes to the target data store. This process is called ongoing replication or change data capture (CDC). AWS DMS uses this process when replicating ongoing changes from a source data store. This process works by collecting changes to the database logs using the database engine’s native API. The following diagram illustrates this workflow.

Reconciliation and cutover

Data reconciliation is the process of verification of data between source and target. In this process, target data is compared with source data to ensure that the data is transferred completely without any alterations. To ensure reliability in the pipeline and the data processed, you should create an end-to-end reconciliation report. This report verifies the percentage of matching tables, columns, and data records. It also identifies missing records, missing values, incorrect values, badly formatted values, and duplicated records.

You can define the reconciliation process to check whether both clusters are running in sync. For that you can create simple Python scripts or shell scripts to query the source and target clusters, fetch the results, and compare.

Cutover is the final step of migration, and involves switching the existing cluster with the newly launched cluster. At this point, the clusters are running in parallel. Next, you validate that the downstream data consumption flows are up to date. Verify the reconciliation metrics from the DC2 and RA3 clusters such that table updates are in sync.

You can keep dual write while you switch from the migration data pipeline. If you discover any issues after cutting over, you can switch back to the old data pipeline, which is the source of truth until cutover. In this case, cutover involves updating the DC2 cluster endpoint to the new RA3 cluster endpoint in the application. Make sure to identify a relatively quiet window during  the day to update the endpoint. To keep the same endpoint for your applications and users, you can rename the new RA3 cluster with the same name as the original DC2 cluster. To rename the cluster, modify the cluster in the Amazon Redshift console or ModifyCluster API operation. For more information, see Renaming clusters or ModifyCluster API operation in the Amazon Redshift API Reference.

Up to this point, AWS DMS is continuing to update RA3. After you cut over to RA3, the DC2 cluster is no longer live and you can stop the AWS DMS replication job to RA3. Pause the last snapshot. Delete the reserve S3 bucket and AWS DMS resources used for RA3 load.

Conclusion

In this post, we presented an approach to migrate an existing Amazon Redshift cluster with minimal to no data loss, which also allows the cluster to serve both read and write operations during the resize window. Elastic resize is a quick way to resize your cluster to maintain the same number of slices in the target cluster. Slice mapping reduces the time required to resize a cluster. If you choose a resize configuration that isn’t available on elastic resize, you can choose classic resize or perform a snapshot, restore, and resize.

To learn more about what’s new with RA3 instances, refer to Amazon Redshift RA3 instances with managed storage. Amazon Redshift delivers better price performance and at the same time helps you keep your costs predictable. Amazon Redshift Serverless automatically provisions and scales the data warehouse capacity to deliver high performance for demanding and unpredictable workloads, and you pay only for the resources you use. This provides greater flexibility to choose either or both based on custom requirements. After you’ve made your choice, try the hands-on labs on Amazon Redshift.


About the Authors

Soujanya Konka is a Solutions Architect and Analytics specialist at AWS, focused on helping customers build their ideas on cloud. Expertise in design and implementation of business information systems and Data warehousing solutions. Before joining AWS, Soujanya has had stints with companies such as HSBC, Cognizant.

Dipayan Sarkar is a Specialist Solutions Architect for Analytics at AWS, where he helps customers to modernise their data platform using AWS Analytics services. He works with customer to design and build analytics solutions enabling business to make data-driven decisions.

How GE Proficy Manufacturing Data Cloud replatformed to improve TCO, data SLA, and performance

Post Syndicated from Jyothin Madari original https://aws.amazon.com/blogs/big-data/how-ge-proficy-manufacturing-data-cloud-replatformed-to-improve-tco-data-sla-and-performance/

This is post is co-authored by Jyothin Madari, Madhusudhan Muppagowni and Ayush Srivastava from GE.

GE Proficy Manufacturing Data Cloud (MDC), part of the GE Digital’s Manufacturing Execution Systems (MES) suite of solutions, allows GED’s customers to increase the derived value easily and quickly from the MES by reliably bringing enterprise-wide manufacturing data into the cloud and transforming it into a structured dataset for advanced analytics and deeper insights into the manufacturing processes.

In this post, we share how MDC modernized the hybrid cloud strategy by replatforming. This solution improved scalability, their data availability Service Level Agreement (SLA), and performance.

Challenge

MDC v1 was built on Predix services using industrial use case-optimized Predix services such as Predix Columnar Store (Cassandra) and Predix Insights (Amazon EMR). MDC evolved in both features and the underlying platform over the past year with a goal to improve TCO, data SLA, and performance. MDC’s customer base grew and the number of sites from customers grew to over 100 in the past couple of years. The increased number of sites needed more compute and storage capacity. This increased infrastructure and operational cost significantly, while introducing increased data latency and lowering the data freshness interval from the cloud.

How we started

MDC evaluated several vendors for their storage and compute capabilities using various measurements: security, performance, scalability, ease of management and operation, reduction of overall cost and increase in ROI, partnership, and migration help (technology assistance). The MDC team saw opportunities to improve the product by using native AWS services such as Amazon Redshift, AWS Glue, and Amazon Managed Workflows for Apache Airflow (Amazon MWAA), which made the product more performant and scalable while reducing operation costs and making it future-ready for advanced analytics and new customer use cases.

The GE Digital team, comprised of domain experts, developers, and QA, worked shoulder to shoulder with the AWS ProServe team, comprised of Solution Architects, Data Architects, and Big Data Experts, in determining the key architectural changes required and solutions to implementation challenges.

Overview of solution

The following diagram illustrates the high-level architecture of the solution.

This is a broad overview, and the specifics of networking and security between components are out of scope for this post.

The solution includes the following main steps and components:

  1. CDC and log collector – Compressed CSV data is collected from over 100 Manufacturing Data Sources Proficy Plant Applications and sinked into an Amazon Simple Storage Service (Amazon S3) bucket.
  2. S3 raw bucket – Our data lands in Amazon S3 without any transformation, but appropriately partitioned (tenant, site, date, and so on) for the ease of future processing.
  3. AWS Lambda – When the file lands in the S3 raw bucket, it triggers an S3 event notification, which invokes AWS Lambda. Lambda extracts metadata (bucket name, key name, date, and so on) from the event and saves it in Amazon DynamoDB.
  4. AWS Glue – Our goal is now to take CSV files, with varying schemas, and convert them into Apache Parquet format. An AWS Glue extract, transform, and load (ETL) job reads a list of files to be processed from the DynamoDB table and fetches them from the S3 raw bucket. We have preconfigured unified AVRO schemas in the AWS Glue Schema Registry for schema conversion. Converted data lands in the S3 raw Parquet bucket.
  5. S3 raw Parquet bucket – Data in this bucket is still raw and unmodified; only the format was changed. This intermediary storage is required due to schema and column order mismatch in CSV files.
  6. Amazon Redshift – The majority of transformations and data enrichment happens in this step. Amazon Redshift Spectrum consumes data from the S3 raw Parquet bucket and external PostgreSQL dimension tables (through a federated query). Transformations are performed via stored procedures, where we encapsulate logic for data transformation, data validation, and business-specific logic. The Amazon Redshift cluster is configured with concurrency scaling, auto workload management (WLM) with caching, and the latest RA3 instance types.
  7. MDC API – These custom-built, web-based, REST API microservices talk on the backend with Amazon Redshift and expose data to external users, business intelligence (BI) tools, and partners.
  8. Amazon Redshift data export and archival – On a scheduled basis, Amazon Redshift exports (UNLOAD command) contextualized and business-defined aggregated data. Exports are landed in the S3 bucket as Apache Parquet files.
  9. S3 Parquet export bucket – This bucket stores the exported data (hundreds of TBs) used by external users who need to run extensive, heavy analytics and AI or machine learning (ML) with various tools (such as Amazon EMR, Amazon Athena, Apache Spark, and Dremio).
  10. End-users – External users consume data from the API. The main use case here is reporting and visual analytics.
  11. Amazon MWAA – The orchestrator of the solution, Amazon MWAA is used for scheduling Amazon Redshift stored procedures, AWS Glue ETL jobs, and Amazon Redshift exports at regular intervals with error handling and retries built in.

Bringing it all together

MDC replaced both Predix Columnar Store (Cassandra) and Predix Insights (Amazon EMR) with Amazon Redshift for both storage of the MDC data models and compute (ELT). Amazon MWAA is used to schedule the workloads that do the bulk of the ELT. Lambda, AWS Glue, and DynamoDB are used to normalize the schema differences between sites. It was important not to disrupt MDC customers while replatforming. To achieve this, MDC used a phased approach to migrate the data models to Amazon Redshift. They used federated queries to query existing PostgreSQL for dimensional data, which facilitated having some of the data models in Amazon Redshift, while the others were in Cassandra with no interruption to MDC customers. Redshift Spectrum facilitated querying the raw data in Amazon S3 directly both for ETL and data validation.

75% of the MDC team along with the AWS ProServe team and AWS Solution Architects collaborated with the GE Digital Security Team and Platform Team to implement the architecture with AWS native services. It took approximately 9 months to implement, secure, and performance tune the architecture and migrate data models in three phases. Each phase has gone through a GE Digital internal security review. Amazon Redshift Auto WLM, short query acceleration, and tuning the sort keys to optimize querying patterns improved the Proficy MDC API performance. Because the unload of the data from Amazon Redshift was fast, Proficy MDC is now able to export the data much more frequently to our end customers.

Conclusion

With replatforming, Proficy MDC was able to improve ETL performance by approximately 75%. Data latency and freshness improved by approximately 87%. The solution reduced TCO of the platform by approximately 50%. Proficy MDC was also able reduce the infrastructure and operational cost. Improved performance and reduced latency has allowed us to speed up the next steps in our journey to modernize the enterprise data architecture and hybrid cloud data platform.


About the Authors

Jyothin Madari leads the Manufacturing Data Cloud (MDC) engineering team; part of the manufacturing suite of products at GE Digital. He has 18 years of experience, 4 of which is with GE Digital. Most recently he has been working on data migration projects with an aim to reduce costs and improve performance. He is an AWS Certified Cloud Practitioner, a keen learner and loves solving interesting problems. Connect with him on LinkedIn.

Madhusudhan (Madhu) Muppagowni is a Technical Architect and Principal Software Developer based in Silicon Valley, Bay Area, California.  He is passionate about Software Development and Architecture. He thrives on producing Well-Architected and Secure SaaS Products, Data Pipelines that can make a real impact.  He loves outdoors and an avid hiker and backpacker. Connect with him on LinkedIn.

Ayush Srivastava is a Senior Staff Engineer and Technical Anchor based in Hyderabad, India. He is passionate about Software Development and Architecture. He has Demonstrated track record of successfully technical anchoring small to large Secure SaaS Products, Data Pipelines from start to finish. He loves exploring different places and he says “I’m in love with cities I have never been to and people I have never met.” Connect with him on LinkedIn.

Karen Grygoryan is Data Architect with AWS ProServe. Connect with him on LinkedIn.

Gnanasekaran Kailasam is a Data Architect at AWS. He has worked with building data warehouses and big data solutions for over 16 years. He loves to learn new technologies and solving, automating, and simplifying customer problems with easy-to-use cloud data solutions on AWS. Connect with him on LinkedIn.

Supercharging Dream11’s Data Highway with Amazon Redshift RA3 clusters

Post Syndicated from Dhanraj Gaikwad original https://aws.amazon.com/blogs/big-data/supercharging-dream11s-data-highway-with-amazon-redshift-ra3-clusters/

This is a guest post by Dhanraj Gaikwad, Principal Engineer on Dream11 Data Engineering team.

Dream11 is the world’s largest fantasy sports platform, with over 120 million users playing fantasy cricket, football, kabaddi, basketball, hockey, volleyball, handball, rugby, futsal, American football, and baseball. Dream11 is the flagship brand of Dream Sports, India’s leading Sports Technology company, and has partnerships with several national and international sports bodies and cricketers.

In this post, we look at how we supercharged our data highway, the backbone of our major analytics pipeline, by migrating our Amazon Redshift clusters to RA3 nodes. We also look at why we were excited about this migration, the challenges we faced during the migration and how we overcame them, as well as the benefits accrued from the migration.

Background

The Dream11 Data Engineering team runs the analytics pipelines (what we call our Data Highway) across Dream Sports. In near-real time, we analyze various aspects that directly impact the end-user experience, which can have a profound business impact for Dream11.

Initially, we were analyzing upwards of terabytes of data per day with Amazon Redshift clusters that ran mainly on dc2.8xlarge nodes. However, due to a rapid increase in our user participation over the last few years, we observed that our data volumes increased multi-fold. Because we were using dc2.8xlarge clusters, this meant adding more nodes of dc2.8xlarge instance types to the Amazon Redshift clusters. Not only was this increasing our costs, it also meant that we were adding additional compute power when what we really needed was more storage. Because we anticipated significant growth during the Indian Premier League (IPL) 2021, we actively explored various options using our AWS Enterprise Support team. Additionally, we were expecting more data volume over the next few years.

The solution

After discussions with AWS experts and the Amazon Redshift product team, we at Dream11 were recommended the most viable option of migrating our Amazon Redshift clusters from dc2.8xlarge to the newer RA3 nodes. The most obvious reason for this was the decoupled storage from compute. As a result, we could use lesser nodes and move our storage to Amazon Redshift managed storage. This allowed us to respond to data volume growth in the coming years as well as reduce our costs.

To start off, we conducted a few elementary tests using an Amazon Redshift RA3 test cluster. After we were convinced that this wouldn’t require many changes in our Amazon Redshift queries, we decided to carry out a complete head-to-head performance test between the two clusters.

Validating the solution

Because the user traffic on the Dream11 app tends to spike during big ticket tournaments like the IPL, we wanted to ensure that the RA3 clusters could handle the same traffic that we usually experience during our peak. The AWS Enterprise Support team suggested using the Simple Replay tool, an open-sourced tool released by AWS that you can use to record and replay the queries from one Amazon Redshift cluster to another. This tool allows you to capture queries on a source Amazon Redshift cluster, and then replay the same queries on a destination Amazon Redshift cluster (or clusters). We decided to use this tool to capture our performance test queries on the existing dc2.8xlarge clusters and replay them on a test Amazon Redshift cluster composed of RA3 nodes. During this time of our experimentation, the newer version of the automated AWS CloudFormation-based toolset (now on GitHub), was not available.

Challenges faced

The first challenge came up when using the Simple Replay tool because there was no easy way to compare the performance of like-to-like queries on the two types of clusters. Although Amazon Redshift provides various statistics using meta-tables about individual queries and their performance, the Simple Replay tool adds additional comments in each Amazon Redshift query on the target cluster to make it easier to know if these queries were run by the Simple Replay tool. In addition, the Simple Replay tool drops comments from the queries on the source cluster.

Comparing each query performance with the Amazon Redshift performance test suite would mean writing additional scripts for easy performance comparison. An alternative would have been to modify the Simple Replay tool code, because it’s open source on GitHub. However, with the IPL 2022 beginning in just a few days, we had to explore another option urgently.

After further discussions with the AWS Enterprise Support team, we decided to use two test clusters: one with the old dc2.8xlarge nodes, and another with the newer RA3 nodes. The idea was to use the Simple Replay tool to run the captured queries from our original cluster on both test clusters. This meant that the queries would be identical on both test clusters, making it easier to compare. Although this meant running an additional test cluster for a few days, we went ahead with this option. As a side note, the newer automated AWS CloudFormation-based toolset does exactly the same in an automated way.

After we were convinced that most of our Amazon Redshift queries performed satisfactorily, we noticed that certain queries were performing slower on the RA3-based cluster than the dc2.8xlarge cluster. We narrowed down the problem to SQL queries with full table scans. We rectified it by following proper data modelling practices in the ETL workflow. Then we were ready to migrate to the newer RA3 nodes.

The migration to RA3

The migration from the old cluster to the new cluster was smoother than we thought. We used the elastic resize approach, which meant we only had a few minutes of Amazon Redshift downtime. We completed the migration successfully with a sufficient buffer timeline for more tests. Additional tests indicated that the new cluster performed how we wanted it to.

The trial by fire

The new cluster performed satisfactorily during our peak performance loads in the IPL as well as the following ICC T20 Cricket World Cup. We’re excited that the new RA3 node-based Amazon Redshift cluster can support our data volume growth needs without needing to increase the number of instance nodes.

We migrated from dc2 to RA3 in April 2021. The data volume has grown by 50% since then. If we had continued with dc2 instances, the cluster cost would have increased by 50%. However, because of the migration to RA3 instances, even with an increase in data volume by 50% since April 2021, the cluster cost has increased by 0.7%, which is attributed to an increase in storage cost.

Conclusion

Migrating to the newer RA3-based Amazon Redshift cluster helped us decouple our computing needs from our storage needs, and now we’re prepared for our expected data volume growth for the next few years. Moreover, we don’t need to add compute nodes if we only need storage, which is expected to bring down our costs in the long run. We did need to fine-tune some of our queries on the newer cluster. With the Simple Replay tool, we could do a direct comparison between the older and the newer cluster. You can also use the newer automated AWS CloudFormation-based toolset if you want to follow a similar approach.

We highly recommend RA3 instances. They give you the flexibility to size your RA3 cluster based on the
amount of data stored without increasing your compute costs.


About the Authors

Dhanraj Gaikwad is a Principal Data Engineer at Dream11. Dhanraj has more than 15 years of experience in the field of data and analytics. In his current role, Dhanraj is responsible for building the data platform for Dream Sports and is specialized in data warehousing, including data modeling, building data pipelines, and query optimizations. He is passionate about solving large-scale data problems and taking unique approaches to deal with them.

Sanket Raut is a Principal Technical Account Manager at AWS based in Vasai ,India. Sanket has more than 16 years of industry experience, including roles in cloud architecture, systems engineering, and software design. He currently focuses on enabling large startups to streamline their cloud operations and optimize their cloud spend. His area of interest is in serverless technologies.

AWS Week In Review – May 23, 2022

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/aws-week-in-review-may-27-2022/

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

This is the right place to quickly learn about recent AWS news from last week, in just about five minutes or less. This week, I have collected a couple of news items that might be of interest to you, the IT professionals, developers, system administrators, or any type of builders that have their hands on the AWS console, the CLI, or that are writing code.

Last Week’s Launches
The launches that caught my attention last week are the following:

EC2 now supports NitroTPM and SecureBoot – A Trusted Platform Module is often a discrete chip in a computer where you can store secrets and release them to the operating system only when the system is in a known good state. You typically use TPM modules to store operating-system-level volume encryption keys, such as the ones used by BitLocker on Windows or LUKS. NitroTPM is a virtual TPM module available on selected instance families that allows you to deploy your workloads depending on TPM functionalities on EC2 instances.

Amazon EC2 Auto Scaling now backfills predictive scaling forecasts so you can quickly validate forecast accuracy. Auto Scaling Predictive Scaling is a capability of Auto Scaling that allows you to scale your fleet in and out based on observed usage patterns. It uses AI/ML to predict when your fleet needs more or less capacity. It allows you to scale a fleet in advance of the scaling event and have the fleet prepared at peak times. The new backfills shows you how predictive scaling would have scaled your fleet during the last 14 days. This allows you to quickly decide if the predictive scaling policy is accurate for your applications by comparing the demand and capacity forecasts against actual demand immediately after you create a predictive scaling policy.

AWS Backup adds support for two new managed file systems, Amazon FSx for OpenZFS and Amazon Fsx for NetApp ONTAP. These additions helps you meet your centralized data protection and regulatory compliance needs. You can now use AWS Backup’s policy-based capabilities to centrally protect Amazon FSx for NetApp ONTAP or Amazon Fsx for OpenZFS, along with the other AWS services for storage, database, and compute that AWS Backup supports.

AWS App Mesh now supports IPv6 AWS App Mesh is a service mesh that provides application-level networking to make it easy for your services to communicate with each other across multiple types of compute infrastructure. The new support for IPv6 allows you to support workloads running in IPv6 networks and to invoke App Mesh APIs over IPv6. This helps you meet IPv6 compliance requirements, and removes the need for complex networking configuration to handle address translation between IPv4 and IPv6.

Amazon Chime SDK now supports video background replacement and blur on iOS and Android. When you want to integrate audio and video call capabilities in your mobile applications, the Chime SDK is the easiest way to get started. It provides an easy-to-use API that uses the scalable and robust Amazon Chime backend to power your communications. For example, Slack is using Chime as backend for the communications in their apps. The Chime SDK client libraries for iOS and Android now include video background replacement and blur, which developers can use to reduce visual distractions and help increase visual privacy for mobile users on iOS and Android.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Some other updates and news that you may have missed:

Amazon Redshift: Ten years of continuous reinvention. This is an Amazon Redshift research paper that will be presented at a leading international forum for database researchers. The authors reflect on how far the first petabyte-scale cloud data warehouse has advanced since it was announced ten years ago.

Improve Your Security at the Edge with AWS IoT Services is a new blog post on the IoT channel. We understand the risks associated with operating at the edge and that you need additional capabilities to ensure that your data is protected. AWS IoT services can help you with end-to-end data protection, device security, and device identification to create the foundation of an expanded information security model and confidently operate at the edge.

AWS Open Source News and Updates – Ricardo Sueiras, my colleague from the AWS Developer Relation team, runs this newsletter. It brings you all the latest open-source projects, posts, and more. Read edition #113 here.

Upcoming AWS Events
CDK Day, on May 26 is a one-day fully virtual event dedicated to the AWS Cloud Development Kit. With four versions of the CDK released (AWS, Terraform, CDK8s, and Projen), we tought the CDK deserves its own full-fledged conference. We will take one day and showcase the brightest and best of CDK from across the whole product family. Let’s talk serverless, Kubernetes and multi-cloud all on the same day! CDK Day will take place on May 26, 2022 and will be fully virtual, live-streamed to our YouTube channel. Book your ticket now, it’s free.

The AWS Summit season is mostly over in Europe, but there are upcoming Summits in North America and the Asia Pacific Regions. Here are some virtual and in-person Summits that might be close to you:

More to come in July, August, and September.

You can register for re:MARS to get fresh ideas on topics such as machine learning, automation, robotics, and space. The conference will be in person in Las Vegas, June 21–24.

That’s all for this week. Check back next Monday for another Week in Review!

— seb

Use Amazon Redshift RA3 with managed storage in your modern data architecture

Post Syndicated from Bhanu Pittampally original https://aws.amazon.com/blogs/big-data/use-amazon-redshift-ra3-with-managed-storage-in-your-modern-data-architecture/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers.

Over the years, Amazon Redshift has evolved a lot to meet our customer demands. Its journey started as a standalone data warehousing appliance that provided a low-cost, high-performance, cloud-based data warehouse. Support for Amazon Redshift Spectrum compute nodes was later added to extend your data warehouse to data lakes, and the concurrency scaling feature was added to support burst activity and scale your data warehouse to support thousands of queries concurrently. In its latest offering, Amazon Redshift runs on third-generation architecture where storage and compute layers are decoupled and scaled independent of each other. This latest generation powers the several modern data architecture patterns our customers are actively embracing to build flexible and scalable analytics platforms.

When spinning up a new instance of Amazon Redshift, you get to choose either Amazon Redshift Serverless, for when you need a data warehouse that can scale seamlessly and automatically as your demand evolves unpredictably, or you can choose an Amazon Redshift provisioned cluster for steady-state workloads and greater control over your Amazon Redshift cluster’s configuration.

An Amazon Redshift provisioned cluster is a collection of computing resources called nodes, which are organized into a group called a cluster. Each cluster runs the Amazon Redshift engine and contains one or more databases. Creating an Amazon Redshift cluster is the first step in your process of building an Amazon Redshift data warehouse. While launching a provisioned cluster, one option that you specify is the node type. The node type determines the CPU, RAM, storage capacity, and storage drive type for each node.

In this post, we cover the current generation node RA3 architecture, different RA3 node types, important capabilities that are available only on RA3 node types, and how you can upgrade your current Amazon Redshift node types to RA3.

Amazon Redshift RA3 nodes

RA3 nodes with managed storage enable you to optimize your data warehouse by scaling and paying for compute and managed storage independently. RA3 node types are the latest node type for Amazon Redshift. With RA3, you choose the number of nodes based on your performance requirements and only pay for the managed storage that you use. RA3 architecture gives you the ability to size your cluster based on the amount of data you process daily or the amount of data that you want to store in your warehouse; there is no need to account for both storage and processing needs together.

Other node types that we previously offered include the following:

  • Dense compute – DC2 nodes enable you to have compute-intensive data warehouses with local SSD storage included. You choose the number of nodes you need based on data size and performance requirements.
  • Dense storage (deprecated) – DS2 nodes enable you to create large data warehouses using hard disk drives (HDDs). If you’re using the DS2 node type, we strongly recommend that you upgrade to RA3 to get twice as much storage and improved performance for the same on-demand cost.

When you use the RA3 node size and choose your number of nodes, you can provision the compute independent of storage. RA3 nodes are built on the AWS Nitro System and feature high bandwidth networking and large high-performance SSDs as local caches. RA3 nodes use your workload patterns and advanced data management techniques to deliver the performance of local SSD while scaling storage automatically to Amazon Simple Storage Service (Amazon S3).

RA3 node types come in three different sizes to accommodate your analytical workloads. You can quickly start experimenting with the RA3 node type by creating a single-node ra3.xlplus cluster and explore various features that are available. If you’re running a medium-sized data warehouse, you can size your cluster with ra3.4xlarge nodes. For large data warehouses, you can start with ra3.16xlarge. The following table gives more information about RA3 node types and their specifications as of this writing.

Node Type vCPU

RAM

(GiB)

Default Slices Per Node Managed Storage Quota Per Node Node Range with Create Cluster Total Managed Storage Capacity
ra3.xlplus 4 32 2 32 TB 1-16 1024 TB
ra3.4xlarge 12 96 4 128 TB 2-32 8192 TB
ra3.16xlarge 48 384 16 128 TB 2-128 16384 TB

Amazon Redshift with managed storage

Amazon Redshift with a managed storage architecture (RMS) still boasts the same resiliency and industry-leading hardware. With managed storage, Amazon Redshift uses intelligent data prefetching and data evictions based on the temperature of your data. This method helps you decide where to store your most-queried data. Most frequently used blocks (hot data) are cached locally on SSD, and infrequently used blocks (cold data) are stored on an RMS layer backed by Amazon S3. The following diagram depicts the leader node, compute node, and Amazon Redshift managed storage.

In the following sections, we discuss the capabilities that Amazon Redshift RA3 with managed storage can provide.

Independently scale compute and storage

As the scale of an organization grows, data continues to grow—reaching petabytes. The amount of data you ingest into your Amazon Redshift data warehouse also grows. You may be looking for ways to cost-effectively analyze all your data and at the same time have control over choosing the right compute or storage resource at the right time. For customers who are looking to be cost conscientious and cost-effective, the RA3 platform provides the option to scale and pay for your compute and storage resources separately.

With RA3 instances with managed storage, you can choose the number of nodes based on your performance requirements, and only pay for the managed storage that you use. This gives you the flexibility to size your RA3 cluster based on the amount of data you process daily without increasing your storage costs. It allows you to pay per hour for the compute and separately scale your data warehouse storage capacity without adding any additional compute resources and paying only for what you use.

Another benefit of RMS is that Amazon Redshift manages which data should be stored locally for fastest access, and data that is slightly colder is still kept within fast-access reach.

Advanced hardware

RA3 instances use high-bandwidth networking built on the AWS Nitro System to further reduce the time taken for data to be offloaded to and retrieved from Amazon S3. Managed storage uses high-performance SSDs for your hot data and Amazon S3 for your cold data, providing ease of use, cost-effective storage, and fast query performance.

Additional security options

Amazon Redshift managed VPC endpoints enable you to set up a private connection to securely access your Amazon Redshift cluster within your virtual private cloud (VPC) from client applications in another VPC within the same AWS account, another AWS account, or a subnet without using public IPs and without requiring the traffic to traverse across the internet.

The following scenarios describe common reasons to allow access to a cluster using an Amazon Redshift managed VPC endpoint:

  • AWS account A wants to allow a VPC in AWS account B to have access to a cluster
  • AWS account A wants to allow a VPC that is also in AWS account A to have access to a cluster
  • AWS account A wants to allow a different subnet in the cluster’s VPC within AWS account A to have access to a cluster

For information about access options to another VPC, refer to Enable private access to Amazon Redshift from your client applications in another VPC.

Further optimize your workload

In this section, we discuss two ways to further optimize your workload.

AQUA

AQUA (Advanced Query Accelerator) is a new distributed and hardware-accelerated cache that enables Amazon Redshift to run up to 10 times faster than other enterprise cloud data warehouses by automatically boosting certain types of queries. AQUA is available with the ra3.16xlarge, ra3.4xlarge, or ra3.xlplus nodes at no additional charge and with no code changes.

AQUA is an analytics query accelerator for Amazon Redshift that uses custom-designed hardware to speed up queries that scan large datasets. AQUA automatically optimizes query performance on subsets of the data that require extensive scans, filters, and aggregation. With this approach, you can use AQUA to run queries that scan, filter, and aggregate large datasets.

For more information about using AQUA, refer to How to evaluate the benefits of AQUA for your Amazon Redshift workloads.

Concurrency scaling for write operations

With RA3 nodes, you can take advantage of concurrency scaling for write operations, such as extract, transform, and load (ETL) statements. Concurrency scaling for write operations is especially useful when you want to maintain consistent response times when your cluster receives a large number of requests. It improves throughput for write operations contending for resources on the main cluster.

Concurrency scaling supports COPY, INSERT, DELETE, and UPDATE statements. In some cases, you might follow DDL statements, such as CREATE, with write statements in the same commit block. In these cases, the write statements are not sent to the concurrency scaling cluster.

When you accrue credit for concurrency scaling, this credit accrual applies to both read and write operations.

Increased agility to scale compute resources

Elastic resize allows you to scale your Amazon Redshift cluster up and down in minutes to get the performance you need, when you need it. However, there are limits on the nodes that you can add to a cluster. With some RA3 node types, you can increase the number of nodes up to four times the existing count. All RA3 node types support a decrease in the number of nodes to a quarter of the existing count. The following table lists growth and reduction limits for each RA3 node type.

Node Type Growth Limit Reduction Limit
ra3.xlplus 2 times (from 4 to 8 nodes, for example) To a quarter of the number
ra3.4xlarge 4 times (from 4 to 16 nodes, for example) To a quarter of the number (from 16 to 4 nodes, for example)
ra3.16xlarge 4 times (from 4 to 16 nodes, for example) To a quarter of the number (from 16 to 4 nodes, for example)

RA3 node types also have a shorter duration of snapshot restoration time because of the separation of storage and compute.

Improved resiliency

Amazon Redshift employs extensive fault detection and auto remediation techniques in order to maximize the availability of a cluster. With the RA3 architecture, you can enable cluster relocation, which provides additional resiliency by having the ability to relocate a cluster across Availability Zones without losing any data (RPO is zero) or having to change your client applications. The cluster’s endpoint remains the same after the relocation occurs so applications can continue operating without modifications. As the existing cluster fails, a new cluster is created on demand in another Availability Zone so cost of a standby replica cluster is avoided.

Accelerate data democratization

In this section, we share two techniques to accelerate data democratization.

Data sharing

Data sharing provides instant, granular, and high-performance access without copying data and data movement. You can query live data constantly across all consumers on different RA3 clusters in the same AWS account, in a different AWS account, or in a different AWS Region. Data is shared securely and provides governed collaboration. You can provide access in different granularity, including schema, database, tables, views, and user-defined functions.

This opens up various new use cases where you may have one ETL cluster that is producing data and have multiple consumers such as ad-hoc querying, dashboarding, and data science clusters to view the same data. This also enables bi-directional collaboration where groups such as marketing and finance can share data with one another. Queries accessing shared data use the compute resources of the consumer Amazon Redshift cluster and don’t impact the performance of the producer cluster.

For more information about data sharing, refer to Sharing Amazon Redshift data securely across Amazon Redshift clusters for workload isolation.

AWS Data Exchange for Amazon Redshift

AWS Data Exchange for Amazon Redshift enables you to find and subscribe to third-party data in AWS Data Exchange that you can query in an Amazon Redshift data warehouse in minutes. You can also license your data in Amazon Redshift through AWS Data Exchange. Access is automatically granted when a customer subscribes to your data and is automatically revoked when their subscription ends. Invoices are automatically generated, and payments are automatically collected and disbursed through AWS. This feature empowers you to quickly query, analyze, and build applications with third-party data.

For details on how to publish a data product and subscribe to a data product using AWS Data Exchange for Amazon Redshift, refer to New – AWS Data Exchange for Amazon Redshift.

Cross-database queries for Amazon Redshift

Amazon Redshift supports the ability to query across databases in a Redshift cluster. With cross-database queries, you can seamlessly query data from any database in the cluster, regardless of which database you are connected to. Cross-database queries can eliminate data copies and simplify your data organization to support multiple business groups on the same cluster.

One of many use cases where Cross-database query helps you is when data is organized across multiple databases in a Redshift cluster to support multi-tenant configurations. For example, different business groups and teams that own and manage data sets in their specific database in the same data warehouse need to collaborate with other groups. You might want to perform common ETL staging and processing while your raw data is spread across multiple databases. Organizing data in multiple Redshift databases is also a common scenario when migrating from traditional data warehouse systems. With cross-database queries, you can now access data from any of the databases on the Redshift cluster without having to connect to that specific database. You can also join data sets from multiple databases in a single query

You can read more about cross-database queries here.

Upgrade to RA3

You can upgrade to RA3 instances within minutes no matter the size of your current Amazon Redshift clusters. Simply take a snapshot of your cluster and restore it to a new RA3 cluster. For more information, refer to Upgrading to RA3 node types.

You can also simplify your migration efforts with Amazon Redshift Simple Replay. For more information, refer to Simplify Amazon Redshift RA3 migration evaluation with Simple Replay utility.

Summary

In this post, we talked about the RA3 node types, the benefits of Amazon Redshift managed storage, and the additional capabilities that you get by using Amazon Redshift RA3 with managed storage. Migrating to RA3 node types isn’t a complicated effort, you can get started today.


About the Authors

Bhanu Pittampally is an Analytics Specialist Solutions Architect based out of Dallas. He specializes in building analytic solutions. His background is in data warehouses—architecture, development, and administration. He has been in the data and analytics field for over 13 years.

Jason Pedreza is an Analytics Specialist Solutions Architect at AWS with data warehousing experience handling petabytes of data. Prior to AWS, he built data warehouse solutions at Amazon.com. He specializes in Amazon Redshift and helps customers build scalable analytic solutions.

Ingest Stripe data in a fast and reliable way using Stripe Data Pipeline for Amazon Redshift

Post Syndicated from Jessica Ho original https://aws.amazon.com/blogs/big-data/ingest-stripe-data-in-a-fast-and-reliable-way-using-stripe-data-pipeline-for-amazon-redshift/

Enterprises typically host a myriad of business applications for varying data needs. As companies grow, so does the demand for insights from a complete set of business data. Having data from various applications that store data in disparate silos can delay the decision-making process. However, building and maintaining an API integration or a third-party extract, transform, and load (ETL) pipeline to move data into a destination data store can be time-consuming and expensive.

Today we’re delighted to introduce Stripe Data Pipeline for Amazon Redshift to help you access your Stripe data and extract insight securely and easily from Amazon Redshift. This data, including billing, issuing, and payment records, can be shared in a consistent and automated fashion. You can integrate your Stripe data with data from other sources in your Amazon Redshift clusters to create a single source of truth.

In this post, we discuss the benefits of Stripe Data Pipeline and some of its use cases.

Solution overview

Amazon Redshift is a fast, fully managed, petabyte-scale cloud data warehousing service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence (BI) tools. It’s optimized for datasets ranging from a few hundred gigabytes to petabytes or more. This columnar data warehouse provides provisioned as well as serverless deployment options and uses an industry-standard SQL interface to analyze structured and semi-structured data with fast query performance.

Stripe Data Pipeline is powered by Amazon Redshift’s latest RA3 instances, which provide cross-account data sharing capability. RA3 takes a performant, cost-effective approach to address rapidly growing data volume by decoupling data processing from managed storage. You can then scale compute and storage independently and only pay for what you use. Data sharing provides read access directly to data stored across Amazon Redshift clusters without data movement. This capability removes the complexity and delays that are often associated with managing large distributed datasets across multiple accounts.

The solution provides the following core features and benefits:

  • Scalable and managed data pipeline – You don’t need to build, maintain, and scale custom ETL jobs. You can set up Stripe Data Pipeline in minutes, and it and scales automatically to handle increased business activities and data volume.
  • Up-to-date financial data – You automatically receive and refresh a complete set of your Stripe data and reports in Amazon Redshift on a low-latency schedule. Stripe Data Pipeline is built into Stripe and always provides accurate data.
  • Security and compliance – Data is shared directly from Stripe with your Amazon Redshift cluster, and confidentiality of the data is protected in transit and at rest. Amazon Redshift offers comprehensive security controls and monitoring via native integration with AWS CloudTrail and Amazon CloudWatch (for more information, see Logging Amazon Redshift API calls with AWS CloudTrail and Monitoring Amazon Redshift using CloudWatch metrics, respectively). You can define and audit who has access to what and ensure the compliance requirements are met.
  • Extensibility – Once the data is accessible in AWS, you benefit from the breadth of native integrations Amazon Redshift supports. You can join datasets from other data stores in operational databases, build reports and dashboards with BI tools, or identify patterns and generate prediction using Amazon Redshift ML.

The following architecture diagram provides a quick overview of how data sharing works and how other AWS services can be used together. We dive deeper into different use cases in the following sections.

Accept datashares from Stripe

You can configure the solution in a few steps with no code necessary.

Once Stripe creates a datashare from the producer cluster and authorizes your AWS account, you can view this datashare on your Amazon Redshift console. You need to associate it with specific or all clusters in your AWS account as the consumer. Clusters can be specified by namespaces as globally unique identifiers. Next, you create a database from the datashare in order to start querying data.

Query data from the consumer Amazon Redshift cluster

You can now access your Stripe data and schema directly from Amazon Redshift’s web-based query editor. This direct connection enables teams to pull accurate analysis of various functions of the business. For example:

  • Finance – “How does my cash flow change based on seasonality?”
  • Sales – “How many customers do we have in the US?”
  • Product – “How many active users do we have on each subscription plan?”
  • Sales operations – “Which customers haven’t paid their invoices?”

The following screenshot shows an example in which the query editor displays the number of charges blocked per Stripe’s connected account.

Use federated queries

The modern data architecture of Amazon Redshift enables you to store data in purpose-built data stores based on specific use cases, and allows querying external databases on Amazon Relational Database Service (Amazon RDS) or datasets in an Amazon Simple Storage Service (Amazon S3) data lake without moving these datasets to Amazon Redshift clusters. You can drive deeper by incorporating data from Amazon RDS, or from an S3 data lake through Amazon Redshift Spectrum. This capability provides a native integration without requiring additional ETL jobs.

The following syntax allows you to create an external schema from an Amazon Aurora MySQL-Compatible Edition database to an Amazon Redshift cluster. Amazon Redshift assumes an AWS Identity and Access Management (IAM) role and uses AWS Secrets Manager to access external data stores. For more information and examples with other supported data stores, refer to Querying data with federated queries in Amazon Redshift.

CREATE EXTERNAL SCHEMA auroram
FROM MYSQL
DATABASE ‘example_database’ SCHEMA 'example_schema' -- schema is optional
URI ‘hostname’
IAM_ROLE ‘iam_role_arn’
SECRET_ARN ‘aws_secrets_manager_arn’; 

Coming back to Stripe Data Pipeline, now you can combine the data from an Aurora table and create further analysis. For example, you can correlate the trends of customer acquisition against sales campaign by region, so you can gain an understanding of the campaign effectiveness and make adjustment to marketing strategies.

Create visualizations and dashboards

Now that your complete set of business data is accessible from Amazon Redshift, you can start to explore the data and create visualizations. Amazon QuickSight is a serverless BI service that allows you to easily connect to a data source, create analyses, publish dashboards, and share between teams. QuickSight seamlessly integrates with AWS services such as Amazon Redshift, Amazon S3, and many more.

The following screenshot illustrates how straightforward it is to connect an Amazon Redshift instance to QuickSight as a new data source.

The following screenshot is of a sample QuickSight dashboard pulling data from Amazon Redshift.

Key considerations

When using Stripe Data Pipeline, consider the following:

  • Instance type – This solution is available for all RA3 node types. If you run an existing DS2 or DC2 cluster, there are multiple options to migrate to RA3, including elastic resize, snapshot and restore, and classic resize. For more information, including an upgrade sizing reference between different node types, refer to Upgrading to RA3 node types.
  • RI migration – If you have Amazon Redshift Reserved Instances (RIs), you can use the RI migration feature to migrate the DS2 RI clusters to equivalent RA3 RI clusters as part of a cross-instance resize or cross-instance snapshot restore operation. The RA3 RI covering the new cluster will be the same cost and on the same calendar terms as the original DS2 RI for supported configurations.
  • Encryption – The consumer cluster must be encrypted as part of the enhanced security control for cross-account sharing. You can enable encryption at cluster creation time, or modify an unencrypted cluster with either AWS Key Management Service (AWS KMS) or AWS CloudHSM.
  • Federated queries – This capability works with external DB instances, including Amazon RDS for PostgreSQL, Amazon Aurora PostgreSQL-Compatible Edition, Amazon RDS for MySQL, and Aurora MySQL-Compatible Edition. You should also ensure that you have an Amazon Redshift cluster with a cluster maintenance version that supports federated queries.

Conclusion

In this post, we introduced Stripe Data Pipeline for Amazon Redshift and discussed options to further integrate with AWS services. Stripe Data Pipeline removes the need to build custom API integration or adopt a third-party ETL pipeline, making data accessible with a few clicks and with no code required. Businesses can automatically receive up-to-date data from Stripe in their data warehouse on AWS, reduce data silos, and extract deep insights to address business needs.

Check out Stripe Data Pipeline for more information about the solution and how to get started.


About the Authors

Jessica Ho is a Sr. Partner Solutions Architect at AWS supporting ISV partners who build business applications. She is passionate about creating differentiated solutions that promote cloud adoption. Outside of work, she enjoys spoiling her garden into a mini jungle.

Alexander Mahabir is a Sr. Partner Solutions Architect at AWS based in the D.C metropolitan area. Alex has over 16 year of experience building cloud, and on-premise solutions for small, medium, and large enterprises. Alex currently works with ISV partners in the Digital Customer Experience segment.

Use a linear learner algorithm in Amazon Redshift ML to solve regression and classification problems

Post Syndicated from Phil Bates original https://aws.amazon.com/blogs/big-data/use-a-linear-learner-algorithm-in-amazon-redshift-ml-to-solve-regression-and-classification-problems/

Amazon Redshift is the fastest, most widely used, fully managed, and petabyte-scale cloud data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads. Amazon Redshift ML, powered by Amazon SageMaker, makes it easy for SQL users such as data analysts, data scientists, and database developers to create, train, and deploy machine learning (ML) models using familiar SQL commands and then use these models to make predictions on new data for use cases such as churn prediction, customer lifetime value prediction, and product recommendations. Redshift ML makes the model available as a SQL function within the Amazon Redshift data warehouse so you can easily use it in queries and reports. Customers across all verticals are using Redshift ML to derive better insights from their data. For example, Jobcase uses Redshift ML to recommend job search at scale. Magellan RX Management uses Redshift ML to predict drug therapeutic use conditions.

Amazon Redshift supports supervised learning, including regression, binary classification, multi-class classification, and unsupervised learning using K-Means. You can optionally specify XGBoost, MLP, and now linear learner model types, which are supervised learning algorithms used for solving either classification or regression problems, and provide a significant increase in speed over traditional hyperparameter optimization techniques. Amazon Redshift also supports bring-your-own-model to invoke remote SageMaker endpoints.

In this post, we show you how to use Redshift ML to solve regression and classification problems using the SageMaker linear learner algorithm, which explores different training objectives and chooses the best solution from a validation set.

Solution overview

We first solve a linear regression problem, followed by a multi-class classification problem.

The following table shows some common use cases and algorithms used.

Use Case Algorithm / Problem Type
Customer churn prediction Classification
Predict if a sales lead will close Classification
Fraud detection Classification
Price and revenue prediction Linear regression
Customer lifetime value prediction Linear regression

To use the linear learner algorithm, you need to provide inputs or columns representing dimensional values and also the label or target, which is the value you’re trying to predict. The linear learner algorithm trains many models in parallel, and automatically determines the most optimized model.

Prerequisites

To get started, we need an Amazon Redshift cluster or an Amazon Redshift Serverless endpoint and an AWS Identity and Access Management (IAM) role attached that provides access to SageMaker and permissions to an Amazon Simple Storage Service (Amazon S3) bucket.

For an introduction to Redshift ML and instructions on setting it up, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.

To create a simple cluster with a default IAM role, see Use the default IAM role in Amazon Redshift to simplify accessing other AWS services.

Use case 1: Linear regression

In this use case, we analyze the Abalone dataset and determine the relationship between the physical measurements and use that to determine the age of abalone. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope, which is a time-consuming task. We want to predict the age using different physical measurements, which is easier to measure. The age of abalone is (number of rings + 1.5) years.

Prepare the data

Load the Abalone dataset into Amazon Redshift using the following SQL. You can use the Amazon Redshift query editor v2 or your preferred SQL tool to run these commands.

To create the table, use the following commands:

create table abalone_dataset
( 
id INT IDENTITY(1,1),
Sex     CHAR(1),
Length float,
Diameter float,
Height float,
Whole  float,
Shucked  float,
Viscera  float,
Shell float,
Rings  integer   
); 

To load data into Amazon Redshift, use the following COPY command:

COPY abalone_dataset 
from 's3://redshift-ml-multiclass/abalone.csv' 
REGION 'us-east-1'
IAM_ROLE default
CSV IGNOREHEADER 1
NULL AS 'NULL'; 

To train the model, we use the abalone table and 80% of the data to train the model, and then test the accuracy of that model by seeing if it correctly predicts the age of ring label attribute on the remaining 20% of the data. Run the following command to create training and validation tables:

create table abalone_training as
  select * from abalone_dataset where mod(id,10) < 8 ; 

create table abalone_validation as
  select * from abalone_dataset where mod(id,10) >= 8; 

Create a model in Redshift ML

To create a model in Amazon Redshift, use the following command:

CREATE MODEL model_abalone_ring_prediction
FROM (
 SELECT   Sex ,
Length ,
Diameter ,
Height ,
Whole  ,
Shucked  ,
Viscera  ,
Shell,
Rings  as target_label
FROM abalone_training
) TARGET target_label
FUNCTION f_abalone_ring_prediction
IAM_ROLE default
MODEL_TYPE LINEAR_LEARNER 
PROBLEM_TYPE REGRESSION
OBJECTIVE 'MSE'
SETTINGS (
  S3_BUCKET '<your-s3-bucket>',
  MAX_RUNTIME 15000
  ); 

We define the following parameters in the CREATE MODEL statement:

  • Problem type – We use the linear learner problem type, which is newly added to extend upon typical linear models by training many models in parallel, in a computationally efficient manner.
  • Objective – We specified MSE (mean square error) as our objective, which is a common metric for evaluation of regression problems.
  • Max runtime –This parameter denotes how long the model training can run. Specifying a larger value may help create a better tuned model. The default value for this parameter is 5400 (90 minutes). For this example, we set it to 15000.

The preceding statement takes a few seconds to complete. It initiates an Amazon SageMaker Autopilot process in the background to automatically build, train, and tune the best ML model for the input data. It then uses Amazon SageMaker Neo to deploy that model locally in the Amazon Redshift cluster or Amazon Redshift Serverless as a user-defined function (UDF). You can use the SHOW MODEL command in Amazon Redshift to track the progress of your model creation, which should be in the READY state within the max_runtime parameter you defined while creating the model.

To check the status of the model, use the following command:

show model model_abalone_ring_prediction;

The following is the tabular outcome for the preceding command after model training was done. It took approximately 120 minutes to train the model.

Key Value
Model Name model_abalone_ring_prediction
Schema Name public
Owner awsuser
Creation Time Tue, 10.05.2022 19:42:33
Model State READY
validation:mse 4.082088
Estimated Cost 5.423719
. .
TRAINING DATA: .
Query SELECT SEX , LENGTH , DIAMETER , HEIGHT , WHOLE , SHUCKED , VISCERA , SHELL, RINGS AS TARGET_LABEL
. FROM ABALONE_TRAINING
Target Column TARGET_LABEL
. .
PARAMETERS: .
Model Type linear_learner
Problem Type Regression
Objective MSE
AutoML Job Name redshiftml-20220510194233380173
Function Name f_abalone_ring_prediction
Function Parameters sex length diameter height whole shucked viscera shell
Function Parameter Types bpchar float8 float8 float8 float8 float8 float8 float8
IAM Role default-aws-iam-role
S3 Bucket poc-generic-bkt
Max Runtime 15000

Model validation

We notice from the preceding table that the MSE for the training data is 4.08. Now let’s run the prediction query and validate the accuracy of the model on the testing and validation dataset:

select
ROUND(AVG(POWER(( tgt_label - predicted ),2)),2) mse
, ROUND(SQRT(AVG(POWER(( tgt_label - predicted ),2))),2) rmse
from 
(
SELECT   Sex ,
Length ,
Diameter ,
Height ,
Whole  ,
Shucked  ,
Viscera  ,
Shell,
Rings as tgt_label,
f_abalone_ring_prediction(Sex ,Length ,Diameter ,Height ,Whole  ,Shucked  ,Viscera  ,Shell) as predicted,
case when tgt_label = predicted then 1
      else 0 end as match,
  case when tgt_label  <> predicted then 1
    else 0 end as nonmatch
FROM abalone_validation
)t1 

The following is the outcome from the query:

mse rmse
5.08 2.25

The MSE value from the preceding query results indicates that our model is accurate enough to the actual values from our validation dataset.

We can also observe that Redshift ML is able to identify the right combination of features to come up with a usable prediction model. We can further check the impact of each attribute and its contribution and weightage in the model selection using the following command:

select explain_model ('model_abalone_ring_prediction')

The following is the outcome, where each attribute weightage is representative of its role in model decision-making:

{"explanations":
    {"kernel_shap":
      {"label0":
        {"expected_value":10.06938362121582,
          "global_shap_values":
          {
            "diameter":0.6897614190439705,
            "height":0.38391156156643987,
            "length":0.29646334630067408,
            "sex":0.5516722137411109,
            "shell":1.5679368167990147,
            "shucked":2.222549468867254,
            "viscera":0.2879883139361144,
            "whole":0.8085603201751219
            }
        }
       }
      }
      ,"version":"1.0"
    }

Use case 2: Multi-class classification

For this use case, we use the Covertype dataset (copyright Jock A. Blackard and Colorado State University), which contains information collected by the US Geological Survey and the US Forest Service about wilderness areas in northern Colorado. This has been downloaded to an S3 bucket to make it simple to create the model. You may want to download the dataset description. This dataset contains various measurements such as elevation, distance to waters and roadways, as well as the wilderness area designation and the soil type. Our ML task is to create a model to predict the cover type for a given area.

Prepare the data

To prepare the data for this model, you need to create and populate the table public.covertype_data in Amazon Redshift using the Covertype dataset. You may use the following SQL in Amazon Redshift query editor v2 or your preferred SQL tool:

CREATE TABLE public.covertype_data (
    elevation bigint ENCODE az64,
    aspect bigint ENCODE az64,
    slope bigint ENCODE az64,
    horizontal_distance_to_hydrology bigint ENCODE az64,
    vertical_distance_to_hydrology bigint ENCODE az64,
    horizontal_distance_to_roadways bigint ENCODE az64,
    hillshade_9am bigint ENCODE az64,
    hillshade_noon bigint ENCODE az64,
    hillshade_3pm bigint ENCODE az64,
    horizontal_distance_to_fire_points bigint ENCODE az64,
    wilderness_area1 bigint ENCODE az64,
    wilderness_area2 bigint ENCODE az64,
    wilderness_area3 bigint ENCODE az64,
    wilderness_area4 bigint ENCODE az64,
    soil_type1 bigint ENCODE az64,
    soil_type2 bigint ENCODE az64,
    soil_type3 bigint ENCODE az64,
    soil_type4 bigint ENCODE az64,
    soil_type5 bigint ENCODE az64,
    soil_type6 bigint ENCODE az64,
    soil_type7 bigint ENCODE az64,
    soil_type8 bigint ENCODE az64,
    soil_type9 bigint ENCODE az64,
    soil_type10 bigint ENCODE az64,
    soil_type11 bigint ENCODE az64,
    soil_type12 bigint ENCODE az64,
    soil_type13 bigint ENCODE az64,
    soil_type14 bigint ENCODE az64,
    soil_type15 bigint ENCODE az64,
    soil_type16 bigint ENCODE az64,
    soil_type17 bigint ENCODE az64,
    soil_type18 bigint ENCODE az64,
    soil_type19 bigint ENCODE az64,
    soil_type20 bigint ENCODE az64,
    soil_type21 bigint ENCODE az64,
    soil_type22 bigint ENCODE az64,
    soil_type23 bigint ENCODE az64,
    soil_type24 bigint ENCODE az64,
    soil_type25 bigint ENCODE az64,
    soil_type26 bigint ENCODE az64,
    soil_type27 bigint ENCODE az64,
    soil_type28 bigint ENCODE az64,
    soil_type29 bigint ENCODE az64,
    soil_type30 bigint ENCODE az64,
    soil_type31 bigint ENCODE az64,
    soil_type32 bigint ENCODE az64,
    soil_type33 bigint ENCODE az64,
    soil_type34 bigint ENCODE az64,
    soil_type35 bigint ENCODE az64,
    soil_type36 bigint ENCODE az64,
    soil_type37 bigint ENCODE az64,
    soil_type38 bigint ENCODE az64,
    soil_type39 bigint ENCODE az64,
    soil_type40 bigint ENCODE az64,
    cover_type bigint ENCODE az64 
)
DISTSTYLE AUTO;	


Copy public.covertype_data
From 's3://redshift-ml-multiclass/covtype.data.gz'
iam_role default 
gzip
delimiter ','
region 'us-east-1';

Now that our dataset is loaded, we run the following SQL statements to split the data into three sets for training (80%), validation (10%), and prediction (10%). Note that Redshift ML Autopilot automatically splits the data into training and validation, but by splitting it here, you’re able to verify the accuracy of your model.

To prepare the dataset, assign random values to split the data:

create table  public.covertype_data_prep  as
select a.*,
cast (random() * 80 as int) as data_group_id
from public.covertype_data a;

Use the following code for the training set:

Create table public.covertype_training as
Select * from public.covertype_data_prep
Where data_group_id < 80;

Use the following code for the validation set:

Create table public.covertype_validation as
Select * from public.covertype_data_prep
Where data_group_id between 80 and 89;

Use the following code for the test set:

Create table public.covertype_test as
Select * from public.covertype_data_prep
Where data_group_id > 89; 

Now that we have our datasets, it’s time to create the model.

Create a model in Redshift ML using linear learner

Run the following SQL command to create your model—note our target is cover_type and we use all the inputs from our training set:

CREATE  MODEL forest_cover_type_model
FROM (select Elevation, 
   Aspect,  
   Slope,  
   Horizontal_distance_to_hydrology,  
   Vertical_distance_to_hydrology,   
   Horizontal_distance_to_roadways,  
   HIllshade_9am,  
   Hillshade_noon,  
   Hillshade_3pm , 
   Horizontal_Distance_To_Fire_Points,  
   Wilderness_Area1,  
   Wilderness_Area2,   
   Wilderness_Area3,  
   Wilderness_Area4,  
   soil_type1,
   Soil_Type2,
   Soil_Type3, 
   Soil_Type4, 
   Soil_Type5,  
   Soil_Type6, 
   Soil_Type7,  
   Soil_Type8, 
   Soil_Type9, 
   Soil_Type10 , 
   Soil_Type11,  
   Soil_Type12 , 
   Soil_Type13 ,
   Soil_Type14, 
   Soil_Type15, 
   Soil_Type16,  
   Soil_Type17, 
   Soil_Type18,  
   Soil_Type19,  
   Soil_Type20,  
   Soil_Type21,  
   Soil_Type22,  
   Soil_Type23,  
   Soil_Type24, 
   Soil_Type25,  
   Soil_Type26, 
   Soil_Type27,  
   Soil_Type28,  
   Soil_Type29,  
   Soil_Type30,  
   Soil_Type31, 
   Soil_Type32, 
   Soil_Type33,  
   Soil_Type34, 
   Soil_Type36, 
   Soil_Type37,  
   Soil_Type38,  
   Soil_Type39, 
   Soil_Type40,
  Cover_type from public.covertype_training)
TARGET cover_type
FUNCTION predict_cover_type
IAM_ROLE default
MODEL_TYPE LINEAR_LEARNER              
PROBLEM_TYPE MULTICLASS_CLASSIFICATION 
 OBJECTIVE 'Accuracy'
SETTINGS (
  S3_BUCKET '<<your-amazon-s3-bucket-name>>’
, 
  S3_GARBAGE_COLLECT OFF,
  MAX_RUNTIME 9600
) ;

You can use the SHOW MODEL command to view the status of the model.

You can see that the model has an accuracy score of .730279 and is in the READY state. Now let’s run a query to do some validation of our own.

Model validation

Run the following SQL query against the validation table, using the function created by our model:

select 
 cast(sum(t1.match)as decimal(7,2)) as predicted_matches
,cast(sum(t1.nonmatch) as decimal(7,2)) as predicted_non_matches
,cast(sum(t1.match + t1.nonmatch) as decimal(7,2))  as total_predictions
,predicted_matches / total_predictions as pct_accuracy
from 
(select 
Elevation, 
   Aspect,  
   Slope,  
   Horizontal_distance_to_hydrology,  
   Vertical_distance_to_hydrology,   
   Horizontal_distance_to_roadways,  
   HIllshade_9am,  
   Hillshade_noon,  
   Hillshade_3pm , 
   Horizontal_Distance_To_Fire_Points,  
   Wilderness_Area1,  
   Wilderness_Area2,   
   Wilderness_Area3,  
   Wilderness_Area4,  
   soil_type1,
   Soil_Type2,
   Soil_Type3, 
   Soil_Type4, 
   Soil_Type5,  
   Soil_Type6, 
   Soil_Type7,  
   Soil_Type8, 
   Soil_Type9, 
   Soil_Type10 , 
   Soil_Type11,  
   Soil_Type12 , 
   Soil_Type13 ,
   Soil_Type14, 
   Soil_Type15, 
   Soil_Type16,  
   Soil_Type17, 
   Soil_Type18,  
   Soil_Type19,  
   Soil_Type20,  
   Soil_Type21,  
   Soil_Type22,  
   Soil_Type23,  
   Soil_Type24, 
   Soil_Type25,  
   Soil_Type26, 
   Soil_Type27,  
   Soil_Type28,  
   Soil_Type29,  
   Soil_Type30,  
   Soil_Type31, 
   Soil_Type32, 
   Soil_Type33,  
   Soil_Type34, 
   Soil_Type36, 
   Soil_Type37,  
   Soil_Type38,  
   Soil_Type39, 
   Soil_Type40,
   Cover_type as actual_cover_type,
   predict_cover_type( Elevation, 
   Aspect,  
   Slope,  
   Horizontal_distance_to_hydrology,  
   Vertical_distance_to_hydrology,   
   Horizontal_distance_to_roadways,  
   HIllshade_9am,  
   Hillshade_noon,  
   Hillshade_3pm , 
   Horizontal_Distance_To_Fire_Points,  
   Wilderness_Area1,  
   Wilderness_Area2,   
   Wilderness_Area3,  
   Wilderness_Area4,  
   soil_type1,
   Soil_Type2,
   Soil_Type3, 
   Soil_Type4, 
   Soil_Type5,  
   Soil_Type6, 
   Soil_Type7,  
   Soil_Type8, 
   Soil_Type9, 
   Soil_Type10 , 
   Soil_Type11,  
   Soil_Type12 , 
   Soil_Type13 ,
   Soil_Type14, 
   Soil_Type15, 
   Soil_Type16,  
   Soil_Type17, 
   Soil_Type18,  
   Soil_Type19,  
   Soil_Type20,  
   Soil_Type21,  
   Soil_Type22,  
   Soil_Type23,  
   Soil_Type24, 
   Soil_Type25,  
   Soil_Type26, 
   Soil_Type27,  
   Soil_Type28,  
   Soil_Type29,  
   Soil_Type30,  
   Soil_Type31, 
   Soil_Type32, 
   Soil_Type33,  
   Soil_Type34, 
   Soil_Type36, 
   Soil_Type37,  
   Soil_Type38,  
   Soil_Type39, 
   Soil_Type40) as predicted_cover_type,
case when actual_cover_type = predicted_cover_type then 1
      else 0 end as match,
  case when actual_cover_type <>  predicted_cover_type then 1
    else 0 end as nonmatch

from public.covertype_validation
) t1;

You can see that our accuracy is very close to our score from the SHOW MODEL output.

Run a prediction query

Let’s run a prediction query in Amazon Redshift ML using our function against our test dataset to see the most common class of cover type for the Neota Wilderness Area. We can denote this by checking wildnerness_area2 for a value of 1.

The dataset includes the following wilderness areas:

  1. Rawah Wilderness Area
  2. Neota Wilderness Area
  3. Comanche Peak Wilderness Area
  4. Cache la Poudre Wilderness Area

The cover types are in seven different classes:

  1. Spruce/Fir
  2. Lodgepole Pine
  3. Ponderosa Pine
  4. Cottonwood/Willow
  5. Aspen
  6. Douglas Fir
  7. Krummholz

There are also 40 different soil type definitions, which you can see in the dataset description, with a value of 0 or 1 to note if it’s applicable for a particular row. The following are a few example soil types:

  1. Cathedral family – Rock outcrop complex, extremely stony
  2. Vanet-Ratake families –Rocky outcrop complex, very stony
  3. Haploborolis family – Rock outcrop complex, rubbly
  4. Ratake family – Rock outcrop complex, rubbly
  5. Vanet family – Rock outcrop complex, rubbly
  6. Vanet-Wetmore families – Rock outcrop complex, stony
select t1. predicted_cover_type, count(*) 
from 
(     
select 
   Elevation, 
   Aspect,  
   Slope,  
   Horizontal_distance_to_hydrology,  
   Vertical_distance_to_hydrology,   
   Horizontal_distance_to_roadways,  
   HIllshade_9am,  
   Hillshade_noon,  
   Hillshade_3pm , 
   Horizontal_Distance_To_Fire_Points,  
   Wilderness_Area1,  
   Wilderness_Area2,   
   Wilderness_Area3,  
   Wilderness_Area4,  
   soil_type1,
   Soil_Type2,
   Soil_Type3, 
   Soil_Type4, 
   Soil_Type5,  
   Soil_Type6, 
   Soil_Type7,  
   Soil_Type8, 
   Soil_Type9, 
   Soil_Type10 , 
   Soil_Type11,  
   Soil_Type12 , 
   Soil_Type13 ,
   Soil_Type14, 
   Soil_Type15, 
   Soil_Type16,  
   Soil_Type17, 
   Soil_Type18,  
   Soil_Type19,  
   Soil_Type20,  
   Soil_Type21,  
   Soil_Type22,  
   Soil_Type23,  
   Soil_Type24, 
   Soil_Type25,  
   Soil_Type26, 
   Soil_Type27,  
   Soil_Type28,  
   Soil_Type29,  
   Soil_Type30,  
   Soil_Type31, 
   Soil_Type32, 
   Soil_Type33,  
   Soil_Type34, 
   Soil_Type36, 
   Soil_Type37,  
   Soil_Type38,  
   Soil_Type39, 
   Soil_Type40,
   predict_cover_type( Elevation, 
   Aspect,  
   Slope,  
   Horizontal_distance_to_hydrology,  
   Vertical_distance_to_hydrology,   
   Horizontal_distance_to_roadways,  
   HIllshade_9am,  
   Hillshade_noon,  
   Hillshade_3pm , 
   Horizontal_Distance_To_Fire_Points,  
   Wilderness_Area1,  
   Wilderness_Area2,   
   Wilderness_Area3,  
   Wilderness_Area4,  
   soil_type1,
   Soil_Type2,
   Soil_Type3, 
   Soil_Type4, 
   Soil_Type5,  
   Soil_Type6, 
   Soil_Type7,  
   Soil_Type8, 
   Soil_Type9, 
   Soil_Type10,
   Soil_Type11,  
   Soil_Type12, 
   Soil_Type13,
   Soil_Type14, 
   Soil_Type15, 
   Soil_Type16,  
   Soil_Type17, 
   Soil_Type18,  
   Soil_Type19,  
   Soil_Type20,  
   Soil_Type21,  
   Soil_Type22,  
   Soil_Type23,  
   Soil_Type24, 
   Soil_Type25,  
   Soil_Type26, 
   Soil_Type27,  
   Soil_Type28,  
   Soil_Type29,  
   Soil_Type30,  
   Soil_Type31, 
   Soil_Type32, 
   Soil_Type33,  
   Soil_Type34, 
   Soil_Type36, 
   Soil_Type37,  
   Soil_Type38,  
   Soil_Type39, 
   Soil_Type40) as predicted_cover_type 

from public.covertype_test
where wilderness_area2 = 1)
t1
group by 1;

Our model has predicted that the majority of cover is Spruce and Fir.

You can experiment with various combinations, such as determining which soil types are most likely to occur in a predicted cover type.

Conclusion

Redshift ML makes it easy for users of all levels to create, train, and tune models using a SQL interface. In this post, we walked you through how to use the linear learner algorithm to create regression and multi-class classification models. You can then use those models to make predictions using simple SQL commands and gain valuable insights.

To learn more about RedShift ML, visit Amazon Redshift ML.


About the Authors

Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS with over 25 years of data warehouse experience.

Sohaib Katariwala is an Analytics Specialist Solutions Architect at AWS. He has 12+ years of experience helping organizations derive insights from their data.

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 13 years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Jiayuan Chen is a Senior Software Development Engineer at AWS. He is passionate about designing and building data-intensive applications, and has been working in the areas of data lake, query engine, ingestion, and analytics. He keeps up with latest technologies and innovates things that spark joy.

Debu Panda is a Senior Manager, Product Management at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences such as re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

Federate single sign-on access to Amazon Redshift query editor v2 with Okta

Post Syndicated from Sumeet Joshi original https://aws.amazon.com/blogs/big-data/federate-single-sign-on-access-to-amazon-redshift-query-editor-v2-with-okta/

Amazon Redshift query editor v2 is a web-based SQL client application that you can use to author and run queries on your Amazon Redshift data warehouse. You can visualize query results with charts and collaborate by sharing queries with members of your team. You can use query editor v2 to create databases, schemas, tables, and load data from Amazon Simple Storage Service (Amazon S3) using the COPY command or by using a wizard. You can browse multiple databases and run queries on your Amazon Redshift data warehouse or data lake, or run federated queries to operational databases such as Amazon Aurora. Because it’s a managed SQL editor in your browser and it’s integrated with your single sign-on (SSO) provider, Amazon Redshift query editor v2 reduces the number of steps to the first query so you gain insights faster.

Amazon Redshift query editor v2 integration with your identity provider (IdP) automatically redirects the user’s browser to the query editor v2 console instead of Amazon Redshift console. This enables your users to easily access Amazon Redshift clusters through query editor v2 using federated credentials without managing database users and passwords.

In this post, we focus on Okta as the IdP and illustrate how to set up your Okta application and AWS Identity and Access Management (IAM) permissions. We also demonstrate how you can limit the access for your users to use only query editor v2 without giving them access to perform any admin functions on the AWS Management Console.

Solution overview

The high-level steps in this post are as follows:

  1. Set up Okta, which contains your users organized into logical groups and AWS account federation application.
  2. Set up two IAM roles: one that establishes a trust relationship between your IdP and AWS, and a second role that Okta uses to access Amazon Redshift.
  3. Complete Okta advanced configuration:
    1. Finalize the Okta configuration by inputting the roles that you just created.
    2. Define a default relay state to direct users to Amazon Redshift query editor v2 after successful SAML authentication.
    3. Configure the SAML PrincipalTagAttribute. This element allows you to pass attributes as session tags in the SAML assertion. For more information about session tags, see Passing session tags in AWS STS.
  4. Set up Amazon Redshift database groups:
    1. Create groups within the Amazon Redshift database to match the Okta groups.
    2. Authorize these groups to access certain schemas and tables.
    3. Access Amazon Redshift query editor v2 using your enterprise credentials and query your Amazon Redshift database.
  5. Sign in to your Okta account and access the application assigned to you. The application directs you to Amazon Redshift query editor v2 using federated login.
  6. Access and query your Amazon Redshift database.

Prerequisites

This post assumes that you have the following prerequisites:

Set up Okta

First, we set up the Okta application and create users and groups. Complete the following steps:

  1. Log in to your Okta admin console using the URL https://<prefix>-admin.okta.com/admin/dashboard, where <prefix> is specific to your account and was created at account setup.
  2. On the admin console, choose Admin.
  3. Under Directory in the navigation pane, choose People.
  4. To add users, choose Add person.
    The following screenshot shows the users that we created.
  5. To add groups into Okta, choose Groups in the navigation pane, then choose Add group.
    The following screenshot shows two groups that we created. We added Jane to analyst_users and Mike to bi_users.
  6. Under Applications in the navigation pane, choose Applications and choose Browse App Catalog.
  7. Search for AWS Account Federation and choose Add.
  8. After you add the application, choose AWS Account Federation.
  9. Leave the values in General Settings at their default and choose Next.
  10. Under Sign-On Options, select SAML 2.0.
  11. Choose the Identity Provider metadata link to download the metadata file in .xml format.

Configure IAM roles

Next, you set up an IAM role that establishes a trust relationship between the IdP and AWS. You also create an IAM role that Okta uses to access Amazon Redshift query editor v2.

  1. On the IAM console, under Access management in the navigation pane, choose Identity providers.
  2. Choose Add provider.
  3. For Provider type¸ select SAML.
  4. For Provider name¸ enter a name.
  5. Choose Choose file and upload the metadata file you downloaded.
  6. Choose Add provider.

    Now you create the IAM SAML 2.0 federation role.
  7. On the IAM console, choose Roles in the navigation pane.
  8. Choose Create role.
  9. For Trusted entity type, select SAML 2.0 federation.
  10. For SAML 2.0-based provider, choose the IdP you created in the previous step.
  11. Select Allow programmatic and AWS Management Console access.
  12. Choose Next and then choose Create Policy.

Amazon Redshift query editor v2 provides multiple managed policies to access the query editor. For a list of managed policies, refer to Configuring your AWS account. The managed policy enables you to limit the access for your users to use only query editor v2 without giving them access to perform any admin functions on the console. For this post, we use the AmazonRedshiftQueryEditorV2ReadSharing managed policy and create a custom policy.

In the following code, provide your Region, account, and cluster parameters to grant access to Amazon Redshift to get cluster credentials, create users, and allow users to join groups:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "RedshiftClusterPermissions",
            "Effect": "Allow",
            "Action": [
                "redshift:GetClusterCredentials",
                "redshift:CreateClusterUser",
                "redshift:JoinGroup"
            ],
            "Resource": [
                "arn:aws:redshift:<region>:<account>:cluster:<cluster>,                 
                "arn:aws:redshift:<region>:<account>:dbuser:<cluster>/${aws:PrincipalTag/RedshiftDbUser}",
                "arn:aws:redshift:<region>:<account>:dbgroup:<cluster>/analyst_users",
                "arn:aws:redshift:<region>:<account>:dbgroup:<cluster>/bi_users",
                "arn:aws:redshift:<region>:<account>:dbname:<cluster>/${redshift:DBName}"
            ]
        }
    ]
}

The group membership lasts only for the duration of the user session. Additionally, there is no CreateGroup permission because groups need to be manually created and granted DB privileges.

  1. Attach the policy you created to the role.
    The following screenshot shows the summary page for the role.
  2. Modify the trust relationships for your role and add sts:TagSession permission.
    When using session tags, trust policies for all roles connected to the IdP passing tags must have the sts:TagSession permission. For roles without this permission in the trust policy, the AssumeRole operation fails.
  3. Choose Update policy.

Set up Okta advanced configuration

In this section, you finalize the Okta configuration by adding the IAM roles that you just created. You set up SAML PrincipalTag attributes such as RedshiftDbUser and RedshiftDbGroups, which are passed in the SAML assertion to federate the access to Amazon Redshift query editor v2. You also define a default relay state, which is the URL that users are directed to after a successful authentication through SAML.

  1. In your Okta account, open the AWS Account Federation app.
  2. On the Sign On tab, set Default Relay State to the query editor URL, using the format https://<region>.console.aws.amazon.com/sqlworkbench/home. For this post, we use https://eu-west-1.console.aws.amazon.com/sqlworkbench/home.
  3. Choose Attributes and set up the following attribute mappings:
    1. Set the DB user using PrincipalTag:RedshiftDbUser. This uses the user name in the directory. This is a required tag and defines the database user that is used by query editor v2.
    2. Set the DB groups using PrincipalTag:RedshiftDbGroups. This uses the Okta groups to fill the principal tags. Its value must be a colon-separated list.
    3. Set the transitive keys using TransitiveTagKeys. This prevents users from changing the session tags in case of role chaining.

These tags are forwarded to the redshift:GetClusterCredentials API to get credentials for your cluster. The following table summarizes their attribute configuration.

Name Name Format Value
https://aws.amazon.com/SAML/Attributes/PrincipalTag:RedshiftDbUser Unspecified user.username
https://aws.amazon.com/SAML/Attributes/PrincipalTag:RedshiftDbGroups Unspecified String.join(“:”, isMemberOfGroupName(“analyst_users”) ? ‘analyst_users’ : ”, isMemberOfGroupName(“bi_users”) ? ‘bi_users’ : ”)
https://aws.amazon.com/SAML/Attributes/TransitiveTagKeys Unspecified Arrays.flatten(“RedshiftDbUser”, “RedshiftDbGroups”)

  1. Under Advanced Sign-on Settings¸ select Use Group Mapping.
  2. Enter the IdP and IAM role ARNs, which are globally unique, and make sure that Okta is directed to your AWS account.
  3. Authorize users to use the AWS Account Federation application by selecting their respective groups or individual user accounts. In this example, we authorized users by group.

Set up Amazon Redshift database groups

Next, you set up groups in the Amazon Redshift database to match the Okta groups. You also authorize these groups to access certain schemas and tables.

  1. Log in to your Amazon Redshift cluster with an admin account.
  2. Create groups that match the IdP group names, and grant the appropriate permissions to tables and schemas:
CREATE GROUP analyst_users;
CREATE GROUP bi_users;

ALTER DEFAULT PRIVILEGES IN SCHEMA sales_analysis
GRANT SELECT on TABLES to GROUP analyst_users;
GRANT USAGE on SCHEMA sales_analysis to GROUP analyst_users;
GRANT SELECT on ALL TABLES in SCHEMA sales_analysis to GROUP analyst_users;

ALTER DEFAULT PRIVILEGES IN SCHEMA sales_bi
GRANT SELECT on TABLES to GROUP bi_users;
GRANT USAGE on SCHEMA sales_bi to GROUP bi_users;
GRANT SELECT on ALL TABLES in SCHEMA sales_bi to GROUP bi_users;

In Okta, you created the user Jane and assigned Jane to the analyst_users group.

In the Amazon Redshift database, you created two database groups: analyst_users and bi_users.

When user Jane logs in via federated authentication to Amazon Redshift using query editor v2, the user is created if it doesn’t already exist and the analyst_users database group is assumed. The user Jane can query tables only in sales_analysis schema.

Because user Jane isn’t part of the bi_users group, when they try to access the sales_bi schema, they get a permission denied error.

The following diagram illustrates this configuration.

Access Amazon Redshift query editor v2

Now you’re ready to connect to your Amazon Redshift cluster using Amazon Redshift query editor v2 using federated login. Log in to your Okta account with your user credentials and under My Apps¸ choose Amazon Redshift Query Editor V2.

You’re redirected to the Amazon Redshift query editor v2 URL that you specified as the default relay state.

Connect to an Amazon Redshift database and run queries

Now let’s set up the connection to your Amazon Redshift cluster.

  1. In the query editor, choose your cluster (right-click) and choose Create connection.
  2. For Database, enter a name.
  3. For Authentication, select Federated user.
    The user name is prepopulated with your federated login information.
  4. Choose Create connection.

When you’re connected to your Amazon Redshift database, you can verify the connection details, as shown in the following screenshot. Notice the session-level group association as per the group assignment in your Okta application configuration. In this case, user Jane is assigned to the analyst_users group.

This user has access to SELECT all tables in the sales_analysis schema and no access to the sales_bi schema. You can run the following statements to test your access.

The following screenshot shows the results from a query to the sales_analysis.store_sales_us table.

When user Jane tries to access the tables in the sales_bi schema, they get a permission denied error.

Summary

In this post, we demonstrated how to federate SSO access to Amazon Redshift query editor v2 using Okta as your IdP. We showed how to set up Okta, set different PrinicpalTag attributes for query editor v2, and pass group memberships defined in your Okta IdP to your Amazon Redshift cluster. We showed how to log in to Amazon Redshift query editor v2 using federated login and validate the configuration by running a few queries. This solution allows you to control access to Amazon Redshift database objects, and your users can easily access Amazon Redshift clusters through query editor v2 using federated credentials without managing database users and passwords.

If you have any feedback or questions, please leave them in the comments.


About the Authors

Sumeet Joshi is an Analytics Specialist Solutions Architect based out of New York. He specializes in building large-scale data warehousing solutions. He has over 16 years of experience in the data warehousing and analytical space.

Bhanu Pittampally is an Analytics Specialist Solutions Architect based out of Dallas. He specializes in building analytical solutions. His background is in data and analytics for over 14 years. His LinkedIn profile can be found here.

Erol Murtezaoglu, a Technical Product Manager at AWS, is an inquisitive and enthusiastic thinker with a drive for self-improvement and learning. He has a strong and proven technical background in software development and architecture, balanced with a drive to deliver commercially successful products. Erol highly values the process of understanding customer needs and problems, in order to deliver solutions that exceed expectations.

Yanis Telaoumaten is a Software Development Engineer at AWS. His passions are building reliable software and creating tools to allow other engineers to work more efficiently. In the past years, he worked on identity, security and reliability of Redshift services