Tag Archives: Amazon Redshift

Improved speed and scalability in Amazon Redshift

Post Syndicated from Naresh Chainani original https://aws.amazon.com/blogs/big-data/improved-speed-and-scalability-in-amazon-redshift/

Since Amazon Redshift launched in 2012, its focus has always been on providing you with the best possible performance, at scale, and at the lowest possible cost. For most organizations across the globe, the amount of data going into their data warehouse is growing exponentially, and the number of people who want insights from that data increases daily. Because of this, Amazon Redshift is continually innovating to handle this growing volume of data and demand for insights.

Amazon Redshift delivers fast performance, at scale, for the most demanding workloads. Getting there was not easy, and it takes consistent investment across a variety of technical focus areas to make this happen. This post breaks down what it takes to build the world’s fastest cloud data warehouse.

Redshift’s performance investments are based on a broad spectrum of needs and uses the extensive fleet telemetry data from tens of thousands of customers to guide development work. Your needs may vary based on a variety of factors, including the size of your data warehouse, the number of concurrent users, the skillset of those users, and workload characteristics such as frequency of usage and query latency requirements.

Based on this, Amazon Redshift has performance investments anchored in four key areas:

  • Out-of-the-box performance – Amazon Redshift is over twice as fast out-of-the-box than it was 6 months ago, and keeps getting faster without any additional manual optimization and tuning
  • Automatic optimizations – Amazon Redshift is self-learning, self-optimizing, and constantly adapts to your actual workload to deliver the best possible performance
  • Scalability – Amazon Redshift can boost throughput to be 35 times greater to support increases in concurrent users and scales linearly for a wide range of workloads
  • Price-performance – Amazon Redshift provides predictable performance at significantly lower total cost than other data warehouse solutions by optimizing resource utilization

Out-of-the-box performance

Amazon Redshift has always enabled you to manually tune workload performance based on statically partitioning memory and specifying the number of concurrent queries. Amazon Redshift also uses advanced machine learning (ML) algorithms to tune configuration settings automatically for optimal data warehouse performance.

These algorithms are extremely effective because they are trained with real-world telemetry data generated from processing over two exabytes of data a day. Additionally, because Amazon Redshift has more customers than any other cloud data warehouse, the telemetry data makes the configuration recommendations from the ML algorithms even more accurate.

In addition, Amazon Redshift Advisor guides optimization by recommending sort keys, distribution keys, and more. However, if you want to override the learning, self-tuning behavior of Amazon Redshift, you still have fine-grained control.

The out-of-the-box performance of Amazon Redshift is continually improving. In November 2019, our Cloud Data Warehouse benchmark[1] showed that the out-of-the-box performance of Amazon Redshift was twice as fast as 6 months ago. This ongoing improvement in performance is the culmination of many technical innovations. This post presents three improvements that have had the most impact.

Improved compression

Amazon Redshift uses AZ64, a novel compression encoding, which delivers high compression ratios and significantly improved query performance. With AZ64, you no longer need to make the trade-off between storage size and performance. AZ64 compresses data, on average, 35% more than the popular, high-performance LZO algorithm, and processes the data 40% faster. AZ64 achieves this by efficiently compressing small groups of data values and uses CPU vector instructions and single instruction, multiple data (SIMD) for parallel processing of AZ64-encoded columns. To benefit from this, you simply select the data type for each column, and Amazon Redshift chooses the compression encoding method. Five months after launch, the AZ64 encoding has become the fourth most popular encoding option in Amazon Redshift with millions of columns.

Efficient large-scale join operations

When complex queries join large tables, massive amounts of data transfers over the network for the join processing on the Amazon Redshift compute nodes. This used to create network bottlenecks that impacted query performance.

Amazon Redshift now uses distributed bloom filters to enable high-performance joins for such workloads. Distributed bloom filters efficiently filter rows at the source that do not match the join relation, which greatly reduces the amount of data transferred over the network. Across the fleet, we see millions of selective bloom filters deployed each day, with some of them filtering more than one billion rows. This removes the network from being the bottleneck and improves overall query performance. Amazon Redshift automatically determines what queries are suitable for this optimization and adapts the optimization at runtime. In addition, Amazon Redshift uses CPU cache-optimized processing to make query execution more efficient for large-scale joins and aggregations. Together, these features improve performance for over 70% of the queries that Amazon Redshift processes daily.

Enhanced query planning

Query planning determines the quickest way to process a given query by evaluating costs associated with various query plans. Costs include many factors, for example, the number of I/O operations required, amount of disk buffer space, time to read from disk, parallelism for the query, and tables statistics such as number of rows and number of distinct values of a column being fresh and relevant.

The Amazon Redshift query planner factors in the capabilities of modern hardware, including the network stack, to take full advantage of the performance that the hardware offers. Its statistic collection process is automated, and it computes statistics by using algorithms like HyperLogLog (HLL), which improves the quality of statistics and therefore enables the cost-based planner to make better choices.

Automatic optimizations

Amazon Redshift uses ML techniques and advanced graph algorithms to continuously improve performance. Different workloads have different data access patterns, and Amazon Redshift observes the workload to learn and adapt. It automatically adjusts data layout, distribution keys, and query plans to provide optimal performance for a given workload.

The automatic optimizations in Amazon Redshift make intelligent decisions, such as how to distribute data across nodes, which datasets are frequently queried together and should be co-located, how to sort data, and how many parallel and concurrent queries should run on the system based on the complexity of the query. Amazon Redshift automatically adjusts these configurations to optimize throughput and performance as you use the data warehouse. To make sure these optimizations do not interrupt your workload, Amazon Redshift runs them incrementally and only during the periods when clusters have low utilization.

For maintenance operations, Amazon Redshift reduces the amount of compute resources required by operating only on frequently accessed tables and portions within those tables. Amazon Redshift prioritizes which portions of the table to operate on by analyzing query patterns. When relevant, it provides prescriptive guidance through recommendations in Amazon Redshift Advisor. You can then evaluate and apply those recommendations as needed.

AWS also constantly evaluates and assesses how effective its ML-based self-learning systems are in delivering fast query performance when compared to traditional methods such as expert human tuning.

Amazon Redshift has been adding automatic optimizations for years. The following timeline shows some of the automatic optimizations delivered over the last 12 months. 

For more information about specific optimizations, see the following posts about

automatic VACUUM DELETE, automatic ANALYZE, distribution key recommendations, sort key recommendations, automatic table sort, and automatic distribution style.

Scalability

Amazon Redshift can boost throughput by more than 35 times to support increases in concurrent users, and scales linearly for simple and mixed workloads.

Scaling to support a growing number of users

As the number of users accessing the data warehouse grows, you should not experience delays in your query responses. Most of the time, your workloads are not predictable and can vary throughout the day. Traditional data warehouses are typically provisioned for peak usage to avoid delays and missed SLAs, and you end up paying for resources that you are not fully using.

Concurrency Scaling in Amazon Redshift allows the data warehouse to handle spikes in workloads while maintaining consistent SLAs by elastically scaling the underlying resources as needed. Amazon Redshift continuously monitors the designated workload. If the queries start to get backlogged because of bursts of user activity, Amazon Redshift automatically adds transient cluster capacity and routes the requests to these new clusters. This transient capacity is available in a few seconds, so your queries continue to be served with low latency. Amazon Redshift removes the additional transient capacity automatically when activity reduces on the cluster.

You can choose if you want to elastically scale for certain workloads and by how much with a simple one-step configuration. Every 24 hours that the Amazon Redshift main cluster is in use, you accrue a 1-hour credit. This makes concurrency scaling free for more than 97% of use cases.

With the ability to automatically add and remove additional capacity, Amazon Redshift data warehouses can improve overall throughput by over 35 times. This post demonstrates how far you can dynamically allocate more compute power to satisfy the demands of concurrent users with the following experiment. First, take a baseline measurement using the Cloud Data Warehouse benchmark and five concurrent users. You can then enable Concurrency Scaling and add more and more users with each iteration. As soon as Amazon Redshift detects queuing, it allocates additional scaling clusters automatically. Ultimately, this experiment ran over 200 concurrent queries on Amazon Redshift and generated more than 35 times greater throughput. This many concurrently executing queries represents a concurrent user population of several thousand. This demonstrates how you can support virtually unlimited concurrent users on your Amazon Redshift data warehouses.

The scaling for concurrent users is also linear. You get consistent increases in performance with every extra dollar spent on the Concurrency Scaling clusters. This helps to keep data warehouse costs predictable as business needs grow. With Concurrency Scaling, AWS can perform benchmark tests with tens of thousands of queries per hour, with hundreds of queries running concurrently and providing linear scale. This represents a real-world workload in enterprises with thousands of concurrent users connecting to the data warehouse.

Scaling while running multiple mixed workloads

As data warehouses grow over time, the number and complexity of the workloads that run on the data warehouse also increase. For example, if you migrate from an on-premises data warehouse to Amazon Redshift, you might first run traditional analytics workloads, and eventually bring more operational and real-time data into the cluster to build new use cases and applications. To scale any data warehouse effectively, you must be able to prioritize and manage multiple types of workloads concurrently. Automatic workload management (WLM) and query priorities are two recent capabilities added to Amazon Redshift that enable you to do just that.

Automatic WLM makes sure that you use cluster resources efficiently, even with dynamic and unpredictable workloads. With automatic WLM, Amazon Redshift uses ML to classify incoming queries based on their memory and processing requirements and routes them to appropriate internal queues to start executing in parallel. Amazon Redshift dynamically adjusts the number of queries to execute in parallel to optimize overall throughput and performance. When queries that require large amounts of resources are in the system (for example, hash joins between large tables), the concurrency is lower. When you submit lighter queries (such as inserts, deletes, or simple aggregations), concurrency is higher. There is a feedback loop to continuously monitor system utilization and regulate admission into the cluster.

However, not all queries may be equally important to you; the performance of one workload or set of users might be more important than others. Query priorities in Amazon Redshift address this. You can give higher-priority workloads preferential treatment, including more resources during busy times, for consistent query performance. Amazon Redshift workload management uses intelligent algorithms to make sure that lower-priority queries continue to make progress and don’t stall.

You can combine Amazon Redshift Concurrency Scaling and automatic WLM with query priorities to solve complex data warehouse use cases. For example, the following table summarizes an Amazon Redshift configuration that effectively mixes ETL with analytics workloads.

WLM queueQueue priorityConcurrency ScalingNotes
ETLHighOffWhen ETL runs, it gets the highest priority
BI queriesNormalOnWhen BI workload suddenly increases, Concurrency Scaling adds capacity to maintain user SLAs
One-time or exploratory queriesLowOffCluster offers analytic access for casual users and data scientists when resources are available

For this use case, and many more, you can maintain SLAs, achieve efficiencies with your cluster utilization, and get sufficient flexibility to invest according to business priorities.

Price performance

You can measure price performance by calculating both the total cost of the computing service consumed and the total amount of computing work performed. Maximum performance for minimum cost gives you the best price performance.

As your data warehouses grow, Amazon Redshift gets more efficient. It moderates cost increases and keeps costs predictable, even as your needs grow. This sets Amazon Redshift apart from others in the market that increase in price much more as the number of users grows.

The investments in automatic optimizations, out-of-the-box performance, and scale all contribute to the unbeatable price performance that Amazon Redshift offers. When you compare typical customer quotes and investments, you find that Amazon Redshift costs 50% –75% less than other cloud data warehouses.

Measuring performance

AWS measures performance, throughput, and price-performance on a nightly basis. AWS also runs larger and more comprehensive benchmarks regularly to make sure the tests extend beyond your current needs. For benchmark results to be useful, they need to be well defined and easily reproducible. AWS uses the Cloud DW benchmark based on current TPC-DS and TPC-H benchmarks without any query or data modifications and compliant with TPC rules and requirements.

It’s important that anyone can reproduce these benchmarks; you can download the benchmark codes and scripts from  GitHub  and the accompanying dataset from a public Amazon S3 bucket.

Summary

Amazon Redshift is self-learning, self-optimizing, and consistently uses telemetry of the actual workload to deliver the best possible performance. Amazon Redshift is more than twice as fast out-of-the-box than it was 6 months ago, and keeps getting faster without any manual optimization and tuning. Amazon Redshift can boost throughput by more than 35 times to support increases in concurrent users and scales linearly for simple and mixed workloads.

In addition to software improvements, AWS continues to build data warehouses on the best hardware available. The new RA3 ndoe type with managed storage features high bandwidth networking and sizable high-performance SSDs as local caches. RA3 nodes use your workload patterns and advanced data management techniques, such as automatic fine-grained data eviction and intelligent data pre-fetching, to deliver the performance of local SSD while scaling storage automatically to Amazon S3. The hardware-based performance improvements in preview with AQUA (Advanced Query Accelerator) bring even more dramatic performance improvements and drive costs down with a new distributed and hardware accelerated cache.

These performance improvements are the cumulative result of years of strategic and sustained product investment and technical innovation across multiple areas such as automatic optimizations, out-of-the-box performance, and scalability. Additionally, price-performance remains a priority so you receive the best value.

Each dataset and workload has unique characteristics, and a proof of concept is the best way to understand how Amazon Redshift performs in your unique situation. When running your own proof of concept, it’s important that you focus on the right metrics—query throughput (number of queries per hour) and price-performance for your workload.
[1]
You can make a data-driven decision by requesting assistance with a proof of concept or working with a system integration and consulting partner. It’s also important to consider not only how a data warehouse performs with your current needs, but also its future performance with increasingly complex workloads, datasets, and users.

To stay up-to-date with the latest developments in Amazon Redshift, subscribe to the What’s New in Amazon Redshift RSS feed.

———
[1] The TPC Benchmark, TPC-DS and TPC-H are trademarks of the Transaction Processing Performance Council www.tpc.org


About the Author

Naresh Chainani is a Senior Software Development Manager at Amazon Redshift where he leads Query Processing, Query Performance, Distributed Systems and Workload Management with a strong team. Naresh is passionate about building high-performance databases to enable customers to gain timely insights and make critical business decisions. In his spare time, Naresh enjoys reading and playing tennis.

 

 

Berni Schiefer is a Senior Development Manager for EMEA at Amazon Web Services, leading the Amazon Redshift development team in Berlin, Germany. The Berlin team focuses on Redshift Performance and Scalability, SQL Query Compilation, Spatial Data support, and Redshift Spectrum. Previously, Berni was an IBM Fellow working in the area of Private Cloud, Db2, Db2 Warehouse, BigSQL, with a focus on SQL-based engines, query optimization and performance.

 

 

Neeraja Rentachintala is a seasoned Product Management and GTM leader at Amazon Web Services, bringing over 20 years of experience in Product Vision, Strategy and Leadership roles in industry-leading data products and platforms. During her career, she delivered products in Analytics, Big data, Databases, Data and Application Integration, AI/ML serving Fortune 500 enterprise and ventures including MapR (acquired by HPE), Microsoft, Oracle, Informatica and Expedia.com. Currently Neeraja is a Principal Product Manager with Amazon Web Services building Amazon Redshift – the world’s most popular, highest performance and most scalable cloud data warehouse. Neeraja earned a Bachelor of Technology in Electronics and Communication Engineering from the National Institute of Technology in India and various business program certifications from the University of Washington, MIT Sloan School of Management and Stanford University.

Amazon Redshift update – ra3.4xlarge instances

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/amazon-redshift-update-ra3-4xlarge-instances/

Since we launched Amazon Redshift as a cloud data warehouse service more than seven years ago, tens of thousands of customers built their workloads using it. We are always listening to your feedback and, in December last year, we announced our 3rd generation RA3 node type providing you the ability to scale compute and storage separately. Previous generation DS2 and DC2 nodes had a fixed amount of storage and required adding more nodes to your cluster to increase storage capacity. The new RA3 nodes let you determine how much compute capacity you need to support your workload and then scale the amount of storage based on your needs. The first member of the RA3 family was the ra3.16xlarge which we heard from many customers was fantastic, but more than they needed for their workload needs.

Today we are adding a new smaller member to the RA3 family: the ra3.4xlarge.

The RA3 node type is based on AWS Nitro and includes support for Redshift managed storage. Redshift managed storage automatically manages data placement across tiers of storage and caches the hottest data in high-performance SSD storage while automatically offloading colder data to Amazon Simple Storage Service (S3). Redshift managed storage uses advanced techniques such as block temperature, data block age, and workload patterns to optimize performance.

RA3 nodes with managed storage are a great fit for analytics workloads that require massive storage capacity and can be a great fit for workloads such as operational analytics, where the subset of data that is most important evolves constantly over time. In the past, there was pressure to offload or archive old data to other storage because of fixed storage limits. This made maintaining the operational analytics data set and the larger historical dataset difficult to query when needed.

The new ra3.4xlarge node provides 12 vCPUs, 96 GiB of RAM, and addresses up to 64 Tb of managed storage. A cluster can contain up to 32 of these instances, for a total storage of 2048 TB (that’s 2 petabytes!).

The differences between ra3.16xlarge and ra3.4xlarge nodes are summarized in the table below.

vCPUMemoryAddressable Storage I/OPrice
(US East (N. Virginia))
ra3.4xlarge1296 GiB64TB RMS2 GB/sec$3.26 per Hour
ra3.16xlarge28384 GiB64TB RMS8 GB/sec$13.04 per Hour

To create a new cluster, I am using the Redshift AWS Management Console or AWS Command Line Interface (CLI). In the console. I click Create Cluster and choose ra3.4xlarge instances.

If you have a DS2 or DC2 instance-based cluster you create a new RA3 cluster to evaluate the new instance with managed storage. You use a recent snapshot of your Redshift DS2 or DC2 cluster to create a new cluster based on ra3.4xlarge instances. You keep the two clusters running in parallel to evaluate the compute needs of your application.

You can resize your RA3 cluster at anytime by using elastic resize to add or remove compute capacity. If elastic resize is not available for your chosen configuration then you can do a classic resize.

RA3 instances are now available in 14 AWS Regions : US East (Ohio), US East (N. Virginia), US West (N. California), US West (Oregon), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Paris), Canada (Central), and South America (São Paulo).

The price vary from one region to the other, starting at $3.26/hr/node in US East (N. Virginia). Check the Amazon Redshift pricing page for details.

— seb

Speeding up Etleap models at AXS with Amazon Redshift materialized views

Post Syndicated from Christian Romming original https://aws.amazon.com/blogs/big-data/speeding-up-etleap-models-at-axs-with-amazon-redshift-materialized-views/

The materialized views feature in Amazon Redshift is now generally available and has been benefiting customers and partners in preview since December 2019. One customer, AXS, is a leading ticketing, data, and marketing solutions provider for live entertainment venues in the US, UK, Europe, and Japan. Etleap, an Amazon Redshift partner, is an extract, transform, load, and transform (ETLT) service built for AWS. AXS uses Etleap to ingest data into Amazon Redshift from a variety of sources, including file servers, Amazon S3, relational databases, and applications. These ingestion pipelines parse, structure, and load data into Amazon Redshift tables with appropriate column types and sort and distribution keys.

Improving dashboard performance with Etleap models

To analyze data, AXS typically runs queries against large tables that originate from multiple sources. One of the ways that AXS uses Amazon Redshift is to power interactive dashboards. To achieve fast dashboard load times, AXS pre-computes partial answers to the queries dashboards use. These partial answers are orders of magnitude smaller in terms of the number of rows than the tables on which they are based. Dashboards can load much faster than they would if they were querying the base tables directly by querying Amazon Redshift tables that hold the pre-computed partial answers.

Etleap supports creating and managing such pre-computations through a feature called models. A model consists of a SELECT query and triggers for when it should be updated. An example of a trigger is a change to a base table, that is, a table the SELECT statement uses that defines the model. This way, the model can remain consistent with its base tables.

The following screenshot shows an Etleap model with two base table dependencies.

Etleap represents their models as tables in Amazon Redshift. To create the model table, Etleap wraps the SELECT statement in a CREATE TABLE AS (CTAS) query. When an update is triggered, for example, due to base table inserts, updates, or deletes, Etleap recomputes the model table through the following code:

CREATE TABLE model_temporary AS SELECT …
DROP TABLE model;
RENAME TABLE model_temporary TO model;

Analyzing CTAS performance as data grows

AXS manages a large number of Etleap models. For one particular model, the CTAS query takes over 6 minutes, on average. This query performs an aggregation on a join of three different tables, including an event table that is constantly ingesting new data and contains over a billion rows. The following graph shows that the CTAS query time increases as the event table increases in number of rows.

There are two key problems with the query taking longer:

  • There’s a longer delay before the updated model is available to analysts
  • The model update consumes more Amazon Redshift cluster resources

To address this, AXS would have to resort to workarounds that are either inconvenient or costly, such as archiving older data from the event table or expanding the Amazon Redshift cluster to increase available resources.

Comparing CTAS to materialized views

Etleap decided to run an experiment to verify that Amazon Redshift’s materialized views feature is an improvement over the CTAS approach for this AXS model. First, they built the materialized view by wrapping the SELECT statement in a CREATE MATERIALIZED VIEW AS query. For updates, instead of recreating the materialized view every time that data in a base table changes, a REFRESH MATERIALIZED VIEW query is sufficient. The expectation was that using materialized views would be significantly faster than the CTAS-based procedure. The following graph compares query times of CTAS to materialized view refresh.

Running REFRESH MATERIALIZED VIEW was 7.9 times faster than the CTAS approach—it took 49 seconds instead of 371 seconds on average at the current scale. Additionally, the update time was roughly proportional to the number of rows that were added to the base table since the last update, rather than the total size of the base table. In this use case, this number is 3.8 million, which corresponds to the approximate number of events ingested per day.

This is great news. The solution solves the previous problems because the delay the model update caused stays constant as new data comes in, and so do the resources that Amazon Redshift consume (assuming the growth of the base table is constant). In other words, using materialized views eliminates the need for workarounds, such as archiving or cluster expansion, as the dataset grows. It also simplifies the refresh procedure for model updates by reducing the number of SQL statements from three (CREATE, DROP, and RENAME) to one (REFRESH).

Achieving fast refresh performance with materialized views

Amazon Redshift can refresh a materialized view efficiently and incrementally. It keeps track of the last transaction in the base tables up to which the materialized view was previously refreshed. During subsequent refreshes, Amazon Redshift processes only the newly inserted, updated, or deleted tuples in the base tables, referred to as a delta, to bring the materialized view up-to-date with its base tables. In other words, Amazon Redshift can incrementally maintain the materialized view by reading only base table deltas, which leads to faster refresh times.

For AXS, Amazon Redshift analyzed their materialized view definitions, which join multiple tables, filters, and aggregates, to figure out how to incrementally maintain their specific materialized view. Each time AXS refreshes the materialized view, Amazon Redshift quickly determines if a refresh is needed, and if so, incrementally maintains the materialized view. As records are ingested into the base table, the materialized view refresh times shown are much faster and grow very slowly because each refresh reads a delta that is small and roughly the same size as the other deltas. In comparison, the refresh times using CTAS are much slower because each refresh reads all the base tables. Moreover, the refresh times using CTAS grow much faster because the amount of data that each refresh reads grows with the ingest rate.

You are in full control of when to refresh your materialized views. For example, AXS refreshes their materialized views based on triggers defined in Etleap. As a result, transactions that are run on base tables do not incur additional cost to maintain dependent materialized views. Decoupling the base tables’ updates from the materialized view’s refresh gives AXS an easy way to insulate their dashboard users and offers them a well-defined snapshot to query, while ingesting new data into base tables. When AXS vets the next batch of base table data via their ETL pipelines, they can refresh their materialized views to offer the next snapshot of dashboard results.

In addition to efficiently maintaining their materialized views, AXS also benefits from the simplicity of Amazon Redshift storing each materialized view as a plain table. Queries on the materialized view perform with the same world-class speed that Amazon Redshift runs any query. You can organize a materialized view like other tables, which means that you can exploit distribution key and sort columns to further improve query performance. Finally, when you need to process many queries at peak times, Amazon Redshift’s concurrency scaling kicks in automatically to elastically scale query processing capacity.

Conclusion

Now that the materialized views feature is generally available, Etleap gives you the option of using materialized views rather than tables when creating models. You can use models more actively as part of your ETLT strategies, and also choose more frequent update schedules for your models, due to the performance benefits of incremental refreshes.

For more information about Amazon Redshift materialized views, see Materialize your Amazon Redshift Views to Speed Up Query Execution and Creating Materialized Views in Amazon Redshift.

 


About the Author

Christian Romming is the founder and CEO of Etleap.  Etleap is a managed ETL solution for AWS that doesn’t require extensive engineering work to set up, maintain, and scale.

 

 

 

 

Prasad Varakur is a Database, Big Data & Distributed Systems enthusiast, and Product Manager at Amazon Web Services. Prior to this, he has developed Database and Storage engines at SAP/Sybase, Couchbase, Huawei, Novell, EMC, and Veritas. He holds 11 patents in database systems and distributed computing, and his thesis has contributed foundational works of Parametric Query Optimization. He holds Master’s degree in Computer Science from IIT, Kanpur.

 

Vuk Ercegovac is a principal engineer for Redshift at AWS.

 

 

 

Halodoc: Building the Future of Tele-Health One Microservice at a Time

Post Syndicated from Annik Stahl original https://aws.amazon.com/blogs/architecture/halodoc-building-the-future-of-tele-health-one-microservice-at-a-time/

Halodoc, a Jakarta-based healthtech platform, uses tele-health and artificial intelligence to connect patients, doctors, and pharmacies. Join builder Adrian De Luca for this special edition of This is My Architecture as he dives deep into the solutions architecture of this Indonesian healthtech platform that provides healthcare services in one of the most challenging traffic environments in the world.

Explore how the company evolved its monolithic backend into decoupled microservices with Amazon EC2 and Amazon Simple Queue Service (SQS), adopted serverless to cost effectively support new user functionality with AWS Lambda, and manages the high volume and velocity of data with Amazon DynamoDB, Amazon Relational Database Service (RDS), and Amazon Redshift.

For more content like this, subscribe to our YouTube channels This is My Architecture, This is My Code, and This is My Model, or visit the This is My Architecture AWS website, which has search functionality and the ability to filter by industry, language, and service.

Materialize your Amazon Redshift Views to Speed Up Query Execution

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/materialize-your-amazon-redshift-views-to-speed-up-query-execution/

At AWS, we take pride in building state of the art virtualization technologies to simplify the management and access to cloud services such as networks, computing resources or object storage.

In a Relational Database Management Systems (RDBMS), a view is virtualization applied to tables : it is a virtual table representing the result of a database query. Views are frequently used when designing a schema, to present a subset of the data, summarized data (such as aggregated or transformed data) or to simplify data access across multiple tables. When using data warehouses, such as Amazon Redshift, a view simplifies access to aggregated data from multiple tables for Business Intelligence (BI) tools such as Amazon QuickSight or Tableau.

Views provide ease of use and flexibility but they are not speeding up data access. The database system must evaluate the underlying query representing the view each time your application accesses the view. When performance is key, data engineers use create table as (CTAS) as an alternative. A CTAS is a table defined by a query. The query is executed at table creation time and your applications can use it like a normal table, with the downside that the CTAS data set is not refreshed when underlying data are updated. Furthermore, the CTAS definition is not stored in the database system. It is not possible to know if a table was created by a CTAS or not, making it difficult to track which CTAS needs to be refreshed and which is current.

Today, we are introducing materialized views for Amazon Redshift. A materialized view (MV) is a database object containing the data of a query. A materialized view is like a cache for your view. Instead of building and computing the data set at run-time, the materialized view pre-computes, stores and optimizes data access at the time you create it. Data are ready and available to your queries just like regular table data.

Using materialized views in your analytics queries can speed up the query execution time by orders of magnitude because the query defining the materialized view is already executed and the data is already available to the database system.

Materialized views are especially useful for queries that are predictable and repeated over and over. Instead of performing resource-intensive queries on large tables, applications can query the pre-computed data stored in the materialized view.

When the data in the base tables are changing, you refresh the materialized view by issuing a Redshift SQL statement “refresh materialized view“. After issuing a refresh statement, your materialized view contains the same data as would have been returned by a regular view. Refreshes can be incremental or full refreshes (recompute). When possible, Redshift incrementally refreshes data that changed in the base tables since the materialized view was last refreshed.

Let’s see how it works. I create a sample schema to store sales information : each sales transaction and details about the store where the sales took place.

To view the total amount of sales per city, I create a materialized view with the create materialized view SQL statement. I connect to the Redshift console, select the query Editor and type the following statement to create a materialized view (city_sales) joining records from two tables and aggregating sales amount (sum(sales.amount)) per city (group by city):

CREATE MATERIALIZED VIEW city_sales AS (
  SELECT st.city, SUM(sa.amount) as total_sales
  FROM sales sa, store st
  WHERE sa.store_id = st.id
  GROUP BY st.city
);

The resulting schema is below:

Now I can query the materialized view just like a regular view or table and issue statements like “SELECT city, total_sales FROM city_sales” to get the below results. The join between the two tables and the aggregate (sum and group by) are already computed, resulting to significantly less data to scan.

When the data in the underlying base tables change, the materialized view is not automatically reflecting those changes. The data stored in the materialized can be refreshed on demand with latest changes from base tables using the SQL refreshmaterialized view command. Let’s see a practical example:

!-- let's add a row in the sales base table
INSERT INTO sales (id, item, store_id, customer_id, amount)
VALUES(8, 'Gaming PC Super ProXXL', 1, 1, 3000);

SELECT city, total_sales FROM city_sales WHERE city = 'Paris'

city |total_sales|
-----|-----------|
Paris|        690|

!-- the new sale is not taken into account !

!-- let's refresh the materialized view
REFRESH MATERIALIZED VIEW city_sales;

SELECT city, total_sales FROM city_sales WHERE city = 'Paris'

city |total_sales|
-----|-----------|
Paris|       3690|

!-- now the view has the latest sales data

The full code for this very simple demo is available as a gist.

You can start to use materialized views today in all AWS Regions.

There is nothing to change in your existing clusters to start to use materialized views, you can start to create them today at no additional cost.

Happy building !

Lower your costs with the new pause and resume actions on Amazon Redshift

Post Syndicated from Sain Das original https://aws.amazon.com/blogs/big-data/lower-your-costs-with-the-new-pause-and-resume-actions-on-amazon-redshift/

Today’s analytics workloads typically require a data warehouse to be available 24 hours a day, 7 days a week. However, there may be times when you need an Amazon Redshift cluster for a short duration of time at frequent (or infrequent) intervals. For example, you may run a periodic ETL job or use a cluster for testing and development and not use it during off-hours or weekends. In these cases, you may want an easy way to keep the data warehouse up and running only part of the time. Previously, you could accomplish this by making a backup, terminating the cluster, and restoring the cluster from the snapshot. The pause and resume actions on Amazon Redshift are a much simpler way to suspend billing and are designed to use if your Amazon Redshift cluster is out of operation for hours at a time, and especially if that time is on a regularly scheduled basis.

Pausing a cluster suspends compute and retains the underlying data structures and data so you can resume the cluster at a later point in time. You can configure this through the Amazon Redshift console or the use of Amazon Redshift CLIs.

When the cluster is paused, the data warehouse’s storage incurs charges. On-demand compute billing is suspended and resumed on a per-second basis. Paused clusters still appear as an entry in the console. You can also automate the pause and resume actions by using a schedule that matches your operational needs.

Using the actions via the Amazon Redshift console

To use the pause and resume actions on the Amazon Redshift console, complete the following steps:

  1. On the Amazon Redshift console, choose Clusters.
  2. Choose your desired cluster.
  3. Choose Actions.
  4. Choose Pause.
  5. To determine when to pause the cluster, choose from the following three options:
    • To pause the cluster immediately, select Pause now.
    • To pause the cluster at a later point, select Pause later.
    • To pause and resume the cluster according to a set schedule, select Pause and resume on schedule.
  6. For this walkthrough, select Pause now.
  7. Choose Pause now.

The cluster is now in Modifying status. It can take up to a few minutes for the cluster to change to a Paused state, but the cost accrual for compute resources is suspended immediately.

The following screenshot shows a view of the cluster status.

Amazon Redshift processes any outstanding queries before it pauses the cluster. When the cluster is paused, you can still view it on the Amazon Redshift console, and the Resume action is available.

To resume the cluster, complete the following steps:

  1. On the Amazon Redshift console, choose Clusters.
  2. Choose your desired cluster.
  3. Choose Actions.
  4. Choose Resume.
  5. Choose when to resume the cluster. The options are the same as those for the pause action.
  6. For this walkthrough, select Resume now.
  7. Choose Resume now.

The cluster moves to Modifying status. Depending upon the size of the cluster, it can take several minutes to resume a cluster before queries can be processed. Billing only resumes when the cluster is available.

The following screenshot shows the view of the cluster status.

Using the actions via CLI

The following two commands pause and resume the cluster:

  • Pause-cluster
  • Resume-cluster

To pause a given cluster, enter the following code:

aws redshift pause-cluster --cluster identifier <insert cluster identifier here>

To resume a paused cluster, enter the following code:

aws redshift resume-cluster --cluster identifier <insert cluster identifier here>

Scheduling pause and resume actions

You can schedule to pause and resume a cluster at specific times of the day and week. For example, this walkthrough pauses a cluster on Friday 8:00 p.m. and resumes it on Monday 7:00 a.m. You can configure this via the Amazon Redshift console or APIs.

Scheduling via the Amazon Redshift console

To schedule to pause and resume a cluster on the Amazon Redshift console, complete the following steps:

  1. On the Amazon Redshift console, choose Clusters.
  2. Choose your desired cluster.
  3. Choose Actions.
  4. Choose Pause.
  5. Select Pause and resume on schedule.
  6. For Schedule name, enter a name for this schedule.
  7. Optionally, for Starts on and Ends on, enter the dates and times to start and end the schedule.
  8. For Pause every and Resume every, choose the time and day to pause and resume.
  9. Choose Schedule recurring pause and resume.

You can review existing pause and resume schedules on the Schedules tab. See the following screenshot.

Scheduling via CLI

The following CLI commands allow you to create, modify, and delete scheduled pause and resume tasks.

To create a scheduled action to occur one time, enter the following code:

aws redshift create-scheduled-action --scheduled-action-name test-resume --schedule "at(2020-02-21T02:00:00)" --target-action "{\"ResumeCluster\":{\"ClusterIdentifier\":\"redshift-cluster-1\"}}" --iam-role arn:aws:iam::<Account ID>:role/<Redshift Role>

To create a recurring scheduled action, enter the following code:

aws redshift create-scheduled-action --scheduled-action-name "scheduled-pause-repetitive" --target-action "{\"PauseCluster\":{\"ClusterIdentifier\":\"redshift-cluster-1\"}}" --schedule "cron(30 20 * * ? *)" --iam-role "arn:aws:iam::<Account ID>:role/<Redshift Role>"

The preceding code example pauses a cluster daily at 10:30 p.m.

To modify an existing scheduled action, enter the following code:

aws redshift modify-scheduled-action --scheduled-action-name "scheduled-pause-repetitive" --schedule "cron(30 * * * ? *)"

The preceding code example modifies the scheduled-pause-repetitive schedule to run every hour at 30 minutes past the hour.

To delete a scheduled action, enter the following code:

aws redshift delete-scheduled-action --scheduled-action-name "scheduled-pause-repetitive"

Summary

The pause and resume actions on Amazon Redshift allow you to easily pause and resume clusters that may not be in operation at all times. It allows you to create a regularly-scheduled time to initiate the pause and resume actions at specific times or you can manually initiate a pause and later a resume. Flexible on-demand pricing and per-second billing gives you greater control of costs of your Redshift compute clusters while maintaining your data in a way that is simple to manage. You can run your data warehouse at the lowest cost possible without having to purchase a fixed amount of resources up front.

 


About the Author

Sain Das is a data warehouse specialist solutions architect with AWS.

 

 

 

Integrate Power BI with Amazon Redshift for insights and analytics

Post Syndicated from Vu Le original https://aws.amazon.com/blogs/big-data/integrate-power-bi-with-amazon-redshift-for-insights-and-analytics/

Amazon Redshift is a fast, fully managed, cloud-native data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. Microsoft Power BI is a business analytics service that delivers insights to enable fast, informed decisions. With Power BI, you can perform ad-hoc query analysis, visualize data, and create-user friendly dashboards.

This post demonstrates how to integrate Power BI with Amazon Redshift to deliver powerful visualization and insights.

Solution architecture

This post provides code artifacts to help you create a big data environment on AWS from scratch. You can automatically provision a new Amazon Redshift data warehouse in under an hour without much technical depth required by using the AWS CloudFormation template and code examples provided.

The post also demonstrates how to configure integration for the most common deployment scenarios. For example, how do you connect Power BI to AWS services using ODBC/JDBC drivers? How do you connect to AWS services that are deployed behind a private network? What credentials do you use to connect to AWS services? This post addresses and answers these questions in the subsequent sections.

The following diagram shows the solution architecture deployed to AWS. All components inside the AWS Cloud boundary are deployed automatically using an AWS CloudFormation template to allow you to reproduce this solution quickly using your AWS account.

When deployed, the solution contains the following components:

  • Networking infrastructure that includes VPC, public and private subnets, security groups, internet gateway, NAT Gateway, and route tables
  • Linux EC2 instance provisioned in a public subnet to generate sample data
  • Windows Server EC2 instance to host Power BI Desktop
  • Windows Server EC2 instance to act as an on-premises data gateway that handles the communication between Power BI and Amazon Redshift
  • An Amazon Redshift cluster deployed in a private subnet
  • IAM user and roles with permissions to access Amazon S3 and Amazon Redshift

Prerequisites

To complete the steps in this post, you need the following prerequisites:

  • AWS account – You need an account to follow the instructions and test it with minimal cost.
    • If you are creating your account for the first time, choose the us-east-1 region.
    • Create a key pair for the selected Region. For more information, see Amazon EC2 Key Pairs.
  • Power BI tenant – You can test all the described Power BI functionalities with minimum to no cost with the following:
    • Power BI Pro license
    • Access to your Power BI admin portal

Creating and configuring your development environment

Before you can create Power BI visualizations in AWS, you need to load a fully working development environment with sample data. This section contains instructions to create and configure that environment from scratch. After completing all the deployment steps in this section, you have an AWS infrastructure with all the integration hooks between AWS and Power BI fully configured.

To create this environment, execute the following high-level tasks:

  1. Run an AWS CloudFormation template to provision the initial development environment.
  2. SSH into an Amazon EC2 Linux instance to generate a sample dataset.
  3. Configure the data warehouse by creating and loading data into Amazon Redshift tables.
  4. Install and configure the Power BI Desktop.
  5. Configure a data gateway in Power BI.
  6. Install the Power BI mobile app so you can consume the visuals from your phone.

To make the deployment quick and easy, this post automates much of the deployment steps through the use of an AWS CloudFormation template. For the tasks that could not be automated, the post provides detailed instructions along with actual code examples. You can find the template and relevant code in the GitHub repo. Remember to clone this GitHub repo to a local working folder because you need to reference these artifacts as you walk through the deployment steps in this section.

Provisioning a development environment

In this step, you run an AWS CloudFormation template to provision the initial infrastructure. Complete the following steps:

  1. On the AWS CloudFormation console, under Region, choose US East (N. Virginia).You can choose other Regions, but you need to change the EC2 AMI IDs in the AWS CloudFormation template parameters to match the chosen Region.
  2. Choose Create stack.
  3. Under Specify template, select Upload a template file.
  4. Click on Choose file.
  5. Select ProvisionDevEnv.yaml file from your working folder.
  6. Choose Next.
  7. Under Specify stack details, provide the following information:
    • Stack name – Enter a name for your CloudFormation stack.
    • AmazonLinuxImageID – Use the default value if you are running in US East (N. Virginia). Otherwise, provide the Amazon Linux AMI ID for your chosen Region. An easy way to find the AMI ID is to go to the Amazon EC2 console and launch a new instance. The Amazon Linux AMI ID should be listed on the first page. Use Amazon Linux, not Amazon Linux 2.
    • EC2KeyPair – Enter the name of your key pair for your Region.
    • MyIPAddress – Enter your public IP address in CIDR format, for example, 205.251.192.0/32. To find your public IP address, search online for what is my ip address.
    • RedshiftMasterUserPassword – Enter a password for your Amazon Redshift cluster.
    • RedshiftMasterUsername – Use the default username or provide your own.
    • WindowsServerImageID – Use the default value if you are running in US East (N. Virginia). Otherwise, provide the Microsoft Windows Server 2019 Base AMI ID for your Region.
    • For all other options, keep the default values.
  1. Choose Next.
  2. On the Configure stack options page, choose Next.
  3. On the Review page, select the “I acknowledge…” check box.
  4. Choose Create stack.

Wait for the CloudFormation stack creation to complete, which can take approximately 10 minutes. When the CloudFormation stack is complete, choose Stack Outputs. In this section, you can find all the keys and values of the resources that you need to reference later in the deployment process.

Generating an sample dataset

Now that you have provisioned the initial infrastructure, let us generate the sample data to be use in the Power BI visualizations. For more information about how to generate the sample dataset, see GenerateSampleDataset.txt in your working folder.

Replace the placeholders in this file with the following corresponding values from the AWS CloudFormation Stack Outputs section:

  • [KeyPairFileName] – Replace with the value of the EC2KeyPair template input parameter
  • [EC2IPAddress] – Replace with the public IP address of this EC2 instance with the key DataGeneratorPublicIP
  • [S3BucketName] – Replace with the actual S3 bucket name

After you generate the dataset and copy it to your S3 bucket, terminate the EC2 instance with the name Data Generator. You don’t need it anymore.

Configuring an Amazon Redshift cluster

In this step, you run a series of SQL commands to create tables in your Amazon Redshift cluster. After that, you load the data you generated earlier from Amazon S3 into the Amazon Redshift tables. Complete the following steps:

  1. On the Amazon Redshift console, choose Query Editor.
  2. Connect to the cluster by providing the database name, database user, and password that were provided as input parameters to the CloudFormation stack.
  3. From the schema drop-down menu, choose public.
  4. Open the file CreateRedshiftTables.txt in your working folder and run each SQL statement, one at a time, in the Query Editor.
  5. Run the COPY commands in the file CopyFromS3ToRedshift.txt, one at a time, to load sample data into the tables you just created. Replace the following placeholders with the actual values from the AWS CloudFormation Stack Outputs section
    • [S3BucketName] – Replace with actual S3 bucket name.
    • [RedshiftAccessRoleName] – Replace with the IAM role name.
  6. Confirm that you see the following tables listed under the public schema. Do a select count(*) on each table to see how many rows you have for the following tables:
    • customer: 1, 500,000
    • lineitem: 59,986,052
    • nation: 25
    • orders: 15,000,000
    • part: 2,000,000
    • partsupp: 8,000,000
    • region: 5
    • supplier: 100,000

Installing and configuring Power BI Desktop

In this step, you connect using Remote Desktop Protocol (RDP) into the Windows Server jump box and install Power BI Desktop, which you use later to create visualizations from the data that you pulled from Amazon Redshift. For more information, see Connecting to Your Windows Instance. Complete the following steps:

  • From your local machine, RDP to the Windows Server Jump Box. You can get the public IP address of the jump box from the CloudFormation Stack Outputs tab or from your EC2 Console.
  • Remember to use the EC2KeyPair you specified in the CloudFormation template to decrypt the administrator password by using the EC2 console:
    • Open the Amazon EC2 console, and then choose Instances.
    • Select the instance of your choice, choose Actions, and then choose Get Windows Password.
      Note: It can take a few minutes for this option to be available after first launching a new instance.
    • Click on Choose File, select your key pair file, and then choose Open. Alternatively, you can paste the contents of your key pair into the text box.
    • Choose Decrypt Password.
  • The first thing you’ll want to do after you’ve logged into the jump box is to turn off IE Enhanced Security Configuration. Otherwise, you will have problems logging into Power BI Service later.
    • Inside the Jump Box, go to Start > Server Manager > Local Server
    • Click the On link to the right of IE Enhanced Security Configuration
    • Click Off under the Administrators section and Click OK
  • Download and install Power BI Desktop 64-bit on the Jump Box. Remember to just download PBIDesktopSetup_x64.exe, which is the 64-bit version of Power BI Desktop.  After you download the executable, run it to install Power BI Desktop.  Accept all the default settings.

Installing and configuring a Microsoft on-premises data gateway

In this step, you install a Microsoft on-premises data gateway to enable Power BI to communicate with data stores that are not accessible from the public internet, such as your Amazon Redshift cluster, which you deployed inside a private subnet. Install the Microsoft on-premises data gateway on the data gateway instance. You can perform the following tests at minimum or no cost. Complete the following steps:

  • From the Windows Server Jump Box, you will RDP to the data gateway using its private IP address, which you can get from the EC2 Console.
  • Use the EC2KeyPair to decrypt the administrator password for the data gateway following the same instructions as outlined in the previous steps.
  • Turn off IE Enhanced Security Configuration.
    • Inside the data gateway, go to Start > Server Manager > Local Server
    • Click the On link to the right of IE Enhanced Security Configuration
    • Click Off under the Administrators section
    • Click OK
  • Download and install Microsoft On-premises Data Gateway
    • On the screen that asks you to choose the type of gateway you need, choose On-premises data gateway (recommended)
    • Accept the default values and click Install
    • When the installer ask you to sign-in, type the email address associated with the admin account for the Power BI Pro tenant.
    • Click Sign in
    • After sign-in, you will be asked to register the gateway. If you are asked to register a new gateway or migrate, restore, or take over an existing gateway, choose Register a new gateway.
    • Give your gateway a name and provide a recovery key
    • Click Configure
    • You should see a green checkmark indicating the gateway is online and ready to be used.

Configuring the data gateway in Power BI

In this step, you log in to your Power BI tenant as an administrator to configure how Power BI connects to the AWS data sources. Complete the following steps:

  1. On your browser (from any machine), launch https://powerbi.com.
  2. Log in with an account that has admin privileges. Use the same account that you for the data gateway.
  3. Choose the gear icon.
  4. Choose Manage gateways.
  5. Choose the gateway you installed earlier.
  6. Select the check box Allow user’s custom data connectors to refresh through this gateway cluster.
  7. Choose Apply.
  8. Choose ADD DATA SOURCE.
  9. Create a data source for Amazon Redshift with the following parameters:
    • Data Source Name – Amazon Redshift Private VPC.
    • Data Source Type – Amazon Redshift.
    • Server – Replace with the value of the [RedshiftClusterEndpoint] key from the AWS CloudFormation Stack Outputs
    • Database – Enter the name of the Amazon Redshift database, which you provided earlier as part of the CloudFormation stack parameters. The default is dev.
    • Username – Enter your database username, which you provided earlier as part of the CloudFormation stack parameters.
    • Password – Enter your database password, which you provided earlier as part of the CloudFormation stack parameters.
  10. Choose Add.

Installing the Power BI mobile app

In this step, you install the Power BI app on your mobile phone so you can interact with Power BI reports and dashboards later from your phone. Complete the following steps:

  1. Go to either the Apple App Store or Google Play Store and search for Microsoft Power BI.
  2. Install the app.
  3. Sign in to Power BI with the same account you’ve been using.

Visualizing Redshift Data using Power BI Desktop

In this section, you will connect Power BI Desktop to Redshift; create a report; publish the report to Power BI service; and finally consume the report from your phone.

  1. From Jump Box, launch Power BI Desktop
  2. If you haven’t already, sign in to the Service with your Power BI credentials
  3. Select Home > Get Data > More > Database > Amazon Redshift
  4. Click Connect
  5. On the next screen, provide the following values:
    • Server – copy the value of the key [RedshiftClusterEndpoint], which is found in the CloudFormation Stack Outputs tab
    • Databasedev (or whatever name you gave for the database)
    • Data Connectivity ModeDirectQuery
  6. If this is the first time you’re connecting to this cluster, then you’ll need to type the Redshift credentials you provided to the CloudFormation Stack earlier. Type in your Redshift username and password in the popup window and click on Connect.
  7. Select the orders, lineitem, and part tables from the Navigator window and then click Load.
  8. Once the data has finished loading, you will need to define table relationships in the in-memory model.
    • In Power BI Desktop, change to the Model view by clicking on the “table relationship” icon on the left.
    • Create relationships between the tables by dragging and dropping the following columns on each other.
      • o_orderkey = lineitem.l_orderkey
      • p_partkey = lineitem.l_partkey
  1. Now, you are ready to create some charts. Change to Report view and add the following visualizations to the report.
    • Date Slicer
      • Visualization type – Slicer
      • Field – orders.o_orderdate
    • Sales by Date by Manufacturer
      • Visualization type – Line Chart
      • Axis – orders.o_orderdate
      • Legend – part.p_mfgr
      • Values – lineitem.l_extendedprice
    • Order Count
      • Visualization type – Card
      • Fields – Count of orders.o_orderkey
    • Line Item Count
      • Visualization type – Card
      • Fields – Count of lineitem.l_linenumber
  1. With a little formatting, your report shows Sales by Date and Manufacturer and should look something similar to this:

  1. Save the report and give it a name.

Publishing the report to Power BI

After you create a report in Power BI Desktop, you need to publish the dataset and report to Power BI to share it with others or consume it from a mobile device. Complete the following steps:

  1. In Power BI Desktop, choose Publish
  2. Choose My workspace.
  3. Choose Select.

It should not take long for the publishing to complete because no data was imported.

Configuring the gateway connection

You now need to tell Power BI to map the data source inside this report to the Amazon Redshift data source that you registered with the data gateway. Complete the following steps:

  1. From any machine, open a browser and launch https://powerbi.com.
  2. Log in to Power BI.
  3. Choose My workspace.
  4. Choose Datasets.
  5. Choose the … icon next to dataset you just published.
  6. Choose Settings.
  7. Expand the Gateway connection section.
  8. For Use a data gateway, choose On.
  9. Under Maps to, choose the Amazon Redshift data source you created earlier.
  10. Choose Apply.

Consuming the Power BI reports and dashboard

Now that you have published the report to Power BI successfully, you can interact with it on https://powerbi.com.

  1. Launch https://powerbi.com.
  2. Log in to Power BI.
  3. Choose My workspace.
  4. Choose Reports.
  5. Choose the report you just published.

Your report should look similar to the following screenshot.  It shows the same Sales by Date and Manufacturer graph as before.

You can interact with the reports and dashboards on the Power BI mobile app the same way you can on the Power BI website. The following screenshot shows how the Sales by Date and Manufacturer report looks on the Power BI app for iPhone.

 

When you view and interact with the report, whether from the website or your mobile app, you’re always viewing the latest data because you connect directly to Amazon Redshift.

Conclusion

This post showed how to use Power BI to query data in Amazon Redshift to generate reports, visualization, and dashboards, and described a solution architecture that you can deploy in the AWS Cloud. The post also demonstrated how easy it is to set up and connect Power BI to an Amazon Redshift cluster in a VPC without public internet access and how to push Power BI dashboards to mobile devices.

Special acknowledgement goes to AWS colleagues Juan Yu, Sophia Jung and Joe Harris for their valuable comments and suggestions.

If you have any questions or suggestions, leave your feedback in the comment section. If you need any further assistance to optimize your Amazon Redshift implementation, contact your AWS account team or a trusted AWS partner.

 


About the authors

Vu Le is a Senior Data Architect, Strategic Accounts Team, AWS Professional Services.

 

 

 

 

Po Hong, PhD, is a Senior Data Architect, Data & Analytics Global Specialty Practice, AWS Professional Services.

 

 

Analyze your Amazon S3 spend using AWS Glue and Amazon Redshift

Post Syndicated from Shayon Sanyal original https://aws.amazon.com/blogs/big-data/analyze-your-amazon-s3-spend-using-aws-glue-and-amazon-redshift/

The AWS Cost & Usage Report (CUR) tracks your AWS usage and provides estimated charges associated with that usage. You can configure this report to present the data at hourly or daily intervals, and it is updated at least one time per day until it is finalized at the end of the billing period. The Cost & Usage Report is delivered automatically to an Amazon S3 bucket that you specify, and you can download it from there directly. You can also integrate the report into Amazon Redshift, query it with Amazon Athena, or upload it to Amazon QuickSight. For more information, see Query and Visualize AWS Cost and Usage Data Using Amazon Athena and Amazon QuickSight.

This post presents a solution that uses AWS Glue Data Catalog and Amazon Redshift to analyze S3 usage and spend by combining the AWS CUR, S3 inventory reports, and S3 server access logs.

Prerequisites

Before you begin, complete the following prerequisites:

  • You need an S3 bucket for your S3 inventory and server access log data files. For more information, see Create a Bucket and What is Amazon S3?
  • You must have the appropriate IAM permissions for Amazon Redshift to be able to access the S3 buckets – for this post, choose two non-restrictive IAM roles (AmazonS3FullAccess and AWSGlueConsoleFullAccess), but restrict your access accordingly for your own scenarios.

Amazon S3 inventory

Amazon S3 inventory is one of the tools S3 provides to help manage your storage. You can use it to audit and report on the replication and encryption status of your objects for business, compliance, and regulatory needs. Amazon S3 inventory provides comma-separated values (CSV), Apache optimized row columnar (ORC), or Apache Parquet output files that list your objects and their corresponding metadata on a daily or weekly basis for a given S3 bucket.

Amazon S3 server access logs

Server access logging provides detailed records for the requests you make to a bucket. Server access logs are useful for many applications, for example in security and access audits. It can also help you learn about your customer base and understand your S3 bill.

AWS Glue

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. AWS Glue consists of a central metadata repository known as the Data Catalog, a crawler to populate the Data Catalog with tables, an ETL engine that automatically generates Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there’s no infrastructure to set up or manage. This post uses AWS Glue to catalog S3 inventory data and server access logs, which makes it available for you to query with Amazon Redshift Spectrum.

Amazon Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can use Amazon Redshift to efficiently query and retrieve structured and semi-structured data from files in S3 without having to load the data into Amazon Redshift native tables. You can create Amazon Redshift external tables by defining the structure for files and registering them as tables in the AWS Glue Data Catalog.

Setting up S3 inventory reports for analysis

This post uses the Parquet file format for its inventory reports and delivers the files daily to S3 buckets. You can select both the frequency of delivery and output file formats under Advanced settings as shown in the screenshot below:

For more information about configuring your S3 inventory, see How Do I Configure Amazon S3 Inventory?

The following diagram shows the data flow for this solution:

Below steps summarize the data flow diagram represented above:

  • S3 Inventory Reports are delivered to an S3 bucket that you configure.
  • The AWS Glue crawler then crawls this S3 bucket and populates the metadata in the AWS Glue Data Catalog.
  • The AWS Glue Data Catalog is then accessible through an external schema in Redshift.
  • The S3 Inventory Reports (available in the AWS Glue Data Catalog) and the Cost and Usage Reports (available in another S3 bucket) are now ready to be joined and queried for analysis.

The inventory reports are delivered to an S3 bucket. The following screenshot shows the S3 bucket structure for the S3 inventory reports:

There is a data folder in this bucket. This folder contains the Parquet data you want to analyze. The following screenshot shows the content of the folder.

Because these are daily files, there is one file per day.

Configuring an AWS Glue crawler

You can use an AWS Glue crawler to discover this dataset in your S3 bucket and create the table schemas in the Data Catalog. After you create these tables, you can query them directly from Amazon Redshift.

To configure your crawler to read S3 inventory files from your S3 bucket, complete the following steps:

  1. Choose a crawler name.
  2. Choose S3 as the data store and specify the S3 path up to the data
  3. Choose an IAM role to read data from S3 – AmazonS3FullAccess and AWSGlueConsoleFullAccess.
  4. Set a frequency schedule for the crawler to run.
  5. Configure the crawler’s output by selecting a database and adding a prefix (if any).

This post uses the database s3spendanalysis.

The following screenshot shows the completed crawler configuration.

Run this crawler to add tables to your Glue Data Catalog. After the crawler has completed successfully, go to the Tables section on your AWS Glue console to verify the table details and table metadata. The following screenshot shows the table details and table metadata after your AWS Glue crawler has completed successfully:

Creating an external schema

Before you can query the S3 inventory reports, you need to create an external schema (and subsequently, external tables) in Amazon Redshift. An Amazon Redshift external schema references an external database in an external data catalog. Because you are using an AWS Glue Data Catalog as your external catalog, after you create an external schema in Amazon Redshift, you can see all the external tables in your Data Catalog in Amazon Redshift. To create the external schema, enter the following code:

create external schema spectrum_schema from data catalog
database 's3spendanalysis'
iam_role 'arn:aws:iam::<AWS_IAM_ROLE>';

Querying the table

On the Amazon Redshift dashboard, under Query editor, you can see the data table. You can also query the svv_external_schemas system table to verify that your external schema has been created successfully. See the following screenshot.

You can now query the S3 inventory reports directly from Amazon Redshift without having to move the data into Amazon Redshift first. The following screenshot shows how to do this using the Query Editor in the Amazon Redshift console:

Setting up S3 server access logs for analysis

The following diagram shows the data flow for this solution.

Below steps summarize the data flow diagram represented above:

  • S3 Server Access Logs are delivered to an S3 bucket that you configure.
  • These server access logs are then directly accessible to be queried from Amazon Redshift (note that we’ll be using CREATE EXTERNAL TABLE in Redshift Spectrum for this purpose, explained below).
  • The S3 Server Access Logs and the Cost and Usage Reports (available in another S3 bucket) are now ready to be joined and queried for analysis.

The S3 server access logs are delivered to an S3 bucket. For more information about setting up server access logging, see Amazon S3 Server Access Logging.

The following screenshot shows the S3 bucket structure for the server access logs.

The server access log files consist of a sequence of new-line delimited log records. Each log record represents one request and consists of space-delimited fields. The following code is an example log record:

b8ad5f5cfd3c09418536b47b157851fb7bea4a00486471093a7d765e35a4f8ef s3spendanalysisblog [23/Sep/2018:22:10:52 +0000] 72.21.196.65 arn:aws:iam::<AWS Account #>:user/shayons D5633DAD1063C5CA REST.GET.LIFECYCLE - "GET /s3spendanalysisblog?lifecycle= HTTP/1.1" 404 NoSuchLifecycleConfiguration 332 - 105 - "-" "S3Console/0.4, aws-internal/3 aws-sdk-java/1.11.408 Linux/4.9.119-0.1.ac.277.71.329.metal1.x86_64 OpenJDK_64-Bit_Server_VM/25.181-b13 java/1.8.0_181" -

Creating an external table

You can define the S3 server access logs as an external table. Because you already have an external schema, create an external table using the following code. This post uses RegEx SerDe to create a table that allows you to correctly parse all the fields present in the S3 server access logs. See the following code:

CREATE EXTERNAL TABLE spectrum_schema.s3accesslogs(
BucketOwner                   varchar(256), 
Bucket                        varchar(256), 
RequestDateTime               varchar(256), 
RemoteIP                      varchar(256), 
Requester                     varchar(256), 
RequestID                     varchar(256), 
Operation                     varchar(256), 
Key                           varchar(256), 
RequestURI_operation          varchar(256),
RequestURI_key                varchar(256),
RequestURI_httpProtoversion   varchar(256),
HTTPstatus                    varchar(256), 
ErrorCode                     varchar(256), 
BytesSent                     varchar(256), 
ObjectSize                    varchar(256), 
TotalTime                     varchar(256), 
TurnAroundTime                varchar(256), 
Referrer                      varchar(256), 
UserAgent                     varchar(256), 
VersionId                     varchar(256))
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'input.regex' = '([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^ ]*) ([^ ]*) ([^ ]*)\" (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\") ([^ ]*)'
  )
STORED AS TEXTFILE
LOCATION
  's3://s3spendanalysisblog/accesslogs/';

Validating the data

You can validate the external table data in Amazon Redshift. The following screenshot shows how to do this using the Query Editor in the Amazon Redshift console:

You are now ready to analyze the data.

Analyzing the data using Amazon Redshift

In this post, you have a CUR file per day in your S3 bucket. The files themselves are organized in a monthly hierarchy. See the following screenshot.

Each day’s file consists of the following files for CUR data:

  • myCURReport-1.csv.gz – A zipped file of the data itself
  • myCURReport-Manifest.json – A JSON file that contains the metadata for the file
  • myCURReport-RedshiftCommands.sql – Amazon Redshift table creation scripts and a COPY command to create the CUR table from a Redshift manifest file
  • myCURReport-RedshiftManifest.json – The Amazon Redshift manifest file to create the CUR table

Using Amazon Redshift is one of the many ways to carry out this analysis. Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. Amazon Redshift gives you fast querying capabilities over structured data using familiar SQL-based clients and BI tools using standard ODBC and JDBC connections. Queries are distributed and parallelized across multiple physical resources.

You are now ready to run SQL queries with the Amazon Redshift SQL Query Editor. This post also uses the psql client tool, a terminal-based front end from PostgreSQL, to query the data in the cluster.

To query the data, complete the following steps:

  1. Create a custom schema to contain your tables for analysis. See the following code:
    create schema if not exists redshift_schema;

    You should create your table in a schema other than public to control user access to database objects.

  2. Create a CUR table for the latest month in Amazon Redshift using the CUR SQL file in S3. See the following code:
    create table redshift_schema.AWSBilling201910 (
    identity_LineItemId VARCHAR(256),
    identity_TimeInterval VARCHAR(100),
    bill_InvoiceId VARCHAR(100),
    bill_BillingEntity VARCHAR(10),
    bill_BillType VARCHAR(100),
    bill_PayerAccountId VARCHAR(100),
    bill_BillingPeriodStartDate TIMESTAMPTZ,
    bill_BillingPeriodEndDate TIMESTAMPTZ,
    lineItem_UsageAccountId VARCHAR(100),
    lineItem_LineItemType VARCHAR(100),
    lineItem_UsageStartDate TIMESTAMPTZ,
    lineItem_UsageEndDate TIMESTAMPTZ,
    lineItem_ProductCode VARCHAR(100),
    lineItem_UsageType VARCHAR(100),
    lineItem_Operation VARCHAR(100),
    lineItem_AvailabilityZone VARCHAR(100),
    lineItem_ResourceId VARCHAR(256),
    lineItem_UsageAmount DECIMAL(11,2),
    lineItem_NormalizationFactor VARCHAR(10),
    lineItem_NormalizedUsageAmount DECIMAL(11,2),
    lineItem_CurrencyCode VARCHAR(10),
    lineItem_UnblendedRate DECIMAL(11,2),
    lineItem_UnblendedCost DECIMAL(11,2),
    lineItem_BlendedRate DECIMAL(11,2),
    lineItem_BlendedCost DECIMAL(11,2),
    lineItem_LineItemDescription VARCHAR(100),
    lineItem_TaxType VARCHAR(100),
    lineItem_LegalEntity VARCHAR(100),
    product_ProductName VARCHAR(100),
    product_alarmType VARCHAR(100),
    product_automaticLabel VARCHAR(100),
    product_availability VARCHAR(100),
    product_availabilityZone VARCHAR(100),
    product_clockSpeed VARCHAR(100),
    product_currentGeneration VARCHAR(100),
    product_databaseEngine VARCHAR(100),
    product_dedicatedEbsThroughput VARCHAR(100),
    product_deploymentOption VARCHAR(100),
    product_durability VARCHAR(100),
    product_ecu VARCHAR(100),
    product_edition VARCHAR(100),
    product_engineCode VARCHAR(100),
    product_enhancedNetworkingSupported VARCHAR(100),
    product_eventType VARCHAR(100),
    product_feeCode VARCHAR(100),
    product_feeDescription VARCHAR(100),
    product_fromLocation VARCHAR(100),
    product_fromLocationType VARCHAR(100),
    product_gpu VARCHAR(100),
    product_gpuMemory VARCHAR(100),
    product_group VARCHAR(100),
    product_groupDescription VARCHAR(100),
    product_instanceFamily VARCHAR(100),
    product_instanceType VARCHAR(100),
    product_instanceTypeFamily VARCHAR(100),
    product_io VARCHAR(100),
    product_labelingTaskType VARCHAR(100),
    product_licenseModel VARCHAR(100),
    product_location VARCHAR(100),
    product_locationType VARCHAR(100),
    product_maxThroughputvolume VARCHAR(100),
    product_maxVolumeSize VARCHAR(100),
    product_memory VARCHAR(100),
    product_messageDeliveryFrequency VARCHAR(100),
    product_messageDeliveryOrder VARCHAR(100),
    product_minVolumeSize VARCHAR(100),
    product_networkPerformance VARCHAR(100),
    product_normalizationSizeFactor VARCHAR(100),
    product_operation VARCHAR(100),
    product_physicalCpu VARCHAR(100),
    product_physicalGpu VARCHAR(100),
    product_physicalProcessor VARCHAR(100),
    product_processorArchitecture VARCHAR(100),
    product_processorFeatures VARCHAR(100),
    product_productFamily VARCHAR(100),
    product_protocol VARCHAR(100),
    product_queueType VARCHAR(100),
    product_region VARCHAR(100),
    product_servicecode VARCHAR(100),
    product_servicename VARCHAR(100),
    product_sku VARCHAR(100),
    product_storage VARCHAR(100),
    product_storageClass VARCHAR(100),
    product_storageMedia VARCHAR(100),
    product_subscriptionType VARCHAR(100),
    product_toLocation VARCHAR(100),
    product_toLocationType VARCHAR(100),
    product_transferType VARCHAR(100),
    product_usageFamily VARCHAR(100),
    product_usagetype VARCHAR(100),
    product_vcpu VARCHAR(100),
    product_version VARCHAR(100),
    product_volumeType VARCHAR(100),
    product_workforceType VARCHAR(100),
    pricing_RateId VARCHAR(100),
    pricing_publicOnDemandCost DECIMAL(11,2),
    pricing_publicOnDemandRate DECIMAL(11,2),
    pricing_term VARCHAR(100),
    pricing_unit VARCHAR(100),
    reservation_AmortizedUpfrontCostForUsage DECIMAL(11,2),
    reservation_AmortizedUpfrontFeeForBillingPeriod DECIMAL(11,2),
    reservation_EffectiveCost DECIMAL(11,2),
    reservation_EndTime TIMESTAMPTZ,
    reservation_ModificationStatus VARCHAR(100),
    reservation_NormalizedUnitsPerReservation BIGINT,
    reservation_RecurringFeeForUsage DECIMAL(11,2),
    reservation_StartTime TIMESTAMPTZ,
    reservation_SubscriptionId VARCHAR(100),
    reservation_TotalReservedNormalizedUnits BIGINT,
    reservation_TotalReservedUnits BIGINT,
    reservation_UnitsPerReservation BIGINT,
    reservation_UnusedAmortizedUpfrontFeeForBillingPeriod DECIMAL(11,2),
    reservation_UnusedNormalizedUnitQuantity BIGINT,
    reservation_UnusedQuantity BIGINT,
    reservation_UnusedRecurringFee DECIMAL(11,2),
    reservation_UpfrontValue BIGINT
    );

  3. Load the data into Amazon Redshift for the latest month, using the provided CUR Manifest file. See the following code:
    copy AWSBilling201910 from 's3://ss-cur//myCURReport/20191001-20191101/fd76beee-0709-42d5-bcb2-bb45f8ba1aae/myCURReport-RedshiftManifest.json'
    credentials 'arn:aws:iam::<AWS_IAM_ROLE>'
    GZIP CSV IGNOREHEADER 1 TIMEFORMAT 'auto' manifest;

  4. Validate the data loaded in the Amazon Redshift table. See the following code:
    select * from AWSBilling201910
    where lineItem_ProductCode = 'AmazonS3'
    and lineItem_ResourceId = 's3spendanalysisblog' limit 10;

    The following screenshot shows that data has been loaded correctly in the Amazon Redshift table:

Managing database security

You can manage database security in Amazon Redshift by controlling which users have access to which database objects. To make sure your objects are secure, create two groups: FINANCE and ADMIN, with two users in FINANCE and one user in ADMIN. Complete the following steps:

  1. Create the groups where the user accounts are assigned. The following code creates two different user groups:
    create group finance;
    create group admin;

    To view all user groups, query the PG_GROUP system catalog table (you should see finance and admin here):

    select * from pg_group:

  2. Create three database users with different privileges and add them to the groups. See the following code:
    create user finance1 password 'finance1Pass'
    in group finance;
    
    create user finance2 password 'finance2Pass'
    in group finance;
    
    create user admin1 password 'admin1Pass'
    in group admin;

    Validate the users have been successfully created. To view a list of users, query the PG_USER catalog table:

  3. Grant SELECT privileges to the FINANCE group and ALL privileges to the ADMIN group for your table AWSBilling201910 in redshift_schema. See the following code:
    grant select on table redshift_schema.AWSBilling201910 to group finance; 
    grant all on table redshift_schema.AWSBilling201910 to group admin;

    You can verify if you enforced database security correctly. The user finance1 tried to rename the table AWSBilling201910 in redshift_schema, but got a permission denied error message (due to restricted access). The following screenshot shows this scenario and the subsequent error message:

Example S3 inventory analysis

S3 charges split per bucket. The following query identifies the data storage and transfer costs for each separate S3 bucket:

SELECT
  "lineitem_productcode",
  "lineitem_usagetype",
  "lineitem_resourceid",
  b."storage_class",
  SUM(CASE
    WHEN "lineitem_usagetype" like '%Byte%' THEN "lineitem_usageamount"/1024
    ELSE "lineitem_usageamount"
  END) as "Usage",
  CASE
    WHEN "lineitem_usagetype" like '%Byte%' THEN 'TBs'
    ELSE 'Requests'
  END as "Usage Units",
  sum("lineitem_blendedcost") as cost
from awsbilling201902 a
  join spectrum_schema.data b
    on a.lineItem_ResourceId = b.bucket
where "product_productname" = 'Amazon Simple Storage Service'
group by
  "lineitem_productcode",
  "lineitem_usagetype",
  "lineitem_resourceid",
  b."storage_class"
order by
  sum("lineitem_blendedcost") desc;

The following screenshot shows the results of executing the above query:

Costs are split by type of storage (for example, Glacier versus standard storage).

The following query identifies S3 data transfer costs (intra-region and inter-region) by S3 storage class (usage amount, unblended cost, blended cost):

SELECT
 lineitem_productcode
 ,product_fromlocation
 ,product_tolocation,
  b.storage_class
 ,sum(lineitem_usageamount) usageamount
 ,sum(lineitem_unblendedcost) unblendedcost
 ,sum(lineitem_blendedcost) blendedcost
FROM
awsbilling201902 a
  join spectrum_schema.data b
ON
    a.lineItem_ResourceId = b.bucket
WHERE
 a.lineitem_productcode = 'AmazonS3'
 AND a.product_productfamily = 'Data Transfer'
GROUP BY
 1,2,3,4
ORDER BY
 usageamount desc;

The following screenshot shows the result of executing the above query:

The following query identifies S3 fee, API request, and storage charges:

SELECT
 lineitem_productcode
 ,product_productfamily
 ,b.storage_class
 ,sum(lineitem_usageamount) usageamount
 ,sum(lineitem_unblendedcost) unblendedcost
 ,sum(lineitem_blendedcost) blendedcost
FROM
awsbilling201902 a
  join spectrum_schema.data b
ON
   a.lineItem_ResourceId = b.bucket
WHERE
 a.lineitem_productcode = 'AmazonS3'
  and a.product_productfamily <> 'Data Transfer'
GROUP BY
 1,2,3
ORDER BY
 usageamount desc;

The following screenshot shows the result of executing the above query:

Server access logs sample analysis queries

S3 access log charges per operation type. The following query identifies the data storage and transfer costs for each separate HTTP operation:

SELECT
  "lineitem_productcode",
  "lineitem_usagetype",
  "lineitem_resourceid",
  b."operation",
  b."httpstatus",
  b."bytessent",
  SUM(CASE
      WHEN "lineitem_usagetype" like '%Byte%'
        THEN "lineitem_usageamount" / 1024
      ELSE "lineitem_usageamount"
      END) as "Usage",
  CASE
  WHEN "lineitem_usagetype" like '%Byte%'
    THEN 'TBs'
  ELSE 'Requests'
  END  as "Usage Units",
  sum("lineitem_blendedcost") as cost
from awsbilling201902 a
  join spectrum_schema.s3accesslogs b
    on a.lineItem_ResourceId = b.bucket
where "product_productname" = 'Amazon Simple Storage Service'
group by
  1, 2, 3, 4, 5, 6
order by
  sum("lineitem_blendedcost") desc;

The following screenshot shows the result of executing the above query:

The following query identifies S3 data transfer costs (intra-region and inter-region) by S3 operation and HTTP status (usage amount, unblended cost, blended cost):

SELECT
 lineitem_productcode
 ,product_fromlocation
 ,product_tolocation
 ,b.operation
 ,b.httpstatus
 ,sum(lineitem_usageamount) usageamount
 ,sum(lineitem_unblendedcost) unblendedcost
 ,sum(lineitem_blendedcost) blendedcost
FROM
awsbilling201902 a
  JOIN spectrum_schema.s3accesslogs b
ON
   a.lineItem_ResourceId = b.bucket
WHERE
 a.lineitem_productcode = 'AmazonS3'
 AND a.product_productfamily = 'Data Transfer'
GROUP BY
 1,2,3,4,5
ORDER BY
 usageamount desc;

The following screenshot shows the result of executing the above query:

The following query identifies S3 fee, API request, and storage charges:

SELECT
 lineitem_productcode
 ,product_productfamily
 ,b.operation
 ,b.httpstatus
 ,sum(lineitem_usageamount) usageamount
 ,sum(lineitem_unblendedcost) unblendedcost
 ,sum(lineitem_blendedcost) blendedcost
FROM
awsbilling201902 a
  JOIN spectrum_schema.s3accesslogs b
ON
   a.lineItem_ResourceId = b.bucket
WHERE
 a.lineitem_productcode = 'AmazonS3'
  and a.product_productfamily <> 'Data Transfer'
GROUP BY
 1,2,3,4
ORDER BY
 usageamount desc;

The following screenshot shows the result of executing the above query:

Overall data flow diagram

The following diagram shows the complete data flow for this solution.

Conclusion

AWS Glue makes provides an easy and convenient way to discover data stored in your S3 buckets automatically in a cloud-native, secure, and efficient way. This post demonstrated how to use AWS Glue and Amazon Redshift to analyze your S3 spend using Cost and Usage Reports. You also learned best practices for managing database security in Amazon Redshift through users and groups. Using this framework, you can start analyzing your S3 bucket spend with a few clicks in a matter of minutes on the AWS Management Console!

If you have questions or suggestions, please leave your thoughts in the comments section below.

 


About the Author

 Shayon Sanyal is a Data Architect, Data Lake for Global Financial Services at AWS.

 

 

 

Amazon Redshift at re:Invent 2019

Post Syndicated from Corina Radovanovich original https://aws.amazon.com/blogs/big-data/amazon-redshift-at-reinvent-2019/

The annual AWS re:Invent learning conference is an exciting time full of new product and program launches. At the first re:Invent conference in 2012, AWS announced Amazon Redshift. Since then, tens of thousands of customers have started using Amazon Redshift as their cloud data warehouse. In 2019, AWS shared several significant launches and dozens of sessions. This post presents highlights of what happened with Amazon Redshift at re:Invent 2019.

Andy Jassy’s AWS re:Invent 2019 keynote

When Andy Jassy takes the stage to talk about what’s new at AWS, he launches the new Amazon Redshift node type, RA3 with managed storage; the new Federated Query (preview) feature, Export to Data Lake; and Advanced Query Accelerator (AQUA) (preview) for Amazon Redshift. Watch AWS re:Invent 2019 – Keynote with Andy Jassy on YouTube, or jump ahead for the Amazon RedShift announcements.

Deep dive and best practices for Amazon Redshift

Every year the Amazon Redshift deep dive session rates highly, and people continue to watch and re-watch it after the event. This year was no different. Specialist Solution Architects Harshida Patel and Tony Gibbs take an in-depth look at best practices for data warehousing with Amazon Redshift. It’s a must-see for existing Amazon Redshift users. Watch AWS re:Invent 2019: Deep dive and best practices for Amazon Redshift (ANT418) on YouTube.

What’s new with Amazon Redshift, featuring Yelp and Workday

With over 200 new features and capabilities launched in the last 18 months, there’s a lot to cover in a session about what’s new with Amazon Redshift. Join one of the Product Managers driving RA3 and managed storage, Himanshu Raja, to catch up on the recent performance, concurrency, elasticity, and manageability enhancements behind Amazon Redshift’s record price-to-performance ratio. You also get more insight into the architectural evolution of Amazon Redshift with RA3 and managed storage, and how it uses machine learning to create a self-optimizing data warehouse. In the second half of the session, you hear from Steven Moy, a software engineer at Yelp, about how Amazon Redshift’s latest features have helped Yelp achieve optimization and scale for an organization with an enormous about of data and sophisticated analytics. Watch AWS re:Invent 2019: What’s new with Amazon Redshift, featuring Yelp (ANT320-R1) on YouTube.

The session repeated with Michalis Petropoulos, Director of Engineering, and Erol Guney of Workday. Watch the full session to get a slightly different take on what’s new, or jump to the customer presentation to hear from Erol Guney, Architect, Data Platform, at Workday about how Amazon Redshift empowers their Data as a Service product team to focus on architecture goals and business logic.

Migrate your data warehouse to the cloud in record time, featuring Nielsen and Fannie Mae

In this session, you learn about important concepts and tips for migrating your legacy on-premise data warehouse to the cloud. You hear from Tejas Desai, VP of Technology at Neilsen, about their migration journey and benefits. Watch AWS re:Invent 2019: Migrate your data warehouse to the cloud, featuring Nielsen (ANT334-R1) on YouTube.

The repeat of this session features Amy Tseng from Fannie Mae. If you don’t want to listen to Tony’s overview again, skip ahead to learn how Fannie Mae embraced a data lake architecture with Amazon Redshift for analytics to save costs, maximize performance, and scale. Amy’s presentation was a crowd favorite, with some of the most positive customer feedback and a wealth of great information about how Fannie Mae managed their migration.

How to scale data analytics with Amazon Redshift, featuring Duolingo and Warner Brothers Interactive Entertainment

Data is growing fast, and so is the value that business users need to gain from business data. When AWS first announced Amazon Redshift in 2012, it could handle up to 640 TB of compressed data. It can now scale to 8 PB of compressed data. Learn more about Amazon Redshift’s unique ability to deliver top performance at the lowest and most predictable cost from Vinay Shukla, Principal Product Manager. This is an especially important session if you want to learn more about the newest Amazon Redshift node type RA3. You also hear from Jonathan Burket of Duolingo about their experience in the preview of RA3 nodes and how Duolingo uses Amazon Redshift. Duolingo is a wildly popular language-learning platform and the most downloaded education app in the world, with over 300 million users. Enabling data-driven decisions with A/B tests and ad hoc analysis has been a driver of their success. Watch AWS re:Invent 2019: How to scale data analytics with Amazon Redshift (ANT335-R2) on YouTube.

The repeat session features Redshift Product Manager Maor Kleider with an in-depth case study from Matt Howell, Executive Director, Analytics, and Kurt Larson, Technical Director, Analytics, at Warner Brothers Interactive Entertainment. Watch the full session for another perspective about how to scale with the latest Amazon Redshift features, with unique insights about analytics across Amazon Redshift and your data lake. You can also jump to the customer presentation. Not only is this session packed with interesting insights about how data analytics drives the success of games like Batman and Mortal Kombat, it also has an action-packed trailer about all the awesome Warner Brothers games.

If you prefer to see a session without the announcements from the keynote and with demos, watch Debu Panda showcase the new Amazon Redshift console and share practical tips about using Amazon Redshift.

Amazon Redshift reimagined: RA3 and AQUA

This embargoed session is the first opportunity to learn more about AQUA for Amazon Redshift, and how it improves query performance to up to 10 times faster. Britt Johnston, Director of Product Management, kicks off with an intro into the next generation of Amazon Redshift, and Senior Principal Engineer Andy Caldwell jumps in to share the origin and vision of the exciting new technology. The enthusiasm Andy feels about sharing AQUA with customers for the first time is palpable. Watch AWS re:Invent 2019: [NEW LAUNCH!] Amazon Redshift reimagined: RA3 and AQUA (ANT230) on YouTube.

State-of-the-art cloud data warehousing, featuring Asurion and Comcast

This session serves as a great introduction to cloud data warehousing at AWS, with insightful presentations from a different customer in each delivery. You can hear from Asurion about how they use data analytics to serve over 300 million people with excellent customer satisfaction scores. You learn about how to use AWS services with Amazon Redshift and why Asurion believes in their data lake-based architecture. Watch AWS re:Invent 2019: State-of-the-art cloud data warehousing, featuring Asurion (ANT213-R1) on YouTube.

In the repeat session, Rajat Garg, Senior Principal Architect from Comcast, talks about moving to Amazon Redshift from a legacy on-premise Oracle Exadata environment. He shares their strategy, approach, and performance improvements.

What’s next and more information

In addition to these sessions at re:Invent, there are also hands-on workshops, intimate builder roundtables, and interactive chalk talks that weren’t recorded.

Keep exploring the following links for more information about new releases:

We hope to see you in Las Vegas for re:Invent 2020, or at one of the hundreds of other AWS virtual and in-person events running around the world. For more information, see AWS Events and Webinars.


About the authors

Corina Radovanovich leads product marketing for cloud data warehousing at AWS.

 

 

 

Maximize data ingestion and reporting performance on Amazon Redshift

Post Syndicated from Vasu Kiran Gorti original https://aws.amazon.com/blogs/big-data/maximize-data-ingestion-and-reporting-performance-on-amazon-redshift/

This is a guest post from ZS. In their own words, “ZS is a professional services firm that works closely with companies to help develop and deliver products and solutions that drive customer value and company results. ZS engagements involve a blend of technology, consulting, analytics, and operations, and are targeted toward improving the commercial experience for clients.”

ZS was involved in setting up and operating a MicroStrategy-based BI application that sources 700 GB of data from Amazon Redshift as a data warehouse in an Amazon-hosted backend architecture. ZS sourced healthcare data from various pharma data vendors from different systems such as Amazon S3 buckets and FTP systems into the data lake. They processed this data using transient Amazon EMR clusters and stored it on Amazon S3 for reporting consumption. The reporting-specific data is moved to Amazon Redshift using COPY commands, and MicroStrategy uses it to refresh front-end dashboards.

ZS has strict, client-set SLAs to meet with the available Amazon Redshift infrastructure. We carried out experiments to identify an approach to handle large data volumes using the available small Amazon Redshift cluster.

This post provides an approach for loading a large volume of data from S3 to Amazon Redshift and applies efficient distribution techniques for enhanced performance of reporting queries on a relatively small Amazon Redshift cluster.

Data processing methodology

ZS infrastructure is hosted on AWS, where they store and process pharma industry data from various vendors using AWS services before reporting the data on a MicroStrategy BI reporting tool. The following diagram shows the overall data flow from flat files to reports shown on MicroStrategy for end-users.

Step 1: Pharma data is sourced from various vendors and different systems like FTP location, individual systems and Amazon S3 buckets etc.

Step 2: Cost-effective transient clusters are spun as needed to provide compute power to execute pyspark codes.

Step 3: Post the processing, data is stored in Amazon S3 buckets for consumption of downstream applications.

Step 4: 700 GB of data is then ingested into Amazon Redshift for MSTR consumption.

Step 5: This data is read from Amazon Redshift and the insights are displayed to the end-users in the form of reports on MicroStrategy.

Dataset under consideration

In this specific scenario, ZS was working with data from the pharma domain. The following table demonstrates the data’s typical structure—it has several doctor, patient, treatment-pertinent IDs, and healthcare metrics.

Table 1
Column Name EMR datatypeAmazon Redshift datatype
Time IDintegerint
Geography IDintegerint
Product IDintegerint
Market IDintegerint
Doctor IDintegerint
Doctor Attribute 1 IDintegerint
Doctor Attribute 2 IDintegerint
Doctor Attribute 3 IDintegerint
Doctor Attribute 4 IDintegerint
Doctor Rankintegerint
Metric 1doubledecimal(18,6)
Metric 2doubledecimal(18,6)
Metric 3doubledecimal(18,6)
Metric 4doubledecimal(18,6)
Metric 5doubledecimal(18,6)
Metric 6doubledecimal(18,6)
Metric 7doubledecimal(18,6)
Metric 8doubledecimal(18,6)
Metric 9doubledecimal(18,6)
Metric 10doubledecimal(18,6)
Metric 11doubledecimal(18,6)
Metric 12doubledecimal(18,6)
Metric 13doubledecimal(18,6)
Metric 14doubledecimal(18,6)
Metric 15doubledecimal(18,6)
Metric 16doubledecimal(18,6)
Metric 17doubledecimal(18,6)
Metric 18doubledecimal(18,6)
Metric 19doubledecimal(18,6)
Metric 20doubledecimal(18,6)
Metric 21doubledecimal(18,6)
Metric 22doubledecimal(18,6)
Metric 23doubledecimal(18,6)
Data Snapshot Datetimestamptimestamp
Data Refresh Datetimestamptimestamp
Data Refresh IDstringvarchar

Each table has approximately 35–40 columns and holds approximately 200–250M rows of data. ZS used 40 such tables; they sourced the data in these tables from various healthcare data vendors and processed them as per reporting needs.

The total dataset is approximately 2 TB in size in CSV format and approximately 700 GB in Parquet format.

Challenges and constraints

The five-step process for data refresh and insight generation outlined previously takes place over the weekends within a stipulated time frame. Under default, unoptimized state data load from S3 to Amazon Redshift and MicroStrategy refresh (Step 4 in the previous diagram) took almost 13–14 hours on a 2node ds2.8xlarge cluster and was affecting the overall weekend run SLA (1.5 hours).

The following diagram outlines the three constraints that ZS had to solve for to meet client needs:

Weekly time-based SLA – Load within 1 hour and fetch data on MSTR within 1.5 hours

The client IT and Business teams set a strict SLA to load 700 GB of Parquet data (equivalent to 2 TB CSV) onto Amazon Redshift and refresh the reports on the MicroStrategy BI tool. In this scenario, the client team had moved from another vendor to AWS, and the overall client expectation was to reduce costs without a significant performance dip.

Fixed cluster size – Pre-decided 2 node ds2.8xlarge cluster

The client IT teams determined the cluster size and configuration, and took into consideration the cost, data volumes, and load patterns. These were fixed and not adjustable: a 2 node ds2.8xlarge cluster. ZS carried out PoCs to optimize the environment subject to these constraints.

High data volume – Truncate load 700GB data in Parquet format

The data that ZS used was pertinent to the Pharma domain. The dataset under consideration in this scenario was 700 GB in Parquet format. In this specific use case, with every refresh, even historic data was updated, and therefore a lot of data could not be appended. Therefore, we followed a truncate and load process.

Iterative optimization

With constraints over time, data volume, and cluster size, ZS performed various experiments to optimize Amazon Redshift data load and read time—two key aspects to gauge performance. ZS created an iterative framework that helps do the following:

  • Decide the file format
  • Define optimal data distribution through distribution and sort keys
  • Identify the techniques to parallelize the data-loading process

The below diagram shows the key steps that can be followed to get the best data load and read performance on any Amazon RedShift cluster.

Data load optimization

We identified and optimized four key factors impacting data load performance: file formats, file size at source, concurrency, and column encoding.

File formats

Many projects usually load data in CSV format from S3 to Amazon Redshift. ZS had data available in Parquet format with snappy compression as an output of Spark processes. (Spark processes work best with this combination.)

To identify an efficient format for Amazon Redshift, we compared Parquet with commonly used CSV and GZIP formats. We loaded a table from S3, with 200M rows of data generated through the Spark process, which equates to 41 GB in CSV, 11 GB in Parquet, and 10 GB in GZIP, and compared load time and CPU utilization. The below diagram shows load time vs CPU utilization for same data stored in different file formats.

For the dataset and constraints we were working with, loading the Parquet format file required low CPU utilization and lesser I/O compared to CSV and GZIP, and occupied a smaller memory footprint on S3 compared to the memory-intensive CSV format. Lower CPU utilization allowed more parallel loads in this scenario, thereby reducing the overall runtime required to load Parquet files.

File size at source

The next aspect was to choose the block size in which the Parquet files were broken down and stored on S3. A block size of 128 MB is commonly used for Spark jobs and considered to be optimal for data processing. However, Amazon Redshift works best with larger files.

We loaded 10 GB of Parquet data broken down into smaller equisized files of 250 MB, 750 MB, 1 GB, 1.5 GB, and 3 GB block sizes and noted the performance in each case. The following graph shows the different load times.

There was a gradual improvement in the data load time until the block size reached 1 GB (with best load timing). Beyond 1GB mark there was a dip in performance observed with larger files and Amazon Redshift took more time to process larger files.

These numbers are specific to the kind of data we were working with. The recommendations can vary as the form and shape of data changes.

As a best practice, identify the block size and have the number of files as a multiple of the number of slices of the Amazon Redshift cluster. This makes sure that each slice does an equal amount of work and there are no idle slices, thereby increasing efficiency and improving performance. For more information, see Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift.

Concurrency

The COPY command is relatively low on memory. The more parallel the loads, the better the performance. ZS loaded a table approximately 7.3 GB multiple times with separate concurrency settings. We measured the throughput in terms of the average time taken per GB to move files to Amazon Redshift with 1 to 20 concurrent loads. The following table summarizes the results.

TestNumber of tables loaded in parallel (concurrency)Total data loaded (GB)
Test 117.3
Test 2536.5
Test 31073
Test 415109.5
Test 520146

The below diagram shows the time taken to load 1GB of data and the CPU utilization for various concurrency settings.

For the dataset and constraints we were working with, a concurrency of 10 gave the best throughput for our specific dataset, with a 25% CPU availability buffer of about 25%. Depending on the nature of the data and volume fluctuations every release, you can opt for a different buffer.

Column encoding

To identify the column encoding and compression on Amazon Redshift that gives the best performance and occupies lower storage footprint, ZS compared ZSTD (which the ANALYZE COMPRESSSION command recommended), LZO, and none encoding formats on Amazon Redshift tables for load performance. The below diagram shows the time taken to load same data volume into the tables with none, ZSTD and LZO encoding applied to the columns.

For the dataset and constraints we were working with, ZSTD encoding on columns offered a high compression ratio (~3, than when there was no compression used) and gave the best data copy performance and low storage footprint on Amazon Redshift for our use case. You can have varied results depending on the datatype and cardinality of the data.

Note: This solution was implemented prior to the feature release for AZ64 encoding and hence does not consider its impact. One could use the approach described in this blog post considering AZ64 compression encoding among all the compression encodings Amazon Redshift supports.

Data read optimization

ZS also improved the data read performance by MicroStrategy from Amazon Redshift by using distribution and sorting keys and SQL optimization (minimizing filters on MicroStrategy auto-generated SQL queries).

SQL queries

MicroStrategy is a business intelligence tool and reads data from a database by intelligently building its own SQL. We compared the performance of MSTR SQLs (typical DW queries such as SELECT, GROUP BY, or temporary tables) with and without filters and observed that the query with and without filters for our dataset ran in almost the same time and used the same resources. But the unfiltered query gave four times more data. The following table summarizes the results.

Filters?

Rows fetchedQuery runtimeCPU utilization

Y

500K

5.1 (mins)

15%

N2M5.1 (mins)

15%

Tweaking the SQLs to have minimum or no filters can help fetch significantly more rows with only a small increase in processing time, compared to what it takes to read data with more filters. If you need to use the entire dataset, it is better to fetch complete data using one query than using several SQLs with separate filters and running them in parallel.

Distribution and sort keys

The following table is an analysis of a table load and read performance with and without distribution and sort keys for our use case.

Dist. Style Dist KeySort KeysLoad time (mins)Query run time (mins)Query CPU utilization
None16.759.559%
KeyMost suitable distribution key depending on the query that gives even distribution6 columns that appear in where clause, in the order of group by clause in SQL17.536.132%

The table without a distribution key loaded a little faster. However, differences in data read time and CPU utilization between the two queries is significant. This means that more and more parallel loads could run efficiently when the keys were set appropriately, because the overall CPU utilization reduced significantly. The key takeaway is to use a distribution style—auto, even, or manual—to optimize data read from Amazon Redshift and allow more parallel processing. ZS used a manual distribution style and chose distribution and sort keys based on an extensive understanding of data and refresh cadence.

Next steps to get the best output from your Amazon Redshift instance

As an outcome of the multiple POC results, we identified the most suitable file formats, compression techniques, re-partitioned Parquet file block sizes, and distribution and interleaved sorting logic that offer the best performance for our dataset for reporting on MicroStrategy with Amazon Redshift as the database. This helped us identify the best data load and read combinations to load 700 GB of Parquet data (equivalent to 2 TB of CSV data) within the client-set 2.5 hour SLA using the available fixed 2 node ds2.8xlarge cluster.

The following diagram shows the iterative process you can follow to identify the best data load and read techniques suited for an Amazon Redshift cluster configuration.

The following are a few key takeaways:

  • Parquet and Amazon Redshift worked well together. The data in Parquet format had low CPU utilization and I/O requirements, which allowed more parallel loads.
  • ZSTD encoding worked best for this specific dataset due to encoding on numbers as well.
  • Sort keys and distribution keys on tables can bring down read time by approximately 80% compared to tables with no distribution and sorting logic applied.
  • Filtering data on Amazon Redshift does not work the same as typical databases. You can improve data filtering performance with the appropriate sort keys.
  • File size at the source and concurrency are interrelated and you should choose them accordingly. Larger blocks (maximum 1 GB, for this specific dataset) are loaded faster onto Amazon Redshift.

 


About the Authors

Vasu Kiran Gorti is a result-oriented professional with technology and functional/domain experience predominantly in sales and marketing consulting. Essaying a role of Associate Consultant at ZS Associates, he has experience in working with life sciences and healthcare clients in alignment to their business objectives and expectations. He specializes in MicroStrategy, Business Intelligence, Analytics, and reporting. Proactive and innovative, Vasu enjoys taking up stimulating initiatives that bridge the gap between technology and business. He’s a permanent beta—always learning, improvising and evolving.

 

 

Ajit Pathak is a Technology Consultant at ZS Associates and leads BI and data management projects for Pharmaceutical companies. His areas of interest and expertise include MSTR, Redshift, and AWS suite. He loves designing complex applications and recommending best architectural practices that lead to efficient and concise dashboards. A qualified technology consultant, Ajit has focused on driving informed business decisions through clear data communication. When he is not focusing on areas of data visualizations and applications, Ajit loves to read, play badminton, and participate in debates ranging from politics to sports.

Working with nested data types using Amazon Redshift Spectrum

Post Syndicated from Juan Yu original https://aws.amazon.com/blogs/big-data/working-with-nested-data-types-using-amazon-redshift-spectrum/

Redshift Spectrum is a feature of Amazon Redshift that allows you to query data stored on Amazon S3 directly and supports nested data types. This post discusses which use cases can benefit from nested data types, how to use Amazon Redshift Spectrum with nested data types to achieve excellent performance and storage efficiency, and some of the limitations of nested data types.

This post uses a data set generated with dummy data. You can view its table schema. If you’d like to try the dataset, deploy a Redshift cluster, execute the DDLs there, and use the example queries from this post or build your own.

Data modeling

In many scenarios, data is generated in a hierarchy. For example, assume a customer bought several items. For analytic purposes, there are various data modeling approaches to save storage or speed up data processing. One popular approach to achieve storage efficiency is the dimensional model.

The following table shows dummy customer data.

usernamenamesexaddressmailbirthdate
1erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/10
2shepherdlisaMark LeeM754 Michelle Gateway Port Johnstad, ME 35695[email protected]11/10/32
3palmerpaulJennifer MarshallF869 Harrell Forges Apt. 111 East Monica, MO 01243[email protected]3/11/07
4brettmcgeeTravis WilsonM535 Lisa Flat East Andrew, ID 43332[email protected]3/22/10
5torresdianaAshley HoffmanF7815 Lauren Ranch Ambertown, FL 93225[email protected]5/14/60

The following table contains dummy order data, which is linked to the customer table via a foreign key username.

usernametransaction_dateshipping_dateitemsprice
1erin1510/11/1910/13/19104794
2erin1510/11/1910/12/1971697
3erin1510/7/1910/9/19215
4erin1510/6/1910/10/1951744
5erin1510/5/1910/10/1976346

In the dimensional model, each customer’s information is stored only one time. There is no duplicated data, even though a customer could order multiple items at various times.

The dimensional model is optimal for storage. However, it can be challenging to process data efficiently. To get a full picture of your data, you need to join the two tables together to restore the hierarchy.

For example, to find out how many items customer Mark Lee bought and his total spending in the last three months, the query needs to join the customers and orders table. See the following code:

Select c.username, o.transaction_date, o.shipping_date, sum(items), sum(price) 
from customers c inner join orders o on (c.username = o.username) 
where c.name = ‘Mark Lee’ 
and transaction_date > DATEADD(month, -3, GETDATE())
group by 1,2,3;

When there are millions of customers who might buy multiple items in each transaction, the join can be very expensive. A fast-growing dataset can be so large that you need to store it in a distributed system. To perform the join, you need to shuffle data through the network, and the cost becomes even more significant.

As storage becomes cheaper and cheaper, people are starting to use a flattened model. In this model, data is pre-joined to gain processing efficiency. The following table shows that the customer and order information is stored in one record and ready to be analyzed.

usernamenamesexaddressmailbirthdatetransaction_dateshipping_dateitemsprice
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/109/14/1910/12/1921237
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/109/16/1910/9/1984824
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/109/17/1910/10/1994392
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/109/17/1910/9/1931079
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/109/25/1910/7/191208
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/1010/2/1910/5/19103689
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/1010/5/1910/10/1976346
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/1010/6/1910/10/1951744
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/1010/7/1910/9/19215
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/1010/11/1910/13/19104794
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/1010/11/1910/12/1971697
palmerpaulJennifer MarshallF869 Harrell Forges Apt. 111 East Monica, MO 01243[email protected]3/11/079/14/199/22/1964642
palmerpaulJennifer MarshallF869 Harrell Forges Apt. 111 East Monica, MO 01243[email protected]3/11/079/17/199/21/191527
palmerpaulJennifer MarshallF869 Harrell Forges Apt. 111 East Monica, MO 01243[email protected]3/11/0710/9/1910/12/195408
torresdianaAshley HoffmanF7815 Lauren Ranch Ambertown, FL 93225[email protected]5/14/609/17/199/28/1995452

This model also works well on a distributed system. Because each row contains complete information, you can process it on any node, and don’t need to shuffle data. You can also use the columnar format to store data, which allows the query engine to read only the needed columns instead of the whole row. This technique improves analytics performance and is storage efficient.

Both models have their pros and cons. The dimensional model trades compute power for storage efficiency, and the flattened model trades storage for processing efficiency.

Some new data types are available that achieve the best of both. Instead of putting child records into another table, you can nest them into the parent record and get the full information without performing a join. It effectively denormalizes the data without duplicating the parent record.

The following diagram illustrates this workflow.

You can apply this model to a schemaful hierarchy dataset. Continuing with the customer and order example, although a customer might buy multiple items, each order item contains the same type of information, such as product ID, price, and vendor.

The hierarchy is clear and consistent. You can map data to a nested structured schema, which you can store and access efficiently via SQL language.

The following table is a nested data presentation of the previous example.

usernamenamesexaddressmailbirthdatetransaction_dateshipping_dateitemsprice
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/109/14/1910/12/1921237
9/16/1910/9/1984824
9/17/1910/10/1994392
9/17/1910/9/1931079
9/25/1910/7/191208
10/2/1910/5/19103689
10/5/1910/10/1976346
10/6/1910/10/1951744
10/7/1910/9/19215
10/11/1910/13/19104794
10/11/1910/12/1971697
palmerpaulJennifer MarshallF869 Harrell Forges Apt. 111 East Monica, MO 01243[email protected]3/11/079/14/199/22/1964642
9/17/199/21/191527
10/9/1910/12/195408
torresdianaAshley HoffmanF7815 Lauren Ranch Ambertown, FL 93225[email protected]5/14/609/17/199/28/1995452

The following graph compares the storage usage for the three models (all in parquet format).

The graph shows that nested structure is as storage efficient as the dimensional model.

Using nested data types

Nested data types are structured data types for some common data patterns. Nested data types support structs, arrays, and maps.

A struct is similar to a relational table. It groups object properties together. For example, if a customer profile contains their name, address, email, and birthdate, it appears as the following schema:

customer struct<
              username:string,
              name:string,
              sex:string,
              address:string,
              mail:string,
              birthdate:string>

The data appears as the following code:

    "customer": {
        "username": "kevin35",
        "name": "Nancy Alvarez",
        "sex": "F",
        "address": "05472 Kathleen Turnpike\nNew Ashley, NV 84430",
        "mail": "[email protected]",
        "birthdate": "1961-02-05"
    }

An array stores one-to-many relationships. For example, a customer may have multiple shipping addresses or phone numbers. If a customer has several phone numbers, it appears as the following schema:

Phonenumbers array<string>

The data appears as the following code:

[‘555-5555’, ‘555-1234’]

A map is a collection of key-value pairs. You can consider it as a list of struct<key, value> elements. For example, if a customer has particular reward preferences, it appears as the following schema:

preference map<string,boolean> 

The data appears as the following code:

{one_day_delivery=true, 
 coupon=false, 
 free_shipping=true}

Nested data could have another nested data type as a member. The most common one is an array of structs. For example, an order containing multiple items could appear as the following schema:

orders array<
            struct<
              product_id:string,
              price:int,
              onsale:boolean,
              tax:int,
              weight:int,
              others:int,
              vendor:string>
            > 

You can create a complex object by combining them. For example, a customer’s online transaction appears as the following schema:

customer struct<
              username:string,
              name:string,
              sex:string,
              address:string,
              mail:string,
              birthdate:string> , 
  shipping_address array<
                      struct<
                        name:string,
                        street_address:string,
                        city:string,
                        postcode:string>
                      > , 
  creditcard string , 
  transaction_date string , 
  shipping_date string , 
  membership string , 
  preference map<string,boolean> , 
  orders array<
            struct<
              product_id:string,
              price:int,
              onsale:boolean,
              tax:int,
              weight:int,
              others:int,
              vendor:string>
            > , 
  platform string , 
  comments string 

Popular query engines such as Hive, Spark, Presto, and Redshift Spectrum support nested data types. The SQL syntax those engines support can be different. To make it straightforward and consistent, all query examples in this post use Amazon Redshift Spectrum. For more information, see Tutorial: Querying Nested Data with Amazon Redshift Spectrum.

Use cases for nested data types

Nested data types have many benefits: simplify your ETL, data modeling, and achieve the good performance. The following are some common use cases that can benefit from nested data types.

Parent-child relationship

Nested data types keep the parent-child (summary-details) relationship by storing them collocated. This often matches how you want to analyze the data. For example, to analyze customers’ purchasing habits, you may need to find the following:

  • Customers who purchase often but buy only a few items each time. They likely want an annual membership that covers the shipping cost.
  • Customers who purchase less frequently but buy many items in one transaction. They likely expect a free shipping benefit or discount.

You need support information from the orders data, such as how many items, on average, a customer buys per transaction.

To find a list of customers who order online at least once per week, with fewer than four items each time, use the following code:

with purchases as (
select co.customer.username as customer, co.transaction_date as transaction_date, co.customer.address as address,
  (select count(*) from co.orders) as total_items, 
  (select sum(case when onsale = true then 1 else 0 end) from co.orders) as items_onsale
from demo.customer_order_nested_parq co )
select customer, count(transaction_date) as tran_cnt, avg(total_items) 
from purchases 
where total_items <= 3 and items_onsale > 0 
      and transaction_date >= '2019-09-01' 
group by 1 having tran_cnt >= 4;

With the nested order details, per item information is already grouped by customer per transaction. Children aggregation is straightforward; you can aggregate order details to categorize a customer. If you use a denormalized table, you have to do GROUP BY two times. The query could also take longer. See the following code:

with purchases as (
select cc_username as customer, transaction_date, cc_address as address,
  count(*) as total_items, 
  sum(case when co_onsale = true then 1 else 0 end) as items_onsale
from demo.customer_order_flatten_parq 
group by 1,2,3)
select customer, count(transaction_date) as tran_cnt, avg(total_items) 
from purchases 
where total_items <= 3 and items_onsale > 0 
      and transaction_date > '2019-09-01'
group by 1 having tran_cnt >= 4;

To find customers who order only once per quarter with at least 10 items and high total spending, use the following code:

with purchases as (
select co.customer.username as customer, co.transaction_date as transaction_date, 
  (select count(*) from co.orders) as total_items, 
  (select sum(price) from co.orders) as total_spending
from demo.customer_order_nested_parq co )
select customer, count(transaction_date) as tran_cnt, avg(total_spending) from purchases 
where total_items >= 10 and total_spending > 5000 and transaction_date > '2019-07-01' transaction_date < '2019-09-30'
group by 1 having tran_cnt < 2
order by 3 desc;

Another benefit of using nested data types for parent-child data analysis is resource usage reduction. If there are one million customer transactions, there could be over five times the item orders. For example, to find each day how many goods ship to Michigan, use the following code:

select co.shipping_date, sum(coo.weight)
from demo.customer_order_nested_parq co, co.orders coo
where co.customer.address like '%MI 012__'
group by 1
order by 1;

Assuming that 3% of customers ship orders to Michigan, after filtering the customer data, there could be approximately 3% of matching transactions. You only need to process 150 thousand item orders instead of 5 million. This greatly reduces the data to process and the resources to use when compared to a flattened model.

For the parent-child use case, nested data types provide straightforward aggregation on children, more efficient filtering, group by, windowing, and storage saving.

Many-to-many relationship

Customers could buy many items from various vendors, and a vendor could sell a product to many customers. This is a many-to-many relationship.

In a dimensional model, you need three tables: a customers table, an orders table, and a transactions table. To find the top vendors who have the most customers, you need to join the three tables. See the following code:

select vendor, transaction_date, count(distinct cc.username)
from customers cc,
     transactions tt,
     orders oo
where cc.username = tt.username
and oo.transaction_id = tt.transaction_id
and tt.transaction_date >= '2019-01-01' 
group by 1,2
order by 3 desc;

With nested data types, the query is similar to the one using the dimensional model. However, because the orders data is collocated with customer transactions, you can join them on-the-fly without paying the cost. See the following code:

select coo.vendor, co.transaction_date, count(distinct co.customer.username)
from demo.customer_order_nested_parq co, 
co.orders coo
where co.transaction_date > '2019-01-01'
group by 1,2
order by 3 desc;

As another example, your vendor, Smith PLC, had a big sale event on October 10, 2019. You want to find out which customers bought your product during this sale and the top customers who spent the most. To do so, use the following code:

select co.customer.username, count(coo.product_id), sum(coo.price)
from demo.customer_order_nested_parq co, co.orders coo
where co.transaction_date = '2019-10-10'
and (select count(*) from co.orders 
     where vendor = 'Smith PLC' and onsale = true) > 0
group by 1
order by 3 desc;

Compared to the dimensional model query, the nested model is two-to-three times faster. This is on a relatively small dataset with only a few million rows. For a larger dataset, the performance improvement is even greater, and with less resource usage.

Sparse and frequently changed data

Assume that you want to reward customers who order from your online store. For each transaction, the customer can choose one or more rewards, such as free shipping, one-day delivery, a discount, or a coupon. Depending on how effective a reward is, you have to frequently modify the reward types, add new ones, or remove ones that aren’t popular.

If you store the data in a flattened model, there are two common options to track this data. The first method is creating a table with one column for each type of reward. You have to think of all possible rewards at the outset and create those columns. This could lead to a wide table and very sparse data. Alternatively, you can modify your table schema when you want to add or remove a reward type. That adds more maintenance work and you may lose history data. The following table demonstrates this method (all transaction_id data in below table examples are faked one).

transaction_idfree_shippingone_day_deliverydiscountcoupon
pklein35966659391853535FALSETRUETRUE
rebeccawiliams228880139768961FALSETRUE
brooke39180013629693040TRUEFALSETRUETRUE
jchapman4283556333561927FALSETRUEFALSEFALSE
mariamartin3515336516983566FALSEFALSETRUE

The second option is storing one reward per row. This avoids the wide table issue and the burden of constantly updating the schema. The approach is suitable if you only need to analyze a single reward. If you want to see whether there is any correlation between rewards, such as if more customers prefer free shipping and one-day delivery more than a discount and coupon, this option is more complicated. This model also needs more storage. The following table demonstrates this method.

transaction_idrewordtypevalue
pklein35966659391853535free_shippingFALSE
pklein35966659391853535one_day_deliveryTRUE
pklein35966659391853535couponTRUE
rebeccawiliams228880139768961one_day_deliveryFALSE
rebeccawiliams228880139768961couponTRUE
brooke39180013629693040free_shippingTRUE
brooke39180013629693040one_day_deliveryFALSE
brooke39180013629693040discountTRUE
brooke39180013629693040couponTRUE

A compromise is to use a JSON string to store selected rewards together in one column, which avoids schema change. See the following code:

preference varchar(65535)	

The following table shows how the data is stored in JSON string:

transaction_idpreference
pklein35966659391853535{“coupon”:true, “free_shipping”:false,”one_day_delivery”:true}
rebeccawiliams228880139768961{“coupon”:true, one_day_delivery”:false}
brooke39180013629693040{“coupon”:true, “discount”:true, “free_shipping”:true,”one_day_delivery”:false}
jchapman4283556333561927{“coupon”:false, “discount”:false, “free_shipping”:false, “one_day_delivery”:true}
mariamartin3515336516983566{“discount”:true, “free_shipping”:false,”one_day_delivery”:false}

You can analyze it by using a JSON function to extract the reward data. See the following code:

select correlation, count(username) from (
select username,
(case when 
    (json_extract_path_text(preference,'free_shipping')  = 'true' and  
     json_extract_path_text(preference,'one_day_delivery')  = 'true') 
     then 1
 when 
    (json_extract_path_text(preference,'discount') = 'true' and  
     json_extract_path_text(preference,'coupon')  = 'true') 
     then 2
else 0 
 end) as correlation
from demo.transactions
  )
group by 1;

This solution is acceptable, but you could be more storage efficient and more performant by using the nested data type map. See the following code:

preference map<string, boolean>

The following table shows how the data is stored in map:

transaction_idpreference
pklein35966659391853535{coupon=true, free_shipping=false,one_day_delivery=true}
rebeccawiliams228880139768961{coupon=true, one_day_delivery=false}
brooke39180013629693040{coupon=true, discount=true, free_shipping=true,one_day_delivery=false}
jchapman4283556333561927{coupon=false, discount=false, free_shipping=false, one_day_delivery=true}
mariamartin3515336516983566{discount=true, free_shipping=false,one_day_delivery=false}

 

You can analyze a single reward or multiple rewards using SQL. For example, to find how many customers prefer free shipping, use the following code:

select count(distinct co.customer.username)
from demo.customer_order_nested_parq co, co.preference cm
where cm.key = 'free_shipping' and cm.value = true;

To find how many customers prefer free shipping and one-day delivery more than a coupon or discount, use the following code:

with customer_rewards as (
select co.customer.username as customer, 
 (select count(*) from co.preference cm 
where cm.key = 'free_shipping' and cm.value = true) as shipping_pref,
 (select count(*) from co.preference cm 
where cm.key = 'one_day_delivery' and cm.value = true) as delivery_pref,
 (select count(*) from co.preference cm 
where cm.key = 'coupon' and cm.value = true) as coupon_pref,
 (select count(*) from co.preference cm 
where cm.key = 'discount' and cm.value = true) as discount_pref
from demo.customer_order_nested_parq co;
select case when shipping_pref > 0 and delivery_pref > 0 then 1
            when coupon_pref > 0 and discount_pref > 20 then 2
            else 0
       end as correlation, count(customer)
from customer_rewards
group by 1;

The map type allows you to add any key-value pair. You can add a new reward type at any time without a schema change, and you can analyze the new reward right away.

The main advantage of the map type is that it supports flexible schema and eliminates the need to update the schema frequently. However, there is not much performance benefit. If performance is your top priority, a flattened table is recommended. You can also flatten the most-often accessed columns, and use map for the less frequently accessed columns.

Limitations of nested data types

Although nested data types are useful in many use cases, they have the following limitations:

  • There is a hard limit on children size.
  • You can only append, and updating data is difficult and slow. You need to rewrite the entire nested object even if you want to modify one child attribute.
  • Processing is split at the parent record level. You may run into problems if the children data is heavily skewed.
  • The query engine may not support all types of analytics on nested data.
  • Amazon Redshift Spectrum Nested Data Limitations.

Summary

This post discussed the benefits of nested data types and use cases in which nested data types can help improve storage efficiency, performance, or simplify analysis. There are many more use cases in which nested data types can be an ideal solution. Try it out and share your experiences!

 


About the Author

 Juan Yu is a Data Warehouse Specialist Solutions Architect at AWS.

 

 

 

 

ETL and ELT design patterns for lake house architecture using Amazon Redshift: Part 2

Post Syndicated from Asim Kumar Sasmal original https://aws.amazon.com/blogs/big-data/etl-and-elt-design-patterns-for-lake-house-architecture-using-amazon-redshift-part-2/

Part 1 of this multi-post series, ETL and ELT design patterns for lake house architecture using Amazon Redshift: Part 1, discussed common customer use cases and design best practices for building ELT and ETL data processing pipelines for data lake architecture using Amazon Redshift Spectrum, Concurrency Scaling, and recent support for data lake export.

This post shows you how to get started with a step-by-step walkthrough of a few ETL and ELT design patterns of Amazon Redshift using AWS sample datasets.

Prerequisites

Before getting started, make sure that you meet the following prerequisites:

  1. This post uses two publicly available AWS sample datasets from the US-West-2 (Oregon) Region. Use the US-West-2 (Oregon) Region for your test run to reduce cross-region network latency and cost due to the data movement.
  2. You have an AWS account in the same Region.
  3. You have the AdministratorAccess policy granted to your AWS account (for production, you should restrict this further).
  4. You have an existing Amazon S3 bucket named eltblogpost in your data lake to store unloaded data from Amazon Redshift. Because bucket names are unique across AWS accounts, replace eltblogpost with your unique bucket name as applicable in the sample code provided.
  5. You have AWS CLI installed and configured to use with your AWS account.
  6. You have an IAM policy named redshift-elt-test-s3-policy with the following read and write permissions for the Amazon S3 bucket named eltblogpost:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Action": [
                    "s3:GetBucketLocation",
                    "s3:GetObject",
                    "s3:ListBucket",
                    "s3:ListBucketMultipartUploads",
                    "s3:ListMultipartUploadParts",
                    "s3:AbortMultipartUpload",
                    "s3:PutObject",
                    "s3:DeleteObject"
                ],
                "Resource": [
                    "arn:aws:s3:::eltblogpost",
                    "arn:aws:s3:::eltblogpost/*"
                ],
                "Effect": "Allow"
            }
        ]
    }

  7. You have an IAM policy named redshift-elt-test-sampledata-s3-read-policy with read only permissions for the Amazon S3 bucket named awssampledbuswest2, hosting the sample data used for this walkthrough.
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:Get*",
                    "s3:List*"
                ],
                "Resource": [
                    "arn:aws:s3:::awssampledbuswest2",
                    "arn:aws:s3:::awssampledbuswest2/*"
                ]
            }
        ]
    }

  8. You have an IAM role named redshift-elt-test-role that has a trust relationship with redshift.amazonaws.com and glue.amazonaws.com and the following IAM policies (for production, you should restrict this further as needed):
    • redshift-elt-test-s3-policy
    • redshift-elt-test-sampledata-s3-read-policy
    • AWSGlueServiceRole
    • AWSGlueConsoleFullAccess
  9. Make a note of the ARN for redshift-elt-test-role IAM role.
  10. You have an existing Amazon Redshift cluster with the following parameters:
    • Cluster name as rseltblogpost.
    • Database name as rselttest.
    • Four dc2.large nodes.
    • An associated IAM role named redshift-elt-test-role.
    • A publicly available endpoint.
    • A cluster parameter group named eltblogpost-parameter-group, which you use to change the Concurrency Scaling
    • Cluster workload management set to manual.
  11. You have SQL Workbench/J (or another tool of your choice) and can connect successfully to the cluster.
  12. You have an EC2 instance in the same Region with PostgreSQL client CLI (psql) and can connect successfully to the cluster.
  13. You have an AWS Glue catalog database named eltblogpost as the metadata catalog for Amazon Athena and Redshift Spectrum queries.

Loading data to Amazon Redshift local storage

This post uses the Star Schema Benchmark (SSB) dataset. It is provided publicly in an S3 bucket, which any authenticated AWS user with access to Amazon S3 can access.

To load data to Amazon Redshift local storage, complete the following steps:

  1. Connect to the cluster from the SQL Workbench/J.
  2. Execute the CREATE TABLE statements from the Github repo from your SQL Workbench/J to create tables from the SSB dataset. The following diagram shows the list of tables.
  3. Execute the COPY statements from the Github repo. This step loads data into the tables you created using the sample data available in s3://awssampledbuswest2/ssbgz/. Remember to replace your IAM role ARN noted previously.
  4. To verify that each table loaded correctly, run the following commands:
    select count(*) from LINEORDER; 
    select count(*) from PART;
    select count(*) from CUSTOMER;
    select count(*) from SUPPLIER;
    select count(*) from DWDATE;

    The following results table shows the number of rows for each table in the SSB dataset:

    Table Name    Record Count
    LINEORDER     600,037,902
    PART            1,400,000
    CUSTOMER        3,000,000
    SUPPLIER        1,000,000
    DWDATE              2,556

In addition to the record counts, you can also check for a few sample records from each table.

Performing ELT and ETL using Amazon Redshift and unload to S3

The following are the high-level steps for this walkthrough:

  1. You are looking to pre-aggregate some commonly asked measures by your end-users on the point of sales (POS) data you loaded in Amazon Redshift local storage.
  2. You want to then unload the aggregated data from Amazon Redshift to your data lake (S3) in an open and analytics optimized and compressed Parquet file format. You also want to consider an optimized partitioning for the unloaded data in your data lake to help with the end-user query performance and eventually lower cost.
  3. You want to query the unloaded data from your data lake with Redshift Spectrum. You also want to share the data with other AWS services such as Athena with its pay-per-use and serverless ad hoc and on-demand query model, AWS Glue and Amazon EMR for performing ETL operations on the unloaded data and data integration with your other datasets (such as ERP, finance, or third-party data) stored in your data lake, and Amazon SageMaker for machine learning.

Complete the following steps:

  1. To compute the necessary pre-aggregates, execute the following three ELT queries available on the Github repo from your SQL Workbench/J:
    • ELT Query 1 – This query summarizes the revenue by manufacturer, category, and brand per month per year per supplier region.
    • ELT Query 2 – This query summarizes the revenue by brand per month per year by supplier region and city.
    • ELT Query 3 – This query drills down in time by customer city, supplier city, month, and year.
  2. To unload the aggregated data to S3 with Parquet file format and proper partitioning to help with the access patterns of the unloaded data in the data lake, execute the three UNLOAD queries available on the Github repo from your SQL Workbench/J. To use Redshift Spectrum for querying the unloaded data, you need the following:
    • An Amazon Redshift cluster and a SQL client (SQL Workbench/J or another tool of your choice) that can connect to your cluster and execute SQL commands. The cluster and the data files in S3 must be in the same Region.
    • An external schema in Amazon Redshift that references a database in the external data catalog and provides the IAM role ARN that authorizes your cluster to access S3 on your behalf. It is a best practice to have an external data catalog in AWS Glue.You are now ready to create an AWS Glue crawler.
  3. From AWS CLI, run the following code (replace <Your AWS Account>):
    aws glue create-crawler --cli-input-json file://mycrawler.json --region us-west-2
    Where the file mycrawler.json contains:
    {
        "Name": "eltblogpost_redshift_spectrum_etl_elt_glue_crawler",
        "Role": "arn:aws:iam::<Your AWS Account>:role/redshift-elt-test-role",
        "DatabaseName": "eltblogpost",
        "Description": "",
        "Targets": {
            "S3Targets": [
                {
                    "Path": "s3://eltblogpost/unload_parquet/monthly_revenue_by_region_manufacturer_category_brand"
                },
                {
                    "Path": "s3://eltblogpost/unload_parquet/monthly_revenue_by_region_city_brand"
                },
                {
                    "Path": "s3://eltblogpost/unload_parquet/yearly_revenue_by_city"
                }
            ]
        }
    }

    You can also schedule the crawler to run periodically based on your use case. For example, you can schedule the crawler every 35 minutes to keep the AWS Glue catalog tables up to date with the data being unloaded every 30 minutes. However, this post does not configure any scheduling.

  4. After you create the AWS Glue crawler, run it manually from AWS CLI with the following command:
    aws glue start-crawler --name "eltblogpost_redshift_spectrum_etl_elt_glue_crawler" --region us-west-2
  5. When the AWS Glue crawler run is complete, go to the AWS Glue console to see the following three AWS Glue catalog tables under the database eltblogpost:
    • monthly_revenue_by_region_manufacturer_category_brand
    • monthly_revenue_by_region_city_brand
    • yearly_revenue_by_city
  6. Now that you have an external data catalog in AWS Glue named etlblogpost, create an external schema in the persistent cluster named eltblogpost using the following SQL from your SQL Workbench/J (replace <Your AWS Account>):
    create external schema spectrum_eltblogpost 
    from data catalog 
    database 'eltblogpost' 
    iam_role 'arn:aws:iam::<Your AWS Account>:role/redshift-elt-test-role'
    create external database if not exists;

    Using Spectrum, you can now query the three AWS Glue catalog tables you set up earlier.

  7. Go to SQL Workbench/J and run the following sample queries:
    • Top 10 brands by category and manufacturer contributing to revenue for the region AFRICA for the year of 1992 and month of March:
      SELECT brand, category, manufacturer, revenue 
      from "spectrum_eltblogpost"."monthly_revenue_by_region_manufacturer_category_brand"
      where year = '1992'
      and month = 'March' 
      and supplier_region = 'AFRICA'
      order by revenue desc
      limit 10;
      
      brand | category | manufacturer | revenue
      ----------+----------+--------------+-----------
      MFGR#1313 | MFGR#13 | MFGR#1 | 5170356068
      MFGR#5325 | MFGR#53 | MFGR#5 | 5106463527
      MFGR#3428 | MFGR#34 | MFGR#3 | 5055551376
      MFGR#2425 | MFGR#24 | MFGR#2 | 5046250790
      MFGR#4126 | MFGR#41 | MFGR#4 | 5037843130
      MFGR#219 | MFGR#21 | MFGR#2 | 5018018040
      MFGR#159 | MFGR#15 | MFGR#1 | 5009626205
      MFGR#5112 | MFGR#51 | MFGR#5 | 4994133558
      MFGR#5534 | MFGR#55 | MFGR#5 | 4984369900
      MFGR#5332 | MFGR#53 | MFGR#5 | 4980619214

    • Monthly revenue for the region AMERICA for the year 1995 across all brands:
      SELECT month, sum(revenue) revenue
      FROM "spectrum_eltblogpost"."monthly_revenue_by_region_city_brand"
      where year = '1992'
      and supplier_region = 'AMERICA'
      group by month;
      
      month | revenue
      ----------+--------------
      April | 4347703599195
      January | 4482598782080
      September | 4332911671240
      December | 4489411782480
      May | 4479764212732
      August | 4485519151803
      October | 4493509053843
      June | 4339267242387
      March | 4477659286311
      February | 4197523905580
      November | 4337368695526
      July | 4492092583189

    • Yearly revenue for supplier city ETHIOPIA 4 for the years 1992–1995 and month of December:
      SELECT year, supplier_city, sum(revenue) revenue
      FROM "spectrum_eltblogpost"."yearly_revenue_by_city"
      where supplier_city in ('ETHIOPIA 4')
      and year between '1992' and '1995'
      and month = 'December'
      group by year, supplier_city
      order by year, supplier_city;
      
      year | supplier_city | revenue
      -----+---------------+------------
      1992 | ETHIOPIA 4 | 91006583025
      1993 | ETHIOPIA 4 | 90617597590
      1994 | ETHIOPIA 4 | 92015649529
      1995 | ETHIOPIA 4 | 89732644163

When the data is in S3 and cataloged in the AWS Glue catalog, you can query the same catalog tables using Athena, AWS Glue, Amazon EMR, Amazon SageMaker, Amazon QuickSight, and many more AWS services that have seamless integration with S3.

Accelerating ELT and ETL using Redshift Spectrum and unload to S3

Assume that you need to pre-aggregate a set of commonly requested metrics from your end-users on a large dataset stored in the data lake (S3) cold storage using familiar SQL and unload the aggregated metrics in your data lake for downstream consumption.

The following are the high-level steps for this walkthrough:

  1. This is a batch workload that requires standard SQL joins and aggregations on a fairly large volume of relational and structured data. You want to use the power of Redshift Spectrum to perform the required SQL transformations on the data stored in S3 and unload the transformed results back to S3.
  2. You want to query the unloaded data from your data lake using Redshift Spectrum if you have an existing cluster, Athena with its pay-per-use and serverless ad hoc and on-demand query model, AWS Glue and Amazon EMR for performing ETL operations on the unloaded data and data integration with your other datasets in your data lake, and Amazon SageMaker for machine learning.

Because Redshift Spectrum allows you to query the data directly from your data lake without needing to load into Amazon Redshift local storage, you can spin up a short-lived cluster to perform ELT at a massive scale using Redshift Spectrum and terminate the cluster when the work is complete. You can automate spinning up and terminating the short-lived cluster using AWS CloudFormation. That way, you only pay for the few minutes or hours of use. A short-lived cluster can also avoid overloading the current persistent cluster serving interactive queries from live users. For this post, use your existing cluster rseltblogpost.

This post uses a publicly available sample dataset named tickit provided by AWS,  which any authenticated AWS user with access to S3 can access:

  • Sales – s3://awssampledbuswest2/tickit/spectrum/sales/
  • Event – s3://awssampledbuswest2/tickit/allevents_pipe.txt
  • Date – s3://awssampledbuswest2/tickit/date2008_pipe.txt
  • Users – s3://awssampledbuswest2/tickit/allusers_pipe.txt

It is a best practice of Redshift Spectrum for performance reasons to load the dimension tables in the local storage of your short-lived cluster and use an external table for the fact table Sales.

Complete the following steps:

  1. Connect to the cluster from the SQL Workbench/J.To use Redshift Spectrum for querying data from data lake (S3), you need to have the following:
    • An Amazon Redshift cluster and a SQL client (SQL Workbench/J or another tool of your choice) that can connect to your cluster and execute SQL commands. The cluster and the data files in S3 must be in the same Region.
    • An external schema in Amazon Redshift that references a database in the external data catalog and provides the IAM role ARN that authorizes your cluster to access S3 on your behalf. It is a best practice to have an external data catalog in AWS Glue.
    • An AWS Glue catalog database named eltblogpost that you already created.
    • An external schema in your Redshift cluster named spectrum_eltblogpost that you already created.
  2. Execute the SQL available on the Github repo to create an external table named sales in the same external schema named spectrum_eltblogpost. As shown in the previous section, you can also use an AWS Glue crawler to create the external table.
  3. Execute the SQLs available on the Github repo to create the dimension tables to load the data into Amazon Redshift local storage for Redshift Spectrum performance best practices.
  4. Execute the COPY statements available on the Github repo to load the dimension tables using the sample data available in s3://awssampledbuswest2/tickit/. Replace the IAM role ARN with the IAM role ARN you noted earlier, which is associated with your cluster.
  5. To verify that each table has the correct record count, execute the following commands:
    select count(*) from date;
    select count(*) from users;
    select count(*) from event;
    select count(*) from spectrum_eltblogpost.sales;

    The following results table shows the number of rows for each table in the tickit dataset:

    Table Name                    Record Count
    DATE                           365
    USERS                       49,990
    EVENT                        8,798
    spectrum_eltblogpost.sales 172,456

    In addition to the record counts, you can also check for a few sample records from each table.

  6. To compute the necessary pre-aggregates, execute the following three ELT queries available on the Github repo from your SQL Workbench/J:
    • ELT Query 1 – Total quantity sold on a given calendar date.
    • ELT Query 2 – Total quantity sold to each buyer.
    • ELT Query 3 – Events in the 99.9 percentile in terms of all-time gross sales.
  7. To unload the aggregated data to S3 with Parquet file format and proper partitioning to help with the access patterns of the unloaded data in the data lake, execute the three UNLOAD queries available on the Github repo from your SQL Workbench/J.
  8. To use Redshift Spectrum for querying the unloaded data, you can either create a new AWS Glue crawler or modify the previous crawler named eltblogpost_redshift_spectrum_etl_elt_glue_crawler. Update the existing crawler using the following code from AWS CLI (replace <Your AWS Account>):
    aws glue update-crawler --cli-input-json file://mycrawler.json --region us-west-2
    Where the file mycrawler.json contains:
    {
        "Name": "eltblogpost_redshift_spectrum_etl_elt_glue_crawler",
        "Role": "arn:aws:iam::<Your AWS Account>:role/redshift-elt-test-role",
        "DatabaseName": "eltblogpost",
        "Description": "",
        "Targets": {
            "S3Targets": [
                {
                    "Path": "s3://eltblogpost/unload_parquet/monthly_revenue_by_region_manufacturer_category_brand"
                },
                {
                    "Path": "s3://eltblogpost/unload_parquet/monthly_revenue_by_region_city_brand"
                },
                {
                    "Path": "s3://eltblogpost/unload_parquet/yearly_revenue_by_city"
                },
                {
                    "Path": "s3://eltblogpost/unload_parquet/total_quantity_sold_by_date"
                },
                {
                    "Path": "s3://eltblogpost/unload_parquet/total_quantity_sold_by_buyer_by_date"
                },
                {
                    "Path": "s3://eltblogpost/unload_parquet/total_price_by_eventname"
                }
            ]
        }
    }

  9. After you create the crawler successfully, run it manually from AWS CLI with the following command:
    aws glue start-crawler --name "eltblogpost_redshift_spectrum_etl_elt_glue_crawler" --region us-west-2
  10. When the crawler run is complete, go to the AWS Glue console. The following additional catalog tables are in the catalog database eltblogpost:
    • total_quantity_sold_by_date
    • total_quantity_sold_by_buyer_by_date
    • total_price_by_eventname
  11. Using Spectrum, you can now query the three preceding catalog tables. Go to SQL Workbench/J and run the following sample queries:
      • Top 10 days for quantities sold in February and March 2008:
        SELECT caldate, total_quantity
        FROM "spectrum_eltblogpost"."total_quantity_sold_by_date"
        where caldate between '2008-02-01' and '2008-03-30'
        order by total_quantity desc
        limit 10;
        
        caldate | total_quantity
        -----------+---------------
        2008-02-20 | 1170
        2008-02-25 | 1146
        2008-02-19 | 1145
        2008-02-24 | 1141
        2008-03-26 | 1138
        2008-03-22 | 1136
        2008-03-17 | 1129
        2008-03-08 | 1129
        2008-02-16 | 1127
        2008-03-23 | 1121

      • Top 10 buyers for quantities sold in February and March 2008:
        SELECT firstname,lastname,total_quantity
        FROM "spectrum_eltblogpost"."total_quantity_sold_by_buyer_by_date"
        where caldate between '2008-02-01' and '2008-03-31'
        order by total_quantity desc
        limit 10;
        
        firstname | lastname | total_quantity
        ----------+------------+---------------
        Laurel | Clay | 9
        Carolyn | Valentine | 8
        Amelia | Osborne | 8
        Kai | Gill | 8
        Gannon | Summers | 8
        Ignacia | Nichols | 8
        Ahmed | Mcclain | 8
        Amanda | Mccullough | 8
        Blair | Medina | 8
        Hadley | Bennett | 8

      • Top 10 event names for total price:
        SELECT eventname, total_price
        FROM "spectrum_eltblogpost"."total_price_by_eventname"
        order by total_price desc
        limit 10;
        
        eventname | total_price
        ---------------------+------------
        Adriana Lecouvreur | 51846.00
        Janet Jackson | 51049.00
        Phantom of the Opera | 50301.00
        The Little Mermaid | 49956.00
        Citizen Cope | 49823.00
        Sevendust | 48020.00
        Electra | 47883.00
        Mary Poppins | 46780.00
        Live | 46661.00

After the data is in S3 and cataloged in the AWS Glue catalog, you can query the same catalog tables using Amazon Athena, AWS Glue, Amazon EMR, Amazon SageMaker, Amazon QuickSight, and many more AWS services that have seamless integration with S3.

Scaling ELT and unload running in parallel using Concurrency Scaling

Assume that you have a mixed workload under concurrency when the UNLOAD queries and the ELT jobs run in parallel in your cluster with Concurrency Scaling turned on. When Concurrency Scaling is enabled, Amazon Redshift automatically adds additional cluster capacity when you need to process an increase in concurrent read queries including UNLOAD queries. By default, Concurrency Scaling mode is turned off for your cluster. In this post, you enable the Concurrency Scaling mode for your cluster.

Complete the following steps:

  1. Go to your cluster parameter group named eltblogpost-parameter-group and complete the following:
    • Update max_concurrency_scaling_clusters to 5.
    • Create a new queue named Queue 1 with Concurrency Scaling mode set to Auto and a query group named unload_query for the UNLOAD jobs in the next steps.
  2. After you make these changes, reboot your cluster for the changes to take effect.
  3. For this post, use psql client to connect to your cluster rseltblogpost from the EC2 instance that you set up earlier.
  4. Open an SSH session to your EC2 instance and copy the nine files as shown below from the concurrency folder in the Github repo to /home/ec2-user/eltblogpost/ in your EC2 instance.
  5. Review the concurrency-elt-unload.sh script that runs the following eight jobs in parallel:
    • An ELT script for SSB dataset, which kicks off one query at a time.
    • An ELT script for the tickit dataset, which kicks off one query at a time.
    • Three unload queries for the SSB datasets kicked off in parallel.
    • Three unload queries for the tickit datasets kicked off in parallel.
  6. Run the concurrency-elt-unload.sh While the script is running, you will see the following sample output:             The following are the response times taken by the script:
    real 2m40.245s
    user 0m0.104s
    sys 0m0.000s
  7. Run the following query to validate that some of the UNLOAD queries ran in the Concurrency Scaling clusters (look for “which_cluster = Concurrency Scaling” in the query output below):
    SELECT query,
    Substring(querytxt,1,90) query_text,
    starttime starttime_utc,
    (endtime-starttime)/(1000*1000) elapsed_time_secs,
    case when aborted= 0 then 'complete' else 'error' end status,
    case when concurrency_scaling_status = 1 then 'Concurrency Scaling' else 'Main' end which_cluster
    FROM stl_query
    WHERE database = 'rselttest'
    AND starttime between '2019-10-20 22:53:00' and '2019-10-20 22:56:00’
    AND userid=100
    AND querytxt NOT LIKE 'padb_fetch_sample%'
    AND (querytxt LIKE 'create%' or querytxt LIKE 'UNLOAD%')
    ORDER BY query DESC;

    See the following output from the query: 

  8. Comment out the following SET statement in the six UNLOAD query files (ssb-unload<1-3>.sql and tickit-unload<1-3>.sql) to force all six UNLOAD queries to run in the main cluster:
    set query_group to 'unload_query';

    In other words, disable Concurrency Scaling mode for the UNLOAD queries.

  9. Run the concurrency-elt-unload.sh script. While the script is running, you will see the following sample output:              The following are the response times taken by the script:
    real 3m40.328s
    user 0m0.104s
    sys 0m0.000s

    The following shows the Workload Management settings for the Redshift cluster: 

  10. Run the following query to validate that all the queries ran in the main cluster (look for “which_cluster = Main” in the query output below):
    SELECT query,
    Substring(querytxt,1,90) query_text,
    starttime starttime_utc,
    (endtime-starttime)/(1000*1000) elapsed_time_secs,
    case when aborted= 0 then 'complete' else 'error' end status,
    case when concurrency_scaling_status = 1 then 'Concurrency Scaling' else 'Main' end which_cluster
    FROM stl_query
    WHERE database = 'rselttest'
    AND starttime between '2019-10-20 23:19:00' and '2019-10-20 23:24:00’
    AND userid=100
    AND querytxt NOT LIKE 'padb_fetch_sample%'
    AND (querytxt LIKE 'create%' or querytxt LIKE 'UNLOAD%')
    ORDER BY query DESC;

    See the following output from the query:With Concurrency Scaling, the end-to-end runtime improved by 37.5% (60 seconds faster).

    Summary

    This post provided a step-by-step walkthrough of a few straightforward examples of the common ELT and ETL design patterns of Amazon Redshift using some key Amazon Redshift features such as Amazon Redshift Spectrum, Concurrency Scaling, and recent support for data lake export.

    As always, AWS welcomes feedback. Please submit thoughts or questions in the comments.

     


    About the Authors

    Asim Kumar Sasmal is a senior data architect – IoT in the Global Specialty Practice of AWS Professional Services. He helps AWS customers around the globe to design and build data driven solutions by providing expert technical consulting, best practices guidance, and implementation services on AWS platform. He is passionate about working backwards from customer ask, help them to think big, and dive deep to solve real business problems by leveraging the power of AWS platform.

     

    Maor Kleider is a principal product manager for Amazon Redshift, a fast, simple and cost-effective data warehouse. Maor is passionate about collaborating with customers and partners, learning about their unique big data use cases and making their experience even better. In his spare time, Maor enjoys traveling and exploring new restaurants with his family.

ETL and ELT design patterns for lake house architecture using Amazon Redshift: Part 1

Post Syndicated from Asim Kumar Sasmal original https://aws.amazon.com/blogs/big-data/etl-and-elt-design-patterns-for-lake-house-architecture-using-amazon-redshift-part-1/

Part 1 of this multi-post series discusses design best practices for building scalable ETL (extract, transform, load) and ELT (extract, load, transform) data processing pipelines using both primary and short-lived Amazon Redshift clusters. You also learn about related use cases for some key Amazon Redshift features such as Amazon Redshift Spectrum, Concurrency Scaling, and recent support for data lake export.

Part 2 of this series, ETL and ELT design patterns for lake house architecture using Amazon Redshift: Part 2, shows a step-by-step walkthrough to get started using Amazon Redshift for your ETL and ELT use cases.

ETL and ELT

There are two common design patterns when moving data from source systems to a data warehouse. The primary difference between the two patterns is the point in the data-processing pipeline at which transformations happen. This also determines the set of tools used to ingest and transform the data, along with the underlying data structures, queries, and optimization engines used to analyze the data. The first pattern is ETL, which transforms the data before it is loaded into the data warehouse. The second pattern is ELT, which loads the data into the data warehouse and uses the familiar SQL semantics and power of the Massively Parallel Processing (MPP) architecture to perform the transformations within the data warehouse.

In the following diagram, the first represents ETL, in which data transformation is performed outside of the data warehouse with tools such as Apache Spark or Apache Hive on Amazon EMR or AWS Glue. This pattern allows you to select your preferred tools for data transformations. The second diagram is ELT, in which the data transformation engine is built into the data warehouse for relational and SQL workloads. This pattern is powerful because it uses the highly optimized and scalable data storage and compute power of MPP architecture.

Redshift Spectrum

Amazon Redshift is a fully managed data warehouse service on AWS. It uses a distributed, MPP, and shared nothing architecture. Redshift Spectrum is a native feature of Amazon Redshift that enables you to run the familiar SQL of Amazon Redshift with the BI application and SQL client tools you currently use against all your data stored in open file formats in your data lake (Amazon S3).

A common pattern you may follow is to run queries that span both the frequently accessed hot data stored locally in Amazon Redshift and the warm or cold data stored cost-effectively in Amazon S3, using views with no schema binding for external tables. This enables you to independently scale your compute resources and storage across your cluster and S3 for various use cases.

Redshift Spectrum supports a variety of structured and unstructured file formats such as Apache Parquet, Avro, CSV, ORC, JSON to name a few. Because the data stored in S3 is in open file formats, the same data can serve as your single source of truth and other services such as Amazon Athena, Amazon EMR, and Amazon SageMaker can access it directly from your S3 data lake.

For more information, see Amazon Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required.

Concurrency Scaling

Using Concurrency Scaling, Amazon Redshift automatically and elastically scales query processing power to provide consistently fast performance for hundreds of concurrent queries. Concurrency Scaling resources are added to your Amazon Redshift cluster transparently in seconds, as concurrency increases, to serve sudden spikes in concurrent requests with fast performance without wait time. When the workload demand subsides, Amazon Redshift automatically shuts down Concurrency Scaling resources to save you cost.

The following diagram shows how the Concurrency Scaling works at a high-level:

For more information, see New – Concurrency Scaling for Amazon Redshift – Peak Performance at All Times.

Data lake export

Amazon Redshift now supports unloading the result of a query to your data lake on S3 in Apache Parquet, an efficient open columnar storage format for analytics. The Parquet format is up to two times faster to unload and consumes up to six times less storage in S3, compared to text formats. You can also specify one or more partition columns, so that unloaded data is automatically partitioned into folders in your S3 bucket to improve query performance and lower the cost for downstream consumption of the unloaded data. For example, you can choose to unload your marketing data and partition it by year, month, and day columns. This enables your queries to take advantage of partition pruning and skip scanning of non-relevant partitions when filtered by the partitioned columns, thereby improving query performance and lowering cost. For more information, see UNLOAD.

Use cases

You may be using Amazon Redshift either partially or fully as part of your data management and data integration needs. You likely transitioned from an ETL to an ELT approach with the advent of MPP databases due to your workload being primarily relational, familiar SQL syntax, and the massive scalability of MPP architecture.

This section presents common use cases for ELT and ETL for designing data processing pipelines using Amazon Redshift.

ELT

Consider a batch data processing workload that requires standard SQL joins and aggregations on a modest amount of relational and structured data. You selected initially a Hadoop-based solution to accomplish your SQL needs. However, over time, as data continued to grow, your system didn’t scale well. You now find it difficult to meet your required performance SLA goals and often refer to ever-increasing hardware and maintenance costs. Relational MPP databases bring an advantage in terms of performance and cost, and lowers the technical barriers to process data by using familiar SQL.

Amazon Redshift has significant benefits based on its massively scalable and fully managed compute underneath to process structured and semi-structured data directly from your data lake in S3.

The following diagram shows how Redshift Spectrum allows you to simplify and accelerate your data processing pipeline from a four-step to a one-step process with the CTAS (Create Table As) command.

The preceding architecture enables seamless interoperability between your Amazon Redshift data warehouse solution and your existing data lake solution on S3 hosting other Enterprise datasets such as ERP, finance, and third-party for a variety of data integration use cases.

The following diagram shows the seamless interoperability between your Amazon Redshift and your data lake on S3:

When you use an ELT pattern, you can also use your existing ELT-optimized SQL workload while migrating from your on-premises data warehouse to Amazon Redshift. This eliminates the need to rewrite relational and complex SQL workloads into a new compute framework from scratch. With Amazon Redshift, you can load, transform, and enrich your data efficiently using familiar SQL with advanced and robust SQL support, simplicity, and seamless integration with your existing SQL tools. You also need the monitoring capabilities provided by Amazon Redshift for your clusters.

ETL

You have a requirement to unload a subset of the data from Amazon Redshift back to your data lake (S3) in an open and analytics-optimized columnar file format (Parquet). You then want to query the unloaded datasets from the data lake using Redshift Spectrum and other AWS services such as Athena for ad hoc and on-demand analysis, AWS Glue and Amazon EMR for ETL, and Amazon SageMaker for machine learning.

You have a requirement to share a single version of a set of curated metrics (computed in Amazon Redshift) across multiple business processes from the data lake. You can use ELT in Amazon Redshift to compute these metrics and then use the unload operation with optimized file format and partitioning to unload the computed metrics in the data lake.

You also have a requirement to pre-aggregate a set of commonly requested metrics from your end-users on a large dataset stored in the data lake (S3) cold storage using familiar SQL and unload the aggregated metrics in your data lake for downstream consumption. In other words, consider a batch workload that requires standard SQL joins and aggregations on a fairly large volume of relational and structured cold data stored in S3 for a short duration of time. You can use the power of Redshift Spectrum by spinning up one or many short-lived Amazon Redshift clusters that can perform the required SQL transformations on the data stored in S3, unload the transformed results back to S3 in an optimized file format, and terminate the unneeded Amazon Redshift clusters at the end of the processing. This way, you only pay for the duration in which your Amazon Redshift clusters serve your workloads.

As shown in the following diagram, once the transformed results are unloaded in S3, you then query the unloaded data from your data lake either using Redshift Spectrum if you have an existing Amazon Redshift cluster, Athena with its pay-per-use and serverless ad hoc and on-demand query model, AWS Glue and Amazon EMR for performing ETL operations on the unloaded data and data integration with your other datasets (such as ERP, finance, and third-party data) stored in your data lake, and Amazon SageMaker for machine learning.

You can also scale the unloading operation by using the Concurrency Scaling feature of Amazon Redshift. This provides a scalable and serverless option to bulk export data in an open and analytics-optimized file format using familiar SQL.

Best practices

The following recommended practices can help you to optimize your ELT and ETL workload using Amazon Redshift.

Analyze requirements to decide ELT versus ETL

MPP architecture of Amazon Redshift and its Spectrum feature is efficient and designed for high-volume relational and SQL-based ELT workload (joins, aggregations) at a massive scale. A common practice to design an efficient ELT solution using Amazon Redshift is to spend sufficient time to analyze the following:

  • Type of data from source systems (structured, semi-structured, and unstructured)
  • Nature of the transformations required (usually encompassing cleansing, enrichment, harmonization, transformations, and aggregations)
  • Row-by-row, cursor-based processing needs versus batch SQL
  • Performance SLA and scalability requirements considering the data volume growth over time
  • Cost of the solution

This helps to assess if the workload is relational and suitable for SQL at MPP scale.

Key considerations for ELT

For ELT and ELT both, it is important to build a good physical data model for better performance for all tables, including staging tables with proper data types and distribution methods. A dimensional data model (star schema) with fewer joins works best for MPP architecture including ELT-based SQL workloads. Consider using a TEMPORARY table for intermediate staging tables as feasible for the ELT process for better write performance, because temporary tables only write a single copy.

A common rule of thumb for ELT workloads is to avoid row-by-row, cursor-based processing (a commonly overlooked finding for stored procedures). This is sub-optimal because such processing needs to happen on the leader node of an MPP database like Amazon Redshift. Instead, the recommendation for such a workload is to look for an alternative distributed processing programming framework, such as Apache Spark.

Several hundreds to thousands of single record inserts, updates, and deletes for highly transactional needs are not efficient using MPP architecture. Instead, stage those records for either a bulk UPDATE or DELETE/INSERT on the table as a batch operation.

With the external table capability of Redshift Spectrum, you can optimize your transformation logic using a single SQL as opposed to loading data first in Amazon Redshift local storage for staging tables and then doing the transformations on those staging tables.

Key considerations for data lake export

When you unload data from Amazon Redshift to your data lake in S3, pay attention to data skew or processing skew in your Amazon Redshift tables. The UNLOAD command uses the parallelism of the slices in your cluster. Hence, if there is a data skew at rest or processing skew at runtime, unloaded files on S3 may have different file sizes, which impacts your UNLOAD command response time and query response time downstream for the unloaded data in your data lake.

You should also control maximum file size to approximately 100 MB or less in the UNLOAD command for better performance for downstream consumption. Similarly, for S3 partitioning, a rule of thumb is to not exceed number of partitions per table on S3 to couple of hundreds by choosing the low cardinality partitioning columns (year, quarter, month, and day are good choices) in the UNLOAD command. This avoids creating too many partitions, which in turn creates a large volume of metadata in the AWS Glue catalog, leading to high query times via Athena and Redshift Spectrum.

To get the best throughput and performance under concurrency for multiple UNLOAD commands running in parallel, create a separate queue for unload queries with Concurrency Scaling turned on. This lets Amazon Redshift burst additional Concurrency Scaling clusters as required.

Key considerations for Redshift Spectrum for ELT

To get the best performance from Redshift Spectrum, pay attention to the maximum pushdown operations possible, such as S3 scan, projection, filtering, and aggregation, in your query plans for a performance boost. This is because you want to utilize the powerful infrastructure underneath that supports Redshift Spectrum. Using predicate pushdown also avoids consuming resources in the Amazon Redshift cluster.

In addition, avoid complex operations like DISTINCT or ORDER BY on more than one column and replace them with GROUP BY as applicable. Amazon Redshift can push down a single column DISTINCT as a GROUP BY to the Spectrum compute layer with a query rewrite capability underneath, whereas multi-column DISTINCT or ORDER BY operations need to happen inside Amazon Redshift cluster.

Amazon Redshift optimizer can use external table statistics to generate more optimal execution plans. Without statistics, an execution plan is generated based on heuristics with the assumption that the S3 table is relatively large. It is recommended to set the table statistics (numRows) manually for S3 external tables.

For more information on Amazon Redshift Spectrum best practices, see Twelve Best Practices for Amazon Redshift Spectrum and How to enable cross-account Amazon Redshift COPY and Redshift Spectrum query for AWS KMS–encrypted data in Amazon S3.

Summary

This post discussed the common use cases and design best practices for building ELT and ETL data processing pipelines for data lake architecture using few key features of Amazon Redshift: Spectrum, Concurrency Scaling, and the recently released support for data lake export with partitioning.

Part 2 of this series, ETL and ELT design patterns for lake house architecture using Amazon Redshift: Part 2, shows you how to get started with a step-by-step walkthrough of a few simple examples using AWS sample datasets.

As always, AWS welcomes feedback. Please submit thoughts or questions in the comments.

 


About the Authors

Asim Kumar Sasmal is a senior data architect – IoT in the Global Specialty Practice of AWS Professional Services. He helps AWS customers around the globe to design and build data driven solutions by providing expert technical consulting, best practices guidance, and implementation services on AWS platform. He is passionate about working backwards from customer ask, help them to think big, and dive deep to solve real business problems by leveraging the power of AWS platform.

 

Maor Kleider is a principal product manager for Amazon Redshift, a fast, simple and cost-effective data warehouse. Maor is passionate about collaborating with customers and partners, learning about their unique big data use cases and making their experience even better. In his spare time, Maor enjoys traveling and exploring new restaurants with his family.

New for Amazon Redshift – Data Lake Export and Federated Query

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-amazon-redshift-data-lake-export-and-federated-queries/

A data warehouse is a database optimized to analyze relational data coming from transactional systems and line of business applications. Amazon Redshift is a fast, fully managed data warehouse that makes it simple and cost-effective to analyze data using standard SQL and existing Business Intelligence (BI) tools.

To get information from unstructured data that would not fit in a data warehouse, you can build a data lake. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. With a data lake built on Amazon Simple Storage Service (S3), you can easily run big data analytics and use machine learning to gain insights from your semi-structured (such as JSON, XML) and unstructured datasets.

Today, we are launching two new features to help you improve the way you manage your data warehouse and integrate with a data lake:

  • Data Lake Export to unload data from a Redshift cluster to S3 in Apache Parquet format, an efficient open columnar storage format optimized for analytics.
  • Federated Query to be able, from a Redshift cluster, to query across data stored in the cluster, in your S3 data lake, and in one or more Amazon Relational Database Service (RDS) for PostgreSQL and Amazon Aurora PostgreSQL databases.

This architectural diagram gives a quick summary of how these features work and how they can be used together with other AWS services.

Let’s explain the interactions you see in the diagram better, starting from how you can use these features, and the advantages they provide.

Using Redshift Data Lake Export

You can now unload the result of a Redshift query to your S3 data lake in Apache Parquet format. The Parquet format is up to 2x faster to unload and consumes up to 6x less storage in S3, compared to text formats. This enables you to save data transformation and enrichment you have done in Redshift into your S3 data lake in an open format.

You can then analyze the data in your data lake with Redshift Spectrum, a feature of Redshift that allows you to query data directly from files on S3. Or you can use different tools such as Amazon Athena, Amazon EMR, or Amazon SageMaker.

To try this new feature, I create a new cluster from the Redshift console, and follow this tutorial to load sample data that keeps track of sales of musical events across different venues. I want to correlate this data with social media comments on the events stored in my data lake. To understand their relevance, each event should have a way of comparing its relative sales to other events.

Let’s build a query in Redshift to export the data to S3. My data is stored across multiple tables. I need to create a query that gives me a single view of what is going on with sales. I want to join the content of the  sales and date tables, adding information on the gross sales for an event (total_price in the query), and the percentile in terms of all time gross sales compared to all events.

To export the result of the query to S3 in Parquet format, I use the following SQL command:

UNLOAD ('SELECT sales.*, date.*, total_price, percentile
           FROM sales, date,
                (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) / 10.0 as percentile
                   FROM (SELECT eventid, sum(pricepaid) total_price
                           FROM sales
                       GROUP BY eventid)) as percentile_events
          WHERE sales.dateid = date.dateid
            AND percentile_events.eventid = sales.eventid')
TO 's3://MY-BUCKET/DataLake/Sales/'
FORMAT AS PARQUET
CREDENTIALS 'aws_iam_role=arn:aws:iam::123412341234:role/myRedshiftRole';

To give Redshift write access to my S3 bucket, I am using an AWS Identity and Access Management (IAM) role. I can see the result of the UNLOAD command using the AWS Command Line Interface (CLI). As expected, the output of the query is exported using the Parquet columnar data format:

$ aws s3 ls s3://MY-BUCKET/DataLake/Sales/
2019-11-25 14:26:56 1638550 0000_part_00.parquet
2019-11-25 14:26:56 1635489 0001_part_00.parquet
2019-11-25 14:26:56 1624418 0002_part_00.parquet
2019-11-25 14:26:56 1646179 0003_part_00.parquet

To optimize access to data, I can specify one or more partition columns so that unloaded data is automatically partitioned into folders in my S3 bucket. For example, I can unload sales data partitioned by year, month, and day. This enables my queries to take advantage of partition pruning and skip scanning irrelevant partitions, improving query performance and minimizing cost.

To use partitioning, I need to add to the previous SQL command the PARTITION BY option, followed by the columns I want to use to partition the data in different directories. In my case, I want to partition the output based on the year and the calendar date (caldate in the query) of the sales.

UNLOAD ('SELECT sales.*, date.*, total_price, percentile
           FROM sales, date,
                (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) / 10.0 as percentile
                   FROM (SELECT eventid, sum(pricepaid) total_price
                           FROM sales
                       GROUP BY eventid)) as percentile_events
          WHERE sales.dateid = date.dateid
            AND percentile_events.eventid = sales.eventid')
TO 's3://MY-BUCKET/DataLake/SalesPartitioned/'
FORMAT AS PARQUET
PARTITION BY (year, caldate)
CREDENTIALS 'aws_iam_role=arn:aws:iam::123412341234:role/myRedshiftRole';

This time, the output of the query is stored in multiple partitions. For example, here’s the content of a folder for a specific year and date:

$ aws s3 ls s3://MY-BUCKET/DataLake/SalesPartitioned/year=2008/caldate=2008-07-20/
2019-11-25 14:36:17 11940 0000_part_00.parquet
2019-11-25 14:36:17 11052 0001_part_00.parquet
2019-11-25 14:36:17 11138 0002_part_00.parquet
2019-11-25 14:36:18 12582 0003_part_00.parquet

Optionally, I can use AWS Glue to set up a Crawler that (on demand or on a schedule) looks for data in my S3 bucket to update the Glue Data Catalog. When the Data Catalog is updated, I can easily query the data using Redshift Spectrum, Athena, or EMR.

The sales data is now ready to be processed together with the unstructured and semi-structured  (JSON, XML, Parquet) data in my data lake. For example, I can now use Apache Spark with EMR, or any Sagemaker built-in algorithm to access the data and get new insights.

Using Redshift Federated Query
You can now also access data in RDS and Aurora PostgreSQL stores directly from your Redshift data warehouse. In this way, you can access data as soon as it is available. Straight from Redshift, you can now perform queries processing data in your data warehouse, transactional databases, and data lake, without requiring ETL jobs to transfer data to the data warehouse.

Redshift leverages its advanced optimization capabilities to push down and distribute a significant portion of the computation directly into the transactional databases, minimizing the amount of data moving over the network.

Using this syntax, you can add an external schema from an RDS or Aurora PostgreSQL database to a Redshift cluster:

CREATE EXTERNAL SCHEMA IF NOT EXISTS online_system
FROM POSTGRES
DATABASE 'online_sales_db' SCHEMA 'online_system'
URI ‘my-hostname' port 5432
IAM_ROLE 'iam-role-arn'
SECRET_ARN 'ssm-secret-arn';

Schema and port are optional here. Schema will default to public if left unspecified and default port for PostgreSQL databases is 5432. Redshift is using AWS Secrets Manager to manage the credentials to connect to the external databases.

With this command, all tables in the external schema are available and can be used by Redshift for any complex SQL query processing data in the cluster or, using Redshift Spectrum, in your S3 data lake.

Coming back to the sales data example I used before, I can now correlate the trends of my historical data of musical events with real-time sales. In this way, I can understand if an event is performing as expected or not, and calibrate my marketing activities without delays.

For example, after I define the online commerce database as the online_system external schema in my Redshift cluster, I can compare previous sales with what is in the online commerce system with this simple query:

SELECT eventid,
       sum(pricepaid) total_price,
       sum(online_pricepaid) online_total_price
  FROM sales, online_system.current_sales
 GROUP BY eventid
 WHERE eventid = online_eventid;

Redshift doesn’t import database or schema catalog in its entirety. When a query is run, it localizes the metadata for the Aurora and RDS tables (and views) that are part of the query. This localized metadata is then used for query compilation and plan generation.

Available Now
Amazon Redshift data lake export is a new tool to improve your data processing pipeline and is supported with Redshift release version 1.0.10480 or later. Refer to the AWS Region Table for Redshift availability, and check the version of your clusters.

The new federation capability in Amazon Redshift is released as a public preview and allows you to bring together data stored in Redshift, S3, and one or more RDS and Aurora PostgreSQL databases. When creating a cluster in the Amazon Redshift management console, you can pick three tracks for maintenance: Current, Trailing, or Preview. Within the Preview track, preview_features should be chosen to participate to the Federated Query public preview. For example:

These features simplify data processing and analytics, giving you more tools to react quickly, and a single point of view for your data. Let me know what you are going to use them for!

Danilo

Amazon Redshift Update – Next-Generation Compute Instances and Managed, Analytics-Optimized Storage

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-redshift-update-next-generation-compute-instances-and-managed-analytics-optimized-storage/

We launched Amazon Redshift back in 2012 (Amazon Redshift – The New AWS Data Warehouse). With tens of thousands of customers, it is now the world’s most popular data warehouse. Our customers enjoy consistently fast performance, support for complex queries, and transactional capabilities, all with industry-leading price-performance.

The original Redshift model establishes a fairly rigid coupling between compute power and storage capacity. You create a cluster with a specific number of instances, and are committed to (and occasionally limited by) the amount of local storage that is provided with each instance. You can access additional compute power with on-demand Concurrency Scaling, and you can use Elastic Resize to scale your clusters up and down in minutes, giving you the ability to adapt to changing compute and storage needs.

We think we can do even better! Today we are launching the next generation of Nitro-powered compute instances for Redshift, backed by a new managed storage model that gives you the power to separately optimize your compute power and your storage. This launch takes advantage of some architectural improvements including high-bandwidth networking, managed storage that uses local SSD-based storage backed by Amazon Simple Storage Service (S3), and multiple, advanced data management techniques to optimize data motion to and from S3.

Together, these capabilities allow Redshift to deliver 3x the performance of any other cloud data warehouse service, and most existing Amazon Redshift customers using Dense Storage (DS2) instances will get up to 2x better performance and 2x more storage at the same cost.

Among many other use cases, this new combo is a great fit for operational analytics, where much of the workload is focused on a small (and often recent) subset of the data in the data warehouse. In the past, customers would unload older data to other types of storage in order to stay within storage limits, leading to additional complexity and making queries on historical data very complex.

Next-Generation Compute Instances
The new RA3 instances are designed to work hand-in-glove with the new managed storage model. The ra3.16xlarge instances have 48 vCPUs, 384 GiB of Memory, and up to 64 TB of storage. I can create clusters with 2 to 128 instances, giving me over 8 PB of compressed storage:

I can also create a new RA3-powered cluster from a snapshot of an existing cluster, or I can use Classic resize to upgrade my cluster to use the new instance type.

If you have an existing snapshot or a cluster, you can use the Amazon Redshift console to get a recommended RA3 configuration when you restore or resize. You can also get recommendations from the DescribeNodeConfigurationOptions function or the describe-node-configuration-options command.

Managed, Analytics-Optimized Storage
The new managed storage is equally exciting. There’s a cache of large-capacity, high-performance SSD-based storage on each instance, backed by S3, for scale, performance, and durability. The storage system uses multiple cues, including data block temperature, data blockage, and workload patterns, to manage the cache for high performance. Data is automatically placed into the appropriate tier, and you need not do anything special to benefit from the caching or the other optimizations. You pay the same low price for SSD and S3 storage, and you can scale the storage capacity of your data warehouse without adding and paying for additional instances.

Price & Availability
You can start using RA3 instances together with managed storage in all AWS Regions where Redshift is available.

Jeff;

 

Using Spatial Data with Amazon Redshift

Post Syndicated from Steve Roberts original https://aws.amazon.com/blogs/aws/using-spatial-data-with-amazon-redshift/

Today, Amazon Redshift announced support for a new native data type called GEOMETRY. This new type enables ingestion, storage, and queries against two-dimensional geographic data, together with the ability to apply spatial functions to that data. Geographic data (also known as georeferenced data) refers to data that has some association with a location relative to Earth. Coordinates, elevation, addresses, city names, zip (or postal) codes, administrative and socioeconomic boundaries are all examples of geographic data.

The GEOMETRY type enables us to easily work with coordinates such as latitude and longitude in our table columns, which can then be converted or combined with other types of geographic data using spatial functions. The type is abstract, meaning it cannot be directly instantiated, and polymorphic. The actual types supported for this data (and which will be used in table columns) are points, linestrings, polygons, multipoints, multilinestrings, multipolygons, and geometry collections. In addition to creating GEOMETRY-typed data columns in tables the new support also enables ingestion of geographic data from delimited text files using the existing COPY command. The data in the files is expected to be in hexadecimal Extended Well-Known Binary (EWKB) format which is a standard for representing geographic data.

To show the new type in action I imagined a scenario where I am working as a personal tour coordinator based in Berlin, Germany, and my client has supplied me with a list of attractions that they want to visit. My task is to locate accommodation for this client that is reasonably central to the set of attractions, and within a certain budget. Geographic data is ideal for solving this scenario. Firstly, the set of points representing the attractions combine to form one or more polygons which I can use to restrict my search for accommodation. In a single query I can then join the data representing those polygons with data representing a set of accommodations to arrive at the results. This spatial query is actually quite expensive in CPU terms yet Redshift is able to execute the query in less that one second.

Sample Scenario Data
To show my scenario in action I needed to first source various geographic data related to Berlin. Firstly I obtained the addresses, and latitude/longitude coordinates, of a variety of attractions in the city using several ‘top X things to see’ travel websites. For accommodation I used Airbnb data, licensed under the Creative Commons 1.0 Universal “Public Domain Dedication” from http://insideairbnb.com/get-the-data.html. I then added to this zip code data for the city, licensed under Creative Commons Attribution 3.0 Germany (CC BY 3.0 DE). The provider for this data is Amt für Statistik Berlin-Brandenburg.

Any good tour coordinator would of course have a web site or application with an interactive map so as to be able to show clients the locations of the accommodation that matched their criteria. In real life, I’m not a tour coordinator (outside of my family!) so for this post I’m going to focus solely on the back-end processes – the loading of the data, and the eventual query to satisfy our client’s request using the Redshift console.

Creating a Redshift Cluster
My first task is to load the various sample data sources into database tables in a Redshift cluster. To do this I go to the Redshift console dashboard and select Create cluster. This starts a wizard that walks me through the process of setting up a new cluster, starting with the type and number of nodes that I want to create.

In Cluster details I fill out a name for my new cluster, set a password for the master user, and select an AWS Identity and Access Management (IAM) role that will give permission for Redshift to access one of my buckets in Amazon Simple Storage Service (S3) in read-only mode when I come to load my sample data later. The new cluster will be created in my default Amazon Virtual Private Cloud for the region, and I also opted to use the defaults for node types and number of nodes. You can read more about available options for creating clusters in the Management Guide. Finally I click Create cluster to start the process, which will take just a few minutes.

Loading the Sample Data
With the cluster ready to use I can load the sample data into my database, so I head to the Query editor and using the pop-up, connect to my default database for the cluster.

My sample data will be sourced from delimited text files that I’ve uploaded as private objects to an S3 bucket and loaded into three tables. The first, accommodations, will hold the Airbnb data. The second, zipcodes, will hold the zip or postal codes for the city. The final table, attractions, will hold the coordinates of the city attractions that my client can choose from. To create and load the accommodations data I paste the following statements into tabs in the query editor, one at a time, and run them. Note that schemas in databases have access control semantics and the public prefix shown on the table names below simply means I am referencing the public schema, accessible to all users, for the database in use.

To create the accommodations table I use:

CREATE TABLE public.accommodations (
  id INTEGER PRIMARY KEY,
  shape GEOMETRY,
  name VARCHAR(100),
  host_name VARCHAR(100),
  neighbourhood_group VARCHAR(100),
  neighbourhood VARCHAR(100),
  room_type VARCHAR(100),
  price SMALLINT,
  minimum_nights SMALLINT,
  number_of_reviews SMALLINT,
  last_review DATE,
  reviews_per_month NUMERIC(8,2),
  calculated_host_listings_count SMALLINT,
  availability_365 SMALLINT
);

To load the data from S3:

COPY public.accommodations
FROM 's3://my-bucket-name/redshift-gis/accommodations.csv'
DELIMITER ';'
IGNOREHEADER 1
CREDENTIALS 'aws_iam_role=arn:aws:iam::123456789012:role/RedshiftDemoRole';

Next, I repeat the process for the zipcodes table.

CREATE TABLE public.zipcode (
  ogc_field INTEGER,
  wkb_geometry GEOMETRY,
  gml_id VARCHAR,
  spatial_name VARCHAR,
  spatial_alias VARCHAR,
  spatial_type VARCHAR
);
COPY public.zipcode
FROM 's3://my-bucket-name/redshift-gis/zipcode.csv'
DELIMITER ';'
IGNOREHEADER 1
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftDemoRole';

And finally I create the attractions table and load data into it.

CREATE TABLE public.berlin_attractions (
  name VARCHAR,
  address VARCHAR,
  lat DOUBLE PRECISION,
  lon DOUBLE PRECISION,
  gps_lat VARCHAR,
  gps_lon VARCHAR
);
COPY public.berlin_attractions
FROM 's3://my-bucket-name/redshift-gis/berlin-attraction-coordinates.txt'
DELIMITER '|'
IGNOREHEADER 1
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftDemoRole';

Finding Somewhere to Stay!
With the data loaded, I can now put on my travel coordinator hat and select some properties for my client to consider for their stay in Berlin! Remember, in the real world this would likely be surfaced from a web or other application for the client. I’m simply going to make use of the query editor again.

My client has decided they want a trip to focus on the city museums and they have a budget of 200 EUR per night for accommodation. Opening a new tab in the editor, I paste in and run the following query.

WITH museums(name,loc) AS
(SELECT name, ST_SetSRID(ST_Point(lon,lat),4326) FROM public.berlin_attractions
WHERE name LIKE '%Museum%')
SELECT a.name,a.price,avg(ST_DistanceSphere(m.loc,a.shape)) AS avg_distance
FROM museums m,public.accommodations a
WHERE a.price <= 200 GROUP BY a.name,a.price ORDER BY avg_distance
LIMIT 10;

The query finds the accommodation(s) that are “best located” to visit all the museums, and whose price is within the client’s budget. Here “best located” is defined as having the smallest average distance from all the selected museums. In the query you can see some of the available spatial functions, ST_SetSRID and ST_Point, operating on the latitude and longitude GEOMETRY columns for the attractions, and ST_DistanceSphere to determine distance.

This yields the following results.

Wrap a web or native application front-end around this and we have a new geographic data-based application that we can use to delight clients who have an idea of what they want to see in the city and also want convenient and in-budget accommodation best placed to enable that!

Let’s also consider another scenario. Imagine I have a client who wants to stay in the center of Berlin but isn’t sure what attractions or accommodations are present in the central district, and has a budget of 150 EUR per night. How can we answer that question? First we need the coordinates of what we might consider to be the center of Berlin – latitude 52.516667, longitude 13.388889. Using the zipcode table we can convert this coordinate location to a polygon enclosing that region of the city. Our query must then get all attractions within that polygon, plus all accommodations (within budget), ordered by average distance from the attractions. Here’s the query:

WITH center(geom) AS
(SELECT wkb_geometry FROM zipcode
  WHERE ST_Within(ST_SetSRID(ST_Point(13.388889, 52.516667), 4326), wkb_geometry)),
pois(name,loc) AS
(SELECT name, ST_SetSRID(ST_Point(lon,lat),4326) FROM public.berlin_attractions,center
  WHERE ST_Within(ST_SetSRID(ST_Point(lon,lat),4326), center.geom))
SELECT a.name,a.price,avg(ST_DistanceSphere(p.loc,a.shape))
  AS avg_distance, LISTAGG(p.name, ';') as pois
FROM pois p,public.accommodations a
WHERE a.price <= 150 GROUP BY a.name,a.price ORDER BY avg_distance
LIMIT 10;

When I run this in the query editor, I get the following results. You can see the list of attractions in the area represented by the zipcode in the pois column.

So there we have some scenarios for making use of geographic data in Amazon Redshift using the new GEOMETRY type and associated spatial functions, and I’m sure there are many more! The new type and functions are available now in all AWS Regions to all customers at no additional cost.

— Steve

Simplify management of Amazon Redshift clusters with the Redshift console

Post Syndicated from Debu Panda original https://aws.amazon.com/blogs/big-data/simplify-management-of-amazon-redshift-clusters-with-the-redshift-console/

Amazon Redshift is the most popular and the fastest cloud data warehouse. It includes a console for administrators to create, configure, and manage Amazon Redshift clusters. The new Amazon Redshift console modernizes the user interface and adds several features to improve managing your clusters and workloads running on clusters.

The new Amazon Redshift console provides the following benefits:

  • Visibility to the health and performance of clusters from a unified dashboard.
  • Simplified management of clusters by streamlining several screens and flows, and reducing the number of clicks for several everyday operations.
  • Improved mean-time-to-diagnose query performance issues by adding capabilities to monitor user queries and correlate with cluster performance metrics.

This post discusses how you can use the new console to create your first cluster, and manage and monitor your clusters in your AWS account.

Prerequisites

For the best experience in using the new Amazon Redshift console, make sure you meet the following prerequisites.

If you are using the AmazonRedshiftFullAccess, you don’t need to make any changes to your permissions. The AmazonRedshiftFullAccess grants full access to all Amazon Redshift resources for the AWS account.

If you are using custom policies, either attach AmazonRedshiftFullAccess or add the following code to your IAM user’s policy:

{  
   "Version":"2012-10-17",
   "Statement":[  
      {  
         "Action":[  
            "cloudwatch:ListMetrics",
            "cloudwatch:GetMetricWidgetImage",
            "cloudwatch:GetMetricData",
            "tag:GetResources",
            "tag:UntagResources",
            "tag:GetTagValues",
            "tag:GetTagKeys",
            "tag:TagResources",
            "iam:ListRoles"
         ],
         "Effect":"Allow",
         "Resource":"*"
      }
   ]
}

You also have to upgrade your clusters to the latest maintenance patch.

New console launch page

When you log in to the new console and you don’t have a cluster, you see the launch page. You can navigate to your dashboard, clusters, queries, Query Editor, configuration changes, and advisor using the left navigation menu.

You are redirected to the Dashboard page if you have at least one cluster.

The following screenshot shows your Amazon Redshift console and navigation menu.

To expand the menu, choose the expand icon at the top of the menu. See the following screenshot.

Setting up a new Amazon Redshift data warehouse

To use Amazon Redshift, you have to first create a new Amazon Redshift cluster. You can launch a cluster by providing a few parameters such as node type, number of nodes, and the master user password. The rest of the parameters can be default values. You can also get an estimated cost for your cluster, and calculate the best configuration based on the size of your data.

Now that you created an Amazon Redshift cluster, you’re ready to create some tables and load data using the Query Editor.

Query Editor

The Query Editor allows you to author and execute queries in your cluster. To connect to your cluster, from Editor, choose Connect to database.

You have to provide your cluster URL and database credentials to connect to your database. See the following screenshot.

You can run SQL commands to create tables, load data, run queries, and perform visual analysis in the Query Editor.

You can switch to a new database connection by clicking on the “Change connection” button as highlighted below.

Monitoring dashboard

More than 15,000 customers use Amazon Redshift to power their analytical workloads to enable modern analytics use cases such as Business Intelligence, predictive analytics, and real-time streaming analytics. As an Amazon Redshift administrator or developer, you want your users, such as data analysts or BI professionals, to get optimal performance for their workloads. Being proactive, you might set up Amazon CloudWatch alarms on different Amazon Redshift metrics.

With the existing console, you can navigate to different pages to get a glance of the health and performance of Amazon Redshift clusters. The new dashboard provides unified visibility to the health and performance of your clusters.

The dashboard enables Amazon Redshift administrators and operators to gain visibility into the following:

  • The number of clusters, number of nodes, availability, and cluster health status.
  • CloudWatch alarms for your clusters.
  • Critical performance metrics for the top five clusters, such as the number of queries, database connections, and CPU utilization.
  • Performance workloads such as query throughput, average query length, average run time, and average wait time for your clusters.
  • Consolidated view of all events for your clusters.

You can filter data in each widget by a specific filter or modify the range to perform a trend analysis for a metric.

The dashboard also allows you to isolate issues in your clusters to focus on the problem. For example, if you have CloudWatch alarm for an Amazon Redshift metric such as DB Connections, as the preceding screenshot shows, you can drill down to view the alarm details or drill down to the cluster to get details such as queries and loads running, and cluster and database metrics.

Cluster management

The Clusters page allows you to view node details and key performance metrics for the clusters, tags, and notifications. You can customize this page to include your preferred metric and areas of focus. In this post, the page includes a maintenance track and release status. Through the Actions menu, you can perform frequent operations such as modify, resize, manage tags, reboot, create a snapshot or configure a cross-region snapshot, or delete your cluster.

The following screenshot shows the Clusters page and options on the Actions menu.

Cluster details

If you are diagnosing a performance problem, you can drill down to view cluster details. The cluster details page shows the details for a specific cluster, organized in the following categories:

  • Query monitoring
  • Cluster performance
  • Maintenance
  • Backup
  • Properties

The Actions menu allows you to perform everyday operations such as resize, reboot, or change configuration. You can also modify the password for your master user for the database, create a snapshot, or restore a table in your database.

The following screenshot shows the details page for a cluster and the available menus and options.

Query and Load monitoring

You can monitor the current workload for your WLM queues and view critical metrics such as active queries and query throughput.

The following screenshot shows the Query monitoring section and the time-series view of the breakdown of the workload.

Cluster and database monitoring

You can view cluster and database performance metrics for your cluster, such as CPU utilization, percentage of disk space used, and database connections using the Cluster performance tab. You can also change the time period and period and granularity of data.

The following screenshot shows the Cluster performance section.

You can change your default metric view to the metrics of your choice by using the config button. You can either add 6, 9, or 20 metrics to this page. It is recommended that you keep these six metrics for optimal page performance.

Cluster maintenance

The Maintenance and monitoring tab allows you to view your CloudWatch alarms, events, and maintenance level. You can also enable auditing, create CloudWatch alarms, and create maintenance windows for your cluster.

The following screenshot shows the Maintenance and monitoring section.

Backup and restore

Amazon Redshift automatically takes incremental snapshots (backups) of your data every eight hours or 5 GB per node of data change. Alternatively, you can create a snapshot schedule to control when to create snapshots. You can create a new cluster by restoring from an existing snapshot.

You can view, search your snapshots, enable automatic snapshots, take a manual snapshot, and configure cross-region snapshots from this page.

You can also create a new cluster from an existing snapshot. With this release, Amazon Redshift allows you to restore snapshots to create a cluster with different node types than the original version. You can either migrate your Amazon Redshift data warehouse to a more powerful node type or clone to a smaller node type for development and testing.

You can view snapshots for your clusters by navigating the snapshots submenu from the cluster menu. You can search, view, and delete snapshots generated by existing and deleted clusters. You can restore a cluster from a snapshot, including a snapshot generated from a deleted cluster.

The following screenshot shows the Backup section and available actions in the Snapshots section.

Properties

Amazon Redshift clusters provide several customization options. The properties tab allows you to view and edit standard configurations such as IAM roles, cluster configuration details, database configurations, and network configurations. You can also manage tags for your cluster.

You need the JDBC and ODBC URLs for connecting to a cluster for using with different ETL, Business Intelligence, and developer tools. This post also provides you connection details such as JDBC and ODBC URLs and IP addresses for the nodes in the cluster.

The following screenshot shows the Properties section.

Conclusion

This post showed you the new features in the Amazon Redshift management console, such as monitoring your dashboard and updated flows to create, manage, and monitor Amazon Redshift clusters. The Clusters page logically organizes information for better navigation and simplify everyday operations. Switch today to the new Redshift Console and simplify management of your Amazon Redshift clusters.

In a later post, you can learn how the Amazon Redshift console improves query monitoring and helps you diagnose query performance.

 


About the Authors

Debu Panda, a senior product manager at AWS, is an industry leader in analytics, application platform, and database technologies. He has more than 20 years of experience in the IT industry and has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

 

 

 

 

Raja Bhogi is an engineering manager at AWS. He is responsible for building delightful and easy-to-use web experiences for analytics and blockchain products. His work includes launching web experiences for new analytics products, and working on new feature launches for existing products. He is passionate about web technologies, performance insights, and tuning. He is a thrill seeker and enjoys everything from roller coasters to bungy jumping.

 

 

 

 

Orchestrate Amazon Redshift-Based ETL workflows with AWS Step Functions and AWS Glue

Post Syndicated from Ben Romano original https://aws.amazon.com/blogs/big-data/orchestrate-amazon-redshift-based-etl-workflows-with-aws-step-functions-and-aws-glue/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud that offers fast query performance using the same SQL-based tools and business intelligence applications that you use today. Many customers also like to use Amazon Redshift as an extract, transform, and load (ETL) engine to use existing SQL developer skillsets, to quickly migrate pre-existing SQL-based ETL scripts, and—because Amazon Redshift is fully ACID-compliant—as an efficient mechanism to merge change data from source data systems.

In this post, I show how to use AWS Step Functions and AWS Glue Python Shell to orchestrate tasks for those Amazon Redshift-based ETL workflows in a completely serverless fashion. AWS Glue Python Shell is a Python runtime environment for running small to medium-sized ETL tasks, such as submitting SQL queries and waiting for a response. Step Functions lets you coordinate multiple AWS services into workflows so you can easily run and monitor a series of ETL tasks. Both AWS Glue Python Shell and Step Functions are serverless, allowing you to automatically run and scale them in response to events you define, rather than requiring you to provision, scale, and manage servers.

While many traditional SQL-based workflows use internal database constructs like triggers and stored procedures, separating workflow orchestration, task, and compute engine components into standalone services allows you to develop, optimize, and even reuse each component independently. So, while this post uses Amazon Redshift as an example, my aim is to more generally show you how to orchestrate any SQL-based ETL.

Prerequisites

If you want to follow along with the examples in this post using your own AWS account, you need a Virtual Private Cloud (VPC) with at least two private subnets that have routes to an S3 VPC endpoint.

If you don’t have a VPC, or are unsure if yours meets these requirements, I provide an AWS CloudFormation template stack you can launch by selecting the following button. Provide a stack name on the first page and leave the default settings for everything else. Wait for the stack to display Create Complete (this should only take a few minutes) before moving on to the other sections.

Scenario

For the examples in this post, I use the Amazon Customer Reviews Dataset to build an ETL workflow that completes the following two tasks which represent a simple ETL process.

  • Task 1: Move a copy of the dataset containing reviews from the year 2015 and later from S3 to an Amazon Redshift table.
  • Task 2: Generate a set of output files to another Amazon S3 location which identifies the “most helpful” reviews by market and product category, allowing an analytics team to glean information about high quality reviews.

This dataset is publicly available via an Amazon Simple Storage Service (Amazon S3) bucket. Complete the following tasks to get set up.

Solution overview

The following diagram highlights the solution architecture from end to end:

The steps in this process are as follows:

  1. The state machine launches a series of runs of an AWS Glue Python Shell job (more on how and why I use a single job later in this post!) with parameters for retrieving database connection information from AWS Secrets Manager and an .sql file from S3.
  2. Each run of the AWS Glue Python Shell job uses the database connection information to connect to the Amazon Redshift cluster and submit the queries contained in the .sql file.
    1. For Task 1: The cluster utilizes Amazon Redshift Spectrum to read data from S3 and load it into an Amazon Redshift table. Amazon Redshift Spectrum is commonly used as an means for loading data to Amazon Redshift. (See Step 7 of Twelve Best Practices for Amazon Redshift Spectrum for more information.)
    2. For Task 2: The cluster executes an aggregation query and exports the results to another Amazon S3 location via UNLOAD.
  3. The state machine may send a notification to an Amazon Simple Notification Service (SNS) topic in the case of pipeline failure.
  4. Users can query the data from the cluster and/or retrieve report output files directly from S3.

I include an AWS CloudFormation template to jumpstart the ETL environment so that I can focus this post on the steps dedicated to building the task and orchestration components. The template launches the following resources:

  • Amazon Redshift Cluster
  • Secrets Manager secret for storing Amazon Redshift cluster information and credentials
  • S3 Bucket preloaded with Python scripts and .sql files
  • Identity and Access Management (IAM) Role for AWS Glue Python Shell jobs

See the following resources for how to complete these steps manually:

Be sure to select at least two private subnets and the corresponding VPC, as shown in the following screenshot. If you are using the VPC template from above, the VPC appears as 10.71.0.0/16 and the subnet names are A private and B private.

The stack should take 10-15 minutes to launch. Once it displays Create Complete, you can move on to the next section. Be sure to take note of the Resources tab in the AWS CloudFormation console, shown in the following screenshot, as I refer to these resources throughout the post.

Building with AWS Glue Python Shell

Begin by navigating to AWS Glue in the AWS Management Console.

Making a connection

Amazon Redshift cluster resides in a VPC, so you first need to create a connection using AWS Glue. Connections contain properties, including VPC networking information, needed to access your data stores. You eventually attach this connection to your Glue Python Shell Job so that it can reach your Amazon Redshift cluster.

Select Connections from the menu bar, and then select Add connection. Give your connection a name like blog_rs_connection,  select Amazon Redshift as the Connection type, and then select Next, as shown in the following screenshot.

Under Cluster, enter the name of the cluster that the AWS CloudFormation template launched, i.e blogstack-redshiftcluster-####. Because the Python code I provide for this blog already handles credential retrieval, the rest of the values around database information you enter here are largely placeholders. The key information you are associating with the connection is networking-related.

Please note that you are not able to test the connection without the correct cluster information.  If you are interested in doing so, note that Database name and Username are auto-populated after selecting the correct cluster, as shown in the following screenshot. Follow the instructions here to retrieve the password information from Secrets Manager to copy into the Password field.

ETL code review

Take a look at the two main Python scripts used in this example:

Pygresql_redshift_common.py is a set of functions that can retrieve cluster connection information and credentials from Secrets Manger, make a connection to the cluster, and submit queries respectively. By retrieving cluster information at runtime via a passed parameter, these functions allow the job to connect to any cluster to which it has access. You can package these functions into a library by following the instructions to create a python .egg file (already completed as a part of the AWS CloudFormation template launch). Note that AWS Glue Python Shell supports several python libraries natively.

import pg
import boto3
import base64
from botocore.exceptions import ClientError
import json

#uses session manager name to return connection and credential information
def connection_info(db):

	session = boto3.session.Session()
	client = session.client(
		service_name='secretsmanager'
	)

	get_secret_value_response = client.get_secret_value(SecretId=db)

	if 'SecretString' in get_secret_value_response:
		secret = json.loads(get_secret_value_response['SecretString'])
	else:
		secret = json.loads(base64.b64decode(get_secret_value_response['SecretBinary']))
		
	return secret


#creates a connection to the cluster
def get_connection(db,db_creds):

	con_params = connection_info(db_creds)
	
	rs_conn_string = "host=%s port=%s dbname=%s user=%s password=%s" % (con_params['host'], con_params['port'], db, con_params['username'], con_params['password'])
	rs_conn = pg.connect(dbname=rs_conn_string)
	rs_conn.query("set statement_timeout = 1200000")
	
	return rs_conn


#submits a query to the cluster
def query(con,statement):
    res = con.query(statement)
    return res

The AWS Glue Python Shell job runs rs_query.py when called. It starts by parsing job arguments that are passed at invocation. It uses some of those arguments to retrieve a .sql file from S3, then connects and submits the statements within the file to the cluster using the functions from pygresql_redshift_common.py. So, in addition to connecting to any cluster using the Python library you just packaged, it can also retrieve and run any SQL statement. This means you can manage a single AWS Glue Python Shell job for all of your Amazon Redshift-based ETL by simply passing in parameters on where it should connect and what it should submit to complete each task in your pipeline.

from redshift_module import pygresql_redshift_common as rs_common
import sys
from awsglue.utils import getResolvedOptions
import boto3

#get job args
args = getResolvedOptions(sys.argv,['db','db_creds','bucket','file'])
db = args['db']
db_creds = args['db_creds']
bucket = args['bucket']
file = args['file']

#get sql statements
s3 = boto3.client('s3') 
sqls = s3.get_object(Bucket=bucket, Key=file)['Body'].read().decode('utf-8')
sqls = sqls.split(';')

#get database connection
print('connecting...')
con = rs_common.get_connection(db,db_creds)

#run each sql statement
print("connected...running query...")
results = []
for sql in sqls[:-1]:
    sql = sql + ';'
    result = rs_common.query(con, sql)
    print(result)
    results.append(result)

print(results)

Creating the Glue Python Shell Job

Next, put that code into action:

  1. Navigate to Jobs on the left menu of the AWS Glue console page and from there, select Add job.
  2. Give the job a name like blog_rs_query.
  3. For the IAM role, select the same GlueExecutionRole you previously noted from the Resources section of the AWS CloudFormation console.
  4. For Type, select Python shell, leave Python version as the default of Python 3, and for This job runs select An existing script that you provide.
  5. For S3 path where the script is stored, navigate to the script bucket created by the AWS CloudFormation template (look for ScriptBucket in the Resources), then select the python/py file.
  6. Expand the Security configuration, script libraries, and job parameters section to add the Python .egg file with the Amazon Redshift connection library to the Python library path. It is also located in the script bucket under python /redshift_module-0.1-py3.6.egg.

When all is said and done everything should look as it does in the following screenshot:

Choose Next. Add the connection you created by choosing Select to move it under Required connections. (Recall from the Making a connection section that this gives the job the ability to interact with your VPC.) Choose Save job and edit script to finish, as shown in the following screenshot.

Test driving the Python Shell job

After creating the job, you are taken to the AWS Glue Python Shell IDE. If everything went well, you should see the rs_query.py code. Right now, the Amazon Redshift cluster is sitting there empty, so use the Python code to run the following SQL statements and populate it with tables.

  1. Create an external database (amzreviews).
  2. Create an external table (reviews) from which Amazon Redshift Spectrum can read from the source data in S3 (the public reviews dataset). The table is partitioned by product_category because the source files are organized by category, but in general you should partition on frequently filtered columns (see #4).
  3. Add partitions to the external table.
  4. Create an internal table (reviews) local to the Amazon Redshift cluster. product_id works well as a DISTKEY because it has high cardinality, even distribution, and most likely (although not explicitly part of this blog’s scenario) a column that will be used to join with other tables. I choose review_date as a SORTKEY to efficiently filter out review data that is not part of my target query (after 2015). Learn more about how to best choose DISTKEY/SORTKEY as well as additional table design parameters for optimizing performance by reading the Designing Tables documentation.
    CREATE EXTERNAL SCHEMA amzreviews 
    from data catalog
    database 'amzreviews'
    iam_role 'rolearn'
    CREATE EXTERNAL database IF NOT EXISTS;
    
    
    
    CREATE EXTERNAL TABLE amzreviews.reviews(
      marketplace varchar(10), 
      customer_id varchar(15), 
      review_id varchar(15), 
      product_id varchar(25), 
      product_parent varchar(15), 
      product_title varchar(50), 
      star_rating int, 
      helpful_votes int, 
      total_votes int, 
      vine varchar(5), 
      verified_purchase varchar(5), 
      review_headline varchar(25), 
      review_body varchar(1024), 
      review_date date, 
      year int)
    PARTITIONED BY ( 
      product_category varchar(25))
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3://amazon-reviews-pds/parquet/';
      
      
      
    ALTER TABLE amzreviews.reviews ADD
    partition(product_category='Apparel') 
    location 's3://amazon-reviews-pds/parquet/product_category=Apparel/'
    partition(product_category='Automotive') 
    location 's3://amazon-reviews-pds/parquet/product_category=Automotive'
    partition(product_category='Baby') 
    location 's3://amazon-reviews-pds/parquet/product_category=Baby'
    partition(product_category='Beauty') 
    location 's3://amazon-reviews-pds/parquet/product_category=Beauty'
    partition(product_category='Books') 
    location 's3://amazon-reviews-pds/parquet/product_category=Books'
    partition(product_category='Camera') 
    location 's3://amazon-reviews-pds/parquet/product_category=Camera'
    partition(product_category='Grocery') 
    location 's3://amazon-reviews-pds/parquet/product_category=Grocery'
    partition(product_category='Furniture') 
    location 's3://amazon-reviews-pds/parquet/product_category=Furniture'
    partition(product_category='Watches') 
    location 's3://amazon-reviews-pds/parquet/product_category=Watches'
    partition(product_category='Lawn_and_Garden') 
    location 's3://amazon-reviews-pds/parquet/product_category=Lawn_and_Garden';
    
    
    CREATE TABLE reviews(
      marketplace varchar(10),
      customer_id varchar(15), 
      review_id varchar(15), 
      product_id varchar(25) DISTKEY, 
      product_parent varchar(15), 
      product_title varchar(50), 
      star_rating int, 
      helpful_votes int, 
      total_votes int, 
      vine varchar(5), 
      verified_purchase varchar(5), 
      review_date date, 
      year int,
      product_category varchar(25))
      
      SORTKEY (
         review_date
        );

Do this first job run manually so you can see where all of the elements I’ve discussed come into play. Select Run Job at the top of the IDE screen. Expand the Security configuration, script libraries, and job parameters section. This is where you add in the parameters as key-value pairs, as shown in the following screenshot.

KeyValue
–dbreviews
–db_credsreviewssecret
–bucket<name of s3 script bucket>
–filesql/reviewsschema.sql

Select Run job to start it. The job should take a few seconds to complete. You can look for log outputs below the code in the IDE to watch job progress.

Once the job completes, navigate to Databases in the AWS Glue console and look for the amzreviews database and reviews table, as shown in the following screenshot. If they are there, then everything worked as planned! You can also connect to your Amazon Redshift cluster using the Redshift Query Editor or with your own SQL client tool and look for the local reviews table.

Step Functions Orchestration

Now that you’ve had a chance to run a job manually, it’s time to move onto something more programmatic that is orchestrated by Step Functions.

Launch Template

I provide a third AWS CloudFormation template for kickstarting this process as well. It creates a Step Functions state machine that calls two instances of the AWS Glue Python Shell job you just created to complete the two tasks I outlined at the beginning of this post.

For BucketName, paste the name of the script bucket created in the second AWS CloudFormation stack. For GlueJobName, type in the name of the job you just created. Leave the other information as default, as shown in the following screenshot. Launch the stack and wait for it to display Create Complete—this should take only a couple of minutes—before moving on to the next section.

Working with the Step Functions State Machine

State Machines are made up of a series of steps, allowing you to stitch together services into robust ETL workflows. You can monitor each step of execution as it happens, which means you can identify and fix problems in your ETL workflow quickly, and even automatically.

Take a look at the state machine you just launched to get a better idea. Navigate to Step Functions in the AWS Console and look for a state machine with a name like GlueJobStateMachine-######. Choose Edit to view the state machine configuration, as shown in the following screenshot.

It should look as it does in the following screenshot:

As you can see, state machines are created using JSON templates made up of task definitions and workflow logic. You can run parallel tasks, catch errors, and even pause workflows and wait for manual callback to continue. The example I provide contains two tasks for running the SQL statements that complete the goals I outlined at the beginning of the post:

  1. Load data from S3 using Redshift Spectrum
  2. Transform and writing data back to S3

Each task contains basic error handling which, if caught, routes the workflow to an error notification task. This example is a simple one to show you how to build a basic workflow, but you can refer to the Step Functions documentation for an example of more complex workflows to help build a robust ETL pipeline. Step Functions also supports reusing modular components with Nested Workflows.

SQL Review

The state machine will retrieve and run the following SQL statements:

INSERT INTO reviews
SELECT marketplace, customer_id, review_id, product_id, product_parent, product_title, star_rating, helpful_votes, total_votes, vine, verified_purchase, review_date, year, product_category
FROM amzreviews.reviews
WHERE year > 2015;

As I mentioned previously, Amazon Redshift Spectrum is a great way to run ETL using an INSERT INTO statement. This example is a simple load of the data as it is in S3, but keep in mind you can add more complex SQL statements to transform your data prior to loading.

UNLOAD ('SELECT marketplace, product_category, product_title, review_id, helpful_votes, AVG(star_rating) as average_stars FROM reviews GROUP BY marketplace, product_category, product_title, review_id, helpful_votes ORDER BY helpful_votes DESC, average_stars DESC')
TO 's3://bucket/testunload/'
iam_role 'rolearn';

This statement groups reviews by product, ordered by number of helpful votes, and writes to Amazon S3 using UNLOAD.

State Machine execution

Now that everything is in order, start an execution. From the state machine main page select Start an Execution.

Leave the defaults as they are and select Start to begin execution. Once execution begins you are taken to a visual workflow interface where you can follow the execution progress, as shown in the following screenshot.

Each of the queries takes a few minutes to run. In the meantime, you can watch the Amazon Redshift query logs to track the query progress in real time. These can be found by navigating to Amazon Redshift in the AWS Console, selecting your Amazon Redshift cluster, and then selecting the Queries tab, as shown in the following screenshot.

Once you see COMPLETED for both queries, navigate back to the state machine execution. You should see success for each of the states, as shown in the following screenshot.

Next, navigate to the data bucket in the S3 AWS Console page (refer to the DataBucket in the CloudFormation Resources tab). If all went as planned, you should see a folder named testunload in the bucket with the unloaded data, as shown in the following screenshot.

Inject Failure into Step Functions State Machine

Next, test the error handling component of the state machine by intentionally causing an error. An easy way to do this is to edit the state machine and misspell the name of the Secrets Manager secret in the ReadFilterJob task, as shown in the following screenshot.

If you want the error output sent to you, optionally subscribe to the error notification SNS Topic. Start another state machine execution as you did previously. This time the workflow should take the path toward the NotifyFailure task, as shown in the following screenshot. If you subscribed to the SNS Topic associated with it, you should receive a message shortly thereafter.

The state machine logs will show the error in more detail, as shown in the following screenshot.

Conclusion

In this post I demonstrated how you can orchestrate Amazon Redshift-based ETL using serverless AWS Step Functions and AWS Glue Python Shells jobs. As I mentioned in the introduction, the concepts can also be more generally applied to other SQL-based ETL, so use them to start building your own SQL-based ETL pipelines today!

 


About the Author

Ben Romano is a Data Lab solution architect at AWS. Ben helps our customers architect and build data and analytics prototypes in just four days in the AWS Data Lab.

 

 

 

 

Protect and Audit PII data in Amazon Redshift with DataSunrise Security

Post Syndicated from Saunak Chandra original https://aws.amazon.com/blogs/big-data/protect-and-audit-pii-data-in-amazon-redshift-with-datasunrise-security/

DataSunrise, in their own words: DataSunrise is a database security software company that offers a breadth of security solutions, including data masking (dynamic and static masking), activity monitoring, database firewalls, and sensitive data discovery for various databases. The goal is to protect databases against external and internal threats and vulnerabilities. Customers often choose DataSunrise Database Security because it gives them unified control and a single-user experience when protecting different database engines that run on AWS, including Amazon Redshift, Amazon Aurora, all Amazon RDS database engines, Amazon DynamoDB, and Amazon Athena, among others. DataSunrise Security Suite is a set of tools that can protect and audit PII data in Amazon Redshift.

DataSunrise offers passive security with data auditing in addition to active data and database security. Active security is based on predefined security policies, such as preventing unauthorized access to sensitive data, blocking suspicious SQL queries, preventing SQL-injection attacks, or dynamically masking and obfuscating data in real time. DataSunrise comes with high availability, failover, and automatic scaling.

This post focuses on active security for Amazon Redshift, in particular DataSunrise’s capabilities for masking and access control of personally identifiable information (PII), which you can back with DataSunrise’s passive security offerings such as auditing access of sensitive information. This post discusses DataSunrise security for Amazon Redshift, how it works, and how to get started.

Why you need active security for Amazon Redshift

Amazon Redshift is a massively parallel processing (MPP), fully managed petabyte-scale data warehouse (DW) solution with over 15,000 deployments worldwide. Amazon Redshift provides a database encryption mechanism to protect sensitive data, such as payment information and health insurance. For more information, see Amazon Redshift Database Encryption.

Many organizations store sensitive data, commonly classified as personally identifiable information (PII) or sensitive personal information (SPI). You may need solutions to manage access control to such sensitive information, and want to manage it efficiently and flexibly, preferably using a management tool. DataSunrise is a centralized management tool that masks that data. It resolves the PII and SPI data access control requirement by allowing you to enforce masking policies against all queries against your Amazon Redshift data warehouse.

What DataSunrise masking does

DataSunrise enables masking queries against Amazon Redshift by acting as a proxy layer between your applications and the backend stores of Amazon Redshift, enabling transparent data flow, bindings, and so on, while your end-users receive masked or obfuscated data that allows them to do their job but prevents any risk of revealing PII data unintentionally.

DataSunrise can exempt users who are authorized to access this information by composing policies and choosing from predefined policy templates that would allow those users to bypass masking constraints when needed.

How it works

DataSunrise operates as a proxy between users or applications that connect to the database and the database server. DataSunrise intercepts the traffic for in-depth analysis and filtering. It applies data masking and access control policies to enforce active security policies against your PII data. When the database firewall is enabled and a security policy violation is detected, DataSunrise can block the malicious SQL query and notify administrators via SMTP or SNMP. With real-time alerts, you can maintain continuous database security and streamline compliance.

DataSunrise operates as a proxy

Getting started with DataSunrise

You can deploy DataSunrise on a Windows or Linux instance in Amazon EC2. You can download a fully prepared DataSunrise AMI from AWS Marketplace to protect your Amazon Redshift cluster. DataSunrise Database and Data Security are available for both Windows and Linux platforms.

After deploying DataSunrise, you can configure security policies for Amazon Redshift and create data masking and access control security rules. After you configure and activate the security policies, DataSunrise enacts those policies against the user and application traffic that would connect to the database through DataSunrise’s proxy.

DataSunrise customers need to configure the Amazon Redshift cluster security group inbound rule to allow DataSunrise IP. For more information, see Amazon Redshift Cluster Security Group. Additionally, you can include the DataSunrise security group in the cluster security group when it runs on the same AWS VPC. Users can execute queries only through connecting to the DataSunrise endpoint and not to the Amazon Redshift cluster endpoint. All DB users and groups are imported from Amazon Redshift into DataSunrise for authentication and authorization to Amazon Redshift objects.

Creating a dynamic data masking rule

Masking obfuscates part or an entire column value. When a column is masked, the column values are replaced by fake values. It is effected either by replacing some original characters with fake ones or by using some masking functions. DataSunrise has many built-in masking functions for credit card numbers, e-mails, etc. Masking protects sensitive or personally identifiable data such as credit card numbers. This is not the same as encryption or hashing, which applies a sophisticated algorithm to a scalar value to convert it into another value.

You can create dynamic masking rules using object-based filters in DataSunrise’s console. DataSunrise identifies the protected objects during application calls and enforces those security rules against targeted operations, schemas, or objects in general within your Amazon Redshift cluster. Security administrators can enable those rules granularly based on the object level and caller identity. They can allow exemptions when needed by identifying authorized callers.

To perform dynamic masking in DataSunrise, you need to create data masking rules as part of defining such security policies.

Complete the following steps to create those masking policies:

  1. In the DataSunrise console, choose Masking > Dynamic Masking Rules.
  2. Choose Add Rule. Add required information.

    Create Dynamic Data Masking Rule

Create Dynamic Data Masking Rule

  1. In the Masking Settings section click Select and navigate to a table in a schema and check the columns you want to mask. See the following screenshot of the Check Columns page:

    Redshift columns to enable dynamic masking

Redshift columns to enable dynamic masking

Click Done after you decide which column to protect and choose the masking method and any relevant settings to allow business-oriented outcomes of the masked information.

In Add Rule, Filter Sessions, you can choose which users, applications, hosts, and more are affected by this rule.

Creating a static data masking rule

You can mask data permanently with static masking as opposed to dynamic data masking. It stores the objects permanently in a separate schema or database. During static masking, DataSunrise copies each selected table into a separate schema or database. So, static masking may require additional storage space. Some of these tables have columns with masked content stored on the disk. This replicated schema is a fully functional schema in which you can run user queries. The source tables remain untouched and unmasked data can be viewed. In case original data has been changed, it is required to repeat the static masking procedure again. In that case it is necessary to truncate tables with previously masked data.

  1. From the menu, choose Masking > Static Masking.
  2. In New Static Masking Task, in Source and Target Instances, choose the source database, schema, and corresponding target destination.See the following screenshot of the New Static Masking Task page:
  1. In Select Source Tables to Transfer and Columns to Mask, choose the objects to which you wish to apply masking.The following screenshot shows the list of available tables:

DataSunrise also enables you to reschedule recurring static masking jobs so you can refresh your masked records based on your source or production data.

Static data masking under DataSunrise applies to Amazon Redshift local tables. In addition to local tables, Amazon Redshift allows querying external tables in Amazon S3; however, DataSunrise does not support static masking on data stored in Amazon S3 accessed in Amazon Redshift via external tables. For more information, see Using Amazon Redshift Spectrum to Query External Data.

Creating a security/access control rule

While data masking can help in many cases to allow your Amazon Redshift users the appropriate access, you may need further enforcement of access control to filter out any operations that might violate your security strategy. DataSunrise can import database users’ and groups’ metadata from Amazon Redshift, which the DataSunrise administrator can use to configure the security profile. If you already have a set of users defined in your existing Redshift DB you don’t need to additionally recreate the users for DataSunrise. DataSunrise will use this list of users only to configure rules. DataSunrise does not modify any traffic related to the authentication process of database users by default.

  1. In the DataSunrise console, from the menu, choose Security > Rules.The following screenshot shows the Security Rules page:
  2. Choose Add Rule.The following screenshot shows the details you can enter to compose this new rule:

DataSunrise also allows you to compose your rule to restrict or allow certain users, applications, hosts, and so on from performing activities that you consider prohibited against particular objects or areas within your Amazon Redshift cluster.

The following screenshot shows the Filter Sessions page:

DataSunrise enables you to create rules for specific objects, query groups, query types, and SQL injection activities, and trigger actions when authorization errors occur.

Static masking doesn’t have a direct impact on performance, but if a customer uses the DataSunrise custom function, it could impact performance because custom functions execute on the DataSunrise server.

Using DataSunrise audit and compliance policies

From the console, Compliance > Add Compliance.

In the Compliance orchestrator page, you can initiate a scan of your Amazon Redshift cluster to identify all PII data or sensitive information in general, per your compliance standards. DataSunrise comes bundled with internal scans for HIPAA, GDPR, and other compliance standards, but you can create or amend any of those libraries to accommodate any special requirements that your security strategy mandates. The following screenshot shows the Add Compliance page:

After completing the scan, DataSunrise guides you through the process of composing rules for sensitive information within your Amazon Redshift cluster.

You can also create your audit rules manually. The following screenshot shows the New Audit Rule page:

You can set audit rules for any restriction to make sure that transactional trails are only collected when necessary. You can target objects starting from the entire database down to a single column in your Amazon Redshift cluster. See the following screenshot:

Conclusion

DataSunrise’s masking feature allows for descriptive specifications of access control to sensitive columns, in addition to built-in encryption provided by the Amazon Redshift cluster. Its proxy enables more fine-grained access control, auditing, and masking capabilities to better monitor, protect, and comply with regulatory standards that address the ever-increasing needs of securing and protecting data. DataSunrise’s integration with Amazon Redshift addresses those concerns by simplifying and automating the security rules and its applications. Keep your data safe and protected at all times!

To get started with DataSunrise with Amazon Redshift, visit DataSunrise in AWS Marketplace.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

 


About the Authors


Saunak Chandra is a senior partner solutions architect for Redshift at AWS.
Saunak likes to experiment with new products in the technology space, alongside his day to day work. He loves exploring the nature in the Pacific Northwest. A short hiking or biking in the trails is his favorite weekend morning routine. He also likes to do yoga when he gets time from his kid.

 

 

 

Radik Chumaren is an engineering leader at DataSunrise. Radik is specializing in heterogeneous database environments with focus on building database security software in the cloud. He enjoys reading and playing soccer.

 

 

 

 

Automate Amazon Redshift cluster creation using AWS CloudFormation

Post Syndicated from Sudhir Gupta original https://aws.amazon.com/blogs/big-data/automate-amazon-redshift-cluster-creation-using-aws-cloudformation/

In this post, I explain how to automate the deployment of an Amazon Redshift cluster in an AWS account. AWS best practices for security and high availability drive the cluster’s configuration, and you can create it quickly by using AWS CloudFormation. I walk you through a set of sample CloudFormation templates, which you can customize as per your needs.

Amazon Redshift is a fast, scalable, fully managed, ACID and ANSI SQL-compliant cloud data warehouse service. You can set up and deploy a new data warehouse in minutes, and run queries across petabytes of structured data stored in Amazon Redshift. With Amazon Redshift Spectrum, it extends your data warehousing capability to data lakes built on Amazon S3. Redshift Spectrum allows you to query exabytes of structured and semi-structured data in its native format, without requiring you to load the data. Amazon Redshift delivers faster performance than other data warehouse databases by using machine learning, massively parallel query execution, and columnar storage on high-performance disk. You can configure Amazon Redshift to scale up and down in minutes, as well as expand compute power automatically to ensure unlimited concurrency.

As you begin your journey with Amazon Redshift and set up AWS resources based on the recommended best practices of AWS Well-Architected Framework, you can use the CloudFormation templates provided here. With the modular approach, you can choose to build AWS infrastructure from scratch, or you can deploy Amazon Redshift into an existing virtual private cloud (VPC).

Benefits of using CloudFormation templates

With an AWS CloudFormation template, you can condense hundreds of manual procedures into a few steps listed in a text file. The declarative code in the file captures the intended state of the resources to create, and you can choose to automate the creation of hundreds of AWS resources. This template becomes the single source of truth for your infrastructure.

A CloudFormation template acts as an accelerator. It helps you automate the deployment of technology and infrastructure in a safe and repeatable manner across multiple Regions and multiple accounts with the least amount of effort and time.

Architecture overview

The following architecture diagram and summary describe the solution that this post uses.

Figure 1: Architecture diagram

The sample CloudFormation templates provision the network infrastructure and all the components shown in the architecture diagram.

I broke the CloudFormation templates into the following three stacks:

  1. A CloudFormation template to set up a VPC, subnets, route tables, internet gateway, NAT gateway, Amazon S3 gateway endpoint, and other networking components.
  2. A CloudFormation template to set up an Amazon Linux bastion host in an Auto Scaling group to connect to the Amazon Redshift cluster.
  3. A CloudFormation template to set up an Amazon Redshift cluster, CloudWatch alarms, AWS Glue Data Catalog, and an Amazon Redshift IAM role for Amazon Redshift Spectrum and ETL jobs.

I integrated the stacks using exported output values. Using three different CloudFormation stacks instead of one nested stack gives you additional flexibility. For example, you can choose to deploy the VPC and bastion host CloudFormation stacks one time and Amazon Redshift cluster CloudFormation stack multiple times in an AWS Region.

Best practices

The architecture built by these CloudFormation templates supports AWS best practices for high availability and security.

The VPC CloudFormation template takes care of the following:

  1. Configures three Availability Zones for high availability and disaster recovery. It geographically distributes the zones within a Region for best insulation and stability in the event of a natural disaster.
  2. Provisions one public subnet and one private subnet for each zone. I recommend using public subnets for external-facing resources and private subnets for internal resources to reduce the risk of data exfiltration.
  3. Creates and associates network ACLs with default rules to the private and public subnets. AWS recommends using network ACLs as firewalls to control inbound and outbound traffic at the subnet level. These network ACLs provide individual controls that you can customize as a second layer of defense.
  4. Creates and associates independent routing tables for each of the private subnets, which you can configure to control the flow of traffic within and outside the VPC. The public subnets share a single routing table because they all use the same internet gateway as the sole route to communicate with the internet.
  5. Creates a NAT gateway in each of the three public subnets for high availability. NAT gateways offer significant advantages over NAT instances in terms of deployment, availability, and maintenance. NAT gateways allow instances in a private subnet to connect to the internet or other AWS services even as they prevent the internet from initiating a connection with those instances.
  6. Creates an VPC endpoint for Amazon S3. Amazon Redshift and other AWS resources—running in a private subnet of a VPC—can connect privately to access S3 buckets. For example, data loading from S3 and unloading data to S3 happens over a private, secure, and reliable connection.

The Amazon Linux bastion host CloudFormation template takes care of the following:

  1. Creates an Auto Scaling group spread across the three public subnets set up by the VPC CloudFormation template. The Auto Scaling group keeps the Amazon Linux bastion host available in one of the three Availability Zones.
  2. Sets up an Elastic IP address and associates it with the Amazon Linux bastion host. An Elastic IP address makes it easier to remember and allow IP addresses from on-premises firewalls. If your system terminates an instance and the Auto Scaling group launches a new instance in its place, the existing Elastic IP address re-associates with the new instance automatically. This lets you use the same trusted Elastic IP address at all times.
  3. Sets up an Amazon EC2 security group and associates with the Amazon Linux bastion host. This allows you to lock down access to the bastion hosts, only allowing inbound traffic from known CIDR scopes and ports.
  4. Creates an Amazon CloudWatch Logs log group to hold the Amazon Linux bastion host’s shell history logs and sets up a CloudWatch metric to track SSH command counts. This helps with security audits by allowing you to check who accesses the bastion host and when.
  5. Creates a CloudWatch alarm to monitor the CPU on the bastion host and send an Amazon SNS notification when anything triggers the alarm.

The Amazon Redshift cluster template takes care of the following:

  1. Creates an Amazon Redshift cluster subnet group span across multiple Availability Zones so that you can create different clusters into different zones to minimize the impact of failure of one zone.
  2. Configures database auditing and stores audit logs into an S3 bucket. It also restricts access to the Amazon Redshift logging service and configures lifecycle rules to archive logs older than 14 days to Amazon S3 Glacier.
  3. Creates an IAM role with a policy to grant the minimum permissions required to use Amazon Redshift Spectrum to access S3, CloudWatch Logs, AWS Glue, and Amazon Athena. It then associates this IAM role with Amazon Redshift.
  4. Creates an EC2 security group and associates it with the Amazon Redshift cluster. This allows you to lock down access to the Amazon Redshift cluster to known CIDR scopes and ports.
  5. Creates an Amazon Redshift cluster parameter group with the following configuration and associates it with the Amazon Redshift cluster. These parameters are only a general guide. Review and customize them to suit your needs.

Parameter

Value

Description

enable_user_activity_loggingTRUEThis enables the user activity log. For more information, see Database Audit Logging.
require_sslTRUEThis enables SSL connections to the Amazon Redshift cluster.
wlm_json_configuration[ {
"query_group" : [ ],
"query_group_wild_card" : 0,
"user_group" : [ ],
"user_group_wild_card" : 0,
"concurrency_scaling" : "auto",
"rules" : [ {
"rule_name" : "DiskSpilling",
"predicate" : [ {
"metric_name" : "query_temp_blocks_to_disk",
"operator" : ">",
"value" : 100000
} ],
"action" : "log",
"value" : ""
}, {
"rule_name" : "RowJoining",
"predicate" : [ {
"metric_name" : "join_row_count",
"operator" : ">",
"value" : 1000000000
} ],
"action" : "log",
"value" : ""
} ],
"auto_wlm" : true
}, {
"short_query_queue" : true
} ]

This creates a custom workload management queue (WLM) with the following configuration:

Auto WLM: Amazon Redshift manages query concurrency and memory allocation automatically, as per workload.

Enable Short Query Acceleration (SQA): Amazon Redshift executes short-running queries in a dedicated space so that SQA queries aren’t forced to wait in queues behind longer queries.

Enable Concurrency Scaling for the queries routed to this WLM queue.

Creates two WLM QMR Rules:

Log queries when temporary disk space used to write intermediate results exceeds 100 GB.

Log queries when the number of rows processed in a join step exceed one billion rows.

You can also create different rules based on your needs and choose different actions (abort or hop or log).

max_concurrency_scaling_clusters1 (or what you chose)Sets the maximum number of concurrency scaling clusters allowed when concurrency scaling is enabled.
auto_analyzeTRUEIf true, Amazon Redshift continuously monitors your database and automatically performs analyze operations in the background.
statement_timeout43200000Terminates any statement that takes more than the specified number of milliseconds. The statement_timeout value is the maximum amount of time a query can run before Amazon Redshift terminates it.
  1. Configures the Amazon Redshift cluster to listen on a non-default Amazon Redshift port, according to security best practices.
  2. Creates the Amazon Redshift cluster in the private subnets according to AWS security best practices. To access the Amazon Redshift cluster, use the Amazon Linux bastion host that the Linux bastion host CloudFormation template sets up.
  3. Creates minimum two-nodes cluster, unless you choose 1 against input parameter NumberOfNodes. AWS recommends using at least two nodes per cluster for production. For more information, see the Availability and Durability section of Amazon Redshift-FAQ.
  4. Enables encryption at-rest for the Amazon Redshift cluster by using the Amazon Redshift managed KMS key or a user-specified KMS key. To use the user-specified KMS key and you have not created it yet, first create a KMS key. For more information, see Creating KMS Keys.
  5. Configures Amazon EBS snapshots retention to 35 days for production environments and 8 days for non-production environments. This allows you to recover your production database to any point in time in the last 35 days or the last 8 days for a non-production database.
  6. It takes a final snapshot of the Amazon Redshift database automatically when you delete the Amazon Redshift cluster using Delete stack option. It prevents data loss from the accidental deletion of your CloudFormation stack.
  7. Creates an AWS Glue Data Catalog as a metadata store for your AWS data lake.
  8. Configures CloudWatch alarms for key CloudWatch metrics like PercentageDiskSpaceUsed, and CPUUtilization for the Amazon Redshift cluster, and sends an SNS notification when one of these conditions triggers the alarm.
  9. Provides the option to restore the Amazon Redshift cluster from a previously taken snapshot.
  10. Attaches common tags to the Amazon Redshift clusters and other resources. AWS recommends assigning tags to your cloud infrastructure resources to manage resource access control, cost tracking, automation, and organization.

Prerequisites

Before setting up the CloudFormation stacks, note the following prerequisites.

  1. You must have an AWS account and an IAM user with sufficient permissions to interact with the AWS Management Console and the services listed in the preceding Architecture overview section. Your IAM permissions must also include access to create IAM roles and policies created by the AWS CloudFormation template.
  2. The VPC CloudFormation stack requires three Availability Zones to set up the public and private subnets. Make sure to select an AWS Region that has at least three Availability Zones.
  3. Create an EC2 key pair in the EC2 console in the AWS Region where you are planning to set up the CloudFormation stacks. Make sure that you save the private key, as this is the only time you can do this. You use this EC2 key pair as an input parameter during setup for the Amazon Linux bastion host CloudFormation stack.

Set up the resources using AWS CloudFormation

I provide these CloudFormation templates as a general guide. Review and customize them to suit your needs. Some of the resources deployed by these stacks incur costs as long as they remain in use.

Set up the VPC, subnets, and other networking components

This CloudFormation template will create a VPC, subnets, route tables, internet gateway, NAT gateway, Amazon S3 gateway endpoint, and other networking components. Follow below steps to create these resources in your AWS account.

  1. Log in to the AWS Management Console.
  2. In the top navigation ribbon, choose the AWS Region in which to create the stack, and choose Next. This CloudFormation stack requires three Availability Zones for setting up the public and private subnets. Select an AWS Region that has at least three Availability Zones.
  3. Choose the following Launch Stack button. This button automatically launches the AWS CloudFormation service in your AWS account with a template. It prompts you to sign in as needed. You can view the CloudFormation template from within the console as required.
  4. The CloudFormation stack requires a few parameters, as shown in the following screenshot.
    • Stack name: Enter a meaningful name for the stack, for example, rsVPC
    • ClassB 2nd Octet : Specify the second octet of the IPv4 CIDR block for the VPC (10.XXX.0.0/16). You can specify any number between and including 0–255, for example, specify 33 to create a VPC with IPv4 CIDR block 10.33.0.0/16.To learn more about VPC and subnet sizing for IPv4, see VPC and Subnet Sizing for IPv4.

      Figure 2: VPC Stack, in the CloudFormation Console

  5. After entering all the parameter values, choose Next.
  6. On the next screen, enter any required tags, an IAM role, or any advanced options, and then choose Next.
  7. Review the details on the final screen, and choose Create.

Stack creation takes a few minutes. Check the AWS CloudFormation Resources section to see the physical IDs of the various components this stack sets up.

After this, you must set up the Amazon Linux bastion host, which you use to log in to the Amazon Redshift cluster.

Set up the Amazon Linux bastion host

This CloudFormation template will create an Amazon Linux bastion host in an Auto Scaling group. Follow below steps to create the bastion host in the VPC.

  1. In the top navigation ribbon, choose the AWS Region in which to create the stack, and choose Next.
  2. Choose the following Launch Stack button. This button automatically launches the AWS CloudFormation service in your AWS account with a template to launch.
  3. The CloudFormation stack requires a few parameters, as shown in the following screenshots.
    • Stack name: Enter a meaningful name for the stack, for example, rsBastion.
    • Parent VPC Stack: Enter the CloudFormation stack name for the VPC stack that you set up in the previous step. Find this value in the CloudFormation console, for example, rsVPC.
    • Allowed Bastion External Access CIDR: Enter the allowed CIDR block in the x.x.x.x/x format for external SSH access to the bastion host.
    • Key Pair Name: Select the key pair name that you set up in the Prerequisites section.
    • Bastion Instance Type: Select the Amazon EC2 instance type for the bastion instance.
    • LogsRetentionInDays: Specify the number of days to retain CloudWatch log events for the bastion host.
    • SNS Notification Email: Enter the email notification list used to configure an SNS topic for sending CloudWatch alarm notifications.
    • Bastion Tenancy: Select the VPC tenancy in which you launched the bastion host.
    • Enable Banner: Select to display a banner when connecting through SSH to the bastion.
    • Bastion Banner: Use Default or provide an S3 location for the file containing the banner text that the host displays upon login.
    • Enable TCP Forwarding: Select True to Enable/Disable TCP Forwarding. Setting this value to true enables TCP forwarding (SSH tunneling). This can be useful, but also presents a security risk, so I recommend that you keep the default Disabled setting unless required.
    • Enable X11 Forwarding: Select to Enable/Disable X11 Forwarding. Setting this value to true enables X Windows over SSH. X11 forwarding can be useful but it is also a security risk, so I recommend that you keep the default (disabled) setting unless required.
    • Custom Bootstrap Script: Optional. Specify a custom bootstrap script S3 location for running during bastion host setup.
    • AMI override: Optional. Specify an AWS Region-specific image for the instance.

      Figure 3: Bastion Stack, in the CloudFormation Console

  1. After entering all the parameter values, choose Next.
  2. On the next screen, enter any required tags, an IAM role, or any advanced options, and then choose Next.
  3. Review the details on the final screen, select I acknowledge that AWS CloudFormation might create IAM resources, and then choose Create.

Stack creation takes a few minutes. Check the AWS CloudFormation Resources section to see the physical IDs of the various components set up by this stack.

You are now ready to set up the Amazon Redshift cluster.

Set up the Amazon Redshift cluster

This CloudFormation template will set up an Amazon Redshift cluster, CloudWatch alarms, AWS Glue Data Catalog, an Amazon Redshift IAM role and required configuration. Follow below steps to create these resources in the VPC:

  1. Choose the AWS Region where you want to create the stack on the top right of the screen, and then choose Next.
  2. Choose the following Launch Stack button. This button automatically launches the AWS CloudFormation service in your AWS account with a template.
  3. The CloudFormation stack requires a few parameters, as shown in the following screenshots:
    • Stack name: Enter a meaningful name for the stack, for example, rsdb
    • Environment: Select the environment stage (Development, Test, Pre-prod, Production) of the Amazon Redshift cluster. If you specify the Production option for this parameter, it sets snapshot retention to 35 days, sets the enable_user_activity_logging parameter to true, and creates CloudWatch alarms for high CPU-utilization and high disk-space-usage. Setting Development, Test, or Pre-prod for this parameter sets snapshot retention to 8 days, sets the enable_user_activity_logging parameter to false, and creates CloudWatch alarms only for high disk-space-Usage.
    • Parent VPC stack: Provide the stack name of the parent VPC stack.  Find this value inthe CloudFormation console.
    • Parent bastion stack (Optional): Provide the stack name of parent Amazon Linux bastion host stack. Find this value in the CloudFormation console.
    • Node type for Redshift cluster: Enter the type of the node for your Amazon Redshift cluster, for example, dc2.large.
    • Number of nodes in Redshift cluster: Enter the number of compute nodes for the Amazon Redshift cluster, for example, 2.
    • Redshift cluster port: Enter the TCP/IP port for the Amazon Redshift cluster, for example, 8200.
    • Redshift database name:  Enter a database name, for example, rsdev01.
    • Redshift master user name: Enter a database master user name, for example, rsadmin.
    • Redshift master user password: Enter an alphanumeric password for the master user. The password must contain 8–64 printable ASCII characters, excluding: /, “, \”, \, and @. It must contain one uppercase letter, one lowercase letter, and one number. For example, Welcome123.
    • Enable Redshift logging to S3: If you choose true for this parameter, the stack enables database auditing for the newly created S3 bucket.
    • Max. number of concurrent clusters: Enter any number between 1–10 for concurrency scaling. To configure more than 10, you must request a limit increase by submitting an Amazon Redshift Limit Increase Form.
    • Encryption at rest: If you choose true for this parameter, the database encrypts your data with the KMS key.
    • KMS key ID: If you leave this empty, then the cluster uses the default Amazon Redshift KMS to encrypt the Amazon Redshift database. If you enter a user-created KMS key, then the cluster uses your user-defined KMS key to encrypt the Amazon Redshift database.
    • Redshift snapshot identifier: Enter a snapshot identifier only if you want to restore from a snapshot. Leave it blank for a new cluster.
    • AWS account ID of the Redshift snapshot: Enter the AWS Account number that created the snapshot. Leave it blank if snapshot comes from the current AWS account or you don’t want to restore from previously taken snapshot.
    • Redshift maintenance window: Enter a maintenance window for your Amazon Redshift cluster. For more information, see Amazon Redshift maintenance window. For example, sat:05:00-sat:05:30.
    • S3 bucket for Redshift IAM role: Enter the existing S3 bucket. The stack automatically creates an IAM role and associates it with the Amazon Redshift cluster with GET and LIST access to this bucket.
    • AWS Glue Data Catalog database name: Leave this field empty if you don’t want to create an AWS Glue Data Catalog. If you do want an associated AWS Glue Data Catalog database, enter a name for it, for example, dev-catalog-01. For a list of the AWS Regions in which AWS Glue is available, check the regional-product-services map.
    • Email address for SNS notification: Enter the email notification list that you used to configure an SNS topic for sending CloudWatch alarms. SNS sends a subscription confirmation email to the recipient. The recipient must choose the Confirm subscription link in this email to set up notifications.
    • Unique friendly name: This tag designates a unique, friendly name to append as a NAME tag into all AWS resources that this stack manages.
    • Designate business owner’s email: This tag designates the business owner’s email address associated with the given AWS resource. The stack sends outage or maintenance notifications to this address.
    • Functional tier: This tag designates the specific version of the application.
    • Project cost center: This tag designates the cost center associated with the project of the given AWS resource.
    • Confidentiality classifier: This tag designates the confidentiality classification of the data associated with the resource.
    • Compliance classifier: This tag specifies the Compliance level for the AWS resource.


      Figure 4: Amazon Redshift Stack, in the CloudFormation Console

  4. After entering the parameter values, choose Next.
  5. On the next screen, enter any required tags, an IAM role, or any advanced options, and then choose Next.
  6. Review the details on the final screen, select I acknowledge that AWS CloudFormation might create IAM resources, and choose Create.

Stack creation takes a few minutes. Check the AWS CloudFormation Resources section to see the physical IDs of the various components set up by these stacks.

With setup complete, log in to the Amazon Redshift cluster and run some basic commands to test it.

Log in to the Amazon Redshift cluster using the Amazon Linux bastion host

The following instructions assume that you use a Linux computer and use an SSH client to connect to the bastion host. For more information about how to connect using various clients, see Connect to Your Linux Instance.

  1. Move the private key of the EC2 key pair (that you saved in the Prerequisites section) to a location on your SSH Client, where you are connecting to the Amazon Linux bastion host.
  2. Change the permission of the private key using the following command, so that it’s not publicly viewable.chmod 400 <private key file name, e.g., bastion-key.pem >
  3. In the CloudFormation console, select the Amazon Linux bastion host stack. Choose Outputs and make a note of the SSHCommand parameter value, which you use to apply SSH to the Amazon Linux bastion host.
  4. On the SSH client, change the directory to the location where you saved the EC2 private key, and then copy and paste the SSHCommand value from the previous step.
  5. On the CloudFormation Dashboard, select the Amazon Redshift cluster stack. Choose Outputs and note the PSQLCommandLine parameter value, which you use to log in to the Amazon Redshift database using psql client.
  6. The EC2 Auto Scaling launch configuration already set up PostgreSQL binaries on the Amazon Linux bastion host. Copy and paste the PSQLCommandLine value at the command prompt of the bastion host.
    psql -h ClusterEndpointAddress -p AmazonRedshiftClusterPort -U Username -d DatabaseNameWhen prompted, enter the database user password.
  7. Run some basic commands, as shown in the following screenshot:
    select current_database();
    select current_user;

    Figure 5: Successful connection to Amazon Redshift

Next steps

Before you use the Amazon Redshift cluster to set up your application-related database objects, consider creating the following:

  • An application schema
  • A user with full access to create and modify objects in the application schema
  • A user with read/write access to the application schema
  • A user with read-only access to the application schema

Use the master user that you set up with the Amazon Redshift cluster only for administering the Amazon Redshift cluster. To create and modify application-related database objects, use the user with full access to the application schema. Your application should use the read/write user for storing, updating, deleting, and retrieving data. Any reporting or read-only application should use the read-only user. Granting the minimum privileges required to perform operations is a database security best practice.

Review AWS CloudTrail, AWS Config, and Amazon GuardDuty and configure them for your AWS account, according to AWS security best practices. Together, these services help you monitor activity in your AWS account; assess, audit, and evaluate the configurations of your AWS resources; monitor malicious or unauthorized behavior; and detect security threats against your resources.

Delete CloudFormation stacks

Some of the AWS resources deployed by the CloudFormation stacks in this post incur a cost as long as you continue to use them.

You can delete the CloudFormation stack to delete all AWS resources created by the stack. To clean up all your stacks, use the CloudFormation console to remove the three stacks that you created in reverse order.

To delete a stack:

  1. On the Stacks page in the CloudFormation console, and select the stack to delete. The stack must be currently running.
  2. In the stack details pane, choose Delete.
  3. Select Delete stack when prompted.

After stack deletion begins, you cannot stop it. The stack proceeds to the DELETE_IN_PROGRESS state. After the stack deletion completes, the stack changes to the DELETE_COMPLETE state. The AWS CloudFormation console does not display stacks in the DELETE_COMPLETE state by default. To display deleted stacks, you must change the stack view filter, as described in Viewing Deleted Stacks on the AWS CloudFormation Console..

If the delete fails, the stack enters the DELETE_FAILED state. For solutions, see Delete Stack Fails.

Summary

In this post, I showed you how to automate creation of an Amazon Redshift cluster and required AWS infrastructure based on AWS security and high availability best practices using AWS CloudFormation. I hope you find the sample CloudFormation templates helpful and encourage you to modify them to support your business needs.

If you have any comments or questions about this post, I encourage you to use the comments section.

 


About the Author

Sudhir Gupta is a senior partner solutions architect at Amazon Web Services. He works with AWS consulting and technology partners to provide guidance and technical assistance on data warehouse and data lake projects, helping them to improve the value of their solutions when using AWS.