Tag Archives: Amazon Redshift

Create cross-account, custom Amazon Managed Grafana dashboards for Amazon Redshift

Post Syndicated from Tahir Aziz original https://aws.amazon.com/blogs/big-data/create-cross-account-custom-amazon-managed-grafana-dashboards-for-amazon-redshift/

Amazon Managed Grafana recently announced a new data source plugin for Amazon Redshift, enabling you to query, visualize, and alert on your Amazon Redshift data from Amazon Managed Grafana workspaces. With the new Amazon Redshift data source, you can now create dashboards and alerts in your Amazon Managed Grafana workspaces to analyze your structured and semi-structured data across data warehouses, operational databases, and data lakes. The Amazon Redshift plugin also comes with default out-of-the-box dashboards that make it simple to get started monitoring the health and performance of your Amazon Redshift clusters.

In this post, we present a step-by-step tutorial to use the Amazon Redshift data source plugin to visualize metrics from your Amazon Redshift clusters hosted in different AWS accounts using AWS Single Sign-On (AWS SSO) as well as how to create custom dashboards visualizing data from Amazon Redshift system tables in Amazon Managed Grafana.

Solution overview

Let’s look at the AWS services that we use in our tutorial:

Amazon Managed Grafana is a fully managed service for open-source Grafana developed in collaboration with Grafana Labs. Grafana is a popular open-source analytics platform that enables you to query, visualize, alert on, and understand your operational metrics. You can create, explore, and share observability dashboards with your team, and spend less time managing your Grafana infrastructure and more time improving the health, performance, and availability of your applications. Amazon Managed Grafana natively integrates with AWS services (like Amazon Redshift) so you can securely add, query, visualize, and analyze operational and performance data across multiple accounts and Regions for the underlying AWS service.

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers. Today, tens of thousands of AWS customers from Fortune 500 companies, startups, and everything in between use Amazon Redshift to run mission-critical business intelligence (BI) dashboards, analyze real-time streaming data, and run predictive analytics jobs. With the constant increase in generated data, Amazon Redshift customers continue to achieve successes in delivering better service to their end-users, improving their products, and running an efficient and effective business.

AWS SSO is where you create or connect your workforce identities in AWS and manage access centrally across your AWS organization. You can choose to manage access just to your AWS accounts or cloud applications. You can create user identities directly in AWS SSO, or you can bring them from your Microsoft Active Directory or a standards-based identity provider, such as Okta Universal Directory or Azure AD. With AWS SSO, you get a unified administration experience to define, customize, and assign fine-grained access. Your workforce users get a user portal to access all their assigned AWS accounts, Amazon Elastic Compute Cloud (Amazon EC2) Windows instances, or cloud applications. AWS SSO can be flexibly configured to run alongside or replace AWS account access management via AWS Identity and Access Management (IAM).

The following diagram illustrates the solution architecture.

The solution includes the following components:

  • Captured metrics from the Amazon Redshift clusters in the development and production AWS accounts.
  • Amazon Managed Grafana, with the Amazon Redshift data source plugin added to it. Amazon Managed Grafana communicates with the Amazon Redshift cluster via the Amazon Redshift Data Service API.
  • The Grafana web UI, with the Amazon Redshift dashboard using the Amazon Redshift cluster as the data source. The web UI communicates with Amazon Managed Grafana via an HTTP API.

We walk you through the following steps in this post:

  1. Create a user in AWS SSO for Amazon Managed Grafana workspace access.
  2. Configure an Amazon Managed Grafana workspace.
  3. Set up two Amazon Redshift clusters as the data sources in Grafana.
  4. Import the Amazon Redshift dashboard supplied with the data source.
  5. Create a custom Amazon Redshift dashboard to visualize metrics from the Amazon Redshift clusters.

Prerequisites

To follow along with this post, you should have the following prerequisites:

Set up AWS SSO

In this section, we set up AWS SSO and register users.

In addition to AWS SSO integration, Amazon Managed Grafana also supports direct SAML integration with SAML 2.0 identity providers.

  1. If you don’t have AWS SSO enabled, open the AWS SSO console and choose Enable AWS SSO.
  2. After AWS SSO is enabled, choose Users in the navigation pane.
  3. Choose Add user.
  4. Enter the user details and choose Next: Groups.
  5. Choose Add user.

Set up your Amazon Grafana workspace

In this section, we demonstrate how to set up a Grafana workspace using Amazon Managed Grafana. We set up authentication using AWS SSO, register data sources, and add administrative users for the workspace.

  1. On the Amazon Managed Grafana console, choose Create workspace.
  2. For Workspace name, enter a suitable name.
  3. Choose Next.
  4. For Authentication access, select AWS Single Sign-On.
  5. For Permission type, select Service managed.
  6. Choose Next.
  7. Select Current account.
  8. For Data sources, select Amazon Redshift.
  9. Choose Next.
  10. Review the details and choose Create workspace.

    Now we assign a user to the workspace.
  11. On the Workspaces page, choose the workspace you created.
  12. Note the IAM role attached to your workspace.
  13. Choose Assign new user or group.
  14. Select the user to assign to the workspace.
  15. Choose Assign users and groups.

    For the purposes of this post, we need an admin user.
  16. To change the permissions of the user you just assigned, select the user name and choose Make admin.

For the cross-account setup, we use two Amazon Redshift clusters: production and development. In the next section, we configure IAM roles in both the production and development accounts so that the Grafana in the production account is able to connect to the Amazon Redshift clusters in the production account as well as in the development account.

Configure an IAM role for the development account

In this section, we set up the IAM role in the AWS account hosting the development environment. This role is assumed by the Amazon Managed Grafana service from the production AWS account to establish the connection between Amazon Managed Grafana and Amazon Redshift cluster in the development account.

  1. On the IAM console, choose Roles in the navigation pane.
  2. Choose Create role.
  3. Select Custom trust policy.
  4. Use the following policy code (update the account number for your production account and the Grafana service role attached to the workspace):
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": "arn:aws:iam::<production-account-number>:role/service-role/AmazonGrafanaServiceRole-xxxxxxxxxx",
                    "Service": "grafana.amazonaws.com"
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }

  5. Choose Next.
  6. Attach the managed IAM policy AmazonGrafanaRedshiftAccess to this role. For instructions, refer to Modifying a role permissions policy (console).
  7. Provide a role name, description, and tags (optional), and create the role.

Configure an IAM role for the production account

Next, we configure the IAM role created by the Amazon Managed Grafana service in order to establish the connection between Amazon Managed Grafana and the Amazon Redshift cluster in the production account.

  1. On the IAM console, choose Roles in the navigation pane.
  2. Search for the AmazonGrafanaServiceRole-xxxxxxx role attached to your Grafana workspace.
  3. Create an inline IAM policy and attach it to this role with the following policy code:
    {
    	"Version": "2012-10-17",
    	"Statement": [{
    		"Sid": "VisualEditor0",
    		"Effect": "Allow",
    		"Action": [
    			"sts:AssumeRole"
    		],
    		"Resource":"arn:aws:iam::<dev-account-number>:role/<DevAccountRoleName>"
    	}]
    }

  4. Provide a role name, description, and tags (optional), and create the role.

Import the default dashboard

In this section, we connect to the Amazon Redshift clusters in the production and development accounts from the Amazon Managed Grafana console and import the default dashboard.

  1. On the Amazon Managed Grafana console, choose Workspaces in the navigation pane.
  2. Choose the workspace you just created (authenticate and sign in if needed).
  3. In the navigation pane, choose Settings and on the Configuration menu, choose Data sources.
  4. Choose Add data source.
  5. Search for and choose Amazon Redshift.
  6. On the Settings tab, for Authentication provider, choose Workspace IAM role.
  7. For Default Region, choose us-east-1.
  8. Under Redshift Details, choose Temporary credentials.
  9. Enter the cluster identifier and database name for your Amazon Redshift cluster in the development account.
  10. For Database user, enter redshift_data_api_user.
  11. Choose Save & test.
    When the connection is successfully established, a message appears that the data source is working. You can now move on to the next step.
  12. Repeat these steps to add another data source to connect to the Amazon Redshift cluster in the development account.
  13. On the Settings tab, for Authentication provider, choose Workspace IAM role.
  14. Enter the workspace role as the ARN of the IAM role you created earlier (arn:aws:iam::dev-account-number:role/cross-account-role-name).
  15. For Default Region, choose us-east-1.
  16. Under Redshift Details, choose Temporary credentials.
  17. Enter the cluster identifier and database name for your Amazon Redshift cluster in the development account.
  18. For Database user, enter redshift_data_api_user.
  19. Choose Save & test.
    When the connection is successfully established, a message appears that the data source is working.
  20. On the Dashboards tab, choose Import next to Amazon Redshift.

On the dashboard page, you can change the data source between your production and development clusters on a drop-down menu.

The default Amazon Redshift dashboard, as shown in the following screenshot, makes it easy to monitor the overall health of the cluster by showing different cluster metrics, like total storage capacity used, storage utilization per node, open and closed connections, WLM mode, AQUA status, and more.

Additionally, the default dashboard displays several table-level metrics such as size of the tables, total number of rows, unsorted rows percentage, and more, in the Schema Insights section.

Add a custom dashboard for Amazon Redshift

The Amazon Redshift data source plugin allows you to query and visualize Amazon Redshift data metrics from within Amazon Managed Grafana. It’s preconfigured with general metrics. To add a custom metric from the Amazon Redshift cluster, complete the following steps:

  1. On the Amazon Managed Grafana console, choose All workspaces in the navigation pane.
  2. Choose the Grafana workspace URL for the workspace you want to modify.
  3. Choose Sign in with AWS SSO and provide your credentials.
  4. On the Amazon Managed Grafana workspace page, choose the plus sign and on the Create menu, choose Dashboard.
  5. Choose Add a new panel.
  6. Add the following custom SQL to get the data from the Amazon Redshift cluster:
    select 
    p.usename,
    count(*) as Num_Query,
    SUM(DATEDIFF('second',starttime,endtime)) as Total_Execution_seconds from stl_query s 
    inner join pg_user p on s.userid= p.usesysid where starttime between $__timeFrom() and $__timeTo()
    and s.userid>1 group by 1

    For this post, we use the default settings, but you can control and link the time range using the $__timeFrom() and $__timeTo() macros; they’re bound with the time range control of your dashboard. For more information and details about the supported expressions, see Query Redshift data.

  7. To inspect the data, choose Query inspector to test the custom query outcome.
    Amazon Managed Grafana supports a variety of visualizations. For this post, we create a bar chart.
  8. On the Visualizations tab in the right pane, choose Bar chart.
  9. Enter a title and description for the custom chart, and leave all other properties as default.
    For more information about supported properties, see Visualizations.
  10. Choose Save.
  11. In the pop-up window, enter a dashboard name and choose Save.

    A new dashboard is created with a custom metric.
  12. To add more metrics, choose the Add panel icon, choose Add a new panel, and repeat the previous steps.

Clean up

To avoid incurring future charges, complete the following steps:

  1. Delete the Amazon Managed Grafana workspace.
  2. If you created a new Amazon Redshift cluster for this demonstration, delete the cluster.

Conclusion

In this post, we demonstrated how to use AWS SSO and Amazon Managed Grafana to create an operational view to monitor the health and performance of Amazon Redshift clusters. We learned how to extend your default dashboard by adding custom and insightful dashboards to your Grafana workspace.

We look forward to hearing from you about your experience. If you have questions or suggestions, please leave a comment.


About the Authors

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 13 years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Shawn Sachdev is a Sr. Analytics Specialist Solutions Architect at AWS. He works with customers and provides guidance to help them innovate and build well-architected and high-performance data warehouses and implement analytics at scale on the AWS platform. Before AWS, he worked in several analytics and system engineering roles. Outside of work, he loves watching sports, and is an avid foodie and craft beer enthusiast.

Ekta Ahuja is an Analytics Specialist Solutions Architect at AWS. She is passionate about helping customers build scalable and robust data and analytics solutions. Before AWS, she worked in several different data engineering and analytics roles. Outside of work, she enjoys baking, traveling, and board games.

Resize Amazon Redshift from DC2 to RA3 with minimal or no downtime

Post Syndicated from Soujanya Konka original https://aws.amazon.com/blogs/big-data/resize-amazon-redshift-from-dc2-to-ra3-with-minimal-or-no-downtime/

Amazon Redshift is a popular cloud data warehouse that allows you to process exabytes of data across your data warehouse, operational database, and data lake using standard SQL. Amazon Redshift offers different node types like DC2 (dense compute) and RA3, which you can use for your different workloads and use cases. For more information about the benefits of migrating from DS2 to RA3, refer to Scale your cloud data warehouse and reduce costs with the new Amazon Redshift RA3 nodes with managed storage and Amazon Redshift Benchmarking: Comparison of RA3 vs. DS2 Instance Types.

Many customers use DC2 nodes for their compute-intensive workloads. It’s natural to scale with your growing workload, namely separating compute from storage so they’re right-sized as per your needs. RA3 nodes with managed storage enable you to optimize your data warehouse by scaling and paying for compute and managed storage independently. Amazon Redshift managed storage uses large, high-performance SSDs in each RA3 node for fast local storage and Amazon S3 for longer-term durable storage. If the data in a node grows beyond the size of the large local SSDs, Amazon Redshift managed storage automatically offloads that data to Amazon S3. RA3 nodes keep track of the frequency of access for each data block and cache the hottest blocks. If the blocks aren’t cached, the large networking bandwidth and precise storing techniques return the data in sub-seconds. Also, if you’re looking for features like cross-cluster data sharing and cross-Availability Zone cluster relocation, these are a few of the reasons for migrating to RA3. Many customers on DC2 have benefitted from migrating to RA3 to serve their growing performance requirements and business use cases.

As a first step of the migration, we always recommend finding the correct load of your system and determining the number of RA3 nodes that will meet your workload and give you the best cost-performance benefit. For this evaluation, you can use the simple Replay tool to conduct a what-if analysis and evaluate how your workload performs in different scenarios. For example, you can use the tool to benchmark your actual workload on a new instance type like RA3, evaluate a new feature, or assess different cluster configurations. To choose the right cluster type, you can compare different node types for your workload and choose the right configuration of RA3 with the Simple Replay utility.

Once you know the cluster type and nodes, the next question is how to migrate your current workload to RA3 with minimum downtime or without disrupting your current workload. In this post, we describe an approach to do this with minimum downtime.

Resizing an Amazon Redshift cluster

There are three ways to resize or migrate an Amazon Redshift cluster from DC2 to RA3 :

  • Elastic resize – If it’s available as an option, use elastic resize to change the node type, number of nodes, or both. Note that when you only change the number of nodes, the queries are temporarily paused and connections are kept open. An elastic resize can take between 10–15 minutes. During a resize operation, the cluster is read-only.
  • Classic resize – Use classic resize to change the node type, number of nodes, or both. Choose this option when you’re resizing to a configuration that isn’t available through elastic resize. A resize operation can take 2 hours or more, or last up to several days depending on your data size. During the resize operation, the source cluster is read-only.
  • Snapshot, restore, and resize – To keep your cluster available during a classic resize, make a copy of the existing cluster, then resize the new cluster. If data is written to the source cluster after a snapshot is taken, the data must be manually copied over after the migration is complete.

Checkpoints for resize

When a cluster is resized using elastic resize with the same node type, the operation doesn’t create a new cluster. As a result, the operation completes quickly. In case of resize, there could be one or more challenges causing the delay in resize:

  • Data volumes – The time required to complete a classic resize or a snapshot and restore operation might vary, depending on factors like the workload on the source cluster, the number and volume of tables being transformed, how evenly data is distributed across the compute nodes and slices, and the node configuration in the source and target clusters.
  • Snapshots – Automated snapshots are automatically deleted when their retention period expires, when you disable automated snapshots, or when you delete a cluster. If you want to keep an automated snapshot, you can copy it to a manual snapshot. You can take a manual snapshot of the cluster before the migration, which is used for resize operations, but it may not include live data from the time the snapshot was captured.
  • Cluster unavailable during resize – It’s critical to know roughly how long the resize will take. To do so, you can try creating a cluster from the snapshot in a test account. However, this only gives a ballpark idea because resize times can vary, especially if you intend to query your cluster during the resize. If the cluster is live almost all the time with minimal or zero non-business hours, a resize can be a challenge because the cluster can’t upsert live data and serve read requests on this data during this window.
  • Cluster endpoint retention – Elastic resize and cluster resize allow you to change the node type, number of nodes, or both, but the endpoint is retained. With snapshot resize, a new cluster endpoint is created, which may require a change in your application to replace the endpoint.
  • Reconciliation – Validate the target cluster data with the source to make sure migration was completed without data loss and ensure data quality. Reconciliation at the table level isn’t sufficient, you need to ensure records have also been copied from the source. You can run a matching record count check followed by data validation using checksum for accuracy of data.

Solution overview

The steps to prepare for migration are as follows:

  1. Take a snapshot of the existing production Amazon Redshift cluster running on DC2.
  2. Create another Amazon Simple Storage Service (Amazon S3) bucket, where AWS Glue writes the curated data in parallel.
  3. Use the snapshot to create an RA3 cluster.
  4. Configure AWS Database Migration Service (AWS DMS) to load data from the migrated bucket to Amazon S3.
  5. After you confirm that the data is synced between the two clusters (DC and RA3) and all other downstream applications, stop the DC cluster and change the endpoint of your dependent downstream application to the newly created RA3 cluster.

Following is the current architecture depicting a live workload.

In this solution, data comes from three source systems and are written into a raw S3 bucket:

  • Change data capture (CDC) from an RDS instance via AWS DMS (1 in the preceding diagram)
  • Events captured via an external API (2)
  • CSV files from an external source copied to the raw bucket (3)

These sources don’t have a pattern or an interval of pushing new data.

Every few minutes, the ingested data is picked up by an S3 event trigger to run an AWS Glue workflow (4 in the preceding diagram). It provides an orchestration layer to manage and run jobs and crawlers. This workflow includes a crawler (5) that updates the metadata schema and partitions of the dataset to the AWS Glue Data Catalog. Then the crawler triggers an AWS Glue job that writes the curated data to the S3 curated bucket. From there, another AWS Glue job uploads data into Amazon Redshift (6).

In this scenario, if your workload is critical and you can’t afford a long downtime, then you need to plan your migration accordingly.

Dual write and transient data curation pipeline

As a first step of the migration, you need a parallel data process pipeline as the AWS Glue job, which writes the data into the curated S3 bucket. Create another S3 bucket and name it migrated-curated-bucket and modify the AWS Glue transform job. You can also replicate another transform job to write data to a new reserve S3 bucket in parallel.

In this scenario, live data ingestion occurs every 30 minutes. When an iteration of the extract, transform, and load (ETL) job is complete, this triggers a manual snapshot of the Amazon Redshift cluster. After the snapshot is captured, a new Amazon Redshift cluster is created using that snapshot. Cluster creation time can vary depending on the snapshot volume.

If snapshot creation takes more than 30 minutes, then the ETL job should be stopped, and resume after the snapshot creation is complete. For example, if the ETL job is triggered at 8:00 AM and finishes at 8:10 AM, then snapshot creation starts at 8:10 AM. If it finishes by 8:30 AM (the next ETL job will run at 8:30 AM as per the half-hour interval), then the ETL process continues according to the schedule. Otherwise, the job stops, and resumes after the snapshot completion.

Now we use the snapshot to launch a new RA3 redshift cluster. The process doesn’t pause the existing ETL pipeline, rather it starts writing curated data in parallel to the reserve S3 bucket. The following diagram illustrates this updated workflow.

At this point, the existing cluster is still live and continues to process the live workload. Even if creation of the Amazon Redshift cluster takes time (owing to the huge volume of data), you should still be covered. The curated data in the S3 bucket acts as a staging reserve, and this data should be loaded into the RA3 cluster after its cluster is launched.

Backfill the new RA3 cluster with missing data

After the RA3 cluster has been launched, you need to playback the captured live data from the reserve S3 bucket to the newly created cluster. Playback is only for the duration of the snapshot capture to the current timestamp. With this process, you’re trying to bring the RA3 cluster in sync with the existing live DC2 cluster.

You need to configure an AWS DMS migration task with the reserve S3 bucket as the source endpoint and the newly created RA3 cluster as the target endpoint.

AWS DMS captures ongoing changes to the target data store. This process is called ongoing replication or change data capture (CDC). AWS DMS uses this process when replicating ongoing changes from a source data store. This process works by collecting changes to the database logs using the database engine’s native API. The following diagram illustrates this workflow.

Reconciliation and cutover

Data reconciliation is the process of verification of data between source and target. In this process, target data is compared with source data to ensure that the data is transferred completely without any alterations. To ensure reliability in the pipeline and the data processed, you should create an end-to-end reconciliation report. This report verifies the percentage of matching tables, columns, and data records. It also identifies missing records, missing values, incorrect values, badly formatted values, and duplicated records.

You can define the reconciliation process to check whether both clusters are running in sync. For that you can create simple Python scripts or shell scripts to query the source and target clusters, fetch the results, and compare.

Cutover is the final step of migration, and involves switching the existing cluster with the newly launched cluster. At this point, the clusters are running in parallel. Next, you validate that the downstream data consumption flows are up to date. Verify the reconciliation metrics from the DC2 and RA3 clusters such that table updates are in sync.

You can keep dual write while you switch from the migration data pipeline. If you discover any issues after cutting over, you can switch back to the old data pipeline, which is the source of truth until cutover. In this case, cutover involves updating the DC2 cluster endpoint to the new RA3 cluster endpoint in the application. Make sure to identify a relatively quiet window during  the day to update the endpoint. To keep the same endpoint for your applications and users, you can rename the new RA3 cluster with the same name as the original DC2 cluster. To rename the cluster, modify the cluster in the Amazon Redshift console or ModifyCluster API operation. For more information, see Renaming clusters or ModifyCluster API operation in the Amazon Redshift API Reference.

Up to this point, AWS DMS is continuing to update RA3. After you cut over to RA3, the DC2 cluster is no longer live and you can stop the AWS DMS replication job to RA3. Pause the last snapshot. Delete the reserve S3 bucket and AWS DMS resources used for RA3 load.

Conclusion

In this post, we presented an approach to migrate an existing Amazon Redshift cluster with minimal to no data loss, which also allows the cluster to serve both read and write operations during the resize window. Elastic resize is a quick way to resize your cluster to maintain the same number of slices in the target cluster. Slice mapping reduces the time required to resize a cluster. If you choose a resize configuration that isn’t available on elastic resize, you can choose classic resize or perform a snapshot, restore, and resize.

To learn more about what’s new with RA3 instances, refer to Amazon Redshift RA3 instances with managed storage. Amazon Redshift delivers better price performance and at the same time helps you keep your costs predictable. Amazon Redshift Serverless automatically provisions and scales the data warehouse capacity to deliver high performance for demanding and unpredictable workloads, and you pay only for the resources you use. This provides greater flexibility to choose either or both based on custom requirements. After you’ve made your choice, try the hands-on labs on Amazon Redshift.


About the Authors

Soujanya Konka is a Solutions Architect and Analytics specialist at AWS, focused on helping customers build their ideas on cloud. Expertise in design and implementation of business information systems and Data warehousing solutions. Before joining AWS, Soujanya has had stints with companies such as HSBC, Cognizant.

Dipayan Sarkar is a Specialist Solutions Architect for Analytics at AWS, where he helps customers to modernise their data platform using AWS Analytics services. He works with customer to design and build analytics solutions enabling business to make data-driven decisions.

How GE Proficy Manufacturing Data Cloud replatformed to improve TCO, data SLA, and performance

Post Syndicated from Jyothin Madari original https://aws.amazon.com/blogs/big-data/how-ge-proficy-manufacturing-data-cloud-replatformed-to-improve-tco-data-sla-and-performance/

This is post is co-authored by Jyothin Madari, Madhusudhan Muppagowni and Ayush Srivastava from GE.

GE Proficy Manufacturing Data Cloud (MDC), part of the GE Digital’s Manufacturing Execution Systems (MES) suite of solutions, allows GED’s customers to increase the derived value easily and quickly from the MES by reliably bringing enterprise-wide manufacturing data into the cloud and transforming it into a structured dataset for advanced analytics and deeper insights into the manufacturing processes.

In this post, we share how MDC modernized the hybrid cloud strategy by replatforming. This solution improved scalability, their data availability Service Level Agreement (SLA), and performance.

Challenge

MDC v1 was built on Predix services using industrial use case-optimized Predix services such as Predix Columnar Store (Cassandra) and Predix Insights (Amazon EMR). MDC evolved in both features and the underlying platform over the past year with a goal to improve TCO, data SLA, and performance. MDC’s customer base grew and the number of sites from customers grew to over 100 in the past couple of years. The increased number of sites needed more compute and storage capacity. This increased infrastructure and operational cost significantly, while introducing increased data latency and lowering the data freshness interval from the cloud.

How we started

MDC evaluated several vendors for their storage and compute capabilities using various measurements: security, performance, scalability, ease of management and operation, reduction of overall cost and increase in ROI, partnership, and migration help (technology assistance). The MDC team saw opportunities to improve the product by using native AWS services such as Amazon Redshift, AWS Glue, and Amazon Managed Workflows for Apache Airflow (Amazon MWAA), which made the product more performant and scalable while reducing operation costs and making it future-ready for advanced analytics and new customer use cases.

The GE Digital team, comprised of domain experts, developers, and QA, worked shoulder to shoulder with the AWS ProServe team, comprised of Solution Architects, Data Architects, and Big Data Experts, in determining the key architectural changes required and solutions to implementation challenges.

Overview of solution

The following diagram illustrates the high-level architecture of the solution.

This is a broad overview, and the specifics of networking and security between components are out of scope for this post.

The solution includes the following main steps and components:

  1. CDC and log collector – Compressed CSV data is collected from over 100 Manufacturing Data Sources Proficy Plant Applications and sinked into an Amazon Simple Storage Service (Amazon S3) bucket.
  2. S3 raw bucket – Our data lands in Amazon S3 without any transformation, but appropriately partitioned (tenant, site, date, and so on) for the ease of future processing.
  3. AWS Lambda – When the file lands in the S3 raw bucket, it triggers an S3 event notification, which invokes AWS Lambda. Lambda extracts metadata (bucket name, key name, date, and so on) from the event and saves it in Amazon DynamoDB.
  4. AWS Glue – Our goal is now to take CSV files, with varying schemas, and convert them into Apache Parquet format. An AWS Glue extract, transform, and load (ETL) job reads a list of files to be processed from the DynamoDB table and fetches them from the S3 raw bucket. We have preconfigured unified AVRO schemas in the AWS Glue Schema Registry for schema conversion. Converted data lands in the S3 raw Parquet bucket.
  5. S3 raw Parquet bucket – Data in this bucket is still raw and unmodified; only the format was changed. This intermediary storage is required due to schema and column order mismatch in CSV files.
  6. Amazon Redshift – The majority of transformations and data enrichment happens in this step. Amazon Redshift Spectrum consumes data from the S3 raw Parquet bucket and external PostgreSQL dimension tables (through a federated query). Transformations are performed via stored procedures, where we encapsulate logic for data transformation, data validation, and business-specific logic. The Amazon Redshift cluster is configured with concurrency scaling, auto workload management (WLM) with caching, and the latest RA3 instance types.
  7. MDC API – These custom-built, web-based, REST API microservices talk on the backend with Amazon Redshift and expose data to external users, business intelligence (BI) tools, and partners.
  8. Amazon Redshift data export and archival – On a scheduled basis, Amazon Redshift exports (UNLOAD command) contextualized and business-defined aggregated data. Exports are landed in the S3 bucket as Apache Parquet files.
  9. S3 Parquet export bucket – This bucket stores the exported data (hundreds of TBs) used by external users who need to run extensive, heavy analytics and AI or machine learning (ML) with various tools (such as Amazon EMR, Amazon Athena, Apache Spark, and Dremio).
  10. End-users – External users consume data from the API. The main use case here is reporting and visual analytics.
  11. Amazon MWAA – The orchestrator of the solution, Amazon MWAA is used for scheduling Amazon Redshift stored procedures, AWS Glue ETL jobs, and Amazon Redshift exports at regular intervals with error handling and retries built in.

Bringing it all together

MDC replaced both Predix Columnar Store (Cassandra) and Predix Insights (Amazon EMR) with Amazon Redshift for both storage of the MDC data models and compute (ELT). Amazon MWAA is used to schedule the workloads that do the bulk of the ELT. Lambda, AWS Glue, and DynamoDB are used to normalize the schema differences between sites. It was important not to disrupt MDC customers while replatforming. To achieve this, MDC used a phased approach to migrate the data models to Amazon Redshift. They used federated queries to query existing PostgreSQL for dimensional data, which facilitated having some of the data models in Amazon Redshift, while the others were in Cassandra with no interruption to MDC customers. Redshift Spectrum facilitated querying the raw data in Amazon S3 directly both for ETL and data validation.

75% of the MDC team along with the AWS ProServe team and AWS Solution Architects collaborated with the GE Digital Security Team and Platform Team to implement the architecture with AWS native services. It took approximately 9 months to implement, secure, and performance tune the architecture and migrate data models in three phases. Each phase has gone through a GE Digital internal security review. Amazon Redshift Auto WLM, short query acceleration, and tuning the sort keys to optimize querying patterns improved the Proficy MDC API performance. Because the unload of the data from Amazon Redshift was fast, Proficy MDC is now able to export the data much more frequently to our end customers.

Conclusion

With replatforming, Proficy MDC was able to improve ETL performance by approximately 75%. Data latency and freshness improved by approximately 87%. The solution reduced TCO of the platform by approximately 50%. Proficy MDC was also able reduce the infrastructure and operational cost. Improved performance and reduced latency has allowed us to speed up the next steps in our journey to modernize the enterprise data architecture and hybrid cloud data platform.


About the Authors

Jyothin Madari leads the Manufacturing Data Cloud (MDC) engineering team; part of the manufacturing suite of products at GE Digital. He has 18 years of experience, 4 of which is with GE Digital. Most recently he has been working on data migration projects with an aim to reduce costs and improve performance. He is an AWS Certified Cloud Practitioner, a keen learner and loves solving interesting problems. Connect with him on LinkedIn.

Madhusudhan (Madhu) Muppagowni is a Technical Architect and Principal Software Developer based in Silicon Valley, Bay Area, California.  He is passionate about Software Development and Architecture. He thrives on producing Well-Architected and Secure SaaS Products, Data Pipelines that can make a real impact.  He loves outdoors and an avid hiker and backpacker. Connect with him on LinkedIn.

Ayush Srivastava is a Senior Staff Engineer and Technical Anchor based in Hyderabad, India. He is passionate about Software Development and Architecture. He has Demonstrated track record of successfully technical anchoring small to large Secure SaaS Products, Data Pipelines from start to finish. He loves exploring different places and he says “I’m in love with cities I have never been to and people I have never met.” Connect with him on LinkedIn.

Karen Grygoryan is Data Architect with AWS ProServe. Connect with him on LinkedIn.

Gnanasekaran Kailasam is a Data Architect at AWS. He has worked with building data warehouses and big data solutions for over 16 years. He loves to learn new technologies and solving, automating, and simplifying customer problems with easy-to-use cloud data solutions on AWS. Connect with him on LinkedIn.

Supercharging Dream11’s Data Highway with Amazon Redshift RA3 clusters

Post Syndicated from Dhanraj Gaikwad original https://aws.amazon.com/blogs/big-data/supercharging-dream11s-data-highway-with-amazon-redshift-ra3-clusters/

This is a guest post by Dhanraj Gaikwad, Principal Engineer on Dream11 Data Engineering team.

Dream11 is the world’s largest fantasy sports platform, with over 120 million users playing fantasy cricket, football, kabaddi, basketball, hockey, volleyball, handball, rugby, futsal, American football, and baseball. Dream11 is the flagship brand of Dream Sports, India’s leading Sports Technology company, and has partnerships with several national and international sports bodies and cricketers.

In this post, we look at how we supercharged our data highway, the backbone of our major analytics pipeline, by migrating our Amazon Redshift clusters to RA3 nodes. We also look at why we were excited about this migration, the challenges we faced during the migration and how we overcame them, as well as the benefits accrued from the migration.

Background

The Dream11 Data Engineering team runs the analytics pipelines (what we call our Data Highway) across Dream Sports. In near-real time, we analyze various aspects that directly impact the end-user experience, which can have a profound business impact for Dream11.

Initially, we were analyzing upwards of terabytes of data per day with Amazon Redshift clusters that ran mainly on dc2.8xlarge nodes. However, due to a rapid increase in our user participation over the last few years, we observed that our data volumes increased multi-fold. Because we were using dc2.8xlarge clusters, this meant adding more nodes of dc2.8xlarge instance types to the Amazon Redshift clusters. Not only was this increasing our costs, it also meant that we were adding additional compute power when what we really needed was more storage. Because we anticipated significant growth during the Indian Premier League (IPL) 2021, we actively explored various options using our AWS Enterprise Support team. Additionally, we were expecting more data volume over the next few years.

The solution

After discussions with AWS experts and the Amazon Redshift product team, we at Dream11 were recommended the most viable option of migrating our Amazon Redshift clusters from dc2.8xlarge to the newer RA3 nodes. The most obvious reason for this was the decoupled storage from compute. As a result, we could use lesser nodes and move our storage to Amazon Redshift managed storage. This allowed us to respond to data volume growth in the coming years as well as reduce our costs.

To start off, we conducted a few elementary tests using an Amazon Redshift RA3 test cluster. After we were convinced that this wouldn’t require many changes in our Amazon Redshift queries, we decided to carry out a complete head-to-head performance test between the two clusters.

Validating the solution

Because the user traffic on the Dream11 app tends to spike during big ticket tournaments like the IPL, we wanted to ensure that the RA3 clusters could handle the same traffic that we usually experience during our peak. The AWS Enterprise Support team suggested using the Simple Replay tool, an open-sourced tool released by AWS that you can use to record and replay the queries from one Amazon Redshift cluster to another. This tool allows you to capture queries on a source Amazon Redshift cluster, and then replay the same queries on a destination Amazon Redshift cluster (or clusters). We decided to use this tool to capture our performance test queries on the existing dc2.8xlarge clusters and replay them on a test Amazon Redshift cluster composed of RA3 nodes. During this time of our experimentation, the newer version of the automated AWS CloudFormation-based toolset (now on GitHub), was not available.

Challenges faced

The first challenge came up when using the Simple Replay tool because there was no easy way to compare the performance of like-to-like queries on the two types of clusters. Although Amazon Redshift provides various statistics using meta-tables about individual queries and their performance, the Simple Replay tool adds additional comments in each Amazon Redshift query on the target cluster to make it easier to know if these queries were run by the Simple Replay tool. In addition, the Simple Replay tool drops comments from the queries on the source cluster.

Comparing each query performance with the Amazon Redshift performance test suite would mean writing additional scripts for easy performance comparison. An alternative would have been to modify the Simple Replay tool code, because it’s open source on GitHub. However, with the IPL 2022 beginning in just a few days, we had to explore another option urgently.

After further discussions with the AWS Enterprise Support team, we decided to use two test clusters: one with the old dc2.8xlarge nodes, and another with the newer RA3 nodes. The idea was to use the Simple Replay tool to run the captured queries from our original cluster on both test clusters. This meant that the queries would be identical on both test clusters, making it easier to compare. Although this meant running an additional test cluster for a few days, we went ahead with this option. As a side note, the newer automated AWS CloudFormation-based toolset does exactly the same in an automated way.

After we were convinced that most of our Amazon Redshift queries performed satisfactorily, we noticed that certain queries were performing slower on the RA3-based cluster than the dc2.8xlarge cluster. We narrowed down the problem to SQL queries with full table scans. We rectified it by following proper data modelling practices in the ETL workflow. Then we were ready to migrate to the newer RA3 nodes.

The migration to RA3

The migration from the old cluster to the new cluster was smoother than we thought. We used the elastic resize approach, which meant we only had a few minutes of Amazon Redshift downtime. We completed the migration successfully with a sufficient buffer timeline for more tests. Additional tests indicated that the new cluster performed how we wanted it to.

The trial by fire

The new cluster performed satisfactorily during our peak performance loads in the IPL as well as the following ICC T20 Cricket World Cup. We’re excited that the new RA3 node-based Amazon Redshift cluster can support our data volume growth needs without needing to increase the number of instance nodes.

We migrated from dc2 to RA3 in April 2021. The data volume has grown by 50% since then. If we had continued with dc2 instances, the cluster cost would have increased by 50%. However, because of the migration to RA3 instances, even with an increase in data volume by 50% since April 2021, the cluster cost has increased by 0.7%, which is attributed to an increase in storage cost.

Conclusion

Migrating to the newer RA3-based Amazon Redshift cluster helped us decouple our computing needs from our storage needs, and now we’re prepared for our expected data volume growth for the next few years. Moreover, we don’t need to add compute nodes if we only need storage, which is expected to bring down our costs in the long run. We did need to fine-tune some of our queries on the newer cluster. With the Simple Replay tool, we could do a direct comparison between the older and the newer cluster. You can also use the newer automated AWS CloudFormation-based toolset if you want to follow a similar approach.

We highly recommend RA3 instances. They give you the flexibility to size your RA3 cluster based on the
amount of data stored without increasing your compute costs.


About the Authors

Dhanraj Gaikwad is a Principal Data Engineer at Dream11. Dhanraj has more than 15 years of experience in the field of data and analytics. In his current role, Dhanraj is responsible for building the data platform for Dream Sports and is specialized in data warehousing, including data modeling, building data pipelines, and query optimizations. He is passionate about solving large-scale data problems and taking unique approaches to deal with them.

Sanket Raut is a Principal Technical Account Manager at AWS based in Vasai ,India. Sanket has more than 16 years of industry experience, including roles in cloud architecture, systems engineering, and software design. He currently focuses on enabling large startups to streamline their cloud operations and optimize their cloud spend. His area of interest is in serverless technologies.

AWS Week In Review – May 23, 2022

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/aws-week-in-review-may-27-2022/

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

This is the right place to quickly learn about recent AWS news from last week, in just about five minutes or less. This week, I have collected a couple of news items that might be of interest to you, the IT professionals, developers, system administrators, or any type of builders that have their hands on the AWS console, the CLI, or that are writing code.

Last Week’s Launches
The launches that caught my attention last week are the following:

EC2 now supports NitroTPM and SecureBoot – A Trusted Platform Module is often a discrete chip in a computer where you can store secrets and release them to the operating system only when the system is in a known good state. You typically use TPM modules to store operating-system-level volume encryption keys, such as the ones used by BitLocker on Windows or LUKS. NitroTPM is a virtual TPM module available on selected instance families that allows you to deploy your workloads depending on TPM functionalities on EC2 instances.

Amazon EC2 Auto Scaling now backfills predictive scaling forecasts so you can quickly validate forecast accuracy. Auto Scaling Predictive Scaling is a capability of Auto Scaling that allows you to scale your fleet in and out based on observed usage patterns. It uses AI/ML to predict when your fleet needs more or less capacity. It allows you to scale a fleet in advance of the scaling event and have the fleet prepared at peak times. The new backfills shows you how predictive scaling would have scaled your fleet during the last 14 days. This allows you to quickly decide if the predictive scaling policy is accurate for your applications by comparing the demand and capacity forecasts against actual demand immediately after you create a predictive scaling policy.

AWS Backup adds support for two new managed file systems, Amazon FSx for OpenZFS and Amazon Fsx for NetApp ONTAP. These additions helps you meet your centralized data protection and regulatory compliance needs. You can now use AWS Backup’s policy-based capabilities to centrally protect Amazon FSx for NetApp ONTAP or Amazon Fsx for OpenZFS, along with the other AWS services for storage, database, and compute that AWS Backup supports.

AWS App Mesh now supports IPv6 AWS App Mesh is a service mesh that provides application-level networking to make it easy for your services to communicate with each other across multiple types of compute infrastructure. The new support for IPv6 allows you to support workloads running in IPv6 networks and to invoke App Mesh APIs over IPv6. This helps you meet IPv6 compliance requirements, and removes the need for complex networking configuration to handle address translation between IPv4 and IPv6.

Amazon Chime SDK now supports video background replacement and blur on iOS and Android. When you want to integrate audio and video call capabilities in your mobile applications, the Chime SDK is the easiest way to get started. It provides an easy-to-use API that uses the scalable and robust Amazon Chime backend to power your communications. For example, Slack is using Chime as backend for the communications in their apps. The Chime SDK client libraries for iOS and Android now include video background replacement and blur, which developers can use to reduce visual distractions and help increase visual privacy for mobile users on iOS and Android.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Some other updates and news that you may have missed:

Amazon Redshift: Ten years of continuous reinvention. This is an Amazon Redshift research paper that will be presented at a leading international forum for database researchers. The authors reflect on how far the first petabyte-scale cloud data warehouse has advanced since it was announced ten years ago.

Improve Your Security at the Edge with AWS IoT Services is a new blog post on the IoT channel. We understand the risks associated with operating at the edge and that you need additional capabilities to ensure that your data is protected. AWS IoT services can help you with end-to-end data protection, device security, and device identification to create the foundation of an expanded information security model and confidently operate at the edge.

AWS Open Source News and Updates – Ricardo Sueiras, my colleague from the AWS Developer Relation team, runs this newsletter. It brings you all the latest open-source projects, posts, and more. Read edition #113 here.

Upcoming AWS Events
CDK Day, on May 26 is a one-day fully virtual event dedicated to the AWS Cloud Development Kit. With four versions of the CDK released (AWS, Terraform, CDK8s, and Projen), we tought the CDK deserves its own full-fledged conference. We will take one day and showcase the brightest and best of CDK from across the whole product family. Let’s talk serverless, Kubernetes and multi-cloud all on the same day! CDK Day will take place on May 26, 2022 and will be fully virtual, live-streamed to our YouTube channel. Book your ticket now, it’s free.

The AWS Summit season is mostly over in Europe, but there are upcoming Summits in North America and the Asia Pacific Regions. Here are some virtual and in-person Summits that might be close to you:

More to come in July, August, and September.

You can register for re:MARS to get fresh ideas on topics such as machine learning, automation, robotics, and space. The conference will be in person in Las Vegas, June 21–24.

That’s all for this week. Check back next Monday for another Week in Review!

— seb

Use Amazon Redshift RA3 with managed storage in your modern data architecture

Post Syndicated from Bhanu Pittampally original https://aws.amazon.com/blogs/big-data/use-amazon-redshift-ra3-with-managed-storage-in-your-modern-data-architecture/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers.

Over the years, Amazon Redshift has evolved a lot to meet our customer demands. Its journey started as a standalone data warehousing appliance that provided a low-cost, high-performance, cloud-based data warehouse. Support for Amazon Redshift Spectrum compute nodes was later added to extend your data warehouse to data lakes, and the concurrency scaling feature was added to support burst activity and scale your data warehouse to support thousands of queries concurrently. In its latest offering, Amazon Redshift runs on third-generation architecture where storage and compute layers are decoupled and scaled independent of each other. This latest generation powers the several modern data architecture patterns our customers are actively embracing to build flexible and scalable analytics platforms.

When spinning up a new instance of Amazon Redshift, you get to choose either Amazon Redshift Serverless, for when you need a data warehouse that can scale seamlessly and automatically as your demand evolves unpredictably, or you can choose an Amazon Redshift provisioned cluster for steady-state workloads and greater control over your Amazon Redshift cluster’s configuration.

An Amazon Redshift provisioned cluster is a collection of computing resources called nodes, which are organized into a group called a cluster. Each cluster runs the Amazon Redshift engine and contains one or more databases. Creating an Amazon Redshift cluster is the first step in your process of building an Amazon Redshift data warehouse. While launching a provisioned cluster, one option that you specify is the node type. The node type determines the CPU, RAM, storage capacity, and storage drive type for each node.

In this post, we cover the current generation node RA3 architecture, different RA3 node types, important capabilities that are available only on RA3 node types, and how you can upgrade your current Amazon Redshift node types to RA3.

Amazon Redshift RA3 nodes

RA3 nodes with managed storage enable you to optimize your data warehouse by scaling and paying for compute and managed storage independently. RA3 node types are the latest node type for Amazon Redshift. With RA3, you choose the number of nodes based on your performance requirements and only pay for the managed storage that you use. RA3 architecture gives you the ability to size your cluster based on the amount of data you process daily or the amount of data that you want to store in your warehouse; there is no need to account for both storage and processing needs together.

Other node types that we previously offered include the following:

  • Dense compute – DC2 nodes enable you to have compute-intensive data warehouses with local SSD storage included. You choose the number of nodes you need based on data size and performance requirements.
  • Dense storage (deprecated) – DS2 nodes enable you to create large data warehouses using hard disk drives (HDDs). If you’re using the DS2 node type, we strongly recommend that you upgrade to RA3 to get twice as much storage and improved performance for the same on-demand cost.

When you use the RA3 node size and choose your number of nodes, you can provision the compute independent of storage. RA3 nodes are built on the AWS Nitro System and feature high bandwidth networking and large high-performance SSDs as local caches. RA3 nodes use your workload patterns and advanced data management techniques to deliver the performance of local SSD while scaling storage automatically to Amazon Simple Storage Service (Amazon S3).

RA3 node types come in three different sizes to accommodate your analytical workloads. You can quickly start experimenting with the RA3 node type by creating a single-node ra3.xlplus cluster and explore various features that are available. If you’re running a medium-sized data warehouse, you can size your cluster with ra3.4xlarge nodes. For large data warehouses, you can start with ra3.16xlarge. The following table gives more information about RA3 node types and their specifications as of this writing.

Node Type vCPU

RAM

(GiB)

Default Slices Per Node Managed Storage Quota Per Node Node Range with Create Cluster Total Managed Storage Capacity
ra3.xlplus 4 32 2 32 TB 1-16 1024 TB
ra3.4xlarge 12 96 4 128 TB 2-32 8192 TB
ra3.16xlarge 48 384 16 128 TB 2-128 16384 TB

Amazon Redshift with managed storage

Amazon Redshift with a managed storage architecture (RMS) still boasts the same resiliency and industry-leading hardware. With managed storage, Amazon Redshift uses intelligent data prefetching and data evictions based on the temperature of your data. This method helps you decide where to store your most-queried data. Most frequently used blocks (hot data) are cached locally on SSD, and infrequently used blocks (cold data) are stored on an RMS layer backed by Amazon S3. The following diagram depicts the leader node, compute node, and Amazon Redshift managed storage.

In the following sections, we discuss the capabilities that Amazon Redshift RA3 with managed storage can provide.

Independently scale compute and storage

As the scale of an organization grows, data continues to grow—reaching petabytes. The amount of data you ingest into your Amazon Redshift data warehouse also grows. You may be looking for ways to cost-effectively analyze all your data and at the same time have control over choosing the right compute or storage resource at the right time. For customers who are looking to be cost conscientious and cost-effective, the RA3 platform provides the option to scale and pay for your compute and storage resources separately.

With RA3 instances with managed storage, you can choose the number of nodes based on your performance requirements, and only pay for the managed storage that you use. This gives you the flexibility to size your RA3 cluster based on the amount of data you process daily without increasing your storage costs. It allows you to pay per hour for the compute and separately scale your data warehouse storage capacity without adding any additional compute resources and paying only for what you use.

Another benefit of RMS is that Amazon Redshift manages which data should be stored locally for fastest access, and data that is slightly colder is still kept within fast-access reach.

Advanced hardware

RA3 instances use high-bandwidth networking built on the AWS Nitro System to further reduce the time taken for data to be offloaded to and retrieved from Amazon S3. Managed storage uses high-performance SSDs for your hot data and Amazon S3 for your cold data, providing ease of use, cost-effective storage, and fast query performance.

Additional security options

Amazon Redshift managed VPC endpoints enable you to set up a private connection to securely access your Amazon Redshift cluster within your virtual private cloud (VPC) from client applications in another VPC within the same AWS account, another AWS account, or a subnet without using public IPs and without requiring the traffic to traverse across the internet.

The following scenarios describe common reasons to allow access to a cluster using an Amazon Redshift managed VPC endpoint:

  • AWS account A wants to allow a VPC in AWS account B to have access to a cluster
  • AWS account A wants to allow a VPC that is also in AWS account A to have access to a cluster
  • AWS account A wants to allow a different subnet in the cluster’s VPC within AWS account A to have access to a cluster

For information about access options to another VPC, refer to Enable private access to Amazon Redshift from your client applications in another VPC.

Further optimize your workload

In this section, we discuss two ways to further optimize your workload.

AQUA

AQUA (Advanced Query Accelerator) is a new distributed and hardware-accelerated cache that enables Amazon Redshift to run up to 10 times faster than other enterprise cloud data warehouses by automatically boosting certain types of queries. AQUA is available with the ra3.16xlarge, ra3.4xlarge, or ra3.xlplus nodes at no additional charge and with no code changes.

AQUA is an analytics query accelerator for Amazon Redshift that uses custom-designed hardware to speed up queries that scan large datasets. AQUA automatically optimizes query performance on subsets of the data that require extensive scans, filters, and aggregation. With this approach, you can use AQUA to run queries that scan, filter, and aggregate large datasets.

For more information about using AQUA, refer to How to evaluate the benefits of AQUA for your Amazon Redshift workloads.

Concurrency scaling for write operations

With RA3 nodes, you can take advantage of concurrency scaling for write operations, such as extract, transform, and load (ETL) statements. Concurrency scaling for write operations is especially useful when you want to maintain consistent response times when your cluster receives a large number of requests. It improves throughput for write operations contending for resources on the main cluster.

Concurrency scaling supports COPY, INSERT, DELETE, and UPDATE statements. In some cases, you might follow DDL statements, such as CREATE, with write statements in the same commit block. In these cases, the write statements are not sent to the concurrency scaling cluster.

When you accrue credit for concurrency scaling, this credit accrual applies to both read and write operations.

Increased agility to scale compute resources

Elastic resize allows you to scale your Amazon Redshift cluster up and down in minutes to get the performance you need, when you need it. However, there are limits on the nodes that you can add to a cluster. With some RA3 node types, you can increase the number of nodes up to four times the existing count. All RA3 node types support a decrease in the number of nodes to a quarter of the existing count. The following table lists growth and reduction limits for each RA3 node type.

Node Type Growth Limit Reduction Limit
ra3.xlplus 2 times (from 4 to 8 nodes, for example) To a quarter of the number
ra3.4xlarge 4 times (from 4 to 16 nodes, for example) To a quarter of the number (from 16 to 4 nodes, for example)
ra3.16xlarge 4 times (from 4 to 16 nodes, for example) To a quarter of the number (from 16 to 4 nodes, for example)

RA3 node types also have a shorter duration of snapshot restoration time because of the separation of storage and compute.

Improved resiliency

Amazon Redshift employs extensive fault detection and auto remediation techniques in order to maximize the availability of a cluster. With the RA3 architecture, you can enable cluster relocation, which provides additional resiliency by having the ability to relocate a cluster across Availability Zones without losing any data (RPO is zero) or having to change your client applications. The cluster’s endpoint remains the same after the relocation occurs so applications can continue operating without modifications. As the existing cluster fails, a new cluster is created on demand in another Availability Zone so cost of a standby replica cluster is avoided.

Accelerate data democratization

In this section, we share two techniques to accelerate data democratization.

Data sharing

Data sharing provides instant, granular, and high-performance access without copying data and data movement. You can query live data constantly across all consumers on different RA3 clusters in the same AWS account, in a different AWS account, or in a different AWS Region. Data is shared securely and provides governed collaboration. You can provide access in different granularity, including schema, database, tables, views, and user-defined functions.

This opens up various new use cases where you may have one ETL cluster that is producing data and have multiple consumers such as ad-hoc querying, dashboarding, and data science clusters to view the same data. This also enables bi-directional collaboration where groups such as marketing and finance can share data with one another. Queries accessing shared data use the compute resources of the consumer Amazon Redshift cluster and don’t impact the performance of the producer cluster.

For more information about data sharing, refer to Sharing Amazon Redshift data securely across Amazon Redshift clusters for workload isolation.

AWS Data Exchange for Amazon Redshift

AWS Data Exchange for Amazon Redshift enables you to find and subscribe to third-party data in AWS Data Exchange that you can query in an Amazon Redshift data warehouse in minutes. You can also license your data in Amazon Redshift through AWS Data Exchange. Access is automatically granted when a customer subscribes to your data and is automatically revoked when their subscription ends. Invoices are automatically generated, and payments are automatically collected and disbursed through AWS. This feature empowers you to quickly query, analyze, and build applications with third-party data.

For details on how to publish a data product and subscribe to a data product using AWS Data Exchange for Amazon Redshift, refer to New – AWS Data Exchange for Amazon Redshift.

Cross-database queries for Amazon Redshift

Amazon Redshift supports the ability to query across databases in a Redshift cluster. With cross-database queries, you can seamlessly query data from any database in the cluster, regardless of which database you are connected to. Cross-database queries can eliminate data copies and simplify your data organization to support multiple business groups on the same cluster.

One of many use cases where Cross-database query helps you is when data is organized across multiple databases in a Redshift cluster to support multi-tenant configurations. For example, different business groups and teams that own and manage data sets in their specific database in the same data warehouse need to collaborate with other groups. You might want to perform common ETL staging and processing while your raw data is spread across multiple databases. Organizing data in multiple Redshift databases is also a common scenario when migrating from traditional data warehouse systems. With cross-database queries, you can now access data from any of the databases on the Redshift cluster without having to connect to that specific database. You can also join data sets from multiple databases in a single query

You can read more about cross-database queries here.

Upgrade to RA3

You can upgrade to RA3 instances within minutes no matter the size of your current Amazon Redshift clusters. Simply take a snapshot of your cluster and restore it to a new RA3 cluster. For more information, refer to Upgrading to RA3 node types.

You can also simplify your migration efforts with Amazon Redshift Simple Replay. For more information, refer to Simplify Amazon Redshift RA3 migration evaluation with Simple Replay utility.

Summary

In this post, we talked about the RA3 node types, the benefits of Amazon Redshift managed storage, and the additional capabilities that you get by using Amazon Redshift RA3 with managed storage. Migrating to RA3 node types isn’t a complicated effort, you can get started today.


About the Authors

Bhanu Pittampally is an Analytics Specialist Solutions Architect based out of Dallas. He specializes in building analytic solutions. His background is in data warehouses—architecture, development, and administration. He has been in the data and analytics field for over 13 years.

Jason Pedreza is an Analytics Specialist Solutions Architect at AWS with data warehousing experience handling petabytes of data. Prior to AWS, he built data warehouse solutions at Amazon.com. He specializes in Amazon Redshift and helps customers build scalable analytic solutions.

Ingest Stripe data in a fast and reliable way using Stripe Data Pipeline for Amazon Redshift

Post Syndicated from Jessica Ho original https://aws.amazon.com/blogs/big-data/ingest-stripe-data-in-a-fast-and-reliable-way-using-stripe-data-pipeline-for-amazon-redshift/

Enterprises typically host a myriad of business applications for varying data needs. As companies grow, so does the demand for insights from a complete set of business data. Having data from various applications that store data in disparate silos can delay the decision-making process. However, building and maintaining an API integration or a third-party extract, transform, and load (ETL) pipeline to move data into a destination data store can be time-consuming and expensive.

Today we’re delighted to introduce Stripe Data Pipeline for Amazon Redshift to help you access your Stripe data and extract insight securely and easily from Amazon Redshift. This data, including billing, issuing, and payment records, can be shared in a consistent and automated fashion. You can integrate your Stripe data with data from other sources in your Amazon Redshift clusters to create a single source of truth.

In this post, we discuss the benefits of Stripe Data Pipeline and some of its use cases.

Solution overview

Amazon Redshift is a fast, fully managed, petabyte-scale cloud data warehousing service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence (BI) tools. It’s optimized for datasets ranging from a few hundred gigabytes to petabytes or more. This columnar data warehouse provides provisioned as well as serverless deployment options and uses an industry-standard SQL interface to analyze structured and semi-structured data with fast query performance.

Stripe Data Pipeline is powered by Amazon Redshift’s latest RA3 instances, which provide cross-account data sharing capability. RA3 takes a performant, cost-effective approach to address rapidly growing data volume by decoupling data processing from managed storage. You can then scale compute and storage independently and only pay for what you use. Data sharing provides read access directly to data stored across Amazon Redshift clusters without data movement. This capability removes the complexity and delays that are often associated with managing large distributed datasets across multiple accounts.

The solution provides the following core features and benefits:

  • Scalable and managed data pipeline – You don’t need to build, maintain, and scale custom ETL jobs. You can set up Stripe Data Pipeline in minutes, and it and scales automatically to handle increased business activities and data volume.
  • Up-to-date financial data – You automatically receive and refresh a complete set of your Stripe data and reports in Amazon Redshift on a low-latency schedule. Stripe Data Pipeline is built into Stripe and always provides accurate data.
  • Security and compliance – Data is shared directly from Stripe with your Amazon Redshift cluster, and confidentiality of the data is protected in transit and at rest. Amazon Redshift offers comprehensive security controls and monitoring via native integration with AWS CloudTrail and Amazon CloudWatch (for more information, see Logging Amazon Redshift API calls with AWS CloudTrail and Monitoring Amazon Redshift using CloudWatch metrics, respectively). You can define and audit who has access to what and ensure the compliance requirements are met.
  • Extensibility – Once the data is accessible in AWS, you benefit from the breadth of native integrations Amazon Redshift supports. You can join datasets from other data stores in operational databases, build reports and dashboards with BI tools, or identify patterns and generate prediction using Amazon Redshift ML.

The following architecture diagram provides a quick overview of how data sharing works and how other AWS services can be used together. We dive deeper into different use cases in the following sections.

Accept datashares from Stripe

You can configure the solution in a few steps with no code necessary.

Once Stripe creates a datashare from the producer cluster and authorizes your AWS account, you can view this datashare on your Amazon Redshift console. You need to associate it with specific or all clusters in your AWS account as the consumer. Clusters can be specified by namespaces as globally unique identifiers. Next, you create a database from the datashare in order to start querying data.

Query data from the consumer Amazon Redshift cluster

You can now access your Stripe data and schema directly from Amazon Redshift’s web-based query editor. This direct connection enables teams to pull accurate analysis of various functions of the business. For example:

  • Finance – “How does my cash flow change based on seasonality?”
  • Sales – “How many customers do we have in the US?”
  • Product – “How many active users do we have on each subscription plan?”
  • Sales operations – “Which customers haven’t paid their invoices?”

The following screenshot shows an example in which the query editor displays the number of charges blocked per Stripe’s connected account.

Use federated queries

The modern data architecture of Amazon Redshift enables you to store data in purpose-built data stores based on specific use cases, and allows querying external databases on Amazon Relational Database Service (Amazon RDS) or datasets in an Amazon Simple Storage Service (Amazon S3) data lake without moving these datasets to Amazon Redshift clusters. You can drive deeper by incorporating data from Amazon RDS, or from an S3 data lake through Amazon Redshift Spectrum. This capability provides a native integration without requiring additional ETL jobs.

The following syntax allows you to create an external schema from an Amazon Aurora MySQL-Compatible Edition database to an Amazon Redshift cluster. Amazon Redshift assumes an AWS Identity and Access Management (IAM) role and uses AWS Secrets Manager to access external data stores. For more information and examples with other supported data stores, refer to Querying data with federated queries in Amazon Redshift.

CREATE EXTERNAL SCHEMA auroram
FROM MYSQL
DATABASE ‘example_database’ SCHEMA 'example_schema' -- schema is optional
URI ‘hostname’
IAM_ROLE ‘iam_role_arn’
SECRET_ARN ‘aws_secrets_manager_arn’; 

Coming back to Stripe Data Pipeline, now you can combine the data from an Aurora table and create further analysis. For example, you can correlate the trends of customer acquisition against sales campaign by region, so you can gain an understanding of the campaign effectiveness and make adjustment to marketing strategies.

Create visualizations and dashboards

Now that your complete set of business data is accessible from Amazon Redshift, you can start to explore the data and create visualizations. Amazon QuickSight is a serverless BI service that allows you to easily connect to a data source, create analyses, publish dashboards, and share between teams. QuickSight seamlessly integrates with AWS services such as Amazon Redshift, Amazon S3, and many more.

The following screenshot illustrates how straightforward it is to connect an Amazon Redshift instance to QuickSight as a new data source.

The following screenshot is of a sample QuickSight dashboard pulling data from Amazon Redshift.

Key considerations

When using Stripe Data Pipeline, consider the following:

  • Instance type – This solution is available for all RA3 node types. If you run an existing DS2 or DC2 cluster, there are multiple options to migrate to RA3, including elastic resize, snapshot and restore, and classic resize. For more information, including an upgrade sizing reference between different node types, refer to Upgrading to RA3 node types.
  • RI migration – If you have Amazon Redshift Reserved Instances (RIs), you can use the RI migration feature to migrate the DS2 RI clusters to equivalent RA3 RI clusters as part of a cross-instance resize or cross-instance snapshot restore operation. The RA3 RI covering the new cluster will be the same cost and on the same calendar terms as the original DS2 RI for supported configurations.
  • Encryption – The consumer cluster must be encrypted as part of the enhanced security control for cross-account sharing. You can enable encryption at cluster creation time, or modify an unencrypted cluster with either AWS Key Management Service (AWS KMS) or AWS CloudHSM.
  • Federated queries – This capability works with external DB instances, including Amazon RDS for PostgreSQL, Amazon Aurora PostgreSQL-Compatible Edition, Amazon RDS for MySQL, and Aurora MySQL-Compatible Edition. You should also ensure that you have an Amazon Redshift cluster with a cluster maintenance version that supports federated queries.

Conclusion

In this post, we introduced Stripe Data Pipeline for Amazon Redshift and discussed options to further integrate with AWS services. Stripe Data Pipeline removes the need to build custom API integration or adopt a third-party ETL pipeline, making data accessible with a few clicks and with no code required. Businesses can automatically receive up-to-date data from Stripe in their data warehouse on AWS, reduce data silos, and extract deep insights to address business needs.

Check out Stripe Data Pipeline for more information about the solution and how to get started.


About the Authors

Jessica Ho is a Sr. Partner Solutions Architect at AWS supporting ISV partners who build business applications. She is passionate about creating differentiated solutions that promote cloud adoption. Outside of work, she enjoys spoiling her garden into a mini jungle.

Alexander Mahabir is a Sr. Partner Solutions Architect at AWS based in the D.C metropolitan area. Alex has over 16 year of experience building cloud, and on-premise solutions for small, medium, and large enterprises. Alex currently works with ISV partners in the Digital Customer Experience segment.

Use a linear learner algorithm in Amazon Redshift ML to solve regression and classification problems

Post Syndicated from Phil Bates original https://aws.amazon.com/blogs/big-data/use-a-linear-learner-algorithm-in-amazon-redshift-ml-to-solve-regression-and-classification-problems/

Amazon Redshift is the fastest, most widely used, fully managed, and petabyte-scale cloud data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads. Amazon Redshift ML, powered by Amazon SageMaker, makes it easy for SQL users such as data analysts, data scientists, and database developers to create, train, and deploy machine learning (ML) models using familiar SQL commands and then use these models to make predictions on new data for use cases such as churn prediction, customer lifetime value prediction, and product recommendations. Redshift ML makes the model available as a SQL function within the Amazon Redshift data warehouse so you can easily use it in queries and reports. Customers across all verticals are using Redshift ML to derive better insights from their data. For example, Jobcase uses Redshift ML to recommend job search at scale. Magellan RX Management uses Redshift ML to predict drug therapeutic use conditions.

Amazon Redshift supports supervised learning, including regression, binary classification, multi-class classification, and unsupervised learning using K-Means. You can optionally specify XGBoost, MLP, and now linear learner model types, which are supervised learning algorithms used for solving either classification or regression problems, and provide a significant increase in speed over traditional hyperparameter optimization techniques. Amazon Redshift also supports bring-your-own-model to invoke remote SageMaker endpoints.

In this post, we show you how to use Redshift ML to solve regression and classification problems using the SageMaker linear learner algorithm, which explores different training objectives and chooses the best solution from a validation set.

Solution overview

We first solve a linear regression problem, followed by a multi-class classification problem.

The following table shows some common use cases and algorithms used.

Use Case Algorithm / Problem Type
Customer churn prediction Classification
Predict if a sales lead will close Classification
Fraud detection Classification
Price and revenue prediction Linear regression
Customer lifetime value prediction Linear regression

To use the linear learner algorithm, you need to provide inputs or columns representing dimensional values and also the label or target, which is the value you’re trying to predict. The linear learner algorithm trains many models in parallel, and automatically determines the most optimized model.

Prerequisites

To get started, we need an Amazon Redshift cluster or an Amazon Redshift Serverless endpoint and an AWS Identity and Access Management (IAM) role attached that provides access to SageMaker and permissions to an Amazon Simple Storage Service (Amazon S3) bucket.

For an introduction to Redshift ML and instructions on setting it up, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.

To create a simple cluster with a default IAM role, see Use the default IAM role in Amazon Redshift to simplify accessing other AWS services.

Use case 1: Linear regression

In this use case, we analyze the Abalone dataset and determine the relationship between the physical measurements and use that to determine the age of abalone. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope, which is a time-consuming task. We want to predict the age using different physical measurements, which is easier to measure. The age of abalone is (number of rings + 1.5) years.

Prepare the data

Load the Abalone dataset into Amazon Redshift using the following SQL. You can use the Amazon Redshift query editor v2 or your preferred SQL tool to run these commands.

To create the table, use the following commands:

create table abalone_dataset
( 
id INT IDENTITY(1,1),
Sex     CHAR(1),
Length float,
Diameter float,
Height float,
Whole  float,
Shucked  float,
Viscera  float,
Shell float,
Rings  integer   
); 

To load data into Amazon Redshift, use the following COPY command:

COPY abalone_dataset 
from 's3://redshift-ml-multiclass/abalone.csv' 
REGION 'us-east-1'
IAM_ROLE default
CSV IGNOREHEADER 1
NULL AS 'NULL'; 

To train the model, we use the abalone table and 80% of the data to train the model, and then test the accuracy of that model by seeing if it correctly predicts the age of ring label attribute on the remaining 20% of the data. Run the following command to create training and validation tables:

create table abalone_training as
  select * from abalone_dataset where mod(id,10) < 8 ; 

create table abalone_validation as
  select * from abalone_dataset where mod(id,10) >= 8; 

Create a model in Redshift ML

To create a model in Amazon Redshift, use the following command:

CREATE MODEL model_abalone_ring_prediction
FROM (
 SELECT   Sex ,
Length ,
Diameter ,
Height ,
Whole  ,
Shucked  ,
Viscera  ,
Shell,
Rings  as target_label
FROM abalone_training
) TARGET target_label
FUNCTION f_abalone_ring_prediction
IAM_ROLE default
MODEL_TYPE LINEAR_LEARNER 
PROBLEM_TYPE REGRESSION
OBJECTIVE 'MSE'
SETTINGS (
  S3_BUCKET '<your-s3-bucket>',
  MAX_RUNTIME 15000
  ); 

We define the following parameters in the CREATE MODEL statement:

  • Problem type – We use the linear learner problem type, which is newly added to extend upon typical linear models by training many models in parallel, in a computationally efficient manner.
  • Objective – We specified MSE (mean square error) as our objective, which is a common metric for evaluation of regression problems.
  • Max runtime –This parameter denotes how long the model training can run. Specifying a larger value may help create a better tuned model. The default value for this parameter is 5400 (90 minutes). For this example, we set it to 15000.

The preceding statement takes a few seconds to complete. It initiates an Amazon SageMaker Autopilot process in the background to automatically build, train, and tune the best ML model for the input data. It then uses Amazon SageMaker Neo to deploy that model locally in the Amazon Redshift cluster or Amazon Redshift Serverless as a user-defined function (UDF). You can use the SHOW MODEL command in Amazon Redshift to track the progress of your model creation, which should be in the READY state within the max_runtime parameter you defined while creating the model.

To check the status of the model, use the following command:

show model model_abalone_ring_prediction;

The following is the tabular outcome for the preceding command after model training was done. It took approximately 120 minutes to train the model.

Key Value
Model Name model_abalone_ring_prediction
Schema Name public
Owner awsuser
Creation Time Tue, 10.05.2022 19:42:33
Model State READY
validation:mse 4.082088
Estimated Cost 5.423719
. .
TRAINING DATA: .
Query SELECT SEX , LENGTH , DIAMETER , HEIGHT , WHOLE , SHUCKED , VISCERA , SHELL, RINGS AS TARGET_LABEL
. FROM ABALONE_TRAINING
Target Column TARGET_LABEL
. .
PARAMETERS: .
Model Type linear_learner
Problem Type Regression
Objective MSE
AutoML Job Name redshiftml-20220510194233380173
Function Name f_abalone_ring_prediction
Function Parameters sex length diameter height whole shucked viscera shell
Function Parameter Types bpchar float8 float8 float8 float8 float8 float8 float8
IAM Role default-aws-iam-role
S3 Bucket poc-generic-bkt
Max Runtime 15000

Model validation

We notice from the preceding table that the MSE for the training data is 4.08. Now let’s run the prediction query and validate the accuracy of the model on the testing and validation dataset:

select
ROUND(AVG(POWER(( tgt_label - predicted ),2)),2) mse
, ROUND(SQRT(AVG(POWER(( tgt_label - predicted ),2))),2) rmse
from 
(
SELECT   Sex ,
Length ,
Diameter ,
Height ,
Whole  ,
Shucked  ,
Viscera  ,
Shell,
Rings as tgt_label,
f_abalone_ring_prediction(Sex ,Length ,Diameter ,Height ,Whole  ,Shucked  ,Viscera  ,Shell) as predicted,
case when tgt_label = predicted then 1
      else 0 end as match,
  case when tgt_label  <> predicted then 1
    else 0 end as nonmatch
FROM abalone_validation
)t1 

The following is the outcome from the query:

mse rmse
5.08 2.25

The MSE value from the preceding query results indicates that our model is accurate enough to the actual values from our validation dataset.

We can also observe that Redshift ML is able to identify the right combination of features to come up with a usable prediction model. We can further check the impact of each attribute and its contribution and weightage in the model selection using the following command:

select explain_model ('model_abalone_ring_prediction')

The following is the outcome, where each attribute weightage is representative of its role in model decision-making:

{"explanations":
    {"kernel_shap":
      {"label0":
        {"expected_value":10.06938362121582,
          "global_shap_values":
          {
            "diameter":0.6897614190439705,
            "height":0.38391156156643987,
            "length":0.29646334630067408,
            "sex":0.5516722137411109,
            "shell":1.5679368167990147,
            "shucked":2.222549468867254,
            "viscera":0.2879883139361144,
            "whole":0.8085603201751219
            }
        }
       }
      }
      ,"version":"1.0"
    }

Use case 2: Multi-class classification

For this use case, we use the Covertype dataset (copyright Jock A. Blackard and Colorado State University), which contains information collected by the US Geological Survey and the US Forest Service about wilderness areas in northern Colorado. This has been downloaded to an S3 bucket to make it simple to create the model. You may want to download the dataset description. This dataset contains various measurements such as elevation, distance to waters and roadways, as well as the wilderness area designation and the soil type. Our ML task is to create a model to predict the cover type for a given area.

Prepare the data

To prepare the data for this model, you need to create and populate the table public.covertype_data in Amazon Redshift using the Covertype dataset. You may use the following SQL in Amazon Redshift query editor v2 or your preferred SQL tool:

CREATE TABLE public.covertype_data (
    elevation bigint ENCODE az64,
    aspect bigint ENCODE az64,
    slope bigint ENCODE az64,
    horizontal_distance_to_hydrology bigint ENCODE az64,
    vertical_distance_to_hydrology bigint ENCODE az64,
    horizontal_distance_to_roadways bigint ENCODE az64,
    hillshade_9am bigint ENCODE az64,
    hillshade_noon bigint ENCODE az64,
    hillshade_3pm bigint ENCODE az64,
    horizontal_distance_to_fire_points bigint ENCODE az64,
    wilderness_area1 bigint ENCODE az64,
    wilderness_area2 bigint ENCODE az64,
    wilderness_area3 bigint ENCODE az64,
    wilderness_area4 bigint ENCODE az64,
    soil_type1 bigint ENCODE az64,
    soil_type2 bigint ENCODE az64,
    soil_type3 bigint ENCODE az64,
    soil_type4 bigint ENCODE az64,
    soil_type5 bigint ENCODE az64,
    soil_type6 bigint ENCODE az64,
    soil_type7 bigint ENCODE az64,
    soil_type8 bigint ENCODE az64,
    soil_type9 bigint ENCODE az64,
    soil_type10 bigint ENCODE az64,
    soil_type11 bigint ENCODE az64,
    soil_type12 bigint ENCODE az64,
    soil_type13 bigint ENCODE az64,
    soil_type14 bigint ENCODE az64,
    soil_type15 bigint ENCODE az64,
    soil_type16 bigint ENCODE az64,
    soil_type17 bigint ENCODE az64,
    soil_type18 bigint ENCODE az64,
    soil_type19 bigint ENCODE az64,
    soil_type20 bigint ENCODE az64,
    soil_type21 bigint ENCODE az64,
    soil_type22 bigint ENCODE az64,
    soil_type23 bigint ENCODE az64,
    soil_type24 bigint ENCODE az64,
    soil_type25 bigint ENCODE az64,
    soil_type26 bigint ENCODE az64,
    soil_type27 bigint ENCODE az64,
    soil_type28 bigint ENCODE az64,
    soil_type29 bigint ENCODE az64,
    soil_type30 bigint ENCODE az64,
    soil_type31 bigint ENCODE az64,
    soil_type32 bigint ENCODE az64,
    soil_type33 bigint ENCODE az64,
    soil_type34 bigint ENCODE az64,
    soil_type35 bigint ENCODE az64,
    soil_type36 bigint ENCODE az64,
    soil_type37 bigint ENCODE az64,
    soil_type38 bigint ENCODE az64,
    soil_type39 bigint ENCODE az64,
    soil_type40 bigint ENCODE az64,
    cover_type bigint ENCODE az64 
)
DISTSTYLE AUTO;	


Copy public.covertype_data
From 's3://redshift-ml-multiclass/covtype.data.gz'
iam_role default 
gzip
delimiter ','
region 'us-east-1';

Now that our dataset is loaded, we run the following SQL statements to split the data into three sets for training (80%), validation (10%), and prediction (10%). Note that Redshift ML Autopilot automatically splits the data into training and validation, but by splitting it here, you’re able to verify the accuracy of your model.

To prepare the dataset, assign random values to split the data:

create table  public.covertype_data_prep  as
select a.*,
cast (random() * 80 as int) as data_group_id
from public.covertype_data a;

Use the following code for the training set:

Create table public.covertype_training as
Select * from public.covertype_data_prep
Where data_group_id < 80;

Use the following code for the validation set:

Create table public.covertype_validation as
Select * from public.covertype_data_prep
Where data_group_id between 80 and 89;

Use the following code for the test set:

Create table public.covertype_test as
Select * from public.covertype_data_prep
Where data_group_id > 89; 

Now that we have our datasets, it’s time to create the model.

Create a model in Redshift ML using linear learner

Run the following SQL command to create your model—note our target is cover_type and we use all the inputs from our training set:

CREATE  MODEL forest_cover_type_model
FROM (select Elevation, 
   Aspect,  
   Slope,  
   Horizontal_distance_to_hydrology,  
   Vertical_distance_to_hydrology,   
   Horizontal_distance_to_roadways,  
   HIllshade_9am,  
   Hillshade_noon,  
   Hillshade_3pm , 
   Horizontal_Distance_To_Fire_Points,  
   Wilderness_Area1,  
   Wilderness_Area2,   
   Wilderness_Area3,  
   Wilderness_Area4,  
   soil_type1,
   Soil_Type2,
   Soil_Type3, 
   Soil_Type4, 
   Soil_Type5,  
   Soil_Type6, 
   Soil_Type7,  
   Soil_Type8, 
   Soil_Type9, 
   Soil_Type10 , 
   Soil_Type11,  
   Soil_Type12 , 
   Soil_Type13 ,
   Soil_Type14, 
   Soil_Type15, 
   Soil_Type16,  
   Soil_Type17, 
   Soil_Type18,  
   Soil_Type19,  
   Soil_Type20,  
   Soil_Type21,  
   Soil_Type22,  
   Soil_Type23,  
   Soil_Type24, 
   Soil_Type25,  
   Soil_Type26, 
   Soil_Type27,  
   Soil_Type28,  
   Soil_Type29,  
   Soil_Type30,  
   Soil_Type31, 
   Soil_Type32, 
   Soil_Type33,  
   Soil_Type34, 
   Soil_Type36, 
   Soil_Type37,  
   Soil_Type38,  
   Soil_Type39, 
   Soil_Type40,
  Cover_type from public.covertype_training)
TARGET cover_type
FUNCTION predict_cover_type
IAM_ROLE default
MODEL_TYPE LINEAR_LEARNER              
PROBLEM_TYPE MULTICLASS_CLASSIFICATION 
 OBJECTIVE 'Accuracy'
SETTINGS (
  S3_BUCKET '<<your-amazon-s3-bucket-name>>’
, 
  S3_GARBAGE_COLLECT OFF,
  MAX_RUNTIME 9600
) ;

You can use the SHOW MODEL command to view the status of the model.

You can see that the model has an accuracy score of .730279 and is in the READY state. Now let’s run a query to do some validation of our own.

Model validation

Run the following SQL query against the validation table, using the function created by our model:

select 
 cast(sum(t1.match)as decimal(7,2)) as predicted_matches
,cast(sum(t1.nonmatch) as decimal(7,2)) as predicted_non_matches
,cast(sum(t1.match + t1.nonmatch) as decimal(7,2))  as total_predictions
,predicted_matches / total_predictions as pct_accuracy
from 
(select 
Elevation, 
   Aspect,  
   Slope,  
   Horizontal_distance_to_hydrology,  
   Vertical_distance_to_hydrology,   
   Horizontal_distance_to_roadways,  
   HIllshade_9am,  
   Hillshade_noon,  
   Hillshade_3pm , 
   Horizontal_Distance_To_Fire_Points,  
   Wilderness_Area1,  
   Wilderness_Area2,   
   Wilderness_Area3,  
   Wilderness_Area4,  
   soil_type1,
   Soil_Type2,
   Soil_Type3, 
   Soil_Type4, 
   Soil_Type5,  
   Soil_Type6, 
   Soil_Type7,  
   Soil_Type8, 
   Soil_Type9, 
   Soil_Type10 , 
   Soil_Type11,  
   Soil_Type12 , 
   Soil_Type13 ,
   Soil_Type14, 
   Soil_Type15, 
   Soil_Type16,  
   Soil_Type17, 
   Soil_Type18,  
   Soil_Type19,  
   Soil_Type20,  
   Soil_Type21,  
   Soil_Type22,  
   Soil_Type23,  
   Soil_Type24, 
   Soil_Type25,  
   Soil_Type26, 
   Soil_Type27,  
   Soil_Type28,  
   Soil_Type29,  
   Soil_Type30,  
   Soil_Type31, 
   Soil_Type32, 
   Soil_Type33,  
   Soil_Type34, 
   Soil_Type36, 
   Soil_Type37,  
   Soil_Type38,  
   Soil_Type39, 
   Soil_Type40,
   Cover_type as actual_cover_type,
   predict_cover_type( Elevation, 
   Aspect,  
   Slope,  
   Horizontal_distance_to_hydrology,  
   Vertical_distance_to_hydrology,   
   Horizontal_distance_to_roadways,  
   HIllshade_9am,  
   Hillshade_noon,  
   Hillshade_3pm , 
   Horizontal_Distance_To_Fire_Points,  
   Wilderness_Area1,  
   Wilderness_Area2,   
   Wilderness_Area3,  
   Wilderness_Area4,  
   soil_type1,
   Soil_Type2,
   Soil_Type3, 
   Soil_Type4, 
   Soil_Type5,  
   Soil_Type6, 
   Soil_Type7,  
   Soil_Type8, 
   Soil_Type9, 
   Soil_Type10 , 
   Soil_Type11,  
   Soil_Type12 , 
   Soil_Type13 ,
   Soil_Type14, 
   Soil_Type15, 
   Soil_Type16,  
   Soil_Type17, 
   Soil_Type18,  
   Soil_Type19,  
   Soil_Type20,  
   Soil_Type21,  
   Soil_Type22,  
   Soil_Type23,  
   Soil_Type24, 
   Soil_Type25,  
   Soil_Type26, 
   Soil_Type27,  
   Soil_Type28,  
   Soil_Type29,  
   Soil_Type30,  
   Soil_Type31, 
   Soil_Type32, 
   Soil_Type33,  
   Soil_Type34, 
   Soil_Type36, 
   Soil_Type37,  
   Soil_Type38,  
   Soil_Type39, 
   Soil_Type40) as predicted_cover_type,
case when actual_cover_type = predicted_cover_type then 1
      else 0 end as match,
  case when actual_cover_type <>  predicted_cover_type then 1
    else 0 end as nonmatch

from public.covertype_validation
) t1;

You can see that our accuracy is very close to our score from the SHOW MODEL output.

Run a prediction query

Let’s run a prediction query in Amazon Redshift ML using our function against our test dataset to see the most common class of cover type for the Neota Wilderness Area. We can denote this by checking wildnerness_area2 for a value of 1.

The dataset includes the following wilderness areas:

  1. Rawah Wilderness Area
  2. Neota Wilderness Area
  3. Comanche Peak Wilderness Area
  4. Cache la Poudre Wilderness Area

The cover types are in seven different classes:

  1. Spruce/Fir
  2. Lodgepole Pine
  3. Ponderosa Pine
  4. Cottonwood/Willow
  5. Aspen
  6. Douglas Fir
  7. Krummholz

There are also 40 different soil type definitions, which you can see in the dataset description, with a value of 0 or 1 to note if it’s applicable for a particular row. The following are a few example soil types:

  1. Cathedral family – Rock outcrop complex, extremely stony
  2. Vanet-Ratake families –Rocky outcrop complex, very stony
  3. Haploborolis family – Rock outcrop complex, rubbly
  4. Ratake family – Rock outcrop complex, rubbly
  5. Vanet family – Rock outcrop complex, rubbly
  6. Vanet-Wetmore families – Rock outcrop complex, stony
select t1. predicted_cover_type, count(*) 
from 
(     
select 
   Elevation, 
   Aspect,  
   Slope,  
   Horizontal_distance_to_hydrology,  
   Vertical_distance_to_hydrology,   
   Horizontal_distance_to_roadways,  
   HIllshade_9am,  
   Hillshade_noon,  
   Hillshade_3pm , 
   Horizontal_Distance_To_Fire_Points,  
   Wilderness_Area1,  
   Wilderness_Area2,   
   Wilderness_Area3,  
   Wilderness_Area4,  
   soil_type1,
   Soil_Type2,
   Soil_Type3, 
   Soil_Type4, 
   Soil_Type5,  
   Soil_Type6, 
   Soil_Type7,  
   Soil_Type8, 
   Soil_Type9, 
   Soil_Type10 , 
   Soil_Type11,  
   Soil_Type12 , 
   Soil_Type13 ,
   Soil_Type14, 
   Soil_Type15, 
   Soil_Type16,  
   Soil_Type17, 
   Soil_Type18,  
   Soil_Type19,  
   Soil_Type20,  
   Soil_Type21,  
   Soil_Type22,  
   Soil_Type23,  
   Soil_Type24, 
   Soil_Type25,  
   Soil_Type26, 
   Soil_Type27,  
   Soil_Type28,  
   Soil_Type29,  
   Soil_Type30,  
   Soil_Type31, 
   Soil_Type32, 
   Soil_Type33,  
   Soil_Type34, 
   Soil_Type36, 
   Soil_Type37,  
   Soil_Type38,  
   Soil_Type39, 
   Soil_Type40,
   predict_cover_type( Elevation, 
   Aspect,  
   Slope,  
   Horizontal_distance_to_hydrology,  
   Vertical_distance_to_hydrology,   
   Horizontal_distance_to_roadways,  
   HIllshade_9am,  
   Hillshade_noon,  
   Hillshade_3pm , 
   Horizontal_Distance_To_Fire_Points,  
   Wilderness_Area1,  
   Wilderness_Area2,   
   Wilderness_Area3,  
   Wilderness_Area4,  
   soil_type1,
   Soil_Type2,
   Soil_Type3, 
   Soil_Type4, 
   Soil_Type5,  
   Soil_Type6, 
   Soil_Type7,  
   Soil_Type8, 
   Soil_Type9, 
   Soil_Type10,
   Soil_Type11,  
   Soil_Type12, 
   Soil_Type13,
   Soil_Type14, 
   Soil_Type15, 
   Soil_Type16,  
   Soil_Type17, 
   Soil_Type18,  
   Soil_Type19,  
   Soil_Type20,  
   Soil_Type21,  
   Soil_Type22,  
   Soil_Type23,  
   Soil_Type24, 
   Soil_Type25,  
   Soil_Type26, 
   Soil_Type27,  
   Soil_Type28,  
   Soil_Type29,  
   Soil_Type30,  
   Soil_Type31, 
   Soil_Type32, 
   Soil_Type33,  
   Soil_Type34, 
   Soil_Type36, 
   Soil_Type37,  
   Soil_Type38,  
   Soil_Type39, 
   Soil_Type40) as predicted_cover_type 

from public.covertype_test
where wilderness_area2 = 1)
t1
group by 1;

Our model has predicted that the majority of cover is Spruce and Fir.

You can experiment with various combinations, such as determining which soil types are most likely to occur in a predicted cover type.

Conclusion

Redshift ML makes it easy for users of all levels to create, train, and tune models using a SQL interface. In this post, we walked you through how to use the linear learner algorithm to create regression and multi-class classification models. You can then use those models to make predictions using simple SQL commands and gain valuable insights.

To learn more about RedShift ML, visit Amazon Redshift ML.


About the Authors

Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS with over 25 years of data warehouse experience.

Sohaib Katariwala is an Analytics Specialist Solutions Architect at AWS. He has 12+ years of experience helping organizations derive insights from their data.

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 13 years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Jiayuan Chen is a Senior Software Development Engineer at AWS. He is passionate about designing and building data-intensive applications, and has been working in the areas of data lake, query engine, ingestion, and analytics. He keeps up with latest technologies and innovates things that spark joy.

Debu Panda is a Senior Manager, Product Management at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences such as re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

Federate single sign-on access to Amazon Redshift query editor v2 with Okta

Post Syndicated from Sumeet Joshi original https://aws.amazon.com/blogs/big-data/federate-single-sign-on-access-to-amazon-redshift-query-editor-v2-with-okta/

Amazon Redshift query editor v2 is a web-based SQL client application that you can use to author and run queries on your Amazon Redshift data warehouse. You can visualize query results with charts and collaborate by sharing queries with members of your team. You can use query editor v2 to create databases, schemas, tables, and load data from Amazon Simple Storage Service (Amazon S3) using the COPY command or by using a wizard. You can browse multiple databases and run queries on your Amazon Redshift data warehouse or data lake, or run federated queries to operational databases such as Amazon Aurora. Because it’s a managed SQL editor in your browser and it’s integrated with your single sign-on (SSO) provider, Amazon Redshift query editor v2 reduces the number of steps to the first query so you gain insights faster.

Amazon Redshift query editor v2 integration with your identity provider (IdP) automatically redirects the user’s browser to the query editor v2 console instead of Amazon Redshift console. This enables your users to easily access Amazon Redshift clusters through query editor v2 using federated credentials without managing database users and passwords.

In this post, we focus on Okta as the IdP and illustrate how to set up your Okta application and AWS Identity and Access Management (IAM) permissions. We also demonstrate how you can limit the access for your users to use only query editor v2 without giving them access to perform any admin functions on the AWS Management Console.

Solution overview

The high-level steps in this post are as follows:

  1. Set up Okta, which contains your users organized into logical groups and AWS account federation application.
  2. Set up two IAM roles: one that establishes a trust relationship between your IdP and AWS, and a second role that Okta uses to access Amazon Redshift.
  3. Complete Okta advanced configuration:
    1. Finalize the Okta configuration by inputting the roles that you just created.
    2. Define a default relay state to direct users to Amazon Redshift query editor v2 after successful SAML authentication.
    3. Configure the SAML PrincipalTagAttribute. This element allows you to pass attributes as session tags in the SAML assertion. For more information about session tags, see Passing session tags in AWS STS.
  4. Set up Amazon Redshift database groups:
    1. Create groups within the Amazon Redshift database to match the Okta groups.
    2. Authorize these groups to access certain schemas and tables.
    3. Access Amazon Redshift query editor v2 using your enterprise credentials and query your Amazon Redshift database.
  5. Sign in to your Okta account and access the application assigned to you. The application directs you to Amazon Redshift query editor v2 using federated login.
  6. Access and query your Amazon Redshift database.

Prerequisites

This post assumes that you have the following prerequisites:

Set up Okta

First, we set up the Okta application and create users and groups. Complete the following steps:

  1. Log in to your Okta admin console using the URL https://<prefix>-admin.okta.com/admin/dashboard, where <prefix> is specific to your account and was created at account setup.
  2. On the admin console, choose Admin.
  3. Under Directory in the navigation pane, choose People.
  4. To add users, choose Add person.
    The following screenshot shows the users that we created.
  5. To add groups into Okta, choose Groups in the navigation pane, then choose Add group.
    The following screenshot shows two groups that we created. We added Jane to analyst_users and Mike to bi_users.
  6. Under Applications in the navigation pane, choose Applications and choose Browse App Catalog.
  7. Search for AWS Account Federation and choose Add.
  8. After you add the application, choose AWS Account Federation.
  9. Leave the values in General Settings at their default and choose Next.
  10. Under Sign-On Options, select SAML 2.0.
  11. Choose the Identity Provider metadata link to download the metadata file in .xml format.

Configure IAM roles

Next, you set up an IAM role that establishes a trust relationship between the IdP and AWS. You also create an IAM role that Okta uses to access Amazon Redshift query editor v2.

  1. On the IAM console, under Access management in the navigation pane, choose Identity providers.
  2. Choose Add provider.
  3. For Provider type¸ select SAML.
  4. For Provider name¸ enter a name.
  5. Choose Choose file and upload the metadata file you downloaded.
  6. Choose Add provider.

    Now you create the IAM SAML 2.0 federation role.
  7. On the IAM console, choose Roles in the navigation pane.
  8. Choose Create role.
  9. For Trusted entity type, select SAML 2.0 federation.
  10. For SAML 2.0-based provider, choose the IdP you created in the previous step.
  11. Select Allow programmatic and AWS Management Console access.
  12. Choose Next and then choose Create Policy.

Amazon Redshift query editor v2 provides multiple managed policies to access the query editor. For a list of managed policies, refer to Configuring your AWS account. The managed policy enables you to limit the access for your users to use only query editor v2 without giving them access to perform any admin functions on the console. For this post, we use the AmazonRedshiftQueryEditorV2ReadSharing managed policy and create a custom policy.

In the following code, provide your Region, account, and cluster parameters to grant access to Amazon Redshift to get cluster credentials, create users, and allow users to join groups:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "RedshiftClusterPermissions",
            "Effect": "Allow",
            "Action": [
                "redshift:GetClusterCredentials",
                "redshift:CreateClusterUser",
                "redshift:JoinGroup"
            ],
            "Resource": [
                "arn:aws:redshift:<region>:<account>:cluster:<cluster>,                 
                "arn:aws:redshift:<region>:<account>:dbuser:<cluster>/${aws:PrincipalTag/RedshiftDbUser}",
                "arn:aws:redshift:<region>:<account>:dbgroup:<cluster>/analyst_users",
                "arn:aws:redshift:<region>:<account>:dbgroup:<cluster>/bi_users",
                "arn:aws:redshift:<region>:<account>:dbname:<cluster>/${redshift:DBName}"
            ]
        }
    ]
}

The group membership lasts only for the duration of the user session. Additionally, there is no CreateGroup permission because groups need to be manually created and granted DB privileges.

  1. Attach the policy you created to the role.
    The following screenshot shows the summary page for the role.
  2. Modify the trust relationships for your role and add sts:TagSession permission.
    When using session tags, trust policies for all roles connected to the IdP passing tags must have the sts:TagSession permission. For roles without this permission in the trust policy, the AssumeRole operation fails.
  3. Choose Update policy.

Set up Okta advanced configuration

In this section, you finalize the Okta configuration by adding the IAM roles that you just created. You set up SAML PrincipalTag attributes such as RedshiftDbUser and RedshiftDbGroups, which are passed in the SAML assertion to federate the access to Amazon Redshift query editor v2. You also define a default relay state, which is the URL that users are directed to after a successful authentication through SAML.

  1. In your Okta account, open the AWS Account Federation app.
  2. On the Sign On tab, set Default Relay State to the query editor URL, using the format https://<region>.console.aws.amazon.com/sqlworkbench/home. For this post, we use https://eu-west-1.console.aws.amazon.com/sqlworkbench/home.
  3. Choose Attributes and set up the following attribute mappings:
    1. Set the DB user using PrincipalTag:RedshiftDbUser. This uses the user name in the directory. This is a required tag and defines the database user that is used by query editor v2.
    2. Set the DB groups using PrincipalTag:RedshiftDbGroups. This uses the Okta groups to fill the principal tags. Its value must be a colon-separated list.
    3. Set the transitive keys using TransitiveTagKeys. This prevents users from changing the session tags in case of role chaining.

These tags are forwarded to the redshift:GetClusterCredentials API to get credentials for your cluster. The following table summarizes their attribute configuration.

Name Name Format Value
https://aws.amazon.com/SAML/Attributes/PrincipalTag:RedshiftDbUser Unspecified user.username
https://aws.amazon.com/SAML/Attributes/PrincipalTag:RedshiftDbGroups Unspecified String.join(“:”, isMemberOfGroupName(“analyst_users”) ? ‘analyst_users’ : ”, isMemberOfGroupName(“bi_users”) ? ‘bi_users’ : ”)
https://aws.amazon.com/SAML/Attributes/TransitiveTagKeys Unspecified Arrays.flatten(“RedshiftDbUser”, “RedshiftDbGroups”)

  1. Under Advanced Sign-on Settings¸ select Use Group Mapping.
  2. Enter the IdP and IAM role ARNs, which are globally unique, and make sure that Okta is directed to your AWS account.
  3. Authorize users to use the AWS Account Federation application by selecting their respective groups or individual user accounts. In this example, we authorized users by group.

Set up Amazon Redshift database groups

Next, you set up groups in the Amazon Redshift database to match the Okta groups. You also authorize these groups to access certain schemas and tables.

  1. Log in to your Amazon Redshift cluster with an admin account.
  2. Create groups that match the IdP group names, and grant the appropriate permissions to tables and schemas:
CREATE GROUP analyst_users;
CREATE GROUP bi_users;

ALTER DEFAULT PRIVILEGES IN SCHEMA sales_analysis
GRANT SELECT on TABLES to GROUP analyst_users;
GRANT USAGE on SCHEMA sales_analysis to GROUP analyst_users;
GRANT SELECT on ALL TABLES in SCHEMA sales_analysis to GROUP analyst_users;

ALTER DEFAULT PRIVILEGES IN SCHEMA sales_bi
GRANT SELECT on TABLES to GROUP bi_users;
GRANT USAGE on SCHEMA sales_bi to GROUP bi_users;
GRANT SELECT on ALL TABLES in SCHEMA sales_bi to GROUP bi_users;

In Okta, you created the user Jane and assigned Jane to the analyst_users group.

In the Amazon Redshift database, you created two database groups: analyst_users and bi_users.

When user Jane logs in via federated authentication to Amazon Redshift using query editor v2, the user is created if it doesn’t already exist and the analyst_users database group is assumed. The user Jane can query tables only in sales_analysis schema.

Because user Jane isn’t part of the bi_users group, when they try to access the sales_bi schema, they get a permission denied error.

The following diagram illustrates this configuration.

Access Amazon Redshift query editor v2

Now you’re ready to connect to your Amazon Redshift cluster using Amazon Redshift query editor v2 using federated login. Log in to your Okta account with your user credentials and under My Apps¸ choose Amazon Redshift Query Editor V2.

You’re redirected to the Amazon Redshift query editor v2 URL that you specified as the default relay state.

Connect to an Amazon Redshift database and run queries

Now let’s set up the connection to your Amazon Redshift cluster.

  1. In the query editor, choose your cluster (right-click) and choose Create connection.
  2. For Database, enter a name.
  3. For Authentication, select Federated user.
    The user name is prepopulated with your federated login information.
  4. Choose Create connection.

When you’re connected to your Amazon Redshift database, you can verify the connection details, as shown in the following screenshot. Notice the session-level group association as per the group assignment in your Okta application configuration. In this case, user Jane is assigned to the analyst_users group.

This user has access to SELECT all tables in the sales_analysis schema and no access to the sales_bi schema. You can run the following statements to test your access.

The following screenshot shows the results from a query to the sales_analysis.store_sales_us table.

When user Jane tries to access the tables in the sales_bi schema, they get a permission denied error.

Summary

In this post, we demonstrated how to federate SSO access to Amazon Redshift query editor v2 using Okta as your IdP. We showed how to set up Okta, set different PrinicpalTag attributes for query editor v2, and pass group memberships defined in your Okta IdP to your Amazon Redshift cluster. We showed how to log in to Amazon Redshift query editor v2 using federated login and validate the configuration by running a few queries. This solution allows you to control access to Amazon Redshift database objects, and your users can easily access Amazon Redshift clusters through query editor v2 using federated credentials without managing database users and passwords.

If you have any feedback or questions, please leave them in the comments.


About the Authors

Sumeet Joshi is an Analytics Specialist Solutions Architect based out of New York. He specializes in building large-scale data warehousing solutions. He has over 16 years of experience in the data warehousing and analytical space.

Bhanu Pittampally is an Analytics Specialist Solutions Architect based out of Dallas. He specializes in building analytical solutions. His background is in data and analytics for over 14 years. His LinkedIn profile can be found here.

Erol Murtezaoglu, a Technical Product Manager at AWS, is an inquisitive and enthusiastic thinker with a drive for self-improvement and learning. He has a strong and proven technical background in software development and architecture, balanced with a drive to deliver commercially successful products. Erol highly values the process of understanding customer needs and problems, in order to deliver solutions that exceed expectations.

Yanis Telaoumaten is a Software Development Engineer at AWS. His passions are building reliable software and creating tools to allow other engineers to work more efficiently. In the past years, he worked on identity, security and reliability of Redshift services

Federate access to Amazon Redshift query editor V2 with Active Directory Federation Services (AD FS): Part 3

Post Syndicated from Sumeet Joshi original https://aws.amazon.com/blogs/big-data/federate-access-to-amazon-redshift-query-editor-v2-with-active-directory-federation-services-ad-fs-part-3/

In the first post of this series, Federate access to your Amazon Redshift cluster with Active Directory Federation Services (AD FS): Part 1, you set up Microsoft Active Directory Federation Services (AD FS) and Security Assertion Markup Language (SAML) based authentication and tested the SAML federation using a web browser.

In Part 2, you learned to set up an Amazon Redshift cluster and use federated authentication with AD FS to connect from a JDBC SQL client tool.

In this post, we walk through the steps to configure Amazon Redshift query editor v2 to work with AD FS federation SSO.

Organizations want to enable their end-users such as data analysts, data scientists, and database developers to use the query editor v2 to accelerate self-service analytics. Amazon Redshift query editor v2 lets users explore, analyze, and collaborate on data. You can use the query editor to create databases, schemas, tables, and load data from Amazon Simple Storage Service (Amazon S3) using the COPY command or by using a wizard. You can browse multiple databases and run queries on your Amazon Redshift data warehouse or data lake, or run federated queries to operational databases such as Amazon Aurora.

In this post, we show how you can use your corporate Active Directory (AD) and the SAML 2.0 AD FS identity provider (IdP) to enable your users to easily access Amazon Redshift clusters through query editor v2 using corporate user names without managing database users and passwords. We also demonstrate how you can limit the access for your users to use only the query editor without giving them access to perform any admin functions on the AWS Management Console.

Solution overview

After you follow the steps explained in Part 1, you set up a deep link for federated users via the SAML 2.0 RelayState parameter in AD FS. You use the user you set up in your AD in Part 1 (Bob) to authenticate using AD FS and control access to database objects based on the group the user is assigned to. You also test if user Bob is integrated with Amazon Redshift database groups as controlled in AD groups.

By the end of this post, you will have created a unique deep link that authenticates the user Bob using AD FS and redirects them directly to the query editor v2 console, where they’re authenticated using the federation SSO option.

The sign-in process is as follows:

  1. The user chooses a deep link that redirects to the IdP for authentication with the information about the destination (query editor v2, in our case) URL embedded in the RelayState parameter. The user enters their credentials on the login page.
  2. Your IdP (AD FS in the case) verifies the user’s identity in your organization.
  3. Your IdP generates a SAML authentication response that includes assertions that identify the user and attributes about the user. The IdP sends this response to the user’s browser.
  4. The user’s browser is redirected to the AWS Single Sign-On endpoint and posts the SAML assertion and the RelayState parameter.
  5. The endpoint calls the AssumeRoleWithSAML API action to request temporary credentials from the AWS Identity and Access Management (IAM) role specified in the SAML assertion and creates a query editor v2 console sign-in URL that uses those credentials. The IAM role trusts the SAML federation entity and also has a policy that has access to query editor V2. If the SAML authentication response includes attributes that map to multiple IAM roles, the user is first prompted to choose the role to use for access to the query editor v2 console. The sign-in URL is the one specified by the RelayState parameter.
  6. AWS sends the sign-in URL back to the user’s browser as a redirect.
  7. The user’s browser is redirected to the Amazon Redshift query editor v2 console defined by the RelayState parameter.

The following diagram illustrates this flow.

In this post, we walk you through the following steps:

  1. Set up the Sales group in AD and set up the PrincipalTag claim rules in AD FS.
  2. Update the IAM roles.
  3. Construct the SSO URL to authenticate and redirect users to the Amazon Redshift query editor v2 console.
  4. Set up Amazon Redshift database groups and permissions on the Amazon Redshift cluster.
  5. Set up Amazon Redshift query editor v2 to use federated authentication with AD FS to connect directly from the query editor interface.
  6. Query Amazon Redshift objects to validate your authorization.

Prerequisites

For this walkthrough, complete the following prerequisite steps:

  1. Create an Amazon Redshift cluster. For instructions, refer to Create a sample Amazon Redshift cluster or complete the steps in Part 2 of this series.
  2. Complete the steps in Part 1 to set up SAML federation with AD FS:
    1. Set up an AD domain controller using an AWS CloudFormation template on a Windows 2016 Amazon Elastic Compute Cloud (Amazon EC2) instance.
    2. Configure federation in AD FS.
    3. Configure AWS as the relying party with AD FS using an IAM SAML provider and SAML roles with an attached policy to allow access to the Amazon Redshift cluster.
    4. Configure claim rules.
    5. Test the SAML authentication using a web browser.
  3. Verify that your IdP supports RelayState and is enabled. If you’re using AD FS 2.0, you need to download and install either Update Rollup 3 or Update Rollup 2 from Microsoft to enable the RelayState parameter.

Configure AD and AD FS

After you configure your AD FS and AD services by following the instructions in Part 1, you can set up the following AD group and claim rules.

In this post, you use the user Bob to log in to Amazon Redshift and check if Bob can access the Sales and Marketing schemas on the Amazon Redshift cluster. To create the sales group and assign the user [email protected] to it, log in to your AD FS server (Amazon EC2 machine) that you created in Part 1 and use the Windows command tool to run the following command:

dsadd group "cn=RSDB-sales, cn=Users, dc=adfsredshift, dc=com" -members "cn=Bob, cn=Users, dc=adfsredshift, dc=com"

Now you’re ready to create your custom claim rules: PrincipalTag:RedshiftDbUser and PrincipalTag:RedshiftDbGroup.

PrincipalTag:RedshiftDbUser

The custom claim rule PrincipalTag:RedshiftDbUser is mapped to the universal principal name in AD FS. When a user authenticates through federated SSO, this claim rule is mapped to the user name. If user doesn’t exist in the Amazon Redshift database, then the user is automatically created. The auto create option is granted through an IAM policy that is attached to the IAM role. The CreateClusterUser permission allows for auto creation of the user (you set this up as part of Part 1 as a prerequisite).

Complete the following steps to create your custom claim rule:

    1. On the AD FS management console, choose Relying Party Trusts.
    2. Choose Edit Claim Issuance Policy.
    3. Choose Choose Rule Type.
    4. For Claim rule template, choose Send Claims Using a Custom Rule.
    5. Choose Next.
    6. For Claims rule name, enter RedshiftDbUser.
    7. Add the following custom rule:
      c:[Type == "http://schemas.microsoft.com/ws/2008/06/identity/claims/windowsaccountname", Issuer == "AD AUTHORITY"]
       => issue(store = "Active Directory", types = ("https://aws.amazon.com/SAML/Attributes/PrincipalTag:RedshiftDbUser"), query = ";userPrincipalName;{0}", param = c.Value);

    8. Choose Finish.
    9. Capture the claim rules sent in a SAML assertion response through your browser. For instructions, refer to How to view a SAML response in your browser for troubleshooting.

In my example, I use the following SAML attribute for the RedshiftDbUser PrincipalTag:

<Attribute Name="https://aws.amazon.com/SAML/Attributes/PrincipalTag:RedshiftDbUser"> <AttributeValue>[email protected]</AttributeValue> 
</Attribute>

PrincipalTag:RedshiftDbGroup

The custom claim rule PrincipalTag:RedshiftDbGroup is built out of AD groups that the user is a member of. This rule is mapped to the Amazon Redshift database groups. The AD groups and Amazon Redshift database group names should match. JoinGroup permission set in the IAM policy allows the user to assume a database group and is session based. If the user is mapped to multiple groups in the AD group, the SAML assertion response should send those groups in : separated values and not as multiple value claims. The following steps demonstrate how to send AD groups as : separated values.

In this example, the user Bob is assigned to the marketing and sales groups. The following code shows how to send multiple groups through the SAML response when the user is in multiple groups, and also how to handle the situation when a user doesn’t exist in any particular group.

  1. Follow the same steps as in the previous section to create the rule Marketing, using the following code for the custom rule:
    c:[Type == " http://temp/groups", Value =~ "RSDB-Marketing"] => add(Type = " http://temp/marketing", Value = c.Value);

  2. Create the rule MarketingNotExists using the following code:
    NOT EXISTS([Type == "http://temp/variable", Value =~ "RSDB-marketing"]) => add(Type = "http://temp/marketing", Value = ""); 

  3. Create the rule sales using the following code:
    c:[Type == " http://temp/groups", Value =~ "RSDB-sales"] => add(Type = " http://temp/marketing", Value = c.Value);

  4. Create the rule SalesNotExists using the following code:
    NOT EXISTS([Type == "http://temp/variable", Value =~ "RSDB-sales"])
     => add(Type = "http://temp/sales", Value = ""); 

  5. Create the rule RedshiftDbGroups using the following code:
    c:[Type == "http://temp/marketing"]
     && c2:[Type == "http://temp/sales"]
     => issue(Type = "https://aws.amazon.com/SAML/Attributes/PrincipalTag:RedshiftDbGroups", Value = c.Value + ":" + c2.Value);

The following screenshot shows the list of rules that I created in my AD FS. Note the number of rules and the order in which they’re positioned. We created rules 6–11 as part of this post.

If you see a similar SAML response for RedshiftDbGroups, your setup is good:

<Attribute Name="https://redshift.amazon.com/SAML/Attributes/PrincipalTag:RedshiftDbGroups"> <AttributeValue> marketing:sales</AttributeValue>

If a user doesn’t exist in one of the groups, an empty value is passed to the claim rule. For example, if user bob is removed from the marketing group, the SAML response for PrincipalTag:RedshiftDbGroup would be :sales.

Update IAM roles

In Part 1 of this series, you created two IAM roles: ADFZ-Dev and ADFZ-Production. These two roles aren’t yet set up with grants on the query editor. In this section, you update these roles with query editor permissions.

Amazon Redshift query editor v2 provides multiple managed policies to access the query editor. For a list of all the managed policies, refer to Configuring your AWS account. For this post, we attach the AmazonRedshiftQueryEditorV2ReadSharing managed policy to the roles.

  1. On the IAM console, choose Roles in the navigation pane.
  2. Choose the role ADFZ-Dev.
  3. Choose Add permissions and then Attach policies.
  4. Under Other permission policies, search for the AmazonRedshiftQueryEditorV2ReadSharing managed policy.
  5. Select the policy and choose Attach policies.
  6. Modify the trust relationships for your role and add sts:TagSession. While in role select Trust relationships and click on Edit trust policy.When using session tags, trust policies for all roles connected to the IdP passing tags must have the sts:TagSession permission. For roles without this permission in the trust policy, the AssumeRole operation fails.
  7. Choose Update policy.
  8. Repeat these steps to attach the AmazonRedshiftQueryEditorV2ReadSharing managed policy to the ADFZ-Production role.

Limiting User access only to Query Editor

If you would like to limit users only access Query Editor then  update the policy redshift-marketing that you have created in Part 1 blog post as below.

Note: once updated, users will lose admin privileges such as create cluster.

Replace the region, account, and cluster parameters. This custom policy grants access to Amazon Redshift to get cluster credentials, create users, and allow users to join groups.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "RedshiftClusterPermissions",
            "Effect": "Allow",
            "Action": [
                "redshift:GetClusterCredentials",
                "redshift:CreateClusterUser",
                "redshift:JoinGroup"
            ],
            "Resource": [
                "arn:aws:redshift:<region>:<account>:cluster:<cluster>,
                "arn:aws:redshift:<region>:<account>:dbuser:<cluster>/${aws:PrincipalTag/RedshiftDbUser}",
                "arn:aws:redshift:<region>:<account>:dbgroup:<cluster>/marketing",
                "arn:aws:redshift:<region>:<account>:dbgroup:<cluster>/sales",
                "arn:aws:redshift:<region>:<account>:dbname:<cluster>/${redshift:DBName}"
            ]
        }
    ]
}

There are a few important things to note:

  1. The group membership lasts only for the duration of the user session.
  2. There is no CreateGroup permission because groups need to be manually created and granted DB privileges.

Generate the SSO destination URL as the Amazon Redshift query editor v2 console

In this step, you construct the sign-in URL for the AD FS IdP to redirect users to the Amazon Redshift query editor v2 console. For instructions, refer to How to Use SAML to Automatically Direct Federated Users to a Specific AWS Management Console Page.

To provide a full SSO experience for the end-users, the SAML response can include an optional parameter called RelayState. This parameter contains the destination URL.

Microsoft provides a tool to help generate these SSO URLs for AD FS called the AD FS 2.0 RelayState Generator.

To build this URL, you need three pieces of information:

  • IdP URL string – The string is in the format https://ADFSSERVER/adfs/ls/idpinitiatedsignon.aspx. For this post, we use https://EC2AMAZ-F9TJOIC.adfsredshift.com/adfs/ls/IdpInitiatedSignOn.aspx.
  • Relying party identifier – For AWS, this is urn:amazon:webservices.
  • Relay state or target app – This is the AWS Management Console URL you want your authenticated users redirect to. In this case, it’s https://eu-west-1.console.aws.amazon.com/sqlworkbench/home?. For this post, we use the eu-west-1 Region, but you can adjust this as needed.

I followed the instructions in How to Use SAML to Automatically Direct Federated Users to a Specific AWS Management Console Page and used the AD FS 2.0 RelayState Generator to generate the URL shown in the following screenshot.

The following is an example of the final URL that you use to get authenticated and also get redirected to Amazon Redshift query editor v2 (this URL won’t work in your setup because it has been created specifically for an AD FS server in my account): https://EC2AMAZ-F9TJOIC.adfsredshift.com/adfs/ls/IdpInitiatedSignOn.aspx?RelayState=RPID%3Durn%253Aamazon%253Awebservices%26RelayState%3Dhttps%253A%252F%252Feu-west-1.console.aws.amazon.com%252Fsqlworkbench%252Fhome%253Fregion%253Deu-west-1%2523%252Fclient

You can now save this URL and use it from anywhere you can reach your AD FS server. After you enter the URL in a browser, you first authenticate to AD FS, then you’re redirected to the Amazon Redshift query editor v2 console.

Set up DB groups on Amazon Redshift cluster

In this step, you set up your Amazon Redshift database group. This step is necessary because when the user is authenticated, they have to be part of an Amazon Redshift DB group with proper permissions set on a schema or table (or view).

In Active Directory, the user Bob is part of two groups: Sales and Marketing. In your Amazon Redshift database, you have three database groups: Sales, Marketing, and Finance.

When user Bob logs in via federated authentication, the user assumes the Sales and Marketing database groups, so this user can query tables in both the Sales and Marketing schemas. Because the user Bob isn’t part of the Finance group, when they try to access the Finance schema, they receive a permission denied error.

The following diagram illustrates this configuration.

Complete the following steps to set up your DB groups:

  1. Connect as awsuser (a superuser).
  2. Create three database groups:
    CREATE GROUP sales;
    CREATE GROUP marketing;
    CREATE GROUP finance;

  3. Create three schemas:
    CREATE SCHEMA sales;
    CREATE SCHEMA marketing;
    CREATE SCHEMA finance;

  4. Create a table in each schema:
    CREATE TABLE IF NOT EXISTS marketing.employee
    (
    	n_empkey INTEGER   
    	,n_name CHAR(25)   
    	,n_regionkey INTEGER   
    	,n_comment VARCHAR(152)   
    )
    DISTSTYLE AUTO
     SORTKEY (n_empkey);
    
    CREATE TABLE IF NOT EXISTS sales.employee_sales
    (
    	n_empkey INTEGER   
    	,n_name CHAR(25)   
    	,n_regionkey INTEGER   
    	,n_comment VARCHAR(152)   
    )
    DISTSTYLE AUTO
     SORTKEY (n_empkey);
    
    
    CREATE TABLE IF NOT EXISTS finance.accounts
    (
    	account_id INTEGER   
    	,account_name CHAR(25)   
    	 
    )
    DISTSTYLE AUTO
     SORTKEY (account_id);

  5. Insert sample data into the three tables:
    INSERT INTO marketing.employee
    VALUES(1, 'Bob', 0, 'Marketing');
    
    INSERT INTO sales.employee_sales
    VALUES(1, 'John', 0, 'Sales');
    
    INSERT INTO finance.accounts
    VALUES(1, 'online company');

  6. Validate the data is available in the tables:
    Select * from marketing.employee;
    Select * from sales.employee_sales;
    Select * from finance.accounts;

You can now set up appropriate privileges for the sales, finance, and marketing groups. Groups are collections of users who are all granted privileges associated with the group. You can use groups to assign privileges by job function. For example, you can create different groups for sales, administration, and support, and give the users in each group the appropriate access to the data they require for their work. You can grant or revoke privileges at the group level, and those changes apply to all members of the group, except for superusers.

  1. Enter the following SQL queries to grant access to all tables in the sales schema to the sales group, access to all tables in the marketing schema to the marketing group, and access to all tables in the finance schema to the finance group:
ALTER DEFAULT PRIVILEGES IN SCHEMA sales
GRANT SELECT on TABLES to GROUP sales;
GRANT USAGE on SCHEMA sales to GROUP sales;
GRANT SELECT on ALL TABLES in SCHEMA sales to GROUP sales;

ALTER DEFAULT PRIVILEGES IN SCHEMA marketing
GRANT SELECT on TABLES to GROUP marketing;
GRANT USAGE on SCHEMA marketing to GROUP marketing;
GRANT SELECT on ALL TABLES in SCHEMA marketing to GROUP marketing;

ALTER DEFAULT PRIVILEGES IN SCHEMA finance
GRANT SELECT on TABLES to GROUP finance;
GRANT USAGE on SCHEMA finance to GROUP finance;
GRANT SELECT on ALL TABLES in SCHEMA finance to GROUP finance;

Access Amazon Redshift query editor v2 through federated authentication

Now that you have completed your SAML integration, deep link setup, and DB groups and access rights configuration, you can set up Amazon Redshift query editor v2 to use federated authentication with AD FS to connect from directly from the query editor interface.

  1. Navigate to the deep link URL you created earlier.
    You’re redirected to the AD FS login page.
  2. Sign in as [email protected].
    For this post, I accessed this URL from an Amazon EC2 machine, but you can access it from any location where you can reach the AD FS IdP.
    After AD FS successfully authenticates you, it redirects to the AWS SSO endpoint and posts the SAML assertion and RelayState parameter. Because you configured two IAM roles on the AWS side, you’re prompted to select a role.
  3. Select a role (for this example, ADFZ-Dev) and choose Sign In.

    AWS sends the sign-in URL that is based on the RelayState value back to your browser as a redirect. Your browser is redirected to the query editor v2 console automatically.
  4. Right-click your Amazon Redshift cluster (for this post, redshift-cluster-1) and choose Edit connection.

    The value for User name is automatically populated, and Federated Access is automatically selected.
  5. Choose Edit connection to save the connection and log in to the database.

    After you’re successfully logged in, you can browse the database objects in the left pane.
  6. Test the connection by running the following query:
    select * from stv_sessions;

The following screenshot shows the output.

The output shows that the user [email protected] was authenticated using AD FS. The user also joined the marketing and Sales groups as enforced by the AD FS PrincipalTag:RedshiftDbGroups claim rule and the policy associated with the ADFZ-Dev role, which the user assumes during this session.

Run queries to validate authorization through federated groups

In this final step, you validate how the groups and membership configured in AD are seamlessly integrated with Amazon Redshift database groups.

Run the following query against the marketing and sales schema:

select * from marketing.employee;
select * from sales.employee_sales;

The following screenshots shows the output:

The preceding images show that AD user Bob is part of the AD group RSDB-marketing and RSDB-sales, which are mapped to the DB groups marketing and sales. These DB groups have select access to the schemas marketing and sales and all tables in those schemas. Therefore, the user can successfully query the tables.

To run a query against the finance schema, enter the following code:

select * from finance.accounts;

The following screenshot shows the output.

The output shows that Bob is only part of the AD groups RSDB-marketing and RSDB-sales. Due to the way the claim rule is set up, Bob doesn’t have access to the database group finance, and therefore the query returns with a permission denied error.

Clean up

To avoid incurring future charges, delete the resources by deleting the CloudFormation stack. This cleans up all the resources from your AWS account that you set up in Part 1.

Conclusion

In this post, we demonstrated how to set up an AD FS server, configure different PrincipalTag attributes used for Amazon Redshift query editor v2, and generate an SSO URL with the query editor as the destination location. You then connected to the Amazon Redshift DB cluster using a database user with administrator privileges to set up DB groups and permissions, and used a federated user authentication with AD FS to run several queries. This solution enables you to control access to your Amazon Redshift database objects by using AD groups and memberships seamlessly.

If you have any feedback or questions, please leave them in the comments.


About the Authors

Sumeet Joshi is an Analytics Specialist Solutions Architect based out of New York. He specializes in building large-scale data warehousing solutions. He has over 16 years of experience in the data warehousing and analytical space.

Bhanu Pittampally is an Analytics Specialist Solutions Architect based out of Dallas. He specializes in building analytical solutions. His background is in data and analytics for over 14 years. His LinkedIn profile can be found here.

Erol Murtezaoglu, a Technical Product Manager at AWS, is an inquisitive and enthusiastic thinker with a drive for self-improvement and learning. He has a strong and proven technical background in software development and architecture, balanced with a drive to deliver commercially successful products. Erol highly values the process of understanding customer needs and problems, in order to deliver solutions that exceed expectations.

Yanis Telaoumaten is a Software Development Engineer at AWS. His passions are building reliable software and creating tools to allow other engineers to work more efficiently. In the past years, he worked on identity, security and reliability of Redshift services

Analyze Amazon SES events at scale using Amazon Redshift

Post Syndicated from Manash Deb original https://aws.amazon.com/blogs/big-data/analyze-amazon-ses-events-at-scale-using-amazon-redshift/

Email is one of the most important methods for business communication across many organizations. It’s also one of the primary methods for many businesses to communicate with their customers. With the ever-increasing necessity to send emails at scale, monitoring and analysis has become a major challenge.

Amazon Simple Email Service (Amazon SES) is a cost-effective, flexible, and scalable email service that enables you to send and receive emails from your applications. You can use Amazon SES for several use cases, such as transactional, marketing, or mass email communications.

An important benefit of Amazon SES is its native integration with other AWS services, such as Amazon CloudWatch and Amazon Redshift, which allows you to monitor and analyze your emails sending at scale seamlessly. You can store your email events in Amazon Redshift, which is a widely used, fast, and fully managed cloud data warehouse. You can then analyze these events using SQL to gain business insights such as marketing campaign success, email bounces, complaints, and so on.

In this post, you will learn how to implement an end-to-end solution to automate this email analysis and monitoring process.

Solution overview

The following architecture diagram highlights the end-to-end solution, which you can provision automatically with an AWS CloudFormation template.

In this solution, you publish Amazon SES email events to an Amazon Kinesis Data Firehose delivery stream that publishes data to Amazon Redshift. You then connect to the Amazon Redshift database and use a SQL query tool to analyze Amazon SES email events that meet the given criteria. We use the Amazon Redshift SUPER data type to store the event (JSON data) in Amazon Redshift. The SUPER data type handles semi-structured data, which can have varying table attributes and types.

The alarm system uses Amazon CloudWatch logs that Kinesis Data Firehose generates when a data load to Amazon Redshift fails. We have set up a metric filter that pattern matches the CloudWatch log events to determine the error condition and triggers a CloudWatch alarm. This in turn sends out email notifications using Amazon Simple Notification Service (Amazon SNS).

Prerequisites

As a prerequisite for deploying the solution in this post, you need to set up Amazon SES in your account. For more information, see Getting Started with Amazon Simple Email Service.

Solution resources and features

The architecture built by AWS CloudFormation supports AWS best practices for high availability and security. The CloudFormation template takes care of the following key resources and features:

  • Amazon Redshift cluster – An Amazon Redshift cluster with encryption at rest enabled using an AWS Key Management Service (AWS KMS) customer managed key (CMK). This cluster acts as the destination for Kinesis Data Firehose and stores all the Amazon SES email sending events in the table ses, as shown in the following screenshot.
  • Kinesis Data Firehose configuration – A Kinesis Data Firehose delivery stream that acts as the event destination for all Amazon SES email sending metrics. The delivery stream is set up with Amazon Redshift as the destination. Server-side encryption is enabled using an AWS KMS CMK, and destination error logging has been enabled as per best practices.
  • Amazon SES configuration – A configuration set in Amazon SES that is used to map Kinesis Data Firehose as the event destination to publish email metrics.

To use the configuration set when sending emails, you can specify a default configuration set for your verified identity, or include a reference to the configuration set in the headers of the email.

  • Exploring and analyzing the data – We use Amazon Redshift query editor v2 for exploring and analyzing the data.
  • Alarms and notifications for ingestion failures – A data load error notification system using CloudWatch and Amazon SNS generates email-based notifications in the event of a failure during data load from Kinesis Data Firehose to Amazon Redshift. The setup creates a CloudWatch log metric filter, as shown in the following screenshot.

A CloudWatch alarm based on the metric filter triggers an SNS notification when in alarm state. For more information, see Using Amazon CloudWatch alarms.

Deploy the CloudFormation template

The provided CloudFormation template automatically creates all the required resources for this solution in your AWS account. For more information, see Getting started with AWS CloudFormation.

  1. Sign in to the AWS Management Console.
  2. Choose Launch Stack to launch AWS CloudFormation in your AWS account:
  3. For Stack name, enter a meaningful name for the stack, for example, ses_events.
  4. Provide the following values for the stack parameters:
    1. ClusterName – The name of the Amazon Redshift cluster.
    2. DatabaseName – The name of the first database to be created when the Amazon Redshift cluster is created.
    3. DeliveryStreamName – The name of the Firehose delivery stream.
    4. MasterUsername – The user name that is associated with the primary user account for the Amazon Redshift cluster.
    5. NodeType – The type of node to be provisioned. (Default dc2.large)
    6. NotificationEmailId – The email notification list that is used to configure an SNS topic for sending CloudWatch alarm and event notifications.
    7. NumberofNodes – The number of compute nodes in the Amazon Redshift cluster. For multi-node clusters, the NumberofNodes parameter must be greater than 1.
    8. OnPremisesCIDR – IP range (CIDR notation) for your existing infrastructure to access the target and replica Amazon Redshift clusters.
    9. SESConfigSetName – Name of the Amazon SES configuration set.
    10. SubnetId – Subnet ID where source Amazon Redshift cluster is created.
    11. Vpc – VPC in which Amazon Redshift cluster is launched.
  5. Choose Next.
  6. Review all the information and select I acknowledge that AWS CloudFormation might create IAM resources.
  7. Choose Create stack.

You can track the progress of the stack creation on the Events tab. Wait for the stack to complete and show the status CREATE_COMPLETE.

Test the solution

To send a test email, we use the Amazon SES mailbox simulator. Set the configuration-set header to the one created by the CloudFormation template.

We use the Amazon Redshift query editor V2 to query the Amazon Redshift table (created by the CloudFormation template) and see if the events have shown up.

If the data load of the event stream fails from Kinesis Data Firehose to Amazon Redshift, the failure notification system is triggered, and you receive an email notification via Amazon SNS.

Clean up

Some of the AWS resources deployed by the CloudFormation stacks in this post incur a cost as long as you continue to use them.

You can delete the CloudFormation stack to delete all AWS resources created by the stack. To clean up all your stacks, use the AWS CloudFormation console to remove the stacks that you created in reverse order.

  1. On the Stacks page on the AWS CloudFormation console, choose the stack to delete.
  2. In the stack details pane, choose Delete.
  3. Choose Delete stack when prompted.

After stack deletion begins, you can’t stop it. The stack proceeds to the DELETE_IN_PROGRESS state. When the stack deletion is complete, the stack changes to the DELETE_COMPLETE state. The AWS CloudFormation console doesn’t display stacks in the DELETE_COMPLETE state by default. To display deleted stacks, you must change the stack view filter. For more information, see Viewing deleted stacks on the AWS CloudFormation console.

If the delete fails, the stack enters the DELETE_FAILED state. For solutions, see Delete stack fails.

Conclusion

In this post, we walked through the process of setting up Amazon SES and Amazon Redshift to deploy an email reporting service that can scale to support millions of events. We used Amazon Redshift to store semi-structured messages using the SUPER data type in database tables to support varying message sizes and formats. With this solution, you can easily run analytics at scale and analyze your email event data for deliverability-related issues such as bounces or complaints.

Use the CloudFormation template provided to speed up provisioning of the cloud resources required for the solution (Amazon SES, Kinesis Data Firehose, and Amazon Redshift) in your account while following security best practices. Then you can analyze Amazon SES events at scale using Amazon Redshift.


About the Authors

Manash Deb is a Software Development Engineer in the AWS Directory Service team. He has worked on building end-to-end applications in different database and technologies for over 15 years. He loves to learn new technologies and solving, automating, and simplifying customer problems on AWS.

Arnab Ghosh is a Solutions Architect for AWS in North America helping enterprise customers build resilient and cost-efficient architectures. He has over 13 years of experience in architecting, designing, and developing enterprise applications solving complex business problems.

Sanjoy Thanneer is a Sr. Technical Account Manager with AWS based out of New York. He has over 20 years of experience working in Database and Analytics Domains.  He is passionate about helping enterprise customers build scalable , resilient and cost efficient Applications.

Justin Morris is a Email Deliverability Manager for the Simple Email Service team. With over 10 years of experience in the IT industry, he has developed a natural talent for diagnosing and resolving customer issues and continuously looks for growth opportunities to learn new technologies and services.

Build a big data Lambda architecture for batch and real-time analytics using Amazon Redshift

Post Syndicated from Jagadish Kumar original https://aws.amazon.com/blogs/big-data/build-a-big-data-lambda-architecture-for-batch-and-real-time-analytics-using-amazon-redshift/

With real-time information about customers, products, and applications in hand, organizations can take action as events happen in their business application. For example, you can prevent financial fraud, deliver personalized offers, and identify and prevent failures before they occur in near real time. Although batch analytics provides abilities to analyze trends and process data at scale that allow processing data in time intervals (such as daily sales aggregations by individual store), real-time analytics is optimized for low-latency analytics, ensuring that data is available for querying in seconds. Both paradigms of data processing operate in silos, which results in data redundancy and operational overhead to maintain them. A big data Lambda architecture is a reference architecture pattern that allows for the seamless coexistence of the batch and near-real-time paradigms for large-scale data for analytics.

Amazon Redshift allows you to easily analyze all data types across your data warehouse, operational database, and data lake using standard SQL. In this post, we collect, process, and analyze data streams in real time. With data sharing, you can share live data across Amazon Redshift clusters for read purposes with relative security and ease out of the box. In this post, we discuss how we can harness the data sharing ability of Amazon Redshift to set up a big data Lambda architecture to allow both batch and near-real-time analytics.

Solution overview

Example Corp. is a leading electric automotive company that revolutionized the automotive industry. Example Corp. operationalizes the connected vehicle data and improves the effectiveness of various connected vehicle and fleet use cases, including predictive maintenance, in-vehicle service monetization, usage-based insurance. and delivering exceptional driver experiences. In this post, we explore the real-time and trend analytics using the connected vehicle data to illustrate the following use cases:

  • Usage-based insurance – Usage-based insurance (UBI) relies on analysis of near-real-time data from the driver’s vehicle to access the risk profile of the driver. In addition, it also relies on the historical analysis (batch) of metrics (such as the number of miles driven in a year). The better the driver, the lower the premium.
  • Fleet performance trends – The performance of a fleet (such as a taxi fleet) relies on the analysis of historical trends of data across the fleet (batch) as well as the ability to drill down to a single vehicle within the fleet for near-real-time analysis of metrics like fuel consumption or driver distraction.

Architecture overview

In this section, we discuss the overall architectural setup for the Lambda architecture solution.

The following diagram shows the implementation architecture and the different computational layers:

  • Data ingestion from AWS IoT Core
  • Batch layer
  • Speed layer
  • Serving layer

Data ingestion

Vehicle telemetry data is ingested into the cloud through AWS IoT Core and routed to Amazon Kinesis Data Streams. The Kinesis Data Streams layer acts as a separation layer for the speed layer and batch layer, where the incoming telemetry is consumed by the speed layer’s Amazon Redshift cluster and Amazon Kinesis Data Firehose, respectively.

Batch layer

Amazon Kinesis Data Firehose is a fully managed service that can batch, compress, transform, and encrypt your data streams before loading them into your Amazon Simple Storage Service (Amazon S3) data lake. Kinesis Data Firehose also allows you to specify a custom expression for the Amazon S3 prefix where data records are delivered. This provides the ability to filter the partitioned data and control the amount of data scanned by each query, thereby improving performance and reducing cost.

The batch layer persists data in Amazon S3 and is accessed directly by an Amazon Redshift Serverless endpoint (serving layer). With Amazon Redshift Serverless, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables.

The batch layer can also optionally precompute results as batch views from the immutable Amazon S3 data lake and persist them as either native tables or materialized views for very high-performant use cases. You can create these precomputed batch views using AWS Glue, Amazon Redshift stored procedures, Amazon Redshift materialized views, or other options.

The batch views can be calculated as:

batch view = function (all data)

In this solution, we build a batch layer for Example Corp. for two types of queries:

  • rapid_acceleration_by_year – The number of rapid accelerations by each driver aggregated per year
  • total_miles_driven_by_year – The total number of miles driven by the fleet aggregated per year

For demonstration purposes, we use Amazon Redshift stored procedures to create the batch views as Amazon Redshift native tables from external tables using Amazon Redshift Spectrum.

Speed layer

The speed layer processes data streams in real time and aims to minimize latency by providing real-time views into the most recent data.

Amazon Redshift Streaming Ingestion uses SQL to connect with one or more Kinesis data streams simultaneously. The native streaming ingestion feature in Amazon Redshift lets you ingest data directly from Kinesis Data Streams and enables you to ingest hundreds of megabytes of data per second and query it at exceptionally low latency—in many cases only 10 seconds after entering the data stream.

The speed cluster uses materialized views to materialize a point-in-time view of a Kinesis data stream, as accumulated up to the time it is queried. The real-time views are computed using this layer, which provide a near-real-time view of the incoming telemetry stream.

The speed views can be calculated as a function of recent data unaccounted for in the batch views:

speed view = function (recent data)

We calculate the speed views for these batch views as follows:

  • rapid_acceleration_realtime – The number of rapid accelerations by each driver for recent data not accounted for in the batch view rapid_acceleration_by_month
  • miles_driven_realtime – The number of miles driven by each driver for recent data not in miles_driven_by_month

Serving layer

The serving layer comprises an Amazon Redshift Serverless endpoint and any consumption services such as Amazon QuickSight or Amazon SageMaker.

Amazon Redshift Serverless (preview) is a serverless option of Amazon Redshift that makes it easy to run and scale analytics in seconds without the need to set up and manage data warehouse infrastructure. With Amazon Redshift Serverless, any user—including data analysts, developers, business professionals, and data scientists—can get insights from data by simply loading and querying data in the data warehouse.

Amazon Redshift data sharing enables instant, granular, and fast data access across Amazon Redshift clusters without the need to maintain redundant copies of data.

The speed cluster provides outbound data shares of the real-time materialized views to the Amazon Redshift Serverless endpoint (serving cluster).

The serving cluster joins data from the batch layer and speed layer to get near-real-time and historical data for a particular function with minimal latency. The consumption layer (such as Amazon API Gateway or QuickSight) is only aware of the serving cluster, and all the batch and stream processing is abstracted from the consumption layer.

We can view the queries to the speed layer from data consumption layer as follows:

query = function (batch views, speed views)

Deploy the CloudFormation template

We have provided an AWS CloudFormation template to demonstrate the solution. You can download and use this template to easily deploy the required AWS resources. This template has been tested in the us-east-1 Region.

The template requires you to provide the following parameters:

  • DatabaseName – The name of the first database to be created for speed cluster
  • NumberOfNodes – The number of compute nodes in the cluster.
  • NodeType – The type of node to be provisioned
  • MasterUserName – The user name that is associated with the master user account for the cluster that is being created
  • MasterUserPassword – The password that is associated with the master user account
  • InboundTraffic – The CIDR range to allow inbound traffic to the cluster
  • PortNumber – The port number on which the cluster accepts incoming connections
  • SQLForData – The source query to extract from AWS IOT Core topic

Prerequisites

When setting up this solution and using your own application data to push to Kinesis Data Streams, you can skip setting up the IoT Device Simulator and start creating your Amazon Redshift Serverless endpoint. This post uses the simulator to create related database objects and assumes use of the simulator in the solution walkthrough.

Set up the IoT Device Simulator

We use the IoT Device simulator to generate and simulate vehicle IoT data. The solution allows you to create and simulate hundreds of connected devices, without having to configure and manage physical devices or develop time-consuming scripts.

Use the following CloudFormation template to create the IoT Device Simulator in your account for trying out this solution.

Configure devices and simulations

To configure your devices and simulations, complete the following steps:

  1. Use the login information you received in the email you provided to log in to the IoT Device Simulator.
  2. Choose Device Types and Add Device Type.
  3. Choose Automotive Demo.
  4. For Device type name, enter testVehicles.
  5. For Topic, enter the topic where the sensor data is sent to AWS IoT Core.
  6. Save your settings.
  7. Choose Simulations and Add simulation.
  8. For Simulation name, enter testSimulation.
  9. For Simulation type¸ choose Automotive Demo.
  10. For Select a device type¸ choose the device type you created (testVehicles).
  11. For Number of devices, enter 15.

You can choose up to 100 devices per simulation. You can configure a higher number of devices to simulate large data.

  1. For Data transmission interval, enter 1.
  2. For Data transmission duration, enter 300.

This configuration runs the simulation for 5 minutes.

  1. Choose Save.

Now you’re ready to simulate vehicle telemetry data to AWS IoT Core.

Create an Amazon Redshift Serverless endpoint

The solution uses an Amazon Redshift Serverless endpoint as the serving layer cluster. You can set up Amazon Redshift Serverless in your account.

Set up Amazon Redshift Query Editor V2

To query data, you can use Amazon Redshift Query Editor V2. For more information, refer to Introducing Amazon Redshift Query Editor V2, a Free Web-based Query Authoring Tool for Data Analysts.

Get namespaces for the provisioned speed layer cluster and Amazon Redshift Serverless

Connect to speed-cluster-iot (the speed layer cluster) through Query Editor V2 and run the following SQL:

select current_namespace; -- (Save as <producer_namespace>)

Similarly, connect to the Amazon Redshift Serverless endpoint and get the namespace:

select current_namespace; -- (Save as <consumer_namespace>)

You can also get this information via the Amazon Redshift console.

Now that we have all the prerequisites set up, let’s go through the solution walkthrough.

Implement the solution

The workflow includes the following steps:

  1. Start the IoT simulation created in the previous section.

The vehicle IoT is simulated and ingested through IoT Device Simulator for the configured number of vehicles. The raw telemetry payload is sent to AWS IoT Core, which routes the data to Kinesis Data Streams.

At the batch layer, data is directly put from Kinesis Data Streams to Kinesis Data Firehose, which converts the data to parquet and delivers to Amazon with the prefix s3://<Bucketname>/vehicle_telematics_raw/year=<>/month=<>/day=<>/.

  1. When the simulation is complete, run the pre-created AWS Glue crawler vehicle_iot_crawler on the AWS Glue console.

The serving layer Amazon Redshift Serverless endpoint can directly access data from the Amazon S3 data lake through Redshift Spectrum external tables. In this demo, we compute batch views through Redshift Spectrum and store them as Amazon Redshift tables using Amazon Redshift stored procedures.

  1. Connect to the Amazon Redshift Serverless endpoint through Query Editor V2 and create the stored procedures using the following SQL script.
  2. Run the two stored procedures to create the batch views:
call rapid_acceleration_by_year_sp();
call total_miles_driven_by_year_sp();

The two stored procedures create batch views as Amazon Redshift native tables:

    • batchlayer_rapid_acceleration_by_year
    • batchlayer_total_miles_by_year

You can also schedule these stored procedures as batch jobs. For more information, refer to Scheduling SQL queries on your Amazon Redshift data warehouse.

At the speed layer, the incoming data stream is read and materialized by the speed layer Amazon Redshift cluster in the materialized view vehicleiotstream_mv.

  1. Connect to the provisioned speed-cluster-iot and run the following SQL script to create the required objects.

Two real-time views are created from this materialized view:

    • batchlayer_rapid_acceleration_by_year
    • batchlayer_total_miles_by_year
  1. Refresh the materialized view vehicleiotstream_mv at the required interval, which triggers Amazon Redshift to read from the stream and load data into the materialized view.
    REFRESH MATERIALIZED VIEW vehicleiotstream_mv;

Refreshes are currently manual, but can be automated using the query scheduler.

The real-time views are shared as an outbound data share by the speed cluster to the serving cluster.

  1. Connect to speed-cluster-iot and create an outbound data share (producer) with the following SQL:
    -- Create Datashare from Primary (Producer) to Serverless (Consumer)
    CREATE DATASHARE speedlayer_datashare SET PUBLICACCESSIBLE TRUE;
    ALTER DATASHARE speedlayer_datashare ADD SCHEMA public;
    ALTER DATASHARE speedlayer_datashare ADD ALL TABLES IN SCHEMA public;
    GRANT USAGE ON DATASHARE speedlayer_datashare TO NAMESPACE '<consumer_namespace>'; -- (replace with consumer namespace created in prerequisites 5)

  2. Connect to speed-cluster-iot and create an inbound data share (consumer) with the following SQL:
    CREATE DATABASE vehicleiot_shareddb FROM DATASHARE speedlayer_datashare OF NAMESPACE '< producer_namespace >'; -- (replace with producer namespace created in prerequisites 5)

Now that the real-time views are available for the Amazon Redshift Serverless endpoint, we can run queries to get real-time metrics or historical trends with up-to-date data by accessing the batch and speed layers and joining them using the following queries.

For example, to calculate total rapid acceleration by year with up-to-the-minute data, you can run the following query:

-- Rapid Acceleration By Year

select SUM(rapid_acceleration) rapid_acceleration, vin, year from 
(
select rapid_acceleration, vin,year
  from public.batchlayer_rapid_acceleration_by_year batch
union all
select rapid_acceleration, vin,year
from speedlayer_shareddb.public.speedlayer_rapid_acceleration_by_year speed)
group by VIN, year;

Similarly, to calculate total miles driven by year with up-to-the-minute data, run the following query:

-- Total Miles Driven By Year

select SUM(total_miles) total_miles_driven , year from 
(
select total_miles, year
  from public.batchlayer_total_miles_by_year batch
union all
select total_miles, year
from speedlayer_shareddb.public.speedlayer_total_miles_by_year speed)
group by year;

For only access to real-time data to power daily dashboards, you can run queries against real-time views shared to your Amazon Redshift Serverless cluster.

For example, to calculate the average speed per trip of your fleet, you can run the following SQL:

select CAST(measuretime as DATE) "date",
vin,
trip_id,
avg(vehicleSpeed)
from speedlayer_shareddb.public.vehicleiotstream_mv 
group by vin, date, trip_id;

Because this demo uses the same data as a quick start, there are duplicates in this demonstration. In actual implementations, the serving cluster manages the data redundancy and duplication by creating views with date predicates that consume non-overlapping data from batch and real-time views and provide overall metrics to the consumption layer.

You can consume the data with QuickSight for dashboards, with API Gateway for API-based access, or via the Amazon Redshift Data API or SageMaker for AI and machine learning (ML) workloads. This is not included as part of the provided CloudFormation template.

Best practices

In this section, we discuss some best practices and lessons learned when using this solution.

Provisioned vs. serverless

The speed layer is a continuous ingestion layer reading data from the IoT streams often running 24/7 workloads. There is less idle time and variability in the workloads and it is advantageous to have a provisioned cluster supporting persistent workloads that can scale elastically.

The serving layer can be provisioned (in case of 24/7 workloads) or Amazon Redshift Serverless in case of sporadic or ad hoc workloads. In this post, we assumed sporadic workloads, so serverless is the best fit. In addition, the serving layer can house multiple Amazon Redshift clusters, each consuming their data share and serving downstream applications.

RA3 instances for data sharing

Amazon Redshift RA3 instances enable data sharing to allow you to securely and easily share live data across Amazon Redshift clusters for reads. You can combine the data that is ingested in near-real time with the historical data using the data share to provide personalized driving characteristics to determine the insurance recommendation.

You can also grant fine-grained access control to the underlying data in the producer to the consumer cluster as needed. Amazon Redshift offers comprehensive auditing capabilities using system tables and AWS CloudTrail to allow you to monitor the data sharing permissions and usage across all the consumers and revoke access instantly when necessary. The permissions are granted by the superusers from both the producer and the consumer clusters to define who gets access to what objects, similar to the grant commands used in the earlier section. You can use the following commands to audit the usage and activities for the data share.

Track all changes to the data share and the shared database imported from the data share with the following code:

Select username, share_name, recordtime, action, 
         share_object_type, share_object_name 
  from svl_datashare_change_log
   order by recordtime desc;

Track data share access activity (usage), which is relevant only on the producer, with the following code:

Select * from svl_datashare_usage;

Pause and Resume

You can pause the producer cluster when batch processing is complete to save costs. The pause and resume actions on Amazon Redshift allow you to easily pause and resume clusters that may not be in operation at all times. It allows you to create a regularly-scheduled time to initiate the pause and resume actions at specific times or you can manually initiate a pause and later a resume. Flexible on-demand pricing and per-second billing gives you greater control of costs of your Redshift compute clusters while maintaining your data in a way that is simple to manage.

Materialized views for fast access to data

Materialized views allow pre-composed results from complex queries on large tables for faster access. The producer cluster exposes data as materialized views to simplify access for the consumer cluster. This also allows flexibility at the producer cluster to update the underlying table structure to address new business use cases, without affecting consumer-dependent queries and enabling a loose coupling.

Conclusion

In this post, we demonstrated how to process and analyze large-scale data from streaming and batch sources using Amazon Redshift as the core of the platform guided by the Lambda architecture principles.

You started by collecting real-time data from connected vehicles, and storing the streaming data in an Amazon S3 data lake through Kinesis Data Firehose. The solution simultaneously processes the data for near-real-time analysis through Amazon Redshift streaming ingestion.

Through the data sharing feature, you were able to share live, up-to-date data to an Amazon Redshift Serverless endpoint (serving cluster), which merges the data from the speed layer (near-real time) and batch layer (batch analysis) to provide low-latency access to data from near-real-time analysis to historical trends.

Click here to get started with this solution today and let us know how you implemented this solution in your organization through the comments section.


About the Authors

Jagadish Kumar is a Sr Analytics Specialist Solutions Architect at AWS. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS. He is an avid college football fan and enjoys reading, watching sports and riding motorcycle.

Thiyagarajan Arumugam is a Big Data Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

Eesha Kumar is an Analytics Solutions Architect with AWS. He works with customers to realize business value of data by helping them building solutions leveraging AWS platform and tools.

Secure data movement across Amazon S3 and Amazon Redshift using role chaining and ASSUMEROLE

Post Syndicated from Sudipta Mitra original https://aws.amazon.com/blogs/big-data/secure-data-movement-across-amazon-s3-and-amazon-redshift-using-role-chaining-and-assumerole/

Data lakes use a ring of purpose-built data services around a central data lake. Data needs to move between these services and data stores easily and securely. The following are some examples of such services:

  • Amazon Simple Storage Service (Amazon S3), which stores structured, unstructured, and semi-structured data
  • Amazon Redshift, a fully managed, petabyte-scale data warehouse product to analyze large-scale structured and semi-structured data across data warehouses and operational databases
  • Amazon SageMaker, which consumes data for machine learning (ML) capabilities

In multi-tenant architectures, groups or users within a group may require exclusive permissions to the group’s S3 bucket and also the schema and tables belonging to Amazon Redshift. These teams also need to be able to control loading and unloading of data between the team-owned S3 buckets and Amazon Redshift schemas. Additionally, individual users within the team may require fine-grained control over objects in S3 buckets and specific schemas in Amazon Redshift. Implementing this permissions control use case should be scalable as more teams and users are onboarded and permission-separation requirements evolve.

Amazon Redshift and Amazon S3 provide a unified, natively integrated storage layer for data lakes. You can move data between Amazon Redshift and Amazon S3 using the Amazon Redshift COPY and UNLOAD commands.

This post presents an approach that you can apply at scale to achieve fine-grained access controls to resources in S3 buckets and Amazon Redshift schemas for tenants, including groups of users belonging to the same business unit down to the individual user level. This solution provides tenant isolation and data security. In this approach, we use the bridge model to store data and control access for each tenant at the individual schema level in the same Amazon Redshift database. We utilize ASSUMEROLE and role chaining to provide fine-grained access control when data is being copied and unloaded between Amazon Redshift and Amazon S3, so the data flows within each tenant’s namespace. Role chaining also streamlines the new tenant onboarding process.

Solution overview

In this post, we explore how to achieve resource isolation, data security, scaling to multiple tenants, and fine-grained access control at the individual user level for teams that access, store, and move data across storage using Amazon S3 and Amazon Redshift.

We use the bridge model to store data and control access for each tenant at the individual schema level in the same Amazon Redshift database. In the bridge model, a separate database schema is created for each tenant, and data for each tenant is stored in its own schema. The tenant has access only to its own schema.

We use the COPY and UNLOAD commands to load and unload data into the Amazon Redshift cluster using an S3 bucket. These commands require Amazon Redshift to access Amazon S3 on your behalf, and security credentials are provided to your clusters.

We create an AWS Identity and Access Management (IAM) role—we call it the Amazon Redshift onboarding role—and associate it with the Amazon Redshift cluster. For each tenant, we create a tenant-specific IAM role—we call it the tenant role—to define the fine-grained access to its own Amazon S3 resources. The Amazon Redshift onboarding role doesn’t have any permissions granted except allowing sts:AssumeRole to the tenant roles. The trust relationship to the Amazon Redshift onboarding role is defined in each of the tenant roles. We use the Amazon Redshift ASSUMEROLE privilege to control IAM role access privileges for database users and groups on COPY and UNLOAD commands.

Each tenant database user or group is granted ASSUMEROLE on the Amazon Redshift onboarding role and its own tenant role, which restricts the tenant to access its own Amazon S3 resources when using COPY and UNLOAD commands. We use role chaining when ASSUMEROLE is granted. This means that the tenant role isn’t required to be attached to the Amazon Redshift cluster, and the only IAM role associated is the Amazon Redshift onboarding role. Role chaining streamlines the new tenant onboarding process. With role chaining, we don’t need to modify the cluster; we can make modifications on the tenant IAM role definition when onboarding a new tenant.

For our use case, we have two tenants: team 1 and team 2. A tenant here represents a group of users—a team from the same business unit. We want separate S3 buckets and Amazon Redshift schemas for each team. These teams should be able to access their own data only and also be able to support fine-grained access control over copying and unloading data from Amazon S3 to Amazon Redshift (and vice versa) within the team boundary. We can apply access control at the individual user level using the same approach.

The following architecture diagram shows the AWS resources and process flow of our solution.

In this tutorial, you create two S3 buckets, two Amazon Redshift tenant schemas, two Amazon Redshift tenant groups, one Amazon Redshift onboarding role, and two tenant roles. Then you grant ASSUMEROLE on the onboarding and tenant role to each tenant, using role chaining. To verify that each tenant can only access its own S3 resources, you create two Amazon Redshift users assigned to their own tenant group and run COPY and UNLOAD commands.

Prerequisites

To follow along with this solution, you need the following prerequisites:

Download the source code to your local environment

To implement this solution in your local development environment, you can download the source code from the GitHub repo or clone the source code using the following command:

git clone https://github.com/aws-samples/amazon-redshift-assume-role-sample.git

The following files are included in the source code:

  • redshift-onboarding-role.cf.yaml – A CloudFormation template to deploy the Amazon Redshift onboarding role redshift-onboarding-role
  • redshift-tenant-resources.cf.yaml – A CloudFormation template to deploy an S3 bucket, KMS key, and IAM role for each tenant you want to onboard

Provision an IAM role for Amazon Redshift and attach this role to the Amazon Redshift cluster

Deploy the template redshift-onboarding-role.cf.yaml using the AWS CloudFormation console or the AWS Command Line Interface (AWS CLI). For more information about stack creation, see Create the stack. This template doesn’t have any required parameters. The stack provisions an IAM role named redshift-s3-onboarding-role for Amazon Redshift. The following code is the policy defining sts:AssumeRole to the tenant-specific IAM roles:

{
  "Version": "2012-10-17",
  "Statement": [
  {
   "Action": [
     "sts:AssumeRole"
    ],
   "Resource": [
     "arn:aws:iam::xxxxxxxxxxxx:role/*-tenant-redshift-s3-access-role"
    ],
   "Effect": "Allow"
   }
  ]
}

Navigate to the Amazon Redshift console and select the cluster you want to update. On the Actions menu, choose Manage IAM roles. Choose the role redshift-s3-onboarding-role to associate with the cluster. For more information, see Associate the IAM role with your cluster.

Provision the IAM role and resources for tenants

Deploy the template redshift-tenant-resources.cf.yaml using the AWS CloudFormation console or the AWS CLI. For this post, you deploy the stack twice, supplying two unique tenant names for TenantName. For example, you can use team1 and team2 as the TenantName parameter values.

For each tenant, the stack provisions the following resources:

  • A KMS key
  • An S3 bucket named team1-data-<account id>-<region> with default encryption enabled with SSE-KMS using the created key
  • An IAM role named team1-tenant-redshift-s3-access-role

The policy attached to the role team1-tenant-redshift-s3-access-role can only access the team’s own S3 bucket. The role redshift-s3-onboarding-role is trusted to assume all tenant roles *-tenant-redshift-s3-access-role to enable role chaining. The tenant role *-tenant-redshift-s3-access-role has a trust relationship to redshift-s3-onboarding-role. See the following policy code:

        {
            "Action": [
                "s3:List*",
                "s3:Get*",
                "s3:Put*"
            ],
            "Resource": [
                "arn:aws:s3:::team1-data-<account id>-<region>/*",
                "arn:aws:s3:::team1-data-<account id>-<region>"
            ],
            "Effect": "Allow"
        }

Create a tenant schema and tenant user with appropriate privileges

For this post, you create the following Amazon Redshift database objects using the query editor on the Amazon Redshift console or a SQL client tool like SQL Workbench/J. Replace <password> with your password and <account id> with your AWS account ID before running the following SQL statements:

create schema team1;
create schema team2;

create group team1_grp;
create group team2_grp;

create user team1_usr with password '<password>' in group team1_grp;
create user team2_usr with password '<password>' in group team2_grp;

grant usage on schema team1 to group team1_grp;
grant usage on schema team2 to group team2_grp;

GRANT ALL ON SCHEMA team1 TO group team1_grp;
GRANT ALL ON SCHEMA team2 TO group team2_grp;

revoke assumerole on all from public for all;

grant assumerole
on 'arn:aws:iam::<account id>:role/redshift-s3-onboarding-role,arn:aws:iam::<account-id>:role/team1-tenant-redshift-s3-access-role'
to group team1_grp for copy;

grant assumerole
on 'arn:aws:iam::<account id>:role/redshift-s3-onboarding-role,arn:aws:iam::<account id>:role/team1-tenant-redshift-s3-access-role'
to group team1_grp for unload;

grant assumerole
on 'arn:aws:iam::<account id>:role/redshift-s3-onboarding-role,arn:aws:iam::<account id>:role/team2-tenant-redshift-s3-access-role'
to group team2_grp for copy;

grant assumerole
on 'arn:aws:iam::<account id>:role/redshift-s3-onboarding-role,arn:aws:iam::<account id>:role/team2-tenant-redshift-s3-access-role'
to group team2_grp for unload;

commit;

Verify that each tenant can only access its own resources

To verify your access control settings, you can create a test table in each tenant schema and upload a file to the tenant’s S3 bucket using the following commands. You can use the Amazon Redshift query editor or a SQL client tool.

  1. Sign in as team1_usr and enter the following commands:
    CREATE TABLE TEAM1.TEAM1_VENUE(
    VENUEID SMALLINT,
    VENUENAME VARCHAR(100),
    VENUECITY VARCHAR(30),
    VENUESTATE CHAR(2),
    VENUESEATS INTEGER
    ) DISTSTYLE EVEN;
    
    commit;

  2. Sign in as team2_usr and enter the following commands:
    CREATE TABLE TEAM2.TEAM2_VENUE(
    VENUEID SMALLINT,
    VENUENAME VARCHAR(100),
    VENUECITY VARCHAR(30),
    VENUESTATE CHAR(2),
    VENUESEATS INTEGER
    ) DISTSTYLE EVEN;
    
    commit;

  3. Create a file named test-venue.txt with the following contents:
    7|BMO Field|Toronto|ON|0
    16|TD Garden|Boston|MA|0
    23|The Palace of Auburn Hills|Auburn Hills|MI|0
    28|American Airlines Arena|Miami|FL|0
    37|Staples Center|Los Angeles|CA|0
    42|FedExForum|Memphis|TN|0
    52|PNC Arena|Raleigh|NC|0
    59|Scotiabank Saddledome|Calgary|AB|0
    66|SAP Center|San Jose|CA|0
    73|Heinz Field|Pittsburgh|PA|65050

  4. Upload this file to both team1 and team2 S3 buckets.
  5. Sign in as team1_usr and enter the following commands to test Amazon Redshift COPY and UNLOAD:
    copy team1.team1_venue
    from 's3://team1-data-<account id>-<region>/'
    iam_role 'arn:aws:iam::<account id>:role/redshift-s3-onboarding-role,arn:aws:iam::<account id>:role/team1-tenant-redshift-s3-access-role'
    delimiter '|' ;
    
    unload ('select * from team1.team1_venue')
    to 's3://team1-data-<account id>-<region>/unload/' 
    iam_role 'arn:aws:iam::<account id>:role/redshift-s3-onboarding-role,arn:aws:iam::<account id>:role/team1-tenant-redshift-s3-access-role';

The file test-venue.txt uploaded to the team1 bucket is copied to the table team1_venue in the team1 schema, and the data in table team1_venue is unloaded to the team1 bucket successfully.

  1. Replace team1 with team2 in the preceding commands and then run them again, this time signed in as team2_usr.

If you’re signed in as team1_usr and try to access the team2 S3 bucket or team2 schema or table and team2 IAM role when running COPY and UNLOAD, you get an access denied error. You get the same error if trying to access team1 resources while logged in as team2_usr.

Clean up

To clean up the resources you created, delete the CloudFormation stack created in your AWS account.

Conclusion

In this post, we presented a solution to achieve role-based secure data movement between Amazon S3 and Amazon Redshift. This approach combines with the ASSUMEROLE feature in Amazon Redshift to allow fine-grained access control over the COPY and UNLOAD commands down to the individual user level within a particular team. This in turn provides finer control over resource isolation and data security in a multi-tenant solution. Many use cases can benefit from this solution as more enterprises build data platforms to provide the foundations for highly scalable, customizable, and secure data consumption models.


About the Authors

Sudipta Mitra is a Senior Data Architect for AWS, and passionate about helping customers to build modern data analytics applications by making innovative use of latest AWS services and their constantly evolving features. A pragmatic architect who works backwards from customer needs, making them comfortable with the proposed solution, helping achieve tangible business outcomes. His main areas of work are Data Mesh, Data Lake, Knowledge Graph, Data Security and Data Governance.

Michelle Deng is a Sr. Data Architect at Amazon Web Services. She works with AWS customers to provide guidance and technical assistance about database migrations and Big data projects.

Jared Cook is a Sr. Cloud Infrastructure Architect at Amazon Web Services. He is committed to driving business outcomes in the cloud, and uses Infrastructure as Code and DevOps best practices to build resilient architectures on AWS.  In his leisure time, Jared enjoys the outdoors, music, and plays the drums.

Lisa Matacotta is a Senior Customer Solutions Manager at Amazon Web Services. She works with AWS customers to help customers achieve business and strategic goals, understand their biggest challenges and provide guidance based on AWS best practices to overcome them.

Real-time analytics with Amazon Redshift streaming ingestion

Post Syndicated from Sam Selvan original https://aws.amazon.com/blogs/big-data/real-time-analytics-with-amazon-redshift-streaming-ingestion/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL. Amazon Redshift offers up to three times better price performance than any other cloud data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as high-performance business intelligence (BI) reporting, dashboarding applications, data exploration, and real-time analytics.

We’re excited to launch Amazon Redshift streaming ingestion for Amazon Kinesis Data Streams, which enables you to ingest data directly from the Kinesis data stream without having to stage the data in Amazon Simple Storage Service (Amazon S3). Streaming ingestion allows you to achieve low latency in the order of seconds while ingesting hundreds of megabytes of data into your Amazon Redshift cluster.

In this post, we walk through the steps to create a Kinesis data stream, generate and load streaming data, create a materialized view, and query the stream to visualize the results. We also discuss the benefits of streaming ingestion and common use cases.

The need for streaming ingestion

We hear from our customers that you want to evolve your analytics from batch to real time, and access your streaming data in your data warehouses with low latency and high throughput. You also want to enrich your real-time analytics by combining them with other data sources in your data warehouse.

Use cases for Amazon Redshift streaming ingestion center around working with data that is generated continually (streamed) and needs to be processed within a short period (latency) of its generation. Sources of data can vary, from IoT devices to system telemetry, utility service usage, geolocation of devices, and more.

Before the launch of streaming ingestion, if you wanted to ingest real-time data from Kinesis Data Streams, you needed to stage your data in Amazon S3 and use the COPY command to load your data. This usually involved latency in the order of minutes and needed data pipelines on top of the data loaded from the stream. Now, you can ingest data directly from the data stream.

Solution overview

Amazon Redshift streaming ingestion allows you to connect to Kinesis Data Streams directly, without the latency and complexity associated with staging the data in Amazon S3 and loading it into the cluster. You can now connect to and access the data from the stream using SQL and simplify your data pipelines by creating materialized views directly on top of the stream. The materialized views can also include SQL transforms as part of your ELT (extract, load and transform) pipeline.

After you define the materialized views, you can refresh them to query the most recent stream data. This means that you can perform downstream processing and transformations of streaming data using SQL at no additional cost and use your existing BI and analytics tools for real-time analytics.

Amazon Redshift streaming ingestion works by acting as a stream consumer. A materialized view is the landing area for data that is consumed from the stream. When the materialized view is refreshed, Amazon Redshift compute nodes allocate each data shard to a compute slice. Each slice consumes data from the allocated shards until the materialized view attains parity with the stream. The very first refresh of the materialized view fetches data from the TRIM_HORIZON of the stream. Subsequent refreshes read data from the last SEQUENCE_NUMBER of the previous refresh until it reaches parity with the stream data. The following diagram illustrates this workflow.

Setting up streaming ingestion in Amazon Redshift is a two-step process. You first need to create an external schema to map to Kinesis Data Streams and then create a materialized view to pull data from the stream. The materialized view must be incrementally maintainable.

Create a Kinesis data stream

First, you need to create a Kinesis data stream to receive the streaming data.

  1. On the Amazon Kinesis console, choose Data streams.
  2. Choose Create data stream.
  3. For Data stream name, enter ev_stream_data.
  4. For Capacity mode, select On-demand.
  5. Provide the remaining configurations as needed to create your data stream.

Generate streaming data with the Kinesis Data Generator

You can synthetically generate data in JSON format using the Amazon Kinesis Data Generator (KDG) utility and the following template:

{
    
   "_id" : "{{random.uuid}}",
   "clusterID": "{{random.number(
        {   "min":1,
            "max":50
        }
    )}}", 
    "connectionTime": "{{date.now("YYYY-MM-DD HH:mm:ss")}}",
    "kWhDelivered": "{{commerce.price}}",
    "stationID": "{{random.number(
        {   "min":1,
            "max":467
        }
    )}}",
      "spaceID": "{{random.word}}-{{random.number(
        {   "min":1,
            "max":20
        }
    )}}",
 
   "timezone": "America/Los_Angeles",
   "userID": "{{random.number(
        {   "min":1000,
            "max":500000
        }
    )}}"
}

The following screenshot shows the template on the KDG console.

Load reference data

In the previous step, we showed you how to load synthetic data into the stream using the Kinesis Data Generator. In this section, you load reference data related to electric vehicle charging stations to the cluster.

Download the Plug-In EVerywhere Charging Station Network data from the City of Austin’s open data portal. Split the latitude and longitude values in the dataset and load it in to a table with the following schema.

CREATE TABLE ev_station
  (
     siteid                INTEGER,
     station_name          VARCHAR(100),
     address_1             VARCHAR(100),
     address_2             VARCHAR(100),
     city                  VARCHAR(100),
     state                 VARCHAR(100),
     postal_code           VARCHAR(100),
     no_of_ports           SMALLINT,
     pricing_policy        VARCHAR(100),
     usage_access          VARCHAR(100),
     category              VARCHAR(100),
     subcategory           VARCHAR(100),
     port_1_connector_type VARCHAR(100),
     voltage               VARCHAR(100),
     port_2_connector_type VARCHAR(100),
     latitude              DECIMAL(10, 6),
     longitude             DECIMAL(10, 6),
     pricing               VARCHAR(100),
     power_select          VARCHAR(100)
  ) DISTTYLE ALL

Create a materialized view

You can access your data from the data stream using SQL and simplify your data pipelines by creating materialized views directly on top of the stream. Complete the following steps:

  1. Create an external schema to map the data from Kinesis Data Streams to an Amazon Redshift object:
    CREATE EXTERNAL SCHEMA evdata FROM KINESIS
    IAM_ROLE 'arn:aws:iam::0123456789:role/redshift-streaming-role';

  2. Create an AWS Identity and Access Management (IAM) role (for the policy, see Getting started with streaming ingestion).

Now you can create a materialized view to consume the stream data. You can choose to use the SUPER datatype to store the payload as is in JSON format or use Amazon Redshift JSON functions to parse the JSON data into individual columns. For this post, we use the second method because the schema is well defined.

  1. Create the materialized view so it’s distributed on the UUID value from the stream and is sorted by the approximatearrivaltimestamp value:
    CREATE MATERIALIZED VIEW ev_station_data_extract DISTKEY(5) sortkey(1) AS
        SELECT approximatearrivaltimestamp,
        partitionkey,
        shardid,
        sequencenumber,
        json_extract_path_text(from_varbyte(data, 'utf-8'),'_id')::character(36) as ID,
        json_extract_path_text(from_varbyte(data, 'utf-8'),'clusterID')::varchar(30) as clusterID,
        json_extract_path_text(from_varbyte(data, 'utf-8'),'connectionTime')::varchar(20) as connectionTime,
        json_extract_path_text(from_varbyte(data, 'utf-8'),'kWhDelivered')::DECIMAL(10,2) as kWhDelivered,
        json_extract_path_text(from_varbyte(data, 'utf-8'),'stationID')::DECIMAL(10,2) as stationID,
        json_extract_path_text(from_varbyte(data, 'utf-8'),'spaceID')::varchar(100) as spaceID,
        json_extract_path_text(from_varbyte(data, 'utf-8'),'timezone')::varchar(30) as timezone,
        json_extract_path_text(from_varbyte(data, 'utf-8'),'userID')::varchar(30) as userID
        FROM evdata."ev_station_data";

  2. Refresh the materialized view:
    REFRESH MATERIALIZED VIEW ev_station_data_extract;

The materialized view doesn’t auto-refresh while in preview, so you should schedule a query in Amazon Redshift to refresh the materialized view once every minute. For instructions, refer to Scheduling SQL queries on your Amazon Redshift data warehouse.

Query the stream

You can now query the refreshed materialized view to get usage statistics:

SELECT to_timestamp(connectionTime, 'YYYY-MM-DD HH24:MI:SS') as connectiontime
,SUM(kWhDelivered) AS Energy_Consumed
,count(distinct userID) AS #Users
from ev_station_data_extract
group by to_timestamp(connectionTime, 'YYYY-MM-DD HH24:MI:SS')
order by 1 desc;

The following table contains the results.

connectiontime energy_consumed #users
2022-02-27 23:52:07+00 72870 131
2022-02-27 23:52:06+00 510892 998
2022-02-27 23:52:05+00 461994 934
2022-02-27 23:52:04+00 540855 1064
2022-02-27 23:52:03+00 494818 999
2022-02-27 23:52:02+00 491586 1000
2022-02-27 23:52:01+00 499261 1000
2022-02-27 23:52:00+00 774286 1498
2022-02-27 23:51:59+00 505428 1000
2022-02-27 23:51:58+00 262413 500
2022-02-27 23:51:57+00 486567 1000
2022-02-27 23:51:56+00 477892 995
2022-02-27 23:51:55+00 591004 1173
2022-02-27 23:51:54+00 422243 823
2022-02-27 23:51:53+00 521112 1028
2022-02-27 23:51:52+00 240679 469
2022-02-27 23:51:51+00 547464 1104
2022-02-27 23:51:50+00 495332 993
2022-02-27 23:51:49+00 444154 898
2022-02-27 23:51:24+00 505007 998
2022-02-27 23:51:23+00 499133 999
2022-02-27 23:29:14+00 497747 997
2022-02-27 23:29:13+00 750031 1496

Next, you can join the materialized view with the reference data to analyze the charging station consumption data for the last 5 minutes and break it down by station category:

SELECT to_timestamp(connectionTime, 'YYYY-MM-DD HH24:MI:SS') as connectiontime
,SUM(kWhDelivered) AS Energy_Consumed
,count(distinct userID) AS #Users
,st.category
from ev_station_data_extract ext
join ev_station st on
ext.stationID = st.siteid
where approximatearrivaltimestamp > current_timestamp -interval '5 minutes'
group by to_timestamp(connectionTime, 'YYYY-MM-DD HH24:MI:SS'),st.category
order by 1 desc, 2 desc

The following table contains the results.

connectiontime energy_consumed #users category
2022-02-27 23:55:34+00 188887 367 Workplace
2022-02-27 23:55:34+00 133424 261 Parking
2022-02-27 23:55:34+00 88446 195 Multifamily Commercial
2022-02-27 23:55:34+00 41082 81 Municipal
2022-02-27 23:55:34+00 13415 29 Education
2022-02-27 23:55:34+00 12917 24 Healthcare
2022-02-27 23:55:34+00 11147 19 Retail
2022-02-27 23:55:34+00 8281 14 Parks and Recreation
2022-02-27 23:55:34+00 5313 10 Hospitality
2022-02-27 23:54:45+00 146816 301 Workplace
2022-02-27 23:54:45+00 112381 216 Parking
2022-02-27 23:54:45+00 75727 144 Multifamily Commercial
2022-02-27 23:54:45+00 29604 55 Municipal
2022-02-27 23:54:45+00 13377 30 Education
2022-02-27 23:54:45+00 12069 26 Healthcare

Visualize the results

You can set up a simple visualization using Amazon QuickSight. For instructions, refer to Quick start: Create an Amazon QuickSight analysis with a single visual using sample data.

We create a dataset in QuickSight to join the materialized view with the charging station reference data.

Then, create a dashboard showing energy consumption and number of connected users over time. The dashboard also shows the list of locations on the map by category.

Streaming ingestion benefits

In this section, we discuss some of the benefits of streaming ingestion.

High throughput with low latency

Amazon Redshift can handle and process several gigabytes of data per second from Kinesis Data Streams. (Throughput is dependent on the number of shards in the data stream and the Amazon Redshift cluster configuration.) This allows you to experience low latency and high bandwidth when consuming streaming data, so you can derive insights from your data in seconds instead of minutes.

As we mentioned earlier, the key differentiator with the direct ingestion pull approach in Amazon Redshift is lower latency, which is in seconds. Contrast this to the approach of creating a process to consume the streaming data, staging the data in Amazon S3, and then running a COPY command to load the data into Amazon Redshift. This approach introduces latency in minutes due to the multiple steps involved in processing the data.

Straightforward setup

Getting started is easy. All the setup and configuration in Amazon Redshift uses SQL, which most cloud data warehouse users are already familiar with. You can get real-time insights in seconds without managing complex pipelines. Amazon Redshift with Kinesis Data Streams is fully managed, and you can run your streaming applications without requiring infrastructure management.

Increased productivity

You can perform rich analytics on streaming data within Amazon Redshift and using existing familiar SQL without needing to learn new skills or languages. You can create other materialized views, or views on materialized views, to do most of your ELT data pipeline transforms within Amazon Redshift using SQL.

Streaming ingestion use cases

With near-real time analytics on streaming data, many use cases and industry verticals applications become possible. The following are just some of the many application use cases:

  • Improve the gaming experience – You can focus on in-game conversions, player retention, and optimizing the gaming experience by analyzing real-time data from gamers.
  • Analyze clickstream user data for online advertising – The average customer visits dozens of websites in a single session, yet marketers typically analyze only their own websites. You can analyze authorized clickstream data ingested into the warehouse to assess your customer’s footprint and behavior, and target ads to your customers just-in-time.
  • Real-time retail analytics on streaming POS data – You can access and visualize all your global point of sale (POS) retail sales transaction data for real-time analytics, reporting, and visualization.
  • Deliver real-time application insights – With the ability to access and analyze streaming data from your application log files and network logs, developers and engineers can conduct real-time troubleshooting of issues, deliver better products, and alert systems for preventative measures.
  • Analyze IoT data in real time – You can use Amazon Redshift streaming ingestion with Amazon Kinesis services for real-time applications such as device status and attributes such as location and sensor data, application monitoring, fraud detection, and live leaderboards. You can ingest streaming data using Kinesis Data Streams, process it using Amazon Kinesis Data Analytics, and emit the results to any data store or application using Kinesis Data Streams with low end-to-end latency.

Conclusion

This post showed how to create Amazon Redshift materialized views to ingest data from Kinesis data streams using Amazon Redshift streaming ingestion. With this new feature, you can easily build and maintain data pipelines to ingest and analyze streaming data with low latency and high throughput.

The streaming ingestion preview is now available in all AWS Regions where Amazon Redshift is available. To get started with Amazon Redshift streaming ingestion, provision an Amazon Redshift cluster on the current track and verify your cluster is running version 1.0.35480 or later.

For more information, refer to Streaming ingestion (preview), and check out the demo Real-time Analytics with Amazon Redshift Streaming Ingestion on YouTube.


About the Author

Sam Selvan is a Senior Analytics Solution Architect with Amazon Web Services.

Modernize your healthcare clinical quality data repositories with Amazon Redshift Data Vault

Post Syndicated from Sarathi Balakrishnan original https://aws.amazon.com/blogs/big-data/modernize-your-healthcare-clinical-quality-data-repositories-with-amazon-redshift-data-vault/

With the shift to value-based care, tapping data to its fullest to improve patient satisfaction tops the priority of healthcare executives everywhere. To achieve this, reliance on key technology relevant to their sphere is a must. This is where data lakes can help. A data lake is an architecture that can assist providers in storing, sharing, and utilizing electronic health records and other patient data. Healthcare organizations are already utilizing data lakes to unite heterogeneous data from across hospital systems. The use of data lakes has aided in the effective transformation of the organizational culture, resulting in a data-driven approach to resolving issues. Healthcare clinical quality metric systems stream massive amounts of structured and semi-structured data into a data lake in real time from various sources. However, storing structured data in a data lake and applying schema-on-read leads to quality issues and adds complexity to the simple structured data process.

In this post, we demonstrate how to modernize your clinical quality data repositories with Amazon Redshift Data Vault.

How healthcare systems use data lakes

Data lakes are popular in healthcare systems because they allow the integration and exploratory analysis of diverse data. Data lakes comprise multiple data assets stored within a Hadoop ecosystem with minimal alterations to the original data format (or file). Because of this, the data lake lacks schema-on-write functionality. In data lakes, information is accessed using a methodology called schema-on-read. On the other hand, a data warehouse is a subject-oriented, time-variant, and centrally controlled database system designed for the management of mixed workloads and large queries. The schema of the data warehouse is predefined, the data written to it must conform to its schema (schema-on-write). It’s perfect for structured data storage.

Because the healthcare industry wants to preserve the data in raw format for compliance and regulatory requirements, data lakes are widely used in clinical quality tracking systems even though the incoming data are structured and semi-structured. For business and analytical use cases, the data in a data lake is further converted into a dimensional schema and loaded into a data warehouse. Data warehouses require several months of modeling, mapping, ETL (extract, transform, and load) planning and creation, and testing. The result is consistent, validated data for reporting and analytics.

The following diagram shows a typical clinical quality repository in a current healthcare system.

Data lakes offer information in its raw form, as well as a specialized means to obtaining data that applies the schema-on-read. In general, data lake users are experienced analysts familiar with data wrangling techniques that apply schema-on-read or understand data content from unstructured formats. Business users and data visualization builders struggle in the absence of powerful purpose-built search capabilities and data extraction algorithms. That isn’t to say data lakes are devoid of metadata or rules controlling their use, security, or administration. It’s the exact opposite. A successful data lake structures its data in such a way that it promotes better and more efficient access, and reuses data management methods or provides new tools that increase search and general knowledge of the data content.

Challenges in the data lake

However, data modeling shouldn’t be disregarded while employing the schema-on-read approach to handle data. A lack of meaningful data structure might lead to data quality problems, integration issues, performance issues, and deviations from organizational goals. Storing structured data in a data lake and then applying schema-on-read generates problems with data quality and complicates a basic structured data operation. You can also keep these structured datasets in data warehouses by skipping the data lake, but this increases the complexity of scaling, data transformation, and loading (ETL). But data warehouse ETL jobs apply a large number of business rules and heavily transform the data from its raw form to make the data fit one or more business purposes.

Dimensional representation data warehouses are located close to business applications but away from the source. This brings new challenges, like reverse tracking the data to its source to identify potential data errors and keep the original source for regulatory validation. In clinical quality reporting and regulatory reporting, accuracy is key. To provide high accuracy and data quality, developers often reverse track the data to its original source. This is very complex and sometimes not possible in a data warehouse model because the data loses its raw form. Compared to data warehouses, data lakes perform poorly in terms of dataset identification, processing, and catalog management. It’s complex to securely share data in a data lake between accounts. Cloud data lakes show improvement in this area, but some human intervention is still needed to make sure the data catalogs and security match healthcare requirements and standards.

Design and build a Data Vault model in Amazon Redshift

Amazon Redshift is a relational database system based on the PostgreSQL standard. It’s designed to efficiently run online analytical processing (OLAP) queries on petabyte-scale data. It also offers other query-handling efficiencies such as parallel processing, columnar database design, column data compression, a query optimizer, and compiled query code. A cluster is at the heart of every Amazon Redshift deployment. Each cluster consists of a leader node and one or more compute nodes, all of which are linked by high-speed networks.

Amazon Redshift RA3 instances with managed storage allow you to choose the number of nodes based on your performance requirements. The RA3 instance type can scale your data warehouse storage capacity automatically without manual intervention, and without needing to add additional compute resources. The number of nodes you choose is determined by the quantity of your data and the query processing performance you want.

The following diagram illustrates the Amazon Redshift RA3 instance architecture.

The interface to your business intelligence application is provided by a leader node. It provides conventional PostgreSQL JDBC/ODBC query and response interfaces. It acts as a traffic cop, routing requests from client applications to the proper compute nodes and managing the results that are delivered. It also distributes ingested data to computing nodes so that your databases may be built.

Amazon Redshift features like streaming, RA3’s near-limitless storage, Amazon Redshift ML, the SUPER data type, automatic table optimization, materialized views, and Amazon Redshift Spectrum open up the possibility of modernizing healthcare clinical quality data repositories and standardizing the data model for fast access and high accuracy. These features make RA3 instances the perfect candidate for your clinical quality repository. Additionally, the data sharing capability in Amazon Redshift keeps a single source for truth across the organization and removes the need for data silos. It’s tightly coupled with other AWS services, which allows you to take advantage of the AWS Cloud ecosystem and get the most out of your data with less effort.

Standardizing data modeling has two benefits: it allows you to reuse technical and organizational procedures (for example, ETL and project management), and you can readily merge data from diverse contexts without much complication. Because storage is separated from the computing, the Amazon Redshift RA3 instance retains the classic data lake feature of storing huge amounts of data. On the data side, to keep the data close to its raw form, a Data Vault model is the ideal solution to this problem.

Data Vault is a methodology to speed up the development of data warehouse initiatives and keep the data close to its sources. The Data Vault 2.0 data modeling standards are popular because they stress the business keys by hashing the keys across entities. This eliminates the need for strong relationships between entities and their linkages within the delivery of business processes. Data Vault makes it easy to integrate with versatile data sources for completely different use cases than it’s originally designed for, due to the following features:

  • Entities can stand alone and be built on patterns, each with a well-defined purpose.
  • Because the data represents the source system and delivers business value, data silos are eliminated.
  • You can load data in parallel with the least number of dependencies.
  • Historized data is kept at the coarsest level of granularity possible.
  • You can implement flexible business rules independently of data loading. This reduces the time taken to load the data by 25%.
  • You can add new data sources without affecting the current model.

The following diagram shows a simple clinical quality data repository with the Data Vault model with Amazon Redshift.

The following shows a simple clinical quality business Data Vault model using Amazon Redshift materialized views.

The Data Vault architecture is divided into four stages:

  • Staging – A copy of the most recent modifications to data from the source systems is created. This layer doesn’t save history, and you can perform numerous alterations to the staged data while populating, such as data type changes or scaling, character set conversion, and the inclusion of metadata columns to enable subsequent processing.
  • Raw vault – This stores a history of all data from numerous source systems. Except for putting the data into source system independent targets, no filters or business transformations happen here.
  • Business vault – This is an optional offering that is frequently developed. It includes business computations and denormalizations with the goal of enhancing the speed and ease of access inside the consumption layer, known as the information mart layer.
  • Data mart layer – This is where data is most typically accessible by users, such as reporting dashboards or extracts. You can build multiple marts from a single Data Vault integration layer, and the most frequent data modeling choice for these marts is Star/Kimball schemas.

The Amazon Redshift RA3 instance capabilities we discussed enable the development of highly performant and cost-effective Data Vault solutions. The staging and raw layers, for example, are populated in micro-batches by one Amazon Redshift cluster 24 hours a day. You can build the business Data Vault layer once a day and pause it to save costs when completed, and any number of consumer Amazon Redshift clusters can access the results.

The following diagram shows the complete architecture of the clinical quality Data Vault.

The architecture has the following components:

  • Batched raw data is loaded into the staging database raw schema.
  • The data quality validated structured data is delivered immediately to the staging database curated schema.
  • Data from devices, EHR systems, and vendor medical systems are also sent directly into the Amazon Redshift staging database.
  • The Data Vault 2.0 approach is used in the Amazon Redshift producer cluster to standardize data structure and assure data integrity.
  • Materialized views make a business Data Vault.
  • The Amazon Redshift SUPER data type allows you to store and query semi-structured data.
  • Amazon Redshift data share capabilities securely exchange the Data Vault tables and views between departments.
  • Support for streaming ingestion in Amazon Redshift eliminates the need to stage data in Amazon S3 before ingesting it into Amazon Redshift.
  • Amazon Redshift allows you to offload the historical data to Amazon S3. Healthcare clinical quality tracking systems usually offload data older than 6–7 years old for archival.
  • Amazon S3 and Amazon Redshift are HIPAA eligible.

Advantages of the Amazon Redshift clinical data repository

A clinical data repository in Amazon Redshift has the following benefits:

  • Data Vault enables the preservation of data as raw in its native form.
  • Amazon Redshift RA3 provides limitless storage, a real-time data feed, structured and semi-structured data, and more.
  • The new streaming functionality and analytics capacity in Amazon Redshift provide near-real-time analysis without the need for extra services.
  • Data quality is assured while maintaining data close to the business and source. You can reduce storage costs while maintaining data fidelity.
  • It’s compatible with business applications and allows for easy adoption when merging or expanding a variety of businesses without complexity.
  • The decoupled storage and compute option in Amazon Redshift allows for cost savings and offers ML capabilities within Amazon Redshift for quicker prediction, forecasting, and anomaly detection.

Conclusion

Data Vault is a one-of-a-kind database platform that promotes openness. It’s quick to adopt, infinitely configurable, and ready to overcome the most demanding data difficulties in current clinical quality metric tracking systems. Combined with the advantages of Amazon Redshift RA3 instances, Data Vault can outperform and deliver more value for cost and time.

To get started,

  • Create Amazon Redshift RA3 instance for the primary “Clinical Data Repository” and the data marts.
  • Build a Data vault schema for the raw vault an create materialized views for the business vault.
  • Enable Amazon Redshift data share to share data between producer cluster and consumer cluster.
  • Load the structed and unstructured data in to the producer cluster data vault for the business use.

About the Authors

Sarathi Balakrishnan is the Global Partner Solutions Architect, specializing in Data, Analytics and AI/ML at AWS. He works closely with AWS partner globally to build solutions and platforms on AWS to accelerate customers’ business outcomes with state-of-the-art cloud technologies and achieve more in their cloud explorations. He helps with solution architecture, technical guidance, and best practices to build cloud-native solutions. He joined AWS with over 20 years of large enterprise experience in agriculture, insurance, health care and life science, marketing and advertisement industries to develop and implement data and AI strategies.

Kasi Muthu is a Senior Partner Solutions Architect based in Houston, TX. He is a trusted advisor in Data, Analytics, AI/ML space. He also helps customers migrate their workloads to AWS with focus on scalability, resiliency, performance and sustainability. Outside of work, he enjoys spending time with his family and spending quite a bit of his free time on YouTube.

Scale Amazon Redshift to meet high throughput query requirements

Post Syndicated from Erik Anderson original https://aws.amazon.com/blogs/big-data/scale-amazon-redshift-to-meet-high-throughput-query-requirements/

Many enterprise customers have demanding query throughput requirements for their data warehouses. Some may be able to address these requirements through horizontally or vertically scaling a single cluster. Others may have a short duration where they need extra capacity to handle peaks that can be addressed through Amazon Redshift concurrency scaling. However, enterprises with consistently high demand that can’t be serviced by a single cluster need another option. These enterprise customers require large datasets to be returned from queries at a high frequency. These scenarios are also often paired with legacy business intelligence (BI) tools where data is further analyzed.

Amazon Redshift is a fast, fully managed cloud data warehouse. Tens of thousands of customers use Amazon Redshift as their analytics platform. These customers range from small startups to some of the world’s largest enterprises. Users such as data analysts, database developers, and data scientists use Amazon Redshift to analyze their data to make better business decisions.

This post provides an overview of the available scaling options for Amazon Redshift and also shares a new design pattern that enables query processing in scenarios where having multiple leader nodes are required to extract large datasets for clients or BI tools without introducing additional overhead.

Common Amazon Redshift scaling patterns

Because Amazon Redshift is a managed cloud data warehouse, you only pay for what you use, so sizing your cluster appropriately is critical for getting the best performance at the lowest cost. This process begins with choosing the appropriate instance family for your Amazon Redshift nodes. For new workloads that are planning to scale, we recommend starting with our RA3 nodes, which allow you to independently tailor your storage and compute requirements. The RA3 nodes provide three instance types to build your cluster with: ra3.xlplus, ra3.4xlarge, and ra3.16xlarge.

Horizontal cluster scaling

Let’s assume for this example, you build your cluster with four ra3.4xlarge nodes. This configuration provides 48 vCPUs and 384 GiB RAM. Your workload is consistent throughout the day, with few peaks and valleys. As adoption increases and more users need access to the data, you can add nodes of the same node type to your cluster to increase the amount of compute power available to handle those queries. An elastic resize is the fastest way to horizontally scale your cluster to add nodes as a consistent load increases.

Vertical cluster scaling

Horizontal scaling has its limits, however. Each node type has a limit to the number of nodes that can be managed in a single cluster. To continue with the previous example, ra3.4xlarge nodes have a maximum of 64 nodes per cluster. If your workload continues to grow and you’re approaching this limit, you may decide to vertically scale your cluster. Vertically scaling increases the resources given to each node. Based on the additional resources provided by the larger nodes, you will likely decrease the quantity of nodes at the same time.

Rather than running a cluster with 64 ra3.4xlarge nodes, you could elastically resize your cluster to use 16 ra3.16xlarge nodes and have the equivalent resources to host your cluster. The transition to a larger node type allows you to horizontally scale with those larger nodes. You can create an Amazon Redshift cluster with up to 16 nodes. However, after creation, you can resize your cluster to contain up to 32 ra3.xlplus nodes, up to 64 ra3.4xlarge nodes, or up to 128 ra3.16xlarge nodes.

Concurrency scaling

In March 2019, AWS announced the availability of Amazon Redshift concurrency scaling. Concurrency scaling allows you to add more query processing power to your cluster, but only when you need it. Rather than a consistent volume of workload throughout the day, perhaps there are short periods of time when you need more resources. When you choose concurrency scaling, Amazon Redshift automatically and transparently adds more processing power for just those times when you need it. This is a cost-effective, low-touch option for burst workloads. You only pay for what you use on a per-second basis, and you accumulate 1 hour’s worth of concurrency scaling credits every 24 hours. Those free credits have met the needs of 97% of our Amazon Redshift customers’ concurrency scaling requirements, meaning that most customers get the benefits of concurrency scaling without increasing their costs.

The size of your concurrency scaling cluster is directly proportional to your cluster size, so it also scales as your cluster does. By right-sizing your base cluster and using concurrency scaling, you can address the vast majority of performance requirements.

Multi-cluster scaling

Although the previous three scaling options work together to address the needs of the vast majority of our customers, some customers need another option. These use cases require large datasets to be returned from queries at a high frequency and perform further analysis on them using legacy BI tools.

While working with customers to address these use cases, we have found that in these scenarios, multiple medium-sized clusters can perform better than a single large cluster. This phenomenon mostly relates to the single Amazon Redshift leader node’s throughput capacity.

This last scaling pattern uses multiple Amazon Redshift clusters, which allows you to achieve near-limitless read scalability. Rather than relying on a single cluster, a single leader node, and concurrency scaling, this architecture allows you to add as many resources as needed to address your high throughput query requirements. This pattern relies on Amazon Redshift data sharing abilities to enable a seamless multi-cluster experience.

The remainder of this post covers the details of this architecture.

Solution overview

The following diagram outlines a multi-cluster architecture.

The first supporting component for this architecture is Amazon Redshift managed storage. Managed storage is available for RA3 nodes and allows the complete decoupling of compute and storage resources. This decoupling supports another feature that was announced at AWS re:Invent 2020—data sharing. Data sharing is primarily intended to let you share data amongst different data warehouse groups so that you can retain a single set of data to remove duplication. Data sharing ensures that the users accessing the data are using compute on their clusters rather than using compute on the owning cluster, which better aligns cost to usage.

In this post, we introduce another use case of data sharing: horizontal cluster scaling. This architecture allows you to create two or more clusters to handle high throughput query requirements while maintaining a single data source.

An important component in this design is the Network Load Balancer (NLB). The NLB serves as a single access point for clients to connect to the backend data warehouse for performing reads. It also allows changing the number of underlying clusters transparently to users. If you decide to add or remove clusters, all you need to do is add or remove targets in your NLB. It’s also important to note that this design can use any of the previous three scaling options (horizontal, vertical, and concurrency scaling) to fine-tune the number of resources available to service your particular workload.

Prerequisites

Let’s start by creating two Amazon Redshift clusters of RA3 instance type, and name them producer_cluster and consumer_cluster. For instructions, refer to Create a cluster.

In this post, our producer cluster is a central ETL cluster hosting enterprise sales data using a 3 TB Cloud DW dataset based on the TPC-DS benchmark.

The next step is to configure data sharing between the producer and consumer clusters.

Set up data sharing at the producer cluster

In this step, you need a cluster namespace from the consumer_cluster. One way to find the namespace value of a cluster is to run the SQL statement SELECT CURRENT_NAMESPACE when connected to the consumer_cluster. Another way is through the Amazon Redshift console. Navigate to your Amazon Redshift consumer_cluster, and find the cluster namespace located in the General information section.

After you connect to the producer cluster, create the data share and add the schema and tables to the data share. Then, grant usage to the consumer namespace by providing the namespace value. See the following code:

/* Create Datashare and add objects to the share */ 
CREATE DATASHARE producertpcds3tb;

ALTER DATASHARE producertpcds3tb ADD SCHEMA order_schema;
ALTER DATASHARE producertpcds3tb ADD ALL TABLES in SCHEMA order_schema;

GRANT USAGE ON DATASHARE producertpcds3tb TO NAMESPACE '<consumer namespace>';

You can validate that data sharing was correctly configured by querying these views from the producer cluster:

SELECT * FROM SVV_DATASHARES;
SELECT * FROM SVV_DATASHARE_OBJECTS;

Set up data sharing at the consumer cluster

Get the cluster namespace of the producer cluster by following same steps for the consumer cluster. After you connect to the consumer cluster, you can create a database referencing the data share of the producer cluster. Then you create an external schema and set the search path in the consumer cluster, which allows schema-level access control within the consumer cluster and uses a two-part notation when referencing shared data objects. Finally, you grant usage on the database to a user, and run a query to check if objects as part of data share are accessible. See the following code:

/* Create a local database and schema reference */

CREATE DATABASE tpcds_3tb FROM DATASHARE producertpcds3tb OF NAMESPACE '<producer namespace>';


/*Create External schema */
CREATE EXTERNAL SCHEMA order_schema FROM REDSHIFT DATABASE 'tpcds_3tb' SCHEMA 'order_schema';

SET SEARCH_PATH TO order_schema,public;


/* Grant usage on database to a user */ 

GRANT USAGE On DATABASE tpcds_3tb TO awsuser;

/* Query to check objects accessible from the consumer cluster */

SELECT * FROM SVV_DATASHARE_OBJECTS;

Set up the Network Load Balancer

After you set up data sharing at both the producer_cluster and consumer_cluster, the next step is to configure a Network Load Balancer to accept connections through a single endpoint and forward the connections to both clusters for reading data via queries.

As a prerequisite, collect the following information from the Amazon Redshift producer and consumer clusters on the Amazon Redshift console in the cluster properties section. Use the producer cluster information if consumer cluster is not mentioned below.

Parameter Name Parameter Description
VPCid Amazon Redshift cluster VPC
NLBSubnetid Subnet where the NLB ENI is created. The NLB and Amazon Redshift subnet need to be in the same Availability Zone.
NLBSubnetCIDR Used for allowlisting inbound access in the Amazon Redshift security group
NLBPort Port to be used by NLB Listener, usually the same port as Amazon Redshift port 5439
RedshiftPrivateIP IP address of Amazon Redshift leader node of the producer cluster
RedshiftPrivateIP IP address of Amazon Redshift leader node of the consumer cluster
RedshiftPort: Port used by Amazon Redshift clusters, usually 5439
RedshiftSecurityGroup Security group to allow connectivity to Amazon Redshift cluster

After you collect this information, run the AWS CloudFormation script NLB.yaml to set up the Network Load Balancer for the producer and consumer clusters. The following screenshot shows the stack parameters.

After you create the CloudFormation stack, note the NLB endpoint on the stack’s Outputs tab. You use this endpoint to connect to the Amazon Redshift clusters.

This NLB setup is done for the both producer and consumer clusters by the CloudFormation stack. If needed, you can add additional Amazon Redshift clusters to an existing NLB by navigating to Target groups page of the Amazon EC2 console. Then navigate to rsnlbsetup-target and add the Amazon Redshift cluster leader node private IP and port.

Validate the connections to the Amazon Redshift clusters

After you set up the NLB, the next step is to validate the connectivity to the Amazon Redshift clusters. You can do this by first configuring SQL tools like SQL Workbench, DBeaver, or Aginity Workbench and setting the host name and endpoint to the Amazon Redshift cluster’s NLB endpoint, as shown in the following screenshot. For additional configuration information, see Connecting to an Amazon Redshift cluster using SQL client tools.

Repeat this process a few times to validate that there are connections to both clusters. Similarly, you can use the same NLB endpoint as the host name while configuring.

As a next step, we use JMeter to show how the NLB is connecting to each of the clusters. The Apache JMeter application is open-source software, a 100% pure Java application designed to load test functional behavior and measure performance. Our NLB connects to each cluster in a round-robin manner, which enables even distribution of read load on Amazon Redshift clusters.

Setting up JMeter is out of scope of this post; refer to Building high-quality benchmark tests for Amazon Redshift using Apache JMeter to learn more about setting up JMeter and performance testing on an Amazon Redshift cluster.

The following screenshot shows the HTML output of the response data from JMeter testing. It shows that requests go to both the Amazon Redshift producer and consumer clusters in a round-robin manner.

The preceding screenshot shows a sample output from running 20 SQL queries. Testing with over 1,000 SQL runs was performed with over four Amazon Redshift clusters, and the NLB was able to distribute them as evenly as possible across all of those clusters.

With this setup, you have the flexibility to add Amazon Redshift clusters to your NLB as needed and can configure data sharing to enable horizontal scaling of Amazon Redshift clusters. When demand reduces, you can either de-register some of the Amazon Redshift clusters at the NLB configuration or simply pause the Amazon Redshift cluster and the NLB automatically connects to only those clusters that are available at the time.

Conclusion

In this post, you learned about the different ways that Amazon Redshift can scale to meet your needs as they adjust over time. Use horizontal scaling to increase the number of nodes in your cluster. Use vertical scaling to increase the size of each node. Use concurrency scaling to dynamically address peak workloads. Use multiple clusters with data sharing behind an NLB to provide near-endless scalability. You can use these architectures independently or in combination with each other to build your high-performing, cost-effective data warehouse using Amazon Redshift.

To learn more about some of the foundational features used in the architecture mentioned in this post, refer to:


About the Authors

Erik Anderson is a Principal Solutions Architect at AWS. He has nearly two decades of experience guiding numerous Fortune 100 companies along their technology journeys. He is passionate about helping enterprises build scalable, performant, and cost-effective solutions in the cloud. In his spare time, he loves spending time with his family, home improvement projects, and playing sports.

Rohit Bansal is a Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to build next-generation Analytics solutions using other AWS Analytics Services.

Amazon Redshift continues its price-performance leadership

Post Syndicated from Stefan Gromoll original https://aws.amazon.com/blogs/big-data/amazon-redshift-continues-its-price-performance-leadership/

Data is a strategic asset. Getting timely value from data requires high-performance systems that can deliver performance at scale while keeping costs low. Amazon Redshift is the most popular cloud data warehouse that is used by tens of thousands of customers to analyze exabytes of data every day. We continue to add new capabilities to improve the price-performance ratio for our customers as you bring more data to your Amazon Redshift environments.

This post goes into detail on the analytic workload trends we’re seeing from the Amazon Redshift fleet’s telemetry data, new capabilities we have launched to improve Amazon Redshift’s price-performance, and the results from the latest benchmarks derived from TPC-DS and TPC-H, which reenforce our leadership.

Data-driven performance optimization

We relentlessly focus on improving Amazon Redshift’s price-performance so that you continue to see improvements in your real-world workloads. To this end, the Amazon Redshift team takes a data-driven approach to performance optimization. Werner Vogels discussed our methodology in Amazon Redshift and the art of performance optimization in the cloud, and we have continued to focus our efforts on using performance telemetry from our large customer base to drive the Amazon Redshift performance improvements that matter most to our customers.

At this point, you might ask why does price-performance matter? One critical aspect of a data warehouse is how it scales as your data grows. Will you be paying more per TB as you add more data, or will your costs remain consistent and predictable? We work to make sure that Amazon Redshift delivers not only strong performance as your data grows, but also consistent price-performance.

Optimizing high-concurrency, low-latency workloads

One of the trends that we have observed is that customers are increasingly building analytics applications that require high concurrency of low-latency queries. In the context of data warehousing, this can mean hundreds or even thousands of users running queries with response time SLAs of under 5 seconds.

A common scenario is an Amazon Redshift-powered business intelligence dashboard that serves analytics to a very large number of analysts. For example, one of our customers processes foreign exchange rates and delivers insights based on this data to their users using an Amazon Redshift-powered dashboard. These users generate an average of 200 concurrent queries to Amazon Redshift that can spike to 1,200 concurrent queries at the open and close of the market, with a P90 query SLA of 1.5 seconds. Amazon Redshift is able to meet this requirement, so this customer can meet their business SLAs and provide the best service possible to their users.

A specific metric we track is the percentage of runtime across all clusters that is spent on short-running queries (queries with runtime less than 1 second). Over the last year, we’ve seen a significant increase in short query workloads in the Amazon Redshift fleet, as shown in the following chart.

As we started to look deeper into how Amazon Redshift ran these kinds of workloads, we discovered several opportunities to optimize performance to give you even better throughput on short queries:

  • We significantly reduced Amazon Redshift’s query-planning overhead. Even though this isn’t large, it can be a significant portion of the runtime of short queries.
  • We improved the performance of several core components for situations where many concurrent processes contend for the same resources. This further reduced our query overhead.
  • We made improvements that allowed Amazon Redshift to more efficiently burst these short queries to concurrency scaling clusters to improve query parallelism.

To see where Amazon Redshift stood after making these engineering improvements, we ran an internal test using the Cloud Data Warehouse Benchmark derived from TPC-DS (see a later section of this post for more details on the benchmark, which is available in GitHub). To simulate a high-concurrency, low-latency workload, we used a small 10 GB dataset so that all queries ran in a few seconds or less. We also ran the same benchmark against several other cloud data warehouses. We didn’t enable auto scaling features such as concurrency scaling on Amazon Redshift for this test because not all data warehouses support it. We used an ra3.4xlarge Amazon Redshift cluster, and sized all other warehouses to the closest matching price-equivalent configuration using on-demand pricing. Based on this configuration, we found that Amazon Redshift can deliver up to 8x better performance on analytics applications that predominantly required short queries with low latency and high concurrency, as shown in the following chart.

With Concurrency Scaling on Amazon Redshift, throughput can be seamlessly and automatically scaled to additional Amazon Redshift clusters as user concurrency grows. We increasingly see customers using Amazon Redshift to build such analytics applications based on our telemetry data.

This is just a small peek into the behind-the-scenes engineering improvements our team is continually making to help you improve performance and save costs using a data-driven approach.

New features improving price-performance

With the constantly evolving data landscape, customers want high-performance data warehouses that continue to launch new capabilities to deliver the best performance at scale while keeping costs low for all workloads and applications. We have continued to add features that improve Amazon Redshift’s price-performance out of the box at no additional cost to you, allowing you to solve business problems at any scale. These features include the use of best-in-class hardware through the AWS Nitro System, hardware acceleration with AQUA, auto-rewriting queries so that they run faster using materialized views, Automatic Table Optimization (ATO) for schema optimization, Automatic Workload Management (WLM) to offer dynamic concurrency and optimize resource utilization, short query acceleration, automatic materialized views, vectorization and single instruction/multiple data (SIMD) processing, and much more. Amazon Redshift has evolved to become a self-learning, self-tuning data warehouse, abstracting away the performance management effort needed so you can focus on high-value activities like building analytics applications.

To validate the impact of the latest Amazon Redshift performance enhancements, we ran price-performance benchmarks comparing Amazon Redshift with other cloud data warehouses. For these tests, we ran both a TPC-DS-derived benchmark and a TPC-H-derived benchmark using a 10-node ra3.4xlarge Amazon Redshift cluster. To run the tests on other data warehouses, we chose warehouse sizes that most closely matched the Amazon Redshift cluster in price ($32.60 per hour), using published on-demand pricing for all data warehouses. Because Amazon Redshift is an auto-tuning warehouse, all tests are “out of the box,” meaning no manual tunings or special database configurations are applied—the clusters are launched and the benchmark is run. Price-performance is then calculated as cost per hour (USD) times the benchmark runtime in hours, which is equivalent to the cost to run the benchmark.

For both the TPC-DS-derived and TPC-H-derived tests, we find that Amazon Redshift consistently delivers the best price-performance. The following chart shows the results for the TPC-DS-derived benchmark.

The following chart shows the results for the TPC-H-derived benchmark.

Although these benchmarks reaffirm Amazon Redshift’s price-performance leadership, we always encourage you to try Amazon Redshift using your own proof-of-concept workloads as the best way to see how Amazon Redshift can meet your data needs.

Find the best price-performance for your workloads

The benchmarks used in this post are derived from the industry-standard TPC-DS and TPC-H benchmarks, and have the following characteristics:

  • The schema and data are used unmodified from TPC-DS and TPC-H.
  • The queries are used unmodified from TPC-DS and TPC-H. TPC-approved query variants are used for a warehouse if the warehouse doesn’t support the SQL dialect of the default TPC-DS or TPC-H query.
  • The test includes only the 99 TPC-DS and 22 TPC-H SELECT queries. It doesn’t include maintenance and throughput steps.
  • Three power runs (single stream) were run with query parameters generated using the default random seed of the TPC-DS and TPC-H kits.
  • The primary metric of total query runtime is used when calculating price-performance. The runtime is taken as the best of the three runs.
  • Price-performance is calculated as cost per hour (USD) times the benchmark runtime in hours, which is equivalent to cost to run the benchmark. Published on-demand pricing is used for all data warehouses.

We call this benchmark the Cloud Data Warehouse Benchmark, and you can easily reproduce the preceding benchmark results using the scripts, queries, and data available on GitHub. It is derived from the TPC-DS and TPC-H benchmarks as described earlier, and as such is not comparable to published TPC-DS or TPC-H results, because the results of our tests don’t comply with the specification.

Each workload has unique characteristics, so if you’re just getting started, a proof of concept is the best way to understand how Amazon Redshift performs for your requirements. When running your own proof of concept, it’s important to focus on the right metrics—query throughput (number of queries per hour) and price-performance. You can make a data-driven decision by running a proof of concept on your own or with assistance from AWS or a system integration and consulting partner.

Conclusion

This post discussed the analytic workload trends we’re seeing from Amazon Redshift customers, new capabilities we have launched to improve Amazon Redshift’s price-performance, and the results from the latest benchmarks.

If you’re an existing Amazon Redshift customer, connect with us for a free optimization session and briefing on the new features announced at AWS re:Invent 2021. To stay up to date with the latest developments in Amazon Redshift, follow the What’s New in Amazon Redshift feed.


About the Authors

Stefan Gromoll is a Senior Performance Engineer with Amazon Redshift where he is responsible for measuring and improving Redshift performance. In his spare time, he enjoys cooking, playing with his three boys, and chopping firewood.

Ravi Animi is a Senior Product Management leader in the Redshift Team and manages several functional areas of the Amazon Redshift cloud data warehouse service including performance, spatial analytics, streaming ingestion and migration strategies. He has experience with relational databases, multi-dimensional databases, IoT technologies, storage and compute infrastructure services and more recently as a startup founder using AI/deep learning, computer vision, and robotics.

Florian Wende is a Performance Engineer with Amazon Redshift.

Automate notifications on Slack for Amazon Redshift query monitoring rule violations

Post Syndicated from Dipankar Kushari original https://aws.amazon.com/blogs/big-data/automate-notifications-on-slack-for-amazon-redshift-query-monitoring-rule-violation/

In this post, we walk you through how to set up automatic notifications of query monitoring rule (QMR) violations in Amazon Redshift to a Slack channel, so that Amazon Redshift users can take timely action.

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. With Amazon Redshift, you can analyze your data to derive holistic insights about your business and your customers. One of the challenges is to protect the data warehouse workload from poorly written queries that can consume significant resources. Amazon Redshift query monitoring rules are a feature of workload management (WLM) that allow automatic handling of poorly written queries. Rules that are applied to a WLM queue allow queries to be logged, canceled, hopped (only available with manual WLM), or to change priority (only available with automatic WLM). The reason to use QMRs is to protect against wasteful use of the cluster. You can also use these rules to log resource-intensive queries, which provides the opportunity to establish governance for ad hoc workloads.

The Amazon Redshift cluster automatically collects query monitoring rules metrics. This convenient mechanism lets you view attributes like the following:

  • Query runtime, in seconds
  • Query return row count
  • The CPU time for a SQL statement

It also makes Amazon Redshift Spectrum metrics available, such as the number of Redshift Spectrum rows and MBs scanned by a query.

When a query violates a QMR, Amazon Redshift logs the violation into the STL_WLM_RULE_ACTION system view. If the action is aborted for the queries that violate a QMR, end-users see an error that indicates query failure due to violation of QMRs. We recommend that administrative team members periodically examine violations listed in the STL_WLM_RULE_ACTION table and coach the involved end-users on how to avoid future rule violations.

Alternately, a centralized team, using a Slack channel for collaboration and monitoring, can configure Amazon Redshift events and alarms to be sent to their channel, so that they can take timely action. In the following sections, we walk you through how to set up automatic notifications of QMR violations to a Slack channel through the use of Slack events and alarms. This allows Amazon Redshift users to be notified and take timely actions without the need to query the system view.

Solution overview

To demonstrate how you can receive automatic notification to a Slack channel for QMR violation, we have designed the following architecture. As shown in the following diagram, we have mixed workload extract, transform, and load (ETL), business intelligence (BI) dashboards, and analytics applications that are powered by an Amazon Redshift cluster. The solution relies on AWS Lambda and Amazon Simple Notification Service (Amazon SNS) to send notifications of Amazon Redshift QMR violations to Slack.

To implement this solution, you create an Amazon Redshift cluster and attach a custom defined parameter group.

Amazon Redshift provides one default parameter group for each parameter group family. The default parameter group has preset values for each of its parameters, and it can’t be modified. If you want to use different parameter values than the default parameter group, you must create a custom parameter group and then associate your cluster with it.

In the parameter group, you can use automatic WLM and define a few workload queues, such as a queue for processing ETL workloads and a reporting queue for user queries. You can name the default queue adhoc. With automatic WLM, Amazon Redshift determines the optimal concurrency and memory allocation for each query that is running in each queue.

For each workload queue, you can define one or more QMRs. For example, you can create a rule to abort a user query if it runs for more than 300 seconds or returns more than 1 billion rows. Similarly, you can create a rule to log a Redshift Spectrum query that scans more than 100 MB.

The Amazon Redshift WLM evaluates metrics every 10 seconds. It records details about actions that result from QMR violation that is associated with user-defined queues in the STL_WLM_RULE_ACTION system table. In this solution, a Lambda function is scheduled to monitor the STL_WLM_RULE_ACTION system table every few minutes. When the function is invoked, if it finds a new entry, it publishes a detailed message to an SNS topic. A second Lambda function, created as the target subscriber to the SNS topic, is invoked whenever any message is published to the SNS topic. This second function invokes a pre-created Slack webhook, which sends the message that was received through the SNS topic to the Slack channel of your choice. (For more information on publishing messages by using Slack webhooks, see Sending messages using incoming webhooks.)

To summarize, the solution involves the following steps:

  1. Create an Amazon Redshift custom parameter group and add workload queues.
  2. Configure query monitoring rules.
  3. Attach the custom parameter group to the cluster.
  4. Create a SNS topic.
  5. Create a Lambda function and schedule it to run every 5 minutes by using an Amazon EventBridge rule.
  6. Create the Slack resources.
  7. Add an incoming webhook and authorize the Slack app to post messages to a Slack channel.
  8. Create the second Lambda function and subscribe to the SNS topic.
  9. Test the solution.

Create an Amazon Redshift custom parameter group and add workload queues

In this step, you create an Amazon Redshift custom parameter group with automatic WLM enabled. You also create the following queues to separate the workloads in the parameter group:

  • reporting – The reporting queue runs BI reporting queries that are performed by any user who belongs to the Amazon Redshift database group named reporting_group
  • adhoc – The default queue, renamed adhoc, performs any query that is not sent to any other queue

Complete the following steps to create your parameter group and add workload queues:

  1. Create a parameter group, named csblog, with automatic WLM enabled.
  2. On the Amazon Redshift console, select the custom parameter group you created.
  3. Choose Edit workload queues.
  4. On the Modify workload queues page, choose Add queue.
  5. Fill in the Concurrency scaling mode and Query priority fields as needed to create the reporting queue.
  6. Repeat these steps to add the adhoc queue.

For more information about WLM queues, refer to Configuring workload management.

Configure query monitoring rules

In this step, you add QMRs to each workload queue. For instructions, refer to Creating or modifying a query monitoring rule using the console.

For the reporting queue, add the following QMRs:

  • nested_loop – Logs any query involved in a nested loop join that results in a row count more than 10,000,000 rows.
  • long_running – Stops queries that run for more than 300 seconds (5 minutes).

For the adhoc queue, add the following QMRs:

  • returned_rows – Stops any query that returns more than 1,000,000 rows back to the calling client application (this isn’t practical and can degrade the end-to-end performance of the application).
  • spectrum_scan – Stops any query that scans more than 1000 MB of data from an Amazon Simple Storage Service (Amazon S3) data lake by using Redshift Spectrum.

Attach the custom parameter group to the cluster

To attach the custom parameter group to your provisioned Redshift cluster, follow the instructions in Associating a parameter group with a cluster. If you don’t already have a provisioned Redshift cluster, refer to Create a cluster.

For this post, we attached our custom parameter group csblog to an already created provisioned Amazon Redshift cluster.

Create an SNS topic

In this step, you create an SNS topic that receives a detailed message of QMR violation from the Lambda function that checks the Amazon Redshift system table for QMR violation entries. For instructions, refer to Creating an Amazon SNS topic.

For this post, we created an SNS topic named redshiftqmrrulenotification.

Create a Lambda function to monitor the system table

In this step, you create a Lambda function that monitors the STL_WLM_RULE_ACTION system table. Whenever any record is found in the table since the last time the function ran, the function publishes a detailed message to the SNS topic that you created earlier. You also create an EventBridge rule to invoke the function every 5 minutes.

For this post, we create a Lambda function named redshiftqmrrule that is scheduled to run every 5 minutes via an EventBridge rule named Redshift-qmr-rule-Lambda-schedule. For instructions, refer to Building Lambda functions with Python.

The following screenshot shows the function that checks the pg_catalog.stl_wlm_rule_action table.

To create an EventBridge rule and associate it with the Lambda function, refer to Create a Rule.

The following screenshot shows the EventBridge rule Redshift-qmr-rule-Lambda-schedule, which calls the function every 5 minutes.

We use the following Python 3.9 code for this Lambda function. The function uses an Amazon Redshift Data API call that uses GetClusterCredentials for temporary credentials.

import json
import time
import unicodedata
import traceback
import sys
from pip._internal import main
import urllib3
import os
import boto3
from datetime import datetime

# initiate redshift-data client in boto3
client = boto3.client("redshift-data")

query = "select userid,query,service_class,trim(rule) as rule,trim(action) as action,recordtime from stl_wlm_rule_action WHERE userid > 1 AND recordtime >= current_timestamp AT TIME ZONE 'UTC' - INTERVAL '5 minute' order by recordtime desc;"
sns = boto3.resource('sns')
sns_arn = os.environ['sns_arn']
platform_endpoint = sns.PlatformEndpoint('{sns_arn}'.format(sns_arn = sns_arn))

def status_check(client, query_id):
    desc = client.describe_statement(Id=query_id)
    status = desc["Status"]
    if status == "FAILED":
        raise Exception('SQL query failed:' + query_id + ": " + desc["Error"])
    return status.strip('"')

def execute_sql(sql_text, redshift_database, redshift_user, redshift_cluster_id):
    print("Executing: {}".format(sql_text))
    res = client.execute_statement(Database=redshift_database, DbUser=redshift_user, Sql=sql_text,
                                   ClusterIdentifier=redshift_cluster_id)
    
    query_id = res["Id"]
    print("query id")
    print(query_id)
    done = False
    while not done:
        time.sleep(1)
        status = status_check(client, query_id)
        if status in ("FAILED", "FINISHED"):
            print("status is: {}".format(status))
            break
    return query_id

def publish_to_sns(message):
    try:
        # Publish a message.
        response = platform_endpoint.publish(
                  Subject='Redshift Query Monitoring Rule Notifications',
                  Message=message,
                  MessageStructure='string'

            )
        return  response

    except:
        print(' Failed to publish messages to SNS topic: exception %s' % sys.exc_info()[1])
        return 'Failed'

def lambda_handler(event, context):
    
    rsdb = os.environ['rsdb']
    rsuser = os.environ['rsuser']
    rscluster = os.environ['rscluster']
    #print(query)
    res = execute_sql(query, rsdb, rsuser, rscluster)
    print("res")
    print(res)
    response = client.get_statement_result(
        Id = res
    )
    # datetime object containing current date and time
    now = datetime.now()
    dt_string = now.strftime("%d-%b-%Y %H:%M:%S")
    print(response) 
    if response['TotalNumRows'] > 0:
        messageText = '################## Reporting Begin' + ' [' + str(dt_string) + ' UTC] ##################\n\n'
        messageText = messageText + 'Total number of queries affected by QMR Rule violation for Redshift cluster "' + rscluster + '" is ' + str(len(response['Records'])) + '.' + '\n' + '\n'
        for i in range(len(response['Records'])):
            messageText = messageText + 'It was reported at ' + str(response['Records'][i][5]['stringValue'])[11:19] + ' UTC on ' + str(response['Records'][i][5]['stringValue'])[0:10] + ' that a query with Query ID - ' + str(response['Records'][i][1]['longValue']) + ' had to ' +  str(response['Records'][i][4]['stringValue']) + ' due to violation of QMR Rule "' + str(response['Records'][i][3]['stringValue']) + '".\n'
        messageText = messageText + '\n########################### Reporting End ############################\n\n'
        query_result_json = messageText
        response = publish_to_sns(query_result_json)
    else:
        print('No rows to publish to SNS')

We use four environment variables for this Lambda function:

  • rscluster – The Amazon Redshift provisioned cluster identifier
  • rsdb – The Amazon Redshift database where you’re running these tests
  • rsuser – The Amazon Redshift user who has the privilege to run queries on pg_catalog.stl_wlm_rule_action
  • sns_arn – The Amazon Resource Name (ARN) of the SNS topic that we created earlier

Create Slack resources

In this step, you create a new Slack workspace (if you don’t have one already), a new private Slack channel (only if you don’t have one or don’t want to use an existing one), and a new Slack app in the Slack workspace. For instructions, refer to Create a Slack workspace, Create a channel, and Creating an app.

For this post, we created the following resources in the Slack website and Slack desktop app:

  • A Slack workspace named RedshiftQMR*****
  • A private channel, named redshift-qmr-notification-*****-*******, in the newly created Slack workspace
  • A new Slack app in the Slack workspace, named RedshiftQMRRuleNotification (using the From Scratch option)

Add an incoming webhook and authorize Slack app

In this step, you enable and add an incoming webhook to the Slack workspace that you created. For full instructions, refer to Enable Incoming Webhooks and Create an Incoming Webhook. You also authorize your Slack app so that it can post messages to the private Slack channel.

  1. In the Slack app, under Settings in the navigation pane, choose Basic Information.
  2. Choose Incoming Webhooks.
  3. Turn on Activate Incoming Webhooks.
  4. Choose Add New Webhook to Workspace.
  5. Authorize the Slack app RedshiftQMRRuleNotification so that it can post messages to the private Slack channel redshift-qmr-notification-*****-*******.

The following screenshot shows the details of the newly added incoming webhook.

Create a second Lambda function and subscribe to the SNS topic

In this step, you create a second Lambda function that is subscribed to the SNS topic that you created earlier. For full instructions, refer to Building Lambda functions with Python and Subscribing a function to a topic.

For this post, we create a second function named redshiftqmrrulesnsinvoke, which is subscribed to the SNS topic redshiftqmrrulenotification. The second function sends a detailed QMR violation message (received from the SNS topic) to the designated Slack channel named redshift-qmr-notification-*. This function uses the incoming Slack webhook that we created earlier.

We also create an SNS subscription of the second Lambda function to the SNS topic that we created previously.

The following is the Python 3.9 code used for the second Lambda function:

import urllib3
import json
import os

http = urllib3.PoolManager()
def lambda_handler(event, context):
    
    url = os.environ['webhook']
    channel = os.environ['channel']
    msg = {
        "channel": channel,
        "username": "WEBHOOK_USERNAME",
        "text": event['Records'][0]['Sns']['Message'],
        "icon_emoji": ""
    }
    
    encoded_msg = json.dumps(msg).encode('utf-8')
    resp = http.request('POST',url, body=encoded_msg)
    print({
        "message": event['Records'][0]['Sns']['Message'], 
        "status_code": resp.status, 
        "response": resp.data
    })

We use two environment variables for the second Lambda function:

  • channel – The Slack channel that we created
  • webhook – The Slack webhook that we created

Test the solution

To show the effect of the QMRs, we ran queries that violate the QMRs we set up.

Test 1: Returned rows

Test 1 looks for violations of the returned_rows QMR, in which the return row count is over 1,000,000 for a query that ran in the adhoc queue.

We created and loaded a table named lineitem in a schema named aquademo, which has more than 18 billion records. You can refer to the GitHub repo to create and load the table.

We ran the following query, which violated the returned_rows QMR, and the query was stopped as specified in the action set in the QMR.

select * from aquademo.lineitem limit 1000001;

The following screenshot shows the view from the Amazon Redshift client after running the query.

The following screenshot shows the view on the Amazon Redshift console.

The following screenshot shows the notification we received in our Slack channel.

Test 2: Long-running queries

Test 2 looks for violations of the long_running QMR, in which query runtime is over 300 seconds for a user who belongs to reporting_group.

In the following code, we created a new Amazon Redshift group named reporting_group and added a new user, named reporting_user, to the group. reporting_group is assigned USAGE and SELECT privileges on all tables in the retail and aquademo schemas.

create group reporting_group;
create user reporting_user in group reporting_group password 'Test12345';
grant usage on schema retail,aquademo to group reporting_group;
grant select on all tables in schema retail,aquademo to group reporting_group;

We set the session authorization to reporting_user so the query runs in the reporting queue. We ran the following query, which violated the long_running QMR, and the query was stopped as specified in the action set in the QMR:

set session authorization reporting_user;
set enable_result_cache_for_session  to off;
select * from aquademo.lineitem;

The following screenshot shows the view from the Amazon Redshift client.

The following screenshot shows the view on the Amazon Redshift console.

The following screenshot shows the notification we received in our Slack channel.

Test 3: Nested loops

Test 3 looks for violations of the nested_loop QMR, in which the nested loop join row count is over 10,000,000 for a user who belongs to reporting_group.

We set the session authorization to reporting_user so the query runs in the reporting queue. We ran the following query, which violated the nested_loop QMR, and the query logged the violation as specified in the action set in the QMR:

set session authorization reporting_user;
set enable_result_cache_for_session  to off;
select ss.*,cd.* 
from retail.store_sales ss
, retail.customer_demographics cd;

Before we ran the original query, we also checked the explain plan and noted that this nested loop will return more than 10,000,000 rows. The following screenshot shows the query explain plan.

The following screenshot shows the notification we received in our Slack channel.

Test 4: Redshift Spectrum scans

Test 4 looks for violations of the spectrum_scan QMR, in which Redshift Spectrum scans exceed 1000 MB for a query that ran in the adhoc queue.

For this example, we used store_sales data (unloaded from an Amazon Redshift table that was created by using the TPC-DS benchmark data) loaded in an Amazon S3 location. Data in Amazon S3 is non-partitioned under one prefix and has a volume around 3.9 GB. We created an external schema (qmr_spectrum_rule_test) and external table (qmr_rule_store_sales) in Redshift Spectrum.

We used the following steps to run this test with the sample data:

  1. Run an unload SQL command:
    unload ('select * from store_sales')
    to 's3://<<Your Amazon S3 Location>>/store_sales/' 
    iam_role default;

  2. Create an external schema from Redshift Spectrum:
    CREATE EXTERNAL SCHEMA if not exists qmr_spectrum_rule_test
    FROM DATA CATALOG DATABASE 'qmr_spectrum_rule_test' region 'us-east-1' 
    IAM_ROLE default
    CREATE EXTERNAL DATABASE IF NOT exists;

  3. Create an external table in Redshift Spectrum:
    create external table qmr_spectrum_rule_test.qmr_rule_store_sales
    (
    ss_sold_date_sk int4 ,            
      ss_sold_time_sk int4 ,     
      ss_item_sk int4  ,      
      ss_customer_sk int4 ,           
      ss_cdemo_sk int4 ,              
      ss_hdemo_sk int4 ,         
      ss_addr_sk int4 ,               
      ss_store_sk int4 ,           
      ss_promo_sk int4 ,           
      ss_ticket_number int8 ,        
      ss_quantity int4 ,           
      ss_wholesale_cost numeric(7,2) ,          
      ss_list_price numeric(7,2) ,              
      ss_sales_price numeric(7,2) ,
      ss_ext_discount_amt numeric(7,2) ,             
      ss_ext_sales_price numeric(7,2) ,              
      ss_ext_wholesale_cost numeric(7,2) ,           
      ss_ext_list_price numeric(7,2) ,               
      ss_ext_tax numeric(7,2) ,                 
      ss_coupon_amt numeric(7,2) , 
      ss_net_paid numeric(7,2) ,   
      ss_net_paid_inc_tax numeric(7,2) ,             
      ss_net_profit numeric(7,2)                     
    ) ROW FORMAT DELIMITED 
      FIELDS TERMINATED BY '|' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
      's3://<<Your Amazon S3 Location>>/store_sales/'
    TABLE PROPERTIES (
      'averageRecordSize'='130', 
      'classification'='csv', 
      'columnsOrdered'='true', 
      'compressionType'='none', 
      'delimiter'='|', 
      'recordCount'='11083990573', 
      'sizeKey'='1650877678933', 
      'typeOfData'='file');

  4. Run the following query:
    select * 
    FROM qmr_spectrum_rule_test.qmr_rule_store_sales 
    where ss_sold_date_sk = 2451074;

The query violated the spectrum_scan QMR, and the query was stopped as specified in the action set in the QMR.

The following screenshot shows the view from the Amazon Redshift client.

The following screenshot shows the view on the Amazon Redshift console.

The following screenshot shows the notification we received in our Slack channel.

Clean up

When you’re finished with this solution, we recommend deleting the resources you created to avoid incurring any further charges.

Conclusion

Amazon Redshift is a powerful, fully managed data warehouse that can offer significantly increased performance and lower cost in the cloud. In this post, we discussed how you can automate notification of misbehaving queries on Slack by using query monitoring rules. QMRs can help you maximize cluster performance and throughput when supporting mixed workloads. Use these instructions to set up your Slack channel to receive automatic notifications from your Amazon Redshift cluster for any violation of QMRs.


About the Authors

Dipankar Kushari is a Senior Specialist Solutions Architect in the Analytics team at AWS.

Harshida Patel is a Specialist Senior Solutions Architect in the Analytics team at AWS.

Share data securely across Regions using Amazon Redshift data sharing

Post Syndicated from Rahul Chaturvedi original https://aws.amazon.com/blogs/big-data/share-data-securely-across-regions-using-amazon-redshift-data-sharing/

Today’s global, data-driven organizations treat data as an asset and use it across different lines of business (LOBs) to drive timely insights and better business decisions. This requires you to seamlessly share and consume live, consistent data as a single source of truth without copying the data, regardless of where LOB users are located.

Amazon Redshift is a fast, scalable, secure, and fully managed data warehouse that enables you to analyze all your data using standard SQL easily and cost-effectively. Amazon Redshift data sharing allows you to securely share live, transactionally consistent data in one Amazon Redshift cluster with another Amazon Redshift cluster across accounts and Regions, without needing to copy or move data from one cluster to the other.

Amazon Redshift data sharing was initially launched in March 2021, and support for cross-account data sharing was added in August 2021. Cross-Region support became generally available in February 2022. This provides full flexibility and agility to easily share data across Amazon Redshift clusters in the same AWS account, different accounts, or different Regions.

In this post, we discuss how to configure cross-Region data sharing between different accounts or in the same account.

Solution overview

It’s easy to set up cross-account and cross-Region data sharing from producer to consumer clusters, as shown in the following flow diagram. The workflow consists of the following components:

  • Producer cluster administrator – The producer admin is responsible for the following:
    • Create an Amazon Redshift database share (a new named object to serve as a unit of sharing, referred to as a datashare) on the AWS Management Console
    • Add database objects (schemas, tables, views) to the datashare
    • Specify a list of consumers that the objects should be shared with
  • Consumer cluster administrator – The consumer admin is responsible for the following:
    • Examine the datashares that are made available and review the content of each share
    • Create an Amazon Redshift database from the datashare object
    • Assign permissions on this database to appropriate users and group in the consumer cluster
  • Users and groups in the consumer cluster – Users and groups can perform the following actions:
    • List the shared objects as part of standard metadata queries
    • Start querying immediately

In the following sections, we discuss use cases and how to implement the cross-Region sharing feature between different accounts or in the same account.

Use cases

Common use cases for the data sharing feature include implementing multi-tenant patterns, data sharing for workload isolation, and security considerations related to data sharing.

In this post, we demonstrate cross-Region data sharing, which is especially useful for the following use cases:

  • Support for different kinds of business-critical workloads – You can use a central extract, transform, and load (ETL) cluster that shares data with multiple business intelligence (BI) or analytic clusters owned by geographically distributed business groups
  • Enabling cross-group collaboration – The solution enables seamless collaboration across teams and business groups for broader analytics, data science, and cross-product impact analysis
  • Delivering data as a service – You can share data as a service across your organization
  • Sharing data between environments – You can share data among development, test, and production environments at different levels of granularity
  • Licensing access to data in Amazon Redshift – You can publish Amazon Redshift datasets in the AWS Data Exchange that customers can find, subscribe to, and query in minutes.

Cross-Region data sharing between different accounts

In this use case, we share data from a cluster in an account in us-east-1 with a cluster owned by a different account in us-west-2. The following diagram illustrates this architecture.

When setting up cross-Region data sharing across different AWS accounts, there’s an additional step to authorize a datashare on the producer account and associate the datashare on the consumer account. The reason is that for such shares that go outside of your account or organization, you must have a two-step authentication process. For example, consider a scenario where the database analyst creates a database share using the console and now requires the producer administrator to authorize this datashare. For same-account sharing, we don’t require that two-step approach because you’re granting access within the perimeter of your own account.

This setup broadly follows a four-step process:

  1. Create a datashare on the producer account.
  2. Authorize the datashare on the producer account.
  3. Associate the datashare on the consumer account.
  4. Query the datashare.

Create a datashare

To create your datashare, complete the following steps:

  1. On the Amazon Redshift console, create a producer cluster with encryption enabled.

This is in your source account. For instructions on how to automate process of creating an Amazon Redshift cluster, refer to Automate building an integrated analytics solution with AWS Analytics Automation Toolkit.

On the Datashares tab, no datashares are currently available on the producer account.

  1. Create a consumer cluster with encryption enabled in a different AWS account and different Region.

This is your target account. On the Datashares tab, no datashares are currently available on the consumer side.

  1. Choose Connect to database to start setting up your datashare.
  2. For Connection, select Create a new connection.

You can connect to the database using either temporary credentials or AWS Secrets Manager. Refer to Querying a database using the query editor to learn more about these two connection options. We use temporary credentials in this post.

  1. For Authentication, select Temporary credentials.
  2. For Database name, enter a name (for this post, dev).
  3. For Database user, enter the user authorized to access the database (for this post, awsuser).
  4. Choose Connect.
  5. Repeat these steps on the producer cluster.
  6. Next, create a datashare on the producer.
  7. For Datashare type, select Datashare.
  8. For Database name, enter a name.
  9. For Database name, choose the database you specified earlier.
  10. For Publicly accessible, select Enable.
  11. In the Datashare objects section, choose Add to add the database schemas and tables to share.\

In our case, we add all tables and views from the public schema to the datashare.

  1. Choose Add.

Next, you add the consumer account in order to add a cross-account, cross-Region consumer to this datashare.

  1. In the Data consumers section, select Add AWS accounts to the datashare and enter your AWS account ID.
  2. Choose Create datashare.

Authorize the datashare

To authorize this datashare, complete the following steps:

  1. On the Amazon Redshift console, choose Datashares in the navigation pane.
  2. Choose the datashare you created earlier.
  3. In the Data consumers section, select the data consumer ID and choose Authorize.
  4. Choose Authorize to confirm.

Associate the datashare

On the consumer account in a different Region, we have multiple clusters. In this step, the consumer administrator associates this datashare to only to one of those clusters. As a consumer, you have the option to accept or decline a datashare. We associate this datashare to cross-region-us-west-2-team-a.

  1. On the Datashares page, choose the From other accounts tab to find the database you created in the source account.
  2. Select the appropriate datashare and choose Associate.

In the next step, you as a consumer admin have the option to associate the datashare with specific clusters in a different Region. Each cluster has its own globally unique identifier, known as a namespace. For this post, we associate the datashare to cluster cross-region-us-west-2-team-a with the namespace 90dab986-b5f1-4cbe-b985-dbb4ebdc55a8 in the us-west-2 Region.

  1. For Choose namespaces, select Add specific namespaces.
  2. Select your namespace and choose Add Region.
  3. Choose Clusters in the navigation pane.
  4. Choose the cluster that you authorized access to the datashare.
  5. On the Datashares tab, confirm the datashare from the producer cluster is listed in the Datashares from other namespaces and AWS accounts section.

Query the datashare

You can query shared data using standard SQL interfaces, JDBC or ODBC drivers, and the Data API. You can also query data with high performance from familiar BI and analytic tools. You can run queries by referring to the objects from other Amazon Redshift databases that are both local to and remote from your cluster that you have permission to access.

You can do so simply by staying connected to local databases in your cluster. Then you can create consumer databases from datashares to consume shared data.

After you have done so, you can run cross-database queries joining the datasets. You can query objects in consumer databases using the three-part notation consumer_database_name.schema_name.table_name. You can also query using external schema links to schemas in the consumer database. You can query both local data and data shared from other clusters within the same query. Such a query can reference objects from the current connected database and from other non-connected databases, including consumer databases created from datashares.

  1. On the consumer cluster, create a database to map the datashare objects to your cluster.
  2. You can now join your local schema objects with the objects in the shared database in a single query.

You can access the shared objects using a three-part notation, as highlighted in the following query:

select a.c_custkey,count(b.o_orderkey) as cnt_of_orders
from dev.public.customer a
join producerdata.public.orders b
on a.c_custkey = b.o_custkey
group by 1
order by cnt_of_orders desc 
limit 5;

The following screenshot shows our query results.

One alternative to using three-part notation is to create external schemas for the shared objects. As shown in the following code, we create producerSchema to map the datashare schema to a local schema:

Create external schema producerSchema
From REDSHIFT DATABASE ‘producerdata’ schema ‘public’;

Cross-Region data sharing in the same account

When setting up cross-Region data sharing in the same account, we don’t need to follow the authorize-associate flow on the console. We can easily configure this share using SQL commands. Complete the following steps:

  1. On the consumer cluster, run the following SQL statement to get the current namespace:
    select current_namespace;

The following screenshot shows our results.

  1. On the producer cluster, run the SQL commands as shown in the following screenshot.

This step involves creating a datashare and making it publicly accessible so that the consumer can access it from across the Region. It then adds the database objects to the datashare, followed by granting permissions on this datashare to the consumer namespace. Use your consumer namespace from the previous step.

  1. Capture the producer namespace from the producer cluster to use in the next step.

The consumer cluster now has access to the datashare that was created on the producer.

  1. You can verify this from svv_datashares.
  2. After you confirm that the datashare is available on consumer, the next step is to map the objects of the datashare to a database in order to start querying them, as shown in the following screenshots.

Considerations for using cross-Region data sharing

Keep in mind the following when using cross-Region data sharing:

  • You must enable encryption both the producer and consumer accounts.
  • For cross-Region data sharing, the consumer pays the cross-Region data transfer fee from the producer Region to the consumer Region based on the price of Amazon Simple Storage Service (Amazon S3), in addition to the cost of using the Amazon Redshift cluster.
  • You can’t share stored procedures and Python UDF functions.
  • This feature is only supported on the RA3 instance type.
  • The producer cluster administrator is responsible for implementing any Region locality and compliance requirements (such as GDPR) by performing the following actions:
    • Share only data that can be shared outside a Region in a datashare.
    • Authorize the appropriate data consumer to access the datashare.
  • The performance of the queries on data shared across Regions depends on the compute capacity of the consumer clusters.
  • Network throughput between different Regions could yield different query performance. For instance, data sharing from us-east-1 to us-east-2 could yield better performance compared to us-east-1 to ap-northeast-1.

Summary

Amazon Redshift cross-Region data sharing is now available. Previously, data sharing was only allowed within the same Region. With this launch, you can share data across Regions with clusters residing in the same account or different accounts.

We look forward to hearing from you about your experience. If you have questions or suggestions, please leave a comment.


About the Authors

Rahul Chaturvedi is an Analytics Specialist Solutions Architect AWS. Prior to this role, he was a Data Engineer at Amazon Advertising and Prime Video, where he helped build petabyte-scale data lakes for self-serve analytics.

Kishore Arora is a Sr Analytics Specialist Solutions Architect at AWS. In this role, he helps customers modernize their data platform and architect big data solutions with AWS purpose-built data services. Prior to this role, he was a big data architect and delivery lead working with Tier-1 telecom customers.

BP Yau is a Sr Product Manager at AWS. He is passionate about helping customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Srikanth Sopirala is a Principal Analytics Specialist Solutions Architect at AWS. He is a seasoned leader with over 20 years of experience, who is passionate about helping customers build scalable data and analytics solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys reading, spending time with his family, and road cycling.

Accelerate Snowflake to Amazon Redshift migration using AWS Schema Conversion Tool

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/accelerate-snowflake-to-amazon-redshift-migration-using-aws-schema-conversion-tool/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers. Today, tens of thousands of AWS customers—from Fortune 500 companies, startups, and everything in between—use Amazon Redshift to run mission-critical business intelligence (BI) dashboards, analyze real-time streaming data, and run predictive analytics jobs. With the constant increase in generated data, Amazon Redshift customers continue to achieve successes in delivering better service to their end-users, improving their products, and running an efficient and effective business.

Several new features of Amazon Redshift address a wide range of analytics requirements and improve performance of the cloud data warehouse:

To take advantage of these capabilities and future innovations, you must migrate from your existing data warehouse to Amazon Redshift.

In this post, we show you how to migrate your data warehouse schema from Snowflake to Amazon Redshift using AWS Schema Conversion Tool (AWS SCT). AWS SCT is a service that makes heterogeneous database migrations predictable by automatically converting the source database schema and a majority of the database code objects, including views, stored procedures, and functions, to a format compatible with the target database. Any objects that can’t be automatically converted are clearly marked so that they can be manually converted to complete the migration. AWS SCT can also scan your application source code for embedded SQL statements and convert them as part of a database schema conversion project. During this process, AWS SCT performs cloud-native code optimization by converting legacy data warehouse functions to their equivalent AWS service, thereby helping you modernize the applications at the same time of database migration.

Solution overview

To implement this solution, you complete the following high-level steps:

  1. Configure your AWS SCT application.
  2. Analyze your source Snowflake schema.
  3. Convert your Snowflake schema to an Amazon Redshift schema.
  4. Deploy the schema to Amazon Redshift.

The following diagram illustrates the solution architecture.

Prerequisites

Before starting this walkthrough, you must have the following prerequisites:

Set up an AWS SCT project and extract the schema from the source

In this walkthrough, we use the Snowflake sample database TPCDS_SF10TCL as the source of the schema conversion.

To set up the database migration project, complete the following steps:

  1. Launch the AWS SCT application.
  2. On the File menu, choose New project wizard.
  3. Enter the project name and location.
  4. For Source engine, choose Snowflake.
  5. Choose Next.
  6. Provide the database information and credentials for Snowflake and choose Test Connection.
  7. When the connection is successful, choose Next.

For more information, see Using Snowflake as a source for AWS SCT.

  1. Expand SNOWFLAKE_SAMPLE_DATA and Schemas.
  2. Select TPCDS_SF10TCL and choose Next.

AWS SCT analyzes the schema and prepares an assessment report summary, as shown in the following screenshot.

This report summarizes the objects that AWS SCT converts to Amazon Redshift.

  1. Review the report and choose Next.
  2. Provide the database information and credentials for Amazon Redshift and deselect Use AWS Glue.
  3. Choose Test Connection.
  4. When the connection is successful, choose Finish.

For more information about establishing a connection to Amazon Redshift, see Getting the JDBC URL.

Review and apply the schema from Snowflake to Amazon Redshift

To convert the schema from Snowflake objects to Amazon Redshift, complete the following steps:

  1. Expand SNOWFLAKE_SAMPLE_DATA and Schemas.
  2. Choose (right-click) TPCDS_SF10TCL and choose Convert schema.
  3. Choose Collect and continue.

AWS SCT converts the schema and shows the converted schema and objects in the right pane. The converted schema is marked with a red check mark.

Amazon Redshift take some actions automatically while converting the schema to Amazon Redshift; objects with such actions are marked with a red warning sign.

  1. Choose the object and then on the top menu, choose Main view and choose Assessment Report view.
  2. Choose the Action items tab.

You can see list of all issues and actions taken by AWS SCT.

You can evaluate and inspect the individual object DDL by selecting it from the right pane, and you can also edit it as needed. In the following example, we modify the DISTKEY to use inv_item_sk. AWS SCT analyze the tables and recommends the distribution and sort keys based on the statistics. For cases where you’re not sure, you should set it to AUTO. For more information about automatic data distribution and optimization in Amazon Redshift, refer to Automate your Amazon Redshift performance tuning with automatic table optimization.

  1. To deploy the objects DDL to Amazon Redshift, select the converted schema in the right pane, right-click, and choose Apply to database.
  2. Optionally, if you want to export the copy of the DDLs generated by AWS SCT and apply them manually, you can select the Amazon Redshift converted schema, right-click, and choose Save as SQL.
  3. Log in to your Amazon Redshift cluster using the Query Editor V2.

For more information about Query Editor V2, refer to Introducing Amazon Redshift Query Editor V2, a Free Web-based Query Authoring Tool for Data Analysts.

  1. To verify that the converted schema objects are deployed in Amazon Redshift, select the desired table, right-click, and choose Show table definition to see the underlying DDL.

Clean up

To avoid incurring future charges, complete the following steps:

  1. Delete the Amazon Redshift cluster created for this demonstration.
  2. If you were using an existing cluster, delete the new tables that you created as part of this exercise.
  3. Stop any Amazon Elastic Compute Cloud (Amazon EC2) instances that were created to run the AWS SCT application.

Summary

In this post, we showed how easy it is to convert a Snowflake schema to an Amazon Redshift schema and used AWS SCT for this automatic conversion.

We look forward to hearing from you about your experience. If you have questions or suggestions, please leave a comment.


About the Authors

BP Yau is a Sr Analytics Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 13 years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Shawn Sachdev is a Sr Analytics Specialist Solutions Architect at AWS. He works with customers and provides guidance to help them innovate and build well-architected and high-performance data warehouses and implement analytics at scale on the AWS platform. Before AWS, he has worked in several Analytics and System Engineering roles. Outside of work, he loves watching sports, and is an avid foodie and a craft beer enthusiast.

Srikanth Sopirala is a Principal Analytics Specialist Solutions Architect at AWS. He is a seasoned leader with over 20 years of experience, who is passionate about helping customers build scalable data and analytics solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys reading, spending time with his family, and road cycling.