Tag Archives: database

Migrating your Netezza data warehouse to Amazon Redshift

Post Syndicated from John Hwang original https://aws.amazon.com/blogs/big-data/migrating-your-netezza-data-warehouse-to-amazon-redshift/

With IBM announcing Netezza reaching end-of-life, you’re faced with the prospect of having to migrate your data and workloads off your analytics appliance. For some, this presents an opportunity to transition to the cloud.

Enter Amazon Redshift.

Amazon Redshift is a cloud-native data warehouse platform built to handle workloads at scale, and it shares key similarities with Netezza that make it an excellent candidate to replace your on-premises appliance. You can migrate your data and applications to Amazon Redshift in less time and with fewer changes than migrating to other analytics platforms. For developers, this means less time spent retraining on the new database. For stakeholders, it means a lower cost of, and time to, migration. For more information, see How to migrate a large data warehouse from IBM Netezza to Amazon Redshift with no downtime.

This post discusses important similarities and differences between Netezza and Amazon Redshift, and how they could impact your migration timeline.

Similarities

Three significant similarities between Netezza and Amazon Redshift are their compatibility with Postgres, their massively parallel processing (MPP) architecture, and the Amazon Redshift feature Advanced Query Accelerator (AQUA) compared to Netezza’s use of FPGAs.

Postgres compatibility

Both Netezza and Amazon Redshift share some compatibility with Postgres, an open-source database. This means that Netezza SQL and Amazon Redshift SQL have a similar syntax. In particular, both support many features of PL/pgSQL, Postgres’s procedural language. Your Netezza stored procedures can translate to Amazon Redshift with little-to-no rewriting of code.

You can also use the AWS Schema Conversion Tool, which can automatically migrate a large percentage of Netezza storage procedures to Amazon Redshift syntax with zero user effort. And because both databases are built for analytics and not transactional workloads, there are similar characteristics between the two databases. For example, both Netezza and Amazon Redshift don’t enforce primary keys to improve performance, though you can still define primary keys on your tables to help the optimizer create better query plans.

MPP architecture

Both Netezza and Amazon Redshift are MPP databases. This means that a query is sent to a leader node, which then compiles a set of commands that it sends to multiple compute nodes. Each worker node performs its task in parallel and returns the results to the leader node, where the results are aggregated and returned to the user. This means that you can apply similar architectural strategies from Netezza, such as zone maps and distribution styles, to Amazon Redshift. For more information, see Amazon Redshift Engineering’s Advanced Table Design Playbook: Preamble, Prerequisites, and Prioritization.

Amazon Redshift AQUA and Netezza’s FPGA

AWS recently announced the new Amazon Redshift feature AQUA, which is in preview as of this writing. AQUA uses AWS-designed processors to bring certain compute tasks closer to the storage layer. Compression, encryption, filtering, aggregation—and other tasks—can now happen in this intermediary layer in between the storage and compute, which leaves the CPU free to handle more complex tasks. When it’s released, AQUA will accelerate queries up to 10 times faster.

Netezza uses FPGAs to perform simple compute tasks before data reaches the CPU. Applications designed to employ these FPGA features on Netezza (for example, queries that rely heavily on certain aggregate functions and data filtering) translate well to Amazon Redshift clusters using AQUA.

Differences

For all the similarities that Amazon Redshift and Netezza share, they also have differences. There are three important differences that could have significant impact on your data and application architecture when migrating from Netezza to Amazon Redshift: column store vs. row store, concurrency scaling, and data lake integration.

Column store vs. row store

Netezza stores each row of data onto disk in data blocks, whereas Amazon Redshift stores each column of data. For many analytics workloads, a column store can have dramatic performance benefits over a row store. Typically, an analytics database has tables with many columns, but only a small handful of those columns are used in any one query. For example, assume that you have a table in a row store database with 100 columns and 100 million rows. If you want to sum the entire value of one column in this table, your query has to suffer the I/O penalty of scanning the entire table to retrieve the data in this single column. In a columnar database with the same table, the query only faces I/O for the single column. In addition to enjoying the improved performance for this type of workload in Amazon Redshift, this gives you options for designing wide tables without having to weigh the increase in I/O for typical analytics workloads the way you do in Netezza.

Concurrency scaling

Although both Netezza and Amazon Redshift offer queue priority and short query acceleration to help reduce concurrency issues, Amazon Redshift also uses the benefits of the cloud to offer additional options to handle concurrency.

One option is to take unrelated workloads from a single Netezza appliance and migrate them to separate Amazon clusters, each with an instance type and number of nodes sized accordingly for the workload it has to support. Netezza’s more traditional license and support pricing model, combined with its limited options for appliance sizes, can make this type of architecture untenable for your organization’s budget.

Additionally, Amazon Redshift offers Concurrency Scaling to scale out (and scale back in) additional compute capacity automatically and on the fly. If you want to add or remove nodes in the cluster, or experiment with other nodes types, you can do so in just a few minutes using elastic resize (to change the number of nodes) and classic resize (to change instance types).

This type of scaling and resizing isn’t feasible on Netezza because the on-premises appliance has a fixed number of blades. Concurrency scaling in Amazon Redshift can support virtually unlimited concurrent users and concurrent queries, and its ability to automatically add and remove additional capacity means you only pay for the time the concurrency scaling clusters are in use.

Data lake integration

Netezza offers the ability to create an external table from a data file either on the Netezza host or a remote host. However, querying those tables means migrating the entire dataset internally before running the query. With Amazon Redshift Spectrum, rather than using external tables as a convenient way to migrate entire datasets to and from the database, you can run analytical queries against data in your data lake the same way you do an internal table. As the volume of your data in the lake grows, partitioning can help keep the amount of data scanned for each query (and therefore performance) consistent. You can even join the data from your external tables in Amazon Simple Storage Service (Amazon S3) to your internal tables in Amazon Redshift.

Not only does this mean true integration of your data warehouse and data lake, which can lower your warehouse’s more expensive storage requirements, but because the Amazon Redshift Spectrum layer uses its own dedicated infrastructure outside of your cluster, you can offload many compute-intensive tasks from your cluster and push them down to the Redshift Spectrum layer. For more information, see Amazon Redshift Spectrum overview.

Conclusion

Amazon Redshift can accelerate migration from Netezza and potentially lower the time and cost of moving your analytics workloads into the cloud. The similarity between the two systems eases the pain of migration, but Amazon Redshift has important differences that can offer additional benefits around performance and cost. The expertise of AWS and its partner network can help with your migration strategy and offer guidance to avoid potential roadblocks and pitfalls.

 


About the Author

John Hwang is a senior solutions architect at AWS.

 

 

 

 

 

Brian McDermott is a senior sales manager at AWS.

 

 

 

Monitor and optimize queries on the new Amazon Redshift console

Post Syndicated from Debu Panda original https://aws.amazon.com/blogs/big-data/monitor-and-optimize-queries-on-the-new-amazon-redshift-console/

Tens of thousands of customers use Amazon Redshift to power their workloads to enable modern analytics use cases, such as Business Intelligence, predictive analytics, and real-time streaming analytics. As an administrator or data engineer, it’s important that your users, such as data analysts and BI professionals, get optimal performance. You can use the Amazon Redshift console to monitor and diagnose query performance issues.

The Amazon Redshift console features a monitoring dashboard and updated flows to create, manage, and monitor Amazon Redshift clusters. For more information, see Simplify management of Amazon Redshift clusters with the Redshift console.

This post discusses how you can use the new Amazon Redshift console to monitor your user queries, identify slow queries, and terminate runaway queries. The post also reviews details such as query plans, execution details for your queries, in-place recommendations to optimize slow queries, and how to use the Advisor recommendations to improve your query performance.

User query vs. rewritten query

Any query that users submit to Amazon Redshift is a user query. Analysts either author a user query or a BI tool such as Amazon QuickSight or Tableau generates the query. Amazon Redshift typically rewrites queries for optimization purposes. It can rewrite a user query into a single query or break it down into multiple queries. These queries are rewritten queries.

The following steps are performed by Amazon Redshift for each query:

  1. The leader node receives and parses the query.
  2. The parser produces an initial query tree, which is a logical representation of the original query. Amazon Redshift inputs this query tree into the query optimizer.
  3. The optimizer evaluates and, if necessary, rewrites the query to maximize its efficiency. This process sometimes results in creating multiple queries to replace a single query.

The query rewrite is done automatically and is transparent to the user.

Query monitoring with the original Amazon Redshift console and system tables

Previously, you could monitor the performance of rewritten queries in the original Amazon Redshift console or system tables. However, it was often challenging to find the SQL your users submitted.

The following table shows the comparison of query monitoring differences between the original Amazon Redshift console, system tables, and the new console.

Original Console System Tables New Console

•  User query not supported

•  Monitor only rewritten queries

•  Shows only top 100 queries

•  User query not available

•  Rewritten queries

•  All rewritten queries

•  Supports user queries

•  Shows all queries available in system tables

•  Allows you to correlate rewritten queries with user queries

The new console simplifies monitoring user queries and provides visibility to all query monitoring information available in the system.

Monitoring and diagnosing queries

The Amazon Redshift console provides information about the performance of queries that run in the cluster. You can use this information to identify and diagnose queries that take a long time to process and create bottlenecks that prevent other queries from executing efficiently.

The following table shows some of the common questions you may have when monitoring, isolating, and diagnosing query performance issues.

Monitor Isolate and Diagnose Optimize
How is my cluster doing in terms of query performance and resource utilization? A user complained about performance issues at a specific time. How do I identify that SQL and diagnose problems? How can I optimize the SQL that our end-users author?
How is my cluster throughput, concurrency, and latency looking?

Which other queries were running when my query was slow? Were all queries slow?

 

Is there any optimization required in my schema design?
Are queries being queued in my cluster? Is my database overloaded with queries from other users? Is my queue depth increasing or decreasing? Is there any tuning required for my WLM queues?
Which queries or loads are running now? How do I identify queries that a specific user runs? Can I get any benefit if I enable concurrency scaling?
Which queries or loads are taking longer than usual timing? The resources of my cluster are running very high. How do I find out which queries are running?
What are my top queries by duration in the last hour or last 24 hours?
Which queries have failed?
Is the average query latency for my cluster increasing or decreasing over time?

You can answer these questions by either using the Amazon Redshift console or developing scripts using the system catalog.

Monitoring queries

You can monitor your queries on the Amazon Redshift console on the Queries and loads page or on the Query monitoring tab on the Clusters page. While both options are similar for query monitoring, you can quickly get to your queries for all your clusters on the Queries and loads page. You have to select your cluster and period for viewing your queries.

You can view the queries using List view on the Query monitoring tab on the Clusters page. The query monitoring page visually shows the queries in a Gantt chart.

Each bar represents a user query, and the length of the bar represents runtime for a query. The X-axis shows the selected period, and the location of the bar indicates when a query started and ended. The queries include both standard SQL statements such as SELECT, INSERT, and DELETE, and loads such as COPY commands.

Monitoring top queries

By default, the Query monitoring page shows the top 100 longest queries by runtime or duration for the selected time window. You can change the time window to view the top queries for that period. The top queries also include completed queries and running queries. The completed queries are sorted by descending order of query runtime or duration.

Identifying running queries

You can find out your running queries by choosing Running queries from the drop-down menu.

To see the query’s details such as SQL text, runtime details, related rewritten queries, and execution details, choose the query ID.

The Duration column shows the estimated duration and runtime for a query. You can terminate a query by selecting the query and choosing Terminate query. You need the have the  redshift:CancelQuerySession action added to your IAM policy to cancel a query.

Viewing loads

As a data engineer or Redshift administrator, ensuring that your load jobs complete correctly and meet required performance SLAs is a major priority.

You can view all your load jobs by choosing Loads from the drop-down menu on the Query monitoring page. You can then zoom in on the desired time window.

The preceding Gantt chart shows all loads completed successfully. The query status indicates if the load failed or if an administrator terminated it.

Identifying failed queries

You can identify failed queries by choosing Failed or stopped queries from the drop-down menu on the Query monitoring page and then zooming in on the desired time.

The query page shows 50 queries by default, and you have to paginate to view more results. You can change the page size by choosing the settings gear icon.

In the Preferences section, you can customize what fields you want to see on the Queries and loads list. For example, you can see the PID and not the transaction ID. These changes persist across browser sessions.

Monitoring long-running queries

Amazon Redshift categorizes queries if a query or load runs more than 10 minutes. You can filter long-running queries by choosing Long queries from the drop-down menu. Similarly, you can also filter medium and short queries.

Isolating problematic queries

The following section looks at some use cases in which you use the console to diagnose query performance issues.

Query performance issues

For this use case, a user complains that their queries as part of the dashboards are slow, and you want to identify the associated queries. These queries might not be part of the top queries. To isolate these queries, you can either choose Completed queries or All queries from the drop-down menu and specify the time window by choosing Custom.

You can also drill down to view the queries in a specific period, or filter for queries from one particular user by searching their user name.

You can also filter your queries by searching SQL query text.

As with the earlier charts, the size of a bar represents a relative duration of the runtime for a query. In this period, the highlighted query is the slowest. If you mouse over a bar in the Gantt chart, it provides helpful information about the query such as query ID, part of the query text, and runtime. To view details about a specific query, choose Query ID.

Identifying systemic query performance problems

For this use case, many of your users are complaining about longer-than-normal query runtimes. You want to diagnose what is happening in your cluster. You can customize your time and switch to the graph view, which helps you to correlate longer runtimes with what is happening in the cluster. As the following Gantt chart and CPU utilization graph shows, many queries were running at that time, and CPU utilization almost reached 100%.

The concurrency scaling feature of Amazon Redshift could have helped maintain consistent performance throughput the workload spike.

High CPU utilization

You can correlate query performance with cluster performance and highlight on a given metric such as CPU utilization, which shows you which queries were running at that time.

Monitoring workload performance

You can get a detailed view of your workload’s performance by looking at the Workload execution breakdown chart. You can find out how long it took to plan, wait, and execute your workload. You can also view time spent in operations such as INSERT, UPDATE, DELETE, COPY, UNLOAD, or CTAS. The chosen time in the query history is stored when you navigate between pages.

In the preceding screenshot, you can see several waits in the workload breakdown graph. You can take advantage of concurrency scaling to process a burst of queries.

Correlating query throughput with query duration

You can view the trend of the performance of your queries, such as duration or execution time for your long, medium, and short queries, and correlate with the query throughput.

This information can offer insight into how well the cluster serves each query category with its current configuration.

Monitoring workload for your WLM queues

You can correlate query performance with cluster performance and highlight a given metric such as CPU utilization to see which queries were running at that time. You can view the average throughput, average duration, and average queue time by different WLM queues. Insight from this graph might help you tune your queries; for example, by assigning the right priority for your WLM queue or enabling concurrency scaling for your WLM queue.

Diagnosing and optimizing query performance

After you isolate a slow query, you can drill down to the execution details of the query by choosing Query ID. The following screenshot shows multiple query IDs for a query that has been rewritten to multiple queries.

The Query details page shows you the parent query and all rewritten queries.

You can also find out whether any of the rewritten queries ran on a concurrency scaling cluster.

You can view the query plans, execution statistics such as the cost of each step of the plan, and data scanned for the query.

You can also view the cluster metrics at the time the query ran on the cluster. The following screenshot shows the problematic steps for your query plan. Choosing a problematic step reveals in-place recommendations to improve this query.

Implementing Advisor recommendations

Amazon Redshift Advisor provides recommendations that could improve workload performance. Amazon Redshift uses machine learning to look at your workload and provide customized recommendations. Amazon Redshift monitors and offers guidance for improved performance on the following crucial areas:

  • Short query acceleration (SQA) – Checks for query patterns and reports the number of recent queries in which you can reduce latency and the daily queue time for SQA-eligible queries by enabling SQA, thus improving your query performance
  • Sort key for tables – Analyzes the workload for your data warehouse over several days to identify a beneficial sort key for your tables and makes sort key recommendations
  • Distribution key for tables – Analyzes your workload to identify the most appropriate distribution key for tables that can significantly benefit from a key distribution style

The following screenshot shows a recommendation to alter the distribution key for the table.

Enabling concurrency scaling

To deliver optimal performance for your users, you can monitor user workloads and take action if you diagnose a problem. You can drill down to the query history for that specific time, and see several queries running at that time.

If you aren’t using concurrency scaling, your queries might be getting queued. You can also see that on the Workload concurrency tab. In the following screenshot, you can see that many queries are queued during that time because you didn’t enable concurrency scaling.

You can monitor all submitted queries and enable concurrency scaling when queued queries are increasing.

View a demo of Query Monitoring to learn more about the feature:

Conclusion

This post showed you the new features in the Amazon Redshift console that allow you to monitor user queries and help you diagnose performance issues in your user workload. The console also allows you to view your top queries by duration, filter failed, and long-running queries, and help you drill down to view related rewritten queries and their execution details, which you can use to tune your queries. Start using the query monitoring features of the new Amazon Redshift console to monitor your user workload today!

 


About the Authors

Debu Panda, a senior product manager at AWS, is an industry leader in analytics, application platform, and database technologies. He has more than 20 years of experience in the IT industry and has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

 

 

Apurva Gupta is a user experience designer at AWS. She specializes in databases, analytics and AI solutions. Previously, she has worked with companies both big and small leading end-to-end design and helping teams set-up design-first product development processes, design systems and accessibility programs. 

 

 

 

Chao Duan is a software development manager at Amazon Redshift, where he leads the development team focusing on enabling self-maintenance and self-tuning with comprehensive monitoring for Redshift. Chao is passionate about building high-availability, high-performance, and cost-effective database to empower customers with data-driven decision making.

 

 

 

Sudhakar Reddy is a full stack software development engineer with Amazon Redshift. He is specialized in building cloud services and applications for Big data, Databases and Analytics. 

 

 

 

 

Zayd Simjee is a software development engineer with Amazon Redshift.

Federate Amazon Redshift access with Microsoft Azure AD single sign-on

Post Syndicated from Harshida Patel original https://aws.amazon.com/blogs/big-data/federate-amazon-redshift-access-with-microsoft-azure-ad-single-sign-on/

Recently, we helped a large enterprise customer who was building their data warehouse on Amazon Redshift, using Azure AD as a corporate directory. Their requirement was to enable data warehouse users to use their corporate credentials to query data in Redshift. Doing so not only provided a better user experience — users can utilize their corporate credentials and do not have to remember extra passwords – but it also it made maintenance easier, because all corporate users are managed in a single place. The solution was to set up Azure AD federated access to Redshift.

You can use federation to centrally manage access to Amazon Redshift. This simplifies administration by allowing you to control user access at a central location and reducing the overhead of creating and maintaining database users. In addition to Active Directory Federation Service (ADFS), PingFederate, and Okta, Amazon Redshift also supports Microsoft Azure Active Directory (Azure AD) federation.

For more information about using ADFS with Amazon Redshift, see Federate Database User Authentication Easily with IAM and Amazon Redshift. For more information about integrating Azure AD, see Setting Up JDBC or ODBC Single Sign-on Authentication with Microsoft Azure AD.

This post illustrates how to set up federation using Azure AD and IAM. Azure AD manages the users and provides federated access to Amazon Redshift using IAM. You do not need to create separate database users in Amazon Redshift with this setup.

Solution overview

The solution to configure the federation between Azure AD and IAM to allow seamless access to Amazon Redshift using an SQL client contains the following components:

  • Azure AD serves as the identity provider for user authentication.
  • AWS serves as the service provider for authorization.
  • Security Assertion Markup Language (SAML 2.0) simplifies the interaction between identity provider and service provider.
  • AWS STS requests temporary credentials to connect to Amazon Redshift.

The following diagram illustrates the architecture of the solution.

The solution contains the following steps:

  1. You configure a JDBC or ODBC driver in SQL client to use Azure AD federation.
  2. The client makes an authentication request to the enterprise application in Azure using Azure AD credentials.
  3. The Azure enterprise application queries Azure AD and generates a SAML response, which includes the IAM roles assigned to the user.
  4. The client receives the SAML response.
  5. The SAML assertion goes to the AWS federation endpoint, which invokes the AssumeRoleWithSAML API of AWS STS and generates temporary IAM credentials.
  6. You use the temporary credentials to connect to the Amazon Redshift cluster.

Prerequisites

This blog post assumes that you have the following:

On Azure AD side:

On Amazon Redshift side:

Walkthrough overview

This walkthrough consists of the following three sections:

  1. Setting up the Azure enterprise non-gallery application using single sign-on (SSO) with SAML.
  2. Setting up the IAM provider and roles, which includes the following steps:
    • Creating the SAML identity provider.
    • Creating an IAM role for access to the Amazon Redshift cluster.
    • Creating an IAM provider and an IAM role to use SAML-based federation.
    • Testing the SSO setup.
  3. Configuring the JDBC client to use Azure AD user credentials to log in to the Amazon Redshift cluster. This post uses a JDBC client. However, you can use the same setup to support ODBC clients.

Setting Up Azure Enterprise Application with SAML single sign on

To set up the Azure enterprise application, complete the following steps:

  1. Log in to Azure Portal.
  2. Choose Enterprise applications.
  3. Choose New application.
  4. Choose Non-gallery application.
  5. For Name, enter Redshift.
  6. Choose Add.
  7. Under Manage¸ choose Single sign-on.
  8. In the Basic SAML Configuration section, for Entity ID and Reply URL, enter https://signin.aws.amazon.com/saml.
  9. In the User Attributes & Claims section, choose Edit.
  10. For Additional claims, Add new claim with the following values (if they do not exist):
    • For http://schemas.microsoft.com/ws/2008/06/identity/claims/groups, choose user.groups [SecurityGroup].
    • For https://redshift.amazon.com/SAML/Attributes/RoleSessionName, choose user.userprincipalname.
    • For https://redshift.amazon.com/SAML/Attributes/DbGroups, choose user.assignedroles
    • For https://redshift.amazon.com/SAML/Attributes/DbUser, choose user.userprincipalname.
    • For https://redshift.amazon.com/SAML/Attributes/AutoCreate, enter trueIIt is noteworthy to mention here that value of “DbUser” and “DbGroups” must be lowercase, begin with a letter, contain only alphanumeric characters, underscore (‘_’), plus sign (‘+’), dot (‘.’), at (‘@’), or hyphen (‘-‘), and be less than 128 characters.
  11. In addition to all of the claims added in previous step, Add new claim
    • For Manage claim, for Name, enter Role.
    • For Namespace, enter https://aws.amazon.com/SAML/Attributes.
    • For Source, select Attribute.
    • For Source attribute, enter arn:aws:iam::<yourAWSAccount>:role/AzureSSO,arn:aws:iam::<yourAWSAccount>:saml-provider/AzureADProviderEEnsure “Roleclaim is configured properly. Make sure you change this to your own AWS account. We will setup role AzureSSO and AzureADProvider in AWS Identity Access Management.
  12. In the Single sign-on section, under SAML Signing Certificate, for Federation Metadata XML, choose Download. Save this file locally.Often, the name of the file is the same as the application name; for example, Redshift.xml. You use this file to configure the IAM identity provider in the next section. This file will be used to configure IAM Identity Provider in next section.
  13. On the App registrations page, choose the application Redshift. If you don’t see your application in the list, choose All apps from the drop-down menu and search for it. Register it if it is not registered.
  14. Record the Application (client) ID and Directory (tenant) ID.You use these values in the JDBC connection when you connect to Amazon Redshift.
  15. Under API permissions, choose Add a permission.
  16. Choose Microsoft Graph.
  17. Choose Delegated permissions
  18. For User, choose User.Read.
  19. Choose Add permission.This allows the Amazon Redshift enterprise application to grant admin consent to read user profile and perform login using SSO.
  20. Under Security, choose Permissions.
  21. Add Users/groups to the application or grant universal admin consent for the entire organization.
  22. Choose Certificates & secrets.
  23. Generate the client secret key.
  24. Record the client secret key to use when you configure the JDBC connection.

You have now set up the enterprise application in Azure. You are ready to set up the IAM provider and role to use this Azure federation.

Setting Up IAM Identity Provider and Roles in AWS

An IAM identity provider is an entity in IAM that describes an external identity provider (IdP) service that supports the SAML 2.0 standard. You use an IAM identity provider when you want to establish trust between an SAML-compatible IdP; for example, Azure AD. It is important to name this identity provider AzureADProvider to match the SAML claims you created earlier.

For this walkthrough, you create the following in IAM:

  • IAM SAML identity provider AzureADProvider
  • IAM role AzureSSO
  • IAM policy redshiftAccessPolicy, which you attach to the role AzureSSO to provide permission to access the Amazon Redshift cluster

Creating the IAM SAML identity provider

To create your IAM SAML identity provider, complete the following steps:

  1. Sign in to the AWS Management Console as an admin.
  2. Under Security, Identity, & Compliance, choose IAM.
  3. Choose Identity providers.
  4. Choose Create Provider.
  5. On the Configure Provider page, for Provider Type, choose SAML.
    • For Provider Name, enter AzureADProvider.
    • For Metadata Document, choose xml.
  6. Choose Next Step.
  7. Choose Create.

Creating the IAM role for access to the Amazon Redshift cluster

You now create a role for SAML-based federation. The name of this role needs to match what you named the SAML claim you created earlier: AzureSSO. Complete the following steps:

  1. On the IAM console, choose Roles.
  2. Choose Create role.
  3. You will be directed to the Create role page where you can choose from several different options. To setup the Azure AD authentication on the Create role page, choose SAML 2.0 federation.
  4. Under Choose a SAML 2.0 provider:
    • For SAML provider, choose AzureADProvider from the drop down.
    • Select Allow programmatic and AWS Management Console access.
    • For Attribute, keep at the default SAML:aud.
    • For Value, keep at the default https://signin.aws.amazon.com/saml.
    • For Condition, you can optionally add additional conditions of who can use SSO. For this walkthrough, do not add any conditions.
  5. Choose Next: Permissions.You attach the policy to provide permission to Amazon Redshift after you create the policy.
  6. Choose Next: Tags. For Add tags (optional), you can optionally add key-value tags to help better organize, track, or control access for this role. For this post, do not add any tags.
  7. Choose Next: Review
  8. On the Review page, for Role name, enter AzureSSO.
    • For Role description, you can optionally enter a brief description of what your role does.
    • For Trusted entities, verify the ARN of the provider you specified earlier is correct.
  9. Choose Create role.

Creating the IAM Amazon Redshift access policy

In the following steps, you allow the role AzureSSO to generate an AWS STS token to connect to Amazon Redshift by adding an IAM permissions policy redshiftAccessPolicy to the role.

  1. On the IAM console, choose Policies.
  2. Choose Create policy.
  3. On the Create policy page, choose JSON.
  4. For the policy, enter the following JSON:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "redshift:DescribeClusters"
                ],
                "Resource": [
                    "arn:aws:redshift:<YOUR-REGION>:<AWS-ACCOUNT-NUMBER>:cluster:<YOUR-REDSHIFT-CLUSTER-NAME>"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "redshift:GetClusterCredentials"
                ],
                "Resource": [
                    "arn:aws:redshift:<YOUR-REGION>:<AWS-ACCOUNT-NUMBER>:dbname:<YOUR-REDSHIFT-CLUSTER-NAME>/dev",
                    "arn:aws:redshift:<YOUR-REGION>:<AWS-ACCOUNT-NUMBER>:dbuser:<YOUR-REDSHIFT-CLUSTER-NAME>/${redshift:DbUser}"
                ],
                "Condition": {
                    "StringEquals": {
                        "aws:userid": "<YOUR-ROLE-ID>:${redshift:DbUser}"
                    }
                }
            },
            {
                "Effect": "Allow",
                "Action": [
                    "redshift:CreateClusterUser"
                ],
                "Resource": [
                    "arn:aws:redshift:<YOUR-REGION>:<AWS-ACCOUNT-NUMBER>:dbuser:<YOUR-REDSHIFT-CLUSTER-NAME>/${redshift:DbUser}"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "redshift:JoinGroup"
                ],
                "Resource": [
                    "arn:aws:redshift:<YOUR-REGION>:<AWS-ACCOUNT-NUMBER>:dbgroup:<YOUR-REDSHIFT-CLUSTER-NAME>/<YOUR-DB-GROUP>"
                ]
            }
        ]
    }

    In the preceding code, replace the placeholders with the appropriate values. The following table provides an explanation of each placeholder.

1 <YOUR-REGION> The Region hosting your solution.
2 <AWS-ACCOUNT-NUMBER> Your AWS account.
3 <YOUR-REDSHIFT-CLUSTER-NAME> Your Amazon Redshift cluster name; for example, cluster-1.
4 <YOUR-ROLE-ID> The unique identifier of the role AzureSSO, which you create in the next section. You can enter the code aws iam get-role AzureSSO and use the RoleId value from the output. For more information, see IAM Identifiers.
5 <YOUR-DB-GROUP> The database group that the user can join. For this post, use developer_grp.

 

  1. Choose Review policy.
  2. In the Review policy section, for Name, enter the name of your policy; for example, redshiftAccessPolicy.
  3. For Description, you can optionally enter a brief description of what the policy does.
  4. For Summary, review your policy components and make sure to resolve any warnings or errors.
  5. Choose Create policy.
  6. Add this policy to the IAM role AzureSSO. For instructions, see Adding IAM Identity Permissions (Console).
    • Choose Managed IAM Policy.

Testing the SSO setup

You can now test the SSO setup. Complete the following steps.

  1. On the Azure Portal, for your Amazon Redshift application, choose Single sign-on.
  2. Choose Test.
  3. Choose Sign in as current user.

If the setup is correct, this brings you to the console (may be in next Tab for some browsers). The following screenshot shows that the user test is signed in using the assumed role AzureSSO.

After you verify the SSO setup, you are ready to connect the SQL client to Amazon Redshift using Azure AD federation.

Setting up your JDBC SQL client to use Azure AD federation

You can use any client tool that supports an Amazon Redshift JDBC or ODBC driver to connect to Amazon Redshift using Azure SSO. For this post, use the SQLWorkbenchJ client to connect to Amazon Redshift using JDBC. For instructions in installing SQLWorkbench/J, see Connect to Your Cluster by Using SQL Workbench/J.

To connect to the Amazon Redshift cluster using your Azure AD credentials, complete the following steps. If you are using another client, you can pass all these parameters in the JDBC URL as a URL parameter.

To use Azure AD with JDBC, the Amazon Redshift JDBC driver must be version 1.2.37.1061 or later. To use Azure AD with ODBC, the Amazon Redshift ODBC driver must be version 1.4.10.1000 or later. For this walkthrough, use the driver with AWS SDK.

  1. Start SQLWorkbenchJ.
  2. On the Select Connection Profile page, choose Add Profile Group.
  3. Click on the “New Connection Profile.” This will open a new profile in the profile group you created. You can choose an appropriate name — we named it “Azure.”
  4. For Driver, choose your Amazon Redshift driver.
  5. Next step is to construct JDBC URL and input it in “URL” field. You can do it by using the following pattern:jdbc:redshift:iam://<your-cluster-identifier-here>:<your-cluster-region-here>/<your-DB-name-here>RReplace <your-cluster-identifier-here>, <your-cluster-region-here>, and <your-DB-name-here> with the values matching your Amazon Redshift cluster. This post uses cluster-identifier, us-west-1, and dev.
  6. Choose Extended Properties.

Add the following key-value pairs tabulated in Extended Properties of SQLWorkbenchJ.

1 plugin_name To use Azure AD federation, use com.amazon.redshift.plugin.AzureCredentialsProvider.
2 idp_tenant This is the tenant name of your company configured on your IdP (Azure). This value can either be the tenant name or the tenant unique ID with hyphens. For this walkthrough, use tenant ID that you recorded earlier.
3 client_secret Use the client secret value that you created earlier.
4 client_id This is your client ID with hyphens of the Redshift application. Use the client ID that you recorded earlier.

If you completed these steps correctly, you can connect to the Amazon Redshift cluster using your Azure AD credentials.

Summary

In this post, we provided a step-by-step guide to configure and use Azure AD as your IdP to enable federation to an Amazon Redshift cluster.

 


About the Authors

Harshida Patel is a Data Warehouse Specialist Solutions Architect with AWS.

 

 

 

 

Pir Abdul Rasool Qureshi is a Solutions Architect at AWS. He enjoys solving complex customer problems in Databases & Analytics and deliver successful outcomes. Outside of work, he loves to spend time with his family, watch movies and play cricket whenever possible.

 

 

New – Amazon Keyspaces (for Apache Cassandra) is Now Generally Available

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-amazon-keyspaces-for-apache-cassandra-is-now-generally-available/

We introduced Amazon Managed Apache Cassandra Service (MCS) in preview at re:Invent last year. In the few months that passed, the service introduced many new features, and it is generally available today with a new name: Amazon Keyspaces (for Apache Cassandra).

Amazon Keyspaces is built on Apache Cassandra, and you can use it as a fully managed, serverless database. Your applications can read and write data from Amazon Keyspaces using your existing Cassandra Query Language (CQL) code, with little or no changes. For each table, you can select the best configuration depending on your use case:

  • With on-demand, you pay based on the actual reads and writes you perform. This is the best option for unpredictable workloads.
  • With provisioned capacity, you can reduce your costs for predictable workloads by configuring capacity settings up front. You can also further optimize costs by enable auto scaling, which updates your provisioned capacity settings automatically as your traffic changes throughout the day.

Using Amazon Keyspaces
One of the first “serious” applications I built as a kid, was an archive for my books. I’d like to rebuild it now as a serverless API, using:

With Amazon Keyspaces, your data is stored in keyspaces and tables. A keyspace gives you a way to group related tables together. In the blog post for the preview, I used the console to configure my data model. Now, I can also use AWS CloudFormation to manage my keyspaces and tables as code. For example I can create a bookstore keyspace and a books table with this CloudFormation template:

AWSTemplateFormatVersion: '2010-09-09'
Description: Amazon Keyspaces for Apache Cassandra example

Resources:

  BookstoreKeyspace:
    Type: AWS::Cassandra::Keyspace
    Properties: 
      KeyspaceName: bookstore

  BooksTable:
    Type: AWS::Cassandra::Table
    Properties: 
      TableName: books
      KeyspaceName: !Ref BookstoreKeyspace
      PartitionKeyColumns: 
        - ColumnName: isbn
          ColumnType: text
      RegularColumns: 
        - ColumnName: title
          ColumnType: text
        - ColumnName: author
          ColumnType: text
        - ColumnName: pages
          ColumnType: int
        - ColumnName: year_of_publication
          ColumnType: int

Outputs:
  BookstoreKeyspaceName:
    Description: "Keyspace name"
    Value: !Ref BookstoreKeyspace # Or !Select [0, !Split ["|", !Ref BooksTable]]
  BooksTableName:
    Description: "Table name"
    Value: !Select [1, !Split ["|", !Ref BooksTable]]

If you don’t specify a name for a keyspace or a table in the template, CloudFormation generates a unique name for you. Note that in this way keyspaces and tables may contain uppercase characters that are outside of the usual Cassandra conventions, and you need to put those names between double quotes when using Cassandra Query Language (CQL).

When the creation of the stack is complete, I see the new bookstore keyspace in the console:

Selecting the books table, I have an overview of its configuration, including the partition key, the clustering columns, and all the columns, and the option to change the capacity mode for the table from on-demand to provisioned:

For authentication and authorization, Amazon Keyspaces supports AWS Identity and Access Management (IAM) identity-based policies, that you can use with IAM users, groups, and roles. Here’s a list of actions, resources, and conditions that you can use in IAM policies with Amazon Keyspaces. You can now also manage access to resources based on tags.

You can use IAM roles using AWS Signature Version 4 Process (SigV4) with this open source authentication plugin for the DataStax Java driver. In this way you can run your applications inside an Amazon Elastic Compute Cloud (EC2) instance, a container managed by Amazon ECS or Amazon Elastic Kubernetes Service, or a Lambda function, and leverage IAM roles for authentication and authorization to Amazon Keyspaces, without the need to manage credentials. Here’s a sample application that you can test on an EC2 instance with an associated IAM role giving access to Amazon Keyspaces.

Going back to my books API, I create all the resources I need, including a keyspace and a table, with the following AWS Serverless Application Model (SAM) template.

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Sample Books API using Cassandra as database

Globals:
  Function:
    Timeout: 30

Resources:

  BookstoreKeyspace:
    Type: AWS::Cassandra::Keyspace

  BooksTable:
    Type: AWS::Cassandra::Table
    Properties: 
      KeyspaceName: !Ref BookstoreKeyspace
      PartitionKeyColumns: 
        - ColumnName: isbn
          ColumnType: text
      RegularColumns: 
        - ColumnName: title
          ColumnType: text
        - ColumnName: author
          ColumnType: text
        - ColumnName: pages
          ColumnType: int
        - ColumnName: year_of_publication
          ColumnType: int

  BooksFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: BooksFunction
      Handler: books.App::handleRequest
      Runtime: java11
      MemorySize: 2048
      Policies:
        - Statement:
          - Effect: Allow
            Action:
            - cassandra:Select
            Resource:
              - !Sub "arn:aws:cassandra:${AWS::Region}:${AWS::AccountId}:/keyspace/system*"
              - !Join
                - ""
                - - !Sub "arn:aws:cassandra:${AWS::Region}:${AWS::AccountId}:/keyspace/${BookstoreKeyspace}/table/"
                  - !Select [1, !Split ["|", !Ref BooksTable]] # !Ref BooksTable returns "Keyspace|Table"
          - Effect: Allow
            Action:
            - cassandra:Modify
            Resource:
              - !Join
                - ""
                - - !Sub "arn:aws:cassandra:${AWS::Region}:${AWS::AccountId}:/keyspace/${BookstoreKeyspace}/table/"
                  - !Select [1, !Split ["|", !Ref BooksTable]] # !Ref BooksTable returns "Keyspace|Table"
      Environment:
        Variables:
          KEYSPACE_TABLE: !Ref BooksTable # !Ref BooksTable returns "Keyspace|Table"
      Events:
        GetAllBooks:
          Type: HttpApi
          Properties:
            Method: GET
            Path: /books
        GetBookByIsbn:
          Type: HttpApi
          Properties:
            Method: GET
            Path: /books/{isbn}
        PostNewBook:
          Type: HttpApi
          Properties:
            Method: POST
            Path: /books

Outputs:
  BookstoreKeyspaceName:
    Description: "Keyspace name"
    Value: !Ref BookstoreKeyspace # Or !Select [0, !Split ["|", !Ref BooksTable]]
  BooksTableName:
    Description: "Table name"
    Value: !Select [1, !Split ["|", !Ref BooksTable]]
  BooksApi:
    Description: "API Gateway HTTP API endpoint URL"
    Value: !Sub "https://${ServerlessHttpApi}.execute-api.${AWS::Region}.amazonaws.com/"
  BooksFunction:
    Description: "Books Lambda Function ARN"
    Value: !GetAtt BooksFunction.Arn
  BooksFunctionIamRole:
    Description: "Implicit IAM Role created for Books function"
    Value: !GetAtt BooksFunctionRole.Arn

In this template I don’t specify the keyspace and table names, and CloudFormation is generating unique names automatically. The function IAM policy gives access to read (cassandra:Select) and write (cassandra:Write) only to the books table. I am using CloudFormation Fn::Select and Fn::Split intrinsic functions to get the table name. The driver also needs read access to the system* keyspaces.

To use the authentication plugin for the DataStax Java driver that supports IAM roles, I write the Lambda function in Java, using the APIGatewayV2ProxyRequestEvent and APIGatewayV2ProxyResponseEvent classes to communicate with the HTTP API created by the API Gateway.

package books;

import java.net.InetSocketAddress;
import java.security.NoSuchAlgorithmException;
import java.util.Collections;
import java.util.List;
import java.util.HashMap;
import java.util.Map;
import java.util.StringJoiner;
import javax.net.ssl.SSLContext;

import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;

import com.datastax.oss.driver.api.core.ConsistencyLevel;
import com.datastax.oss.driver.api.core.CqlSession;
import com.datastax.oss.driver.api.core.cql.*;

import software.aws.mcs.auth.SigV4AuthProvider;

import com.amazonaws.services.lambda.runtime.Context;
import com.amazonaws.services.lambda.runtime.RequestHandler;
import com.amazonaws.services.lambda.runtime.LambdaLogger;
import com.amazonaws.services.lambda.runtime.events.APIGatewayV2ProxyRequestEvent;
import com.amazonaws.services.lambda.runtime.events.APIGatewayV2ProxyResponseEvent;

public class App implements RequestHandler<APIGatewayV2ProxyRequestEvent, APIGatewayV2ProxyResponseEvent> {
    
    JSONParser parser = new JSONParser();
    String[] keyspace_table = System.getenv("KEYSPACE_TABLE").split("\\|");
    String keyspace = keyspace_table[0];
    String table = keyspace_table[1];
    CqlSession session = getSession();
    PreparedStatement selectBookByIsbn = session.prepare("select * from \"" + table + "\" where isbn = ?");
    PreparedStatement selectAllBooks = session.prepare("select * from \"" + table + "\"");
    PreparedStatement insertBook = session.prepare("insert into \"" + table + "\" "
    + "(isbn, title, author, pages, year_of_publication)" + "values (?, ?, ?, ?, ?)");
    
    public APIGatewayV2ProxyResponseEvent handleRequest(APIGatewayV2ProxyRequestEvent request, Context context) {
        
        LambdaLogger logger = context.getLogger();
        
        String responseBody;
        int statusCode = 200;
        
        String routeKey = request.getRequestContext().getRouteKey();
        logger.log("routeKey = '" + routeKey + "'");
        
        if (routeKey.equals("GET /books")) {
            ResultSet rs = execute(selectAllBooks.bind());
            StringJoiner jsonString = new StringJoiner(", ", "[ ", " ]");
            for (Row row : rs) {
                String json = row2json(row);
                jsonString.add(json);
            }
            responseBody = jsonString.toString();
        } else if (routeKey.equals("GET /books/{isbn}")) {
            String isbn = request.getPathParameters().get("isbn");
            logger.log("isbn: '" + isbn + "'");
            ResultSet rs = execute(selectBookByIsbn.bind(isbn));
            if (rs.getAvailableWithoutFetching() == 1) {
                responseBody = row2json(rs.one());
            } else {
                statusCode = 404;
                responseBody = "{\"message\": \"not found\"}";
            }
        } else if (routeKey.equals("POST /books")) {
            String body = request.getBody();
            logger.log("Body: '" + body + "'");
            JSONObject requestJsonObject = null;
            if (body != null) {
                try {
                    requestJsonObject = (JSONObject) parser.parse(body);
                } catch (ParseException e) {
                    e.printStackTrace();
                }
                if (requestJsonObject != null) {
                    int i = 0;
                    BoundStatement boundStatement = insertBook.bind()
                    .setString(i++, (String) requestJsonObject.get("isbn"))
                    .setString(i++, (String) requestJsonObject.get("title"))
                    .setString(i++, (String) requestJsonObject.get("author"))
                    .setInt(i++, ((Long) requestJsonObject.get("pages")).intValue())
                    .setInt(i++, ((Long) requestJsonObject.get("year_of_publication")).intValue())
                    .setConsistencyLevel(ConsistencyLevel.LOCAL_QUORUM);
                    ResultSet rs = execute(boundStatement);
                    statusCode = 201;
                    responseBody = body;
                } else {
                    statusCode = 400;
                    responseBody = "{\"message\": \"JSON parse error\"}";
                }
            } else {
                statusCode = 400;
                responseBody = "{\"message\": \"body missing\"}";
            }
        } else {
            statusCode = 405;
            responseBody = "{\"message\": \"not implemented\"}";
        }
        
        Map<String, String> headers = new HashMap<>();
        headers.put("Content-Type", "application/json");
        
        APIGatewayV2ProxyResponseEvent response = new APIGatewayV2ProxyResponseEvent();
        response.setStatusCode(statusCode);
        response.setHeaders(headers);
        response.setBody(responseBody);
        
        return response;
    }
    
    private String getStringColumn(Row row, String columnName) {
        return "\"" + columnName + "\": \"" + row.getString(columnName) + "\"";
    }
    
    private String getIntColumn(Row row, String columnName) {
        return "\"" + columnName + "\": " + row.getInt(columnName);
    }
    
    private String row2json(Row row) {
        StringJoiner jsonString = new StringJoiner(", ", "{ ", " }");
        jsonString.add(getStringColumn(row, "isbn"));
        jsonString.add(getStringColumn(row, "title"));
        jsonString.add(getStringColumn(row, "author"));
        jsonString.add(getIntColumn(row, "pages"));
        jsonString.add(getIntColumn(row, "year_of_publication"));
        return jsonString.toString();
    }
    
    private ResultSet execute(BoundStatement bs) {
        final int MAX_RETRIES = 3;
        ResultSet rs = null;
        int retries = 0;

        do {
            try {
                rs = session.execute(bs);
            } catch (Exception e) {
                e.printStackTrace();
                session = getSession(); // New session
            }
        } while (rs == null && retries++ < MAX_RETRIES);
        return rs;
    }
    
    private CqlSession getSession() {
        
        System.setProperty("javax.net.ssl.trustStore", "./cassandra_truststore.jks");
        System.setProperty("javax.net.ssl.trustStorePassword", "amazon");
        
        String region = System.getenv("AWS_REGION");
        String endpoint = "cassandra." + region + ".amazonaws.com";
        
        System.out.println("region: " + region);
        System.out.println("endpoint: " + endpoint);
        System.out.println("keyspace: " + keyspace);
        System.out.println("table: " + table);
        
        SigV4AuthProvider provider = new SigV4AuthProvider(region);
        List<InetSocketAddress> contactPoints = Collections.singletonList(new InetSocketAddress(endpoint, 9142));
        
        CqlSession session;
                
        try {
            session = CqlSession.builder().addContactPoints(contactPoints).withSslContext(SSLContext.getDefault())
            .withLocalDatacenter(region).withAuthProvider(provider).withKeyspace("\"" + keyspace + "\"")
            .build();
        } catch (NoSuchAlgorithmException e) {
            session = null;
            e.printStackTrace();
        }
        
        return session;
    }
}

To connect to Amazon Keyspaces with TLS/SSL using the Java driver, I need to include a trustStore in the JVM arguments. When using the Cassandra Java Client Driver in a Lambda function, I can’t pass parameters to the JVM, so I pass the same options as system properties, and specify the SSL context when creating the CQL session with the  withSslContext(SSLContext.getDefault()) parameter. Note that I also have to configure the pom.xml file, used by Apache Maven, to include the trustStore file as a dependency.

System.setProperty("javax.net.ssl.trustStore", "./cassandra_truststore.jks");
System.setProperty("javax.net.ssl.trustStorePassword", "amazon");

Now, I can use a tool like curl or Postman to test my books API. First, I take the endpoint of the API from the output of the CloudFormation stack. At the beginning there are no books stored in the books table, and if I do an HTTP GET on the resource, I get an empty JSON list. For readability, I am removing all HTTP headers from the output.

$ curl -i https://a1b2c3d4e5.execute-api.eu-west-1.amazonaws.com/books

HTTP/1.1 200 OK
[]

In the code, I am using a PreparedStatement to run a CQL statement to select all rows from the books table. The names of the keystore and of the table are passed to the Lambda function in an environment variable, as described in the SAM template above.

Let’s use the API to add a book, by doing an HTTP POST on the resource.

$ curl -i -d '{ "isbn": "978-0201896831", "title": "The Art of Computer Programming, Vol. 1: Fundamental Algorithms (3rd Edition)", "author": "Donald E. Knuth", "pages": 672, "year_of_publication": 1997 }' -H "Content-Type: application/json" -X POST https://a1b2c3d4e5.execute-api.eu-west-1.amazonaws.com/books

HTTP/1.1 201 Created
{ "isbn": "978-0201896831", "title": "The Art of Computer Programming, Vol. 1: Fundamental Algorithms (3rd Edition)", "author": "Donald E. Knuth", "pages": 672, "year_of_publication": 1997 }

I can check that the data has been inserted in the table using the CQL Editor in the console, where I select all the rows in the table.

I repeat the previous HTTP GET to get the list of the books, and I see the one I just created.

$ curl -i https://a1b2c3d4e5-api.eu-west-1.amazonaws.com/books

HTTP/1.1 200 OK
[ { "isbn": "978-0201896831", "title": "The Art of Computer Programming, Vol. 1: Fundamental Algorithms (3rd Edition)", "author": "Donald E. Knuth", "pages": 672, "year_of_publication": 1997 } ]

I can get a single book by ISBN, because the isbn column is the primary key of the table and I can use it in the where condition of a select statement.

$ curl -i https://a1b2c3d4e5.execute-api.eu-west-1.amazonaws.com/books/978-0201896831

HTTP/1.1 200 OK
{ "isbn": "978-0201896831", "title": "The Art of Computer Programming, Vol. 1: Fundamental Algorithms (3rd Edition)", "author": "Donald E. Knuth", "pages": 672, "year_of_publication": 1997 }

If there is no book with that ISBN, I return a “not found” message:

$ curl -i https://a1b2c3d4e5.execute-api.eu-west-1.amazonaws.com/books/1234567890

HTTP/1.1 404 Not Found
{"message": "not found"}

It works! We just built a fully serverless API using CQL to read and write data using temporary security credentials, managing the whole infrastructure, including the database table, as code.

Available Now
Amazon Keyspace (for Apache Cassandra) is ready for your applications, please see this table for regional availability. You can find more information on how to use Keyspaces in the documentation. In this post, I built a new application, but you can get lots of benefits by migrating your current tables to a fully managed environment. For migrating data, you can now use cqlsh as described in this post.

Let me know what are you going to use it for!

Danilo

Achieve finer-grained data security with column-level access control in Amazon Redshift

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/achieve-finer-grained-data-security-with-column-level-access-control-in-amazon-redshift/

Amazon Redshift is the most popular cloud data warehouse because it provides fast insights at a low cost. Customers can confidently run mission critical workloads, even in highly regulated industries, because Amazon Redshift comes with out of the box security and compliance. The security features, combined with the ability to easily analyze data in-place and in open formats, along with compute and storage elasticity, and ease of use are what makes tens of thousands of customers choose Amazon Redshift.

Many organizations store sensitive data, commonly classified as personally identifiable information (PII) or sensitive personal information (SPI) in Amazon Redshift and this data will have restricted access from different persona in the organization. For example, your human resources, finance, sales, data science, and marketing departments may all have the required access privileges to view customer data, whereas only the finance department should have access to sensitive data like personally identifiable information (PII) or payment card industry (PCI).

Views or AWS Lake Formation on Amazon Redshift Spectrum was used previously to manage such scenarios, however this adds extra overhead in creating and maintaining views or Amazon Redshift Spectrum. View based approach is also difficult to scale and can lead to lack of security controls. Amazon Redshift column-level access control is a new feature that supports access control at a column-level for data in Amazon Redshift. You can use column-level GRANT and REVOKE statements to help meet your security and compliance needs similar to managing any database object.

This post shows you how to setup Amazon Redshift column-level access control on table, view and materialized view.

Use Case

There are two tables that store customer demographic and account balance data. Finance department can see all customer data while Sales department can only view and update market segment and account balance data as the rest of customer demographic data like customer name, phone and nation are considered PII data and should have restricted access. This is a good use case for column-level access control to secure the PII data. Below is a simple entity relation diagram for the 2 tables.

Prerequisites

Before trying out the illustration in this blog, note the following prerequisites:

  1. Amazon Redshift cluster.
  2. Database user with permission to create table or superuser.

Setting up the environment

To setup the environment and implement the use case, complete the following steps:

  1. Connect to your Amazon Redshift cluster using any SQL client of your choice with user with permission to create table or superuser.
  2. Create two tables with the following code:

    CREATE TABLE customer 
    (
      customerid       INT8 NOT NULL,
      customername     VARCHAR(25) NOT NULL,
      phone            CHAR(15) NOT NULL,
      nationid        INT4 NOT NULL,
      marketsegment    CHAR(10) NOT NULL,
      accountbalance   NUMERIC(12,2) NOT NULL
    );
    CREATE TABLE nation 
    (
      nationid    INT4 NOT NULL,
      nationname   CHAR(25) NOT NULL
    );

  3. Populate some sample data into the two tables with the following code:

    INSERT INTO customer VALUES
    (1, 'Customer#000000001', '33-687-542-7601', 3, 'HOUSEHOLD', 2788.52),
    (2, 'Customer#000000002', '13-806-545-9701', 1, 'MACHINERY', 591.98),
    (3, 'Customer#000000003', '13-312-472-8245', 1, 'HOUSEHOLD', 3332.02),
    (4, 'Customer#000000004', '23-127-851-8031', 2, 'MACHINERY', 9255.67),
    (5, 'Customer#000000005', '13-137-193-2709', 1, 'BUILDING', 5679.84)
    ;
    INSERT INTO nation VALUES
    (1, 'UNITED STATES'),
    (2, 'AUSTRALIA'),
    (3, 'UNITED KINGDOM')
    ;

  4. Create a view and a materialized view with the following code:

    CREATE OR REPLACE VIEW customer_vw AS SELECT customername, phone, marketsegment, accountbalance, CASE WHEN accountbalance < 1000 THEN 'low' WHEN accountbalance > 1000 AND accountbalance < 5000 THEN 'middle' ELSE 'high' END AS incomegroup FROM customer;
    
    CREATE MATERIALIZED VIEW customernation_mv AS SELECT customername, phone, nationname, marketsegment, sum(accountbalance) AS accountbalance FROM customer c INNER JOIN nation n ON c.nationid = n.nationid GROUP BY customername, phone, nationname, marketsegment;

  5. The purpose of the view, customer_vw is to implement business rule of customer income group categorization based on customer dataset.
  6. Analytical dashboards frequently access this dataset by joining and aggregating tables customer and nation and thus, the materialized view customernation_mv is created to speed up the performance such query significantly.
  7. Create and grant table level permissions to user finance which represent finance department users. Note that below users are created only for illustration purpose. We recommend you to use AWS IAM Federation to bring your corporate users without creating them manually in Amazon Redshift. For more information, please refer to https://docs.aws.amazon.com/redshift/latest/mgmt/redshift-iam-authentication-access-control.html#authentication.
    CREATE USER finance WITH password 'Abcd1234!';
    CREATE USER sales WITH password 'Abcd1234!'; 
    GRANT SELECT, UPDATE ON customer TO finance;
    GRANT SELECT ON customer_vw TO finance;
    GRANT SELECT ON customernation_mv TO finance;

  8. Note that user finance has SELECT and UPDATE permission on all columns on customer table.
  9. You need to test and validate user finance is able to view all data from the customer table, customer_vw view and customernation_mv materialized view and update data on customer table.
  10. Enter the following code:
    SET SESSION AUTHORIZATION 'finance';
    SELECT CURRENT_USER;
    SELECT * FROM customer;
    SELECT * FROM customer_vw;
    SELECT * FROM customernation_mv;
    UPDATE customer SET accountbalance = 1000 WHERE marketsegment = 'BUILDING';
    RESET SESSION AUTHORIZATION;

    Note that SQL statement SET SESSION AUTHORIZATION 'finance' is used to impersonate user finance in above code.

    Each select statement should return five rows and the update statement should return one row updated. See the following code:

    dev=# SET SESSION AUTHORIZATION 'finance';
    SET
    dev=> SELECT CURRENT_USER;
     current_user 
    --------------
     finance
    (1 row)
     
    dev=> SELECT * FROM customer;
     customerid |    customername    |      phone      | nationid | marketsegment | accountbalance 
    ------------+--------------------+-----------------+----------+---------------+----------------
              1 | Customer#000000001 | 33-687-542-7601 |        3 | HOUSEHOLD     |        2788.52
              2 | Customer#000000002 | 13-806-545-9701 |        1 | MACHINERY     |         591.98
              3 | Customer#000000003 | 13-312-472-8245 |        1 | HOUSEHOLD     |        3332.02
              4 | Customer#000000004 | 23-127-851-8031 |        2 | MACHINERY     |        9255.67
              5 | Customer#000000005 | 13-137-193-2709 |        1 | BUILDING      |        5679.84
    (5 rows)
     
    dev=> SELECT * FROM customer_vw;
        customername    |      phone      | marketsegment | accountbalance | incomegroup 
    --------------------+-----------------+---------------+----------------+-------------
     Customer#000000001 | 33-687-542-7601 | HOUSEHOLD     |        2788.52 | middle
     Customer#000000002 | 13-806-545-9701 | MACHINERY     |         591.98 | low
     Customer#000000003 | 13-312-472-8245 | HOUSEHOLD     |        3332.02 | middle
     Customer#000000004 | 23-127-851-8031 | MACHINERY     |        9255.67 | high
     Customer#000000005 | 13-137-193-2709 | BUILDING      |        5679.84 | high
    (5 rows)
     
    dev=> SELECT * FROM customernation_mv;
        customername    |      phone      |        nationname         | marketsegment | accountbalance 
    --------------------+-----------------+---------------------------+---------------+----------------
     Customer#000000005 | 13-137-193-2709 | UNITED STATES             | BUILDING      |        5679.84
     Customer#000000004 | 23-127-851-8031 | AUSTRALIA                 | MACHINERY     |        9255.67
     Customer#000000003 | 13-312-472-8245 | UNITED STATES             | HOUSEHOLD     |        3332.02
     Customer#000000002 | 13-806-545-9701 | UNITED STATES             | MACHINERY     |         591.98
     Customer#000000001 | 33-687-542-7601 | UNITED KINGDOM            | HOUSEHOLD     |        2788.52
    (5 rows)
     
    dev=> UPDATE customer SET accountbalance = 1000 WHERE marketsegment = 'BUILDING';
    UPDATE 1
    dev=>  
    dev=> RESET SESSION AUTHORIZATION;
    RESET

You have now successfully setup table level permissions for user finance to view and update all customer data.

Setting up Amazon Redshift column-level access control

Column-level access control can be enabled and disabled by using GRANT and REVOKE statements with the following syntax:

GRANT { { SELECT | UPDATE } ( column_name [, ...] ) [, ...] | ALL [ PRIVILEGES ] ( column_name [,...] ) }
ON { [ TABLE ] table_name [, ...] }
TO { username | GROUP group_name | PUBLIC } [, ...]
 
REVOKE { { SELECT | UPDATE } ( column_name [, ...] ) [, ...] | ALL [ PRIVILEGES ] ( column_name [,...] ) }
ON { [ TABLE ] table_name [, ...] }
FROM { username | GROUP group_name | PUBLIC } [, ...]
[ CASCADE | RESTRICT ]

To set up column-level privileges, complete the following steps:

  1. To determine which users have column-level access control, you can query PG_ATTRIBUTE_INFO system view. Enter the following code:
    SELECT b.attacl, b.attname, c.relname FROM pg_catalog.pg_attribute_info b JOIN pg_class c ON c.oid=b.attrelid WHERE c.relname in ('customer','customer_vw','customernation_mv') AND b.attacl IS NOT NULL ORDER BY c.relname, b.attname;

  2. The query should return zero records as we have not implemented column-level access control yet.
  3. Grant user sales SELECT permission on columns marketsegment and accountbalance on table customer, view customer_vw and materialized view customernation_mv. We also grant UPDATE permission on column marketsegment and accountbalance on table customer by entering the following code:
    RESET SESSION AUTHORIZATION;
    GRANT SELECT (marketsegment, accountbalance) ON customer TO sales WITH GRANT OPTION;
    GRANT SELECT (marketsegment, accountbalance),UPDATE (marketsegment, accountbalance) ON customer TO sales;
    GRANT SELECT (marketsegment, accountbalance) ON customer_vw TO sales;
    GRANT SELECT (marketsegment, accountbalance) ON customernation_mv TO sales;

  4. Error message “Grant options are not supported for column privileges” should be returned for the first statement. This is because only a table’s owner or a superuser can grant column-level privileges and to maintain simple security model.
  5. Validate if above permissions have been granted with the following code:
    SELECT b.attacl, b.attname, c.relname FROM pg_catalog.pg_attribute_info b  JOIN pg_class c ON c.oid=b.attrelid WHERE c.relname in ('customer','customer_vw','customernation_mv') AND b.attacl IS NOT NULL ORDER BY c.relname, b.attname;

  6. The query should return six rows. See the following code:
    dev=# SELECT b.attacl, b.attname, c.relname FROM pg_catalog.pg_attribute_info b  JOIN pg_class c ON c.oid=b.attrelid WHERE c.relname in ('customer','customer_vw','customernation_mv') AND b.attacl IS NOT NULL ORDER BY c.relname, b.attname;
          attacl       |    attname     |      relname      
    -------------------+----------------+-------------------
     {sales=rw/fqdemo} | accountbalance | customer
     {sales=rw/fqdemo} | marketsegment  | customer
     {sales=r/fqdemo}  | accountbalance | customer_vw
     {sales=r/fqdemo}  | marketsegment  | customer_vw
     {sales=r/fqdemo}  | accountbalance | customernation_mv
     {sales=r/fqdemo}  | marketsegment  | customernation_mv
    (6 rows)

    The output above shows:
    Users: sales (attacl column)
    Permissions: read/write (attacl column value “rw”)
    On Column: accountbalance, marketsegment (attname column)
    Of table: customer (relname column)
    Granted by: fqdemo (attacl column)

    Users: sales (attacl column)
    Permissions: read (attacl column value “r”)
    On Column: accountbalance, marketsegment (attname column)
    Of table: customer_vw, customernation_mv (relname column)
    Granted by: fqdemo (attacl column)

  7. After you confirmed the column-level access control are correct, run as user sales to query table customer, view customer_vw and materialized view customernation_mv using the following code:
    SET SESSION AUTHORIZATION 'sales';
    SELECT CURRENT_USER;
    SELECT * FROM customer;
    SELECT * FROM customer_vw;
    SELECT * FROM customernation_mv;

  8. Each select statement should return permission denied error as the user does not have permissions to all columns of the objects being queried. See the following code:
    dev=# SET SESSION AUTHORIZATION 'sales';
    SET
    dev=> SELECT CURRENT_USER;
     current_user 
    --------------
     sales
    (1 row)
     
    dev=> SELECT * FROM customer;
    ERROR:  permission denied for relation customer
    dev=> SELECT * FROM customer_vw;
    ERROR:  permission denied for relation customer_vw
    dev=> SELECT * FROM customernation_mv;
    ERROR:  permission denied for relation customernation_mv

  9. Query only the columns marketsegment and accountbalance from table customer, view customer_vw and materialized view customernation_mv with the following code:
    SELECT marketsegment, accountbalance FROM customer;
    SELECT marketsegment, accountbalance FROM customer_vw;
    SELECT marketsegment, accountbalance FROM customernation_mv;

  10. Each select statement should return five rows as user sales has permission to query columns marketsegment and accountbalance. See the following code:
    dev=> SELECT marketsegment, accountbalance FROM customer;
     marketsegment | accountbalance 
    ---------------+----------------
     HOUSEHOLD     |        2788.52
     MACHINERY     |         591.98
     HOUSEHOLD     |        3332.02
     MACHINERY     |        9255.67
     BUILDING      |        1000.00
    (5 rows)
     
    dev=> SELECT marketsegment, accountbalance FROM customer_vw;
     marketsegment | accountbalance 
    ---------------+----------------
     HOUSEHOLD     |        2788.52
     MACHINERY     |         591.98
     HOUSEHOLD     |        3332.02
     MACHINERY     |        9255.67
     BUILDING      |        1000.00
    (5 rows)
     
    dev=> SELECT marketsegment, accountbalance FROM customernation_mv;
     marketsegment | accountbalance 
    ---------------+----------------
     MACHINERY     |        9255.67
     BUILDING      |        5679.84
     MACHINERY     |         591.98
     HOUSEHOLD     |        3332.02
     HOUSEHOLD     |        2788.52
    (5 rows)

  11. Update the accountbalance column with the following code:
    UPDATE customer SET accountbalance = 2000 WHERE marketsegment = 'BUILDING';
    SELECT accountbalance FROM customer WHERE marketsegment = 'BUILDING';

  12. The select statement should return one row that shows value 2000. See the following code:
    dev=> UPDATE customer SET accountbalance = 2000 WHERE marketsegment = 'BUILDING';
    UPDATE 1
    dev=> SELECT accountbalance FROM customer WHERE marketsegment = 'BUILDING';
     accountbalance 
    ----------------
            2000.00
    (1 row)

  13. Update the accountbalance column with condition nationid=1 by using the following code:
    UPDATE customer SET accountbalance = 3000 WHERE nationid = 1;

  14. The update statement should return permission denied error as user sales does not have column-level privileges on column nationid in the where clause.
  15. Query the count of record group by nationid with the following code:
    SELECT COUNT(*) FROM customer GROUP BY nationid;

  16. The select statement should return permission denied error as user sales doesn’t have column-level privileges on column nationid in the group by clause.
  17. Please also note that column-level privileges are checked for columns not only in the select list but also where clause, order by clause, group by clause, having clause and other clauses of a query that require SELECT/UPDATE privileges on a column.
  18. Remove column marketsegment from column-level access control for user sales using REVOKE command and see what happens. Enter the following code:
    RESET SESSION AUTHORIZATION;
    REVOKE SELECT (marketsegment) ON customer FROM sales;
    SET SESSION AUTHORIZATION 'sales';
    SELECT CURRENT_USER;
    SELECT marketsegment, accountbalance FROM customer;
    SELECT accountbalance FROM customer;

  19. As you can see, user sales is no longer able to view marketsegment from table customer.
    dev=> SELECT marketsegment, accountbalance FROM customer;
    ERROR:  permission denied for relation customer
    dev=> SELECT accountbalance FROM customer;
     accountbalance 
    ----------------
            2788.52
             591.98
            3332.02
            9255.67
            2000.00
    (5 rows)

  20. Enter the following code to query column marketsegment from view customer_vw:
    SELECT marketsegment FROM customer_vw;

  21. The statement should return five rows as user sales still has access to column marketsegment on the view even though column-level privileges have been revoked from table customer. Views execute with the permissions of the view owner so it will still continue to work as long as the view’s owner still has column or table-level privileges on the base tables used by the view. To prevent unauthorized access of the sensitive data, the column-level privileges for user sales should be revoked from the view as well.
  22. Revoke all permissions for user sales with the following code:
    RESET SESSION AUTHORIZATION;
    SELECT CURRENT_USER;
    REVOKE SELECT ON customernation_mv FROM sales;
    REVOKE SELECT ON customer_vw FROM sales;
    REVOKE SELECT ON customer FROM sales;
    REVOKE UPDATE ON customer FROM sales;

  23. Query the table, view and materialized view again with user sales using the following code:
    SET SESSION AUTHORIZATION 'sales';
    SELECT CURRENT_USER;
    SELECT marketsegment, accountbalance FROM customer;
    SELECT marketsegment, accountbalance FROM customer_vw;
    SELECT marketsegment, accountbalance FROM customernation_mv;

  24. Permission denied error should be returned and this shows that REVOKE is able to remove all permissions.

As summary, a simple GRANT statement will enable column-level access control on Amazon Redshift table, view and materialized view. A REVOKE statement is what you need to remove the permission. This eliminates the complexity of legacy views-based access control to achieve fine-grained read and write access control.

Clean up

Once you are done with above testing, you can remove the objects and users with the following code:

RESET SESSION AUTHORIZATION;
REVOKE SELECT ON customernation_mv FROM finance;
REVOKE SELECT ON customer_vw FROM finance;
REVOKE SELECT ON customer FROM finance;
REVOKE UPDATE ON customer FROM finance;
DROP VIEW customer_vw;
DROP MATERIALIZED VIEW customernation_mv;
DROP TABLE nation;
DROP TABLE customer;
DROP USER IF EXISTS sales;
DROP USER IF EXISTS finance;

Summary

Amazon Redshift is secure by default and security doesn’t cost extra. It provides Authentication (Active Directory, Okta, Ping Federate, and Azure AD), Federation and comes pre-integrated with AWS IAM and KMS. It also supports table-based access control for data in Amazon Redshift and column-level access control for data in Amazon S3 through Amazon Redshift Spectrum since September 2019. Amazon Redshift now supports access control at a column-level for local tables, eliminating the need to implement view-based access control or using another system.

This post showed you how easy it is to setup Amazon Redshift column-level access control. The use case in this post demonstrated how to confirm that you have fine-grained access on the table, view, and materialized view. You can adopt this feature to support your business needs.

If you have any questions or suggestions, please leave a comment below.

 


About the Authors

BP Yau is a Data Warehouse Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate the Oracle Data Warehouse to Amazon Redshift and built the next generation big data analytics platform using AWS technologies.

 

 

 

Srikanth Sopirala is a Sr. Specialist Solutions Architect focused on Analytics at AWS. He is passionate about helping customers build scalable data and analytics solutions in the cloud.

Speed up your ELT and BI queries with Amazon Redshift materialized views

Post Syndicated from Juan Yu original https://aws.amazon.com/blogs/big-data/speed-up-your-elt-and-bi-queries-with-amazon-redshift-materialized-views/

The Amazon Redshift materialized views function helps you achieve significantly faster query performance on repeated or predictable workloads such as dashboard queries from Business Intelligence (BI) tools, such as Amazon QuickSight. It also speeds up and simplifies extract, load, and transform (ELT) data processing. You can use materialized views to store frequently used precomputations and seamlessly use them to achieve lower latency on subsequent analytical queries.

This post demonstrates how to create a materialized view, refresh it after data ingestion, and speed up your BI workload.

Setting up your sample datasets

This walkthrough uses the Amazon Customer Reviews Dataset. It is a public dataset stored in the us-east-1 Region. You will create the following three tables:

  • product_reviews – Contains customer reviews for a specific product
  • customer – Contains customer profile data
  • customer_address – Contains customer address information

The following diagram shows the relationship of the three tables.

To download the script and set up the tables, choose mv_blog.sql.

Creating and using materialized views

For this use case, your marketing team wants to build a report that shows how many customers per state like your products. You also want to drill down to each product category when needed.

In this first step, you create a regular view. See the following code:

CREATE VIEW v_reviews_byprod_and_state AS
SELECT PR.product_category,
       A.ca_state AS customer_state,
       count(PR.star_rating) AS cnt
FROM product_reviews PR,
     customer C,
     customer_address A
WHERE PR.customer_id = C.c_customer_sk
  AND C.c_current_addr_sk = A.ca_address_sk
  AND PR.marketplace = 'US'
GROUP BY 1,
         2;

The following code is a report to analyze the product review count per state:

SELECT customer_state,
       sum(cnt)
FROM v_reviews_byprod_and_state
GROUP BY 1
ORDER BY 2;

The following code is a report to analyze the product review count per state for specific categories:

SELECT customer_state,
       sum(cnt)
FROM v_reviews_byprod_and_state
WHERE product_category IN ('Home',
                           'Grocery')
GROUP BY 1
ORDER BY 2;

The preceding reports take approximately 4 seconds to run. As you sell more products and get more reviews, this elapsed time gradually gets longer. To speed up those reports, you can create a materialized view to precompute the count of reviews per product category and per state. See the following code:

CREATE MATERIALIZED VIEW mv_reviews_byprod_and_state AS
SELECT PR.product_category,
       A.ca_state AS customer_state,
       count(PR.star_rating) AS cnt
FROM product_reviews PR,
     customer C,
     customer_address A
WHERE PR.customer_id = C.c_customer_sk
  AND C.c_current_addr_sk = A.ca_address_sk
  AND PR.marketplace = 'US'
GROUP BY 1,
         2;

The following code are the reports to analyze the product review against the materialized view.

SELECT customer_state,
       sum(cnt)
FROM mv_reviews_byprod_and_state
GROUP BY 1
ORDER BY 2;

SELECT customer_state,
       sum(cnt)
FROM mv_reviews_byprod_and_state
WHERE product_category IN ('Home',
                           'Grocery')
GROUP BY 1
ORDER BY 2;

The same reports against materialized views take less than 200 milliseconds because the new queries access precomputed joins, filters, grouping, and partial sums instead of the multiple, larger base tables.

Speeding up and simplifying ELT data processing

To achieve similar performance without the use of materialized views, many users use the CREATE TABLE AS (CTAS) command. However, as you update base tables with new data inserts, updates, or deletes, the CTAS tables become stale; you must recreate them to keep them up-to-date with the latest changes from the base tables. Now with Amazon Redshift materialized views, you can overcome this problem by efficiently and incrementally refreshing the materialized views with supported SQL. For example, the following code ingests another 10,000 reviews:

INSERT INTO product_reviews
SELECT   marketplace, 
  cast(customer_id as bigint) customer_id, 
  review_id, 
  product_id, 
  cast(product_parent as bigint) product_parent, 
  product_title, 
  star_rating, 
  helpful_votes, 
  total_votes, 
  vine, 
  verified_purchase, 
  review_headline, 
  review_body, 
  review_date, 
  year,
  product_category
FROM demo.products_reviews
WHERE review_date = '2015-07-01' LIMIT 10000;

Now the materialized view is out-of-date. To refresh the materialized view, enter the following code:

REFRESH MATERIALIZED VIEW mv_reviews_byprod_and_state;

Within 200 milliseconds, the materialized view is up-to-date again. Your report queries have the same consistent, fast performance.

The following screenshot is the query log that shows query performance. The log shows newer statements at the top.

The materialized views refresh is much faster because it’s incremental: Amazon Redshift only uses the new data to update the materialized view instead of recomputing the entire materialized view again from the base tables.  For more information, see REFRESH MATERIALIZED VIEW.

Materialized views also simplify and make ELT easier and more efficient. Without materialized views, you might create an ELT job and use CTAS to precompute the product analysis data. The ELT job recomputes this data after new data is ingested and stores the data in the precomputed product analysis table to meet the dashboard latency requirement.

In particular, the ETL job drops and recreates the precomputed product analysis table after each ingestion. See the following code:

BEGIN;
    DROP TABLE IF EXISTS latest_product_analysis;
    CREATE TABLE latest_product_analysis as SELECT ...;
END;

With materialized views, you just need to create the materialized view one time and refresh to keep it up-to-date. To refresh materialized views after ingesting new data, add REFRESH MATERIALIZED VIEW to the ELT data ingestion scripts. Redshift will automatically and incrementally bring the materialized view up-to-date.

Achieving faster performance for BI dashboards

You can use materialized views to help your BI team build a dashboard to analyze product trends.

For example, to create a materialized view to join customer and customer_address dimension tables and precompute reviews and ratings, enter the following code:

CREATE MATERIALIZED VIEW mv_product_analysis 
    sortkey(product_category, Customer_State, review_date) 
AS
SELECT PR.product_category,
       A.ca_state AS Customer_State,
       PR.review_date AS Review_Date,
       PR.product_title,
       COUNT(1) AS review_total,
       SUM(PR.star_rating) AS rating
FROM product_reviews PR,
     customer C,
     customer_address A
WHERE PR.customer_id = C.c_customer_sk
  AND C.c_current_addr_sk = A.ca_address_sk
  AND marketplace = 'US'
GROUP BY 1,2,3,4;

You access materialized views the same as you do a regular table. For this walkthrough, choose a materialized view as the source for an Amazon QuickSight dataset. As showing by the following screenshot.

You can preview data of the materialized view in Amazon QuickSight to understand what information can be used to build the dashboard. The following screenshot shows the sample data of mv_product_analysis.

To track how many reviews customers post over time, use review_date as the X-axis and Sum(review_total) as the Y-axis. The following graph shows this visualization.

The following screenshot shows a complete dashboard “Product trend” that analyzes the top product category, product popularity by state, and more.

Because you are using materialized views, the product trend dashboard loads in seconds and is always up-to-date. You can gain the latest insights, understand customer purchase behavior, and identify business opportunities and optimizations.

You can compare the performance of materialized views with other possible alternatives, such as using regular views and using CTAS. The following graph shows the overall query execution for the product trend dashboard. Materialized views not only improve query performance by more than an order of magnitude compared to using a regular view, but also have low maintenance costs compared to using a CTAS because the incremental refresh time is proportional to the delta of changes in the base tables. In contrast, the CTAS recreate approach needs to processes all the data in the base tables.

The following animated gif shows the actual response time for the product trend dashboard built using Amazon QuickSight in direct query mode.

Conclusion

This post showed how to create Amazon Redshift materialized views with one or more base tables to speed up both BI queries and ELT. You can easily build and maintain efficient data processing pipelines and seamlessly extend the low latency query execution benefits of materialized views to data analysis.

 


About the Authors

Juan Yu is a Data Warehouse Specialist Solutions Architect at AWS.

 

 

 

 

Jose Kunnackal John is principal product manager for Amazon QuickSight, AWS’ cloud-native, fully managed BI service. Jose started his career with Motorola, writing software for telecom and first responder systems. Later he was Director of Engineering at Trilibis Mobile, where he built a SaaS mobile web platform using AWS services. Jose is excited by the potential of cloud technologies and looks forward to helping customers with their transition to the cloud.

 

 

 

Serving Billions of Ads in Just 100 ms Using Amazon Elasticache for Redis

Post Syndicated from Rodrigo Asensio original https://aws.amazon.com/blogs/architecture/serving-billions-of-ads-with-amazon-elasticache-for-redis/

This post was co-written with Lucas Ceballos, CTO of Smadex

Introduction

Showing ads may seem to be a simple task, but it’s not. Showing the right ad to the right user is an incredibly complex challenge that involves multiple disciplines such as artificial intelligence, data science, and software engineering. Doing it one million times per second with a 100-ms constraint is even harder.

In the ad-tech business, speed and infrastructure costs are the keys to success. The less the final user waits for an ad, the higher the probability of that user clicking on the ad. Doing that while keeping infrastructure costs under control is crucial for business profitability.

About Smadex

Smadex is the leading mobile-first programmatic advertising platform specifically built to deliver best user acquisition performance and complete transparency.

Its state-of-the-art digital signal processing (DSP) technology provides advertisers with the tools they need to achieve their goals and ROI, with measurable results from web forms, post-app install events, store visits, and sales.

Smadex advertising architecture

What does showing ads look like under the hood? At Smadex, our technology works based on the OpenRTB (Real-Time Bidding) protocol.

RTB is a means by which advertising inventory is bought and sold on a per-impression basis, via programmatic instantaneous auction, which is similar to financial markets.

To show ads, we participate in auctions deciding in real time which ad to show and how much to bid trying to optimize the cost of every impression.

High level diagram

  1. The final user browses the publisher’s website or app.
  2. Ad-exchange is called to start a new auction.
  3. Smadex receives the bid request and has to decide which ad to show and how much to offer in just 100 ms (and this is happening one million times per second).
  4. If Smadex won the auction, the ad must be sent and rendered on the publisher’s website or app.
  5. In the end, the user interacts with the ad sending new requests and events to Smadex platform.

Flow of data

As you can see in the previous diagram, showing ads is just one part of the challenge. After the ad is shown, the final user interacts with it in multiple ways, such as clicking it, installing an application, subscribing to a service, etc. This happens during a determined period that we call the “attribution window.” All of those interactions must be tracked and linked to the original bid transaction (using the request_id parameter).

Doing this is complicated: billions of bid transactions must be stored and available so that they can be quickly accessed every time the user interacts with the ad. The longer we store the transactions, the longer we can “wait” for an interaction to take place, and the better for our business and our clients, too.

Detailed diagram

Challenge #1: Cost

The challenge is: What kind of database can store billions of records per day, with at least a 30-day retention capacity (attribution window), be accessed by key-value, and all by spending as little as possible?

The answer is…none! Based on our research, all the available options that met the technical requirements were way out of our budget.

So…how to solve it? Here is when creativity and the combination of different AWS services comes into place.

We started to analyze the time dispersion of the events trying to find some clues. The interesting thing we spotted was that 90% of what we call “post-bid events” (impression, click, install, etc.) happened within one hour after the auction took place.

That means that we can process 90% of post-bid events by storing just one hour of bids.

Under our current workload, in one hour we participate in approximately 3.7 billion auctions generating 100 million bid records of an average 600 bytes each. This adds up to 55 gigabytes per hour, an easier amount of data to process.

Instead of thinking about one single database to store all the bid requests, we decided to split bids into two different categories:

  • Hot Bid: A request that took place within the last hour (small amount and frequently accessed)
  • Cold Bid: A request that took place more than our hour ago (huge amount and infrequently accessed)

Amazon ElastiCache for Redis is the best option to store 55 GB of data in memory, which gives us the ability to query in a key-value way with the lowest possible latency.

Hot Bids flow

Hot Bids flow diagram

  1. Every new bid is a hot bid by definition so it’s going to be stored in the hot bids Redis cluster.
  2. At the moment of the user interaction with the ad, the Smadex tracker component receives an HTTPS notification, including the bid request UUID that originated it.
  3. Based on the date of occurrence extracted from the received UUID, the tracker component can determine if it’s looking for a hot bid or not. If it’s a hot bid, the tracker reads it directly from Redis performing a key-value lookup query.

It’s been easy so far but what to do with the other 29 days and 23 hours we need to store?

Challenge #2: Performance

As we previously mentioned, cold bids are a huge infrequently accessed number of records with only 10% of post-bid events pointing to them. That sounds like a good use case for an inexpensive and slower data store like Amazon S3.

Thanks to the S3 low-cost storage prices combined with the ability to query S3 objects directly using Amazon Athena, we were able to optimize our costs by storing and querying cold bids by implementing a serverless architecture.

Cold Bids Flow

Cold Bids flow diagram

  1. Incoming bids are buffered by Fluentd and flushed to S3 every one minute in JSON format. Every single file flushed to S3 contains all the bids processed by a specific EC2 instance for one minute.
  2. An AWS Lambda function is automatically triggered on every new PutObject event from S3. This function transforms the JSON records to Parquet format and will save it back the S3 bucket, but this time into a specific partition folder based on file creation timestamp.
  3. As seen on the hot bids flow, the tracker component will determine if it’s looking for a hot or a cold bid based on the extracted timestamp of the request UUID. In this case, the cold bid will be retrieved by running an Amazon Athena look-up query leveraging the use of partitions and Parquet format to reduce as much as possible the latency and data that needs to be scanned.

Conclusion

Thanks to this combined approach using different technologies and a variety of AWS services we were able to extend our attribution window from 30 to 90 days while reducing the infrastructure costs by 45%.

 

 

How to track changes to secrets stored in AWS Secrets Manager using AWS Config and AWS Config Rules

Post Syndicated from Jerry Hayes original https://aws.amazon.com/blogs/security/how-to-track-changes-to-secrets-stored-in-aws-secrets-manager-using-aws-config-and-aws-config-rules/

On April 20th, AWS Config announced support for AWS Secrets Manager, making it easier to track configuration changes to the secrets you manage in AWS Secrets Manager. You can now use AWS Config to track changes to secrets’ metadata — such as secret description and rotation configuration, relationship to other AWS sources such as the KMS Key used for secret encryption, Lambda function used for secret rotation, and attributes such as tags associated with the secrets.

You can also leverage two new AWS Managed Config Rules to evaluate if your secrets’ configuration is in compliance with your organization’s security and compliance requirements, identify secrets that don’t conform to these standards, and receive notifications about them via Amazon Simple Notification Service (SNS). Once enabled, these rules will trigger every time a secret’s configuration changes.

  • secretsmanager-rotation-enabled-check: Checks whether or not secrets present in AWS Secrets Manager are configured for rotation. This rule also supports the maximumAllowedRotationFrequency parameter which, if specified, will compare the secret’s configured rotation frequency to the value set in the parameter.
  • secretsmanager-scheduled-rotation-success-check: Checks whether or not secrets present in AWS Secrets Manager configured for rotation have been rotated within their rotation schedule.

In this blog post, I walk you through two ways to use AWS Config rules to determine if your organization’s secrets are being managed in compliance with your security requirements:

  • Example 1: Drive rotation adoption by identifying secrets in a single account that aren’t configured for rotation. This maps to the first managed rule listed above.
  • Example 2: Drive compliance with your security standards across multiple AWS accounts by creating an AWS Config Aggregator, which allows you to collect configuration and compliance data from multiple accounts across multiple regions.

Example 1: Drive rotation adoption by identifying secrets that aren’t configured for rotation in a single account and region

Many organizations require regular secret rotation. Use the new managed rule secretsmanager-rotation-check to verify whether your secrets are configured for automatic rotation.

  1. From the AWS Config console, navigate to Settings and ensure that Recording is on. Under Resource types to record, turn on recording for all resources by checking the All resources box next to Record all resources supported in this region, as shown in Figure 1 below.

    Figure 1: Enable Recording

    Figure 1: Enable Recording

  2. To set up the rule, go to the Rules page in the AWS Config console and select Add rule, as shown in Figure 2.

    Figure 2: Add Rule

    Figure 2: Add Rule

  3. Search for secretsmanager-rotation-enabled-check in the search bar and select the rule that appears, as shown in Figure 3.

    Figure 3: Search for rule

    Figure 3: Search for rule

  4. In Figure 4, I use the name secretsmanager-rotation-enabled-check for the name of my rule. Trigger type is set to run upon changes to the resource’s configuration. For Scope of changes, you can monitor all applicable resources for this rule type or resources with specific tags. In my example, I am monitoring all secrets where the ENVIRONMENT tag is set to PRODUCTION. And finally, under Rule Parameters, I set maximumAllowedRotationFrequency to 30 days.

    Figure 4: Add managed rule

    Figure 4: Add managed rule

  5. In my example, I specify AWS-PublishSNSNotification as my Remediation action and enter the parameters for AutomationAssumeRole, Message, and TopicArn topic so that I can receive notifications from an Amazon SNS topic regarding non-compliant secrets, as shown in Figure 5 below. Setting a Remediation action is optional. Once the rule is set up the way you want it, select Save.
    Figure 5: Choose remediation action

    Figure 5: Choose remediation action

    Once you’ve saved the rule, it will evaluate your secrets every time there’s a change in the secret metadata, and you’ll receive an Amazon SNS notification about non-compliant secrets.

  6. In the AWS Config console, view your results by selecting Rules from the menu on the left. In Figure 6, secretsmanager-rotation-enabled-check shows that it has detected 1 noncompliant resource.

    Figure 6: View rule evaluation results

    Figure 6: View rule evaluation results

  7. Select secretsmanager-rotation-enabled-check and it provides a link to the Resource ID of the non-compliant secret, as shown in Figure 7.

    Figure 7: Detailed view of rule with noncompliant secret

    Figure 7: Detailed view of rule with noncompliant secret

Example 2: Drive security compliance across multiple AWS accounts in your AWS Organization by creating an AWS Config Aggregator

Next, I’ll show you how to use the AWS Config Aggregator to review how secrets are configured across all accounts and regions in your AWS Organization so you can see whether they’re in compliance with your organization’s security and compliance requirements. AWS Organizations helps you centrally govern your environment as you grow and scale your workloads on AWS.

NOTE: You must enable AWS Config and the AWS Config managed rules specific to secrets in all accounts and regions that you want to monitor before creating the aggregator. You can use AWS CloudFormation StackSets to enable AWS Config and provision rules across accounts and regions as described here.

  1. In this example, I create the aggregator in my AWS Organization’s master account. From the AWS Config console, select Aggregators from the left menu, then select Add aggregator, as shown in Figure 8.

    Figure 8: Add aggregator

    Figure 8: Add aggregator

  2. Select the check box next to Allow data replication, as shown in Figure 9 below. This provides the permission for your AWS Organization’s master account to access the resource configuration and compliance details for all the accounts and regions in your AWS Organization.

    Figure 9: Allow data replication

    Figure 9: Allow data replication

  3. Provide a name for the aggregator. In Figure 10, I’ve named mine MyOrganizationsSecrets. Select Add my organization, then Choose IAM role. Select Create a Role and enter a role name and then select Choose IAM role. The IAM role allows AWS Config to get the list of accounts in your AWS Organization.
    Figure 10: Enable data replication and configure aggregator

    Figure 10: Enable data replication and configure aggregator

    NOTE: If you do not have an organization configured in AWS Organizations, you can select Add individual account IDs and then either add account IDs manually or update a comma separated list of accounts.

  4. Select Choose IAM role. Ensure Create a role is selected and enter a unique name. In Figure 11, I’ve named my role aws-config-aggregator-role. Select Choose IAM role again to create the role and again to continue.

    Figure 11: Choose IAM role

    Figure 11: Choose IAM role

  5. Select the Regions you want to aggregate data and select Save. In Figure 12, I’ve selected the two regions in which my AWS Organization uses Secrets Manager.
    Figure 12: Pick target regions for aggregation

    Figure 12: Select target regions for aggregation

    Once you’ve selected your regions, click Save.

  6. Select the aggregator you just created to see the Aggregated view. In Figure 13, I select MyOrganizationsSecrets.As noted on the console, an aggregator is an AWS Config resource type that collects AWS Config data from multiple accounts and regions, the data displayed in the dashboard is received from multiple aggregation sources and is refreshed at different intervals. Data might be delayed by a few minutes.

    Figure 13: Select aggregator

    Figure 13: Select aggregator

  7. In the Aggregated view shown in Figure 14 below, you can now see a dashboard view of all resources in your Organization, across all accounts and regions.On the top right, the Config rule compliance status shows that this organization has 11 compliant and 7 non-compliant rules. Below that is the Top 5 non-compliant rules which denotes the rule name, the region, the account number, and number of non-compliant resources.
    Figure 14: Aggregated view

    Figure 14: Aggregated view

    You can drill down into this data to view all compliant and non-compliant secrets in all your organization’s accounts and regions, and you can work with individual account or secret owners to drive security compliance — ensuring all secrets are configured for rotation, all secrets meet your organizations’ standard for rotation frequency, and secrets are rotated successfully.

  8. In Figure 15, I select secretsmanager-rotation-enabled-check for us-east-1 from the Top 5 non-complaint rules.

    Figure 15: Top 5 noncompliant rules

    Figure 15: Top 5 noncompliant rules

  9. The detail view in Figure 16 below shows the 5 non-compliant resources and their corresponding Resource IDs.

    Figure 16: Compliant and non-compliant secrets

    Figure 16: Compliant and non-compliant secrets

Summary

In this post, I showed you how to track and evaluate secret configuration using AWS Config and AWS Config Rules using the AWS Management Console. You can also do this using the AWS Config APIs or the AWS Command Line Interface (CLI).

This enables you to drive secrets management best practices in an individual account or across your AWS Organization. To get started managing secrets, open the Secrets Manager console. To learn more, read How to Store, Distribute, and Rotate Credentials Securely with Secret Manager or refer to the Secrets Manager documentation.

If you have comments about this post, submit them in the Comments section below. If you have questions about anything in this post, start a new thread on the Secrets Manager forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Jerry Hayes

Jerry Hayes is a Solutions Architect Manager on the World Wide Public Sector (WWPS) Solutions Architect (SA) team where he manages a high-performing team of Specialist SAs supporting National Security customers. He holds a Master’s degree from George Washington University and a Bachelor’s degree from Virginia Tech (Go Hokies!). Outside of work, Jerry enjoys spending time with his family, watching football, running, and traveling to new and exciting places.

Enhancing automated database continuous integration with AWS CodeBuild and Amazon RDS Database Snapshot

Post Syndicated from bobyeh original https://aws.amazon.com/blogs/devops/enhancing-automated-database-continuous-integration-with-aws-codebuild-and-amazon-rds-database-snapshot/

In major integration merges, it’s sometimes necessary to verify the changes with existing online data. To inspect the changes with a cloned database can give us confidence to deploy to the production database. This post demonstrates how to use AWS CodeBuild and Amazon RDS Database Snapshot to verify your code revisions in both the application layer and the underlying layer, ensuring that your existing data works seamlessly with your revised code.

Making code revisions using continuous integration requires running periodic verification to ensure that your new deliverable works functionally and reliably. It’s easy to focus attention solely on the surface level changes made to the application layer. However, it’s important to remember to inspect the changes made to the underlying data layer too.

From the application layer, users modify the data model for different reasons. Any data model definition change in the application layer maps to a schema change in the database. For those services backed with a relational database (RDBMS), a user might perform data definition language (DDL) operations directly toward a database schema or rely on an object-relational mapping (ORM) library to migrate the schema to fit the application revision. These schema changes (CREATE, DROP, ALTER, TRUNCATE, etc.) can be very critical, especially for those services serving real customers.

Performing proper verification and simulation for these changes mitigates the risk of bringing down services. After the changes are applied, fundamental operation testing (CRUD – CREATE, READ, UPDATE, DELETE) toward data models is mandatory; this leads to data control language (DCL) operations (INSERT, SELECT, UPDATE, DELETE, etc.). After all the necessary steps, a user can move on to the deployment stage.

About this page

  • Time to read:6 minutes
  • Time to complete:30 minutes
  • Cost to complete (estimated):Less than $1 for 1-GB database snapshot and restored instance
  • Learning level:Advanced (300)
  • Services used:AWS CodeBuild, IAM, RDS

Solution overview

This example uses a buildspec file in CodeBuild. Set up a build project that points to a source control repository containing that buildspec file. The CodeBuild runtime environment restores the database server from an RDS snapshot.We restore snapshot to an Amazon Aurora cluster as example through AWS Command Line Interface (AWS CLI). After the database is restored, the build process starts to run your integration process, which is in mock code in the buildspec definition. After the verification stage, CodeBuild drops the restored database.

 

Architecture diagram showing an overview of how we use CodeBuild to restore a database snapshot to verify and validate the new database schema change.

Prerequisites

The following components are required to implement this example:

Walkthrough

Follow these steps to execute the solution.

Prepare your build specification file

Before you begin, prepare your CodeBuild Build Specification file with following information:

  • db-cluster-identifier-prefix
  • db-snapshot-identifier
  • region-ID
  • account-ID
  • vpc-security-group-id

The db-cluster-identifier-prefix creates a temporary database followed by a timestamp. Make sure that this value does not overlap with any other databases. The db-snapshot-identifier points to the snapshot you are calling to run with your application. Region-ID and account-ID describe the account on which you are running. The vpc-security-group-id indicates the security group you use in the CodeBuild environment and temporary database.

YAML
Version: 0.2
phases:
  install:
    runtime-versions:
      python: 3.7
pre_build:
  commands:
    - pip3 install awscli --upgrade --user
    - export DATE=`date +%Y%m%d%H%M`
    - export DBIDENTIFIER=db-cluster-identifier-prefix-$DATE
    - echo $DBIDENTIFIER
    - aws rds restore-db-cluster-from-snapshot --snapshot-identifier arn:aws:rds:region-ID:account-ID:cluster-snapshot:db-snapshot-identifier –vpc-security-group-ids vpc-security-group-id --db-cluster-identifier $DBIDENTIFIER --engine aurora
    - while [ $(aws rds describe-db-cluster-endpoints --db-cluster-identifier $DBNAME | grep -c available) -eq 0 ]; do echo "sleep 60s"; sleep 60; done
    - echo "Temp db ready"
    - export ENDPOINT=$(aws rds describe-db-cluster-endpoints --db-cluster-identifier $DBIDENTIFIER| grep "\"Endpoint\"" | grep -v "\-ro\-" | awk -F '\"' '{print $4}')
    - echo $ENDPOINT
build:
  commands:
    - echo Build started on `date`
    - echo proceed db connection to $ENDPOINT
    - echo proceed db migrate update, DDL proceed here
    - echo proceed application test, CRUD test run here
post_build:
  commands:
    - echo Build completed on `date`
    - echo $DBNAME
    - aws rds delete-db-cluster --db-cluster-identifier $DBIDENTIFIER --skip-final-snapshot &

 

After you finish editing the file, name it buildspec.yml. Save it in the root directory with which you plan to build, then commit the file into your code repository.

  1. Open the CodeBuild console.
  2. Choose Create build project.
  3. In Project Configuration, enter the name and description for the build project.
  4. In Source, select the source provider for your code repository.
  5. In Environment image, choose Managed image, Ubuntu, and the latest runtime version.
  6. Choose the appropriate service role for your project.
  7. In the Additional configuration menu, select the VPC with your Amazon RDS database snapshots, as shown in the following screenshot, and then select Validate VPC Settings. For more information, see Use CodeBuild with Amazon Virtual Private Cloud.
  8. In Security Groups, select the security group needed for the CodeBuild environment to access your temporary database.
  9. In Build Specifications, select Use a buildspec file.

CodeBuild Project Additional Configuration - VPC

Grant permission for the build project

Follow these steps to grant permission.

  1. Navigate to the AWS Management Console Policies.
  2. Choose Create a policy and select the JSON tab.To give CodeBuild access to the Amazon RDS resource in the pre_build stage, you must grant RestoreDBClusterFromSnapshot and DeleteDBCluster. Follow the least privilege guideline and limit the DeleteDBCluster action point to “arn:aws:rds:*:*:cluster: db-cluster-identifier-*”.
  3. Copy the following code and paste it into your policy:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "VisualEditor0",
          "Effect": "Allow",
          "Action": "rds:RestoreDBClusterFromSnapshot",
          "Resource": "*"
        },
        {
          "Sid": "VisualEditor1",
          "Effect": "Allow",
          "Action": "rds:DeleteDBCluster*",
          "Resource": "arn:aws:rds:*:*:cluster:db-cluster-identifier-*"
        }
      ]
    }
  4. Choose Review Policy.
  5. Enter a Name and Description for this policy, then choose Create Policy.
  6. After the policy is ready, attach it to your CodeBuild service role, as shown in the following screenshot.

Attach created policy to IAM role

Use database snapshot restore to launch the build process

  1. Navigate back to CodeBuild and locate the project you just created.
  2. Give an appropriate timeout setting and make sure that you set it to the correct branch for your repository.
  3. Choose Start Build.
  4. Open the Build Log to view the database cluster from your snapshot in the pre_build stage, as shown in the following screenshot.CodeBuild ProjectBuild Log - pre_build stage
  5. In the build stage, use $ENDPOINT to point your application to this temporary database, as shown in the following screenshot.CodeBuild Project Build Log - build stage
  6. In the post_build, delete the cluster, as shown in the following screenshot.CodeBuild Project Build log - post build stage

Test your database schema change

After you set up this pipeline, you can begin to test your database schema change within your application code. This example defines several steps in the Build Specifications file to migrate the schema and run with the latest application code. In this example, you can verify that all the modifications fit from the application to the database.

YAML
build:
  commands:
    - echo Build started on `date`
    - echo proceed db connection to $ENDPOINT
    # run a script to apply your latest schema change
    - echo proceed db migrate update
    # start the latest code, and run your own testing
    - echo proceed application test

After validation

After we validated the database schema change in the above steps, a suitable strategy for deployment to production should be utilized that would align with the criteria to satisfy the business goals.

Cleaning up

To avoid incurring future charges, delete the resources as following steps:

  1. Open the CodeBuild console
  2. Click the project you created for this test.
  3. Click the delete build project and input delete to confirm deletion.

Conclusion

In this post, you created a mechanism to set up a temporary database and limit access into the build runtime. The temporary database stands alone and isolated. This mechanism can be applied to secure the permission control for the database snapshot, or not to break any existing environment. The database engine applies to all available RDS options, including Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server. This provides options, without impacting any existing environments, for critical events triggered by major changes in the production database schema, or data format changes required by business decisions.

 

Accelerate Amazon Redshift Federated Query adoption with AWS CloudFormation

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/accelerate-amazon-redshift-federated-query-adoption-with-aws-cloudformation/

Amazon Redshift Federated Query allows you to combine the data from one or more Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL databases with data already in Amazon Redshift. You can also combine such data with data in an Amazon S3 data lake.

This post shows you how to set up Aurora PostgreSQL and Amazon Redshift with a 10 GB TPC-H dataset, and Amazon Redshift Federated Query using AWS CloudFormation. For more information about using Federated Query, see Build a Simplified ETL and Live Data Query Solution using Redshift Federated Query. You can use the environment you set up in this post to experiment with various use cases in the preceding post.

Benefits of using CloudFormation templates

The standard workflow of setting up Amazon Redshift Federated Query involves six steps. For more information, see Querying Data with Federated Query in Amazon Redshift. With a CloudFormation template, you can condense these manual procedures into a few steps listed in a text file. The declarative code in the file captures the intended state of the resources to create and allows you to automate the creation of AWS resources to support Amazon Redshift Federated Query. You can further enhance this template to become the single source of truth for your infrastructure.

A CloudFormation template acts as an accelerator. It helps you automate the deployment of technology and infrastructure in a safe and repeatable manner across multiple Regions and multiple accounts with the least amount of effort and time.

Architecture overview

The following diagram illustrates the solution architecture.

The CloudFormation templates provision the following components in the architecture:

  • VPC
  • Subnets
  • Route tables
  • Internet gateway
  • Amazon Linux Bastion host
  • Secrets
  • Aurora PostgreSQL cluster with TPC-H dataset preloaded
  • Amazon Redshift cluster with TPC-H dataset preloaded
  • Amazon Redshift IAM role with required permissions

Prerequisites

Before you create your resources in AWS CloudFormation, you must complete the following prerequisites:

  • Have an IAM user with sufficient permissions to interact with the AWS Management Console and related AWS services. Your IAM permissions must also include access to create IAM roles and policies via the CloudFormation template.
  • Create an Amazon EC2 key pair in the us-east-1 Region. Make sure that you save the private key; this is the only time you can do so. You use this key pair as an input parameter when you set up the CloudFormation stack.

Setting up the resources with AWS CloudFormation

This post provides a CloudFormation template as a general guide. You can review and customize it to suit your needs. Some of the resources that this stack deploys incur costs when in use.

To create these resources, complete the following steps:

  1. Sign in to the console.
  2. Choose the us-east-1 Region in which to create the stack.
  3. Choose Launch Stack:
  4. Choose Next.This automatically launches AWS CloudFormation in your AWS account with a template. It prompts you to sign in as needed. You can view the CloudFormation template from within the console.
  5. For Stack name, enter a stack name.
  6. For Session, leave as the default.
  7. For ec2KeyPair, choose the key pair you created earlier.
  8. Choose Next.
  9. On the next screen, choose Next.
  10. Review the details on the final screen and select I acknowledge that AWS CloudFormation might create IAM resources.
  11. Choose Create.Stack creation can take up to 45 minutes.
  12. After the stack creation is complete, in the Outputs section, record the value of the key for the following components, which you use in a later step:
  • AuroraClusterEndpoint
  • AuroraSecretArn
  • RedshiftClusterEndpoint
  • RedshiftClusterRoleArn

You are now ready to log in to both the Aurora PostgreSQL and Amazon Redshift cluster and run some basic commands to test them.

Logging in to the clusters using the Amazon Linux Bastion host

The following steps assume that you use a computer with an SSH client to connect to the Bastion host. For more information about connecting using various clients, see Connect to Your Linux Instance.

  1. Move the private key of the EC2 key pair (that you saved previously) to a location on your SSH client, where you are connecting to the Amazon Linux Bastion host.
  2. Change the permission of the private key using the following code, so that it’s not publicly viewable:chmod 400 <private key file name; for example, bastion-key.pem>
  3. On the Amazon EC2 console, choose Instances.
  4. Choose the Amazon Linux Bastion host that the CloudFormation stack created.
  5. Choose Connect.
  6. Copy the value for SSHCommand.
  7. On the SSH client, change the directory to the location where you saved the EC2 private key, and paste the SSHCommand value.
  8. On the console, open the Secrets Manager dashboard.
  9. Choose the secret secretAuroraMasterUser-*.
  10. Choose Retrieve secret value.
  11. Record the password under Secret key/value, which you use to log in to the Aurora PostgreSQL cluster.
  12. Choose the secret SecretRedshiftMasterUser.
  13. Choose Retrieve secret value.
  14. Record the password under Secret key/value, which you use to log in to the Amazon Redshift cluster.
  15. Log in to both the Aurora PostgreSQL and Amazon Redshift database using PSQL Client.The CloudFormation template has already set up PSQL Client binaries on the Amazon Linux Bastion host.
  16. Enter the following code in the command prompt of the Bastion host (substitute <RedshiftClusterEndpoint> with the value from the AWS CloudFormation output):psql -h <RedshiftClusterEndpoint> -d dev -p 5439 -U fqdemo
  17. When prompted, enter the database user password you recorded earlier.
  18. Enter the following SQL command:
    select "table" from svv_table_info where schema='public';

    You should see the following eight tables as the output:

    dev=# select "table" from svv_table_info where schema='public';
     table   
    ----------
     orders
     customer
     region
     nation
     supplier
     part
     lineitem
     partsupp
    (8 rows)

  19. Launch another command prompt session of the Bastion host and enter the following code (substitute <AuroraClusterEndpoint> with the value from the AWS CloudFormation output):psql -h <AuroraClusterEndpoint> -d dev -p 5432 -U awsuser
  20. When prompted, enter the database user password you recorded earlier.
  21. Enter the following SQL command:
    select tablename from pg_tables where schemaname='public';

    You should see the following eight tables as the output:

    dev=# select tablename from pg_tables where schemaname='public';
     tablename 
    -----------
     region
     nation
     lineitem
     orders
     part
     supplier
     partsupp
     customer
    (8 rows)

Completing Federated Query setup

The final step is to create an external schema to connect to the Aurora PostgreSQL instance. The following example code creates an external schema statement that you need to run on your Amazon Redshift cluster to complete this step:

CREATE EXTERNAL SCHEMA IF NOT EXISTS pg 
FROM POSTGRES 
DATABASE 'dev' 
SCHEMA 'public' 
URI '<AuroraClusterEndpoint>' 
PORT 5432 
IAM_ROLE '<IAMRole>' 
SECRET_ARN '<SecretARN>'

Use the following parameters:

  • URI – AuroraClusterEndpoint value from the CloudFormation stack outputs. Value is in the format <stackname>-cluster.<randomcharacter>.us-east-1.rds.amazonaws.com
  • IAMRoleRedshiftClusterRoleArn value from the CloudFormation stack outputs. Value is in the format arn:aws:iam::<accountnumber>:role/<stackname>-RedshiftClusterRole-<randomcharacter>
  • SecretARNAuroraSecretArn value from the CloudFormation stack outputs. Value is in the format arn:aws:secretsmanager:us-east-1:<accountnumber>: secret:secretAuroraMasterUser-<randomcharacter>

Testing Federated Query

Now that you have set up Federated Query, you can start testing the feature using the TPC-H dataset that was preloaded into both Aurora PostgreSQL and Amazon Redshift.

The following query shows the parts and supplier relationship. Tables PARTSUPP and PART are stored in Amazon Redshift, and the SUPPLIER table in the subquery is from Aurora PostgreSQL:

SELECT TOP 10 P_BRAND,
       P_TYPE,
       P_SIZE,
       COUNT(DISTINCT PS_SUPPKEY) AS SUPPLIER_CNT
FROM PARTSUPP,
     PART
WHERE P_PARTKEY = PS_PARTKEY
AND   P_BRAND <> 'Brand#23'
AND   P_TYPE NOT LIKE 'MEDIUM ANODIZED%'
AND   P_SIZE IN (1,32,33,46,7,42,21,40)
AND   PS_SUPPKEY NOT IN (SELECT S_SUPPKEY
                         FROM pg.SUPPLIER
                         WHERE S_COMMENT LIKE '%Customer%Complaints%')
GROUP	BY P_BRAND,
         P_TYPE,
         P_SIZE
ORDER	BY SUPPLIER_CNT DESC,
         P_BRAND,
         P_TYPE,
         P_SIZE;

The following query shows the order priority by combining ORDERS table data from Amazon Redshift and Aurora PostgreSQL. This demonstrates the use case of live data query from an OLTP source federated with historical data on a data warehouse:

SELECT O_ORDERPRIORITY,
       COUNT(*) AS ORDER_COUNT
FROM (SELECT O_ORDERPRIORITY
      FROM ORDERS o
      WHERE O_ORDERDATE < '1997-07-01'       AND O_ORDERDATE >= CAST(DATE '1997-07-01' - INTERVAL '3 months' AS DATE)
      UNION ALL
      SELECT O_ORDERPRIORITY
      FROM pg.ORDERS o
      WHERE O_ORDERDATE >= '1997-07-01'
      AND   O_ORDERDATE < CAST(DATE '1997-07-01' +INTERVAL '1 day' AS DATE))
GROUP	BY O_ORDERPRIORITY
ORDER	BY O_ORDERPRIORITY;

You can continue to experiment with the dataset and explore the three main use cases from the post, Build a Simplified ETL and Live Data Query Solution using Redshift Federated Query.

Deleting the CloudFormation stack

When you are finished, delete the CloudFormation stack; some of the AWS resources in this walkthrough incur a cost if you continue to use them. Complete the following steps:

  1. On the AWS CloudFormation console, choose Stacks.
  2. Choose the stack you launched in this walkthrough. The stack must be currently running.
  3. In the stack details pane, choose Delete.
  4. Choose Delete stack.

Summary

This post showed you how to automate the creation of an Aurora PostgreSQL and Amazon Redshift cluster preloaded with the TPC-H dataset, the prerequisites of the new Amazon Redshift Federated Query feature using AWS CloudFormation, and a single manual step to complete the setup. The post also provided some example federated queries using the TPC-H dataset, which you can use to accelerate your learning and adoption of the new features. You can continue to modify the CloudFormation templates from this post to support your business needs.

If you have any questions or suggestions, please leave a comment.

 


About the Authors

BP Yau is a Data Warehouse Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate the Oracle Data Warehouse to Amazon Redshift and built the next generation big data analytics platform using AWS technologies.

 

 

 

Srikanth Sopirala is a Sr. Specialist Solutions Architect focused on Analytics at AWS. He is passionate about helping customers build scalable data and analytics solutions in the cloud.

Build a Simplified ETL and Live Data Query Solution using Redshift Federated Query

Post Syndicated from Tito Mijares original https://aws.amazon.com/blogs/big-data/build-a-simplified-etl-and-live-data-query-solution-using-redshift-federated-query/

You may have heard the saying that the best ETL is no ETL. Amazon Redshift now makes this possible with Federated Query. In its initial release, this feature lets you query data in Amazon Aurora PostgreSQL or Amazon RDS for PostgreSQL using Amazon Redshift external schemas. Federated Query also exposes the metadata from these source databases through system views and driver APIs, which allows business intelligence tools like Tableau and Amazon Quicksight to connect to Amazon Redshift and query data in PostgreSQL without having to make local copies. This enables a new data warehouse pattern—live data query—in which you can seamlessly retrieve data from PostgreSQL databases, or build data into a late binding view, which combines operational PostgreSQL data, analytical Amazon Redshift local data, and historical Amazon Redshift Spectrum data in an Amazon S3 data lake.

Simplified ETL use case

For this ETL use case, you can simplify the familiar upsert pattern with a federated query. You can bypass the need for incremental extracts in Amazon S3 and the subsequent load via COPY by querying the data in place within its source database. This change can be a single line of code that replaces the COPY command with a query to an external table. See the following code:

BEGIN;
CREATE TEMP TABLE staging (LIKE ods.store_sales);
-- replace the following COPY from S3 
COPY staging FROM 's3://yourETLbucket/daily_store_sales/' 
     IAM_ROLE 'arn:aws:iam::<account_id>:role/<s3_reader_role>' DELIMITER '|' COMPUPDATE OFF;
-- with this federated query to load staging data from PostgreSQL source
INSERT INTO staging SELECT * FROM pg.store_sales p
	WHERE p.last_updated_date > (SELECT MAX(last_updated_date) FROM ods.store_sales)
DELETE FROM ods.store_sales USING staging s WHERE ods.store_sales.id = s.id;
INSERT INTO ods.store_sales SELECT * FROM staging;
DROP TABLE staging;
COMMIT;

In the preceding example, the table pg.store_sales resides in PostgreSQL, and you use a federated query to retrieve fresh data to load into a staging table in Amazon Redshift, keeping the actual delete and insert operations unchanged. This pattern is likely the most common application of federated queries.

Setting up an external schema

The external schema pg in the preceding example was set up as follows:

CREATE EXTERNAL SCHEMA IF NOT EXISTS pg                                                                         
FROM POSTGRES                                                                                                           
DATABASE 'dev' 
SCHEMA 'retail'                                                                                     
URI 'database-1.cluster-ro-samplecluster.us-east-1.rds.amazonaws.com'                                                    
PORT 5432                                                                                                               
IAM_ROLE 'arn:aws:iam::555566667777:role/myFederatedQueryRDS'                                                           
SECRET_ARN 'arn:aws:secretsmanager:us-east-1:555566667777:secret:MyRDSCredentials-TfzFSB'

If you’re familiar with the CREATE EXTERNAL SCHEMA command from using it in Spectrum, note some new parameter options to enable federated queries.

FROM POSTGRES                                                                                                           
DATABASE 'dev' 
SCHEMA 'retail'

Whereas Amazon Redshift Spectrum references an external data catalog that resides within AWS Glue, Amazon Athena, or Hive, this code points to a Postgres catalog. Also, expect more keywords used with FROM, as Amazon Redshift supports more source databases for federated querying. By default, if you do not specify SCHEMA, it defaults to public.

Within the target database, you identify DATABASE ‘dev’ and SCHEMA ‘retail’, so any queries to the Amazon Redshift table pg.<some_table> get issued to PostgreSQL as a request for retail.<some_table> in the dev database. For Amazon Redshift, query predicates are pushed down and run entirely in PostgreSQL, which reduces the result set returned to Amazon Redshift for subsequent operations. Going further, the query planner derives cardinality estimates for external tables to optimize joins between Amazon Redshift and PostgreSQL. From the preceding example:

URI 'database-1.cluster-ro-samplecluster.us-east-1.rds.amazonaws.com'                                                    
PORT 5432

The URI and PORT parameters that reference both the PostgreSQL endpoint and port are self-explanatory, but there are a few things to consider in your configuration:

  • Use a read replica endpoint in Aurora or Amazon RDS for PostgreSQL to reduce load on the primary instance.
  • Set up your Amazon RDS for PostgreSQL instance, Aurora serverless or provisioned instances, and Amazon Redshift clusters to use the same VPC and subnet groups. That way, you can add the security group for the cluster to the inbound rules of the security group for the Aurora or Amazon RDS for PostgreSQL instance.
  • If both Amazon Redshift and Aurora or Amazon RDS for PostgreSQL are on different VPCs, set up VPC peering. For more information, see What is VPC Peering?

Configuring AWS Secrets Manager for remote database credentials

To retrieve AWS Secrets Manager remote database credentials, our example uses the following code:

IAM_ROLE 'arn:aws:iam::555566667777:role/myFederatedQueryRDS'                                                           
SECRET_ARN 'arn:aws:secretsmanager:us-east-1:555566667777:secret:MyRDSCredentials-TfzFSB'

These two parameters are interrelated because the SECRET_ARN is also embedded in the IAM policy for the role.

If a service like Secrets Manager didn’t exist and you wanted to issue a federated query from Amazon Redshift to PostgreSQL, you would need to supply the database credentials to the CREATE EXTERNAL SCHEMA command via a parameter like CREDENTIALS, which you also use with the COPY command. However, this hardcoded approach doesn’t take into account that the PostgreSQL credentials could expire.

You avoid this problem by keeping PostgreSQL database credentials within Secrets Manager, which provides a centralized service to manage secrets. Because Amazon Redshift retrieves and uses these credentials, they are transient and not stored in any generated code and are discarded after query execution.

Storing credentials in Secrets Manager takes up to a few minutes. To store a new secret, complete the following steps:

 

  1. On the Secrets Manager console, choose Secrets.
  2. Choose Store a new secret.
  3. In the Store a new secret section, complete the following:

 

  • Supply your PostgreSQL database credentials
  • Name the secret; for example, MyRDSCredentials
  • Configure rotation (you can enable this at a later time)
  • Optionally, copy programmatic code for accessing your secret using your preferred programming languages (which is not needed for this post)
  1. Choose Next.

You can also retrieve the credentials easily.

  1. On the Secrets Manager console, choose your secret.
  2. Choose Retrieve secret value.

The following screenshot shows you the secret value details.

This secret is now an AWS resource referenced via a secret ARN. See the following screenshot.

Setting up an IAM role

You can now pull everything together by embedding the secret ARN into an IAM policy, naming the policy, and attaching it to an IAM role. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AccessSecret",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetResourcePolicy",
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret",
                "secretsmanager:ListSecretVersionIds"
            ],
            "Resource": "arn:aws:secretsmanager:us-east-1:555566667777:secret:MyRDSCredentials-TfzFSB"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetRandomPassword",
                "secretsmanager:ListSecrets"
            ],
            "Resource": "*"
        }
    ]
}

The following screenshot shows the details of the IAM role called myFederatedQueryRDS, which contains the MyRDSSecretPolicy policy. It’s the same role that’s supplied in the IAM_ROLE parameter of the CREATE EXTERNAL SCHEMA DDL.

Finally, attach the same IAM role to your Amazon Redshift cluster.

  1. On the Amazon Redshift console, choose your cluster.
  2. From the Actions drop-down menu, choose Manage IAM roles.
  3. Choose and add the IAM role you just created.

You have now completed the following steps:

  1. Create an IAM policy and role
  2. Store your PostgreSQL database credentials in Secrets Manager
  3. Create an Amazon Redshift external schema definition that uses the secret and IAM role to authenticate with a PostgreSQL endpoint
  4. Apply a mapping between an Amazon Redshift database and schema to a PostgreSQL database and schema so Amazon Redshift may issue queries to PostgreSQL tables.

You only need to complete this configuration one time.

Querying live operational data

This section explores another use case: querying operational data across multiple source databases. In this use case, a global online retailer has databases deployed by different teams across distinct geographies:

  • Region us-east-1 runs serverless Aurora PostgreSQL.
  • Region us-west-1 runs provisioned Aurora PostgreSQL, which is also configured as a global database with a read replica in us-east-1.
  • Region eu-west-1 runs an Amazon RDS for PostgreSQL instance with a read replica in us-east-1.

Serverless and provisioned Aurora PostgreSQL and Amazon RDS for PostgreSQL are visible in the Amazon RDS console in Region us-east-1. See the following screenshot:

For this use case, assume that you configured the read replicas for Aurora and Amazon RDS to share the same VPC and subnets in us-east-1 with the local serverless Aurora PostgreSQL. Furthermore, you have already created secrets for each of these instances’ credentials, and also an IAM role MyCombinedRDSSecretPolicy, which is more permissive and allows Amazon Redshift to retrieve the value of any Amazon RDS secret within any Region. Be mindful of security in actual production use, however, and explicitly specify the resource ARNs for each secret in separate statements in your IAM policy. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AccessSecret",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetResourcePolicy",
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret",
                "secretsmanager:ListSecretVersionIds"
            ],
            "Resource": "arn:aws:secretsmanager:*:555566667777:secret:*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetRandomPassword",
                "secretsmanager:ListSecrets"
            ],
            "Resource": "*"
        }
    ]
}

External schema DDLs in Amazon Redshift can then reference the combined IAM role and individual secret ARNs. See the following code:

CREATE EXTERNAL SCHEMA IF NOT EXISTS useast
FROM POSTGRES
DATABASE 'dev'
URI 'us-east-1-aurora-pg-serverless.cluster-samplecluster.us-east-1.rds.amazonaws.com'
PORT 5432
IAM_ROLE 'arn:aws:iam::555566667777:role/MyCombinedRDSFederatedQuery'
SECRET_ARN 'arn:aws:secretsmanager:us-east-1:555566667777:secret:MyEastUSAuroraServerlessCredentials-dXOlEq'
;

CREATE EXTERNAL SCHEMA IF NOT EXISTS uswest
FROM POSTGRES
DATABASE 'dev'
URI 'global-aurora-pg-west-coast-stores-instance-1.samplecluster.us-east-1.rds.amazonaws.com'
PORT 5432
IAM_ROLE 'arn:aws:iam::555566667777:role/MyCombinedRDSFederatedQuery'
SECRET_ARN 'arn:aws:secretsmanager:us-west-1:555566667777:secret:MyWestUSAuroraGlobalDBCredentials-p3sV9m'
;

CREATE EXTERNAL SCHEMA IF NOT EXISTS europe
FROM POSTGRES
DATABASE 'dev'
URI 'eu-west-1-postgres-read-replica.samplecluster.us-east-1.rds.amazonaws.com'
PORT 5432
IAM_ROLE 'arn:aws:iam::555566667777:role/MyCombinedRDSFederatedQuery'
SECRET_ARN 'arn:aws:secretsmanager:eu-west-1:555566667777:secret:MyEuropeRDSPostgresCredentials-mz2u9L'
;

This late binding view abstracts the underlying queries to TPC-H lineitem test data within all PostgreSQL instances. See the following code:

CREATE VIEW global_lineitem AS
SELECT 'useast' AS region, * from useast.lineitem
UNION ALL
SELECT 'uswest', * from uswest.lineitem
UNION ALL
SELECT 'europe', * from europe.lineitem
WITH NO SCHEMA BINDING
;

Amazon Redshift can query live operational data across multiple distributed databases and aggregate results into a unified view with this feature. See the following code:

dev=# SELECT region, extract(month from l_shipdate) as month,
      sum(l_extendedprice * l_quantity) - sum(l_discount) as sales
      FROM global_lineitem
      WHERE l_shipdate >= '1997-01-01'
      AND l_shipdate < '1998-01-01'
      AND month < 4
      GROUP BY 1, 2
      ORDER BY 1, 2
;
 region | month |      sales
--------+-------+------------------
 europe |     1 | 16036160823.3700
 europe |     2 | 15089300790.7200
 europe |     3 | 16579123912.6700
 useast |     1 | 16176034865.7100
 useast |     2 | 14624520114.6700
 useast |     3 | 16645469098.8600
 uswest |     1 | 16800599170.4600
 uswest |     2 | 14547930407.7000
 uswest |     3 | 16595334825.9200
(9 rows)

If you examine Remote PG Seq Scan in the following query plan, you see that predicates are pushed down for execution in all three PostgreSQL databases. Unlike your initial simplified ETL use case, no ETL is performed because data is queried and filtered in place. See the following code:

dev=# EXPLAIN SELECT region, extract(month from l_shipdate) as month,
      sum(l_extendedprice * l_quantity) - sum(l_discount) as sales
FROM global_lineitem
WHERE l_shipdate >= '1997-01-01'
AND l_shipdate < '1998-01-01'
AND month < 4
GROUP BY 1, 2
ORDER BY 1, 2
;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 XN Merge  (cost=1000000060145.67..1000000060146.17 rows=200 width=100)
   Merge Key: derived_col1, derived_col2
   ->  XN Network  (cost=1000000060145.67..1000000060146.17 rows=200 width=100)
         Send to leader
         ->  XN Sort  (cost=1000000060145.67..1000000060146.17 rows=200 width=100)
               Sort Key: derived_col1, derived_col2
               ->  XN HashAggregate  (cost=60136.52..60138.02 rows=200 width=100)
                     ->  XN Subquery Scan global_lineitem  (cost=20037.51..60130.52 rows=600 width=100)
                           ->  XN Append  (cost=20037.51..60124.52 rows=600 width=52)
                                 ->  XN Subquery Scan "*SELECT* 1"  (cost=20037.51..20041.51 rows=200 width=52)
                                       ->  XN HashAggregate  (cost=20037.51..20039.51 rows=200 width=52)
                                             ->  XN PG Query Scan lineitem  (cost=0.00..20020.84 rows=1667 width=52)
                                                   ->  Remote PG Seq Scan useast.lineitem  (cost=0.00..20000.00 rows=1667 width=52)
                                                         Filter: ((l_shipdate < '1998-01-01'::date) AND (l_shipdate >= '1997-01-01'::date) AND ("date_part"('month'::text, l_shipdate) < 4))
                                 ->  XN Subquery Scan "*SELECT* 2"  (cost=20037.51..20041.51 rows=200 width=52)
                                       ->  XN HashAggregate  (cost=20037.51..20039.51 rows=200 width=52)
                                             ->  XN PG Query Scan lineitem  (cost=0.00..20020.84 rows=1667 width=52)
                                                   ->  Remote PG Seq Scan uswest.lineitem  (cost=0.00..20000.00 rows=1667 width=52)
                                                         Filter: ((l_shipdate < '1998-01-01'::date) AND (l_shipdate >= '1997-01-01'::date) AND ("date_part"('month'::text, l_shipdate) < 4))
                                 ->  XN Subquery Scan "*SELECT* 3"  (cost=20037.51..20041.51 rows=200 width=52)
                                       ->  XN HashAggregate  (cost=20037.51..20039.51 rows=200 width=52)
                                             ->  XN PG Query Scan lineitem  (cost=0.00..20020.84 rows=1667 width=52)
                                                   ->  Remote PG Seq Scan europe.lineitem  (cost=0.00..20000.00 rows=1667 width=52)
                                                         Filter: ((l_shipdate < '1998-01-01'::date) AND (l_shipdate >= '1997-01-01'::date) AND ("date_part"('month'::text, l_shipdate) < 4))
(24 rows)

Combining the data lake, data warehouse, and live operational data

In this next use case, you join Amazon Redshift Spectrum historical data with current data in Amazon Redshift and live data in PostgreSQL. You use a 3TB TPC-DS dataset and unload data from 1998 through 2001 from the store_sales table in Amazon Redshift to Amazon S3. The unloaded files are stored in Parquet format with ss_sold_date_sk as partitioning key.

To access this historical data via Amazon Redshift Spectrum, create an external table. See the following code:

CREATE EXTERNAL TABLE spectrum.store_sales_historical
(
  ss_sold_time_sk int ,
  ss_item_sk int ,
  ss_customer_sk int ,
  ss_cdemo_sk int ,
  ss_hdemo_sk int ,
  ss_addr_sk int ,
  ss_store_sk int ,
  ss_promo_sk int ,
  ss_ticket_number bigint,
  ss_quantity int ,
  ss_wholesale_cost numeric(7,2) ,
  ss_list_price numeric(7,2) ,
  ss_sales_price numeric(7,2) ,
  ss_ext_discount_amt numeric(7,2) ,
  ss_ext_sales_price numeric(7,2) ,
  ss_ext_wholesale_cost numeric(7,2) ,
  ss_ext_list_price numeric(7,2) ,
  ss_ext_tax numeric(7,2) ,
  ss_coupon_amt numeric(7,2) ,
  ss_net_paid numeric(7,2) ,
  ss_net_paid_inc_tax numeric(7,2) ,
  ss_net_profit numeric(7,2)
)
PARTITIONED BY (ss_sold_date_sk int)
STORED AS PARQUET
LOCATION 's3://mysamplebucket/historical_store_sales/';   

The external spectrum schema is defined as the following:

CREATE EXTERNAL SCHEMA spectrum
FROM data catalog DATABASE 'spectrumdb'
IAM_ROLE 'arn:aws:iam::555566667777:role/mySpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

Instead of an Amazon S3 read-only policy, the IAM role mySpectrumRole contains both AmazonS3FullAccess and AWSGlueConsoleFullAccess policies, in which the former allows Amazon Redshift writes to Amazon S3. See the following code:

UNLOAD ('SELECT * FROM tpcds.store_sales WHERE ss_sold_date_sk < 2452276')
TO 's3://mysamplebucket/historical_store_sales/'
IAM_ROLE 'arn:aws:iam::555566667777:role/mySpectrumRole'
FORMAT AS PARQUET
PARTITION BY (ss_sold_date_sk) ALLOWOVERWRITE;

To make partitioned data visible, the ALTER TABLE ... ADD PARTITION command needs to specify all partition values. For this use case, 2450816 through 2452275 correspond to dates 1998-01-02 through 2001-12-31, respectively. To generate these DDLs quickly, use the following code:

WITH partitions AS (SELECT * FROM generate_series(2450816, 2452275))
SELECT 'ALTER TABLE spectrum.store_sales_historical ADD PARTITION (ss_sold_date_sk='|| generate_series || ') '
    || 'LOCATION \'s3://mysamplebucket/historical_store_sales/ss_sold_date_sk=' || generate_series || '/\';'
FROM partitions;

You can run the generated ALTER TABLE statements individually or as a batch to make partition data visible. See the following code:

ALTER TABLE spectrum.store_sales_historical 
ADD PARTITION (ss_sold_date_sk=2450816)
LOCATION 's3://mysamplebucket/historical_store_sales/ss_sold_date_sk=2450816/';
-- repeated for all partition values

The three combined sources in the following view consist of historical data in Amazon S3 for 1998 through 2001, current data local to Amazon Redshift for 2002, and live data for two months of 2003 in PostgreSQL. When you create this late binding view, you have to re-order Amazon Redshift Spectrum external table columns because the previous UNLOAD operation specifying ss_sold_date_sk as partition key shifted that column’s order to last. See the following code:

CREATE VIEW store_sales_integrated AS
SELECT * FROM uswest.store_sales_live
UNION ALL
SELECT * FROM tpcds.store_sales_current
UNION ALL
SELECT ss_sold_date_sk, ss_sold_time_sk, ss_item_sk, ss_customer_sk, ss_cdemo_sk, 
       ss_hdemo_sk, ss_addr_sk, ss_store_sk, ss_promo_sk, ss_ticket_number, 
       ss_quantity, ss_wholesale_cost, ss_list_price, ss_sales_price, 
       ss_ext_discount_amt, ss_ext_sales_price, ss_ext_wholesale_cost, 
       ss_ext_list_price, ss_ext_tax, ss_coupon_amt, ss_net_paid, 
       ss_net_paid_inc_tax, ss_net_profit
FROM spectrum.store_sales_historical
WITH NO SCHEMA BINDING;

You can now run a query on the view to aggregate date and join tables across the three sources. See the following code:

dev=# SELECT extract(year from b.d_date), count(a.ss_sold_date_sk)
FROM store_sales_integrated a
JOIN tpcds.date_dim b on (a.ss_sold_date_sk = b.d_date_sk)
GROUP BY 1
ORDER BY 1
;
 date_part |   count
-----------+------------
      1998 | 1632403114
      1999 | 1650163390
      2000 | 1659168880
      2001 | 1641184375
      2002 | 1650209644
      2003 |   17994540
(6 rows)

Time: 77624.926 ms (01:17.625)

This following federated query ran on a two-node DC2.8XL cluster and took 1 minute and 17 seconds to join store sales in Amazon S3, PostgreSQL, and Amazon Redshift, with the date dimension table in Amazon Redshift, aggregating and sorting row counts by year:

dev=# EXPLAIN SELECT extract(year from b.d_date), count(a.ss_sold_date_sk)
FROM store_sales_integrated a
JOIN tpcds.date_dim b on (a.ss_sold_date_sk = b.d_date_sk)
GROUP BY 1
ORDER BY 1;

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 XN Merge  (cost=1461036320912.29..1461036321094.91 rows=73049 width=8)
   Merge Key: "date_part"('year'::text, b.d_date)
   ->  XN Network  (cost=1461036320912.29..1461036321094.91 rows=73049 width=8)
         Send to leader
         ->  XN Sort  (cost=1461036320912.29..1461036321094.91 rows=73049 width=8)
               Sort Key: "date_part"('year'::text, b.d_date)
               ->  XN HashAggregate  (cost=461036314645.93..461036315011.18 rows=73049 width=8)
                     ->  XN Hash Join DS_DIST_ALL_NONE  (cost=913.11..428113374829.91 rows=6584587963204 width=8)
                           Hash Cond: ("outer".ss_sold_date_sk = "inner".d_date_sk)
                           ->  XN Subquery Scan a  (cost=0.00..263498674836.70 rows=6584587963204 width=4)
                                 ->  XN Append  (cost=0.00..197652795204.66 rows=6584587963204 width=4)
                                       ->  XN Subquery Scan "*SELECT* 1"  (cost=0.00..539836.20 rows=17994540 width=4)
                                             ->  XN PG Query Scan store_sales_live  (cost=0.00..359890.80 rows=17994540 width=4)
                                                   ->  Remote PG Seq Scan uswest.store_sales_live  (cost=0.00..179945.40 rows=17994540 width=4)
                                       ->  XN Subquery Scan "*SELECT* 2"  (cost=0.00..33004193.28 rows=1650209664 width=4)
                                             ->  XN Seq Scan on store_sales_current  (cost=0.00..16502096.64 rows=1650209664 width=4)
                                       ->  XN Subquery Scan "*SELECT* 3"  (cost=0.00..197619251175.18 rows=6582919759000 width=4)
                                             ->  XN Partition Loop  (cost=0.00..131790053585.18 rows=6582919759000 width=4)
                                                   ->  XN Seq Scan PartitionInfo of spectrum.store_sales_historical  (cost=0.00..10.00 rows=1000 width=4)
                                                   ->  XN S3 Query Scan store_sales_historical  (cost=0.00..131658395.18 rows=6582919759 width=0)
                                                         ->  S3 Seq Scan spectrum.store_sales_historical location:"s3://mysamplebucket/historical_store_sales" format:PARQUET (cost=0.00..65829197.59 rows=6582919759 width=0)
                           ->  XN Hash  (cost=730.49..730.49 rows=73049 width=8)
                                 ->  XN Seq Scan on date_dim b  (cost=0.00..730.49 rows=73049 width=8)
(23 rows)

The query plan shows these are full sequential scans running on the three source tables with the number of returned rows highlighted, totaling 8.2 billion. Because Amazon Redshift Spectrum does not generate statistics for external tables, you manually set the numRows property to the row count for historical data in Amazon S3. See the following code:

ALTER TABLE spectrum.store_sales_historical SET TABLE PROPERTIES ('numRows' = '6582919759');

You can join with another dimension table local to Amazon Redshift, this time the 30 million row customer table, and filter by column c_birth_country. See the following code:

dev=# SELECT extract(year from b.d_date), count(a.ss_sold_date_sk)
FROM store_sales_integrated a
JOIN tpcds.date_dim b on (a.ss_sold_date_sk = b.d_date_sk)
JOIN tpcds.customer c on (a.ss_customer_sk = c.c_customer_sk)
AND c.c_birth_country = 'UNITED STATES'
GROUP BY 1
ORDER BY 1
;
 date_part |  count
-----------+---------
      1998 | 7299277
      1999 | 7392156
      2000 | 7416905
      2001 | 7347920
      2002 | 7390590
      2003 |   81627
(6 rows)

Time: 77878.586 ms (01:17.879)

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 XN Merge  (cost=1363288861214.20..1363288861396.83 rows=73049 width=8)
   Merge Key: "date_part"('year'::text, b.d_date)
   ->  XN Network  (cost=1363288861214.20..1363288861396.83 rows=73049 width=8)
         Send to leader
         ->  XN Sort  (cost=1363288861214.20..1363288861396.83 rows=73049 width=8)
               Sort Key: "date_part"('year'::text, b.d_date)
               ->  XN HashAggregate  (cost=363288854947.85..363288855313.09 rows=73049 width=8)
                     ->  XN Hash Join DS_DIST_ALL_NONE  (cost=376252.50..363139873158.03 rows=29796357965 width=8)
                           Hash Cond: ("outer".ss_sold_date_sk = "inner".d_date_sk)
                           ->  XN Hash Join DS_BCAST_INNER  (cost=375339.39..362394963295.79 rows=29796357965 width=4)
                                 Hash Cond: ("outer".ss_customer_sk = "inner".c_customer_sk)
                                 ->  XN Subquery Scan a  (cost=0.00..263498674836.70 rows=6584587963204 width=8)
                                       ->  XN Append  (cost=0.00..197652795204.66 rows=6584587963204 width=8)
                                             ->  XN Subquery Scan "*SELECT* 1"  (cost=0.00..539836.20 rows=17994540 width=8)
                                                   ->  XN PG Query Scan store_sales_live  (cost=0.00..359890.80 rows=17994540 width=8)
                                                         ->  Remote PG Seq Scan uswest.store_sales_live  (cost=0.00..179945.40 rows=17994540 width=8)
                                             ->  XN Subquery Scan "*SELECT* 2"  (cost=0.00..33004193.28 rows=1650209664 width=8)
                                                   ->  XN Seq Scan on store_sales_current  (cost=0.00..16502096.64 rows=1650209664 width=8)
                                             ->  XN Subquery Scan "*SELECT* 3"  (cost=0.00..197619251175.18 rows=6582919759000 width=8)
                                                   ->  XN Partition Loop  (cost=0.00..131790053585.18 rows=6582919759000 width=8)
                                                         ->  XN Seq Scan PartitionInfo of spectrum.store_sales_historical  (cost=0.00..10.00 rows=1000 width=4)
                                                         ->  XN S3 Query Scan store_sales_historical  (cost=0.00..131658395.18 rows=6582919759 width=4)
                                                               ->  S3 Seq Scan spectrum.store_sales_historical location:"s3://mysamplebucket/historical_store_sales" format:PARQUET (cost=0.00..65829197.59 rows=6582919759 width=4)
                                 ->  XN Hash  (cost=375000.00..375000.00 rows=135755 width=4)
                                       ->  XN Seq Scan on customer c  (cost=0.00..375000.00 rows=135755 width=4)
                                             Filter: ((c_birth_country)::text = 'UNITED STATES'::text)
                           ->  XN Hash  (cost=730.49..730.49 rows=73049 width=8)
                                 ->  XN Seq Scan on date_dim b  (cost=0.00..730.49 rows=73049 width=8)
(28 rows)

Query performance hardly changed from the previous query. Because the query only scanned one column (ss_sold_date_sk), it benefits from Parquet’s columnar structure for the historical data subquery. To put it another way, if the historical data is stored as CSV, all the data is scanned, which degrades performance significantly.

Additionally, the TPC-DS model does not store date values in the store_sales fact table. Instead, a foreign key references the date_dim table. If you plan on implementing something similar but frequently filter by a date column, consider adding that column into the fact table and have it as a sort key, and also adding a partitioning column in Amazon Redshift Spectrum. That way, Amazon Redshift can more efficiently skip blocks for local data and prune partitions for Amazon S3 data, in the latter, and also push filtering criteria down to Amazon Redshift Spectrum.

Conclusion

Applications of live data integration in real-world scenarios include data discovery, data preparation for machine learning, operational analytics, IoT telemetry analytics, fraud detection, and compliance and security audits. Whereas Amazon Redshift Spectrum extends the reach of Amazon Redshift into the AWS data lake, Federated Query extends its reach into operational databases and beyond.

For more information about data type differences between these databases, see Data Type Differences Between Amazon Redshift and Supported RDS PostgreSQL or Aurora PostgreSQL Databases. For more information about accessing federated data with Amazon Redshift, see Limitations and Considerations When Accessing Federated Data with Amazon Redshift.

 


About the Authors

Tito Mijares is a Data Warehouse Specialist Solutions Architect at AWS. He helps AWS customers adopt and optimize their use of Amazon Redshift. Outside of AWS, he enjoys playing jazz guitar and working on audio recording and playback projects.

 

 

 

Entong Shen is a Senior Software Development Engineer for AWS Redshift. He has been working on MPP databases for over 8 years and has focused on query optimization, statistics and SQL language features such as stored procedures and federated query. In his spare time, he enjoys listening to music of all genres and working in his succulent garden.

 

 

 

Niranjan Kamat is a software engineer on the Amazon Redshift query processing team. His focus of PhD research was in interactive querying over large databases. In Redshift, he has worked in different query processing areas such as query optimization, analyze command and statistics, and federated querying. In his spare time, he enjoys playing with his three year old daughter, practicing table tennis (was ranked in top 10 in Ohio, USATT rating 2143), and chess.

 

 

Sriram Krishnamurthy is a Software Development Manager for AWS Redshift Query Processing team. He is passionate about Databases and has been working on Semi Structured Data Processing and SQL Compilation & Execution for over 15 years. In his free time, you can find him on the tennis court, often with his two young daughters in tow.

Improved speed and scalability in Amazon Redshift

Post Syndicated from Naresh Chainani original https://aws.amazon.com/blogs/big-data/improved-speed-and-scalability-in-amazon-redshift/

Since Amazon Redshift launched in 2012, its focus has always been on providing you with the best possible performance, at scale, and at the lowest possible cost. For most organizations across the globe, the amount of data going into their data warehouse is growing exponentially, and the number of people who want insights from that data increases daily. Because of this, Amazon Redshift is continually innovating to handle this growing volume of data and demand for insights.

Amazon Redshift delivers fast performance, at scale, for the most demanding workloads. Getting there was not easy, and it takes consistent investment across a variety of technical focus areas to make this happen. This post breaks down what it takes to build the world’s fastest cloud data warehouse.

Redshift’s performance investments are based on a broad spectrum of needs and uses the extensive fleet telemetry data from tens of thousands of customers to guide development work. Your needs may vary based on a variety of factors, including the size of your data warehouse, the number of concurrent users, the skillset of those users, and workload characteristics such as frequency of usage and query latency requirements.

Based on this, Amazon Redshift has performance investments anchored in four key areas:

  • Out-of-the-box performance – Amazon Redshift is over twice as fast out-of-the-box than it was 6 months ago, and keeps getting faster without any additional manual optimization and tuning
  • Automatic optimizations – Amazon Redshift is self-learning, self-optimizing, and constantly adapts to your actual workload to deliver the best possible performance
  • Scalability – Amazon Redshift can boost throughput to be 35 times greater to support increases in concurrent users and scales linearly for a wide range of workloads
  • Price-performance – Amazon Redshift provides predictable performance at significantly lower total cost than other data warehouse solutions by optimizing resource utilization

Out-of-the-box performance

Amazon Redshift has always enabled you to manually tune workload performance based on statically partitioning memory and specifying the number of concurrent queries. Amazon Redshift also uses advanced machine learning (ML) algorithms to tune configuration settings automatically for optimal data warehouse performance.

These algorithms are extremely effective because they are trained with real-world telemetry data generated from processing over two exabytes of data a day. Additionally, because Amazon Redshift has more customers than any other cloud data warehouse, the telemetry data makes the configuration recommendations from the ML algorithms even more accurate.

In addition, Amazon Redshift Advisor guides optimization by recommending sort keys, distribution keys, and more. However, if you want to override the learning, self-tuning behavior of Amazon Redshift, you still have fine-grained control.

The out-of-the-box performance of Amazon Redshift is continually improving. In November 2019, our Cloud Data Warehouse benchmark[1] showed that the out-of-the-box performance of Amazon Redshift was twice as fast as 6 months ago. This ongoing improvement in performance is the culmination of many technical innovations. This post presents three improvements that have had the most impact.

Improved compression

Amazon Redshift uses AZ64, a novel compression encoding, which delivers high compression ratios and significantly improved query performance. With AZ64, you no longer need to make the trade-off between storage size and performance. AZ64 compresses data, on average, 35% more than the popular, high-performance LZO algorithm, and processes the data 40% faster. AZ64 achieves this by efficiently compressing small groups of data values and uses CPU vector instructions and single instruction, multiple data (SIMD) for parallel processing of AZ64-encoded columns. To benefit from this, you simply select the data type for each column, and Amazon Redshift chooses the compression encoding method. Five months after launch, the AZ64 encoding has become the fourth most popular encoding option in Amazon Redshift with millions of columns.

Efficient large-scale join operations

When complex queries join large tables, massive amounts of data transfers over the network for the join processing on the Amazon Redshift compute nodes. This used to create network bottlenecks that impacted query performance.

Amazon Redshift now uses distributed bloom filters to enable high-performance joins for such workloads. Distributed bloom filters efficiently filter rows at the source that do not match the join relation, which greatly reduces the amount of data transferred over the network. Across the fleet, we see millions of selective bloom filters deployed each day, with some of them filtering more than one billion rows. This removes the network from being the bottleneck and improves overall query performance. Amazon Redshift automatically determines what queries are suitable for this optimization and adapts the optimization at runtime. In addition, Amazon Redshift uses CPU cache-optimized processing to make query execution more efficient for large-scale joins and aggregations. Together, these features improve performance for over 70% of the queries that Amazon Redshift processes daily.

Enhanced query planning

Query planning determines the quickest way to process a given query by evaluating costs associated with various query plans. Costs include many factors, for example, the number of I/O operations required, amount of disk buffer space, time to read from disk, parallelism for the query, and tables statistics such as number of rows and number of distinct values of a column being fresh and relevant.

The Amazon Redshift query planner factors in the capabilities of modern hardware, including the network stack, to take full advantage of the performance that the hardware offers. Its statistic collection process is automated, and it computes statistics by using algorithms like HyperLogLog (HLL), which improves the quality of statistics and therefore enables the cost-based planner to make better choices.

Automatic optimizations

Amazon Redshift uses ML techniques and advanced graph algorithms to continuously improve performance. Different workloads have different data access patterns, and Amazon Redshift observes the workload to learn and adapt. It automatically adjusts data layout, distribution keys, and query plans to provide optimal performance for a given workload.

The automatic optimizations in Amazon Redshift make intelligent decisions, such as how to distribute data across nodes, which datasets are frequently queried together and should be co-located, how to sort data, and how many parallel and concurrent queries should run on the system based on the complexity of the query. Amazon Redshift automatically adjusts these configurations to optimize throughput and performance as you use the data warehouse. To make sure these optimizations do not interrupt your workload, Amazon Redshift runs them incrementally and only during the periods when clusters have low utilization.

For maintenance operations, Amazon Redshift reduces the amount of compute resources required by operating only on frequently accessed tables and portions within those tables. Amazon Redshift prioritizes which portions of the table to operate on by analyzing query patterns. When relevant, it provides prescriptive guidance through recommendations in Amazon Redshift Advisor. You can then evaluate and apply those recommendations as needed.

AWS also constantly evaluates and assesses how effective its ML-based self-learning systems are in delivering fast query performance when compared to traditional methods such as expert human tuning.

Amazon Redshift has been adding automatic optimizations for years. The following timeline shows some of the automatic optimizations delivered over the last 12 months. 

For more information about specific optimizations, see the following posts about

automatic VACUUM DELETE, automatic ANALYZE, distribution key recommendations, sort key recommendations, automatic table sort, and automatic distribution style.

Scalability

Amazon Redshift can boost throughput by more than 35 times to support increases in concurrent users, and scales linearly for simple and mixed workloads.

Scaling to support a growing number of users

As the number of users accessing the data warehouse grows, you should not experience delays in your query responses. Most of the time, your workloads are not predictable and can vary throughout the day. Traditional data warehouses are typically provisioned for peak usage to avoid delays and missed SLAs, and you end up paying for resources that you are not fully using.

Concurrency Scaling in Amazon Redshift allows the data warehouse to handle spikes in workloads while maintaining consistent SLAs by elastically scaling the underlying resources as needed. Amazon Redshift continuously monitors the designated workload. If the queries start to get backlogged because of bursts of user activity, Amazon Redshift automatically adds transient cluster capacity and routes the requests to these new clusters. This transient capacity is available in a few seconds, so your queries continue to be served with low latency. Amazon Redshift removes the additional transient capacity automatically when activity reduces on the cluster.

You can choose if you want to elastically scale for certain workloads and by how much with a simple one-step configuration. Every 24 hours that the Amazon Redshift main cluster is in use, you accrue a 1-hour credit. This makes concurrency scaling free for more than 97% of use cases.

With the ability to automatically add and remove additional capacity, Amazon Redshift data warehouses can improve overall throughput by over 35 times. This post demonstrates how far you can dynamically allocate more compute power to satisfy the demands of concurrent users with the following experiment. First, take a baseline measurement using the Cloud Data Warehouse benchmark and five concurrent users. You can then enable Concurrency Scaling and add more and more users with each iteration. As soon as Amazon Redshift detects queuing, it allocates additional scaling clusters automatically. Ultimately, this experiment ran over 200 concurrent queries on Amazon Redshift and generated more than 35 times greater throughput. This many concurrently executing queries represents a concurrent user population of several thousand. This demonstrates how you can support virtually unlimited concurrent users on your Amazon Redshift data warehouses.

The scaling for concurrent users is also linear. You get consistent increases in performance with every extra dollar spent on the Concurrency Scaling clusters. This helps to keep data warehouse costs predictable as business needs grow. With Concurrency Scaling, AWS can perform benchmark tests with tens of thousands of queries per hour, with hundreds of queries running concurrently and providing linear scale. This represents a real-world workload in enterprises with thousands of concurrent users connecting to the data warehouse.

Scaling while running multiple mixed workloads

As data warehouses grow over time, the number and complexity of the workloads that run on the data warehouse also increase. For example, if you migrate from an on-premises data warehouse to Amazon Redshift, you might first run traditional analytics workloads, and eventually bring more operational and real-time data into the cluster to build new use cases and applications. To scale any data warehouse effectively, you must be able to prioritize and manage multiple types of workloads concurrently. Automatic workload management (WLM) and query priorities are two recent capabilities added to Amazon Redshift that enable you to do just that.

Automatic WLM makes sure that you use cluster resources efficiently, even with dynamic and unpredictable workloads. With automatic WLM, Amazon Redshift uses ML to classify incoming queries based on their memory and processing requirements and routes them to appropriate internal queues to start executing in parallel. Amazon Redshift dynamically adjusts the number of queries to execute in parallel to optimize overall throughput and performance. When queries that require large amounts of resources are in the system (for example, hash joins between large tables), the concurrency is lower. When you submit lighter queries (such as inserts, deletes, or simple aggregations), concurrency is higher. There is a feedback loop to continuously monitor system utilization and regulate admission into the cluster.

However, not all queries may be equally important to you; the performance of one workload or set of users might be more important than others. Query priorities in Amazon Redshift address this. You can give higher-priority workloads preferential treatment, including more resources during busy times, for consistent query performance. Amazon Redshift workload management uses intelligent algorithms to make sure that lower-priority queries continue to make progress and don’t stall.

You can combine Amazon Redshift Concurrency Scaling and automatic WLM with query priorities to solve complex data warehouse use cases. For example, the following table summarizes an Amazon Redshift configuration that effectively mixes ETL with analytics workloads.

WLM queue Queue priority Concurrency Scaling Notes
ETL High Off When ETL runs, it gets the highest priority
BI queries Normal On When BI workload suddenly increases, Concurrency Scaling adds capacity to maintain user SLAs
One-time or exploratory queries Low Off Cluster offers analytic access for casual users and data scientists when resources are available

For this use case, and many more, you can maintain SLAs, achieve efficiencies with your cluster utilization, and get sufficient flexibility to invest according to business priorities.

Price performance

You can measure price performance by calculating both the total cost of the computing service consumed and the total amount of computing work performed. Maximum performance for minimum cost gives you the best price performance.

As your data warehouses grow, Amazon Redshift gets more efficient. It moderates cost increases and keeps costs predictable, even as your needs grow. This sets Amazon Redshift apart from others in the market that increase in price much more as the number of users grows.

The investments in automatic optimizations, out-of-the-box performance, and scale all contribute to the unbeatable price performance that Amazon Redshift offers. When you compare typical customer quotes and investments, you find that Amazon Redshift costs 50% –75% less than other cloud data warehouses.

Measuring performance

AWS measures performance, throughput, and price-performance on a nightly basis. AWS also runs larger and more comprehensive benchmarks regularly to make sure the tests extend beyond your current needs. For benchmark results to be useful, they need to be well defined and easily reproducible. AWS uses the Cloud DW benchmark based on current TPC-DS and TPC-H benchmarks without any query or data modifications and compliant with TPC rules and requirements.

It’s important that anyone can reproduce these benchmarks; you can download the benchmark codes and scripts from  GitHub  and the accompanying dataset from a public Amazon S3 bucket.

Summary

Amazon Redshift is self-learning, self-optimizing, and consistently uses telemetry of the actual workload to deliver the best possible performance. Amazon Redshift is more than twice as fast out-of-the-box than it was 6 months ago, and keeps getting faster without any manual optimization and tuning. Amazon Redshift can boost throughput by more than 35 times to support increases in concurrent users and scales linearly for simple and mixed workloads.

In addition to software improvements, AWS continues to build data warehouses on the best hardware available. The new RA3 ndoe type with managed storage features high bandwidth networking and sizable high-performance SSDs as local caches. RA3 nodes use your workload patterns and advanced data management techniques, such as automatic fine-grained data eviction and intelligent data pre-fetching, to deliver the performance of local SSD while scaling storage automatically to Amazon S3. The hardware-based performance improvements in preview with AQUA (Advanced Query Accelerator) bring even more dramatic performance improvements and drive costs down with a new distributed and hardware accelerated cache.

These performance improvements are the cumulative result of years of strategic and sustained product investment and technical innovation across multiple areas such as automatic optimizations, out-of-the-box performance, and scalability. Additionally, price-performance remains a priority so you receive the best value.

Each dataset and workload has unique characteristics, and a proof of concept is the best way to understand how Amazon Redshift performs in your unique situation. When running your own proof of concept, it’s important that you focus on the right metrics—query throughput (number of queries per hour) and price-performance for your workload.
[1]
You can make a data-driven decision by requesting assistance with a proof of concept or working with a system integration and consulting partner. It’s also important to consider not only how a data warehouse performs with your current needs, but also its future performance with increasingly complex workloads, datasets, and users.

To stay up-to-date with the latest developments in Amazon Redshift, subscribe to the What’s New in Amazon Redshift RSS feed.

———
[1] The TPC Benchmark, TPC-DS and TPC-H are trademarks of the Transaction Processing Performance Council www.tpc.org


About the Author

Naresh Chainani is a Senior Software Development Manager at Amazon Redshift where he leads Query Processing, Query Performance, Distributed Systems and Workload Management with a strong team. Naresh is passionate about building high-performance databases to enable customers to gain timely insights and make critical business decisions. In his spare time, Naresh enjoys reading and playing tennis.

 

 

Berni Schiefer is a Senior Development Manager for EMEA at Amazon Web Services, leading the Amazon Redshift development team in Berlin, Germany. The Berlin team focuses on Redshift Performance and Scalability, SQL Query Compilation, Spatial Data support, and Redshift Spectrum. Previously, Berni was an IBM Fellow working in the area of Private Cloud, Db2, Db2 Warehouse, BigSQL, with a focus on SQL-based engines, query optimization and performance.

 

 

Neeraja Rentachintala is a seasoned Product Management and GTM leader at Amazon Web Services, bringing over 20 years of experience in Product Vision, Strategy and Leadership roles in industry-leading data products and platforms. During her career, she delivered products in Analytics, Big data, Databases, Data and Application Integration, AI/ML serving Fortune 500 enterprise and ventures including MapR (acquired by HPE), Microsoft, Oracle, Informatica and Expedia.com. Currently Neeraja is a Principal Product Manager with Amazon Web Services building Amazon Redshift – the world’s most popular, highest performance and most scalable cloud data warehouse. Neeraja earned a Bachelor of Technology in Electronics and Communication Engineering from the National Institute of Technology in India and various business program certifications from the University of Washington, MIT Sloan School of Management and Stanford University.

Use MAP for Windows to Simplify your Migration to AWS

Post Syndicated from Fred Wurden original https://aws.amazon.com/blogs/compute/use-map-for-windows-to-simplify-your-migration-to-aws/

There’s no question that organizations today are being disrupted in their industry. In a previous blog post, I shared that such disruption often accelerates organizations’ decisions to move to the cloud. When these organizations migrate to the cloud, Windows workloads are often critical to their business and these workloads require a performant, reliable, and secure cloud infrastructure. Customers tell us that reducing risk, building cloud expertise, and lowering costs are important factors when choosing that infrastructure.

Today, we are announcing the general availability of the Migration Acceleration Program (MAP) for Windows, a comprehensive program that helps you execute large-scale migrations and modernizations of your Windows workloads on AWS. We have millions of customers on AWS, and have spent the last 11 years helping Windows customers successfully move to our cloud. We’ve built a proven methodology, providing you with AWS services, tools, and expertise to help simplify the migration of your Windows workloads to AWS. MAP for Windows provides prescriptive guidance, consulting support from experts, tools, trainings, and service credits to help reduce the risk and cost of migrating to the cloud as you embark on your migration journey.

MAP for Windows also helps you along the pathways to modernize current and legacy versions of Windows Server and SQL Server to cloud native and open source solutions, enabling you to break free from commercial licensing costs. With the strong price-performance of open-source solutions and the proven reliability of AWS, you can innovate quickly while reducing your risk.

With MAP for Windows, you will follow a simple three-step migration process to your migration:

  1. Assess Your Readiness: The migration readiness assessment helps you identify gaps along the six dimensions of the AWS Cloud Adoption Framework: business, process, people, platform, operations, and security. This assessment helps customers identify capabilities required in the migration. MAP for Windows also includes an Optimization and Licensing Assessment, which provides recommendations on how to optimize your licenses on AWS.
  2. Mobilize Your Resources: The mobilize phase helps you build an operational foundation for your migration, with the goal of fixing the capability gaps identified in the assessment phase. The mobilize phase accelerates your migration decisions by providing clear guidance on migration plans that improve the success of your migration.
  3. Migrate or Modernize Your Workloads: APN Partners and the AWS ProServe team help customers execute the large-scale migration plan developed during the mobilize phase. MAP for Windows also offers financial incentives to help you offset migration costs such as labor, training, and the expense of sometimes running two environments in parallel.

MAP for Windows includes support from AWS Professional Services and AWS Migration Competency Partners, such as Rackspace, 2nd Watch, Accenture, Cloudreach, Enimbos Global Services, Onica, and Slalom. Our MAP for Windows partners have successfully demonstrated completion of multiple large-scale migrations to AWS. They have received the APN Migration Competency Partner and the Microsoft Workloads Competency designations.

Learn about what MAP for Windows can do for you on this page. Learn also about the migration experiences of AWS customers. And contact us to discuss your Windows migration or modernization initiatives and apply to MAP for Windows.

About the Author

Fred Wurden is the GM of Enterprise Engineering (Windows, VMware, Red Hat, SAP, benchmarking) working to make AWS the most customer-centric cloud platform on Earth. Prior to AWS, Fred worked at Microsoft for 17 years and held positions, including: EU/DOJ engineering compliance for Windows and Azure, interoperability principles and partner engagements, and open source engineering. He lives with his wife and a few four-legged friends since his kids are all in college now.