This is a guest post co-written with Siva Bangaru and Leon Liu from ADP.
ADP helps organizations of all types and sizes by providing human capital management (HCM) solutions that unite HR, payroll, talent, time, tax, and benefits administration. ADP is a leader in business outsourcing services, analytics, and compliance expertise. ADP’s unmatched experience, deep insights, and cutting-edge technology have transformed human resources from a back-office administrative function to a strategic business advantage.
People Analytics powered by ADP DataCloud is an application that provides analytics and enhanced insights to ADP’s clients. It delivers a guided analytics experience that make it easy for you to create, use, and distribute tailored analytics for your organization. ADP People Analytics’s streamlined, configurable dashboards can help you identify potential issues in key areas, like overtime, turnover, compensation, and much more.
ADP provides this analytics experience to thousands of clients today. Securing customers’ data is a top priority for ADP. The company requires the highest security standards when implementing a multi-tenant analytics platform on Amazon Redshift.
ADP DataCloud integrates with Amazon Redshift row-level security (RLS) to implement granular data entitlements and enforce the access restrictions on their tables in Amazon Redshift.
In this post, we discuss how the ADP DataCloud team implemented Amazon Redshift RLS on the foundation of role-based access control (RBAC) to simplify managing privileges required in a multi-tenant environment, and also enabled and enforced access to granular data entitlements in business terms.
The ADP DataCloud team had the following key requirements and challenges:
- Support a multi-tenant application to enforce a logical separation of each tenant’s data rows
- Support dynamic provisioning of new tenants
- Minimal impact on performance
Row-level security in Amazon Redshift
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. One of the challenges with security is that enterprises want to provide fine-grained access control at the row level for sensitive data. This can be done by creating views or using different databases and schemas for different users. However, this approach isn’t scalable and becomes complex to maintain over time, especially when supporting multi-tenant environments.
In early 2022, Amazon Redshift released row-level security, which is built on the foundation of role-based access control. RLS allows you to control which users or roles can access specific records of data within tables, based on security policies that are defined at the database object level. This new RLS capability in Amazon Redshift enables you to dynamically filter existing rows of data in a table along with session context variable setting capabilities to dynamically assign the appropriate tenant configuration. This is in addition to column-level access control, where you can grant users permissions to a subset of columns. Now you can combine column-level access control with RLS policies to further restrict access to particular rows of visible columns. Refer to Achieve fine-grained data security with row-level access control in Amazon Redshift for additional details.
As part of ADP’s key requirements to support a multi-tenant data store wherein a single table holds data of multiple tenants, enforcement of security policies to ensure no cross-tenant data access is of paramount importance. One obvious way to ensure this is by creating database users for each tenant and implementing RLS policies to filter a single tenant’s data as per the logged-in user. But this can be tedious and become cumbersome to maintain as the number of tenants grow by the thousands.
This post presents another way to handle this use case by combining session context variables and RLS policies on tables to filter a single tenant’s data, thereby easing the burden of creating and maintaining thousands of database users. In fact, a single database user is all that is needed to connect and query different tenant’s data in different sessions from a multi-tenant table by setting different values to a session context variable in each session, as shown in the following diagram.
Let’s start by covering the high-level implementation steps. Consider there is a database user in Amazon Redshift
app_user (which is neither a super user, nor has the
sys:secadmin role granted, nor has the
IGNORE RLS system privilege granted via another role). The user
app_user owns a schema with the same name and all objects in it. The following is a typical multi-tenant table
employee in the
app_user schema with some sample records shown in the table:
To implement that, the following steps are required:
- Create a RLS policy on a column using a predicate that is set using a session context variable.
- Enable RLS at the table level and attach the RLS policy on the table.
- Create a stored procedure that sets the session context variable used in the RLS policy predicate.
- Connect and call the stored procedure to set the session context variable and query the table.
Now RLS can be enabled on this table in such a way that whenever
app_user queries the employee table, the user will either see no rows or retrieve only rows specific to single tenant despite being the owner of the table.
An administrator, such as
app_admin, either a super user or a user that has the
sys:secadmin role, can enforce this as follows:
- Create a RLS policy that attaches a
tenant_idpredicate using a session context variable:
- Enable RLS and attach the policy on the employee table:
- Create a stored procedure to set the
tenant_idin a session variable and grant access to
- Connect to
app_userand call the stored procedure to set the session context variable:
When this setup is complete, whenever tenants are connecting to ADP Analytics dashboards, it connects as
app_user and runs stored procedures by passing
tenant_id, which sets the session context variable using the tenant ID. In this case, when requests come to connect and query the
employee table, the user will experience the following scenarios:
- No data is retrieved if
current_setting('app_context.tenant_id')is not set or is null
- Data is retrieved if
current_setting('app_context.tenant_id')is set by calling the
app_admin.set_app_context(varchar)procedure to a value that exists in the
employeetable (for example,
No data is retrieved if
current_setting('app_context.tenant_id') is set to a value that doesn’t exist in the
employee table (for example,
Validate RLS by examining query plans
Now let’s review the preceding scenarios by running an explain plan and observing how RLS works for the test setup. If a query contains a table that is subject to RLS policies,
EXPLAIN displays a special RLS
SecureScan node. Amazon Redshift also logs the same node type to the
STL_EXPLAIN system table.
EXPLAIN doesn’t reveal the RLS predicate that applies to the
employee table. To view an explain plan with RLS predicate details, the
EXPLAIN RLS system privilege is granted to
app_user via a role.
In this first scenario,
tenant_id wasn’t set by the stored procedure and was passed as a null value, therefore below select statement returns no rows .
Explain plan output shows the filter as NULL:
In the second scenario,
tenant_id was set by the stored procedure and passed as a value of
T0001, therefore returning only corresponding rows as shown in the explain plan output:
Call stored procedure to set the session context variable as ‘T0001’ and then run the select :
Explain plan output shows the filter on
tenant_id as ‘T0001’
In the third scenario, a non-existing
tenant_id was set by the stored procedure, therefore returning no rows:
Another key point is that you can apply the same policy to multiple tables as long as they have the column (
tenant_id varchar(50)) defined with the same data type, because RLS polices are strongly typed in Amazon Redshift. Similarly, you can combine multiple RLS policies defined using different session context variables or other relevant column predicates and attach them to a single table.
Also, this RLS implementation doesn’t need any changes when a new tenant’s data is added to the table, because it can be queried by simply setting the new tenant’s identifier in the session context variable that is used to define the filter predicate inside the RLS policy. A tenant to its corresponding identifier mapping is typically done during an application’s tenant onboarding process and is generally maintained in a separate metastore, which is also referred to during each tenant’s login to get the tenant’s identifier. With that, thousands of tenants could be provisioned without needing to change any policy in Amazon Redshift. In our testing, we found no performance impact by tenants after RLS was implemented.
In this post, we demonstrated how the ADP DataCloud team implemented row-level security in a multi-tenant environment for thousands of customers using Amazon Redshift RLS and session context variables. For more information about RLS best practices, refer to Amazon Redshift security overview.
Try out RLS for your future Amazon Redshift implementations, and feel free to leave a comment about your use cases and experience.
About the authors
Siva Bangaru is a Database Architect at ADP. He has more than 13 years of experience with technical expertise on design, development, administration, and performance tuning of database solutions for a variety of OLAP and OLTP use cases on multiple database engines like Oracle, Amazon Aurora PostgreSQL, and Amazon Redshift.
Leon Liu is a Chief Architect at ADP. He has over 20 years of experience with enterprise application framework, architecture, data warehouses, and big data real-time processing.
Neha Daudani is a Solutions Architect at AWS. She has 15 years of experience in the data and analytics space. She has enabled clients on various projects on enterprise data warehouses, data governance, data visualization, master data management, data modeling, and data migration for clients to use business intelligence and analytics in business growth and operational efficiency.
Rohit Bansal is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to build next-generation analytics solutions using other AWS Analytics services.