Tag Archives: Amazon Redshift

Set up cross-account AWS Glue Data Catalog access using AWS Lake Formation and AWS IAM Identity Center with Amazon Redshift and Amazon QuickSight

Post Syndicated from Poulomi Dasgupta original https://aws.amazon.com/blogs/big-data/set-up-cross-account-aws-glue-data-catalog-access-using-aws-lake-formation-and-aws-iam-identity-center-with-amazon-redshift-and-amazon-quicksight/

Most organizations manage their workforce identity centrally in external identity providers (IdPs) and are comprised of multiple business units that produce their own datasets and manage the lifecycle spread across multiple AWS accounts. These business units have varying landscapes, where a data lake is managed by Amazon Simple Storage Service (Amazon S3) and analytics workloads are run on Amazon Redshift, a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data.

Business units that create data products like to share them with others, without copying the data, to promote analysis to derive insights. Also, they want tighter control on user access and the ability to audit access to their data products. To address this, enterprises usually catalog the datasets in the AWS Glue Data Catalog for data discovery and use AWS Lake Formation for fine-grained access control to adhere to the compliance and operating security model for their business units. Given the diverse range of services, fine-grained data sharing, and personas involved, these enterprises often want a streamlined experience for enterprise user identities when accessing their data using AWS Analytics services.

AWS IAM Identity Center enables centralized management of workforce user access to AWS accounts and applications using a local identity store or by connecting corporate directories using IdPs. Amazon Redshift and AWS Lake Formation are integrated with the new trusted identity propagation capability in IAM Identity Center, allowing you to use third-party IdPs such as Microsoft Entra ID (Azure AD), Okta, Ping, and OneLogin.

With trusted identity propagation, Lake Formation enables data administrators to directly provide fine-grained access to corporate users and groups, and simplifies the traceability of end-to-end data access across supported AWS services. Because access is managed based on a user’s corporate identity, end-users don’t need to use database local user credentials or assume an AWS Identity and Access Management (IAM) role to access data. Furthermore, this enables effective user permissions based on collective group membership and supports group hierarchy.

In this post, we cover how to enable trusted identity propagation with AWS IAM Identity Center, Amazon Redshift, and AWS Lake Formation residing on separate AWS accounts and set up cross-account sharing of an S3 data lake for enterprise identities using AWS Lake Formation to enable analytics using Amazon Redshift. Then we use Amazon QuickSight to build insights using Redshift tables as our data source.

Solution overview

This post covers a use case where an organization centrally manages corporate users within their IdP and where the users belong to multiple business units. Their goal is to enable centralized user authentication through IAM Identity Center in the management account, while keeping the business unit that analyzes data using a Redshift cluster and the business unit that produces data cataloged using the Data Catalog in separate member accounts. This allows them to maintain a single authentication mechanism through IAM Identity Center within an organization while retaining access control, resource, and cost separation through the use of separate AWS accounts per business units and enabling cross-account data sharing using Lake Formation.

For this solution, AWS Organizations is enabled in the central management account and IAM Identity Center is configured for managing workforce identities. The organization has two member accounts: one account that manages the S3 data lake using the Data Catalog, and another account that runs analytical workloads on Amazon Redshift and QuickSight, with all the services enabled with trusted identity propagation. Amazon Redshift will access cross-account AWS Glue resources using IAM Identity Center users and groups set up in the central management account using QuickSight in member account 1. In member account 2, permissions on the AWS Glue resources are managed using Lake Formation and are shared with member account 1 using Lake Formation data sharing.

The following diagram illustrates the solution architecture.

The solution consists of the following:

  • In the centralized management account, we create a permission set and create account assignments for Redshift_Member_Account. We integrate users and groups from the IdP with IAM Identity Center.
  • Member account 1 (Redshift_Member_Account) is where the Redshift cluster and application exist.
  • Member account 2 (Glue_Member_Account) is where metadata is cataloged in the Data Catalog and Lake Formation is enabled with IAM Identity Center integration.
  • We assign permissions to two IAM Identity Center groups to access the Data Catalog resources:
    • awssso-sales – We apply column-level filtering for this group so that users belonging to this group will be able to select two columns and read all rows.
    • awssso-finance – We apply row-level filtering using data filters for this group so that users belonging to this group will be able to select all columns and see rows after row-level filtering is applied.
  • We apply different permissions for three IAM Identity Center users:
    • User Ethan, part of awssso-sales – Ethan will be able to select two columns and read all rows.
    • User Frank, part of awssso-finance – Frank will be able to select all columns and see rows after row-level filtering is applied.
    • User Brian, part of awssso-sales and awssso-finance – Brian inherits permissions defined for both groups.
  • We set up QuickSight in the same account where Amazon Redshift exists, enabling authentication using IAM Identity Center.

Prerequisites

You should have the following prerequisites alreday set up:

Member account 2 configuration

Sign in to the Lake Formation console as the data lake administrator. To learn more about setting up permissions for a data lake administrator, see Create a data lake administrator.

In this section, we walk through the steps to set up Lake Formation, enable Lake Formation permissions, and grant database and table permissions to IAM Identity Center groups.

Set up Lake Formation

Complete the steps in this section to set up Lake Formation.

Create AWS Glue resources

You can use an existing AWS Glue database that has a few tables. For this post, we use a database called customerdb and a table called reviews whose data is stored in the S3 bucket lf-datalake-<account-id>-<region>.

Register the S3 bucket location

Complete the following steps to register the S3 bucket location:

  • On the Lake Formation console, in the navigation pane, under Administration, choose Data lake locations.
  • Choose Register location.
  • For Amazon S3 location, enter the S3 bucket location that contains table data.
  • For IAM role, provide a user-defined IAM role. For instructions to create a user-defined IAM role, refer to Requirements for roles used to register locations.
  • For Permission mode, select Lake Formation.
  • Choose Register location.

Set cross-account version

Complete the following steps to set your cross-account version:

  • Sign in to the Lake Formation console as the data lake admin.
  • In the navigation pane, under Administration, choose Data Catalog settings.
  • Under Cross-account version settings, keep the latest version (Version 4) as the current cross-account version.
  • Choose Save.

Add permissions required for cross-account access

If the AWS Glue Data Catalog resource policy is already enabled in the account, then you can either remove the policy or add the following permissions to the policy that are required for cross-account grants. The provided policy enables AWS Resource Access Manager (AWS RAM) to share a resource policy while cross-account grants are made using Lake Formation. For more information, refer to Prerequisites. Please skip to the following step if your policy is blank under Catalog Settings.

  • Sign in to the AWS Glue console as an IAM admin.
  • In the navigation pane, under Data Catalog, choose Catalog settings.
  • Under Permissions, add the following policy, and provide the account ID where your AWS Glue resources exist:
{ "Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "ram.amazonaws.com"
},
"Action": "glue:ShareResource",
"Resource": [
"arn:aws:glue:us-east-1:<account-id>:table/*/*",
"arn:aws:glue:us-east-1:<account-id>:database/*",
"arn:aws:glue:us-east-1:<account-id>:catalog"
]
}
]
}
  • Choose Save.

For more information, see Granting cross-account access.

Enable IAM Identity Center integration for Lake Formation

To integrate IAM Identity Center with your Lake Formation organization instance of IAM Identity Center, refer to Connecting Lake Formation with IAM Identity Center.

To enable cross-account sharing for IAM Identity Center users and groups, add the target recipient accounts to your Lake Formation IAM Identity Center integration under the AWS account and organization IDs.

  • Sign in to the Lake Formation console as a data lake admin.
  • In the navigation pane, under Administration, choose IAM Identity Center integration.
  • Under AWS account and organization IDs, choose Add.
  • Enter your target accounts.
  • Choose Add.

Enable Lake Formation permissions for databases

For Data Catalog databases that contain tables that you might share, you can stop new tables from having the default grant of Super to IAMAllowedPrincipals. Complete the following steps:

  • Sign in to the Lake Formation console as a data lake admin.
  • In the navigation pane, under Data Catalog, choose Databases.
  • Select the database customerdb.
  • Choose Actions, then choose Edit.
  • Under Default permissions for newly created tables, deselect Use only IAM access control for new tables in this database.
  • Choose Save.

For Data Catalog databases, remove IAMAllowedPrincipals.

  • Under Data Catalog in the navigation pane, choose Databases.
  • Select the database customerdb.
  • Choose Actions, then choose View.
  • Select IAMAllowedPrincipals and choose Revoke.

Repeat the same steps for tables under the customerdb database.

Grant database permissions to IAM Identity Center groups

Complete the following steps to grant database permissions to your IAM Identity Center groups:

  • On the Lake Formation console, under Data Catalog, choose Databases.
  • Select the database customerdb.
  • Choose Actions, then choose Grant.
  • Select IAM Identity Center.
  • Choose Add and select Get Started.
  • Search for and select your IAM Identity Center group names and choose Assign.

  • Select Named Data Catalog resources.
  • Under Databases, choose customerdb.
  • Under Database permissions, select Describe for Database permissions.
  • Choose Grant.

Grant table permissions to IAM Identity Center groups

In the following section, we will grant different permissions to our two IAM Identity Center groups.

Column filter

We first add permissions to the group awssso-sales. This group will have access to the customerdb database and be able to select only two columns and read all rows.

  • On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
  • Select the database customerdb.
  • Choose Actions, then choose Grant.
  • Select IAM Identity Center.
  • Choose Add and select Get Started.
  • Search for and select awssso-sales and choose Assign.

  • Select Named Data Catalog resources.
  • Under Databases, choose customerdb.
  • Under Tables, choose reviews.
  • Under Table permissions, select Select for Table permissions.
  • Select Column-based access.
  • Select Include columns and choose product_title and star_rating.
  • Choose Grant.

Row filter

Next, we grant permissions to awssso-finance. This group will have access to customerdb and be able to select all columns and apply filters on rows.

We need to first create a data filter by performing the following steps:

  • On the Lake Formation console, choose Data filters under Data Catalog.
  • Choose Create data filter.
  • For Data filter name, provide a name.
  • For Target database, choose customerdb.
  • For Target table, choose reviews.
  • For Column-level access, select Access to all columns.
  • For Row-level access, choose Filter rows and apply your filter. In this example, we are filtering reviews with star_rating as 5.
  • Choose Create data filter.

  • Under Data Catalog in the navigation pane, choose Databases.
  • Select the database customerdb.
  • Choose Actions, then choose Grant.
  • Select IAM Identity Center.
  • Choose Add and select Get Started.
  • Search for and select awssso-finance and choose Assign.
  • Select Named Data Catalog resources.
  • Under Databases, choose customerdb.
  • Under Tables, choose reviews.
  • Under Data Filters, choose the High_Rating
  • Under Data Filter permissions, select Select.
  • Choose Grant.

Member account 1 configuration

In this section, we walk through the steps to add Amazon Redshift Spectrum table access in member account 1, where the Redshift cluster and application exist.

Accept Invite from RAM

You should have received a Resource Access Manager (RAM) invite from member account 2 when you added member account 1 under IAM Identity Center integration in Lake Formation at the member account 1.

  • Navigate to Resource Access Manager(RAM) from admin console.
  • Under Shared with me, click on resource shares.
  • Select the resource name and click on Accept resource share.

Please make sure that you have followed this entire blog to establish the Redshift Integration with IAM Identity Center before following the next steps.

Set up Redshift Spectrum table access for the IAM Identity Center group

Complete the following steps to set up Redshift Spectrum table access:

  1. Sign in to the Amazon Redshift console using the admin role.
  2. Navigate to Query Editor v2.
  3. Choose the options menu (three dots) next to the cluster and choose Create connection.
  4. Connect as the admin user and run the following commands to make the shared resource link data in the S3 data lake available to the sales group (use the account ID where the Data Catalog exists):
create external schema if not exists <schema_name> from DATA CATALOG database '<glue_catalog_name>' catalog_id '<accountid>';
grant usage on schema <schema_name> to role "<role_name>";

For example:

create external schema if not exists cross_account_glue_schema from DATA CATALOG database 'customerdb' catalog_id '932880906720';
grant usage on schema cross_account_glue_schema to role "awsidc:awssso-sales";
grant usage on schema cross_account_glue_schema to role "awsidc:awssso-finance";

Validate Redshift Spectrum access as an IAM Identity Center user

Complete the following steps to validate access:

  • On the Amazon Redshift console, navigate to Query Editor v2.
  • Choose the options menu (three dots) next to the cluster and choose Create connection.
  • Select IAM Identity Center.
  • Enter your Okta user name and password in the browser pop-up.

  • When you’re connected as a federated user, run the following SQL commands to query the cross_account_glue_schema data lake table.
select * from "dev"."cross_account_glue_schema"."reviews";

The following screenshot shows that user Ethan, who is part of the awssso-sales group, has access to two columns and all rows from the Data Catalog.

The following screenshot shows that user Frank, who is part of the awssso-finance group, has access to all columns for records that have star_rating as 5.

The following screenshot shows that user Brian, who is part of awssso-sales and awssso-finance, has access to all columns for records that have star_rating as 5 and access to only two columns (other columns are returned NULL) for records with star_rating other than 5.

Subscribe to QuickSight with IAM Identity Center

In this post, we set up QuickSight in the same account where the Redshift cluster exists. You can use the same or a different member account for QuickSight setup. To subscribe to QuickSight, complete the following steps:

  • Sign in to your AWS account and open the QuickSight console.
  • Choose Sign up for QuickSight.

  • Enter a notification email address for the QuickSight account owner or group. This email address will receive service and usage notifications.
  • Select the identity option that you want to subscribe with. For this post, we select Use AWS IAM Identity Center.
  • Enter a QuickSight account name.
  • Choose Configure.

  • Next, assign groups in IAM Identity Center to roles in QuickSight (admin, author, and reader.) This step enables your users to access the QuickSight application. In this post, we choose awssso-sales and awssso-finance for Admin group.
  • Specify an IAM role to control QuickSight access to your AWS resources. In this post, we select Use QuickSight managed role (default).
  • For this post, we deselect Add Paginated Reports.
  • Review the choices that you made, then choose Finish.

Enable trusted identity propagation in QuickSight

Trusted identity propagation authenticates the end-user in Amazon Redshift when they access QuickSight assets that use a trusted identity propagation enabled data source. When an author creates a data source with trusted identity propagation, the identity of the data source consumers in QuickSight is propagated and logged in AWS CloudTrail. This allows database administrators to centrally manage data security in Amazon Redshift and automatically apply data security rules to data consumers in QuickSight.

To configure QuickSight to connect to Amazon Redshift data sources with trusted identity propagation, configure Amazon Redshift OAuth scopes to your QuickSight account:

aws quicksight update-identity-propagation-config --aws-account-id "AWSACCOUNTID" --service "REDSHIFT" --authorized-targets "IAM Identity Center managed application ARN"

For example:

aws quicksight update-identity-propagation-config --aws-account-id "1234123123" --service "REDSHIFT" --authorized-targets "arn:aws:sso::XXXXXXXXXXXX:application/ssoins-XXXXXXXXXXXX/apl-XXXXXXXXXXXX"

After you have added the scope, the following command lists all OAuth scopes that are currently on a QuickSight account:

aws quicksight list-identity-propagation-configs --aws-account-id "AWSACCOUNTID"

The following code is the example with output:

aws quicksight list-identity-propagation-configs --aws-account-id "1234123123"
{
"Status": 200,
"Services": [
{
"Service": "REDSHIFT",
"AuthorizedTargets": [
"arn:aws:sso::1004123000:application/ssoins-1234f1234bb1f123/apl-12a1234e2e391234"
]
}
],
"RequestId": "116ec1b0-1533-4ed2-b5a6-d7577e073b35"
}

For more information, refer to Authorizing connections from Amazon QuickSight to Amazon Redshift clusters.

For QuickSight to connect to a Redshift instance, you must add an appropriate IP address range in the Redshift security group for the specific AWS Region. For more information, see AWS Regions, websites, IP address ranges, and endpoints.

Test your IAM Identity Center and Amazon Redshift integration with QuickSight

Now you’re ready to connect to Amazon Redshift using QuickSight.

  • In the management account, open the IAM Identity Center console and copy the AWS access portal URL from the dashboard.
  • Sign out from the management account and enter the AWS access portal URL in a new browser window.
  • In the pop-up window, enter your IdP credentials.
  • On the Applications tab, select the QuickSight app.
  • After you federate to QuickSight, choose Datasets.
  • Select New Dataset and then choose Redshift (Auto Discovered).
  • Enter your data source details. Make sure to select Single sign-on for Authentication method.
  • Choose Create data source.

Congratulations! You’re signed in using IAM Identity Center integration with Amazon Redshift and are ready to explore and analyze your data using QuickSight.

The following screenshot from QuickSight shows that user Ethan, who is part of the awssso-sales group, has access to two columns and all rows from the Data Catalog.

The following screenshot from QuickSight shows that user Frank, who is part of the awssso-finance group, has access to all columns for records that have star_rating as 5.

The following screenshot from QuickSight shows that user Brian, who is part of awssso-sales and awssso-finance, has access to all columns for records that have star_rating as 5 and access to only two columns (other columns are returned NULL) for records with star_rating other than 5.

Clean up

Complete the following steps to clean up your resources:

  • Delete the data from the S3 bucket.
  • Delete the Data Catalog objects that you created as part of this post.
  • Delete the Lake Formation resources and QuickSight account.
  • If you created new Redshift cluster for testing this solution, delete the cluster.

Conclusion

In this post, we established cross-account access to enable centralized user authentication through IAM Identity Center in the management account, while keeping the Amazon Redshift and AWS Glue resources isolated by business unit in separate member accounts. We used Query Editor V2 for querying the data from Amazon Redshift. Then we showed how to build user-facing dashboards by integrating with QuickSight. Refer to Integrate Tableau and Okta with Amazon Redshift using AWS IAM Identity Center to learn about integrating Tableau and Okta with Amazon Redshift using IAM Identity Center.

Learn more about IAM Identity Center with Amazon Redshift, QuickSight, and Lake Formation. Leave your questions and feedback in the comments section.


About the Authors

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Maneesh Sharma is a Senior Database Engineer at AWS with more than a decade of experience designing and implementing large-scale data warehouse and analytics solutions. He collaborates with various Amazon Redshift Partners and customers to drive better integration.

Poulomi Dasgupta is a Senior Analytics Solutions Architect with AWS. She is passionate about helping customers build cloud-based analytics solutions to solve their business problems. Outside of work, she likes travelling and spending time with her family.

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

Post Syndicated from Chanpreet Singh original https://aws.amazon.com/blogs/big-data/unlock-scalability-cost-efficiency-and-faster-insights-with-large-scale-data-migration-to-amazon-redshift/

Large-scale data warehouse migration to the cloud is a complex and challenging endeavor that many organizations undertake to modernize their data infrastructure, enhance data management capabilities, and unlock new business opportunities. As data volumes continue to grow exponentially, traditional data warehousing solutions may struggle to keep up with the increasing demands for scalability, performance, and advanced analytics.

Migrating to Amazon Redshift offers organizations the potential for improved price-performance, enhanced data processing, faster query response times, and better integration with technologies such as machine learning (ML) and artificial intelligence (AI). However, you might face significant challenges when planning for a large-scale data warehouse migration. These challenges can range from ensuring data quality and integrity during the migration process to addressing technical complexities related to data transformation, schema mapping, performance, and compatibility issues between the source and target data warehouses. Additionally, organizations must carefully consider factors such as cost implications, security and compliance requirements, change management processes, and the potential disruption to existing business operations during the migration. Effective planning, thorough risk assessment, and a well-designed migration strategy are crucial to mitigating these challenges and implementing a successful transition to the new data warehouse environment on Amazon Redshift.

In this post, we discuss best practices for assessing, planning, and implementing a large-scale data warehouse migration into Amazon Redshift.

Success criteria for large-scale migration

The following diagram illustrates a scalable migration pattern for an extract, load, and transform (ELT) scenario using Amazon Redshift data sharing patterns.

The following diagram illustrates a scalable migration pattern for extract, transform, and load (ETL) scenario.

Migration pattern extract, transform, and load (ETL) scenarios

Success criteria alignment by all stakeholders (producers, consumers, operators, auditors) is key for successful transition to a new Amazon Redshift modern data architecture. The success criteria are the key performance indicators (KPIs) for each component of the data workflow. This includes the ETL processes that capture source data, the functional refinement and creation of data products, the aggregation for business metrics, and the consumption from analytics, business intelligence (BI), and ML.

KPIs make sure you can track and audit optimal implementation, achieve consumer satisfaction and trust, and minimize disruptions during the final transition. They measure workload trends, cost usage, data flow throughput, consumer data rendering, and real-life performance. This makes sure the new data platform can meet current and future business goals.

Migration from a large-scale mission-critical monolithic legacy data warehouse (such as Oracle, Netezza, Teradata, or Greenplum) is typically planned and implemented over 6–16 months, depending on the complexity of the existing implementation. The monolithic data warehouse environments that have been built over the last 30 years contain proprietary business logic and multiple data design patterns, including an operation data store, star or Snowflake schema, dimension and facts, data warehouses and data marts, online transaction processing (OLTP) real-time dashboards, and online analytic processing (OLAP) cubes with multi-dimensional analytics. The data warehouse is highly business critical with minimal allowable downtime. If your data warehouse platform has gone through multiple enhancements over the years, your operational service levels documentation may not be current with the latest operational metrics and desired SLAs for each tenant (such as business unit, data domain, or organization group).

As part of the success criteria for operational service levels, you need to document the expected service levels for the new Amazon Redshift data warehouse environment. This includes the expected response time limits for dashboard queries or analytical queries, elapsed runtime for daily ETL jobs, desired elapsed time for data sharing with consumers, total number of tenants with concurrency of loads and reports, and mission-critical reports for executives or factory operations.

As part of your modern data architecture transition strategy, the migration goal of a new Amazon Redshift based platform is to use the scalability, performance, cost-optimization, and additional lake house capabilities of Amazon Redshift, resulting in improving the existing data consumption experience. Depending on your enterprise’s culture and goals, your migration pattern of a legacy multi-tenant data platform to Amazon Redshift could use one of the following strategies:

A majority of organizations opt for the organic strategy (lift and shift) when migrating their large data platforms to Amazon Redshift. This approach uses AWS migration tools such as the AWS Schema Conversion Tool (AWS SCT) or the managed service version DMS Schema Conversion to rapidly meet goals around data center exit, cloud adoption, reducing legacy licensing costs, and replacing legacy platforms.

By establishing clear success criteria and monitoring KPIs, you can implement a smooth migration to Amazon Redshift that meets performance and operational goals. Thoughtful planning and optimization are crucial, including optimizing your Amazon Redshift configuration and workload management, addressing concurrency needs, implementing scalability, tuning performance for large result sets, minimizing schema locking, and optimizing join strategies. This will enable right-sizing the Redshift data warehouse to meet workload demands cost-effectively. Thorough testing and performance optimization will facilitate a smooth transition with minimal disruption to end-users, fostering exceptional user experiences and satisfaction. A successful migration can be accomplished through proactive planning, continuous monitoring, and performance fine-tuning, thereby aligning with and delivering on business objectives.

Migration involves the following phases, which we delve into in the subsequent sections:

  • Assessment
    • Discovery of workload and integrations
    • Dependency analysis
    • Effort estimation
    • Team sizing
    • Strategic wave planning
  • Functional and performance
    • Code conversion
    • Data validation
  • Measure and benchmark KPIs
    • Platform-level KPIs
    • Tenant-level KPIs
    • Consumer-level KPIs
    • Sample SQL
  • Monitoring Amazon Redshift performance and continual optimization
    • Identify top offending queries
    • Optimization strategies

To achieve a successful Amazon Redshift migration, it’s important to address these infrastructure, security, and deployment considerations simultaneously, thereby implementing a smooth and secure transition.

Assessment

In this section, we discuss the steps you can take in the assessment phase.

Discovery of workload and integrations

Conducting discovery and assessment for migrating a large on-premises data warehouse to Amazon Redshift is a critical step in the migration process. This phase helps identify potential challenges, assess the complexity of the migration, and gather the necessary information to plan and implement the migration effectively. You can use the following steps:

  • Data profiling and assessment – This involves analyzing the schema, data types, table sizes, and dependencies. Special attention should be given to complex data types such as arrays, JSON, or custom data types and custom user-defined functions (UDFs), because they may require specific handling during the migration process. Additionally, it’s essential to assess the volume of data and daily incremental data to be migrated, and estimate the required storage capacity in Amazon Redshift. Furthermore, analyzing the existing workload patterns, queries, and performance characteristics provides valuable insights into the resource requirements needed to optimize the performance of the migrated data warehouse in Amazon Redshift.
  • Code and query assessment – It’s crucial to assess the compatibility of existing SQL code, including queries, stored procedures, and functions. The AWS SCT can help identify any unsupported features, syntax, or functions that need to be rewritten or replaced to achieve a seamless integration with Amazon Redshift. Additionally, it’s essential to evaluate the complexity of the existing processes and determine if they require redesigning or optimization to align with Amazon Redshift best practices.
  • Performance and scalability assessment – This includes identifying performance bottlenecks, concurrency issues, or resource constraints that may be hindering optimal performance. This analysis helps determine the need for performance tuning or workload management techniques that may be required to achieve optimal performance and scalability in the Amazon Redshift environment.
  • Application integrations and mapping – Embarking on a data warehouse migration to a new platform necessitates a comprehensive understanding of the existing technology stack and business processes intertwined with the legacy data warehouse. Consider the following:
    • Meticulously document all ETL processes, BI tools, and scheduling mechanisms employed in conjunction with the current data warehouse. This includes commercial tools, custom scripts, and any APIs or connectors interfacing with source systems.
    • Take note of any custom code, frameworks, or mechanisms utilized in the legacy data warehouse for tasks such as managing slowly changing dimensions (SCDs), generating surrogate keys, implementing business logic, and other specialized functionalities. These components may require redevelopment or adaptation to operate seamlessly on the new platform.
    • Identify all upstream and downstream applications, as well as business processes that rely on the data warehouse. Map out their specific dependencies on database objects, tables, views, and other components. Trace the flow of data from its origins in the source systems, through the data warehouse, and ultimately to its consumption by reporting, analytics, and other downstream processes.
  • Security and access control assessment – This includes reviewing the existing security model, including user roles, permissions, access controls, data retention policies, and any compliance requirements and industry regulations that need to be adhered to.

Dependency analysis

Understanding dependencies between objects is crucial for a successful migration. You can use system catalog views and custom queries on your on-premises data warehouses to create a comprehensive object dependency report. This report shows how tables, views, and stored procedures rely on each other. This also involves analyzing indirect dependencies (for example, a view built on top of another view, which in turn uses a set of tables), and having a complete understanding of data usage patterns.

Effort estimation

The discovery phase serves as your compass for estimating the migration effort. You can translate those insights into a clear roadmap as follows:

  • Object classification and complexity assessment – Based on the discovery findings, categorize objects (tables, views, stored procedures, and so on) based on their complexity. Simple tables with minimal dependencies will require less effort to migrate than intricate views or stored procedures with complex logic.
  • Migration tools – Use the AWS SCT to estimate the base migration effort per object type. The AWS SCT can automate schema conversion, data type mapping, and function conversion, reducing manual effort.
  • Additional considerations – Factor in additional tasks beyond schema conversion. This may include data cleansing, schema optimization for Amazon Redshift performance, unit testing of migrated objects, and migration script development for complex procedures. The discovery phase sheds light on potential schema complexities, allowing you to accurately estimate the effort required for these tasks.

Team sizing

With a clear picture of the effort estimate, you can now size the team for the migration.

Person-months calculation

Divide the total estimated effort by the desired project duration to determine the total person-months required. This provides a high-level understanding of the team size needed.

For example, for a ELT migration project from an on-premises data warehouse to Amazon Redshift to be completed within 6 months, we estimate the team requirements based on the number of schemas or tenants (for example, 30), number of database tables (for example, 5,000), average migration estimate for a schema (for example, 4 weeks based on complexity of stored procedures, tables and views, platform-specific routines, and materialized views), and number of business functions (for example, 2,000 segmented by simple, medium, and complex patterns). We can determine the following are needed:

  • Migration time period (65% migration/35% for validation & transition) = 0.8* 6 months = 5 months or 22 weeks
  • Dedicated teams = Number of tenants / (migration time period) / (average migration period for a tenant) = 30/5/1 = 6 teams
  • Migration team structure:
    • One to three data developers with stored procedure conversion expertise per team, performing over 25 conversions per week
    • One data validation engineer per team, testing over 50 objects per week
    • One to two data visualization experts per team, confirming consumer downstream applications are accurate and performant
  • A common shared DBA team with performance tuning expertise responding to standardization and challenges
  • A platform architecture team (3–5 individuals) focused on platform design, service levels, availability, operational standards, cost, observability, scalability, performance, and design pattern issue resolutions

Team composition expertise

Based on the skillsets required for various migration tasks, we assemble a team with the right expertise. Platform architects define a well-architected platform. Data engineers are crucial for schema conversion and data transformation, and DBAs can handle cluster configuration and workload monitoring. An engagement or project management team makes sure the project runs smoothly, on time, and within budget.

For example, for an ETL migration project from Informatica/Greenplum to a target Redshift lakehouse with an Amazon Simple Storage Service (Amazon S3) data lake to be completed within 12 months, we estimate the team requirements based on the number of schemas and tenants (for example, 50 schemas), number of database tables (for example, 10,000), average migration estimate for a schema (6 weeks based on complexity of database objects), and number of business functions (for example, 5,000 segmented by simple, medium, and complex patterns). We can determine the following are needed:

  • An open data format ingestion architecture processing the source dataset and refining the data in the S3 data lake. This requires a dedicated team of 3–7 members building a serverless data lake for all data sources. Ingestion migration implementation is segmented by tenants and type of ingestion patterns, such as internal database change data capture (CDC); data streaming, clickstream, and Internet of Things (IoT); public dataset capture; partner data transfer; and file ingestion patterns.
  • The migration team composition is tailored to the needs of a project wave. Depending on each migration wave and what is being done in the wave (development, testing, or performance tuning), the right people will be engaged. When the wave is complete, the people from that wave will move to another wave.
  • A loading team builds a producer-consumer architecture in Amazon Redshift to process concurrent near real-time publishing of data. This requires a dedicated team of 3–7 members building and publishing refined datasets in Amazon Redshift.
  • A shared DBA group of 3–5 individuals helping with schema standardization, migration challenges, and performance optimization outside the automated conversion.
  • Data transformation experts to convert database stored functions in the producer or consumer.
  • A migration sprint plan for 10 months with 2 sprint weeks with multiple waves to release tenants to the new architecture.
  • A validation team to confirm a reliable and complete migration.
  • One to two data visualization experts per team, confirming that consumer downstream applications are accurate and performant.
  • A platform architecture team (3–5 individuals) focused on platform design, service levels, availability, operational standards, cost, observability, scalability, performance, and design pattern issue resolutions.

Strategic wave planning

Migration waves can be determined as follows:

  • Dependency-based wave delineation – Objects can be grouped into migration waves based on their dependency relationships. Objects with no or minimal dependencies will be prioritized for earlier waves, whereas those with complex dependencies will be migrated in subsequent waves. This provides a smooth and sequential migration process.
  • Logical schema and business area alignment – You can further revise migration waves by considering logical schema and business areas. This allows you to migrate related data objects together, minimizing disruption to specific business functions.

Functional and performance

In this section, we discuss the steps for refactoring the legacy SQL codebase to leverage Redshift SQL best practices, build validation routines to ensure accuracy and completeness during the transition to Redshift, capturing KPIs to ensure similar or better service levels for consumption tools/downstream applications, and incorporating performance hooks and procedures for scalable and performant Redshift Platform.

Code conversion

We recommend using the AWS SCT as the first step in the code conversion journey. The AWS SCT is a powerful tool that can streamline the database schema and code migrations to Amazon Redshift. With its intuitive interface and automated conversion capabilities, the AWS SCT can significantly reduce the manual effort required during the migration process. Refer to Converting data warehouse schemas to Amazon Redshift using AWS SCT for instructions to convert your database schema, including tables, views, functions, and stored procedures, to Amazon Redshift format. For an Oracle source, you can also use the managed service version DMS Schema Conversion.

When the conversion is complete, the AWS SCT generates a detailed conversion report. This report highlights any potential issues, incompatibilities, or areas requiring manual intervention. Although the AWS SCT automates a significant portion of the conversion process, manual review and modifications are often necessary to address various complexities and optimizations.

Some common cases where manual review and modifications are typically required include:

  • Incompatible data types – The AWS SCT may not always handle custom or non-standard data types, requiring manual intervention to map them to compatible Amazon Redshift data types.
  • Database-specific SQL extensions or proprietary functions – If the source database uses SQL extensions or proprietary functions specific to the database vendor (for example, STRING_AGG() or ARRAY_UPPER functions, or custom UDFs for PostgreSQL), these may need to be manually rewritten or replaced with equivalent Amazon Redshift functions or UDFs. The AWS SCT extension pack is an add-on module that emulates functions present in a source database that are required when converting objects to the target database.
  • Performance optimization – Although the AWS SCT can convert the schema and code, manual optimization is often necessary to take advantage of the features and capabilities of Amazon Redshift. This may include adjusting distribution and sort keys, converting row-by-row operations to set-based operations, optimizing query plans, and other performance tuning techniques specific to Amazon Redshift.
  • Stored procedures and code conversion – The AWS SCT offers comprehensive capabilities to seamlessly migrate stored procedures and other code objects across platforms. Although its automated conversion process efficiently handles the majority of cases, certain intricate scenarios may necessitate manual intervention due to the complexity of the code and utilization of database-specific features or extensions. To achieve optimal compatibility and accuracy, it’s advisable to undertake testing and validation procedures during the migration process.

After you address the issues identified during the manual review process, it’s crucial to thoroughly test the converted stored procedures, as well as other database objects and code, such as views, functions, and SQL extensions, in a non-production Redshift cluster before deploying them in the production environment. This exercise is mostly undertaken by QA teams. This phase also involves conducting holistic performance testing (individual queries, batch loads, consumption reports and dashboards in BI tools, data mining applications, ML algorithms, and other relevant use cases) in addition to functional testing to make sure the converted code meets the required performance expectations. The performance tests should simulate production-like workloads and data volumes to validate the performance under realistic conditions.

Data validation

When migrating data from an on-premises data warehouse to a Redshift cluster on AWS, data validation is a crucial step to confirm the integrity and accuracy of the migrated data. There are several approaches you can consider:

  • Custom scripts – Use scripting languages like Python, SQL, or Bash to develop custom data validation scripts tailored to your specific data validation requirements. These scripts can connect to both the source and target databases, extract data, perform comparisons, and generate reports.
  • Open source tools – Use open source data validation tools like Amazon Deequ or Great Expectations. These tools provide frameworks and utilities for defining data quality rules, validating data, and generating reports.
  • AWS native or commercial tools – Use AWS native tools such as AWS Glue Data Quality or commercial data validation tools like Collibra Data Quality. These tools often provide comprehensive features, user-friendly interfaces, and dedicated support.

The following are different types of validation checks to consider:

  • Structural comparisons – Compare the list of columns and data types of columns between the source and target (Amazon Redshift). Any mismatches should be flagged.
  • Row count validation – Compare the row counts of each core table in the source data warehouse with the corresponding table in the target Redshift cluster. This is the most basic validation step to make sure no data has been lost or duplicated during the migration process.
  • Column-level validation – Validate individual columns by comparing column-level statistics (min, max, count, sum, average) for each column between the source and target databases. This can help identify any discrepancies in data values or data types.

You can also consider the following validation strategies:

  • Data profiling – Perform data profiling on the source and target databases to understand the data characteristics, identify outliers, and detect potential data quality issues. For example, you can use the data profiling capabilities of AWS Glue Data Quality or the Amazon Deequ
  • Reconciliation reports – Produce detailed validation reports that highlight errors, mismatches, and data quality issues. Consider generating reports in various formats (CSV, JSON, HTML) for straightforward consumption and integration with monitoring tools.
  • Automate the validation process – Integrate the validation logic into your data migration or ETL pipelines using scheduling tools or workflow orchestrators like Apache Airflow or AWS Step Functions.

Lastly, keep in mind the following considerations for collaboration and communication:

  • Stakeholder involvement – Involve relevant stakeholders, such as business analysts, data owners, and subject matter experts, throughout the validation process to make sure business requirements and data quality expectations are met.
  • Reporting and sign-off – Establish a clear reporting and sign-off process for the validation results, involving all relevant stakeholders and decision-makers.

Measure and benchmark KPIs

For multi-tenant Amazon Redshift implementation, KPIs are segmented at the platform level, tenant level, and consumption tools level. KPIs evaluate the operational metrics, cost metrics, and end-user response time metrics. In this section, we discuss the KPIs needed for achieving a successful transition.

Platform-level KPIs

As new tenants are gradually migrated to the platform, it’s imperative to monitor the current state of Amazon Redshift platform-level KPIs. The current KPI’s state will help the platform team make the necessary scalability modifications (add nodes, add consumer clusters, add producer clusters, or increase concurrency scaling clusters). Amazon Redshift query monitoring rules (QMR) also help govern the overall state of data platform, providing optimal performance for all tenants by managing outlier workloads.

The following table summarizes the relevant platform-level KPIs.

Component KPI Service Level and Success Criteria
ETL Ingestion data volume Daily or hourly peak volume in GBps, number of objects, number of threads.
Ingestion threads Peak hourly ingestion threads (COPY or INSERT), number of dependencies, KPI segmented by tenants and domains.
Stored procedure volume Peak hourly stored procedure invocations segmented by tenants and domains.
Concurrent load Peak concurrent load supported by the producer cluster; distribution of ingestion pattern across multiple producer clusters using data sharing.
Data sharing dependency Data sharing between producer clusters (objects refreshed, locks per hour, waits per hour).
Workload Number of queries Peak hour query volume supported by cluster segmented by short (less than 10 seconds), medium (less than 60 seconds), long (less than 5 minutes), very long (less than 30 minutes), and outlier (more than 30 minutes); segmented by tenant, domain, or sub-domain.
Number of queries per queue Peak hour query volume supported by priority automatic WLM queue segmented by short (less than 10 seconds), medium (less than 60 seconds), long (less than 5 minutes), very long (less than 30 minutes), and outlier (more than 30 minutes); segmented by tenant, business group, domain, or sub-domain.
Runtime pattern Total runtime per hour; max, median, and average run pattern; segmented by service class across clusters.
Wait time patterns Total wait time per hour; max, median, and average wait pattern for queries waiting.
Performance Leader node usage Service level for leader node (recommended less than 80%).
Compute node CPU usage Service level for compute node (recommended less than 90%).
Disk I/O usage per node Service level for disk I/O per node.
QMR rules Number of outlier queries stopped by QMR (large scan, large spilling disk, large runtime); logging thresholds for potential large queries running more than 5 minutes.
History of WLM queries Historical trend of queries stored in historical archive table for all instances of queries in STL_WLM_QUERY; trend analysis over 30 days, 60 days, and 90 days to fine-tune the workload across clusters.
Cost Total cost per month of Amazon Redshift platform Service level for mix of instances (reserved, on-demand, serverless), cost of Concurrency Scaling, cost of Amazon Redshift Spectrum usage. Use AWS tools like AWS Cost Explorer or daily cost usage report to capture monthly costs for each component.
Daily Concurrency Scaling usage Service limits to monitor cost for concurrency scaling; invoke for outlier activity on spikes.
Daily Amazon Redshift Spectrum usage Service limits to monitor cost for using Amazon Redshift Spectrum; invoke for outlier activity.
Redshift Managed Storage usage cost Track usage of Redshift Managed Storage, monitoring wastage on temporary, archival, and old data assets.
Localization Remote or on-premises tools Service level for rendering large datasets to remote destinations.
Data transfer to remote tools Data transfer to BI tools or workstations outside the Redshift cluster VPC; separation of datasets to Amazon S3 using the unload feature, avoiding bottlenecks at leader node.

Tenant-level KPIs

Tenant-level KPIs help capture current performance levels from the legacy system and document expected service levels for the data flow from the source capture to end-user consumption. The captured legacy KPIs assist in providing the best target modern Amazon Redshift platform (a single Redshift data warehouse, a lake house with Amazon Redshift Spectrum, and data sharing with the producer and consumer clusters). Cost usage tracking at the tenant level helps you spread the cost of a shared platform across tenants.

The following table summarizes the relevant tenant-level KPIs.

Component KPI Service Level and Success Criteria
Cost Compute usage by tenant Track usage by tenant, business group, or domain; capture query volume by business unit associating Redshift user identity to internal business unit; data observability by consumer usage for data products helping with cost attribution.
ETL Orchestration SLA Service level for daily data availability.
Runtime Service level for data loading and transformation.
Data ingestion volume Peak expected volume for service level guarantee.
Query consumption Response time Response time SLA for query patterns (dashboards, SQL analytics, ML analytics, BI tool caching).
Concurrency Peak query consumers for tenant.
Query volume Peak hourly volume service levels and daily query volumes.
Individual query response for critical data consumption Service level and success criteria for critical workloads.

Consumer-level KPIs

A multi-tenant modern data platform can set service levels for a variety of consumer tools. The service levels provide guidance to end-users of the capability of the new deployment.

The following table summarizes the relevant consumer-level KPIs.

Consumer KPI Service Level and Success Criteria
BI tools Large data extraction Service level for unloading data for caching or query rendering a large result dataset.
Dashboards Response time Service level for data refresh.
SQL query tools Response time Service level for response time by query type.
Concurrency Service level for concurrent query access by all consumers.
One-time analytics Response time Service level for large data unloads or aggregation.
ML analytics Response time Service level for large data unloads or aggregation.

Sample SQL

The post includes sample SQL to capture daily KPI metrics. The following example KPI dashboard trends assist in capturing historic workload patterns, identifying deviations in workload, and providing guidance on the platform workload capacity to meet the current workload and anticipated growth patterns.

The following figure shows a daily query volume snapshot (queries per day and queued queries per day, which waited a minimum of 5 seconds).

Figure shows a daily query volume snapshot (queries per day and queued queries per day, which waited a minimum of 5 seconds)

The following figure shows a daily usage KPI. It monitors percentage waits and median wait for waiting queries (identifies the minimal threshold for wait to compute waiting queries and median of all wait times to infer deviation patterns).

Figure shows a daily usage KPI. It monitors percentage waits and median wait for waiting queries (identifies the minimal threshold for wait to compute waiting queries and median of all wait times to infer deviation patterns)

The following figure illustrates concurrency usage (monitors concurrency compute usage for Concurrency Scaling clusters).

The following figure illustrates concurrency usage (monitors concurrency compute usage for Concurrency Scaling clusters)

The following figure shows a 30-day pattern (computes volume in terms of total runtime and total wait time).

The following figure shows a 30-day pattern (computes volume in terms of total runtime and total wait time)

Monitoring Redshift performance and continual optimization

Amazon Redshift uses automatic table optimization (ATO) to choose the right distribution style, sort keys, and encoding when you create a table with AUTO options. Therefore, it’s a good practice to take advantage of the AUTO feature and create tables with DISTSTYLE AUTO, SORTKEY AUTO, and ENCODING AUTO. When tables are created with AUTO options, Amazon Redshift initially creates tables with optimal keys for the best first-time query performance possible using information such as the primary key and data types. In addition, Amazon Redshift analyzes the data volume and query usage patterns to evolve the distribution strategy and sort keys to optimize performance over time. Finally, Amazon Redshift performs table maintenance activities on your tables that reduce fragmentation and make sure statistics are up to date.

During a large, phased migration, it’s important to monitor and measure Amazon Redshift performance against target KPIs at each phase and implement continual optimization. As new workloads are onboarded at each phase of the migration, it’s recommended to perform regular Redshift cluster reviews and analyze query pattern and performance. Cluster reviews can be done by engaging the Amazon Redshift specialist team through AWS Enterprise support or your AWS account team. The goal of a cluster review includes the following:

  • Use cases – Review the application use cases and determine if the design is suitable to solve for those use cases.
  • End-to-end architecture – Assess the current data pipeline architecture (ingestion, transformation, and consumption). For example, determine if too many small inserts are occurring and review their ETL pipeline. Determine if integration with other AWS services can be useful, such as AWS Lake Formation, Amazon Athena, Redshift Spectrum, or Amazon Redshift federation with PostgreSQL and MySQL.
  • Data model design – Review the data model and table design and provide recommendations for sort and distribution keys, keeping in mind best practices.
  • Performance – Review cluster performance metrics. Identify bottlenecks or irregularities and suggest recommendations. Dive deep into specific long-running queries to identify solutions specific to the customer’s workload.
  • Cost optimization – Provide recommendations to reduce costs where possible.
  • New features – Stay up to date with the new features in Amazon Redshift and identify where they can be used to meet these goals.

New workloads can introduce query patterns that could impact performance and miss target SLAs. A number of factors can affect query performance. In the following sections, we discuss aspects impacting query speed and optimizations for improving Redshift cluster performance.

Identify top offending queries

A compute node is partitioned into slices. More nodes means more processors and more slices, which enables you to redistribute the data as needed across the slices. However, more nodes also means greater expense, so you will need to find the balance of cost and performance that is appropriate for your system. For more information on Redshift cluster architecture, see Data warehouse system architecture. Each node type offers different sizes and limits to help you scale your cluster appropriately. The node size determines the storage capacity, memory, CPU, and price of each node in the cluster. For more information on node types, see Amazon Redshift pricing.

Redshift Test Drive is an open source tool that lets you evaluate which different data warehouse configuration options are best suited for your workload. We created Redshift Test Drive from Simple Replay and Amazon Redshift Node Configuration Comparison (see Compare different node types for your workload using Amazon Redshift for more details) to provide a single entry point for finding the best Amazon Redshift configuration for your workload. Redshift Test Drive also provides additional features such as a self-hosted analysis UI and the ability to replicate external objects that a Redshift workload may interact with. With Amazon Redshift Serverless, you can start with a base Redshift Processing Unit (RPU), and Redshift Serverless automatically scales based on your workload needs.

Optimization strategies

If you choose to fine-tune manually, the following are key concepts and considerations:

  • Data distribution – Amazon Redshift stores table data on the compute nodes according to a table’s distribution style. When you run a query, the query optimizer redistributes the data to the compute nodes as needed to perform any joins and aggregations. Choosing the right distribution style for a table helps minimize the impact of the redistribution step by locating the data where it needs to be before the joins are performed. For more information, see Working with data distribution styles.
  • Data sort order – Amazon Redshift stores table data on disk in sorted order according to a table’s sort keys. The query optimizer and query processor use the information about where the data is located to reduce the number of blocks that need to be scanned and thereby improve query speed. For more information, see Working with sort keys.
  • Dataset size – A higher volume of data in the cluster can slow query performance for queries, because more rows need to be scanned and redistributed. You can mitigate this effect by regular vacuuming and archiving of data, and by using a predicate (a condition in the WHERE clause) to restrict the query dataset.
  • Concurrent operations – Amazon Redshift offers a powerful feature called automatic workload management (WLM) with query priorities, which enhances query throughput and overall system performance. By intelligently managing multiple concurrent operations and allocating resources dynamically, automatic WLM makes sure high-priority queries receive the necessary resources promptly, while lower-priority queries are processed efficiently without compromising system stability. This advanced queuing mechanism allows Amazon Redshift to optimize resource utilization, minimizing potential bottlenecks and maximizing query throughput, ultimately delivering a seamless and responsive experience for users running multiple operations simultaneously.
  • Query structure – How your query is written will affect its performance. As much as possible, write queries to process and return as little data as will meet your needs. For more information, see Amazon Redshift best practices for designing queries.
  • Queries with a long return time – Queries with a long return time can impact the processing of other queries and overall performance of the cluster. It’s critical to identify and optimize them. You can optimize these queries by either moving clients to the same network or using the UNLOAD feature of Amazon Redshift, and then configure the client to read the output from Amazon S3. To identify percentile and top running queries, you can download the sample SQL notebook system queries. You can import this in Query Editor V2.0.

Conclusion

In this post, we discussed best practices for assessing, planning, and implementing a large-scale data warehouse migration into Amazon Redshift.

The assessment phase of a data migration project is critical for implementing a successful migration. It involves a comprehensive analysis of the existing workload, integrations, and dependencies to accurately estimate the effort required and determine the appropriate team size. Strategic wave planning is crucial for prioritizing and scheduling the migration tasks effectively. Establishing KPIs and benchmarking them helps measure progress and identify areas for improvement. Code conversion and data validation processes validate the integrity of the migrated data and applications. Monitoring Amazon Redshift performance, identifying and optimizing top offending queries, and conducting regular cluster reviews are essential for maintaining optimal performance and addressing any potential issues promptly.

By addressing these key aspects, organizations can seamlessly migrate their data workloads to Amazon Redshift while minimizing disruptions and maximizing the benefits of Amazon Redshift.

We hope this post provides you with valuable guidance. We welcome any thoughts or questions in the comments section.


About the authors

Chanpreet Singh is a Senior Lead Consultant at AWS, specializing in Data Analytics and AI/ML. He has over 17 years of industry experience and is passionate about helping customers build scalable data warehouses and big data solutions. In his spare time, Chanpreet loves to explore nature, read, and enjoy with his family.

Harshida Patel is a Analytics Specialist Principal Solutions Architect, with AWS.

Raza Hafeez is a Senior Product Manager at Amazon Redshift. He has over 13 years of professional experience building and optimizing enterprise data warehouses and is passionate about enabling customers to realize the power of their data. He specializes in migrating enterprise data warehouses to AWS Modern Data Architecture.

Ram Bhandarkar is a Principal Data Architect at AWS based out of Northern Virginia. He helps customers with planning future Enterprise Data Strategy and assists them with transition to Modern Data Architecture platform on AWS. He has worked with building and migrating databases, data warehouses and data lake solutions for over 25 years.

Vijay Bagur is a Sr. Technical Account Manager. He works with enterprise customers to modernize and cost optimize workloads, improve security posture, and helps them build reliable and secure applications on the AWS platform. Outside of work, he loves spending time with his family, biking and traveling.

Get started with the new Amazon DataZone enhancements for Amazon Redshift

Post Syndicated from Carmen Manzulli original https://aws.amazon.com/blogs/big-data/get-started-with-the-new-amazon-datazone-enhancements-for-amazon-redshift/

In today’s data-driven landscape, organizations are seeking ways to streamline their data management processes and unlock the full potential of their data assets, while controlling access and enforcing governance. That’s why we introduced Amazon DataZone.

Amazon DataZone is a powerful data management service that empowers data engineers, data scientists, product managers, analysts, and business users to seamlessly catalog, discover, analyze, and govern data across organizational boundaries, AWS accounts, data lakes, and data warehouses.

On March 21, 2024, Amazon DataZone introduced several exciting enhancements to its Amazon Redshift integration that simplify the process of publishing and subscribing to data warehouse assets like tables and views, while enabling Amazon Redshift customers to take advantage of the data management and governance capabilities or Amazon DataZone.

These updates empower the experience for both data users and administrators.

Data producers and consumers can now quickly create data warehouse environments using preconfigured credentials and connection parameters provided by their Amazon DataZone administrators.

Additionally, these enhancements grant administrators greater control over who can access and use the resources within their AWS accounts and Redshift clusters, and for what purpose.

As an administrator, you can now create parameter sets on top of DefaultDataWarehouseBlueprint by providing parameters such as cluster, database, and an AWS secret. You can use these parameter sets to create environment profiles and authorize Amazon DataZone projects to use these environment profiles for creating environments.

In turn, data producers and data consumers can now select an environment profile to create environments without having to provide the parameters themselves, saving time and reducing the risk of issues.

In this post, we explain how you can use these enhancements to the Amazon Redshift integration to publish your Redshift tables to the Amazon DataZone data catalog, and enable users across the organization to discover and access them in a self-service fashion. We present a sample end-to-end customer workflow that covers the core functionalities of Amazon DataZone, and include a step-by-step guide of how you can implement this workflow.

The same workflow is available as video demonstration on the Amazon DataZone official YouTube channel.

Solution overview

To get started with the new Amazon Redshift integration enhancements, consider the following scenario:

  • A sales team acts as the data producer, owning and publishing product sales data (a single table in a Redshift cluster called catalog_sales)
  • A marketing team acts as the data consumer, needing access to the sales data in order to analyze it and build product adoption campaigns

At a high level, the steps we walk you through in the following sections include tasks for the Amazon DataZone administrator, Sales team, and Marketing team.

Prerequisites

For the workflow described in this post, we assume a single AWS account, a single AWS Region, and a single AWS Identity and Access Management (IAM) user, who will act as Amazon DataZone administrator, Sales team (producer), and Marketing team (consumer).

To follow along, you need an AWS account. If you don’t have an account, you can create one.

In addition, you must have the following resources configured in your account:

  • An Amazon DataZone domain with admin, sales, and marketing projects
  • A Redshift namespace and workgroup

If you don’t have these resources already configured, you can create them by deploying an AWS CloudFormation stack:

  1. Choose Launch Stack to deploy the provided CloudFormation template.
  2. For AdminUserPassword, enter a password, and take note of this password to use in later steps.
  3. Leave the remaining settings as default.
  4. Select I acknowledge that AWS CloudFormation might create IAM resources, then choose Submit.
  5. When the stack deployment is complete, on the Amazon DataZone console, choose View domains in the navigation pane to see the new created Amazon DataZone domain.
  6. On the Amazon Redshift Serverless console, in the navigation pane, choose Workgroup configuration and see the new created resource.

You should be logged in using the same role that you used to deploy the CloudFormation stack and verify that you’re in the same Region.

As a final prerequisite, you need to create a catalog_sales table in the default Redshift database (dev).

  1. On the Amazon Redshift Serverless console, selected your workgroup and choose Query data to open the Amazon Redshift query editor.
  2. In the query editor, choose your workgroup and select Database user name and password as the type of connection, then provide your admin database user name and password.
  3. Use the following query to create the catalog_sales table, which the Sales team will publish in the workflow:
    CREATE TABLE catalog_sales AS 
    SELECT 146776932 AS order_number, 23 AS quantity, 23.4 AS wholesale_cost, 45.0 as list_price, 43.0 as sales_price, 2.0 as discount, 12 as ship_mode_sk,13 as warehouse_sk, 23 as item_sk, 34 as catalog_page_sk, 232 as ship_customer_sk, 4556 as bill_customer_sk
    UNION ALL SELECT 46776931, 24, 24.4, 46, 44, 1, 14, 15, 24, 35, 222, 4551
    UNION ALL SELECT 46777394, 42, 43.4, 60, 50, 10, 30, 20, 27, 43, 241, 4565
    UNION ALL SELECT 46777831, 33, 40.4, 51, 46, 15, 16, 26, 33, 40, 234, 4563
    UNION ALL SELECT 46779160, 29, 26.4, 50, 61, 8, 31, 15, 36, 40, 242, 4562
    UNION ALL SELECT 46778595, 43, 28.4, 49, 47, 7, 28, 22, 27, 43, 224, 4555
    UNION ALL SELECT 46779482, 34, 33.4, 64, 44, 10, 17, 27, 43, 52, 222, 4556
    UNION ALL SELECT 46779650, 39, 37.4, 51, 62, 13, 31, 25, 31, 52, 224, 4551
    UNION ALL SELECT 46780524, 33, 40.4, 60, 53, 18, 32, 31, 31, 39, 232, 4563
    UNION ALL SELECT 46780634, 39, 35.4, 46, 44, 16, 33, 19, 31, 52, 242, 4557
    UNION ALL SELECT 46781887, 24, 30.4, 54, 62, 13, 18, 29, 24, 52, 223, 4561

Now you’re ready to get started with the new Amazon Redshift integration enhancements.

Amazon DataZone administrator tasks

As the Amazon DataZone administrator, you perform the following tasks:

  1. Configure the DefaultDataWarehouseBlueprint.
    • Authorize the Amazon DataZone admin project to use the blueprint to create environment profiles.
    • Create a parameter set on top of DefaultDataWarehouseBlueprint by providing parameters such as cluster, database, and AWS secret.
  2. Set up environment profiles for the Sales and Marketing teams.

Configure the DefaultDataWarehouseBlueprint

Amazon DataZone blueprints define what AWS tools and services are provisioned to be used within an Amazon DataZone environment. Enabling the data warehouse blueprint will allow data consumers and data producers to use Amazon Redshift and the Query Editor for data sharing, accessing, and consuming.

  1. On the Amazon DataZone console, choose View domains in the navigation pane.
  2. Choose your Amazon DataZone domain.
  3. Choose Default Data Warehouse.

If you used the CloudFormation template, the blueprint is already enabled.

Part of the new Amazon Redshift experience involves the Managing projects and Parameter sets tabs. The Managing projects tab lists the projects that are allowed to create environment profiles using the data warehouse blueprint. By default, this is set to all projects. For our purpose, let’s grant only the admin project.

  1. On the Managing projects tab, choose Edit.

  1. Select Restrict to only managing projects and choose the AdminPRJ project.
  2. Choose Save changes.

With this enhancement, the administrator can control which projects can use default blueprints in their account to create environment profile

The Parameter sets tab lists parameters that you can create on top of DefaultDataWarehouseBlueprint by providing parameters such as Redshift cluster or Redshift Serverless workgroup name, database name, and the credentials that allow Amazon DataZone to connect to your cluster or workgroup. You can also create AWS secrets on the Amazon DataZone console. Before these enhancements, AWS secrets had to be managed separately using AWS Secrets Manager, making sure to include the proper tags (key-value) for Amazon Redshift Serverless.

For our scenario, we need to create a parameter set to connect a Redshift Serverless workgroup containing sales data.

  1. On the Parameter sets tab, choose Create parameter set.
  2. Enter a name and optional description for the parameter set.
  3. Choose the Region containing the resource you want to connect to (for example, our workgroup is in us-east-1).
  4. In the Environment parameters section, select Amazon Redshift Serverless.

If you already have an AWS secret with credentials to your Redshift Serverless workgroup, you can provide the existing AWS secret ARN. In this case, the secret must be tagged with the following (key-value): AmazonDataZoneDomain: <Amazon DataZone domain ID>.

  1. Because we don’t have an existing AWS secret, we create a new one by choosing Create new AWS Secret.
  2. In the pop-up, enter a secret name and your Amazon Redshift credentials, then choose Create new AWS Secret.

Amazon DataZone creates a new secret using Secrets Manager and makes sure the secret is tagged with the domain in which you’re creating the parameter set.

  1. Enter the Redshift Serverless workgroup name and database name to complete the parameters list. If you used the provided CloudFormation template, use sales-workgroup for the workgroup name and dev for the database name.
  2. Choose Create parameter set.

You can see the parameter set created for your Redshift environment and the blueprint enabled with a single managing project configured.

 

Set up environment profiles for the Sales and Marketing teams

Environment profiles are predefined templates that encapsulate technical details required to create an environment, such as the AWS account, Region, and resources and tools to be added to projects. The next Amazon DataZone administrator task consists of setting up environment profiles, based on the default enabled blueprint, for the Sales and Marketing teams.

This task will be performed from the admin project in the Amazon DataZone data portal, so let’s follow the data portal URL and start creating an environment profile for the Sales team to publish their data.

  1. On the details page of your Amazon DataZone domain, in the Summary section, choose the link for your data portal URL.

When you open the data portal for the first time, you’re prompted to create a project. If you used the provided CloudFormation template, the projects are already created.

  1. Choose the AdminPRJ project.
  2. On the Environments page, choose Create environment profile.
  3. Enter a name (for example, SalesEnvProfile) and optional description (for example, Sales DWH Environment Profile) for the new environment profile.
  4. For Owner, choose AdminPRJ.
  5. For Blueprint, select the DefaultDataWarehouse blueprint (you’ll only see blueprints where the admin project is listed as a managing project).
  6. Choose the current enabled account and the parameter set you previously created.

Then you will see each pre-compiled value for Redshift Serverless. Under Authorized projects, you can pick the authorized projects allowed to use this environment profile to create an environment. By default, this is set to All projects.

  1. Select Authorized projects only.
  2. Choose Add projects and choose the SalesPRJ project.
  3. Configure the publishing permissions for this environment profile. Because the Sales team is our data producer, we select Publish from any schema.
  4. Choose Create environment profile.

Next, you create a second environment profile for the Marketing team to consume data. To do this, you repeat similar steps made for the Sales team.

  1. Choose the AdminPRJ project.
  2. On the Environments page, choose Create environment profile.
  3. Enter a name (for example, MarketingEnvProfile) and optional description (for example, Marketing DWH Environment Profile).
  4. For Owner, choose AdminPRJ.
  5. For Blueprint, select the DefaultDataWarehouse blueprint.
  6. Select the parameter set you created earlier.
  7. This time, keep All projects as the default (alternatively, you could select Authorized projects only and add MarketingPRJ).
  8. Configure the publishing permissions for this environment profile. Because the Marketing team is our data consumer, we select Don’t allow publishing.
  9. Choose Create environment profile.

With these two environment profiles in place, the Sales and Marketing teams can start working on their projects on their own to create their proper environments (resources and tools) with fewer configurations and less risk to incur errors, and publish and consume data securely and efficiently within these environments.

To recap, the new enhancements offer the following features:

  • When creating an environment profile, you can choose to provide your own Amazon Redshift parameters or use one of the parameter sets from the blueprint configuration. If you choose to use the parameter set created in the blueprint configuration, the AWS secret only requires the AmazonDataZoneDomain tag (the AmazonDataZoneProject tag is only required if you choose to provide your own parameter sets in the environment profile).
  • In the environment profile, you can specify a list of authorized projects, so that only authorized projects can use this environment profile to create data warehouse environments.
  • You can also specify what data authorized projects are allowed to be published. You can choose one of the following options: Publish from any schema, Publish from the default environment schema, and Don’t allow publishing.

These enhancements grant administrators more control over Amazon DataZone resources and projects and facilitate the common activities of all roles involved.

Sales team tasks

As a data producer, the Sales team performs the following tasks:

  1. Create a sales environment.
  2. Create a data source.
  3. Publish sales data to the Amazon DataZone data catalog.

Create a sales environment

Now that you have an environment profile, you need to create an environment in order to work with data and analytics tools in this project.

  1. Choose the SalesPRJ project.
  2. On the Environments page, choose Create environment.
  3. Enter a name (for example, SalesDwhEnv) and optional description (for example, Environment DWH for Sales) for the new environment.
  4. For Environment profile, choose SalesEnvProfile.

Data producers can now select an environment profile to create environments, without the need to provide their own Amazon Redshift parameters. The AWS secret, Region, workgroup, and database are ported over to the environment from the environment profile, streamlining and simplifying the experience for Amazon DataZone users.

  1. Review your data warehouse parameters to confirm everything is correct.
  2. Choose Create environment.

The environment will be automatically provisioned by Amazon DataZone with the preconfigured credentials and connection parameters, allowing the Sales team to publish Amazon Redshift tables seamlessly.

Create a data source

Now, let’s create a new data source for our sales data.

  1. Choose the SalesPRJ project.
  2. On the Data page, choose Create data source.
  3. Enter a name (for example, SalesDataSource) and optional description.
  4. For Data source type, select Amazon Redshift.
  5. For Environment¸ choose SalesDevEnv.
  6. For Redshift credentials, you can use the same credentials you provided during environment creation, because you’re still using the same Redshift Serverless workgroup.
  7. Under Data Selection, enter the schema name where your data is located (for example, public) and then specify a table selection criterion (for example, *).

Here, the * indicates that this data source will bring into Amazon DataZone all the technical metadata from the database tables of your schema (in this case, a single table called catalog_sales).

  1. Choose Next.

On the next page, automated metadata generation is enabled. This means that Amazon DataZone will automatically generate the business names of the table and columns for that asset. 

  1. Leave the settings as default and choose Next.
  2. For Run preference, select when to run the data source. Amazon DataZone can automatically publish these assets to the data catalog, but let’s select Run on demand so we can curate the metadata before publishing.
  3. Choose Next.
  4. Review all settings and choose Create data source.
  5. After the data source has been created, you can manually pull technical metadata from the Redshift Serverless workgroup by choosing Run.

When the data source has finished running, you can see the catalog_sales asset correctly added to the inventory.

Publish sales data to the Amazon DataZone data catalog

Open the catalog_sales asset to see details of the new asset (business metadata, technical metadata, and so on).

In a real-world scenario, this pre-publishing phase is when you can enrich the asset providing more business context and information, such as a readme, glossaries, or metadata forms. For example, you can start accepting some metadata automatically generated recommendations and rename the asset or its columns in order to make them more readable, descriptive, and easy to search and understand from a business user.

For this post, simply choose Publish asset to complete the Sales team tasks.

Marketing team tasks

Let’s switch to the Marketing team and subscribe to the catalog_sales asset published by the Sales team. As a consumer team, the Marketing team will complete the following tasks:

  1. Create a marketing environment.
  2. Discover and subscribe to sales data.
  3. Query the data in Amazon Redshift.

Create a marketing environment

To subscribe and access Amazon DataZone assets, the Marketing team needs to create an environment.

  1. Choose the MarketingPRJ project.
  2. On the Environments page, choose Create environment.
  3. Enter a name (for example, MarketingDwhEnv) and optional description (for example, Environment DWH for Marketing).
  4. For Environment profile, choose MarketingEnvProfile.

As with data producers, data consumers can also benefit from a pre-configured profile (created and managed by the administrator) in order to speed up the environment creation process, avoiding mistakes and reducing risks of errors.

  1. Review your data warehouse parameters to confirm everything is correct.
  2. Choose Create environment.

Discover and subscribe to sales data

Now that we have a consumer environment, let’s search the catalog_sales table in the Amazon DataZone data catalog.

  1. Enter sales in the search bar.
  2. Choose the catalog_sales table.
  3. Choose Subscribe.
  4. In the pop-up window, choose your marketing consumer project, provide a reason for the subscription request, and choose Subscribe.

When you get a subscription request as a data producer, Amazon DataZone will notify you through a task in the sales producer project. Because you’re acting as both subscriber and publisher here, you will see a notification.

  1. Choose the notification, which will open the subscription request.

You can see details including which project has requested access, who is the requestor, and why access is needed.

  1. To approve, enter a message for approval and choose Approve.

Now that subscription has been approved, let’s go back to the MarketingPRJ. On the Subscribed data page, catalog_sales is listed as an approved asset, but access hasn’t been granted yet. If we choose the asset, you can see that Amazon DataZone is working on the backend to automatically grant the access. When it’s complete, you’ll see the subscription as granted and the message “Asset added to 1 environment.”

Query data in Amazon Redshift

Now that the marketing project has access to the sales data, we can use the Amazon Redshift Query Editor V2 to analyze the sales data.

  1. Under MarketingPRJ, go to the Environments page and select the marketing environment.
  2. Under the analytics tools, choose Query data with Amazon Redshift, which redirects you to the query editor within the environment of the project.
  3. To connect to Amazon Redshift, choose your workgroup and select Federated user as the connection type.

When you’re connected, you will see the catalog_sales table under the public schema.

  1. To make sure that you have access to this table, run the following query:
SELECT * FROM catalog_sales LIMIT 10

As a consumer, you’re now able to explore data and create reports, or you can aggregate data and create new assets to publish in Amazon DataZone, becoming a producer of a new data product to share with other users and departments.

Clean up

To clean up your resources, complete the following steps:

  1. On the Amazon DataZone console, delete the projects used in this post. This will delete most project-related objects like data assets and environments.
  2. Clean up all Amazon Redshift resources (workgroup and namespace) to avoid incurring additional charges.

Conclusion

In this post, we demonstrated how you can get started with the new Amazon Redshift integration in Amazon DataZone. We showed how to streamline the experience for data producers and consumers and how to grant administrators control over data resources.

Embrace these enhancements and unlock the full potential of Amazon DataZone and Amazon Redshift for your data management needs.

Resources

For more information, refer to the following resources:

 


About the author

Carmen is a Solutions Architect at AWS, based in Milan (Italy). She is a Data Lover that enjoys helping companies in the adoption of Cloud technologies, especially with Data Analytics and Data Governance. Outside of work, she is a creative people who loves being in contact with nature and sometimes practicing adrenaline activities.

Manage Amazon Redshift provisioned clusters with Terraform

Post Syndicated from Amit Ghodke original https://aws.amazon.com/blogs/big-data/manage-amazon-redshift-provisioned-clusters-with-terraform/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it straightforward and cost-effective to analyze all your data using standard SQL and your existing extract, transform, and load (ETL); business intelligence (BI); and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics.

HashiCorp Terraform is an infrastructure as code (IaC) tool that lets you define cloud resources in human-readable configuration files that you can version, reuse, and share. You can then use a consistent workflow to provision and manage your infrastructure throughout its lifecycle.

In this post, we demonstrate how to use Terraform to manage common Redshift cluster operations, such as:

  • Creating a new provisioned Redshift cluster using Terraform code and adding an AWS Identity and Access Management (IAM) role to it
  • Scheduling pause, resume, and resize operations for the Redshift cluster

Solution overview

The following diagram illustrates the solution architecture for provisioning a Redshift cluster using Terraform.

Solution Overview

In addition to Amazon Redshift, the solution uses the following AWS services:

  • Amazon Elastic Compute Cloud (Amazon EC2) offers the broadest and deepest compute platform, with over 750 instances and choice of the latest processors, storage, networking, operating system (OS), and purchase model to help you best match the needs of your workload. For this post, we use an m5.xlarge instance with the Windows Server 2022 Datacenter Edition. The choice of instance type and Windows OS is flexible; you can choose a configuration that suits your use case.
  • IAM allows you to securely manage identities and access to AWS services and resources. We use IAM roles and policies to securely access services and perform relevant operations. An IAM role is an AWS identity that you can assume to gain temporary access to AWS services and resources. Each IAM role has a set of permissions defined by IAM policies. These policies determine the actions and resources the role can access.
  • AWS Secrets Manager allows you to securely store the user name and password needed to log in to Amazon Redshift.

In this post, we demonstrate how to set up an environment that connects AWS and Terraform. The following are the high-level tasks involved:

  1. Set up an EC2 instance with Windows OS in AWS.
  2. Install Terraform on the instance.
  3. Configure your environment variables (Windows OS).
  4. Define an IAM policy to have minimum access to perform activities on a Redshift cluster, including pause, resume, and resize.
  5. Establish an IAM role using the policy you created.
  6. Create a provisioned Redshift cluster using Terraform code.
  7. Attach the IAM role you created to the Redshift cluster.
  8. Write the Terraform code to schedule cluster operations like pause, resume, and resize.

Prerequisites

To complete the activities described in this post, you need an AWS account and administrator privileges on the account to use the key AWS services and create the necessary IAM roles.

Create an EC2 instance

We begin with creating an EC2 instance. Complete the following steps to create a Windows OS EC2 instance:

  1. On the Amazon EC2 console, choose Launch Instance.
  2. Choose a Windows Server Amazon Machine Image (AMI) that suits your requirements.
  3. Select an appropriate instance type for your use case.
  4. Configure the instance details:
    1. Choose the VPC and subnet where you want to launch the instance.
    2. Enable Auto-assign Public IP.
    3. For Add storage, configure the desired storage options for your instance.
    4. Add any necessary tags to the instance.
  5. For Configure security group, select or create a security group that allows the necessary inbound and outbound traffic to your instance.
  6. Review the instance configuration and choose Launch to start the instance creation process.
  7. For Select an existing key pair or create a new key pair, choose an existing key pair or create a new one.
  8. Choose Launch instance.
  9. When the instance is running, you can connect to it using the Remote Desktop Protocol (RDP) and the administrator password obtained from the Get Windows password

Install Terraform on the EC2 instance

Install Terraform on the Windows EC2 instance using the following steps:

  1. RDP into the EC2 instance you created.
  2. Install Terraform on the EC2 instance.

You need to update the environment variables to point to the directory where the Terraform executable is available.

  1. Under System Properties, on the Advanced tab, choose Environment Variables.

Environment Variables

  1. Choose the path variable.

Path Variables

  1. Choose New and enter the path where Terraform is installed. For this post, it’s in the C:\ directory.

Add Terraform to path variable

  1. Confirm Terraform is installed by entering the following command:

terraform -v

Check Terraform version

Optionally, you can use an editor like Visual Studio Code (VS Code) and add the Terraform extension to it.

Create a user for accessing AWS through code (AWS CLI and Terraform)

Next, we create an administrator user in IAM, which performs the operations on AWS through Terraform and the AWS Command Line Interface (AWS CLI). Complete the following steps:

  1. Create a new IAM user.
  2. On the IAM console, download and save the access key and user key.

Create New IAM User

  1. Install the AWS CLI.
  2. Launch the AWS CLI and run aws configure and pass the access key ID, secret access key, and default AWS Region.

This prevents the AWS user name and password from being visible in plain text in the Terraform code and prevents accidental sharing when the code is committed to a code repository.

AWS Configure

Create a user for Accessing Redshift through code (Terraform)

Because we’re creating a Redshift cluster and subsequent operations, the administrator user name and password required for these processes (different than the admin role we created earlier for logging in to the AWS Management Console) needs to be invoked in the code. To do this securely, we use Secrets Manager to store the user name and password. We write code in Terraform to access these credentials during the cluster create operation. Complete the following steps:

  1. On the Secrets Manager console, choose Secrets in the navigation pane.
  2. Choose Store a new secret.

Store a New Secret

  1. For Secret type, select Credentials for Amazon Redshift data warehouse.
  2. Enter your credentials.

Choose Secret Type

Set up Terraform

Complete the following steps to set up Terraform:

  1. Create a folder or directory for storing all your Terraform code.
  2. Open the VS Code editor and browse to your folder.
  3. Choose New File and enter a name for the file using the .tf extension

Now we’re ready to start writing our code starting with defining providers. The providers definition is a way for Terraform to get the necessary APIs to interact with AWS.

  1. Configure a provider for Terraform:
terraform {
required_providers {
aws = {
source  = "hashicorp/aws"
version = "5.53.0"
}
}
}

# Configure the AWS Provider
provider "aws" {
region = "us-east-1"
}
  1. Access the admin credentials for the Amazon Redshift admin user:
data "aws_secretsmanager_secret_version" "creds" {
# Fill in the name you gave to your secret
secret_id = "terraform-creds"
}
/*json decode to parse the secret*/
locals {
terraform-creds = jsondecode(
data.aws_secretsmanager_secret_version.creds.secret_string
)
}

Create a Redshift cluster

To create a Redshift cluster, use the aws_redshift_cluster resource:

# Create an encrypted Amazon Redshift cluster

resource "aws_redshift_cluster" "dw_cluster" {
cluster_identifier = "tf-example-redshift-cluster"
database_name      = "dev"
master_username    = local.terraform-creds.username
master_password    = local.terraform-creds.password
node_type          = "ra3.xlplus"
cluster_type       = "multi-node"
publicly_accessible = "false"
number_of_nodes    = 2
encrypted         = true
kms_key_id        = local.RedshiftClusterEncryptionKeySecret.arn
enhanced_vpc_routing = true
cluster_subnet_group_name="<<your-cluster-subnet-groupname>>"
}

In this example, we create a Redshift cluster called tf-example-redshift-cluster, using the ra3.xlplus node type 2 node cluster. We use the credentials from Secrets Manager and jsondecode to access these values. This makes sure the user name and password aren’t passed in plain text.

Add an IAM role to the cluster

Because we didn’t have the option to associate an IAM role during cluster creation, we do so now with the following code:

resource "aws_redshift_cluster_iam_roles" "cluster_iam_role" {
cluster_identifier = aws_redshift_cluster.dw_cluster.cluster_identifier
iam_role_arns      = ["arn:aws:iam::yourawsaccountId:role/service-role/yourIAMrolename"]
}

Enable Redshift cluster operations

Performing operations on the Redshift cluster such as resize, pause, and resume on a schedule offers a more practical use of these operations. Therefore, we create two policies: one that allows the Amazon Redshift scheduler service and one that allows the cluster pause, resume, and resize operations. Then we create a role that has both policies attached to it.

You can perform these steps directly from the console and then referenced in Terraform code. The following example demonstrates the code snippets to create policies and a role, and then to attach the policy to the role.

  1. Create the Amazon Redshift scheduler policy document and create the role that assumes this policy:
#define policy document to establish the Trust Relationship between the role and the entity (Redshift scheduler)

data "aws_iam_policy_document" "assume_role_scheduling" {
statement {
effect = "Allow"
principals {
type        = "Service"
identifiers = ["scheduler.redshift.amazonaws.com"]
}
actions = ["sts:AssumeRole"]
}
}

#create a role that has the above trust relationship attached to it, so that it can invoke the redshift scheduling service
resource "aws_iam_role" "scheduling_role" {
name               = "redshift_scheduled_action_role"
assume_role_policy = data.aws_iam_policy_document.assume_role_scheduling.json
}
  1. Create a policy document and policy for Amazon Redshift operations:
/*define the policy document for other redshift operations*/

data "aws_iam_policy_document" "redshift_operations_policy_definition" {
statement {
effect = "Allow"
actions = [
"redshift:PauseCluster",
"redshift:ResumeCluster",
"redshift:ResizeCluster",
]
resources = ["arn:aws:redshift:*:youraccountid:cluster:*"]
}
}

/*create the policy and add the above data (json) to the policy*/
resource "aws_iam_policy" "scheduling_actions_policy" {
name   = "redshift_scheduled_action_policy"
policy = data.aws_iam_policy_document.redshift_operations_policy_definition.json
}
  1. Attach the policy to the IAM role:
/*connect the policy and the role*/
resource "aws_iam_role_policy_attachment" "role_policy_attach" {
policy_arn = aws_iam_policy.scheduling_actions_policy.arn
role       = aws_iam_role.scheduling_role.name
}
  1. Pause the Redshift cluster:
#pause a cluster
resource "aws_redshift_scheduled_action" "pause_operation" {
name     = "tf-redshift-scheduled-action-pause"
schedule = "cron(00 22 * * ? *)"
iam_role = aws_iam_role.scheduling_role.arn
target_action {
pause_cluster {
cluster_identifier = aws_redshift_cluster.dw_cluster.cluster_identifier
}
}
}

In the preceding example, we created a scheduled action called tf-redshift-scheduled-action-pause that pauses the cluster at 10:00 PM every day as a cost-saving action.

  1. Resume the Redshift cluster:
name     = "tf-redshift-scheduled-action-resume"
schedule = "cron(15 07 * * ? *)"
iam_role = aws_iam_role.scheduling_role.arn
target_action {
resume_cluster {
cluster_identifier = aws_redshift_cluster.dw_cluster.cluster_identifier
}
}
}

In the preceding example, we created a scheduled action called tf-redshift-scheduled-action-resume that resumes the cluster at 7:15 AM every day in time for business operations to start using the Redshift cluster.

  1. Resize the Redshift cluster:
#resize a cluster
resource "aws_redshift_scheduled_action" "resize_operation" {
name     = "tf-redshift-scheduled-action-resize"
schedule = "cron(15 14 * * ? *)"
iam_role = aws_iam_role.scheduling_role.arn
target_action {
resize_cluster {
cluster_identifier = aws_redshift_cluster.dw_cluster.cluster_identifier
cluster_type = "multi-node"
node_type = "ra3.xlplus"
number_of_nodes = 4 /*increase the number of nodes using resize operation*/
classic = true /*default behavior is to use elastic resizeboolean value if we want to use classic resize*/
}
}
}

In the preceding example, we created a scheduled action called tf-redshift-scheduled-action-resize that increases the nodes from 2 to 4. You can do other operations like change the node type as well. By default, elastic resize will be used, but if you want to use classic resize, you have to pass the parameter classic = true as shown in the preceding code. This can be a scheduled action to anticipate the needs of peak periods and resize appripriately for that duration. You can then downsize using similar code during non-peak times.

Test the solution

We apply the following code to test the solution. Change the resource details accordingly, such as account ID and Region name.

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "5.53.0"
    }
  }
}

# Configure the AWS Provider
provider "aws" {
  region = "us-east-1"
}

# access secrets stored in secret manager
data "aws_secretsmanager_secret_version" "creds" {
  # Fill in the name you gave to your secret
  secret_id = "terraform-creds"
}

/*json decode to parse the secret*/
locals {
  terraform-creds = jsondecode(
    data.aws_secretsmanager_secret_version.creds.secret_string
  )
}

#Store the arn of the KMS key to be used for encrypting the redshift cluster

data "aws_secretsmanager_secret_version" "encryptioncreds" {
  secret_id = "RedshiftClusterEncryptionKeySecret"
}
locals {
  RedshiftClusterEncryptionKeySecret = jsondecode(
    data.aws_secretsmanager_secret_version.encryptioncreds.secret_string
  )
}

# Create an encrypted Amazon Redshift cluster
resource "aws_redshift_cluster" "dw_cluster" {
  cluster_identifier = "tf-example-redshift-cluster"
  database_name      = "dev"
  master_username    = local.terraform-creds.username
  master_password    = local.terraform-creds.password
  node_type          = "ra3.xlplus"
  cluster_type       = "multi-node"
  publicly_accessible = "false"
  number_of_nodes    = 2
  encrypted         = true
  kms_key_id        = local.RedshiftClusterEncryptionKeySecret.arn
  enhanced_vpc_routing = true
  cluster_subnet_group_name="redshiftclustersubnetgroup-yuu4sywme0bk"
}

#add IAM Role to the Redshift cluster

resource "aws_redshift_cluster_iam_roles" "cluster_iam_role" {
  cluster_identifier = aws_redshift_cluster.dw_cluster.cluster_identifier
  iam_role_arns      = ["arn:aws:iam::youraccountid:role/service-role/yourrolename"]
}

#for audit logging please create an S3 bucket which has read write privileges for Redshift service, this example does not include S3 bucket creation.

resource "aws_redshift_logging" "redshiftauditlogging" {
  cluster_identifier   = aws_redshift_cluster.dw_cluster.cluster_identifier
  log_destination_type = "s3"
  bucket_name          = "your-s3-bucket-name"
}

#to do operations like pause, resume, resize on a schedule we need to first create a role that has permissions to perform these operations on the cluster

#define policy document to establish the Trust Relationship between the role and the entity (Redshift scheduler)

data "aws_iam_policy_document" "assume_role_scheduling" {
  statement {
    effect = "Allow"
    principals {
      type        = "Service"
      identifiers = ["scheduler.redshift.amazonaws.com"]
    }

    actions = ["sts:AssumeRole"]
  }
}

#create a role that has the above trust relationship attached to it, so that it can invoke the redshift scheduling service
resource "aws_iam_role" "scheduling_role" {
  name               = "redshift_scheduled_action_role"
  assume_role_policy = data.aws_iam_policy_document.assume_role_scheduling.json
}

/*define the policy document for other redshift operations*/

data "aws_iam_policy_document" "redshift_operations_policy_definition" {
  statement {
    effect = "Allow"
    actions = [
      "redshift:PauseCluster",
      "redshift:ResumeCluster",
      "redshift:ResizeCluster",
    ]

    resources =  ["arn:aws:redshift:*:youraccountid:cluster:*"]
  }
}

/*create the policy and add the above data (json) to the policy*/

resource "aws_iam_policy" "scheduling_actions_policy" {
  name   = "redshift_scheduled_action_policy"
  policy = data.aws_iam_policy_document.redshift_operations_policy_definition.json
}

/*connect the policy and the role*/

resource "aws_iam_role_policy_attachment" "role_policy_attach" {
  policy_arn = aws_iam_policy.scheduling_actions_policy.arn
  role       = aws_iam_role.scheduling_role.name
}

#pause a cluster

resource "aws_redshift_scheduled_action" "pause_operation" {
  name     = "tf-redshift-scheduled-action-pause"
  schedule = "cron(00 14 * * ? *)"
  iam_role = aws_iam_role.scheduling_role.arn
  target_action {
    pause_cluster {
      cluster_identifier = aws_redshift_cluster.dw_cluster.cluster_identifier
    }
  }
}

#resume a cluster

resource "aws_redshift_scheduled_action" "resume_operation" {
  name     = "tf-redshift-scheduled-action-resume"
  schedule = "cron(15 14 * * ? *)"
  iam_role = aws_iam_role.scheduling_role.arn
  target_action {
    resume_cluster {
      cluster_identifier = aws_redshift_cluster.dw_cluster.cluster_identifier
    }
  }
}

#resize a cluster

resource "aws_redshift_scheduled_action" "resize_operation" {
  name     = "tf-redshift-scheduled-action-resize"
  schedule = "cron(15 14 * * ? *)"
  iam_role = aws_iam_role.scheduling_role.arn
  target_action {
    resize_cluster {
      cluster_identifier = aws_redshift_cluster.dw_cluster.cluster_identifier
      cluster_type = "multi-node"
      node_type = "ra3.xlplus"
      number_of_nodes = 4 /*increase the number of nodes using resize operation*/
      classic = true /*default behavior is to use elastic resizeboolean value if we want to use classic resize*/
    }
  }
}

Run terraform plan to see a list of changes that will be made, as shown in the following screenshot.

Terraform plan

After you have reviewed the changes, use terraform apply to create the resources you defined.

Terraform Apply

You will be asked to enter yes or no before Terraform starts creating the resources.

Confirmation of apply

You can confirm that the cluster is being created on the Amazon Redshift console.

redshift cluster creation

After the cluster is created, the IAM roles and schedules for pause, resume, and resize operations are added, as shown in the following screenshot.

Terraform actions

You can also view these scheduled operations on the Amazon Redshift console.

Scheduled Actions

Clean up

If you deployed resources such as the Redshift cluster and IAM roles, or any of the other associated resources by running terraform apply, to avoid incurring charges on your AWS account, run terraform destroy to tear these resources down and clean up your environment.

Conclusion

Terraform offers a powerful and flexible solution for managing your infrastructure as code using a declarative approach, with a cloud-agnostic nature, resource orchestration capabilities, and strong community support. This post provided a comprehensive guide to using Terraform to deploy a Redshift cluster and perform important operations such as resize, resume, and pause on the cluster. Embracing IaC and using the right tools, such as Workflow Studio, VS Code, and Terraform, will enable you to build scalable and maintainable distributed applications, and automate processes.


About the Authors

Amit Ghodke is an Analytics Specialist Solutions Architect based out of Austin. He has worked with databases, data warehouses and analytical applications for the past 16 years. He loves to help customers implement analytical solutions at scale to derive maximum business value.

Ritesh Kumar Sinha is an Analytics Specialist Solutions Architect based out of San Francisco. He has helped customers build scalable data warehousing and big data solutions for over 16 years. He loves to design and build efficient end-to-end solutions on AWS. In his spare time, he loves reading, walking, and doing yoga.

How ActionIQ built a truly composable customer data platform using Amazon Redshift

Post Syndicated from Mackenzie Johnson original https://aws.amazon.com/blogs/big-data/how-actioniq-built-a-truly-composable-customer-data-platform-using-amazon-redshift/

This post is written in collaboration with Mackenzie Johnson and Phil Catterall from ActionIQ.

ActionIQ is a leading composable customer data (CDP) platform designed for enterprise brands to grow faster and deliver meaningful experiences for their customers. ActionIQ taps directly into a brand’s data warehouse to build smart audiences, resolve customer identities, and design personalized interactions to unlock revenue across the customer lifecycle. Enterprise brands including Albertsons, Atlassian, Bloomberg, e.l.f. Beauty, DoorDash, HP, and more use ActionIQ to drive growth through better customer experiences.

High costs associated with launching campaigns, the security risk of duplicating data, and the time spent on SQL requests have created a demand for a better solution for managing and activating customer data. Organizations are demanding secure, cost efficient, and time efficient solutions to power their marketing outcomes.

This post will demonstrate how ActionIQ built a connector for Amazon Redshift to tap directly into your data warehouse and deliver a secure, zero-copy CDP. It will cover how you can get started with building a truly composable CDP with Amazon Redshift—from the solution architecture to setting up and testing the connector.

The challenge

Copying or moving data means heavy and complex logistics, along with added incurred cost and security risks associated with replicating data. On the logistics side, data engineering teams have to set up additional extract, transform, and load (ETL) pipelines out of their Amazon Redshift warehouse into ActionIQ, then configure ActionIQ to ingest the data on a recurring basis. Additional ETL jobs means more moving parts, which introduces more potential points of failure, such as breaking schema changes, partial data transfers, delays, and more. All of this requires additional observability overhead to help your team alert on and manage issues as they come up.

These additional ETL jobs add latency to the end-to-end process from data collection to activation, which makes it more likely that your campaigns are activating on stale data and missing key audience members. That will have implications for the customer experience, thereby directly affecting their ability to drive revenue.

The solution

Our solution aims to reduce the logistics already discussed and enables up-to-the-minute data by establishing a secure connection and pushing queries directly down to your data warehouse. Instead of loading full datasets into ActionIQ, the query is pushed to the data warehouse, making it do the hard querying and aggregation work, and wait for the result set.

With Amazon Redshift as your data warehouse, you can run complex workloads with consistently high performance while minimizing the time and effort spent in copying data over to the data warehouse through the use of features like zero-ETL integration with transactional data stores, streaming ingestion, and data sharing. You can also train machine learning models and make predictions directly from your Amazon Redshift data warehouse using familiar SQL commands.

Solution architecture

Within AWS, ActionIQ has a virtual private cloud (VPC) and you have your own VPC. We work within our own private area in AWS, with our own locks and access restrictions. Because ActionIQ is going to have access to your Amazon Redshift data warehouse, this implies that an outside organization (ActionIQ) will be able to make direct database queries to the production database environment.

A Solution architecture diagram showing AWS PrivaeLink set up between ActionIQ's VPC and the customer's VPC

For your information security (infosec) teams to approve this design, we need very clear and tight guardrails to ensure that:

  • ActionIQ only has access to what is absolutely necessary
  • No unintended third party can access these assets

ActionIQ needs to communicate securely to satisfy every information security requirement. To do that, within those AWS environments, you must set up AWS PrivateLink with ActionIQ to create a secure connection between the two VPCs. PrivateLink sets up a secure tunnel between the two VPCs, thus avoiding any opening of either VPC to the public internet. After PrivateLink is set up, ActionIQ needs to be granted privileges to the relevant database objects in your Amazon Redshift data warehouse.

In Amazon Redshift, you must create a distinct database, a service account specifically for ActionIQ, and views to populate the data to be shared with ActionIQ. The views need to adhere to ActionIQ’s data model guidelines, which aren’t rigid, but nonetheless require some structure, such as a clear profile_id that is used in all the views for easy joins between the various data sets.

Getting started with ActionIQ

When starting a hybrid compute integration with Amazon Redshift, it’s key to align your data to ActionIQ’s table types in the following manner:

  • Customer base table: A single dimension table with one record per customer that contains all customers.
  • User info tables: Dimension tables that describe customers and join to the customer base table. They often contain slow-moving or static demographic information and are typically matched one to one with customer records.
  • Event tables: Fact or log-like tables contain events or actions your customers take. The primary key is typically a user_id and timestamp.
  • Entity tables: Dimension tables that describe non-customer objects. They often provide additional information to augment the data in event tables. For example, an entity table could be a product table that contains product metadata and joins to a transaction event table on a product_id.

Visual representation of the relationships of the high level entities in the customer, event and product subject areas

Note: User info and event tables can join on any available identifier to the customer base table, not just the base user ID.

Now you can set up the connection and declare the views in the ActionIQ UI. After ActionIQ establishes a table with master profiles, users can begin to interpret those tables and work with them to build out campaigns.

Establish a secure connection

After setting up PrivateLink, the remaining steps to prepare for hybrid compute are the following:

  1. Create a separate database in Amazon Redshift to define the shared dataset with ActionIQ.
  2. Create a service account for ActionIQ in Amazon Redshift.
  3. Grant READ access to the service account for the dedicated database.
  4. Define the views that will be shared with ActionIQ.

Allow listing

If your data warehouse is on a private network, you must add ActionIQ’s IP addresses to your network’s allow list to allow ActionIQ to access your cloud warehouse. For more information on how to set this up, see the Configure inbound rules for SQL clients.

Database set up

Create an Amazon Redshift user for ActionIQ

  1. Sign in to the Amazon Redshift console
  2. From the navigation menu, choose the Query Editor and connect to your database.
  3. Create a user for ActionIQ:
    CREATE USER actioniq PASSWORD 'password';

  4. Grant permissions to the tables within a given schema you want to give ActionIQ access to:
GRANT USAGE ON SCHEMA 'yourschema' TO actioniq;
GRANT SELECT ON ALL TABLES IN SCHEMA 'yourschema' TO actioniq;

You can then run commands in the query editor to create a new user and to grant permission to the data sets you want to access through ActionIQ.

The result is that ActionIQ now has a programmatic query access to a dedicated database in your Amazon Redshift data warehouse, and that access is limited to that database.

In order to make this easy to govern, we recommend the following guidelines on the shared views:

  • As much as possible, the shared objects should be views and not tables.
  • The views should never use select *, but should explicitly specify each field desired in the view. This has multiple benefits:
    • The schema is robust; even if the underlying table changes, it won’t initiate a change in the shared view
    • It makes it very clear which fields are accessible by ActionIQ and which are not, thereby enabling a proper governance approval process.
  • Limiting privileges to READ access means the data warehouse administrators can be structurally certain that the data views won’t change unless they want them to.

The importance of providing views instead of actual tables is two-fold:

  1. A view doesn’t replicate data. The whole point is to avoid data replication, and we don’t want to replicate data within Amazon Redshift either. With a view, which is essentially a query definition on top of the actual data tables, we avoid the need to replicate data at all. There is a legitimate question of “why not give access to tables directly?” which brings us to the second point.
  2. Tables and data schema change on their own schedule, and ActionIQ needs a stable data schema to work with. By defining a view, we’re also defining a contract for sharing data between you and ActionIQ. The underlying data table can change, and the view definition can absorb this change, without modifying the structure of what the view delivers. This stability is critical for any enterprise software as a service to work effectively with a large organization.

On the ActionIQ side, there’s no caching or persisting of data of any kind. This means that ActionIQ launches a query whenever a scheduled campaign is launched and requires data, and whenever a user of the platform asks for an audience count. In other words, queries will generally happen during business hours, but technically can happen at any time.

Testing the connector

ActionIQ deploys the Amazon Redshift connector and tested queries to validate the success of the connector. After the audience is defined and validated, ActionIQ sends the SQL query to Amazon Redshift and it returns information. We also validate the results with Amazon Redshift to ensure that the logic is correct as intended.

With this, you experience a lean and more transparent deployment process. You can see the queries ActionIQ sends to Amazon Redshift, because the queries are logged. You can see what’s going on, what is attributable to ActionIQ, and can see the growth of adoption and usage.

Image showing connection set up to an Amazon Redshift database

A Connector defines the credentials and other parameters needed to connect to a cloud database or warehouse. The Connector screen is used to create, view and manage your connectors.

Key considerations

Organizations need strong data governance. ActionIQ requires a contract of what the data is going to look like within the defined view. With the dynamic nature of data, strong governance workflows with defined fields are required to run the connector smoothly to achieve the ultimate outcome—driving revenue through marketing campaigns.

Because ActionIQ is used as the central hub of marketing orchestration and activation, it needs to process a lot of queries. Because marketing activity can have significant spikes in activity, it’s prudent to plan for the maximum load on the underlying database.

In one scenario, you might have spikey workloads. With Amazon Redshift Serverless, your data warehouse will be able to scale automatically to manage those spikes. That means Amazon Redshift can absorb large and sudden spikes in queries from ActionIQ without much technical planning.

If workload isolation is a priority and you want to run the ActionIQ workloads using dedicated compute resources, you can use the data sharing feature to create a data share that can be accessed by a dedicated Redshift serverless endpoint. This would allow ActionIQ to query up-to-date data from a separate Redshift serverless instance without the need to copy any data while maintaining complete workload isolation.

The data team needs data to run business intelligence. ActionIQ is driving marketing activation and creating a new data set for the universal contact history—essentially the log of all marketing contacts from the activity. ActionIQ provides this dataset back to Amazon Redshift, which can then be included in the BI reports for ROI measurements.

Conclusion

For your information security teams, the ActionIQ’s Amazon Redshift connector presents a viable solution because ActionIQ doesn’t replicate the data, and the controls outlined establish how ActionIQ accesses the data. Key benefits include:

  • Control: Choose where data is stored and queried to improve security and fit existing technology investments.
  • Performance: Reduce operational effort, increase productivity and cut down on unnecessary technology costs.
  • Power: Use the auto-scaling capabilities of Amazon Redshift for running your workload.

For business teams, the ActionIQ Amazon Redshift connector is querying the freshest data possible. With the connector, there is zero data latency—an important consideration with key audience members that are primed to convert.

ActionIQ is excited to launch the Amazon Redshift connector to activate your data where it lives—within your Amazon Redshift data warehouse—for a zero-copy, real-time experience that drives outcomes with your customers. To learn more about how organizations are modernizing their data platforms using Amazon Redshift, visit the Amazon Redshift page.

Enhance your Amazon Redshift investment with ActionIQ.


About the authors

Mackenzie Johnson is a Senior Manager at ActionIQ. She is an innovative marketing strategist who’s passionate about the convergence of complementary technologies and amplifying joint value. With extensive experience across digital transformation storytelling, she thrives on educating enterprise businesses about the impact of CX based on a data-driven approach.

Phil Catterall is a Senior Product Manager at ActionIQ and leads product development on ActionIQ’s foundational data management, processing, and query federation capabilities. He’s passionate about designing and building scalable data products to empower business users in new ways.

Sain Das is a Senior Product Manager on the Amazon Redshift team and leads Amazon Redshift GTM for partner programs including the Powered by Amazon Redshift and Redshift Ready programs.

AWS Weekly Roundup: Global AWS Heroes Summit, AWS Lambda, Amazon Redshift, and more (July 22, 2024)

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-global-aws-heroes-summit-aws-lambda-amazon-redshift-and-more-july-22-2024/

Last week, AWS Heroes from around the world gathered to celebrate the 10th anniversary of the AWS Heroes program at Global AWS Heroes Summit. This program recognizes a select group of AWS experts worldwide who go above and beyond in sharing their knowledge and making an impact within developer communities.

Matt Garman, CEO of AWS and a long-time supporter of developer communities, made a special appearance for a Q&A session with the Heroes to listen to their feedback and respond to their questions.

Here’s an epic photo from the AWS Heroes Summit:

As Matt mentioned in his Linkedin post, “The developer community has been core to everything we have done since the beginning of AWS.” Thank you, Heroes, for all you do. Wishing you all a safe flight home.

Last week’s launches
Here are some launches that caught my attention last week:

Announcing the July 2024 updates to Amazon Corretto — The latest updates for the Corretto distribution of OpenJDK is now available. This includes security and critical updates for the Long-Term Supported (LTS) and Feature (FR) versions.

New open-source Advanced MYSQL ODBC Driver now available for Amazon Aurora and RDS — The new AWS ODBC Driver for MYSQL provides faster switchover and failover times, and authentication support for AWS Secrets Manager and AWS Identity and Access Management (IAM), making it a more efficient and secure option for connecting to Amazon RDS and Amazon Aurora MySQL-compatible edition databases.

Productionize Fine-tuned Foundation Models from SageMaker Canvas — Amazon SageMaker Canvas now allows you to deploy fine-tuned Foundation Models (FMs) to SageMaker real-time inference endpoints, making it easier to integrate generative AI capabilities into your applications outside the SageMaker Canvas workspace.

AWS Lambda now supports SnapStart for Java functions that use the ARM64 architecture — Lambda SnapStart for Java functions on ARM64 architecture delivers up to 10x faster function startup performance and up to 34% better price performance compared to x86, enabling the building of highly responsive and scalable Java applications using AWS Lambda.

Amazon QuickSight improves controls performance — Amazon QuickSight has improved the performance of controls, allowing readers to interact with them immediately without having to wait for all relevant controls to reload. This enhancement reduces the loading time experienced by readers.

Amazon OpenSearch Serverless levels up speed and efficiency with smart caching — The new smart caching feature for indexing in Amazon OpenSearch Serverless automatically fetches and manages data, leading to faster data retrieval, efficient storage usage, and cost savings.

Amazon Redshift Serverless with lower base capacity available in the Europe (London) Region — Amazon Redshift Serverless now allows you to start with a lower data warehouse base capacity of 8 Redshift Processing Units (RPUs) in the Europe (London) region, providing more flexibility and cost-effective options for small to large workloads.

AWS Lambda now supports Amazon MQ for ActiveMQ and RabbitMQ in five new regions — AWS Lambda now supports Amazon MQ for ActiveMQ and RabbitMQ in five new regions, enabling you to build serverless applications with Lambda functions that are invoked based on messages posted to Amazon MQ message brokers.

From community.aws
Here’s my top 5 personal favorites posts from community.aws:

Upcoming AWS events
Check your calendars and sign up for upcoming AWS events:

AWS Summits — Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. To learn more about future AWS Summit events, visit the AWS Summit page. Register in your nearest city: AWS Summit Taipei (July 23–24), AWS Summit Mexico City (Aug. 7), and AWS Summit Sao Paulo (Aug. 15).

AWS Community Days — Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world. Upcoming AWS Community Days are in Aotearoa (Aug. 15), Nigeria (Aug. 24), New York (Aug. 28), and Belfast (Sept. 6).

You can browse all upcoming in-person and virtual events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

Donnie

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Run Apache XTable on Amazon MWAA to translate open table formats

Post Syndicated from Matthias Rudolph original https://aws.amazon.com/blogs/big-data/run-apache-xtable-on-amazon-mwaa-to-translate-open-table-formats/

Open table formats (OTFs) like Apache Iceberg are being increasingly adopted, for example, to improve transactional consistency of a data lake or to consolidate batch and streaming data pipelines on a single file format and reduce complexity. In practice, architects need to integrate the chosen format with the various layers of a modern data platform. However, the level of support for the different OTFs varies across common analytical services.

Commercial vendors and the open source community have recognized this situation and are working on interoperability between table formats. One approach is to make a single physical dataset readable in different formats by translating its metadata and avoiding reprocessing of actual data files. Apache XTable is an open source solution that follows this approach and provides abstractions and tools for the translation of open table format metadata.

In this post, we show you how to get started with Apache XTable on AWS and how you can use it in a batch pipeline orchestrated with Amazon Managed Workflows for Apache Airflow (Amazon MWAA). To understand how XTable and similar solutions work, we start with a high-level background on metadata management in an OTF and then dive deeper into XTable and its usage.

Open table formats

Open table formats overcome the gaps of traditional storage formats of data lakes such as Apache Hive tables. They provide abstractions and capabilities known from relational databases like transactional consistency and the ability to create, update, or delete single records. In addition, they help manage schema evolution.

In order to understand how the XTable metadata translation approach works, you must first understand how the metadata of an OTF is represented on the storage layer.

An OTF comprises a data layer and a metadata layer, which are both represented as files on storage. The data layer contains the data files. The metadata layer contains metadata files that keep track of the data files and the transactionally consistent sequence of changes to these. The following figure illustrates this configuration.

Inspecting the files of an Iceberg table on storage, we identify the metadata layer through the folder metadata. Adjacent to it are the data files—in this example, as snappy-compressed Parquet:

<table base folder>
├── metadata # contains metadata files
│ ├── 00000-6c64b16d-affa-4f0e-8635-a996ec13a7fa.metadata.json
│ ├── 23ba9e94-7819-4661-b698-f512f5b51054-m0.avro
│ └── snap-5514083086862319381-1-23ba9e94-7819-4661-b698-f512f5b51054.avro
└── part-00011-587322f1-1007-4500-a5cf-8022f6e7fa3c-c000.snappy.parquet # data files

Comparable to Iceberg, in Delta Lake, the metadata layer is represented through the folder _delta_log:

<table base folder>
├── _delta_log # contains metadata files
│ └── 00000000000000000000.json
└── part-00011-587322f1-1007-4500-a5cf-8022f6e7fa3c-c000.snappy.parquet # data files

Although the metadata layer varies in structure and capabilities between OTFs, it’s eventually just files on storage. Typically, it resides in the table’s base folder adjacent to the data files.

Now, the question emerges: what if metadata files of multiple different formats are stored in parallel for the same table?

Current approaches to interoperability do exactly that, as we will see in the next section.

Apache XTable

XTable is currently provided as a standalone Java binary. It translates the metadata layer between Apache Hudi, Apache Iceberg, or Delta Lake without rewriting data files and integrates with Iceberg-compatible catalogs like the AWS Glue Data Catalog.

In practice, XTable reads the latest snapshot of an input table and creates additional metadata for configurable target formats. It adds this additional metadata to the table on the storage layer—in addition to existing metadata.

Through this, you can choose either format, source or target, read the respective metadata, and get the same consistent view on the table’s data.

The following diagram illustrates the metadata translation process.

Let’s assume you have an existing Delta Lake table that you want to make readable as an Iceberg table. To run XTable, you invoke its Java binary and provide a dataset config file that specifies source and target format, as well as source table paths:

java -jar utilities-0.1.0-SNAPSHOT-bundled.jar \
	--datasetConfig datasetConfig.yaml

A minimal datasetConfig.yaml looks as follows, assuming the table is stored on Amazon Simple Storage Service (Amazon S3):

---
sourceFormat: DELTA
targetFormats:
  - ICEBERG
datasets:
  - tableBasePath: s3://<URI to base folder of table>
    tableName: <table name>
...

As shown in the following listing, XTable adds the Iceberg-specific metadata folder to the table’s base path in addition to the existing _delta_log folder. Now, clients can read the table in either Delta Lake or Iceberg format.

<table base folder>
├── _delta_log # Previously existing Delta Lake metadata
│   └── ...
├── metadata   # Added by XTable: Apache Iceberg metadata
│   └── ...
└── part-00011-587322f1-1007-4500-a5cf-8022f6e7fa3c-c000.snappy.parquet # data files

To register the Iceberg table in Data Catalog, pass a further config file to XTable that is responsible for Iceberg catalogs:

java -jar utilities-0.1.0-SNAPSHOT-bundled.jar \
	--datasetConfig datasetConfig.yaml \
	-- icebergCatalogConfig glueDataCatalog.yaml

The minimal contents of glueDataCatalog.yaml are as follows. It configures XTable to use the Data Catalog-specific IcebergCatalog implementation provided by the iceberg-aws module, which is part of the Apache Iceberg core project:

---
catalogImpl: org.apache.iceberg.aws.glue.GlueCatalog
catalogName: glue
catalogOptions:
  warehouse: s3://<URI to base folder of Iceberg tables>
  catalog-impl: org.apache.iceberg.aws.glue.GlueCatalog
  io-impl: org.apache.iceberg.aws.s3.S3FileIO
... 

Run Apache XTable as an Airflow Operator

You can use XTable in batch data pipelines that write tables on the data lake and make sure these are readable in different file formats. For instance, operating in the Delta Lake ecosystem, a data pipeline might create Delta tables, which need to be accessible as Iceberg tables as well.

One tool to orchestrate data pipelines on AWS is Amazon MWAA, which is a managed service for Apache Airflow. In the following sections, we explore how XTable can run within a custom Airflow Operator on Amazon MWAA. We elaborate on the initial design of such an Operator and demonstrate its deployment on Amazon MWAA.

Why a custom Operator? Although XTable could also be invoked from a BashOperator directly, we choose to wrap this step in a custom operator to allow for configuration through a native Airflow programming language (Python) and operator parameters only. For a background on how to write custom operators, see Creating a custom operator.

The following diagram illustrates the dependency between the operator and XTable’s binary.

Input parameters of the Operator

XTable’s primary inputs are YAML-based configuration files:

  • Dataset config – Contains source format, target formats, and source tables
  • Iceberg catalog config (optional) – Contains the reference to an external Iceberg catalog into which to register the table in the target format

We choose to reflect the data structures of the YAML files in the Operator’s input parameters, as listed in the following table.

Parameter Type Values
dataset_config dict Contents of dataset config as dict literal
iceberg_catalog_config dict Contents of Iceberg catalog config as dict literal

As the Operator runs, the YAML files are generated from the input parameters.

Refer to XTable’s GitHub repository for a full reference of all possible dict keys.

Example parameterisation

The following example shows the configuration to translate a table from Delta Lake to both Iceberg and Hudi. The attribute dataset_config reflects the structure of the dataset config file through a Python dict literal:

from mwaa_plugin.plugin import XtableOperator
operator = XtableOperator(
    task_id="xtableTask",
    dataset_config={
        "sourceFormat": "DELTA",
        "targetFormats": ["ICEBERG", "HUDI"],
        "datasets": [
            {
                "tableBasePath": "s3://datalake/sales",
                "tableName": "sales",
                "namespace": "table",
            }
        ],
    }
)

Sample code: The full source code of the sample XtableOperator and all other code used in this post is provided through this GitHub repository.

Solution overview

To deploy the custom operator to Amazon MWAA, we upload it together with DAGs into the configured DAG folder.

Besides the operator itself, we also need to upload XTable’s executable JAR. As of writing this post, the JAR needs to be compiled by the user from source code. To simplify this, we provide a container-based build script.

Prerequisites

We assume you have at least an environment consisting of Amazon MWAA itself, an S3 bucket, and an AWS Identity and Access Management (IAM) role for Amazon MWAA that has read access to the bucket and optionally write access to the AWS Glue Data Catalog.

In addition, you need one of the following container runtimes to run the provided build script for XTable:

  • Finch
  • Docker

Build and deploy the XTableOperator

To compile XTable, you can use the provided build script and complete the following steps:

  1. Clone the sample code from GitHub:
    git clone https://github.com/aws-samples/apache-xtable-on-aws-samples.git
    cd apache-xtable-on-aws-samples

  2. Run the build script:
    ./build-airflow-operator.sh

  3. Because the Airflow operator uses the library JPype to invoke XTable’s JAR, add a dependency in the Amazon MWAA requirement.txt file:
    JPype1==1.5.0

    For a background on installing additional Python libraries on Amazon MWAA, see Installing Python dependencies.
    Because XTable is Java-based, a Java 11 runtime environment (JRE) is required on Amazon MWAA. You can use the Amazon MWAA startup script to install a JRE.

  4. Add the following lines to an existing startup script or create a new one as provided in the sample code base of this post:
    if [[ "${MWAA_AIRFLOW_COMPONENT}" != "webserver" ]]
    then
        sudo yum install -y java-11-amazon-corretto-headless
    fi

    For more information about this mechanism, see Using a startup script with Amazon MWAA.

  5. Upload xtable_operator/, requirements.txt, startup.sh and .airflowignore to the S3 bucket and respective paths from which Amazon MWAA will read files.
    Make sure the IAM role for Amazon MWAA has appropriate read permissions.
    With regard to the Customer Operator, make sure to upload the local folder xtable_operator/ and .airflowignore into the configured DAG folder.

  6. Update the configuration of your Amazon MWAA environment as follows and start the update process:
    1. Add or update the S3 URI to the requirements.txt file through the Requirements file configuration option.
    2. Add or update the S3 URI to the startup.sh script through Startup script configuration option.
  7. Optionally, you can use the AWS Glue Data Catalog as an Iceberg catalog. In case you create Iceberg metadata and want to register it in the AWS Glue Data Catalog, the Amazon MWAA role needs permissions to create or modify tables in AWS Glue. The following listing shows a minimal policy for this. It constrains permissions to a defined database in AWS Glue:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "glue:GetDatabase",
                    "glue:CreateTable",
                    "glue:GetTables",
                    "glue:UpdateTable",
                    "glue:GetDatabases",
                    "glue:GetTable"
                ],
                "Resource": [
                    "arn:aws:glue:<AWS Region>:<AWS Account ID>:catalog",
                    "arn:aws:glue:<AWS Region>:<AWS Account ID>:database/<Database name>",
                    "arn:aws:glue:<AWS Region>:<AWS Account ID>:table/<Database name>/*"
                ]
            }
        ]
    }

Using the XTableOperator in practice: Delta Lake to Apache Iceberg

Let’s look into a practical example that uses the XTableOperator. We continue the scenario of a data pipeline in the Delta Lake ecosystem and assume it is implemented as a DAG on Amazon MWAA. The following figure shows our example batch pipeline.

The pipeline uses an Apache Spark job that is run by AWS Glue to write a Delta table into an S3 bucket. Additionally, the table is made accessible as an Iceberg table without data duplication. Finally, we want to load the Iceberg table into Amazon Redshift, which is a fully managed, petabyte-scale data warehouse service in the cloud.

As shown in the following screenshot of the graph visualization of the example DAG, we run the XTableOperator after creating the Delta table through a Spark job. Then we use the RedshiftDataOperator to refresh a materialized view, which is used in downstream transformations as a source table. Materialized views are a common construct to precompute complex queries on large tables. In this example, we use them to simplify data loading into Amazon Redshift because of the incremental update capabilities in combination with Iceberg.

The input parameters of the XTableOperator are as follows:

operator = XtableOperator(
    task_id="xtableTask",
    dataset_config={
        "sourceFormat": "DELTA",
        "targetFormats": ["ICEBERG"],
        "datasets": [
            {
                "tableBasePath": "s3://<datalake>/<sales>",
                "tableName": "sales",
                "namespace": "table",
            }
        ],
    },
    iceberg_catalog_config={
        "catalogImpl": "org.apache.iceberg.aws.glue.GlueCatalog",
        "catalogName": "glue",
        "catalogOptions": {
            "warehouse": "s3://datalake/sales",
            "catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
            "io-impl": "org.apache.iceberg.aws.s3.S3FileIO"
    }
)

The XTableOperator creates Apache Iceberg metadata on Amazon S3 and registers a table accordingly in the Data Catalog. The following screenshot shows the created Iceberg table. AWS Glue stores a pointer to Iceberg’s most recent metadata file. As updates are applied to the table and new metadata files are created, XTable updates the pointer after each job.

Amazon Redshift is able to discover the Iceberg table through the Data Catalog and read it using Amazon Redshift Spectrum.

Summary

In this post, we showed how Apache XTable translates the metadata layer of open table formats without data duplication. This provides advantages from both a cost and data integrity perspective—especially in large-scale environment—and allows for a migration of an existing historical estate of datasets. We also discussed how a you can implement a custom Airflow Operator that embeds Apache XTable into data pipelines on Amazon MWAA.

For further reading, visit What’s new with Amazon MWAA and Apache XTable’s website. For more examples of other customer operators, refer to the following GitHub repository.


About the Authors

Matthias Rudolph is an Associate Solutions Architect, digitalizing the German manufacturing industry.

Stephen Said is a Senior Solutions Architect and works with Retail/CPG customers. His areas of interest are data platforms and cloud-native software engineering.

How EchoStar ingests terabytes of data daily across its 5G Open RAN network in near real-time using Amazon Redshift Serverless Streaming Ingestion

Post Syndicated from Balaram Mathukumilli original https://aws.amazon.com/blogs/big-data/how-echostar-ingests-terabytes-of-data-daily-across-its-5g-open-ran-network-in-near-real-time-using-amazon-redshift-serverless-streaming-ingestion/

This post was co-written with Balaram Mathukumilli, Viswanatha Vellaboyana and Keerthi Kambam from DISH Wireless, a wholly owned subsidiary of EchoStar.

EchoStar, a connectivity company providing television entertainment, wireless communications, and award-winning technology to residential and business customers throughout the US, deployed the first standalone, cloud-native Open RAN 5G network on AWS public cloud.

Amazon Redshift Serverless is a fully managed, scalable cloud data warehouse that accelerates your time to insights with fast, simple, and secure analytics at scale. Amazon Redshift data sharing allows you to share data within and across organizations, AWS Regions, and even third-party providers, without moving or copying the data. Additionally, it allows you to use multiple warehouses of different types and sizes for extract, transform, and load (ETL) jobs so you can tune your warehouses based on your write workloads’ price-performance needs.

You can use the Amazon Redshift Streaming Ingestion capability to update your analytics data warehouse in near real time. Redshift Streaming Ingestion simplifies data pipelines by letting you create materialized views directly on top of data streams. With this capability in Amazon Redshift, you can use SQL to connect to and directly ingest data from data streams, such as Amazon Kinesis Data Streams or Amazon Managed Streaming for Apache Kafka (Amazon MSK), and pull data directly to Amazon Redshift.

EchoStar uses Redshift Streaming Ingestion to ingest over 10 TB of data daily from more than 150 MSK topics in near real time across its Open RAN 5G network. This post provides an overview of real-time data analysis with Amazon Redshift and how EchoStar uses it to ingest hundreds of megabytes per second. As data sources and volumes grew across its network, EchoStar migrated from a single Redshift Serverless workgroup to a multi-warehouse architecture with live data sharing. This resulted in improved performance for ingesting and analyzing their rapidly growing data.

“By adopting the strategy of ‘parse and transform later,’ and establishing an Amazon Redshift data warehouse farm with a multi-cluster architecture, we leveraged the power of Amazon Redshift for direct streaming ingestion and data sharing.

“This innovative approach improved our data latency, reducing it from two–three days to an average of 37 seconds. Additionally, we achieved better scalability, with Amazon Redshift direct streaming ingestion supporting over 150 MSK topics.”

—Sandeep Kulkarni, VP, Software Engineering & Head of Wireless OSS Platforms at EchoStar

EchoStar use case

EchoStar needed to provide near real-time access to 5G network performance data for downstream consumers and interactive analytics applications. This data is sourced from the 5G network EMS observability infrastructure and is streamed in near real-time using AWS services like AWS Lambda and AWS Step Functions. The streaming data produced many small files, ranging from bytes to kilobytes. To efficiently integrate this data, a messaging system like Amazon MSK was required.

EchoStar was processing over 150 MSK topics from their messaging system, with each topic containing around 1 billion rows of data per day. This resulted in an average total data volume of 10 TB per day. To use this data, EchoStar needed to visualize it, perform spatial analysis, join it with third-party data sources, develop end-user applications, and use the insights to make near real-time improvements to their terrestrial 5G network. EchoStar needed a solution that does the following:

  • Optimize parsing and loading of over 150 MSK topics to enable downstream workloads to run simultaneously without impacting each other
  • Allow hundreds of queries to run in parallel with desired query throughput
  • Seamlessly scale capacity with the increase in user base and maintain cost-efficiency

Solution overview

EchoStar migrated from a single Redshift Serverless workgroup to a multi-warehouse Amazon Redshift architecture in partnership with AWS. The new architecture enables workload isolation by separating streaming ingestion and ETL jobs from analytics workloads across multiple Redshift compute instances. At the same time, it provides live data sharing using a single copy of the data between the data warehouse. This architecture takes advantage of AWS capabilities to scale Redshift streaming ingestion jobs and isolate workloads while maintaining data access.

The following diagram shows the high-level end-to-end serverless architecture and overall data pipeline.

Architecture Diagram

The solution consists of the following key components:

  • Primary ETL Redshift Serverless workgroup – A primary ETL producer workgroup of size 392 RPU
  • Secondary Redshift Serverless workgroups – Additional producer workgroups of varying sizes to distribute and scale near real-time data ingestion from over 150 MSK topics based on price-performance requirements
  • Consumer Redshift Serverless workgroup – A consumer workgroup instance to run analytics using Tableau

To efficiently load multiple MSK topics into Redshift Serverless in parallel, we first identified the topics with the highest data volumes in order to determine the appropriate sizing for secondary workgroups.

We began by sizing the system initially to Redshift Serverless workgroup of 64 RPU. Then we onboarded a small number of MSK topics, creating related streaming materialized views. We incrementally added more materialized views, evaluating overall ingestion cost, performance, and latency needs within a single workgroup. This initial benchmarking gave us a solid baseline to onboard the remaining MSK topics across multiple workgroups.

In addition to a multi-warehouse approach and workgroup sizing, we optimized such large-scale data volume ingestion with an average latency of 37 seconds by splitting ingestion jobs into two steps:

  • Streaming materialized views – Use JSON_PARSE to ingest data from MSK topics in Amazon Redshift
  • Flattening materialized views – Shred and perform transformations as a second step, reading data from the respective streaming materialized view

The following diagram depicts the high-level approach.

MSK to Redshift

Best practices

In this section, we share some of the best practices we observed while implementing this solution:

  • We performed an initial Redshift Serverless workgroup sizing based on three key factors:
    • Number of records per second per MSK topic
    • Average record size per MSK topic
    • Desired latency SLA
  • Additionally, we created only one streaming materialized view for a given MSK topic. Creation of multiple materialized views per MSK topic can slow down the ingestion performance because each materialized view becomes a consumer for that topic and shares the Amazon MSK bandwidth for that topic.
  • While defining the streaming materialized view, we avoided using JSON_EXTRACT_PATH_TEXT to pre-shred data, because json_extract_path_text operates on the data row by row, which significantly impacts ingestion throughput. Instead, we adopted JSON_PARSE with the CAN_JSON_PARSE function to ingest data from the stream at lowest latency and to guard against errors. The following is a sample SQL query we used for the MSK topics (the actual data source names have been masked due to security reasons):
CREATE MATERIALIZED VIEW <source-name>_streaming_mvw AUTO REFRESH YES AS
SELECT
    kafka_partition,
    kafka_offset,
    refresh_time,
    case when CAN_JSON_PARSE(kafka_value) = true then JSON_PARSE(kafka_value) end as Kafka_Data,
    case when CAN_JSON_PARSE(kafka_value) = false then kafka_value end as Invalid_Data
FROM
    external_<source-name>."<source-name>_mvw";
  • We kept the streaming materialized views simple and moved all transformations like unnesting, aggregation, and case expressions to a later step as flattening materialized views. The following is a sample SQL query we used to flatten data by reading the streaming materialized views created in the previous step (the actual data source and column names have been masked due to security reasons):
CREATE MATERIALIZED VIEW <source-name>_flatten_mvw AUTO REFRESH NO AS
SELECT
    kafka_data."<column1>" :: integer as "<column1>",
    kafka_data."<column2>" :: integer as "<column2>",
    kafka_data."<column3>" :: bigint as "<column3>",
    … 
    …
    …
    …
FROM
    <source-name>_streaming_mvw;
  • The streaming materialized views were set to auto refresh so that they can continuously ingest data into Amazon Redshift from MSK topics.
  • The flattening materialized views were set to manual refresh based on SLA requirements using Amazon Managed Workflows for Apache Airflow (Amazon MWAA).
  • We skipped defining any sort key in the streaming materialized views to further accelerate the ingestion speed.
  • Lastly, we used SYS_MV_REFRESH_HISTORY and SYS_STREAM_SCAN_STATES system views to monitor the streaming ingestion refreshes and latencies.

For more information about best practices and monitoring techniques, refer to Best practices to implement near-real-time analytics using Amazon Redshift Streaming Ingestion with Amazon MSK.

Results

EchoStar saw improvements with this solution in both performance and scalability across their 5G Open RAN network.

Performance

By isolating and scaling Redshift Streaming Ingestion refreshes across multiple Redshift Serverless workgroups, EchoStar met their latency SLA requirements. We used the following SQL query to measure latencies:

WITH curr_qry as (
    SELECT
        mv_name,
        cast(partition_id as int) as partition_id,
        max(query_id) as current_query_id
    FROM
        sys_stream_scan_states
    GROUP BY
        mv_name,
        cast(partition_id as int)
)
SELECT
    strm.mv_name,
    tmp.partition_id,
    min(datediff(second, stream_record_time_max, record_time)) as min_latency_in_secs,
    max(datediff(second, stream_record_time_min, record_time)) as max_latency_in_secs
FROM
    sys_stream_scan_states strm,
    curr_qry tmp
WHERE
    strm.query_id = tmp.current_query_id
    and strm.mv_name = tmp.mv_name
    and strm.partition_id = tmp.partition_id
GROUP BY 1,2
ORDER BY 1,2;

When we further aggregate the preceding query to only the mv_name level (removing partition_id, which uniquely identifies a partition in an MSK topic), we find the average daily performance results we achieved on a Redshift Serverless workgroup size of 64 RPU as shown in the following chart. (The actual materialized view names have been hashed for security reasons because it maps to an external vendor name and data source.)

S.No. stream_name_hash min_latency_secs max_latency_secs avg_records_per_day
1 e022b6d13d83faff02748d3762013c 1 6 186,395,805
2 a8cc0770bb055a87bbb3d37933fc01 1 6 186,720,769
3 19413c1fc8fd6f8e5f5ae009515ffb 2 4 5,858,356
4 732c2e0b3eb76c070415416c09ffe0 3 27 12,494,175
5 8b4e1ffad42bf77114ab86c2ea91d6 3 4 149,927,136
6 70e627d11eba592153d0f08708c0de 5 5 121,819
7 e15713d6b0abae2b8f6cd1d2663d94 5 31 148,768,006
8 234eb3af376b43a525b7c6bf6f8880 6 64 45,666
9 38e97a2f06bcc57595ab88eb8bec57 7 100 45,666
10 4c345f2f24a201779f43bd585e53ba 9 12 101,934,969
11 a3b4f6e7159d9b69fd4c4b8c5edd06 10 14 36,508,696
12 87190a106e0889a8c18d93a3faafeb 13 69 14,050,727
13 b1388bad6fc98c67748cc11ef2ad35 25 118 509
14 cf8642fccc7229106c451ea33dd64d 28 66 13,442,254
15 c3b2137c271d1ccac084c09531dfcd 29 74 12,515,495
16 68676fc1072f753136e6e992705a4d 29 69 59,565
17 0ab3087353bff28e952cd25f5720f4 37 71 12,775,822
18 e6b7f10ea43ae12724fec3e0e3205c 39 83 2,964,715
19 93e2d6e0063de948cc6ce2fb5578f2 45 45 1,969,271
20 88cba4fffafd085c12b5d0a01d0b84 46 47 12,513,768
21 d0408eae66121d10487e562bd481b9 48 57 12,525,221
22 de552412b4244386a23b4761f877ce 52 52 7,254,633
23 9480a1a4444250a0bc7a3ed67eebf3 58 96 12,522,882
24 db5bd3aa8e1e7519139d2dc09a89a7 60 103 12,518,688
25 e6541f290bd377087cdfdc2007a200 71 83 176,346,585
26 6f519c71c6a8a6311f2525f38c233d 78 115 100,073,438
27 3974238e6aff40f15c2e3b6224ef68 79 82 12,770,856
28 7f356f281fc481976b51af3d76c151 79 96 75,077
29 e2e8e02c7c0f68f8d44f650cd91be2 92 99 12,525,210
30 3555e0aa0630a128dede84e1f8420a 97 105 8,901,014
31 7f4727981a6ba1c808a31bd2789f3a 108 110 11,599,385

All 31 materialized views running and refreshing concurrently and continuously show a minimum latency of 1 second and a maximum latency of 118 seconds over the last 7 days, meeting EchoStar’s SLA requirements.

Scalability

With this Redshift data sharing enabled multi-warehouse architecture approach, EchoStar can now quickly scale their Redshift compute resources on demand by using the Redshift data sharing architecture to onboard the remaining 150 MSK topics. In addition, as their data sources and MSK topics increase further, they can quickly add additional Redshift Serverless workgroups (for example, another Redshift Serverless 128 RPU workgroup) to meet their desired SLA requirements.

Conclusion

By using the scalability of Amazon Redshift and a multi-warehouse architecture with data sharing, EchoStar delivers near real-time access to over 150 million rows of data across over 150 MSK topics, totaling 10 TB ingested daily, to their users.

This split multi-producer/consumer model of Amazon Redshift can bring benefits to many workloads that have similar performance characteristics as EchoStar’s warehouse. With this pattern, you can scale your workload to meet SLAs while optimizing for price and performance. Please reach out to your AWS Account Team to engage an AWS specialist for additional help or for a proof of concept.


About the authors

Balaram Mathukumilli is Director, Enterprise Data Services at DISH Wireless. He is deeply passionate about Data and Analytics solutions. With 20+ years of experience in Enterprise and Cloud transformation, he has worked across domains such as PayTV, Media Sales, Marketing and Wireless. Balaram works closely with the business partners to identify data needs, data sources, determine data governance, develop data infrastructure, build data analytics capabilities, and foster a data-driven culture to ensure their data assets are properly managed, used effectively, and are secure

Viswanatha Vellaboyana, a Solutions Architect at DISH Wireless, is deeply passionate about Data and Analytics solutions. With 20 years of experience in enterprise and cloud transformation, he has worked across domains such as Media, Media Sales, Communication, and Health Insurance. He collaborates with enterprise clients, guiding them in architecting, building, and scaling applications to achieve their desired business outcomes.

Keerthi Kambam is a Senior Engineer at DISH Network specializing in AWS Services. She builds scalable data engineering and analytical solutions for dish customer faced applications. She is passionate about solving complex data challenges with cloud solutions.

Raks KhareRaks Khare is a Senior Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers across varying industries and regions architect data analytics solutions at scale on the AWS platform. Outside of work, he likes exploring new travel and food destinations and spending quality time with his family.

Adi Eswar has been a core member of the AI/ML and Analytics Specialist team, leading the customer experience of customer’s existing workloads and leading key initiatives as part of the Analytics Customer Experience Program and Redshift enablement in AWS-TELCO customers. He spends his free time exploring new food, cultures, national parks and museums with his family.

Shirin Bhambhani is a Senior Solutions Architect at AWS. She works with customers to build solutions and accelerate their cloud migration journey. She enjoys simplifying customer experiences on AWS.

Vinayak Rao is a Senior Customer Solutions Manager at AWS. He collaborates with customers, partners, and internal AWS teams to drive customer success, delivery of technical solutions, and cloud adoption.

Automate data loading from your database into Amazon Redshift using AWS Database Migration Service (DMS), AWS Step Functions, and the Redshift Data API

Post Syndicated from Ritesh Sinha original https://aws.amazon.com/blogs/big-data/automate-data-loading-from-your-database-into-amazon-redshift-using-aws-database-migration-service-dms-aws-step-functions-and-the-redshift-data-api/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics.

As more and more data is being generated, collected, processed, and stored in many different systems, making the data available for end-users at the right place and right time is a very important aspect for data warehouse implementation. A fully automated and highly scalable ETL process helps minimize the operational effort that you must invest in managing the regular ETL pipelines. It also provides timely refreshes of data in your data warehouse.

You can approach the data integration process in two ways:

  • Full load – This method involves completely reloading all the data within a specific data warehouse table or dataset
  • Incremental load – This method focuses on updating or adding only the changed or new data to the existing dataset in a data warehouse

This post discusses how to automate ingestion of source data that changes completely and has no way to track the changes. This is useful for customers who want to use this data in Amazon Redshift; some examples of such data are products and bills of materials without tracking details at the source.

We show how to build an automatic extract and load process from various relational database systems into a data warehouse for full load only. A full load is performed from SQL Server to Amazon Redshift using AWS Database Migration Service (AWS DMS). When Amazon EventBridge receives a full load completion notification from AWS DMS, ETL processes are run on Amazon Redshift to process data. AWS Step Functions is used to orchestrate this ETL pipeline. Alternatively, you could use Amazon Managed Workflows for Apache Airflow (Amazon MWAA), a managed orchestration service for Apache Airflow that makes it straightforward to set up and operate end-to-end data pipelines in the cloud.

Solution overview

The workflow consists of the following steps:

  1. The solution uses an AWS DMS migration task that replicates the full load dataset from the configured SQL Server source to a target Redshift cluster in a staging area.
  2. AWS DMS publishes the replicationtaskstopped event to EventBridge when the replication task is complete, which invokes an EventBridge rule.
  3. EventBridge routes the event to a Step Functions state machine.
  4. The state machine calls a Redshift stored procedure through the Redshift Data API, which loads the dataset from the staging area to the target production tables. With this API, you can also access Redshift data with web-based service applications, including AWS Lambda.

The following architecture diagram highlights the end-to-end solution using AWS services.

In the following sections, we demonstrate how to create the full load AWS DMS task, configure the ETL orchestration on Amazon Redshift, create the EventBridge rule, and test the solution.

Prerequisites

To complete this walkthrough, you must have the following prerequisites:

  • An AWS account
  • A SQL Server database configured as a replication source for AWS DMS
  • A Redshift cluster to serve as the target database
  • An AWS DMS replication instance to migrate data from source to target
  • A source endpoint pointing to the SQL Server database
  • A target endpoint pointing to the Redshift cluster

Create the full load AWS DMS task

Complete the following steps to set up your migration task:

  1. On the AWS DMS console, choose Database migration tasks in the navigation pane.
  2. Choose Create task.
  3. For Task identifier, enter a name for your task, such as dms-full-dump-task.
  4. Choose your replication instance.
  5. Choose your source endpoint.
  6. Choose your target endpoint.
  7. For Migration type, choose Migrate existing data.

  1. In the Table mapping section, under Selection rules, choose Add new selection rule
  2. For Schema, choose Enter a schema.
  3. For Schema name, enter a name (for example, dms_sample).
  4. Keep the remaining settings as default and choose Create task.

The following screenshot shows your completed task on the AWS DMS console.

Create Redshift tables

Create the following tables on the Redshift cluster using the Redshift query editor:

  • dbo.dim_cust – Stores customer attributes:
CREATE TABLE dbo.dim_cust (
cust_key integer ENCODE az64,
cust_id character varying(10) ENCODE lzo,
cust_name character varying(100) ENCODE lzo,
cust_city character varying(50) ENCODE lzo,
cust_rev_flg character varying(1) ENCODE lzo
)

DISTSTYLE AUTO;
  • dbo.fact_sales – Stores customer sales transactions:
CREATE TABLE dbo.fact_sales (
order_number character varying(20) ENCODE lzo,
cust_key integer ENCODE az64,
order_amt numeric(18,2) ENCODE az64
)

DISTSTYLE AUTO;
  • dbo.fact_sales_stg – Stores daily customer incremental sales transactions:
CREATE TABLE dbo.fact_sales_stg (
order_number character varying(20) ENCODE lzo,
cust_id character varying(10) ENCODE lzo,
order_amt numeric(18,2) ENCODE az64
)

DISTSTYLE AUTO;

Use the following INSERT statements to load sample data into the sales staging table:

insert into dbo.fact_sales_stg(order_number,cust_id,order_amt) values (100,1,200);
insert into dbo.fact_sales_stg(order_number,cust_id,order_amt) values (101,1,300);
insert into dbo.fact_sales_stg(order_number,cust_id,order_amt) values (102,2,25);
insert into dbo.fact_sales_stg(order_number,cust_id,order_amt) values (103,2,35);
insert into dbo.fact_sales_stg(order_number,cust_id,order_amt) values (104,3,80);
insert into dbo.fact_sales_stg(order_number,cust_id,order_amt) values (105,3,45);

Create the stored procedures

In the Redshift query editor, create the following stored procedures to process customer and sales transaction data:

  • Sp_load_cust_dim() – This procedure compares the customer dimension with incremental customer data in staging and populates the customer dimension:
CREATE OR REPLACE PROCEDURE dbo.sp_load_cust_dim()
LANGUAGE plpgsql
AS $$
BEGIN
truncate table dbo.dim_cust;
insert into dbo.dim_cust(cust_key,cust_id,cust_name,cust_city) values (1,100,'abc','chicago');
insert into dbo.dim_cust(cust_key,cust_id,cust_name,cust_city) values (2,101,'xyz','dallas');
insert into dbo.dim_cust(cust_key,cust_id,cust_name,cust_city) values (3,102,'yrt','new york');
update dbo.dim_cust
set cust_rev_flg=case when cust_city='new york' then 'Y' else 'N' end
where cust_rev_flg is null;
END;
$$
  • sp_load_fact_sales() – This procedure does the transformation for incremental order data by joining with the date dimension and customer dimension and populates the primary keys from the respective dimension tables in the final sales fact table:
CREATE OR REPLACE PROCEDURE dbo.sp_load_fact_sales()
LANGUAGE plpgsql
AS $$
BEGIN
--Process Fact Sales
insert into dbo.fact_sales
select
sales_fct.order_number,
cust.cust_key as cust_key,
sales_fct.order_amt
from dbo.fact_sales_stg sales_fct
--join to customer dim
inner join (select * from dbo.dim_cust) cust on sales_fct.cust_id=cust.cust_id;
END;
$$

Create the Step Functions state machine

Complete the following steps to create the state machine redshift-elt-load-customer-sales. This state machine is invoked as soon as the AWS DMS full load task for the customer table is complete.

  1. On the Step Functions console, choose State machines in the navigation pane.
  2. Choose Create state machine.
  3. For Template, choose Blank.
  4. On the Actions dropdown menu, choose Import definition to import the workflow definition of the state machine.

  1. Open your preferred text editor and save the following code as an ASL file extension (for example, redshift-elt-load-customer-sales.ASL). Provide your Redshift cluster ID and the secret ARN for your Redshift cluster.
{
"Comment": "State Machine to process ETL for Customer Sales Transactions",
"StartAt": "Load_Customer_Dim",
"States": {
"Load_Customer_Dim": {
"Type": "Task",
"Parameters": {
"ClusterIdentifier": "redshiftcluster-abcd",
"Database": "dev",
"Sql": "call dbo.sp_load_cust_dim()",
"SecretArn": "arn:aws:secretsmanager:us-west-2:xxx:secret:rs-cluster-secret-abcd"
},
"Resource": "arn:aws:states:::aws-sdk:redshiftdata:executeStatement",
"Next": "Wait on Load_Customer_Dim"
},
"Wait on Load_Customer_Dim": {
"Type": "Wait",
"Seconds": 30,
"Next": "Check_Status_Load_Customer_Dim"
},

"Check_Status_Load_Customer_Dim": {
"Type": "Task",
"Next": "Choice",
"Parameters": {
"Id.$": "$.Id"
},

"Resource": "arn:aws:states:::aws-sdk:redshiftdata:describeStatement"
},

"Choice": {
"Type": "Choice",
"Choices": [
{
"Not": {
"Variable": "$.Status",
"StringEquals": "FINISHED"
},
"Next": "Wait on Load_Customer_Dim"
}
],
"Default": "Load_Sales_Fact"
},
"Load_Sales_Fact": {
"Type": "Task",
"End": true,
"Parameters": {
"ClusterIdentifier": "redshiftcluster-abcdef”,
"Database": "dev",
"Sql": "call dbo.sp_load_fact_sales()",
"SecretArn": "arn:aws:secretsmanager:us-west-2:xxx:secret:rs-cluster-secret-abcd"
},

"Resource": "arn:aws:states:::aws-sdk:redshiftdata:executeStatement"
}
}
}
  1. Choose Choose file and upload the ASL file to create a new state machine.

  1. For State machine name, enter a name for the state machine (for example, redshift-elt-load-customer-sales).
  2. Choose Create.

After the successful creation of the state machine, you can verify the details as shown in the following screenshot.

The following diagram illustrates the state machine workflow.

The state machine includes the following steps:

  • Load_Customer_Dim – Performs the following actions:
    • Passes the stored procedure sp_load_cust_dim to the execute-statement API to run in the Redshift cluster to load the incremental data for the customer dimension
    • Sends data back the identifier of the SQL statement to the state machine
  • Wait_on_Load_Customer_Dim – Waits for at least 15 seconds
  • Check_Status_Load_Customer_Dim – Invokes the Data API’s describeStatement to get the status of the API call
  • is_run_Load_Customer_Dim_complete – Routes the next step of the ETL workflow depending on its status:
    • FINISHED – Passes the stored procedure Load_Sales_Fact to the execute-statement API to run in the Redshift cluster, which loads the incremental data for fact sales and populates the corresponding keys from the customer and date dimensions
    • All other statuses – Goes back to the wait_on_load_customer_dim step to wait for the SQL statements to finish

The state machine redshift-elt-load-customer-sales loads the dim_cust, fact_sales_stg, and fact_sales tables when invoked by the EventBridge rule.

As an optional step, you can set up event-based notifications on completion of the state machine to invoke any downstream actions, such as Amazon Simple Notification Service (Amazon SNS) or further ETL processes.

Create an EventBridge rule

EventBridge sends event notifications to the Step Functions state machine when the full load is complete. You can also turn event notifications on or off in EventBridge.

Complete the following steps to create the EventBridge rule:

  1. On the EventBridge console, in the navigation pane, choose Rules.
  2. Choose Create rule.
  3. For Name, enter a name (for example, dms-test).
  4. Optionally, enter a description for the rule.
  5. For Event bus, choose the event bus to associate with this rule. If you want this rule to match events that come from your account, select AWS default event bus. When an AWS service in your account emits an event, it always goes to your account’s default event bus.
  6. For Rule type, choose Rule with an event pattern.
  7. Choose Next.
  8. For Event source, choose AWS events or EventBridge partner events.
  9. For Method, select Use pattern form.
  10. For Event source, choose AWS services.
  11. For AWS service, choose Database Migration Service.
  12. For Event type, choose All Events.
  13. For Event pattern, enter the following JSON expression, which looks for the REPLICATON_TASK_STOPPED status for the AWS DMS task:
{
"source": ["aws.dms"],
"detail": {
"eventId": ["DMS-EVENT-0079"],
"eventType": ["REPLICATION_TASK_STOPPED"],
"detailMessage": ["Stop Reason FULL_LOAD_ONLY_FINISHED"],
"type": ["REPLICATION_TASK"],
"category": ["StateChange"]
}
}

  1. For Target type, choose AWS service.
  2. For AWS service, choose Step Functions state machine.
  3. For State machine name, enter redshift-elt-load-customer-sales.
  4. Choose Create rule.

The following screenshot shows the details of the rule created for this post.

Test the solution

Run the task and wait for the workload to complete. This workflow moves the full volume data from the source database to the Redshift cluster.

The following screenshot shows the load statistics for the customer table full load.

AWS DMS provides notifications when an AWS DMS event occurs, for example the completion of a full load or if a replication task has stopped.

After the full load is complete, AWS DMS sends events to the default event bus for your account. The following screenshot shows an example of invoking the target Step Functions state machine using the rule you created.

We configured the Step Functions state machine as a target in EventBridge. This enables EventBridge to invoke the Step Functions workflow in response to the completion of an AWS DMS full load task.

Validate the state machine orchestration

When the entire customer sales data pipeline is complete, you may go through the entire event history for the Step Functions state machine, as shown in the following screenshots.

Limitations

The Data API and Step Functions AWS SDK integration offers a robust mechanism to build highly distributed ETL applications within minimal developer overhead. Consider the following limitations when using the Data API and Step Functions:

Clean up

To avoid incurring future charges, delete the Redshift cluster, AWS DMS full load task, AWS DMS replication instance, and Step Functions state machine that you created as part of this post.

Conclusion

In this post, we demonstrated how to build an ETL orchestration for full loads from operational data stores using the Redshift Data API, EventBridge, Step Functions with AWS SDK integration, and Redshift stored procedures.

To learn more about the Data API, see Using the Amazon Redshift Data API to interact with Amazon Redshift clusters and Using the Amazon Redshift Data API.


About the authors

Ritesh Kumar Sinha is an Analytics Specialist Solutions Architect based out of San Francisco. He has helped customers build scalable data warehousing and big data solutions for over 16 years. He loves to design and build efficient end-to-end solutions on AWS. In his spare time, he loves reading, walking, and doing yoga.

Praveen Kadipikonda is a Senior Analytics Specialist Solutions Architect at AWS based out of Dallas. He helps customers build efficient, performant, and scalable analytic solutions. He has worked with building databases and data warehouse solutions for over 15 years.

Jagadish Kumar (Jag) is a Senior Specialist Solutions Architect at AWS focused on Amazon OpenSearch Service. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.

Implement disaster recovery with Amazon Redshift

Post Syndicated from Nita Shah original https://aws.amazon.com/blogs/big-data/implement-disaster-recovery-with-amazon-redshift/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers.

The objective of a disaster recovery plan is to reduce disruption by enabling quick recovery in the event of a disaster that leads to system failure. Disaster recovery plans also allow organizations to make sure they meet all compliance requirements for regulatory purposes, providing a clear roadmap to recovery.

This post outlines proactive steps you can take to mitigate the risks associated with unexpected disruptions and make sure your organization is better prepared to respond and recover Amazon Redshift in the event of a disaster. With built-in features such as automated snapshots and cross-Region replication, you can enhance your disaster resilience with Amazon Redshift.

Disaster recovery planning

Any kind of disaster recovery planning has two key components:

  • Recovery Point Objective (RPO) – RPO is the maximum acceptable amount of time since the last data recovery point. This determines what is considered an acceptable loss of data between the last recovery point and the interruption of service.
  • Recovery Time Objective (RTO) – RTO is the maximum acceptable delay between the interruption of service and restoration of service. This determines what is considered an acceptable time window when service is unavailable.

To develop your disaster recovery plan, you should complete the following tasks:

  • Define your recovery objectives for downtime and data loss (RTO and RPO) for data and metadata. Make sure your business stakeholders are engaged in deciding appropriate goals.
  • Identify recovery strategies to meet the recovery objectives.
  • Define a fallback plan to return production to the original setup.
  • Test out the disaster recovery plan by simulating a failover event in a non-production environment.
  • Develop a communication plan to notify stakeholders of downtime and its impact to the business.
  • Develop a communication plan for progress updates, and recovery and availability.
  • Document the entire disaster recovery process.

Disaster recovery strategies

Amazon Redshift is a cloud-based data warehouse that supports many recovery capabilities out of the box to address unforeseen outages and minimize downtime.

Amazon Redshift RA3 instance types and Redshift serverless store their data in Redshift Managed Storage (RMS), which is backed by Amazon Simple Storage Service (Amazon S3), which is highly available and durable by default.

In the following sections, we discuss the various failure modes and associated recovery strategies.

Using backups

Backing up data is an important part of data management. Backups protect against human error, hardware failure, virus attacks, power outages, and natural disasters.

Amazon Redshift supports two kinds of snapshots: automatic and manual, which can be used to recover data. Snapshots are point-in-time backups of the Redshift data warehouse. Amazon Redshift stores these snapshots internally with RMS by using an encrypted Secure Sockets Layer (SSL) connection.

Redshift provisioned clusters offer automated snapshots that are taken automatically with a default retention of 1 day, which can be extended for up to 35 days. These snapshots are taken every 5 GB data change per node or every 8 hours, and the minimum time interval between two snapshots is 15 minutes. The data change must be greater than the total data ingested by the cluster (5 GB times the number of nodes). You can also set a custom snapshot schedule with frequencies between 1–24 hours. You can use the AWS Management Console or ModifyCluster API to manage the period of time your automated backups are retained by modifying the RetentionPeriod parameter. If you want to turn off automated backups altogether, you can set up the retention period to 0 (not recommended). For additional details, refer to Automated snapshots.

Amazon Redshift Serverless automatically creates recovery points approximately every 30 minutes. These recovery points have a default retention of 24 hours, after which they get automatically deleted. You do have the option to convert a recovery point into a snapshot if you want to retain it longer than 24 hours.

Both Amazon Redshift provisioned and serverless clusters offer manual snapshots that can be taken on-demand and be retained indefinitely. Manual snapshots allow you to retain your snapshots longer than automated snapshots to meet your compliance needs. Manual snapshots accrue storage charges, so it’s important that you delete them when you no longer need them. For additional details, refer to Manual snapshots.

Amazon Redshift integrates with AWS Backup to help you centralize and automate data protection across all your AWS services, in the cloud, and on premises. With AWS Backup for Amazon Redshift, you can configure data protection policies and monitor activity for different Redshift provisioned clusters in one place. You can create and store manual snapshots for Redshift provisioned clusters. This lets you automate and consolidate backup tasks that you had to do separately before, without any manual processes. To learn more about setting up AWS Backup for Amazon Redshift, refer to Amazon Redshift backups. As of this writing, AWS Backup does not integrate with Redshift Serverless.

Node failure

A Redshift data warehouse is a collection of computing resources called nodes.
Amazon Redshift will automatically detect and replace a failed node in your data warehouse cluster. Amazon Redshift makes your replacement node available immediately and loads your most frequently accessed data from Amazon S3 first to allow you to resume querying your data as quickly as possible.

If this is a single-node cluster (which is not recommended for customer production use), there is only one copy of the data in the cluster. When it’s down, AWS needs to restore the cluster from the most recent snapshot on Amazon S3, and that becomes your RPO.

We recommend using at least two nodes for production.

Cluster failure

Each cluster has a leader node and one or more compute nodes. In the event of a cluster failure, you must restore the cluster from a snapshot. Snapshots are point-in-time backups of a cluster. A snapshot contains data from all databases that are running on your cluster. It also contains information about your cluster, including the number of nodes, node type, and admin user name. If you restore your cluster from a snapshot, Amazon Redshift uses the cluster information to create a new cluster. Then it restores all the databases from the snapshot data. Note that the new cluster is available before all of the data is loaded, so you can begin querying the new cluster in minutes. The cluster is restored in the same AWS Region and a random, system-chosen Availability Zone, unless you specify another Availability Zone in your request.

Availability Zone failure

A Region is a physical location around the world where data centers are located. An Availability Zone is one or more discrete data centers with redundant power, networking, and connectivity in a Region. Availability Zones enable you to operate production applications and databases that are more highly available, fault tolerant, and scalable than would be possible from a single data center. All Availability Zones in a Region are interconnected with high-bandwidth, low-latency networking, over fully redundant, dedicated metro fiber providing high-throughput, low-latency networking between Availability Zones.

To recover from Availability Zone failures, you can use one of the following approaches:

  • Relocation capabilities (active-passive) – If your Redshift data warehouse is a single-AZ deployment and the cluster’s Availability Zone becomes unavailable, then Amazon Redshift will automatically move your cluster to another Availability Zone without any data loss or application changes. To activate this, you must enable cluster relocation for your provisioned cluster through configuration settings, which is automatically enabled for Redshift Serverless. Cluster relocation is free of cost, but it is a best-effort approach subject to resource availability in the Availability Zone being recovered in, and RTO can be impacted by other issues related to starting up a new cluster. This can result in recovery times between 10–60 minutes. To learn more about configuring Amazon Redshift relocation capabilities, refer to Build a resilient Amazon Redshift architecture with automatic recovery enabled.
  • Amazon Redshift Multi-AZ (active-active) – A Multi-AZ deployment allows you to run your data warehouse in multiple Availability Zones simultaneously and continue operating in unforeseen failure scenarios. No application changes are required to maintain business continuity because the Multi-AZ deployment is managed as a single data warehouse with one endpoint. Multi-AZ deployments reduce recovery time by guaranteeing capacity to automatically recover and are intended for customers with mission-critical analytics applications that require the highest levels of availability and resiliency to Availability Zone failures. This also allows you to implement a solution that is more compliant with the recommendations of the Reliability Pillar of the AWS Well-Architected Framework. Our pre-launch tests found that the RTO with Amazon Redshift Multi-AZ deployments is under 60 seconds or less in the unlikely case of an Availability Zone failure. To learn more about configuring Multi-AZ, refer to Enable Multi-AZ deployments for your Amazon Redshift data warehouse. As of writing, Redshift Serverless currently does not support Multi-AZ.

Region failure

Amazon Redshift currently supports single-Region deployments for clusters. However, you have several options to help with disaster recovery or accessing data across multi-Region scenarios.

Use a cross-Region snapshot

You can configure Amazon Redshift to copy snapshots for a cluster to another Region. To configure cross-Region snapshot copy, you need to enable this copy feature for each data warehouse (serverless and provisioned) and configure where to copy snapshots and how long to keep copied automated or manual snapshots in the destination Region. When cross-Region copy is enabled for a data warehouse, all new manual and automated snapshots are copied to the specified Region. In the event of a Region failure, you can restore your Redshift data warehouse in a new Region using the latest cross-Region snapshot.

The following diagram illustrates this architecture.

For more information about how to enable cross-Region snapshots, refer to the following:

Use a custom domain name

A custom domain name is easier to remember and use than the default endpoint URL provided by Amazon Redshift. With CNAME, you can quickly route traffic to a new cluster or workgroup created from snapshot in a failover situation. When a disaster happens, connections can be rerouted centrally with minimal disruption, without clients having to change their configuration.

For high availability, you should have a warm-standby cluster or workgroup available that regularly receives restored data from the primary cluster. This backup data warehouse could be in another Availability Zone or in a separate Region. You can redirect clients to the secondary Redshift cluster by setting up a custom domain name in the unlikely scenario of an entire Region failure.

In the following sections, we discuss how to use a custom domain name to handle Region failure in Amazon Redshift. Make sure the following prerequisites are met:

  • You need a registered domain name. You can use Amazon Route 53 or a third-party domain registrar to register a domain.
  • You need to configure cross-Region snapshots for your Redshift cluster or workgroup.
  • Turn on cluster relocation for your Redshift cluster. Use the AWS Command Line Interface (AWS CLI) to turn on relocation for a Redshift provisioned cluster. For Redshift Serverless, this is automatically enabled. For more information, see Relocating your cluster.
  • Take note of your Redshift endpoint. You can locate the endpoint by navigating to your Redshift workgroup or provisioned cluster name on the Amazon Redshift console.

Set up a custom domain with Amazon Redshift in the primary Region

In the hosted zone that Route 53 created when you registered the domain, create records to tell Route 53 how you want to route traffic to Redshift endpoint by completing the following steps:

  1. On the Route 53 console, choose Hosted zones in the navigation pane.
  2. Choose your hosted zone.
  3. On the Records tab, choose Create record.
  4. For Record name, enter your preferred subdomain name.
  5. For Record type, choose CNAME.
  6. For Value, enter the Redshift endpoint name. Make sure to provide the value by removing the colon (:), port, and database. For example, redshift-provisioned.eabc123.us-east-2.redshift.amazonaws.com.
  7. Choose Create records.

  1. Use the CNAME record name to create a custom domain in Amazon Redshift. For instructions, see Use custom domain names with Amazon Redshift.

You can now connect to your cluster using the custom domain name. The JDBC URL will be similar to jdbc:redshift://prefix.rootdomain.com:5439/dev?sslmode=verify-full, where prefix.rootdomain.com is your custom domain name and dev is the default database. Use your preferred editor to connect to this URL using your user name and password.

Steps to handle a Regional failure

In the unlikely situation of a Regional failure, complete the following steps:

  1. Use a cross-Region snapshot to restore a Redshift cluster or workgroup in your secondary Region.
  2. Turn on cluster relocation for your Redshift cluster in the secondary Region. Use the AWS CLI to turn on relocation for a Redshift provisioned cluster.
  3. Use the CNAME record name from the Route 53 hosted zone setup to create a custom domain in the newly created Redshift cluster or workgroup.
  4. Take note of the Redshift endpoint’s newly created Redshift cluster or workgroup.

Next, you need to update the Redshift endpoint in Route 53 for achieve seamless connectivity.

  1. On the Route 53 console, choose Hosted zones in the navigation pane.
  2. Choose your hosted zone.
  3. On the Record tab, select the CNAME record you created.
  4. Under Record details, choose Edit record.
  5. Change the value to the newly created Redshift endpoint. Make sure to provide the value by removing the colon (:), port, and database. For example, redshift-provisioned.eabc567.us-west-2.redshift.amazonaws.com.
  6. Choose Save.

Now when you connect to your custom domain name using the same JDBC URL from your application, you should be connected to your new cluster in your secondary Region.

Use active-active configuration

For business-critical applications that require high availability, you can set up an active-active configuration at the Region level. There are many ways to make sure all writes occur to all clusters; one way is to keep the data in sync between the two clusters by ingesting data concurrently into the primary and secondary cluster. You can also use Amazon Kinesis to sync the data between two clusters. For more details, see Building Multi-AZ or Multi-Region Amazon Redshift Clusters.

Additional considerations

In this section, we discuss additional considerations for your disaster recovery strategy.

Amazon Redshift Spectrum

Amazon Redshift Spectrum is a feature of Amazon Redshift that allows you to run SQL queries against exabytes of data stored in Amazon S3. With Redshift Spectrum, you don’t have to load or extract the data from Amazon S3 into Amazon Redshift before querying.

If you’re using external tables using Redshift Spectrum, you need to make sure it is configured and accessible on your secondary failover cluster.

You can set this up with the following steps:

  1. Replicate existing S3 objects between the primary and secondary Region.
  2. Replicate data catalog objects between the primary and secondary Region.
  3. Set up AWS Identity and Access Management (IAM) policies for accessing the S3 bucket residing in the secondary Region.

Cross-Region data sharing

With Amazon Redshift data sharing, you can securely share read access to live data across Redshift clusters, workgroups, AWS accounts, and Regions without manually moving or copying the data.

If you’re using cross-Region data sharing and one of the Regions has an outage, you need to have a business continuity plan to fail over your producer and consumer clusters to minimize the disruption.

In the event of an outage affecting the Region where the producer cluster is deployed, you can take the following steps to create a new producer cluster in another Region using a cross-Region snapshot and by reconfiguring data sharing, allowing your system to continue operating:

  1. Create a new Redshift cluster using the cross-Region snapshot. Make sure you have correct node type, node count, and security settings.
  2. Identify the Redshift data shares that were previously configured for the original producer cluster.
  3. Recreate these data shares on the new producer cluster in the target Region.
  4. Update the data share configurations in the consumer cluster to point to the newly created producer cluster.
  5. Confirm that the necessary permissions and access controls are in place for the data shares in the consumer cluster.
  6. Verify that the new producer cluster is operational and the consumer cluster is able to access the shared data.

In the event of an outage in the Region where the consumer cluster is deployed, you will need to create a new consumer cluster in a different Region. This makes sure all applications that are connecting to the consumer cluster continue to function as expected, with proper access.

The steps to accomplish this are as follows:

  1. Identify an alternate Region that is not affected by the outage.
  2. Provision a new consumer cluster in the alternate Region.
  3. Provide necessary access to data sharing objects.
  4. Update the application configurations to point to the new consumer cluster.
  5. Validate that all the applications are able to connect to the new consumer cluster and are functioning as expected.

For additional information on how to configure data sharing, refer to Sharing datashares.

Federated queries

With federated queries in Amazon Redshift, you can query and analyze data across operational databases, data warehouses, and data lakes. If you’re using federated queries, you need to set up federated queries from the failover cluster as well to prevent any application failure.

Summary

In this post, we discussed various failure scenarios and recovery strategies associated with Amazon Redshift. Disaster recovery solutions make restoring your data and workloads seamless so you can get business operations back online quickly after a catastrophic event.

As an administrator, you can now work on defining your Amazon Redshift disaster recovery strategy and implement it to minimize business disruptions. You should develop a comprehensive plan that includes:

  • Identifying critical Redshift resources and data
  • Establishing backup and recovery procedures
  • Defining failover and failback processes
  • Enforcing data integrity and consistency
  • Implementing disaster recovery testing and drills

Try out these strategies for yourself, and leave any questions and feedback in the comments section.


About the authors

Nita Shah is a Senior Analytics Specialist Solutions Architect at AWS based out of New York. She has been building data warehouse solutions for over 20 years and specializes in Amazon Redshift. She is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Poulomi Dasgupta is a Senior Analytics Solutions Architect with AWS. She is passionate about helping customers build cloud-based analytics solutions to solve their business problems. Outside of work, she likes travelling and spending time with her family.

Ranjan Burman is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 16 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with cloud solutions.

Jason Pedreza is a Senior Redshift Specialist Solutions Architect at AWS with data warehousing experience handling petabytes of data. Prior to AWS, he built data warehouse solutions at Amazon.com and Amazon Devices. He specializes in Amazon Redshift and helps customers build scalable analytic solutions.

Agasthi Kothurkar is an AWS Solutions Architect, and is based in Boston. Agasthi works with enterprise customers as they transform their business by adopting the Cloud. Prior to joining AWS, he worked with leading IT consulting organizations on customers engagements spanning Cloud Architecture, Enterprise Architecture, IT Strategy, and Transformation. He is passionate about applying Cloud technologies to resolve complex real world business problems.

Access Amazon Redshift data from Salesforce Data Cloud with Zero Copy Data Federation

Post Syndicated from Vijay Gopalakrishnan original https://aws.amazon.com/blogs/big-data/access-amazon-redshift-data-from-salesforce-data-cloud-with-zero-copy-data-federation/

This post is co-authored by Vijay Gopalakrishnan, Director of Product, Salesforce Data Cloud.

In today’s data-driven business landscape, organizations collect a wealth of data across various touch points and unify it in a central data warehouse or a data lake to deliver business insights. This data is primarily used for analytical and machine learning purposes, but not easily accessible by the business users across Sales, Service, and Marketing teams to make data driven decisions. Salesforce and Amazon collaborated to address this challenge, by making the data accessible to the users in the flow of their work, with Zero Copy Data Federation between Salesforce Data Cloud and Amazon Redshift. This solution empowers businesses to access Redshift data within the Salesforce Data Cloud, breaking down data silos, gaining deeper insights, and creating unified customer profiles to deliver highly personalized experiences across various touchpoints. By eliminating the need for data replication, this integration improves efficiency and reduces costs while enabling real-time access to valuable business data.

In this post, we explore the benefits of the new Zero Copy Data Federation and provide a step-by-step guidance to configure it in Salesforce Data Cloud.

What is Salesforce Data Cloud?

Salesforce Data Cloud is a data platform that unifies all of your company’s data into Salesforce’s Einstein 1 Platform, giving every team a 360-degree view of the customer to drive automation, create analytics, personalize engagement, and power trusted artificial intelligence (AI). Data Cloud creates a holistic customer view by turning volumes of disconnected data into a unified customer profile that’s straightforward to access and understand. This includes diverse datasets like telemetry data, web engagement data, and more across your organization or your external data lakes and warehouses. This unified view helps your Sales, Service, and Marketing teams build personalized customer experiences, invoke data-driven actions and workflows, and safely drive AI across all your Salesforce apps.

What is Amazon Redshift?

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence (BI) tools. It’s optimized for datasets ranging from a few hundred gigabytes to a petabyte or more and delivers better price-performance compared to most traditional data warehousing solutions. With a fully managed AI powered massively parallel processing (MPP) architecture, Amazon Redshift makes business decision-making quick and cost-effective.

What is Zero Copy Data Federation?

Zero Copy Data Federation, a Salesforce Data Cloud capability, unifies Salesforce and Amazon Redshift data through a point-and-click interface. It provides secure, real-time access to Redshift data without copying, keeping enterprise data in place. This eliminates replication overhead and ensures access to current information, enhancing data integration while maintaining data integrity and efficiency.

Data federated from Amazon Redshift is represented as a native data cloud object which power various Data Cloud features, including marketing segmentation, activations, and process automation. With these capabilities at your fingertips, you can enrich unified customer profile in Salesforce Data Cloud with transaction data from Amazon Redshift to create a rich customer 360, gain insights, harness predictive and generative AI on the unified data, and ultimately deliver highly personalized experiences across multiple touchpoints.

The following diagram depicts Zero Copy Data Federation flow, key features enabled and few potential actions and activations.

solution architecture

Connection to Amazon Redshift is established by deploying a data stream in Salesforce Data Cloud. When you deploy a data stream from Amazon Redshift to Data Cloud, an external data lake object (DLO) is created within the Data Cloud environment. This external DLO acts as a storage container, housing metadata for your federated Redshift data. Importantly, the DLO serves as a reference, pointing to the data physically stored in your Redshift data warehouse, keeping your data in its original location. Similar to native DLOs, the Amazon Redshift backed external DLOs can power several key features, including batch transform, calculated insights, identity resolution, query, segmentation, and activation, among others. Customer unified profiles enriched with Redshift data could be actioned by Amazon SageMaker to drive predictive outcomes and activated across several platforms, including Amazon Ads and Salesforce Marketing Cloud, for creating audience journeys and running targeted campaigns.

To increase performance, you can opt for acceleration, which is designed to enhance query runtimes. For more information on this feature, refer to Acceleration in Data Federation.

To summarize, Zero Copy Data Federation provides the following benefits:

  • Unified data view: Integrates external data seamlessly with Salesforce data for a comprehensive customer view.
  • Real-time access: Provides near real-time access to data stored in external sources like Amazon Redshift.
  • Data efficiency: Eliminates the need to copy or move large datasets, reducing storage costs and data duplication.
  • Cost-effective: Reduces data transfer pipeline and storage costs associated with traditional data integration methods.
  • Enhanced security: Data remains in its original secure environment, reducing exposure risks.
  • Streamlined compliance: Simplifies data governance by maintaining data in its original, regulated environment.

Prerequisites

Before configuring data federation, you must have access to Salesforce Data Cloud and the information to connect to your Redshift provisioned or serverless warehouse. The Redshift warehouse must be publicly accessible and it is recommended to restrict access by allow listing only the Data Cloud IP addresses.

For information on setting up an Amazon Redshift Serverless or Amazon Redshift provisioned cluster, refer to Amazon Redshift Serverless or Amazon Redshift provisioned clusters, respectively.

Configure Zero Copy Data Federation

To federate Redshift data to Salesforce Data Cloud, start by configuring a Redshift connection.

  1. Log in to Salesforce Data Cloud and navigate to Data Cloud Setup.
    Step 1 - Navigate to Data Cloud Setup
  2. In the navigation pane, choose Connectors under Configuration.
    Step 2 choose Connectors under Configuration.
  3. Choose New, choose Amazon Redshift, and choose Next.
    Step 3 choose New, choose Amazon Redshift, and choose Next.
  4. Retrieve the Redshift endpoint by navigating to the Redshift Serverless or provisioned cluster in the AWS console. Following image shows how to obtain the endpoint URL for Redshift serverless.
    Step 4 Retrieve the Redshift endpoint
  5. Back in Salesforce Data Cloud, configure the connector with a unique name and enter the endpoint from your Redshift server.
  6. Enter the user name and password configured for your Redshift serverless namespace.
  7. Enter the name of the database configured in your Redshift serverless namespace.
    Configure the coonector
  8. Choose Test Connection to confirm you’re able to successfully connect to the Redshift instance and choose Save.
    Confirm connection and Save

Create a Redshift Zero Copy Data Federation data stream

Complete the following steps to create a data stream using the connection you created:

  1. Navigate to Data Cloud and choose Data Streams in the navigation bar.
  2. Choose New to set up a new data stream.
    set up a new data stream
  3. Choose Amazon Redshift and choose Next.
    Select Amazon Redshift
  4. Choose your connector, database, and objects, then choose Next.
    Choose your connector, database, and objects, then choose Next.
  5. Configure the object, category, primary key, and fields:
    1. Set the object name and object API name. For more information, see Data Lake Object Naming Standards.
    2. Set the category to specify the type of data to ingest. For more information, see Category.
    3. Set the primary key to identify the incoming records uniquely. For more information, see Primary Key.
    4. Select the source fields you want to ingest.
  6. Choose Next.
    Configure the object, category, primary key, and fields. And choose Next
  7. Select the relevant data space. Choose default if you don’t have any other data space provisioned in your organization. For more information, see Manage Data Spaces.
  8. If you want to query the data in your Redshift instance with reduced latency, select Enable acceleration and choose your acceleration schedule. For more information, see Acceleration in Data Federation.
  9. Choose Deploy.
    deploy

On successful deployment, a data stream is created.

On successful deployment, a data stream is created.

Use cases for Zero Copy Data Federation

The following are key use cases enabled by Zero Copy Data Federation between Redshift and Salesforce Data Cloud:

  • Marketing insurance campaign journey – Combine customer profile, insurance policy, and plan data in Amazon Redshift with customer data in Salesforce Cloud for targeted outreach campaigns in Marketing Cloud. This facilitates cross-selling of other financial products.
  • Targeted promotions and customer outreach – Merge customer purchase and profile data from Amazon Redshift with customer feedback and service data in Salesforce for targeted customer outreach in Marketing Cloud, including promotional deals.
  • Customer satisfaction using service cloud data – Combine customer and case data in Salesforce with customer feedback data in Amazon Redshift to determine customer satisfaction ratings, enhancing service quality.
  • Prioritized offers and data-driven next-best actions – Utilize customer billing accounts and service data from Salesforce along with prospect, order, and billing data in Amazon Redshift to generate prioritized offers and next-best actions. The transition from ETL pipelines to Zero Copy BYOL integration has streamlined operations.
  • Customer segmentation and activation – Federate purchase data and billing history from Amazon Redshift to enrich unified profiles in Salesforce Data Cloud and generate actionable insights based on the recency, frequency, and monetary value to create customer segments and activate to your desired source.
  • Customer 360 with rich insights – Enrich customer profiles in Salesforce Data Cloud with purchase, billing, and product data from Amazon Redshift to empower Marketing, Sales, and Service teams to improve customer engagement with rich customer insights.

Conclusion

Zero Copy Data Federation between Salesforce Data Cloud and Amazon Redshift empowers businesses to break down data silos, enhance customer experiences, and drive operational efficiencies. By federating Redshift data to Salesforce Data Cloud, organizations can make informed decisions faster, personalize customer interactions at scale, and optimize resources across marketing, sales, service, and operations. This integration sets a new standard for data-driven business success in the digital age. Check out the Salesforce Zero Copy Data Federation announcement and the following resources to learn more and get started:


About the Authors

Vijay Gopalakrishnan is a Director of Product Management with Salesforce with several years of experience in the data space. He currently is a part of the Salesforce Data Cloud team.

Ravi Bhattiprolu is a Sr. Partner Solutions Architect at AWS. Ravi works with strategic ISV partners, Salesforce and Tableau, to deliver innovative and well-architected products and solutions that help joint customers achieve their business and technical objectives.

Avijit Goswami is a Principal Solutions Architect at AWS specialized in data and analytics. He supports AWS strategic customers in building high-performing, secure, and scalable data lake solutions on AWS using AWS managed services and open-source solutions. Outside of his work, Avijit likes to travel, hike, watch sports, and listen to music.

Ife Stewart is a Principal Solutions Architect in the Strategic ISV segment at AWS. She has been engaged with Salesforce Data Cloud over the last 2 years to help build integrated customer experiences across Salesforce and AWS. Ife has over 10 years of experience in technology. She is an advocate for diversity and inclusion in the technology field.

Mike Patterson is a Senior Customer Solutions Manager in the Strategic ISV segment at AWS. He has partnered with Salesforce Data Cloud to align business objectives with innovative AWS solutions to achieve impactful customer experiences. In Mike’s spare time, he enjoys spending time with his family, sports, and outdoor activities.

Apply fine-grained access and transformation on the SUPER data type in Amazon Redshift

Post Syndicated from Ritesh Sinha original https://aws.amazon.com/blogs/big-data/apply-fine-grained-access-and-transformation-on-the-super-data-type-in-amazon-redshift/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics.

Amazon Redshift, a cloud data warehouse service, supports attaching dynamic data masking (DDM) policies to paths of SUPER data type columns, and uses the OBJECT_TRANSFORM function with the SUPER data type. SUPER data type columns in Amazon Redshift contain semi-structured data like JSON documents. Previously, data masking in Amazon Redshift only worked with regular table columns, but now you can apply masking policies specifically to elements within SUPER columns. For example, you could apply a masking policy to mask sensitive fields like credit card numbers within JSON documents stored in a SUPER column. This allows for more granular control over data masking in Amazon Redshift. Amazon Redshift gives you more flexibility in how you apply data masking to protect sensitive information stored in SUPER columns containing semi-structured data.

With DDM support in Amazon Redshift, you can do the following:

  • Define masking policies that apply custom obfuscation policies, such as masking policies to handle credit card, personally identifiable information (PII) entries, HIPAA or GDPR needs, and more
  • Transform the data at query time to apply masking policies
  • Attach masking policies to roles or users
  • Attach multiple masking policies with varying levels of obfuscation to the same column in a table and assign them to different roles with priorities to avoid conflicts
  • Implement cell-level masking by using conditional columns when creating your masking policy
  • Use masking policies to partially or completely redact data, or hash it by using user-defined functions (UDFs)

In this post, we demonstrate how a retail company can control the access of PII data stored in the SUPER data type to users based on their access privilege without duplicating the data.

Solution overview

For our use case, we have the following data access requirements:

  • Users from the Customer Service team should be able to view the order data but not PII information
  • Users from the Sales team should be able to view customer IDs and all order information
  • Users from the Executive team should be able to view all the data
  • Staff should not be able to view any data

The following diagram illustrates how DDM support in Amazon Redshift policies works with roles and users for our retail use case.

The solution encompasses creating masking policies with varying masking rules and attaching one or more to the same role and table with an assigned priority to remove potential conflicts. These policies may pseudonymize results or selectively nullify results to comply with retailers’ security requirements. We refer to multiple masking policies being attached to a table as a multi-modal masking policy. A multi-modal masking policy consists of three parts:

  • A data masking policy that defines the data obfuscation rules
  • Roles with different access levels depending on the business case
  • The ability to attach multiple masking policies on a user or role and table combination with priority for conflict resolution

Prerequisites

To implement this solution, you need the following prerequisites:

Prepare the data

To set up our use case, complete the following steps:

  1. On the Amazon Redshift console, choose Query editor v2 under Explorer in the navigation pane.

If you’re familiar with SQL Notebooks, you can download the SQL notebook for the demonstration and import it to quickly get started.

  1. Create the table and populate contents:
    -- 1- Create the orders table
    drop table if exists public.order_transaction;
    create table public.order_transaction (
     data_json super
    );
    
    -- 2- Populate the table with sample values
    INSERT INTO public.order_transaction
    VALUES
        (
            json_parse('
            {
            "c_custkey": 328558,
            "c_name": "Customer#000328558",
            "c_phone": "586-436-7415",
            "c_creditcard": "4596209611290987",
            "orders":{
              "o_orderkey": 8014018,
              "o_orderstatus": "F",
              "o_totalprice": 120857.71,
              "o_orderdate": "2024-01-01"
              }
            }'
            )
        ),
        (
            json_parse('
            {
            "c_custkey": 328559,
            "c_name": "Customer#000328559",
            "c_phone": "789-232-7421",
            "c_creditcard": "8709000219329924",
            "orders":{
              "o_orderkey": 8014019,
              "o_orderstatus": "S",
              "o_totalprice": 9015.98,
              "o_orderdate": "2024-01-01"
              }
            }'
            )
        ),
        (
            json_parse('
            {
            "c_custkey": 328560,
            "c_name": "Customer#000328560",
            "c_phone": "276-564-9023",
            "c_creditcard": "8765994378650090",
            "orders":{
              "o_orderkey": 8014020,
              "o_orderstatus": "C",
              "o_totalprice": 18765.56,
              "o_orderdate": "2024-01-01"
              }
            }
            ')
        );

Implement the solution

To satisfy the security requirements, we need to make sure that each user sees the same data in different ways based on their granted privileges. To do that, we use user roles combined with masking policies as follows:

  1. Create users and roles, and add users to their respective roles:
    --create four users
    set session authorization admin;
    CREATE USER Kate_cust WITH PASSWORD disable;
    CREATE USER Ken_sales WITH PASSWORD disable;
    CREATE USER Bob_exec WITH PASSWORD disable;
    CREATE USER Jane_staff WITH PASSWORD disable;
    
    -- 1. Create User Roles
    CREATE ROLE cust_srvc_role;
    CREATE ROLE sales_srvc_role;
    CREATE ROLE executives_role;
    CREATE ROLE staff_role;
    
    -- note that public role exists by default.
    -- Grant Roles to Users
    GRANT ROLE cust_srvc_role to Kate_cust;
    GRANT ROLE sales_srvc_role to Ken_sales;
    GRANT ROLE executives_role to Bob_exec;
    GRANT ROLE staff_role to Jane_staff;
    
    -- note that regualr_user is attached to public role by default.
    GRANT ALL ON ALL TABLES IN SCHEMA "public" TO ROLE cust_srvc_role;
    GRANT ALL ON ALL TABLES IN SCHEMA "public" TO ROLE sales_srvc_role;
    GRANT ALL ON ALL TABLES IN SCHEMA "public" TO ROLE executives_role;
    GRANT ALL ON ALL TABLES IN SCHEMA "public" TO ROLE staff_role;

  2. Create masking policies:
    -- Mask Full Data
    CREATE MASKING POLICY mask_full
    WITH(pii_data VARCHAR(256))
    USING ('000000XXXX0000'::TEXT);
    
    -- This policy rounds down the given price to the nearest 10.
    CREATE MASKING POLICY mask_price
    WITH(price INT)
    USING ( (FLOOR(price::FLOAT / 10) * 10)::INT );
    
    -- This policy converts the first 12 digits of the given credit card to 'XXXXXXXXXXXX'.
    CREATE MASKING POLICY mask_credit_card
    WITH(credit_card TEXT)
    USING ( 'XXXXXXXXXXXX'::TEXT || SUBSTRING(credit_card::TEXT FROM 13 FOR 4) );
    
    -- This policy mask the given date
    CREATE MASKING POLICY mask_date
    WITH(order_date TEXT)
    USING ( 'XXXX-XX-XX'::TEXT);
    
    -- This policy mask the given phone number
    CREATE MASKING POLICY mask_phone
    WITH(phone_number TEXT)
    USING ( 'XXX-XXX-'::TEXT || SUBSTRING(phone_number::TEXT FROM 9 FOR 4) );

  3. Attach the masking policies:
    • Attach the masking policy for the customer service use case:
      --customer_support (cannot see customer PHI/PII data but can see the order id , order details and status etc.)
      
      set session authorization admin;
      
      ATTACH MASKING POLICY mask_full
      ON public.order_transaction(data_json.c_custkey)
      TO ROLE cust_srvc_role;
      
      ATTACH MASKING POLICY mask_phone
      ON public.order_transaction(data_json.c_phone)
      TO ROLE cust_srvc_role;
      
      ATTACH MASKING POLICY mask_credit_card
      ON public.order_transaction(data_json.c_creditcard)
      TO ROLE cust_srvc_role;
      
      ATTACH MASKING POLICY mask_price
      ON public.order_transaction(data_json.orders.o_totalprice)
      TO ROLE cust_srvc_role;
      
      ATTACH MASKING POLICY mask_date
      ON public.order_transaction(data_json.orders.o_orderdate)
      TO ROLE cust_srvc_role;

    • Attach the masking policy for the sales use case:
      --sales —> can see the customer ID (non phi data) and all order info
      
      set session authorization admin;
      
      ATTACH MASKING POLICY mask_phone
      ON public.order_transaction(data_json.customer.c_phone)
      TO ROLE sales_srvc_role;

    • Attach the masking policy for the staff use case:
      --Staff — > cannot see any data about the order. all columns masked for them ( we can hand pick some columns) to show the functionality
      
      set session authorization admin;
      
      ATTACH MASKING POLICY mask_full
      ON public.order_transaction(data_json.orders.o_orderkey)
      TO ROLE staff_role;
      
      ATTACH MASKING POLICY mask_pii_full
      ON public.order_transaction(data_json.orders.o_orderstatus)
      TO ROLE staff_role;
      
      ATTACH MASKING POLICY mask_pii_price
      ON public.order_transaction(data_json.orders.o_totalprice)
      TO ROLE staff_role;
      
      ATTACH MASKING POLICY mask_date
      ON public.order_transaction(data_json.orders.o_orderdate)
      TO ROLE staff_role;

Test the solution

Let’s confirm that the masking policies are created and attached.

  1. Check that the masking policies are created with the following code:
    -- 1.1- Confirm the masking policies are created
    SELECT * FROM svv_masking_policy;

  2. Check that the masking policies are attached:
    -- 1.2- Verify attached masking policy on table/column to user/role.
    SELECT * FROM svv_attached_masking_policy;

Now you can test that different users can see the same data masked differently based on their roles.

  1. Test that the customer support can’t see customer PHI/PII data but can see the order ID, order details, and status:
    set session authorization Kate_cust;
    select * from order_transaction;

  2. Test that the sales team can see the customer ID (non PII data) and all order information:
    set session authorization Ken_sales;
    select * from order_transaction;

  3. Test that the executives can see all data:
    set session authorization Bob_exec;
    select * from order_transaction;

  4. Test that the staff can’t see any data about the order. All columns should masked for them.
    set session authorization Jane_staff;
    select * from order_transaction;

Object_Transform function

In this section, we dive into the capabilities and benefits of the OBJECT_TRANSFORM function and explore how it empowers you to efficiently reshape your data for analysis. The OBJECT_TRANSFORM function in Amazon Redshift is designed to facilitate data transformations by allowing you to manipulate JSON data directly within the database. With this function, you can apply transformations to semi-structured or SUPER data types, making it less complicated to work with complex data structures in a relational database environment.

Let’s look at some usage examples.

First, create a table and populate contents:

--1- Create the customer table 

DROP TABLE if exists customer_json;

CREATE TABLE customer_json (
    col_super super,
    col_text character varying(100) ENCODE lzo
) DISTSTYLE AUTO;

--2- Populate the table with sample data 

INSERT INTO customer_json
VALUES
    (
        
        json_parse('
            {
                "person": {
                    "name": "GREGORY HOUSE",
                    "salary": 120000,
                    "age": 17,
                    "state": "MA",
                    "ssn": ""
                }
            }
        ')
        ,'GREGORY HOUSE'
    ),
    (
        json_parse('
              {
                "person": {
                    "name": "LISA CUDDY",
                    "salary": 180000,
                    "age": 30,
                    "state": "CA",
                    "ssn": ""
                }
            }
        ')
        ,'LISA CUDDY'
    ),
     (
        json_parse('
              {
                "person": {
                    "name": "JAMES WILSON",
                    "salary": 150000,
                    "age": 35,
                    "state": "WA",
                    "ssn": ""
                }
            }
        ')
        ,'JAMES WILSON'
    )
;
-- 3 select the data 

SELECT * FROM customer_json;

Apply the transformations with the OBJECT_TRANSFORM function:

SELECT
    OBJECT_TRANSFORM(
        col_super
        KEEP
            '"person"."name"',
            '"person"."age"',
            '"person"."state"'
           
        SET
            '"person"."name"', LOWER(col_super.person.name::TEXT),
            '"person"."salary"',col_super.person.salary + col_super.person.salary*0.1
    ) AS col_super_transformed
FROM customer_json;

As you can see in the example, by applying the transformation with OBJECT_TRANSFORM, the person name is formatted in lowercase and the salary is increased by 10%. This demonstrates how the transformation makes is less complicated to work with semi-structured or nested data types.

Clean up

When you’re done with the solution, clean up your resources:

  1. Detach the masking policies from the table:
    -- Cleanup
    --reset session authorization to the default
    RESET SESSION AUTHORIZATION;

  2. Drop the masking policies:
    DROP MASKING POLICY mask_pii_data CASCADE;

  3. Revoke or drop the roles and users:
    REVOKE ROLE cust_srvc_role from Kate_cust;
    REVOKE ROLE sales_srvc_role from Ken_sales;
    REVOKE ROLE executives_role from Bob_exec;
    REVOKE ROLE staff_role from Jane_staff;
    DROP ROLE cust_srvc_role;
    DROP ROLE sales_srvc_role;
    DROP ROLE executives_role;
    DROP ROLE staff_role;
    DROP USER Kate_cust;
    DROP USER Ken_sales;
    DROP USER Bob_exec;
    DROP USER Jane_staff;

  4. Drop the table:
    DROP TABLE order_transaction CASCADE;
    DROP TABLE if exists customer_json;

Considerations and best practices

Consider the following when implementing this solution:

  • When attaching a masking policy to a path on a column, that column must be defined as the SUPER data type. You can only apply masking policies to scalar values on the SUPER path. You can’t apply masking policies to complex structures or arrays.
  • You can apply different masking policies to multiple scalar values on a single SUPER column as long as the SUPER paths don’t conflict. For example, the SUPER paths a.b and a.b.c conflict because they’re on the same path, with a.b being the parent of a.b.c. The SUPER paths a.b.c and a.b.d don’t conflict.

Refer to Using dynamic data masking with SUPER data type paths for more details on considerations.

Conclusion

In this post, we discussed how to use DDM support for the SUPER data type in Amazon Redshift to define configuration-driven, consistent, format-preserving, and irreversible masked data values. With DDM support in Amazon Redshift, you can control your data masking approach using familiar SQL language. You can take advantage of the Amazon Redshift role-based access control capability to implement different levels of data masking. You can create a masking policy to identify which column needs to be masked, and you have the flexibility of choosing how to show the masked data. For example, you can completely hide all the information of the data, replace partial real values with wildcard characters, or define your own way to mask the data using SQL expressions, Python, or Lambda UDFs. Additionally, you can apply conditional masking based on other columns, which selectively protects the column data in a table based on the values in one or more columns.

We encourage you to create your own user-defined functions for various use cases and achieve your desired security posture using dynamic data masking support in Amazon Redshift.


About the Authors

Ritesh Kumar Sinha is an Analytics Specialist Solutions Architect based out of San Francisco. He has helped customers build scalable data warehousing and big data solutions for over 16 years. He loves to design and build efficient end-to-end solutions on AWS. In his spare time, he loves reading, walking, and doing yoga.

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 15+ years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Omama Khurshid is an Acceleration Lab Solutions Architect at Amazon Web Services. She focuses on helping customers across various industries build reliable, scalable, and efficient solutions. Outside of work, she enjoys spending time with her family, watching movies, listening to music, and learning new technologies.

How Swisscom automated Amazon Redshift as part of their One Data Platform solution using AWS CDK – Part 2

Post Syndicated from Asad Bin Imtiaz original https://aws.amazon.com/blogs/big-data/how-swisscom-automated-amazon-redshift-as-part-of-their-one-data-platform-solution-using-aws-cdk-part-2/

In this series, we talk about Swisscom’s journey of automating Amazon Redshift provisioning as part of the Swisscom One Data Platform (ODP) solution using the AWS Cloud Development Kit (AWS CDK), and we provide code snippets and the other useful references.

In Part 1, we did a deep dive on provisioning a secure and compliant Redshift cluster using the AWS CDK and the best practices of secret rotation. We also explained how Swisscom used AWS CDK custom resources to automate the creation of dynamic user groups that are relevant for the AWS Identity and Access Management (IAM) roles matching different job functions.

In this post, we explore using the AWS CDK and some of the key topics for self-service usage of the provisioned Redshift cluster by end-users as well as other managed services and applications. These topics include federation with the Swisscom identity provider (IdP), JDBC connections, detective controls using AWS Config rules and remediation actions, cost optimization using the Redshift scheduler, and audit logging.

Scheduled actions

To optimize cost-efficiency for provisioned Redshift cluster deployments, Swisscom implemented a scheduling mechanism. This functionality is driven by the user configuration of the cluster, as described in Part 1 of this series, wherein the user may enable dynamic pausing and resuming of clusters based on specified cron expressions:

redshift_options:
...
  use_scheduler: true                                         # Whether to use Redshift scheduler
  scheduler_pause_cron: "cron(00 18 ? * MON-FRI *)"           # Cron expression for scheduler pause
  scheduler_resume_cron: "cron(00 08 ? * MON-FRI *)"          # Cron expression for scheduler resume
...

This feature allows Swisscom to reduce operational costs by suspending cluster activity during off-peak hours. This leads to significant cost savings by pausing and resuming clusters at appropriate times. The scheduling is achieved using the AWS CloudFormation action CfnScheduledAction. The following code illustrates how Swisscom implemented this scheduling:

if config.use_scheduler:
    cfn_scheduled_action_pause = aws_redshift.CfnScheduledAction(
        scope, "schedule-pause-action",
        # ...
        schedule=config.scheduler_pause_cron,
        # ...
        target_action=aws_redshift.CfnScheduledAction.ScheduledActionTypeProperty(
                         pause_cluster=aws_redshift.CfnScheduledAction.ResumeClusterMessageProperty(
                            cluster_identifier='cluster-identifier'
                         )
                      )
    )

    cfn_scheduled_action_resume = aws_redshift.CfnScheduledAction(
        scope, "schedule-resume-action",
        # ...
        schedule=config.scheduler_resume_cron,
        # ...
        target_action=aws_redshift.CfnScheduledAction.ScheduledActionTypeProperty(
                         resume_cluster=aws_redshift.CfnScheduledAction.ResumeClusterMessageProperty(
                            cluster_identifier='cluster-identifier'
                         )
                      )
    )

JDBC connections

The JDBC connectivity for Amazon Redshift clusters was also very flexible, adapting to user-defined subnet types and security groups in the configuration:

redshift_options:
...
  subnet_type: "routable-private"         # 'routable-private' OR 'non-routable-private'
  security_group_id: "sg-test_redshift"   # Security Group ID for Amazon Redshift (referenced group must exists in Account)
...

As illustrated in the ODP architecture diagram in Part 1 of this series, a considerable part of extract, transform, and load (ETL) processes is anticipated to operate outside of Amazon Redshift, within the serverless AWS Glue environment. Given this, Swisscom needed a mechanism for AWS Glue to connect to Amazon Redshift. This connectivity to Redshift clusters is provided through JDBC by creating an AWS Glue connection within the AWS CDK code. This connection allows ETL processes to interact with the Redshift cluster by establishing a JDBC connection. The subnet and security group defined in the user configuration guide the creation of JDBC connectivity. If no security groups are defined in the configuration, a default one is created. The connection is configured with details of the data product from which the Redshift cluster is being provisioned, like ETL user and default database, along with network elements like cluster endpoint, security group, and subnet to use, providing secure and efficient data transfer. The following code snippet demonstrates how this was achieved:

jdbc_connection = glue.Connection(
    scope, "redshift-glue-connection",
    type=ConnectionType("JDBC"),
    connection_name="redshift-glue-connection",
    subnet=connection_subnet,
    security_groups=connection_security_groups,
    properties={
        "JDBC_CONNECTION_URL": f"jdbc:redshift://{cluster_endpoint}/{database_name}",
        "USERNAME": etl_user.username,
        "PASSWORD": etl_user.password.to_string(),
        "redshiftTmpDir": f"s3://{data_product_name}-redshift-work"
    }
)

By doing this, Swisscom made sure that serverless ETL workflows in AWS Glue can securely communicate with newly provisioned Redshift cluster running within a secured virtual private cloud (VPC).

Identity federation

Identity federation allows a centralized system (the IdP) to be used for authenticating users in order to access a service provider like Amazon Redshift. A more general overview of the topic can be found in Identity Federation in AWS.

Identity federation not only enhances security due to its centralized user lifecycle management and centralized authentication mechanism (for example, supporting multi-factor authentication), but also improves the user experience and reduces the overall complexity of identity and access management and thereby also its governance.

In Swisscom’s setup, Microsoft Active Directory Services are used for identity and access management. At the initial build stages of ODP, Amazon Redshift offered two different options for identity federation:

In Swisscom’s context, during the initial implementation, Swisscom opted for IAM-based SAML 2.0 IdP federation because this is a more general approach, which can also be used for other AWS services, such as Amazon QuickSight (see Setting up IdP federation using IAM and QuickSight).

At 2023 AWS re:Invent, AWS announced a new connection option to Amazon Redshift based on AWS IAM Identity Center. IAM Identity Center provides a single place for workforce identities in AWS, allowing the creation of users and groups directly within itself or by federation with standard IdPs like Okta, PingOne, Microsoft Entra ID (Azure AD), or any IdP that supports SAML 2.0 and SCIM. It also provides a single sign-on (SSO) experience for Redshift features and other analytics services such as Amazon Redshift Query Editor V2 (see Integrate Identity Provider (IdP) with Amazon Redshift Query Editor V2 using AWS IAM Identity Center for seamless Single Sign-On), QuickSight, and AWS Lake Formation. Moreover, a single IAM Identity Center instance can be shared with multiple Redshift clusters and workgroups with a simple auto-discovery and connect capability. It makes sure all Redshift clusters and workgroups have a consistent view of users, their attributes, and groups. This whole setup fits well with ODP’s vision of providing self-service analytics across the Swisscom workforce with necessary security controls in place. At the time of writing, Swisscom is actively working towards using IAM Identity Center as the standard federation solution for ODP. The following diagram illustrates the high-level architecture for the work in progress.

Audit logging

Amazon Redshift audit logging is useful for auditing for security purposes, monitoring, and troubleshooting. The logging provides information, such as the IP address of the user’s computer, the type of authentication used by the user, or the timestamp of the request. Amazon Redshift logs the SQL operations, including connection attempts, queries, and changes, and makes it straightforward to track the changes. These logs can be accessed through SQL queries against system tables, saved to a secure Amazon Simple Storage Service (Amazon S3) location, or exported to Amazon CloudWatch.

Amazon Redshift logs information in the following log files:

  • Connection log – Provides information to monitor users connecting to the database and related connection information like their IP address.
  • User log – Logs information about changes to database user definitions.
  • User activity log – Tracks information about the types of queries that both the users and the system perform in the database. It’s useful primarily for troubleshooting purposes.

With the ODP solution, Swisscom wanted to write all the Amazon Redshift logs to CloudWatch. This is currently not directly supported by the AWS CDK, so Swisscom implemented a workaround solution using the AWS CDK custom resources option, which invokes the SDK on the Redshift action enableLogging. See the following code:

    custom_resources.AwsCustomResource(self, f"{self.cluster_identifier}-custom-sdk-logging",
           on_update=custom_resources.AwsSdkCall(
               service="Redshift",
               action="enableLogging",
               parameters={
                   "ClusterIdentifier": self.cluster_identifier,
                   "LogDestinationType": "cloudwatch",
                   "LogExports": ["connectionlog","userlog","useractivitylog"],
               },
               physical_resource_id=custom_resources.PhysicalResourceId.of(
                   f"{self.account}-{self.region}-{self.cluster_identifier}-logging")
           ),
           policy=custom_resources.AwsCustomResourcePolicy.from_sdk_calls(
               resources=[f"arn:aws:redshift:{self.region}:{self.account}:cluster:{self.cluster_identifier}"]
           )
        )

AWS Config rules and remediation

After a Redshift cluster has been deployed, Swisscom needed to make sure that the cluster meets the governance rules defined in every point in time after creation. For that, Swisscom decided to use AWS Config.

AWS Config provides a detailed view of the configuration of AWS resources in your AWS account. This includes how the resources are related to one another and how they were configured in the past so you can see how the configurations and relationships change over time.

An AWS resource is an entity you can work with in AWS, such as an Amazon Elastic Compute Cloud (Amazon EC2) instance, Amazon Elastic Block Store (Amazon EBS) volume, security group, or Amazon VPC.

The following diagram illustrates the process Swisscom implemented.

If an AWS Config rule isn’t compliant, a remediation can be applied. Swisscom defined the pause cluster action as default in case of a non-compliant cluster (based on your requirements, other remediation actions are possible). This is covered using an AWS Systems Manager automation document (SSM document).

Automation, a capability of Systems Manager, simplifies common maintenance, deployment, and remediation tasks for AWS services like Amazon EC2, Amazon Relational Database Service (Amazon RDS), Amazon Redshift, Amazon S3, and many more.

The SSM document is based on the AWS document AWSConfigRemediation-DeleteRedshiftCluster. It looks like the following code:

description: | 
  ### Document name - PauseRedshiftCluster-WithCheck 

  ## What does this document do? 
  This document pauses the given Amazon Redshift cluster using the [PauseCluster](https://docs.aws.amazon.com/redshift/latest/APIReference/API_PauseCluster.html) API. 

  ## Input Parameters 
  * AutomationAssumeRole: (Required) The ARN of the role that allows Automation to perform the actions on your behalf. 
  * ClusterIdentifier: (Required) The identifier of the Amazon Redshift Cluster. 

  ## Output Parameters 
  * PauseRedshiftClusterWithoutSnapShot.Response: The standard HTTP response from the PauseCluster API. 
  * PauseRedshiftClusterWithSnapShot.Response: The standard HTTP response from the PauseCluster API. 
schemaVersion: '0.3' 
assumeRole: '{{ AutomationAssumeRole }}' 
parameters: 
  AutomationAssumeRole: 
    type: String 
    description: (Required) The ARN of the role that allows Automation to perform the actions on your behalf. 
    allowedPattern: '^arn:aws[a-z0-9-]*:iam::\d{12}:role\/[\w-\/.@+=,]{1,1017}$' 
  ClusterIdentifier: 
    type: String 
    description: (Required) The identifier of the Amazon Redshift Cluster. 
    allowedPattern: '[a-z]{1}[a-z0-9_.-]{0,62}' 
mainSteps: 
  - name: GetRedshiftClusterStatus 
    action: 'aws:executeAwsApi' 
    inputs: 
      ClusterIdentifier: '{{ ClusterIdentifier }}' 
      Service: redshift 
      Api: DescribeClusters 
    description: |- 
      ## GetRedshiftClusterStatus 
      Gets the status for the given Amazon Redshift Cluster. 
    outputs: 
      - Name: ClusterStatus 
        Selector: '$.Clusters[0].ClusterStatus' 
        Type: String 
    timeoutSeconds: 600 
  - name: Condition 
    action: 'aws:branch' 
    inputs: 
      Choices: 
        - NextStep: PauseRedshiftCluster 
          Variable: '{{ GetRedshiftClusterStatus.ClusterStatus }}' 
          StringEquals: available 
      Default: Finish 
  - name: PauseRedshiftCluster 
    action: 'aws:executeAwsApi' 
    description: | 
      ## PauseRedshiftCluster 
      Makes PauseCluster API call using Amazon Redshift Cluster identifier and pauses the cluster without taking any final snapshot. 
      ## Outputs 
      * Response: The standard HTTP response from the PauseCluster API. 
    timeoutSeconds: 600 
    isEnd: false 
    nextStep: VerifyRedshiftClusterPause 
    inputs: 
      Service: redshift 
      Api: PauseCluster 
      ClusterIdentifier: '{{ ClusterIdentifier }}' 
    outputs: 
      - Name: Response 
        Selector: $ 
        Type: StringMap 
  - name: VerifyRedshiftClusterPause 
    action: 'aws:assertAwsResourceProperty' 
    timeoutSeconds: 600 
    isEnd: true 
    description: | 
      ## VerifyRedshiftClusterPause 
      Verifies the given Amazon Redshift Cluster is paused. 
    inputs: 
      Service: redshift 
      Api: DescribeClusters 
      ClusterIdentifier: '{{ ClusterIdentifier }}' 
      PropertySelector: '$.Clusters[0].ClusterStatus' 
      DesiredValues: 
        - pausing 
  - name: Finish 
    action: 'aws:sleep' 
    inputs: 
      Duration: PT1S 
    isEnd: true

The SSM automations document is deployed with the AWS CDK:

from aws_cdk import aws_ssm as ssm  

ssm_document_content = #read yaml document as dict  

document_id = 'automation_id'   
document_name = 'automation_name' 

document = ssm.CfnDocument(scope, id=document_id, content=ssm_document_content,  
                           document_format="YAML", document_type='Automation', name=document_name) 

To run the automation document, AWS Config needs the right permissions. You can create an IAM role for this purpose:

from aws_cdk import iam 

#Create role for the automation 
role_name = 'role-to-pause-redshift'
automation_role = iam.Role(scope, 'role-to-pause-redshift-cluster', 
                           assumed_by=iam.ServicePrincipal('ssm.amazonaws.com'), 
                           role_name=role_name) 

automation_policy = iam.Policy(scope, "policy-to-pause-cluster", 
                               policy_name='policy-to-pause-cluster', 
                               statements=[ 
                                   iam.PolicyStatement( 
                                       effect=iam.Effect.ALLOW, 
                                       actions=['redshift:PauseCluster', 
                                                'redshift:DescribeClusters'], 
                                       resources=['*'] 
                                   ) 
                               ]) 

automation_role.attach_inline_policy(automation_policy) 

Swisscom defined the rules to be applied following AWS best practices (see Security Best Practices for Amazon Redshift). These are deployed as AWS Config conformance packs. A conformance pack is a collection of AWS Config rules and remediation actions that can be quickly deployed as a single entity in an AWS account and AWS Region or across an organization in AWS Organizations.

Conformance packs are created by authoring YAML templates that contain the list of AWS Config managed or custom rules and remediation actions. You can also use SSM documents to store your conformance pack templates on AWS and directly deploy conformance packs using SSM document names.

This AWS conformance pack can be deployed using the AWS CDK:

from aws_cdk import aws_config  
  
conformance_pack_template = # read yaml file as str 
conformance_pack_content = # substitute `role_arn_for_substitution` and `document_for_substitution` in conformance_pack_template

conformance_pack_id = 'conformance-pack-id' 
conformance_pack_name = 'conformance-pack-name' 


conformance_pack = aws_config.CfnConformancePack(scope, id=conformance_pack_id, 
                                                 conformance_pack_name=conformance_pack_name, 
                                                 template_body=conformance_pack_content) 

Conclusion

Swisscom is building its next-generation data-as-a-service platform through a combination of automated provisioning processes, advanced security features, and user-configurable options to cater for diverse data handling and data products’ needs. The integration of the Amazon Redshift construct in the ODP framework is a significant stride in Swisscom’s journey towards a more connected and data-driven enterprise landscape.

In Part 1 of this series, we demonstrated how to provision a secure and compliant Redshift cluster using the AWS CDK as well as how to deal with the best practices of secret rotation. We also showed how to use AWS CDK custom resources in automating the creation of dynamic user groups that are relevant for the IAM roles matching different job functions.

In this post, we showed, through the usage of the AWS CDK, how to address key Redshift cluster usage topics such as federation with the Swisscom IdP, JDBC connections, detective controls using AWS Config rules and remediation actions, cost optimization using the Redshift scheduler, and audit logging.

The code snippets in this post are provided as is and will need to be adapted to your specific use cases. Before you get started, we highly recommend speaking to an Amazon Redshift specialist.


About the Authors

Asad bin Imtiaz is an Expert Data Engineer at Swisscom, with over 17 years of experience in architecting and implementing enterprise-level data solutions.

Jesús Montelongo Hernández is an Expert Cloud Data Engineer at Swisscom. He has over 20 years of experience in IT systems, data warehousing, and data engineering.

Samuel Bucheli is a Lead Cloud Architect at Zühlke Engineering AG. He has over 20 years of experience in software engineering, software architecture, and cloud architecture.

Srikanth Potu is a Senior Consultant in EMEA, part of the Professional Services organization at Amazon Web Services. He has over 25 years of experience in Enterprise data architecture, databases and data warehousing.

How Swisscom automated Amazon Redshift as part of their One Data Platform solution using AWS CDK – Part 1

Post Syndicated from Asad Bin Imtiaz original https://aws.amazon.com/blogs/big-data/how-swisscom-automated-amazon-redshift-as-part-of-their-one-data-platform-solution-using-aws-cdk-part-1/

Swisscom is a leading telecommunications provider in Switzerland. Swisscom’s Data, Analytics, and AI division is building a One Data Platform (ODP) solution that will enable every Swisscom employee, process, and product to benefit from the massive value of Swisscom’s data.

In a two-part series, we talk about Swisscom’s journey of automating Amazon Redshift provisioning as part of the Swisscom ODP solution using the AWS Cloud Development Kit (AWS CDK), and we provide code snippets and the other useful references.

In this post, we deep dive into provisioning a secure and compliant Redshift cluster using the AWS CDK and discuss the best practices of secret rotation. We also explain how Swisscom used AWS CDK custom resources in automating the creation of dynamic user groups that are relevant for the AWS Identity and Access management (IAM) roles matching different job functions.

In Part 2 of this series, we explore using the AWS CDK and some of the key topics for self-service usage of the provisioned Redshift cluster by end-users as well as other managed services and applications. These topics include federation with the Swisscom identity provider (IdP), JDBC connections, detective controls using AWS Config rules and remediation actions, cost optimization using the Redshift scheduler, and audit logging.

Amazon Redshift is a fast, scalable, secure, fully managed, and petabyte scale data warehousing service empowering organizations and users to analyze massive volumes of data using standard SQL tools. Amazon Redshift benefits from seamless integration with many AWS services, such as Amazon Simple Storage Service (Amazon S3), AWS Key Management Service (AWS KMS), IAM, and AWS Lake Formation, to name a few.

The AWS CDK helps you build reliable, scalable, and cost-effective applications in the cloud with the considerable expressive power of a programming language. The AWS CDK supports TypeScript, JavaScript, Python, Java, C#/.Net, and Go. Developers can use one of these supported programming languages to define reusable cloud components known as constructs. A data product owner in Swisscom can use the ODP AWS CDK libraries with a simple config file to provision ready-to-use infrastructure, such as S3 buckets; AWS Glue ETL (extract, transform, and load) jobs, Data Catalog databases, and crawlers; Redshift clusters; JDBC connections; and more, with all the needed permissions in just a few minutes.

One Data Platform

The ODP architecture is based on the AWS Well Architected Framework Analytics Lens and follows the pattern of having raw, standardized, conformed, and enriched layers as described in Modern data architecture. By using infrastructure as code (IaC) tools, ODP enables self-service data access with unified data management, metadata management (data catalog), and standard interfaces for analytics tools with a high degree of automation by providing the infrastructure, integrations, and compliance measures out of the box. At the same time, the ODP will also be continuously evolving and adapting to the constant stream of new additional features being added to the AWS analytics services. The following high-level architecture diagram shows ODP with different layers of the modern data architecture. In this series, we specifically discuss the components specific to Amazon Redshift (highlighted in red).

Harnessing Amazon Redshift for ODP

A pivotal decision in the data warehousing migration process involves evaluating the extent of a lift-and-shift approach vs. re-architecture. Balancing system performance, scalability, and cost while taking into account the rigid system pieces requires a strategic solution. In this context, Amazon Redshift has stood out as a cloud-centered data warehousing solution, especially with its straightforward and seamless integration into the modern data architecture. Its straightforward integration and fluid compatibility with AWS services like Amazon QuickSight, Amazon SageMaker, and Lake Formation further solidifies its choice for forward-thinking data warehousing strategies. As a columnar database, it’s particularly well suited for consumer-oriented data products. Consequently, Swisscom chose to provide a solution wherein use case-specific Redshift clusters are provisioned using IaC, specifically using the AWS CDK.

A crucial aspect of Swisscom’s strategy is the integration of these data domain and use case-oriented individual clusters into a virtually single and unified data environment, making sure that data ingestion, transformation, and eventual data product sharing remains convenient and seamless. This is achieved by custom provisioning of the Redshift clusters based on user or use case needs, in a shared virtual private cloud (VPC), with data and system governance policies and remediation, IdP federation, and Lake Formation integration already in place.

Although many controls for governance and security were put in place in the AWS CDK construct, Swisscom users also have the flexibility to customize their clusters based on what they need. The cluster configurator allows users to define the cluster characteristics based on individual use case requirements while remaining within the bounds of defined best practices. The key configurable parameters include node types, sizing, subnet types for routing based on different security policies per user case, enabling scheduler, integration with IdP setup, and any additional post-provisioning setup, like the creation of specific schemas and group-level access on it. This flexibility in configuration is achieved for the Amazon Redshift AWS CDK construct through a Python data class, which serves as a template for users to specify aspects like subnet types, scheduler cron expressions, and specific security groups for the cluster, among other configurations. Users are also able to select the type of subnets (routable-private or non-routable-private) to adhere to network security policies and architectural standards. See the following data class options:

class RedShiftOptions:
    node_type: NodeType
    number_of_nodes: int
    vpc_id: str
    security_group_id: Optional[str]
    subnet_type: SubnetType
    use_redshift_scheduler: bool
    scheduler_pause_cron: str
    scheduler_resume_cron: str
    maintenance_window: str
    # Additional configuration options ...

The separation of configuration in the RedShiftOptions data class from the cluster provisioning logic in the RedShiftCluster AWS CDK construct is in line with AWS CDK best practices, wherein both constructs and stacks should accept a property object to allow for full configurability completely in code. This separates the concerns of configuration and resource creation, enhancing the readability and maintainability. The data class structure reflects the user configuration from a configuration file, making it straightforward for users to specify their requirements. The following code shows what the configuration file for the Redshift construct looks like:

# ===============================
# Amazon Redshift Options
# ===============================
# The enriched layer is based on Amazon Redshift.
# This section has properties for Amazon Redshift.
#
redshift_options:
  provision_cluster: true                                     # Skip provisioning Amazon Redshift in enriched layer (required)
  number_of_nodes: 2                                          # Number of nodes for redshift cluster to provision (optional) (default = 2)
  node_type: "ra3.xlplus"                                     # Type of the cluster nodes (optional) (default = "ra3.xlplus")
  use_scheduler: true                                        # Whether to use the Amazon Redshift scheduler (optional)
  scheduler_pause_cron: "cron(00 18 ? * MON-FRI *)"           # Cron expression for scheduler pause (optional)
  scheduler_resume_cron: "cron(00 08 ? * MON-FRI *)"          # Cron expression for scheduler resume (optional)
  maintenance_window: "sun:23:45-mon:00:15"                   # Maintenance window for Amazon Redshift (optional)
  subnet_type: "routable-private"                             # 'routable-private' OR 'non-routable-private' (optional)
  security_group_id: "sg-test-redshift"                       # Security group ID for Amazon Redshift (optional) (reference must exist)
  user_groups:                                                # User groups and their privileges on default DB
    - group_name: dba
      access: [ 'ALL' ]
    - group_name: data_engineer
      access: [ 'SELECT' , 'INSERT' , 'UPDATE' , 'DELETE' , 'TRUNCATE' ]
    - group_name: qa_engineer
      access: [ 'SELECT' ]
  integrate_all_groups_with_idp: false

Admin user secret rotation

As part of the cluster deployment, an admin user is created with its credentials stored in AWS Secrets Manager for database management. This admin user is used for automating several setup operations, such as the setup of database schemas and integration with Lake Formation. For the admin user, as well as other users created for Amazon Redshift, Swisscom used AWS KMS for encryption of the secrets associated with cluster users. The use of Secrets Manager made it simple to adhere to IAM security best practices by supporting the automatic rotation of credentials. Such a setup can be quickly implemented on the AWS Management Console or may be integrated in AWS CDK code with friendly methods in the aws_redshift_alpha module. This module provides higher-level constructs (specifically, Layer 2 constructs), including convenience and helper methods, as well as sensible default values. This module is experimental and under active development and may have changes that aren’t backward compatible. See the following admin user code:

admin_secret_kms_key_options = KmsKeyOptions(
    ...
    key_name='redshift-admin-secret',
    service="secretsmanager"
)
admin_secret_kms_key = aws_kms.Key(
    scope, 'AdminSecretKmsKey,
    # ...
)

# ...

cluster = aws_redshift_alpha.Cluster(
            scope, cluster_identifier,
            # ...
            master_user=aws_redshift_alpha.Login(
                master_username='admin',
                encryption_key=admin_secret_kms_key
                ),
            default_database_name=database_name,
            # ...
        )

See the following code for secret rotation:

self.cluster.add_rotation_single_user(aws_cdk.Duration.days(60))

Methods such as add_rotation_single_user internally rely on a serverless application hosted in the AWS Serverless Application Model repository, which may be in a different AWS Region outside of the organization’s permission boundary. To effectively use such functions, make sure access to this serverless repository within the organization’s service control policies. If the access is not feasible, consider implementing solutions such as custom AWS Lambda functions replicating these functionalities (within your organization’s permission boundary).

AWS CDK custom resource

A key challenge Swisscom faced was automating the creation of dynamic user groups tied to specific IAM roles at deployment time. As an initial and simple solution, Swisscom’s approach was creating an AWS CDK custom resource using the admin user to submit and run SQL statements. This allowed Swisscom to embed the logic for the database schema, user group assignments, and Lake Formation-specific configurations directly within AWS CDK code, making sure that these crucial steps are automatically handled during cluster deployment. See the following code:

sql = get_rendered_stacked_sqls()

custom_resources.AwsCustomResource(scope, 'RedshiftSQLCustomResource',
                                           on_update=custom_resources.AwsSdkCall(
                                               service='RedshiftData',
                                               action='executeStatement',
                                               parameters={
                                                   'ClusterIdentifier': cluster_identifier,
                                                   'SecretArn': secret_arn,
                                                   'Database': database_name,
                                                   'Sql': f'{sqls}',
                                               },
                                               physical_resource_id=custom_resources.PhysicalResourceId.of(
                                                   f'{account}-{region}-{cluster_identifier}-groups')
                                           ),
                                           policy=custom_resources.AwsCustomResourcePolicy.from_sdk_calls(
                                               resources=[f'arn:aws:redshift:{region}:{account}:cluster:{cluster_identifier}']
                                           )
                                        )


cluster.secret.grant_read(groups_cr)

This method of dynamic SQL, embedded within the AWS CDK code, provides a unified deployment and post-setup of the Redshift cluster in a convenient manner. Although this approach unifies the deployment and post-provisioning configuration with SQL-based operations, it remains an initial strategy. It is tailored for convenience and efficiency in the current context. As ODP further evolves, Swisscom will iterate this solution to streamline SQL operations during cluster provisioning. Swisscom remains open to integrating external schema management tools or similar approaches where they add value.

Another aspect of Swisscom’s architecture is the dynamic creation of IAM roles tailored for the user groups for different job functions within the Amazon Redshift environment. This IAM role generation is also driven by the user configuration, acting as a blueprint for dynamically defining user role to policy mappings. This allowed them to quickly adapt to evolving requirements. The following code illustrates the role assignment:

policy_mappings = {
    "role1": ["Policy1", "Policy2"],
    "role2": ["Policy3", "Policy4"],
    ...
    # Example:
    # "dba-role": ["AmazonRedshiftFullAccess", "CloudWatchFullAccess"],
    # ...
}

def create_redshift_role(role_name, policy_list):
   # Implementation to create Redshift role with provided policies
   ...

redshift_role_1 = create_redshift_role(
    data_product_name, "role1", policy_names=policy_mappings["role1"])
redshift_role_1 = create_redshift_role(
    data_product_name, "role1", policy_names=policy_mappings["role1"])
# Example:
# redshift_dba_role = create_redshift_role(
#   data_product_name, "dba-role", policy_names=policy_mappings["dba-role"])
...

Conclusion

Swisscom is building its data-as-a-service platform, and Amazon Redshift has a crucial role as part of the solution. In this post, we discussed the aspects that need to be covered in your IaC best practices to deploy secure and maintainable Redshift clusters using the AWS CDK. Although Amazon Redshift supports industry-leading security, there are aspects organizations need to adjust to their specific requirements. It is therefore important to define the configurations and best practices that are right for your organization and bring it to your IaC to make it available for your end consumers.

We also discussed how to provision a secure and compliant Redshift cluster using the AWS CDK and deal with the best practices of secret rotation. We also showed how to use AWS CDK custom resources in automating the creation of dynamic user groups that are relevant for the IAM roles matching different job functions.

In Part 2 of this series, we will delve into enhancing self-service capabilities for end-users. We will cover topics like integration with the Swisscom IdP, setting up JDBC connections, and implementing detective controls and remediation actions, among others.

The code snippets in this post are provided as is and will need to be adapted to your specific use cases. Before you get started, we highly recommend speaking to an Amazon Redshift specialist.


About the Authors

Asad bin Imtiaz is an Expert Data Engineer at Swisscom, with over 17 years of experience in architecting and implementing enterprise-level data solutions.

Jesús Montelongo Hernández is an Expert Cloud Data Engineer at Swisscom. He has over 20 years of experience in IT systems, data warehousing, and data engineering.

Samuel Bucheli is a Lead Cloud Architect at Zühlke Engineering AG. He has over 20 years of experience in software engineering, software architecture, and cloud architecture.

Srikanth Potu is a Senior Consultant in EMEA, part of the Professional Services organization at Amazon Web Services. He has over 25 years of experience in Enterprise data architecture, databases and data warehousing.

Integrate Tableau and Okta with Amazon Redshift using AWS IAM Identity Center

Post Syndicated from Debu Panda original https://aws.amazon.com/blogs/big-data/integrate-tableau-and-okta-with-amazon-redshift-using-aws-iam-identity-center/

This blog post is co-written with Sid Wray and Jake Koskela from Salesforce, and Adiascar Cisneros from Tableau. 

Amazon Redshift is a fast, scalable cloud data warehouse built to serve workloads at any scale. With Amazon Redshift as your data warehouse, you can run complex queries using sophisticated query optimization to quickly deliver results to Tableau, which offers a comprehensive set of capabilities and connectivity options for analysts to efficiently prepare, discover, and share insights across the enterprise. For customers who want to integrate Amazon Redshift with Tableau using single sign-on capabilities, we introduced AWS IAM Identity Center integration to seamlessly implement authentication and authorization.

IAM Identity Center provides capabilities to manage single sign-on access to AWS accounts and applications from a single location. Redshift now integrates with IAM Identity Center, and supports trusted identity propagation, making it possible to integrate with third-party identity providers (IdP) such as Microsoft Entra ID (Azure AD), Okta, Ping, and OneLogin. This integration positions Amazon Redshift as an IAM Identity Center-managed application, enabling you to use database role-based access control on your data warehouse for enhanced security. Role-based access control allows you to apply fine grained access control using row level, column level, and dynamic data masking in your data warehouse.

AWS and Tableau have collaborated to enable single sign-on support for accessing Amazon Redshift from Tableau. Tableau now supports single sign-on capabilities with Amazon Redshift connector to simplify the authentication and authorization. The Tableau Desktop 2024.1 and Tableau Server 2023.3.4 releases support trusted identity propagation with IAM Identity Center. This allows users to seamlessly access Amazon Redshift data within Tableau using their external IdP credentials without needing to specify AWS Identity and Access Management (IAM) roles in Tableau. This single sign-on integration is available for Tableau Desktop, Tableau Server, and Tableau Prep.

In this post, we outline a comprehensive guide for setting up single sign-on to Amazon Redshift using integration with IAM Identity Center and Okta as the IdP. By following this guide, you’ll learn how to enable seamless single sign-on authentication to Amazon Redshift data sources directly from within Tableau Desktop, streamlining your analytics workflows and enhancing security.

Solution overview

The following diagram illustrates the architecture of the Tableau SSO integration with Amazon RedShift, IAM Identity Center, and Okta.

Figure 1: Solution overview for Tableau integration with Amazon Redshift using IAM Identity Center and Okta

The solution depicted in Figure 1 includes the following steps:

  1. The user configures Tableau to access Redshift using IAM Identity Center authentication
  2. On a user sign-in attempt, Tableau initiates a browser-based OAuth flow and redirects the user to the Okta login page to enter the login credentials.
  3. On successful authentication, Okta issues an authentication token (id and access token) to Tableau
  4. Redshift driver then makes a call to Redshift-enabled IAM Identity Center application and forwards the access token.
  5. Redshift passes the token to Identity Center and requests an access token.
  6. Identity Center verifies/validates the token using the OIDC discovery connection to the trusted token issuer and returns an Identity Center generated access token for the same user. In Figure 1, Trusted Token Issuer (TTI) is the Okta server that Identity Center trusts to provide tokens that third-party applications like Tableau uses to call AWS services.
  7. Redshift then uses the token to obtain the user and group membership information from IAM Identity Center.
  8. Tableau user will be able to connect with Amazon Redshift and access data based on the user and group membership returned from IAM Identity Center.

Prerequisites

Before you begin implementing the solution, make sure that you have the following in place:

Walkthrough

In this walkthrough, you build the solution with following steps:

  • Set up the Okta OIDC application
  • Set up the Okta authorization server
  • Set up the Okta claims
  • Setup the Okta access policies and rules
  • Setup trusted token issuer in AWS IAM Identity Center
  • Setup client connections and trusted token issuers
  • Setup the Tableau OAuth config files for Okta
  • Install the Tableau OAuth config file for Tableau Desktop
  • Setup the Tableau OAuth config file for Tableau Server or Tableau Cloud
  • Federate to Amazon Redshift from Tableau Desktop
  • Federate to Amazon Redshift from Tableau Server

Set up the Okta OIDC application

To create an OIDC web app in Okta, you can follow the instructions in this video, or use the following steps to create the wep app in Okta admin console:

Note: The Tableau Desktop redirect URLs should always use localhost. The examples below also use localhost for the Tableau Server hostname for ease of testing in a test environment. For this setup, you should also access the server at localhost in the browser. If you decide to use localhost for early testing, you will also need to configure the gateway to accept localhost using this tsm command:

 tsm configuration set -k gateway.public.host -v localhost

In a production environment, or Tableau Cloud, you should use the full hostname that your users will access Tableau on the web, along with https. If you already have an environment with https configured, you may skip the localhost configuration and use the full hostname from the start.

  1. Sign in to your Okta organization as a user with administrative privileges.
  2. On the admin console, under Applications in the navigation pane, choose Applications.
  3. Choose Create App Integration.
  4. Select OIDC – OpenID Connect as the Sign-in method and Web Application as the Application type.
  5. Choose Next.
  6. In General Settings:
    1. App integration name: Enter a name for your app integration. For example, Tableau_Redshift_App.
    2. Grant type: Select Authorization Code and Refresh Token.
    3. Sign-in redirect URIs: The sign-in redirect URI is where Okta sends the authentication response and ID token for the sign-in request. The URIs must be absolute URIs. Choose Add URl and along with the default URl, add the following URIs.
      • http://localhost:55556/Callback
      • http://localhost:55557/Callback
      • http://localhost:55558/Callback
      • http://localhost/auth/add_oauth_token
    4. Sign-out redirect URIs: keep the default value as http://localhost:8080.
    5. Skip the Trusted Origins section and for Assignments, select Skip group assignment for now.
    6. Choose Save.
Figure 2: OIDC application

Figure 2: OIDC application

  1. In the General Settings section, choose Edit and select Require PKCE as additional verification under Proof Key for Code Exchange (PKCE). This option indicates if a PKCE code challenge is required to verify client requests.
  2. Choose Save.
Figure 3: OIDC App Overview

Figure 3: OIDC App Overview

  1. Select the Assignments tab and then choose Assign to Groups. In this example, we’re assigning awssso-finance and awssso-sales.
  2. Choose Done.

Figure 4: OIDC application group assignments

For more information on creating an OIDC app, see Create OIDC app integrations.

Set up the Okta authorization server

Okta allows you to create multiple custom authorization servers that you can use to protect your own resource servers. Within each authorization server you can define your own OAuth 2.0 scopes, claims, and access policies. If you have an Okta Developer Edition account, you already have a custom authorization server created for you called default.

For this blog post, we use the default custom authorization server. If your application has requirements such as requiring more scopes, customizing rules for when to grant scopes, or you need more authorization servers with different scopes and claims, then you can follow this guide.

Figure 5: Authorization server

Set up the Okta claims

Tokens contain claims that are statements about the subject (for example: name, role, or email address). For this example, we use the default custom claim sub. Follow this guide to create claims.

Figure 6: Create claims

Setup the Okta access policies and rules

Access policies are containers for rules. Each access policy applies to a particular OpenID Connect application. The rules that the policy contains define different access and refresh token lifetimes depending on the nature of the token request. In this example, you create a simple policy for all clients as shown in Figure 7 that follows. Follow this guide to create access policies and rules.

Figure 7: Create access policies

Rules for access policies define token lifetimes for a given combination of grant type, user, and scope. They’re evaluated in priority order and after a matching rule is found, no other rules are evaluated. If no matching rule is found, then the authorization request fails. This example uses the role depicted in Figure 8 that follows. Follow this guide to create rules for your use case.

Figure 8: Access policy rules

Setup trusted token issuer in AWS IAM Identity Center

At this point, you switch to setting up the AWS configuration, starting by adding a trusted token issuer (TTI), which makes it possible to exchange tokens. This involves connecting IAM Identity Center to the Open ID Connect (OIDC) discovery URL of the external OAuth authorization server and defining an attribute-based mapping between the user from the external OAuth authorization server and a corresponding user in Identity Center. In this step, you create a TTI in the centralized management account. To create a TTI:

  1. Open the AWS Management Console and navigate to IAM Identity Center, and then to the Settings page.
  2. Select the Authentication tab and under Trusted token issuers, choose Create trusted token issuer.
  3. On the Set up an external IdP to issue trusted tokens page, under Trusted token issuer details, do the following:
    • For Issuer URL, enter the OIDC discovery URL of the external IdP that will issue tokens for trusted identity propagation. The administrator of the external IdP can provide this URL (for example, https://prod-1234567.okta.com/oauth2/default).

To get the issuer URL from Okta, sign in as an admin to Okta and navigate to Security and then to API and choose default under the Authorization Servers tab and copy the Issuer URL

Figure 9: Authorization server issuer

  1. For Trusted token issuer name, enter a name to identify this trusted token issuer in IAM Identity Center and in the application console.
  2. Under Map attributes, do the following:
    • For Identity provider attribute, select an attribute from the list to map to an attribute in the IAM Identity Center identity store.
    • For IAM Identity Center attribute, select the corresponding attribute for the attribute mapping.
  3. Under Tags (optional), choose Add new tag, enter a value for Key and optionally for Value. Choose Create trusted token issuer. For information about tags, see Tagging AWS IAM Identity Center resources.

This example uses Subject (sub) as the Identity provider attribute to map with Email from the IAM identity Center attribute. Figure 10 that follows shows the set up for TTI.

Figure 10: Create Trusted Token Issuer

Setup client connections and trusted token issuers

In this step, the Amazon Redshift applications that exchange externally generated tokens must be configured to use the TTI you created in the previous step. Also, the audience claim (or aud claim) from Okta must be specified. In this example, you are configuring the Amazon Redshift application in the member account where the Amazon Redshift cluster or serverless instance exists.

  1. Select IAM Identity Center connection from Amazon Redshift console menu.

Figure 11: Amazon Redshift IAM Identity Center connection

  1. Select the Amazon Redshift application that you created as part of the prerequisites.
  2. Select the Client connections tab and choose Edit.
  3. Choose Yes under Configure client connections that use third-party IdPs.
  4. Select the checkbox for Trusted token issuer which you have created in the previous section.
  5. Enter the aud claim value under section Configure selected trusted token issuers. For example, okta_tableau_audience.

To get the audience value from Okta, sign in as an admin to Okta and navigate to Security and then to API and choose default under the Authorization Servers tab and copy the Audience value.

Figure 12: Authorization server audience

Note: The audience claim value must exactly match with IdP audience value otherwise your OIDC connection with third part application like Tableau will fail.

  1. Choose Save.

Figure 13: Adding Audience Claim for Trusted Token Issuer

Setup the Tableau OAuth config files for Okta

At this point, your IAM Identity Center, Amazon Redshift, and Okta configuration are complete. Next, you need to configure Tableau.

To integrate Tableau with Amazon Redshift using IAM Identity Center, you need to use a custom XML. In this step, you use the following XML and replace the values starting with the $ sign and highlighted in bold. The rest of the values can be kept as they are, or you can modify them based on your use case. For detailed information on each of the elements in the XML file, see the Tableau documentation on GitHub.

Note: The XML file will be used for all the Tableau products including Tableau Desktop, Server, and Cloud.

<?xml version="1.0" encoding="utf-8"?>
<pluginOAuthConfig>
<dbclass>redshift</dbclass>
<oauthConfigId>custom_redshift_okta</oauthConfigId>
<clientIdDesktop>$copy_client_id_from_okta_oidc_app</clientIdDesktop>
<clientSecretDesktop>$copy_client_secret_from_okta_oidc_app</clientSecretDesktop>
<redirectUrisDesktop>http://localhost:55556/Callback</redirectUrisDesktop>
<redirectUrisDesktop>http://localhost:55557/Callback</redirectUrisDesktop>
<redirectUrisDesktop>http://localhost:55558/Callback</redirectUrisDesktop>
<authUri>https://$copy_okta_host_value.okta.com/oauth2/default/v1/authorize</authUri>
<tokenUri>https://$copy_okta_host_value.okta.com/oauth2/default/v1/token</tokenUri>
<scopes>openid</scopes>
<scopes>email</scopes>
<scopes>profile</scopes>
<scopes>offline_access</scopes>
<capabilities>
<entry>
<key>OAUTH_CAP_FIXED_PORT_IN_CALLBACK_URL</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_PKCE_REQUIRES_CODE_CHALLENGE_METHOD</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_REQUIRE_PKCE</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_SUPPORTS_STATE</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_CLIENT_SECRET_IN_URL_QUERY_PARAM</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_SUPPORTS_GET_USERINFO_FROM_ID_TOKEN</key>
<value>true</value>
</entry>
</capabilities>
<accessTokenResponseMaps>
<entry>
<key>ACCESSTOKEN</key>
<value>access_token</value>
</entry>
<entry>
<key>REFRESHTOKEN</key>
<value>refresh_token</value>
</entry>
<entry>
<key>id-token</key>
<value>id_token</value>
</entry>
<entry>
<key>access-token-issue-time</key>
<value>issued_at</value>
</entry>
<entry>
<key>access-token-expires-in</key>
<value>expires_in</value>
</entry>
<entry>
<key>username</key>
<value>preferred_username</value>
</entry>
</accessTokenResponseMaps>
</pluginOAuthConfig>

The following is an example XML file:

<?xml version="1.0" encoding="utf-8"?>
<pluginOAuthConfig>
<dbclass>redshift</dbclass>
<oauthConfigId>custom_redshift_okta</oauthConfigId>
<clientIdDesktop>ab12345z-a5nvb-123b-123b-1c434ghi1234</clientIdDesktop>
<clientSecretDesktop>3243jkbkjb~~ewf.112121.3432423432.asd834k</clientSecretDesktop>
<redirectUrisDesktop>http://localhost:55556/Callback</redirectUrisDesktop>
<redirectUrisDesktop>http://localhost:55557/Callback</redirectUrisDesktop>
<redirectUrisDesktop>http://localhost:55558/Callback</redirectUrisDesktop>
<authUri>https://prod-1234567.okta.com/oauth2/default/v1/authorize</authUri>
<tokenUri>https://prod-1234567.okta.com/oauth2/default/v1/token</tokenUri>
<scopes>openid</scopes>
<scopes>email</scopes>
<scopes>profile</scopes>
<scopes>offline_access</scopes>
<capabilities>
<entry>
<key>OAUTH_CAP_FIXED_PORT_IN_CALLBACK_URL</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_PKCE_REQUIRES_CODE_CHALLENGE_METHOD</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_REQUIRE_PKCE</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_SUPPORTS_STATE</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_CLIENT_SECRET_IN_URL_QUERY_PARAM</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_SUPPORTS_GET_USERINFO_FROM_ID_TOKEN</key>
<value>true</value>
</entry>
</capabilities>
<accessTokenResponseMaps>
<entry>
<key>ACCESSTOKEN</key>
<value>access_token</value>
</entry>
<entry>
<key>REFRESHTOKEN</key>
<value>refresh_token</value>
</entry>
<entry>
<key>id-token</key>
<value>id_token</value>
</entry>
<entry>
<key>access-token-issue-time</key>
<value>issued_at</value>
</entry>
<entry>
<key>access-token-expires-in</key>
<value>expires_in</value>
</entry>
<entry>
<key>username</key>
<value>preferred_username</value>
</entry>
</accessTokenResponseMaps>
</pluginOAuthConfig>

Install the Tableau OAuth config file for Tableau Desktop

After the configuration XML file is created, it must be copied to a location to be used by Amazon Redshift Connector from Tableau Desktop. Save the file from the previous step as .xml and save it under Documents\My Tableau Repository\OAuthConfigs.

Note: Currently this integration isn’t supported in macOS because the Redshift ODBC 2.X driver isn’t supported yet for MAC. It will be supported soon.

Setup the Tableau OAuth config file for Tableau Server or Tableau Cloud

To integrate with Amazon Redshift using IAM Identity Center authentication, you must install the Tableau OAuth config file in Tableau Server or Tableau Cloud

  1. Sign in to the Tableau Server or Tableau Cloud using admin credentials.
  2. Navigate to Settings.
  3. Go to OAuth Clients Registry and select Add OAuth Client
  4. Choose following settings:
    • Connection Type: Amazon Redshift
    • OAuth Provider: Custom_IdP
    • Client ID: Enter your IdP client ID value
    • Client Secret: Enter your client secret value
    • Redirect URL: Enter http://localhost/auth/add_oauth_token. This example uses localhost for testing in a local environment. You should use the full hostname with https.
    • Choose OAuth Config File. Select the XML file that you configured in the previous section.
    • Select Add OAuth Client and choose Save.

Figure 14: Create an OAuth connection in Tableau Server or Tableau Cloud

Federate to Amazon Redshift from Tableau Desktop

Now you’re ready to connect to Amazon Redshift from Tableau through federated sign-in using IAM Identity Center authentication. In this step, you create a Tableau Desktop report and publish it to Tableau Server.

  1. Open Tableau Desktop.
  2. Select Amazon Redshift Connector and enter the following values:
    1. Server: Enter the name of the server that hosts the database and the name of the database you want to connect to.
    2. Port: Enter 5439.
    3. Database: Enter your database name. This example uses dev.
    4. Authentication: Select OAuth.
    5. Federation Type: Select Identity Center.
    6. Identity Center Namespace: You can leave this value blank.
    7. OAuth Provider: This value should automatically be pulled from your configured XML. It will be the value from the element oauthConfigId.
    8. Select Require SSL.
    9. Choose Sign in.

Figure 15: Tableau Desktop OAuth connection

  1. Enter your IdP credentials in the browser pop-up window.

Figure 16: Okta Login Page

  1. When authentication is successful, you will see the message shown in Figure 17 that follows.

Figure 17: Successful authentication using Tableau

Congratulations! You’re signed in using IAM Identity Center integration with Amazon Redshift and are ready to explore and analyze your data using Tableau Desktop.

Figure 18: Successfully connected using Tableau Desktop

Figure 19 is a screenshot from the Amazon Redshift system table (sys_query_history) showing that user Ethan from Okta is accessing the sales report.

Figure 19: User audit in sys_query_history

After signing in, you can create your own Tableau Report on the desktop version and publish it to your Tableau Server. For this example, we created and published a report named SalesReport.

Federate to Amazon Redshift from Tableau Server

After you have published the report from Tableau Desktop to Tableau Server, sign in as a non-admin user and view the published report (SalesReport in this example) using IAM Identity Center authentication.

  1. Sign in to the Tableau Server site as a non-admin user.
  2. Navigate to Explore and go to the folder where your published report is stored.
  3. Select the report and choose Sign In.

Figure 20: Tableau Server Sign In

  1. To authenticate, enter your non-admin Okta credentials in the browser pop-up.

Figure 21: Okta Login Page

  1. After your authentication is successful, you can access the report.

Figure 22: Tableau report

Clean up

Complete the following steps to clean up your resources:

  1. Delete the IdP applications that you have created to integrate with IAM Identity Center.
  2. Delete the IAM Identity Center configuration.
  3. Delete the Amazon Redshift application and the Amazon Redshift provisioned cluster or serverless instance that you created for testing.
  4. Delete the IAM role and IAM policy that you created for IAM Identity Center and Amazon Redshift integration.
  5. Delete the permission set from IAM Identity Center that you created for Amazon Redshift Query Editor V2 in the management account.

Conclusion

This post covered streamlining access management for data analytics by using Tableau’s capability to support single sign-on based on the OAuth 2.0 OpenID Connect (OIDC) protocol. The solution enables federated user authentication, where user identities from an external IdP are trusted and propagated to Amazon Redshift. You walked through the steps to configure Tableau Desktop and Tableau Server to integrate seamlessly with Amazon Redshift using IAM Identity Center for single sign-on. By harnessing this integration of a third party IdP with IAM Identity Center, users can securely access Amazon Redshift data sources within Tableau without managing separate database credentials.

Listed below are key resources to learn more about Amazon Redshift integration with IAM Identity Center


About the Authors

Debu-PandaDebu Panda is a Senior Manager, Product Management at AWS. He is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world.

Sid Wray is a Senior Product Manager at Salesforce based in the Pacific Northwest with nearly 20 years of experience in Digital Advertising, Data Analytics, Connectivity Integration and Identity and Access Management. He currently focuses on supporting ISV partners for Salesforce Data Cloud.

Adiascar Cisneros is a Tableau Senior Product Manager based in Atlanta, GA. He focuses on the integration of the Tableau Platform with AWS services to amplify the value users get from our products and accelerate their journey to valuable, actionable insights. His background includes analytics, infrastructure, network security, and migrations.

Jade Koskela is a Principal Software Engineer at Salesforce. He has over a decade of experience building Tableau with a focus on areas including data connectivity, authentication, and identity federation.

Harshida Patel is a Principal Solutions Architect, Analytics with AWS.

Maneesh Sharma is a Senior Database Engineer at AWS with more than a decade of experience designing and implementing large-scale data warehouse and analytics solutions. He collaborates with various Amazon Redshift Partners and customers to drive better integration.

Ravi Bhattiprolu is a Senior Partner Solutions Architect at Amazon Web Services (AWS). He collaborates with strategic independent software vendor (ISV) partners like Salesforce and Tableau to design and deliver innovative, well-architected cloud products, integrations, and solutions to help joint AWS customers achieve their business goals.

Migrate a petabyte-scale data warehouse from Actian Vectorwise to Amazon Redshift

Post Syndicated from Krishna Gogineni original https://aws.amazon.com/blogs/big-data/migrate-a-petabyte-scale-data-warehouse-from-actian-vectorwise-to-amazon-redshift/

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data. Tens of thousands of customers use Amazon Redshift to process large amounts of data, modernize their data analytics workloads, and provide insights for their business users.

In this post, we discuss how a financial services industry customer achieved scalability, resiliency, and availability by migrating from an on-premises Actian Vectorwise data warehouse to Amazon Redshift.

Challenges

The customer’s use case required a high-performing, highly available, and scalable data warehouse to process queries against large datasets in a low-latency environment. Their Actian Vectorwise system was designed to replace Excel plugins and stock screeners but eventually evolved into a much larger and ambitious portfolio analysis solution running multiple API clusters on premises, serving some of the largest financial services firms worldwide. The customer saw growing demand that needed high performance and scalability due to 30% year-over-year increase in usage from the success of their products. The customer needed to keep up with increased volume of read requests, but they couldn’t do this without deploying additional hardware in the data center. There was also a customer mandate that business-critical products must have their hardware updated to cloud-based solutions or be deemed on the path to obsolescence. In addition, the business started moving customers onto a new commercial model, and therefore new projects would need to provision a new cluster, which meant that they needed improved performance, scalability, and availability.

They faced the following challenges:

  • Scalability – The customer understood that infrastructure maintenance was a growing issue and, although operations were a consideration, the existing implementation didn’t have a scalable and efficient solution to meet the advanced sharding requirements needed for query, reporting, and analysis. Over-provisioning of data warehouse capacity to meet unpredictable workloads resulted in underutilized capacity during normal operations by 30%.
  • Availability and resiliency – Because the customer was running business-critical analytical workloads, it required the highest levels of availability and resiliency, which was a concern with the on-premises data warehouse solution.
  • Performance – Some of their queries needed to be processed in priority, and users were starting to experience performance degradation with longer-running query times as their solution started getting used more and more. The need for a scalable and efficient solution to manage customer demand, address infrastructure maintenance concerns, replace legacy tooling, and tackle availability led to them choosing Amazon Redshift as the future state solution. If these concerns were not addressed, the customer would be prevented from growing their user base.

Legacy architecture

The customer’s platform was the main source for one-time, batch, and content processing. It served many enterprise use cases across API feeds, content mastering, and analytics interfaces. It was also the single strategic platform within the company for entity screening, on-the-fly aggregation, and other one-time, complex request workflows.

The following diagram illustrates the legacy architecture.

The architecture consists of many layers:

  • Rules engine – The rules engine was responsible for intercepting every incoming request. Based on the nature of the request, it routed the request to the API cluster that could optimally process that specific request based on the response time requirement.
  • API – Scalability was one of the primary challenges with the existing on-premises system. It wasn’t possible to quickly scale up and down API service capacity to meet growing business demand. Both the API and data store had to support a highly volatile workload pattern. This included simple data retrieval requests that had to be processed within a few milliseconds vs. power user-style batch requests with complex analytics-based workloads that could take several seconds and significant compute resources to process. To separate these different workload patterns, the API and data store infrastructure was split into multiple isolated physical clusters. This made sure each workload group was provisioned with sufficient reserved capacity to meet the respective response time expectations. However, this model of reserving capacity for each workload type resulted in suboptimal usage of compute resources because each cluster would only process a specific workload type.
  • Data store – The data store used a custom data model that had been highly optimized to meet low-latency query response requirements. The current on-premises data store wasn’t horizontally scalable, and there was no built-in replication or data sharding capability. Due to this limitation, multiple database instances were created to meet concurrent scalability and availability requirements because the schema wasn’t generic per dataset. This model caused operational maintenance overhead and wasn’t easily expandable.
  • Data ingestion – Pentaho was used to ingest data sourced from multiple data publishers into the data store. The ingestion framework itself didn’t have any major challenges. However, the primary bottleneck was due to scalability issues associated with the data store. Because the data store didn’t support sharding or replication, data ingestion had to explicitly ingest the same data concurrently across multiple database nodes within a single transaction to provide data consistency. This significantly impacted overall ingestion speed.

Overall, the current architecture didn’t support workload prioritization, therefore a physical model of resources was reserved for this reason. The downside here is over-provisioning. The system had an integration with legacy backend services that were all hosted on premises.

Solution overview

Amazon Redshift is an industry-leading cloud data warehouse. Amazon Redshift uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes using AWS-designed hardware and machine learning (ML) to deliver the best price-performance at any scale.

Amazon Redshift is designed for high-performance data warehousing, which provides fast query processing and scalable storage to handle large volumes of data efficiently. Its columnar storage format minimizes I/O and improves query performance by reading only the relevant data needed for each query, resulting in faster data retrieval. Lastly, you can integrate Amazon Redshift with data lakes like Amazon Simple Storage Service (Amazon S3), combining structured and semi-structured data for comprehensive analytics.

The following diagram illustrates the architecture of the new solution.

In the following sections, we discuss the features of this solution and how it addresses the challenges of the legacy architecture.

Rules engine and API

Amazon API Gateway is a fully managed service that help developers deliver secure, robust, API-driven application backends at any scale. To address scalability and availability requirements of the rules and routing layer, we introduced API Gateway to do the routing of the client requests to different integration paths using routes and parameter mappings. Having API Gateway as the entry point allowed the customer to move away from the design, testing, and maintenance of their rules engine development workload. In their legacy environment, handling fluctuating amounts of traffic posed a significant challenge. However, API Gateway seamlessly addressed this issue by acting as a proxy and automatically scaling to accommodate varying traffic demands, providing optimal performance and reliability.

Data storage and processing

Amazon Redshift allowed the customer to meet their scalability and performance requirements. Amazon Redshift features such as workload management (WLM), massively parallel processing (MPP) architecture, concurrency scaling, and parameter groups helped address the requirements:

  • WLM provided the ability for query prioritization and managing resources effectively
  • The MPP architecture model provided horizontal scalability
  • Concurrency scaling added additional cluster capacity to handle unpredictable and spiky workloads
  • Parameter groups defined configuration parameters that control database behavior

Together, these capabilities allowed them to meet their scalability and performance requirements in a managed fashion.

Data distribution

The legacy data center architecture was unable to partition the data without deploying additional hardware in the data center, and it couldn’t handle read workloads efficiently.

The MPP architecture of Amazon Redshift offers efficient data distribution across all the compute nodes, which helped run heavy workloads in parallel and subsequently lowered response times. With the data distributed across all the compute nodes, it allows data to be processed in parallel. Its MPP engine and architecture separates compute and storage for efficient scaling and performance.

Operational efficiency and hygiene

Infrastructure maintenance and operational efficiency was a concern for the customer in their current state architecture. Amazon Redshift is a fully managed service that takes care of data warehouse management tasks such as hardware provisioning, software patching, setup, configuration, and monitoring nodes and drives to recover from failures or backups. Amazon Redshift periodically performs maintenance to apply fixes, enhancements, and new features to your Redshift data warehouse. As a result, the customer’s operational costs reduced by 500%, and they are now able to spend more time innovating and building mission-critical applications.

Workload management

Amazon Redshift WLM was able to resolve issues with the legacy architecture where longer-running queries were consuming all the resources, causing other queries to run slower, impacting performance SLAs. With automatic WLM, the customer was able to create separate WLM queues with different priorities, which allowed them to manage the priorities for the critical SLA-bound workloads and other non-critical workloads. With short query acceleration (SQA) enabled, it prioritized selected short-running queries ahead of longer-running queries. Furthermore, the customer benefited by using query monitoring rules in WLM to apply performance boundaries to control poorly designed queries and take action when a query goes beyond those boundaries. To learn more about WLM, refer to Implementing workload management.

Workload isolation

In the legacy architecture, all the workloads—extract, transform, and load (ETL); business intelligence (BI); and one-time workloads—were running on the same on-premises data warehouse, leading to the noisy neighbor problem and performance issues with the increase in users and workloads.

With the new solution architecture, this issue is remediated using data sharing in Amazon Redshift. With data sharing, the customer is able to share live data with security and ease across Redshift clusters, AWS accounts, or AWS Regions for read purposes, without the need to copy any data.

Data sharing improved the agility of the customer’s organization. It does this by giving them instant, granular, and high-performance access to data across Redshift clusters without the need to copy or move it manually. With data sharing, customers have live access to data, so their users can see the most up-to-date and consistent information as it’s updated in Redshift clusters. Data sharing provides workload isolation by running ETL workloads in its own Redshift cluster and sharing data with other BI and analytical workloads in their respective Redshift clusters.

Scalability

With the legacy architecture, the customer was facing scalability challenges during large events to handle unpredictable spiky workloads and over-provisioning of the database capacity. Using concurrency scaling and elastic resize allowed the customer to meet their scalability requirements and handle unpredictable and spiky workloads.

Data migration to Amazon Redshift

The customer used a home-grown process to extract the data from Actian Vectorwise and store it in Amazon S3 and CSV files. The data from Amazon S3 was then ingested into Amazon Redshift.

The loading process used a COPY command and ingested the data from Amazon S3 in a fast and efficient way. A best practice for loading data into Amazon Redshift is to use the COPY command. The COPY command is the most efficient way to load a table because it uses the Amazon Redshift MPP architecture to read and load data in parallel from a file or multiple files in an S3 bucket.

To learn about the best practices for source data files to load using the COPY command, see Loading data files.

After the data is ingested into Redshift staging tables from Amazon S3, transformation jobs are run from Pentaho to apply the incremental changes to the final reporting tables.

The following diagram illustrates this workflow.

Key considerations for the migration

There are three ways of migrating an on-premises data warehouse to Amazon Redshift: one-step, two-step, and wave-based migration. To minimize the risk of migrating over 20 databases that vary in complexity, we decided on the wave-based approach. The fundamental concept behind wave-based migration involves dividing the migration program into projects based on factors such as complexity and business outcomes. The implementation then migrates each project individually or by combining certain projects into a wave. Subsequent waves follow, which may or may not be dependent on the results of the preceding wave.

This strategy requires both the legacy data warehouse and Amazon Redshift to operate concurrently until the migration and validation of all workloads are successfully complete. This provides a smooth transition while making sure the on-premises infrastructure can be retired only after thorough migration and validation have taken place.

In addition, within each wave, we followed a set of phases to make sure that each wave was successful:

  • Assess and plan
  • Design the Amazon Redshift environment
  • Migrate the data
  • Test and validate
  • Perform cutover and optimizations

In the process, we didn’t want to rewrite the legacy code for each migration. With minimal code changes, we migrated the data to Amazon Redshift because SQL compatibility was very important in the process due to existing knowledge within the organization and downstream application consumption. After the data was ingested into the Redshift cluster, we adjusted the tables for best performance.

One of the main benefits we realized as part of the migration was the option to integrate data in Amazon Redshift with other business groups in the future that use AWS Data Exchange, without significant effort.

We performed blue/green deployments to make sure that the end-users didn’t encounter any latency degradation while retrieving the data. We migrated the end-users in a phased manner to measure the impact and adjust the cluster configuration as needed.

Results

The customer’s decision to use Amazon Redshift for their solution was further reinforced by the platform’s ability to handle both structured and semi-structured data seamlessly. Amazon Redshift allows the customer to efficiently analyze and derive valuable insights from their diverse range of datasets, including equities and institutional data, all while using standard SQL commands that teams are already comfortable with.

Through rigorous testing, Amazon Redshift consistently demonstrated remarkable performance, meeting the customer’s stringent SLAs and delivering exceptional subsecond query response times with an impressive latency. With the AWS migration, the customer achieved a 5% improvement in query performance. Scalability of the clusters was done in minutes compared to 6 months in the data center. Operational cost reduced by 500% due to the simplicity of the Redshift cluster operations in AWS. Stability of the clusters improved by 100%. Upgrades and patching cycle time improved by 200%. Overall, improvement in operational posture and total savings for the footprint has resulted in significant savings for the team and platform in general. In addition, the ability to scale the overall architecture based on market data trends in a resilient and highly available way not only met the customer demand in terms of time to market, but also significantly reduced the operational costs and total cost of ownership.

Conclusion

In this post, we covered how a large financial services customer improved performance and scalability, and reduced their operational costs by migrating to Amazon Redshift. This enabled the customer to grow and onboard new workloads into Amazon Redshift for their business-critical applications.

To learn about other migration use cases, refer to the following:


About the Authors

Krishna Gogineni is a Principal Solutions Architect at AWS helping financial services customers. Krishna is Cloud-Native Architecture evangelist helping customers transform the way they build software. Krishna works with customers to learn their unique business goals, and then super-charge their ability to meet these goals through software delivery that leverages industry best practices/tools such as DevOps, Data Lakes, Data Analytics, Microservices, Containers, and Continuous Integration/Continuous Delivery.

Dayananda Shenoy is a Senior Solution Architect with over 20 years of experience designing and architecting backend services for financial services products. Currently, he leads the design and architecture of distributed, high-performance, low latency analytics services for a data provider. He is passionate about solving scalability and performance challenges in distributed systems leveraging emerging technology which improve existing tech stacks and add value to the business to enhance customer experience.

Vishal Balani is a Sr. Customer Solutions Manager based out of New York. He works closely with Financial Services customers to help them leverage cloud for businesses agility, innovation and resiliency. He has extensive experience leading large-scale cloud migration programs. Outside of work he enjoys spending time with family, tinkering with a new project or riding his bike.

Ranjan Burman is a Sr. PostgreSQL Database Specialist SA. He specializes in RDS & Aurora PostgreSQL. He has more than 18 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with the use of cloud solutions.

Muthuvelan Swaminathan is an Enterprise Solutions Architect based out of New York. He works with enterprise customers providing architectural guidance in building resilient, cost-effective and innovative solutions that address business needs.

Build a decentralized semantic search engine on heterogeneous data stores using autonomous agents

Post Syndicated from Dhaval Shah original https://aws.amazon.com/blogs/big-data/build-a-decentralized-semantic-search-engine-on-heterogeneous-data-stores-using-autonomous-agents/

Large language models (LLMs) such as Anthropic Claude and Amazon Titan have the potential to drive automation across various business processes by processing both structured and unstructured data. For example, financial analysts currently have to manually read and summarize lengthy regulatory filings and earnings transcripts in order to respond to Q&A on investment strategies. LLMs could automate the extraction and summarization of key information from these documents, enabling analysts to query the LLM and receive reliable summaries. This would allow analysts to process the documents to develop investment recommendations faster and more efficiently. Anthropic Claude and other LLMs on Amazon Bedrock can bring new levels of automation and insight across many business functions that involve both human expertise and access to knowledge spread across an organization’s databases and content repositories.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon via a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

In this post, we show how to build a Q&A bot with RAG (Retrieval Augmented Generation). RAG uses data sources like Amazon Redshift and Amazon OpenSearch Service to retrieve documents that augment the LLM prompt. For getting data from Amazon Redshift, we use the Anthropic Claude 2.0 on Amazon Bedrock, summarizing the final response based on pre-defined prompt template libraries from LangChain. To get data from Amazon OpenSearch Service, we chunk, and convert the source data chunks to vectors using Amazon Titan Text Embeddings model.

For client interaction we use Agent Tools based on ReAct. A ReAct prompt consists of few-shot task-solving trajectories, with human-written text reasoning traces and actions, as well as environment observations in response to actions. In this example, we use ReAct for zero-shot training to generate responses to fit in a pre-defined template. The additional information is concatenated as context with the original input prompt and fed to the text generator which produces the final output. This makes RAG adaptive for situations where facts could evolve over time.

Solution overview

Our solution demonstrates how financial analysts can use generative artificial intelligence (AI) to adapt their investment recommendations based on financial reports and earnings transcripts with RAG to use LLMs to generate factual content.

The hybrid architecture uses multiple databases and LLMs, with foundation models from Amazon Bedrock for data source identification, SQL generation, and text generation with results. In the following architecture, Steps 1 and 2 represent data ingestion to be done by data engineering in batch mode. Steps 3, 4, and 5 are the queries and response formation.

The following diagram shows a more detailed view of the Q&A processing chain. The user asks a question, and LangChain queries the Redshift and OpenSearch Service data stores for relevant information to build the prompt. It sends the prompt to the Anthropic Claude on Amazon Bedrock model, and returns the response.

The details of each step are as follows:

  1. Populate the Amazon Redshift Serverless data warehouse with company stock information stored in Amazon Simple Storage Service (Amazon S3). Redshift Serverless is a fully functional data warehouse holding data tables maintained in real time.
  2. Load the unstructured data from your S3 data lake to OpenSearch Service to create an index to store and perform semantic search. The LangChain library loads knowledge base documents, splits the documents into smaller chunks, and uses Amazon Titan to generate embeddings for chunks.
  3. The client submits a question via an interface like a chatbot or website.
  4. You will create multiple steps to transform a user query passed from Amazon SageMaker Notebook to execute API calls to LLMs from Amazon Bedrock. Use LLM-based Agents to generate SQL from Text and then validate if query is relevant to data warehouse tables. If yes, run query to extract information. The LangChain library calls Amazon Titan embeddings to generate a vector for the user’s question. It calls OpenSearch vector search to get similar documents.
  5. LangChain calls Anthropic Claude on Amazon Bedrock model with the additional, retrieved knowledge as context, to generate an answer for the question. It returns generated content to client

In this deployment, you will choose Amazon Redshift Serverless, use Anthropic Claude 2.0  model on Amazon Bedrock and Amazon Titan Text Embeddings model. Overall spend for the deployment will be directly proportional to number of input/output tokens for Amazon Bedrock models, Knowledge base volume, usage hours and so on.

To deploy the solution, you need two datasets: SEC Edgar Annual Financial Filings and Stock pricing data. To join these datasets for analysis, you need to choose Stock Symbol as the join key. The provided AWS CloudFormation template deploys the datasets required for this post, along with the SageMaker notebook.

Prerequisites

To follow along with this post, you should have an AWS account with AWS Identity and Access Management (IAM) user credentials to deploy AWS services.

Deploy the chat application using AWS CloudFormation

To deploy the resources, complete the following steps:

  1. Deploy the following CloudFormation template to create your stack in the us-east-1 AWS Region.The stack will deploy an OpenSearch Service domain, Redshift Serverless endpoint, SageMaker notebook, and other services like VPC and IAM roles that you will use in this post. The template sets a default user name password for the OpenSearch Service domain, and sets up a Redshift Serverless admin. You can choose to modify them or use the default values.
  2. On the AWS CloudFormation console, navigate to the stack you created.
  3. On the Outputs tab, choose the URL for SageMakerNotebookURL to open the notebook.
  4. In Jupyter, choose semantic-search-with-amazon-opensearch, thenblog, then the LLM-Based-Agentfolder.
  5. Open the notebook Generative AI with LLM based autonomous agents augmented with structured and unstructured data.ipynb.
  6. Follow the instructions in the notebook and run the code sequentially.

Run the notebook

There are six major sections in the notebook:

  • Prepare the unstructured data in OpenSearch Service – Download the SEC Edgar Annual Financial Filings dataset and convert the company financial filing document into vectors with Amazon Titan Text Embeddings model and store the vector in an Amazon OpenSearch Service vector database.
  • Prepare the structured data in a Redshift database – Ingest the structured data into your Amazon Redshift Serverless table.
  • Query the unstructured data in OpenSearch Service with a vector search – Create a function to implement semantic search with OpenSearch Service. In OpenSearch Service, match the relevant company financial information to be used as context information to LLM. This is unstructured data augmentation to the LLM.
  • Query the structured data in Amazon Redshift with SQLDatabaseChain – Use the LangChain library LLM text to SQL to query company stock information stored in Amazon Redshift. The search result will be used as context information to the LLM.
  • Create an LLM-based ReAct agent augmented with data in OpenSearch Service and Amazon Redshift – Use the LangChain library to define a ReAct agent to judge whether the user query is stock- or investment-related. If the query is stock related, the agent will query the structured data in Amazon Redshift to get the stock symbol and stock price to augment context to the LLM. The agent also uses semantic search to retrieve relevant financial information from OpenSearch Service to augment context to the LLM.
  • Use the LLM-based agent to generate a final response based on the template used for zero-shot training – The following is a sample user flow for a stock price recommendation for the query, “Is ABC a good investment choice right now.”

Example questions and responses

In this section, we show three example questions and responses to test our chatbot.

Example 1: Historical data is available

In our first test, we explore how the bot responds to a question when historical data is available. We use the question, “Is [Company Name] a good investment choice right now?” Replace [Company Name] with a company you want to query.

This is a stock-related question. The company stock information is in Amazon Redshift and the financial statement information is in OpenSearch Service. The agent will run the following process:

  1. Determine if this is a stock-related question.
  2. Get the company name.
  3. Get the stock symbol from Amazon Redshift.
  4. Get the stock price from Amazon Redshift.
  5. Use semantic search to get related information from 10k financial filing data from OpenSearch Service.
response = zero_shot_agent("\n\nHuman: Is {company name} a good investment choice right now? \n\nAssistant:")

The output may look like the following:

Final Answer: Yes, {company name} appears to be a good investment choice right now based on the stable stock price, continued revenue and earnings growth, and dividend payments. I would recommend investing in {company name} stock at current levels.

You can view the final response from the complete chain in your notebook.

Example 2: Historical data is not available

In this next test, we see how the bot responds to a question when historical data is not available. We ask the question, “Is Amazon a good investment choice right now?”

This is a stock-related question. However, there is no Amazon stock price information in the Redshift table. Therefore, the bot will answer “I cannot provide stock analysis without stock price information.” The agent will run the following process:

  1. Determine if this is a stock-related question.
  2. Get the company name.
  3. Get the stock symbol from Amazon Redshift.
  4. Get the stock price from Amazon Redshift.
response = zero_shot_agent("\n\nHuman: Is Amazon a good investment choice right now? \n\nAssistant:")

The output looks like the following:

Final Answer: I cannot provide stock analysis without stock price information.

Example 3: Unrelated question and historical data is not available

For our third test, we see how the bot responds to an irrelevant question when historical data is not available. This is testing for hallucination. We use the question, “What is SageMaker?”

This is not a stock-related query. The agent will run the following process:

  1. Determine if this is a stock-related question.
response = zero_shot_agent("\n\nHuman: What is SageMaker? \n\nAssistant:")

The output looks like the following:

Final Answer: What is SageMaker? is not a stock related query.

This was a simple RAG-based ReAct chat agent analyzing the corpus from different data stores. In a realistic scenario, you might choose to further enhance the response with restrictions or guardrails for input and output like filtering harsh words for robust input sanitization, output filtering, conversational flow control, and more. You may also want to explore the programmable guardrails to LLM-based conversational systems.

Clean up

To clean up your resources, delete the CloudFormation stack llm-based-agent.

Conclusion

In this post, you explored how LLMs play a part in answering user questions. You looked at a scenario for helping financial analysts. You could employ this methodology for other Q&A scenarios, like supporting insurance use cases, by quickly contextualizing claims data or customer interactions. You used a knowledge base of structured and unstructured data in a RAG approach, merging the data to create intelligent chatbots. You also learned how to use autonomous agents to help provide responses that are contextual and relevant to the customer data and limit irrelevant and inaccurate responses.

Leave your feedback and questions in the comments section.

References


About the Authors

Dhaval Shah is a Principal Solutions Architect with Amazon Web Services based out of New York, where he guides global financial services customers to build highly secure, scalable, reliable, and cost-efficient applications on the cloud. He brings over 20 years of technology experience on Software Development and Architecture, Data Engineering, and IT Management.

Soujanya Konka is a Senior Solutions Architect and Analytics specialist at AWS, focused on helping customers build their ideas on cloud. Expertise in design and implementation of Data platforms. Before joining AWS, Soujanya has had stints with companies such as HSBC & Cognizant

Jon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included 4 years of coding a large-scale, ecommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.

Jianwei Li is a Principal Analytics Specialist TAM at Amazon Web Services. Jianwei provides consultant service for customers to help customer design and build modern data platform. Jianwei has been working in big data domain as software developer, consultant and tech leader.

Hrishikesh Karambelkar is a Principal Architect for Data and AIML with AWS Professional Services for Asia Pacific and Japan. He is proactively engaged with customers in APJ region to enable enterprises in their Digital Transformation journey on AWS Cloud in the areas of Generative AI, machine learning and Data, Analytics, Previously, Hrishikesh has authored books on enterprise search, biig data and co-authored research publications in the areas of Enterprise Search and AI-ML.

Understanding Apache Iceberg on AWS with the new technical guide

Post Syndicated from Carlos Rodrigues original https://aws.amazon.com/blogs/big-data/understanding-apache-iceberg-on-aws-with-the-new-technical-guide/

We’re excited to announce the launch of the Apache Iceberg on AWS technical guide. Whether you are new to Apache Iceberg on AWS or already running production workloads on AWS, this comprehensive technical guide offers detailed guidance on foundational concepts to advanced optimizations to build your transactional data lake with Apache Iceberg on AWS.

Apache Iceberg is an open source table format that simplifies data processing on large datasets stored in data lakes. It does so by bringing the familiarity of SQL tables to big data and capabilities such as ACID transactions, row-level operations (merge, update, delete), partition evolution, data versioning, incremental processing, and advanced query scanning. Apache Iceberg seamlessly integrates with popular open source big data processing frameworks like Apache Spark, Apache Hive, Apache Flink, Presto, and Trino. It is natively supported by AWS analytics services such as AWS Glue, Amazon EMR, Amazon Athena, and Amazon Redshift.

The following diagram illustrates a reference architecture of a transactional data lake with Apache Iceberg on AWS.

AWS customers and data engineers use the Apache Iceberg table format for its many benefits, as well as for its high performance and reliability at scale to build transactional data lakes and write-optimized solutions with Amazon EMR, AWS Glue, Athena, and Amazon Redshift on Amazon Simple Storage Service (Amazon S3).

We believe Apache Iceberg adoption on AWS will continue to grow rapidly, and you can benefit from this technical guide that delivers productive guidance on working with Apache Iceberg on supported AWS services, best practices on cost-optimization and performance, and effective monitoring and maintenance policies.

Related resources


About the Authors

Carlos Rodrigues is a Big Data Specialist Solutions Architect at AWS. He helps customers worldwide build transactional data lakes on AWS using open table formats like Apache Iceberg and Apache Hudi. He can be reached via LinkedIn.

Imtiaz (Taz) Sayed is the WW Tech Leader for Analytics at AWS. He is an expert on data engineering and enjoys engaging with the community on all things data and analytics. He can be reached via LinkedIn.

Shana Schipers is an Analytics Specialist Solutions Architect at AWS, focusing on big data. She supports customers worldwide in building transactional data lakes using open table formats like Apache Hudi, Apache Iceberg, and Delta Lake on AWS.

Breaking barriers in geospatial: Amazon Redshift, CARTO, and H3

Post Syndicated from Ravi Animi original https://aws.amazon.com/blogs/big-data/breaking-barriers-in-geospatial-amazon-redshift-carto-and-h3/

This post is co-written with Javier de la Torre from CARTO.

In this post, we discuss how Amazon Redshift spatial index functions such as Hexagonal hierarchical geospatial indexing system (or H3) can be used to represent spatial data using H3 indexing for fast spatial lookups at scale. Navigating the vast landscape of data-driven insights has always been an exciting endeavor. As technology continues to evolve, one specific facet of this journey is reaching unprecedented proportions: geospatial data. In our increasingly interconnected world, where every step we take, every location we visit, and every event we encounter leaves a digital footprint, the volume and complexity of geospatial data are expanding at an astonishing pace. From GPS-enabled smartphones to remote sensing satellites, the sources of geospatial information are multiplying, generating an immense gold mine of location-based insights.

However, visualizing and analyzing large-scale geospatial data presents a formidable challenge due to the sheer volume and intricacy of information. This often overwhelms traditional visualization tools and methods. The need to balance detail and context while maintaining real-time interactivity can lead to issues of scalability and rendering complexity.

Because of this, many organizations are turning to novel ways of approaching geospatial data, such as spatial indexes such as H3.

Figure 1 – Map built with CARTO Builder and the native support to visualize H3 indexes

Figure 1 – Map built with CARTO Builder and the native support to visualize H3 indexes

What are spatial indexes?

Spatial indexes are global grid systems that exist at multiple resolutions. But what makes them special? Traditionally, spatial data is represented through a geography or geometry in which features are geolocated on the earth by a long reference string describing the coordinates of every vertex. Unlike geometries, spatial indexes are georeferenced by a short ID string. This makes them far smaller to store and lightning fast to process! Because of this, many organizations are utilizing them as a support geography, aggregating their data to these grids to optimize both their storage and analysis.

Figure 2 shows some of the possible types of savings with spatial indexes. To learn more details about their benefits, see Introduction to Spatial Indexes.

Figure 2 – Comparison of performance between geometries and spatial indexes. Learn more about these differences in CARTO’s free ebook Spatial Indexes

Figure 2 – Comparison of performance between geometries and spatial indexes. Learn more about these differences in CARTO’s free ebook Spatial Indexes

Benefits of H3

One of the flagship examples of spatial indexes is H3, which is a hexagonal spatial index. Originally developed by Uber, it is now used far beyond the ridesharing industry. Unlike square-based grids, H3’s well-structured hexagons accurately represent intricate geographic features like rivers and roads, enabling precise depiction of nonperpendicular shapes. The hexagonal geometry excels at capturing gradual spatial changes and movement, and its consistent distance between one centroid and neighboring centroids eliminates outliers. This ensures robust data representation in all directions. Learn more about the benefits of using hexagons for location intelligence at Hexagons for Location Intelligence.

Figure 3 – H3: the relationships between different resolutions

Figure 3 – H3: the relationships between different resolutions

H3 available now in Amazon Redshift

Given the immense benefits of H3 for spatial analysis, we’re very excited to announce the availability of H3 in Amazon Redshift. Seamlessly accessible through the powerful infrastructure of Amazon Redshift, H3 unlocks a new realm of possibilities for visualizing, analyzing, and deriving insights from geospatial data.

Amazon Redshift support for H3 offers an easy way to index spatial coordinates into a hexagonal grid, down to a square meter resolution. Indexed data can be quickly joined across different datasets and aggregated at different levels of precision. H3 enables several spatial algorithms and optimizations based on the hexagonal grid, including nearest neighbors, shortest path, gradient smoothing, and more. H3 indexes refer to cells that can be either hexagons or pentagons. The space is subdivided hierarchically, and given a resolution. H3 supports 16 resolutions from 0–15, inclusive, with 0 being the coarsest and 15 being the finest. H3 indexing and related H3 spatial functions are now available for Amazon Redshift spatial analytics.

Support for the three new H3 indexing related spatial functions, H3_FromLongLat, H3_FromPoint, and H3_PolyFill spatial functions, is now available in all commercial AWS Regions. For more information or to get started with Amazon Redshift spatial analytics, see the documentation for querying spatial data, spatial functions, and the spatial tutorial.

Examples of H3 functions in Amazon Redshift:

To create or access the indexed values of the hexagonal tiles, you use one of the three H3 indexing functions Amazon Redshift has released for the particular spatial GEOMETRY object you want to index. For example, a polygon (a series of Cartesian X Y points that makes a closed 2D object), a point (a single Cartesian X Y value) or a point as a latitude, longitude value (a single latitude, longitude value). For example, if you have a spatial polygon already, you would use the H3_PolyFill function to get the index values of the hexagonal tiles that cover or fit the polygon vertices. Imagine you have a polygon with the following Cartesian (X Y) coordinates:

(0 0, 0 1, 1 1, 1 0, 0 0) , which is just a 1 x 1 unit square. You would then invoke the H3_PolyFill() function by converting the text values of the Cartesian coordinates to a GEOMETRY data type and then use the POLYGON() function to convert those coordinates to a polygon object of GEOMETRY data type. This is what you would call:

SELECT H3_Polyfill(ST_GeomFromText('POLYGON((0 0, 0 1, 1 1, 1 0, 0 0))'), 4);

The return values from the this function are the actual index values to the individual hexagonal tiles that cover the 1 x 1 polygon. Of course, you could define arbitrary polygons of any shape just by using vertices of the enclosing 2D polygon of GEOMETRY data type. The actual H3 tile index values that are returned as Amazon Redshift SUPER data type arrays for the preceding example are:

h3_polyfill
_____________________________________________________________________
[596538848238895103,596538805289222143,596538856828829695,596538813879156735,596537920525959167,596538685030137855,596538693620072447,596538839648960511]
_____________________________________________________________________

So there are eight hexagonal tiles when the resolution of four is used when you call the H3_PolyFill function.

Similarly, the following SQL returns the H3 cell ID from longitude 0, latitude 0, and resolution 10.

SELECT H3_FromLongLat(0, 0, 10);

 h3_fromlonglat
______________________________________________________________
 623560421467684863
______________________________________________________________

As does this SQL that returns the H3 cell ID from point 0,0 with resolution 10.

SELECT H3_FromPoint(ST_GeomFromText('POINT(0 0)'), 10);

 h3_frompoint
_____________________________________________________________________________________
 623560421467684863
_____________________________________________________________________________________

Data visualization and analysis made easy with H3 and CARTO

To illustrate how H3 can be used in action, let’s turn to CARTO. As an AWS Partner, CARTO offers a software solution on the curated digital catalog AWS Marketplace that seamlessly integrates distinctive capabilities for spatial visualization, analysis, and app development directly within the AWS data warehouse environment. Notably setting CARTO apart from certain GIS platforms is its strategy of query optimization by using the data warehouse and conducting analytical tasks and computations within Amazon Redshift through the use of user-defined functions (UDFs).

Figure 4 – Basic workflow build with CARTO to polyfill a set of polygons into H3 indexes

Figure 4 – Basic workflow build with CARTO to polyfill a set of polygons into H3 indexes

Amazon Redshift comes equipped with a variety of preexisting spatial functions, and CARTO enhances this foundation by providing additional spatial functions within its Analytics Toolbox for Amazon Redshift, thereby expanding the range of analytical possibilities even further. Let’s dive into a use case to see how this can be used to solve an example spatial analysis problem.

Unveiling H3 spatial indexes in logistics

Logistics, particularly in last-mile delivery, harness substantial benefits from utilizing H3 spatial indexes in operational analytics. This framework has revolutionized geospatial analysis, particularly in efficiently managing extensive datasets.

H3 divides earth’s surface into varying-sized hexagons, precisely representing different geographic areas across multiple hierarchy levels. This precision allows detailed location representation at various scales, offering versatility in analyses and optimizations—from micro to macro, spanning neighborhoods to cities—efficiently managing vast datasets.

H3-based analytics empower the processing and understanding of delivery data patterns, such as peak times, popular destinations, and high-demand areas. This insight aids in predicting future demand and facilitates operations-related decisions. H3 can also help create location-based profiling features for predictive machine learning (ML) models such as risk-mitigation models. Further use cases can include adjustments to inventory, strategic placement of permanent or temporary distribution centers, or even refining pricing strategies to become more effective and adaptive.

The uniform scalability and size consistency of H3 make it an ideal structure for organizing data, effectively replacing traditional zip codes in day-to-day operations.

In essence, insights derived from H3-based analytics empower businesses to make informed decisions, swiftly adapt to market changes, and elevate customer satisfaction through efficient deliveries.

The feature is eagerly anticipated by Amazon Redshift and CARTO customers. “The prospect of leveraging H3’s advanced spatial capabilities within the robust framework of Amazon Redshift has us excited about the new insights and efficiencies we can unlock for our geospatial analysis. This partnership truly aligns with our vision for smarter, data-driven decision-making,” says the Data Science Team at Aramex.

Figure 5 – Diagram illustrating the process of using H3-powered analytics for strategic decision-making

Figure 5 – Diagram illustrating the process of using H3-powered analytics for strategic decision-making

Let’s talk about your use case

You can experience the future of location intelligence firsthand by requesting a demo from CARTO today. Discover how H3’s hexagonal spatial index, seamlessly integrated with Amazon Redshift, can empower your organization with efficiency in handling large-scale geospatial data.

About Amazon Redshift

Thousands of customers rely on Amazon Redshift to analyze data from terabytes to petabytes and run complex analytical queries.

With Amazon Redshift, you can get real-time insights and predictive analytics on all of your data across your operational databases, data lake, data warehouse, and third-party datasets. It delivers this at a price performance that’s up to three times better than other cloud data warehouses out of the box, helping you keep your costs predictable.

Amazon Redshift provides capabilities likeAmazon Redshift spatial analytics, Amazon Redshift streaming analytics, Amazon Redshift ML and Amazon Redshift Serverless to further simplify application building and make it easier, simpler, and faster for independent software vendors (ISVs) to embed rich data analytics capabilities within their applications.

With Amazon Redshift serverless, ISVs can run and scale analytics quickly without the need to set up and manage data warehouse infrastructure. Developers, data analysts, business professionals, and data scientists can go from data to insights in seconds by simply loading and querying in the data warehouse.

To request a demo of Amazon Redshift, visit Amazon Redshift free trial or to get started on your own, visit Getting started with Amazon Redshift.

About CARTO

From smartphones to connected cars, location data is changing the way we live and the way we run businesses. Everything happens somewhere, but visualizing data to see where things are isn’t the same as understanding why they happen there. CARTO is the world’s leading cloud-based location intelligence platform, enabling organizations to use spatial data and analysis for more efficient delivery routes, better behavioral marketing, strategic store placements, and much more.

Data scientists, developers, and analysts use CARTO to optimize business processes and predict future outcomes through the power of spatial data science. To learn more, visit CARTO.


About the authors

Ravi Animi is a senior product leader in the Amazon Redshift team and manages several functional areas of the Amazon Redshift cloud data warehouse service, including spatial analytics, streaming analytics, query performance, Spark integration, and analytics business strategy. He has experience with relational databases, multidimensional databases, IoT technologies, storage and compute infrastructure services, and more recently, as a startup founder in the areas of artificial intelligence (AI) and deep learning, computer vision, and robotics.

Ioanna Tsalouchidou is a software development engineer in the Amazon Redshift team focusing on spatial analytics and query processing. She holds a PhD in graph algorithms from UPF Spain and a Masters in distributed systems and computing from KTH Sweden and UPC Spain.

Hinnerk Gildhoff is a senior engineering leader in the Amazon Redshift team leading query processing, spatial analytics, materialized views, autonomics, query languages and more. Prior to joining Amazon, Hinnerk spent over a decade as both an engineer and a manager in the field of in-memory and cluster computing, specializing in building databases and distributed systems.

Javier de la Torre is founder and Chief Strategy Officer of CARTO, has been instrumental in advancing the geospatial industry. At CARTO, he’s led innovations in location intelligence. He also serves on the Open Geospatial Consortium board, aiding in the development of standards like geoparquet. Javier’s commitment extends to environmental causes through his work with Tierra Pura, focusing on climate change and conservation, demonstrating his dedication to using data for global betterment.

Achieve peak performance and boost scalability using multiple Amazon Redshift serverless workgroups and Network Load Balancer

Post Syndicated from Ricardo Serafim original https://aws.amazon.com/blogs/big-data/achieve-peak-performance-and-boost-scalability-using-multiple-amazon-redshift-serverless-workgroups-and-network-load-balancer/

As data analytics use cases grow, factors of scalability and concurrency become crucial for businesses. Your analytic solution architecture should be able to handle large data volumes at high concurrency and without compromising speed, thereby delivering a scalable high-performance analytics environment.

Amazon Redshift Serverless provides a fully managed, petabyte-scale, auto scaling cloud data warehouse to support high-concurrency analytics. It offers data analysts, developers, and scientists a fast, flexible analytic environment to gain insights from their data with optimal price-performance. Redshift Serverless auto scales during usage spikes, enabling enterprises to cost-effectively help meet changing business demands. You can benefit from this simplicity without changing your existing analytics and business intelligence (BI) applications.

To help meet demanding performance needs like high concurrency, usage spikes, and fast query response times while optimizing costs, this post proposes using Redshift Serverless. The proposed solution aims to address three key performance requirements:

  • Support thousands of concurrent connections with high availability by using multiple Redshift Serverless endpoints behind a Network Load Balancer
  • Accommodate hundreds of concurrent queries with low-latency service level agreements through scalable and distributed workgroups
  • Enable subsecond response times for short queries against large datasets using the fast query processing of Amazon Redshift

The suggested architecture uses multiple Redshift Serverless endpoints accessed through a single Network Load Balancer client endpoint. The Network Load Balancer evenly distributes incoming requests across workgroups. This improves performance and reduces latency by scaling out resources to meet high throughput and low latency demands.

Solution overview

The following diagram outlines a Redshift Serverless architecture with multiple Amazon Redshift managed VPC endpoints behind a Network Load Balancer.

The following are the main components of this architecture:

  • Amazon Redshift data sharing – This allows you to securely share live data across Redshift clusters, workgroups, AWS accounts, and AWS Regions without manually moving or copying the data. Users can see up-to-date and consistent information in Amazon Redshift as soon as it’s updated. With Amazon Redshift data sharing, the ingestion can be done at the producer or consumer endpoint, allowing the other consumer endpoints to read and write the same data and thereby enabling horizontal scaling.
  • Network Load Balancer – This serves as the single point of contact for clients. The load balancer distributes incoming traffic across multiple targets, such as Redshift Serverless managed VPC endpoints. This increases the availability, scalability, and performance of your application. You can add one or more listeners to your load balancer. A listener checks for connection requests from clients, using the protocol and port that you configure, and forwards requests to a target group. A target group routes requests to one or more registered targets, such as Redshift Serverless managed VPC endpoints, using the protocol and the port number that you specify.
  • VPC – Redshift Serverless is provisioned in a VPC. By creating a Redshift managed VPC endpoint, you enable private access to Redshift Serverless from applications in another VPC. This design allows you to scale by having multiple VPCs as needed. The VPC endpoint provides a dedicate private IP for each Redshift Serverless workgroup to be used as the target groups on the Network Load Balancer.

Create an Amazon Redshift managed VPC endpoint

Complete the following steps to create the Amazon Redshift managed VPC endpoint:

  1. On the Redshift Serverless console, choose Workgroup configuration in the navigation pane.
  2. Choose a workgroup from the list.
  3. On the Data access tab, in the Redshift managed VPC endpoints section, choose Create endpoint.
  4. Enter the endpoint name. Create a name that is meaningful for your organization.
  5. The AWS account ID will be populated. This is your 12-digit account ID.
  6. Choose a VPC where the endpoint will be created.
  7. Choose a subnet ID. In the most common use case, this is a subnet where you have a client that you want to connect to your Redshift Serverless instance.
  8. Choose which VPC security groups to add. Each security group acts as a virtual firewall to control inbound and outbound traffic to resources protected by the security group, such as specific virtual desktop instances.

The following screenshot shows an example of this workgroup. Note down the IP address to use during the creation of the target group.

Repeat these steps to create all your Redshift Serverless workgroups.

Add VPC endpoints for the target group for the Network Load Balancer

To add these VPC endpoints to the target group for the Network Load Balancer using Amazon Elastic Compute Cloud (Amazon EC2), complete the following steps:

  1. On the Amazon EC2 console, choose Target groups under Load Balancing in the navigation pane.
  2. Choose Create target group.
  3. For Choose a target type, select Instances to register targets by instance ID, or select IP addresses to register targets by IP address.
  4. For Target group name, enter a name for the target group.
  5. For Protocol, choose TCP or TCP_UDP.
  6. For Port, use 5439 (Amazon Redshift port).
  7. For IP address type, choose IPv4 or IPv6. This option is available only if the target type is Instances or IP addresses and the protocol is TCP or TLS.
  8. You must associate an IPv6 target group with a dual-stack load balancer. All targets in the target group must have the same IP address type. You can’t change the IP address type of a target group after you create it.
  9. For VPC, choose the VPC with the targets to register.
  10. Leave the default selections for the Health checks section, Attributes section, and Tags section.

Create a load balancer

After you create the target group, you can create your load balancer. We recommend using port 5439 (Amazon Redshift default port) for it.

The Network Load Balancer serves as a single-access endpoint and will be used on connections to reach Amazon Redshift. This allows you to add more Redshift Serverless workgroups and increase the concurrency transparently.

Testing the solution

We tested this architecture to run three BI reports with the TPC-DS dataset (cloud benchmark dataset) as our data. Amazon Redshift includes this dataset for free when you choose to load sample data (sample_data_dev database). The installation also provides the queries to test the setup.

Among all the queries from TPC-DS benchmark, we chose the following three to use as our report queries. We changed the first two report queries to use a CREATE TABLE AS SELECT (CTAS) query on temporary tables instead of the WITH clause to emulate options you can see on a typical BI tool. For our testing, we also disabled the result cache to make sure that Amazon Redshift would run the queries every time.

The set of queries contains the creation of temporary tables, a join between those tables, and the cleanup. The cleanup step drops tables. This isn’t needed because they’re deleted at the end of the session, but this aims to simulate all that the BI tool does.

We used Apache JMETER to simulate clients invoking the requests. To learn more about how to use and configure Apache JMETER with Amazon Redshift, refer to Building high-quality benchmark tests for Amazon Redshift using Apache JMeter.

For the tests, we used the following configurations:

  • Test 1 – A single 96 RPU Redshift Serverless vs. three workgroups at 32 RPU each
  • Test 2 – A single 48 RPU Redshift Serverless vs. three workgroups at 16 RPU each

We tested three reports by spawning 100 sessions per report (300 total). There were 14 statements across the three reports (4,200 total). All sessions were triggered simultaneously.

The following table summarizes the tables used in the test.

Table Name Row Count
Catalog_page 93,744
Catalog_sales 23,064,768
Customer_address 50,000
Customer 100,000
Date_dim 73,049
Item 144,000
Promotion 2,400
Store_returns 4,600,224
Store_sales 46,086,464
Store 96
Web_returns 1,148,208
Web_sales 11,510,144
Web_site 240

Some tables were modified by ingesting more data than what the TPC-DS schema offers on Amazon Redshift. Data was reinserted on the table to increase the size.

Test results

The following table summarizes our test results.

TEST 1 . Time Consumed Number of Queries Cost Max Scaled RPU Performance
Single: 96 RPUs 0:02:06 2,100 $6 279 Base
Parallel: 3x 32 RPUs 0:01:06 2,100 $1.20 96 48.03%
Parallel 1 (32 RPU) 0:01:03 688 $0.40 32 50.10%
Parallel 2 (32 RPU) 0:01:03 703 $0.40 32 50.13%
Parallel 3 (32 RPU) 0:01:06 709 $0.40 32 48.03%
TEST 2 . Time Consumed Number of Queries Cost Max Scaled RPU Performance
Single: 48 RPUs 0:01:55 2,100 $3.30 168 Base
Parallel: 3x 16 RPUs 0:01:47 2,100 $1.90 96 6.77%
Parallel 1 (16 RPU) 0:01:47 712 $0.70 36 6.77%
Parallel 2 (16 RPU) 0:01:44 696 $0.50 25 9.13%
Parallel 3 (16 RPU) 0:01:46 692 $0.70 35 7.79%

The preceding table shows that the parallel setup was faster than the single at a lower cost. Also, in our tests, even though Test 1 had double the capacity of Test 2 for the parallel setup, the cost was still 36% lower and the speed was 39% faster. Based on these results, we can conclude that for workloads that have high throughput (I/O), low latency, and high concurrency requirements, this architecture is cost-efficient and performant. Refer to the AWS Pricing Cost Calculator for Network Load Balancer and VPC endpoints pricing.

Redshift Serverless automatically scales the capacity to deliver optimal performance during periods of peak workloads including spikes in concurrency of the workload. This is evident from the maximum scaled RPU results in the preceding table.

Recently released features of Redshift Serverless such as MaxRPU and AI-driven scaling were not used for this test. These new features can increase the price-performance of the workload even further.

We recommend enabling cross-zone load balancing on the Network Load Balancer because it distributes requests from clients to registered targets. Enabling cross-zone load balancing will help balance the requests among the Redshift Serverless managed VPC endpoints irrespective of the Availability Zone they are configured in. Also, if the Network Load Balancer receives traffic from only one server (same IP), you should always use an odd number of Redshift Serverless managed VPC endpoints behind the Network Load Balancer.

Conclusion

In this post, we discussed a scalable architecture that increases the throughput of Redshift Serverless in low latency, high concurrency scenarios. Having multiple Redshift Serverless workgroups behind a Network Load Balancer can deliver a horizontally scalable solution at the best price-performance.

Additionally, Redshift Serverless uses AI techniques (currently in preview) to scale automatically with workload changes across all key dimensions—such as data volume changes, concurrent users, and query complexity—to meet and maintain your price-performance targets.

We hope this post provides you with valuable guidance. We welcome any thoughts or questions in the comments section.


About the Authors

Ricardo Serafim is a Senior Analytics Specialist Solutions Architect at AWS.

Harshida Patel is a Analytics Specialist Principal Solutions Architect, with AWS.

Urvish Shah is a Senior Database Engineer at Amazon Redshift. He has more than a decade of experience working on databases, data warehousing and in analytics space. Outside of work, he enjoys cooking, travelling and spending time with his daughter.

Amol Gaikaiwari is a Sr. Redshift Specialist focused on helping customers realize their business outcomes with optimal Redshift price-performance. He loves to simplify data pipelines and enhance capabilities through adoption of latest Redshift features.