All posts by Jyoti Aggarwal

Announcing data filtering for Amazon Aurora MySQL zero-ETL integration with Amazon Redshift

2024-03-20 Jyoti Aggarwal

Post Syndicated from Jyoti Aggarwal original https://aws.amazon.com/blogs/big-data/announcing-data-filtering-for-amazon-aurora-mysql-zero-etl-integration-with-amazon-redshift/

As your organization becomes more data driven and uses data as a source of competitive advantage, you’ll want to run analytics on your data to better understand your core business drivers to grow sales, reduce costs, and optimize your business. To run analytics on your operational data, you might build a solution that is a combination of a database, a data warehouse, and an extract, transform, and load (ETL) pipeline. ETL is the process data engineers use to combine data from different sources.

To reduce the effort involved in building and maintaining ETL pipelines between transactional databases and data warehouses, AWS announced Amazon Aurora zero-ETL integration with Amazon Redshift at AWS re:Invent 2022 and is now generally available (GA) for Amazon Aurora MySQL-Compatible Edition 3.05.0.

AWS is now announcing data filtering on zero-ETL integrations, enabling you to bring in selective data from the database instance on zero-ETL integrations between Amazon Aurora MySQL and Amazon Redshift. This feature allows you to select individual databases and tables to be replicated to your Redshift data warehouse for analytics use cases.

In this post, we provide an overview of use cases where you can use this feature, and provide step-by-step guidance on how to get started with near real time operational analytics using this feature.

Data filtering use cases

Data filtering allows you to choose the databases and tables to be replicated from Amazon Aurora MySQL to Amazon Redshift. You can apply multiple filters to the zero-ETL integration, allowing you to tailor the replication to your specific needs. Data filtering applies either an exclude or include filter rule, and can use regular expressions to match multiple databases and tables.

In this section, we discuss some common use cases for data filtering.

Improve data security by excluding tables containing PII data from replication

Operational databases often contain personally identifiable information (PII). This is information that is sensitive in nature, and can include information such as mailing addresses, customer verification documentation, or credit card information.

Due to strict security compliance regulations, you may not want to use PII for your analytics use cases. Data filtering allows you to filter out databases or tables containing PII data, excluding them from replication to Amazon Redshift. This improves data security and compliance with analytics workloads.

Save on storage costs and manage analytics workloads by replicating tables required for specific use cases

Operational databases often contain many different datasets that aren’t useful for analytics. This includes supplementary data, specific application data, and multiple copies of the same dataset for different applications.

Moreover, it’s common to build different use cases on different Redshift warehouses. This architecture requires different datasets to be available in individual endpoints.

Data filtering allows you to only replicate the datasets that are required for your use cases. This can save costs by eliminating the need to store data that is not being used.

You can also modify existing zero-ETL integrations to apply more restrictive data replication where desired. If you add a data filter to an existing integration, Aurora will fully reevaluate the data being replicated with the new filter. This will remove the newly filtered data from the target Redshift endpoint.

For more information about quotas for Aurora zero-ETL integrations with Amazon Redshift, refer to Quotas.

Start with small data replication and incrementally add tables as required

As more analytics use cases are developed on Amazon Redshift, you may want to add more tables to an individual zero-ETL replication. Rather than replicating all tables to Amazon Redshift to satisfy the chance that they may be used in the future, data filtering allows you to start small with a subset of tables from your Aurora database and incrementally add more tables to the filter as they’re required.

After a data filter on a zero-ETL integration is updated, Aurora will fully reevaluate the entire filter as if the previous filter didn’t exist, so workloads using previously replicated tables aren’t impacted in the addition of new tables.

Improve individual workload performance by load balancing replication processes

For large transactional databases, you may need to load balance the replication and any downstream processing to multiple Redshift clusters to allow for reduction of compute requirements for an individual Redshift endpoint and the ability to split workloads onto multiple endpoints. By load balancing workloads across multiple Redshift endpoints, you can effectively create a data mesh architecture, where endpoints are appropriately sized for individual workloads. This can improve performance and lower overall cost.

Data filtering allows you to replicate different databases and tables to separate Redshift endpoints.

The following figure shows how you could use data filters on zero-ETL integrations to split different databases in Aurora to separate Redshift endpoints.

Example use case

Consider the TICKIT database. The TICKIT sample database contains data from a fictional company where users can buy and sell tickets for various events. The company’s business analysts want to use the data that is stored in their Aurora MySQL database to generate various metrics, and would like to perform this analysis in near real time. For this reason, the company has identified zero-ETL as a potential solution.

Throughout their investigation of the datasets required, the company’s analysts noted that the users table contains personal information about their customer user information that is not useful for their analytics requirements. Therefore, they want to replicate all data except the users table and will use zero-ETL’s data filtering to do so.

Setup

Start by following the steps in Getting started guide for near-real time operational analytics using Amazon Aurora zero-ETL integration with Amazon Redshift to create a new Aurora MySQL database, Amazon Redshift Serverless endpoint, and zero-ETL integration. Then open the Redshift query editor v2 and run the following query to show that data from the users table has been replicated successfully:

select * from aurora_zeroetl.demodb.users;

Data filters

Data filters are applied directly to the zero-ETL integration on Amazon Relational Database Service (Amazon RDS). You can define multiple filters for a single integration, and each filter is defined as either an Include or Exclude filter type. Data filters apply a pattern to existing and future database tables to determine which filter should be applied.

Apply a data filter

To apply a filter to remove the users table from the zero-ETL integration, complete the following steps:

On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane.
Choose the zero-ETL integration to add a filter to.

The default filter is to include all databases and tables represented by an include:*.* filter.

Choose Modify.
Choose Add filter in the Source section.
For Choose filter type, choose Exclude.
For Filter expression, enter the expression demodb.users.

Filter expression order matters. Filters are evaluated left to right, top to bottom, and subsequent filters will override previous filters. In this example, Aurora will evaluate that every table should be included (filter 1) and then evaluate that the demodb.users table should be excluded (filter 2). The exclusion filter therefore overrides the inclusion because it’s after the inclusion filter.

Choose Continue.
Review the changes, making sure that the order of the filters is correct, and choose Save changes.

The integration will be added and will be in a Modifying state until the changes have been applied. This can take up to 30 minutes. To check if the changes have finished applying, choose the zero-ETL integration and check its status. When it shows as Active, the changes have been applied.

Verify the change

To verify the zero-ETL integration has been updated, complete the following steps:

In the Redshift query editor v2, connect to your Redshift cluster.
Choose (right-click) the aurora-zeroetl database you created and choose Refresh.
Expand demodb and Tables.

The users table is no longer available because it has been removed from the replication. All other tables are still available.

If you run the same SELECT statement from earlier, you will receive an error stating the object does not exist in the database:
```
select * from aurora_zeroetl.demodb.users;
```

Apply a data filter using the AWS CLI

The company’s business analysts now understand that more databases are being added to the Aurora MySQL database and they want to ensure only the demodb database is replicated to their Redshift cluster. To this end, they want to update the filters on the zero-ETL integration with the AWS Command Line Interface (AWS CLI).

To add data filters to a zero-ETL integration using the AWS CLI, you can call the modify-integration command. In addition to the integration identifier, specify the --data-filter parameter with a comma-separated list of include and exclude filters.

Complete the following steps to alter the filter on the zero-ETL integration:

Open a terminal with the AWS CLI installed.
Enter the following command to list all available integrations:
```
aws rds describe-integrations
```
Find the integration you want to update and copy the integration identifier.

The integration identifier is an alphanumeric string at the end of the integration ARN.

Run the following command, updating <integration identifier> with the identifier copied from the previous step:

aws rds modify-integration --integration-identifier "<integration identifier>" --data-filter 'exclude: *.*, include: demodb.*, exclude: demodb.users'

When Aurora is assessing this filter, it will exclude everything by default, then only include the demodb database, but exclude the demodb.users table.

Data filters can implement regular expressions for the databases and table. For example, if you want to filter out any tables starting with user, you can run the following:

aws rds modify-integration --integration-identifier "<integration identifier>" --data-filter 'exclude: *.*, include: demodb.*, exclude *./^user/'

As with the previous filter change, the integration will be added and will be in a Modifying state until the changes have been applied. This can take up to 30 minutes. When it shows as Active, the changes have been applied.

Clean up

To remove the filter added to the zero-ETL integration, complete the following steps:

On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane.
Choose your zero-ETL integration.
Choose Modify.
Choose Remove next to the filters you want to remove.
You can also change the Exclude filter type to Include.

Alternatively, you can use the AWS CLI to run the following:

aws rds modify-integration --integration-identifier "<integration identifier>" --data-filter 'include: *.*'

Choose Continue.
Choose Save changes.

The data filter will take up to 30 minutes to apply the changes. After you remove data filters, Aurora reevaluates the remaining filters as if the removed filter had never existed. Any data that previously didn’t match the filtering criteria but now does is replicated into the target Redshift data warehouse.

Conclusion

In this post, we showed you how to set up data filtering on your Aurora zero-ETL integration from Amazon Aurora MySQL to Amazon Redshift. This allows you to enable near real time analytics on transactional and operational data while replicating only the data required.

With data filtering, you can split workloads into separate Redshift endpoints, limit the replication of private or confidential datasets, and increase performance of workloads by only replicating required datasets.

To learn more about Aurora zero-ETL integration with Amazon Redshift, see Working with Aurora zero-ETL integrations with Amazon Redshift and Working with zero-ETL integrations.

About the authors

Jyoti Aggarwal is a Product Management Lead for AWS zero-ETL. She leads the product and business strategy, including driving initiatives around performance, customer experience, and security. She brings along an expertise in cloud compute, data pipelines, analytics, artificial intelligence (AI), and data services including databases, data warehouses and data lakes.

Sean Beath is an Analytics Solutions Architect at Amazon Web Services. He has experience in the full delivery lifecycle of data platform modernisation using AWS services, and works with customers to help drive analytics value on AWS.

Gokul Soundararajan is a principal engineer at AWS and received a PhD from University of Toronto and has been working in the areas of storage, databases, and analytics.

Announcing zero-ETL integrations with AWS Databases and Amazon Redshift

2023-11-28 Jyoti Aggarwal

Post Syndicated from Jyoti Aggarwal original https://aws.amazon.com/blogs/big-data/announcing-zero-etl-integrations-with-aws-databases-and-amazon-redshift/

As customers become more data driven and use data as a source of competitive advantage, they want to easily run analytics on their data to better understand their core business drivers to grow sales, reduce costs, and optimize their businesses. To run analytics on their operational data, customers often build solutions that are a combination of a database, a data warehouse, and an extract, transform, and load (ETL) pipeline. ETL is the process data engineers use to combine data from different sources.

Through customer feedback, we learned that lot of undifferentiated time and resources go towards building and managing ETL pipelines between transactional databases and data warehouses. At Amazon Web Services (AWS), our goal is to make it easier for our customers to connect to and use all of their data and to do it with the speed and agility they need. We think that by automating the undifferentiated parts, we can help our customers increase the pace of their data-driven innovation by breaking down data silos and simplifying data integration.

Bringing operational data closer to analytics workflows

Customers want flexible data architectures that let them integrate data across their organization to give them a better picture of their customers, streamline operations, and help teams make better, faster decisions. But integrating data isn’t easy. Today, building these pipelines and assembling the architecture to interconnect all the data sources and optimize analytics results is complex, requires highly skilled resources, and renders data that can be erroneous or is often inconsistent.

Amazon Redshift powers data driven decisions for tens of thousands of customers every day with a fully managed, artificial intelligence (AI)-powered cloud data warehouse that delivers the best price-performance for your analytics workloads.

Zero-ETL is a set of integrations that eliminates the need to build ETL data pipelines. Zero-ETL integrations with Amazon Redshift enable customers to access their data in place using federated queries or ingest it into Amazon Redshift with a fully managed solution from across their databases. With newer features, such as support for autocopy that simplifies and automates file ingestion from Amazon Simple Storage Service (Amazon S3), Redshift Streaming Ingestion capabilities to continuously ingest any amount of streaming data directly into the warehouse, and multi-cluster data sharing architectures that minimize data movement and even provide access to third-party data, Amazon Redshift enables data integration and quick access to data without building manual pipelines.

With all the data integrated and available, Amazon Redshift empowers every data user to run analytics and build AI, machine learning (ML), and generative AI applications. Developers can run Apache Spark applications directly on the data in their warehouse from AWS analytics services, such as Amazon EMR and AWS Glue. They can enrich their datasets by joining operational data replicated through zero-ETL integrations with other sources such as sales and marketing data from SaaS applications and can even create Amazon QuickSight dashboards on top of this data to track key metrics across sales, website analytics, operations, and more—all in one place.

Customers can also use Amazon Redshift data sharing to securely share this data with multiple consumer clusters used by different teams—both within and across AWS accounts—driving a unified view of business and facilitating self-service access to application data within team clusters while maintaining governance over sensitive operational data.

Furthermore, customers can build machine learning models directly on their operational data in Amazon Redshift ML (native integration into Amazon SageMaker) without needing to build any data pipelines and use them to run billions of predictions with SQL commands. Or they can build complex transformations and aggregations on the integrated data using Amazon Redshift materialized views.

We’re excited to share four AWS database zero-ETL integrations with Amazon Redshift:

Amazon Aurora MySQL-Compatible Edition (generally available)
Amazon Aurora PostgreSQL-Compatible Edition (preview)
Amazon RDS for MySQL (preview)
Amazon DynamoDB (limited preview)

By bringing different database services closer to analytics, AWS is streamlining access to data and enabling companies to accelerate innovation, create competitive advantage, and maximize the business value extracted from their data assets.

Amazon Aurora zero-ETL integration with Amazon Redshift

The Amazon Aurora zero-ETL integration with Amazon Redshift unifies transactional data from Amazon Aurora with near real-time analytics in Amazon Redshift. This eliminates the burden of building and maintaining custom ETL pipelines between the two systems. Unlike traditional siloed databases that force a tradeoff between performance and analytics, the zero-ETL integration replicates data from multiple Aurora clusters into the same Amazon Redshift warehouse. This enables holistic insights across applications without impacting production workloads. The entire system can be serverless and can auto-scale to handle fluctuations in data volume without infrastructure management.

Amazon Aurora MySQL zero-ETL integration with Amazon Redshift processes over 1 million transactions per minute (an equivalent of 17.5 million insert/update/delete row operations per minute) from multiple Aurora databases and makes them available in Amazon Redshift in less than 15 seconds (p50 latency lag). Figure 1 shows how the Aurora MySQL zero-ETL integration with Amazon Redshift works at a high level.

Figure 1: High level working of Aurora MySQL zero-ETL integration with Amazon Redshift

In their own words, see how one of our customers is using Aurora MySQL zero-ETL integration with Amazon Redshift.

In the retail industry, for example, Infosys wanted to gain faster insights about their business, such as best-selling products and high-revenue stores, based on transactions in a store management system. They used Amazon Aurora MySQL zero-ETL integration with Amazon Redshift to achieve this. With this integration, Infosys replicated Aurora data to Amazon Redshift and created Amazon QuickSight dashboards for product managers and channel leaders in just a few seconds, instead of several hours. Now, as part of Infosys Cobalt and Infosys Topaz blueprints, enterprises can have near real-time analytics on transactional data, which can help them make informed decisions related to store management.

– Sunil Senan, SVP and Global Head of Data, Analytics, and AI, Infosys

To learn more, see Aurora Docs, Amazon Redshift Docs, and the AWS News Blog.

Amazon RDS for MySQL zero-ETL integration with Amazon Redshift

The new Amazon RDS for MySQL integration with Amazon Redshift empowers customers to easily perform analytics on their RDS for MySQL data. With a few clicks, it seamlessly replicates RDS for MySQL data into Amazon Redshift, automatically handling initial data loads, ongoing change synchronization, and schema replication. This eliminates the complexity of traditional ETL jobs. The zero-ETL integration enables workload isolation for optimal performance; RDS for MySQL focuses on high-speed transactions while Amazon Redshift handles analytical workloads. Customers can also consolidate data from multiple sources into Amazon Redshift, such as Aurora MySQL-Compatible Edition and Aurora PostgreSQL-Compatible Edition. This unified view provides holistic insights across applications in one place, delivering significant cost and operational efficiencies.

Figure 2 shows how a customer can use the AWS Management Console for Amazon RDS to get started with creating a zero-ETL integration from RDS for MySQL, Aurora MySQL-Compatible Edition, and Aurora PostgreSQL-Compatible Edition to Amazon Redshift.

Figure 2: How to create a zero-ETL integration using Amazon RDS.

This integration is currently in public preview, visit the getting started guide to learn more.

Amazon DynamoDB zero-ETL integration with Amazon Redshift

The Amazon DynamoDB zero-ETL integration with Amazon Redshift (limited preview) provides a fully managed solution for making data from DynamoDB available for analytics in Amazon Redshift. With minimal configuration, customers can replicate DynamoDB data into Amazon Redshift for analytics without consuming the DynamoDB Read Capacity Units (RCU). This zero-ETL integration unlocks powerful Amazon Redshift capabilities on DynamoDB data such as high-speed SQL queries, machine learning integrations, materialized views for fast aggregations, and secure data sharing.

This integration is currently in limited preview, use this link to request access.

Integrated services bring us closer to zero-ETL

Our mission is to help customers get the most value from their data, and integrated services are key to this. That’s why we’re building towards a zero-ETL future today. By automating complex ETL processes, data engineers can redirect their focus on creating value from the data. With this modern approach to data management, organizations can accelerate their use of data to streamline operations and fuel business growth.

About the author

Jyoti Aggarwal is a Product Management lead for Amazon Redshift zero-ETL. She brings along an expertise in cloud compute and storage, data warehouse, and B2B/B2C customer experience.

Noise