All posts by Krishna Gogineni

Migrate a petabyte-scale data warehouse from Actian Vectorwise to Amazon Redshift

2024-05-30 Krishna Gogineni

Post Syndicated from Krishna Gogineni original https://aws.amazon.com/blogs/big-data/migrate-a-petabyte-scale-data-warehouse-from-actian-vectorwise-to-amazon-redshift/

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data. Tens of thousands of customers use Amazon Redshift to process large amounts of data, modernize their data analytics workloads, and provide insights for their business users.

In this post, we discuss how a financial services industry customer achieved scalability, resiliency, and availability by migrating from an on-premises Actian Vectorwise data warehouse to Amazon Redshift.

Challenges

The customer’s use case required a high-performing, highly available, and scalable data warehouse to process queries against large datasets in a low-latency environment. Their Actian Vectorwise system was designed to replace Excel plugins and stock screeners but eventually evolved into a much larger and ambitious portfolio analysis solution running multiple API clusters on premises, serving some of the largest financial services firms worldwide. The customer saw growing demand that needed high performance and scalability due to 30% year-over-year increase in usage from the success of their products. The customer needed to keep up with increased volume of read requests, but they couldn’t do this without deploying additional hardware in the data center. There was also a customer mandate that business-critical products must have their hardware updated to cloud-based solutions or be deemed on the path to obsolescence. In addition, the business started moving customers onto a new commercial model, and therefore new projects would need to provision a new cluster, which meant that they needed improved performance, scalability, and availability.

They faced the following challenges:

Scalability – The customer understood that infrastructure maintenance was a growing issue and, although operations were a consideration, the existing implementation didn’t have a scalable and efficient solution to meet the advanced sharding requirements needed for query, reporting, and analysis. Over-provisioning of data warehouse capacity to meet unpredictable workloads resulted in underutilized capacity during normal operations by 30%.
Availability and resiliency – Because the customer was running business-critical analytical workloads, it required the highest levels of availability and resiliency, which was a concern with the on-premises data warehouse solution.
Performance – Some of their queries needed to be processed in priority, and users were starting to experience performance degradation with longer-running query times as their solution started getting used more and more. The need for a scalable and efficient solution to manage customer demand, address infrastructure maintenance concerns, replace legacy tooling, and tackle availability led to them choosing Amazon Redshift as the future state solution. If these concerns were not addressed, the customer would be prevented from growing their user base.

Legacy architecture

The customer’s platform was the main source for one-time, batch, and content processing. It served many enterprise use cases across API feeds, content mastering, and analytics interfaces. It was also the single strategic platform within the company for entity screening, on-the-fly aggregation, and other one-time, complex request workflows.

The following diagram illustrates the legacy architecture.

The architecture consists of many layers:

Rules engine – The rules engine was responsible for intercepting every incoming request. Based on the nature of the request, it routed the request to the API cluster that could optimally process that specific request based on the response time requirement.
API – Scalability was one of the primary challenges with the existing on-premises system. It wasn’t possible to quickly scale up and down API service capacity to meet growing business demand. Both the API and data store had to support a highly volatile workload pattern. This included simple data retrieval requests that had to be processed within a few milliseconds vs. power user-style batch requests with complex analytics-based workloads that could take several seconds and significant compute resources to process. To separate these different workload patterns, the API and data store infrastructure was split into multiple isolated physical clusters. This made sure each workload group was provisioned with sufficient reserved capacity to meet the respective response time expectations. However, this model of reserving capacity for each workload type resulted in suboptimal usage of compute resources because each cluster would only process a specific workload type.
Data store – The data store used a custom data model that had been highly optimized to meet low-latency query response requirements. The current on-premises data store wasn’t horizontally scalable, and there was no built-in replication or data sharding capability. Due to this limitation, multiple database instances were created to meet concurrent scalability and availability requirements because the schema wasn’t generic per dataset. This model caused operational maintenance overhead and wasn’t easily expandable.
Data ingestion – Pentaho was used to ingest data sourced from multiple data publishers into the data store. The ingestion framework itself didn’t have any major challenges. However, the primary bottleneck was due to scalability issues associated with the data store. Because the data store didn’t support sharding or replication, data ingestion had to explicitly ingest the same data concurrently across multiple database nodes within a single transaction to provide data consistency. This significantly impacted overall ingestion speed.

Overall, the current architecture didn’t support workload prioritization, therefore a physical model of resources was reserved for this reason. The downside here is over-provisioning. The system had an integration with legacy backend services that were all hosted on premises.

Solution overview

Amazon Redshift is an industry-leading cloud data warehouse. Amazon Redshift uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes using AWS-designed hardware and machine learning (ML) to deliver the best price-performance at any scale.

Amazon Redshift is designed for high-performance data warehousing, which provides fast query processing and scalable storage to handle large volumes of data efficiently. Its columnar storage format minimizes I/O and improves query performance by reading only the relevant data needed for each query, resulting in faster data retrieval. Lastly, you can integrate Amazon Redshift with data lakes like Amazon Simple Storage Service (Amazon S3), combining structured and semi-structured data for comprehensive analytics.

The following diagram illustrates the architecture of the new solution.

In the following sections, we discuss the features of this solution and how it addresses the challenges of the legacy architecture.

Rules engine and API

Amazon API Gateway is a fully managed service that help developers deliver secure, robust, API-driven application backends at any scale. To address scalability and availability requirements of the rules and routing layer, we introduced API Gateway to do the routing of the client requests to different integration paths using routes and parameter mappings. Having API Gateway as the entry point allowed the customer to move away from the design, testing, and maintenance of their rules engine development workload. In their legacy environment, handling fluctuating amounts of traffic posed a significant challenge. However, API Gateway seamlessly addressed this issue by acting as a proxy and automatically scaling to accommodate varying traffic demands, providing optimal performance and reliability.

Data storage and processing

Amazon Redshift allowed the customer to meet their scalability and performance requirements. Amazon Redshift features such as workload management (WLM), massively parallel processing (MPP) architecture, concurrency scaling, and parameter groups helped address the requirements:

WLM provided the ability for query prioritization and managing resources effectively
The MPP architecture model provided horizontal scalability
Concurrency scaling added additional cluster capacity to handle unpredictable and spiky workloads
Parameter groups defined configuration parameters that control database behavior

Together, these capabilities allowed them to meet their scalability and performance requirements in a managed fashion.

Data distribution

The legacy data center architecture was unable to partition the data without deploying additional hardware in the data center, and it couldn’t handle read workloads efficiently.

The MPP architecture of Amazon Redshift offers efficient data distribution across all the compute nodes, which helped run heavy workloads in parallel and subsequently lowered response times. With the data distributed across all the compute nodes, it allows data to be processed in parallel. Its MPP engine and architecture separates compute and storage for efficient scaling and performance.

Operational efficiency and hygiene

Infrastructure maintenance and operational efficiency was a concern for the customer in their current state architecture. Amazon Redshift is a fully managed service that takes care of data warehouse management tasks such as hardware provisioning, software patching, setup, configuration, and monitoring nodes and drives to recover from failures or backups. Amazon Redshift periodically performs maintenance to apply fixes, enhancements, and new features to your Redshift data warehouse. As a result, the customer’s operational costs reduced by 500%, and they are now able to spend more time innovating and building mission-critical applications.

Workload management

Amazon Redshift WLM was able to resolve issues with the legacy architecture where longer-running queries were consuming all the resources, causing other queries to run slower, impacting performance SLAs. With automatic WLM, the customer was able to create separate WLM queues with different priorities, which allowed them to manage the priorities for the critical SLA-bound workloads and other non-critical workloads. With short query acceleration (SQA) enabled, it prioritized selected short-running queries ahead of longer-running queries. Furthermore, the customer benefited by using query monitoring rules in WLM to apply performance boundaries to control poorly designed queries and take action when a query goes beyond those boundaries. To learn more about WLM, refer to Implementing workload management.

Workload isolation

In the legacy architecture, all the workloads—extract, transform, and load (ETL); business intelligence (BI); and one-time workloads—were running on the same on-premises data warehouse, leading to the noisy neighbor problem and performance issues with the increase in users and workloads.

With the new solution architecture, this issue is remediated using data sharing in Amazon Redshift. With data sharing, the customer is able to share live data with security and ease across Redshift clusters, AWS accounts, or AWS Regions for read purposes, without the need to copy any data.

Data sharing improved the agility of the customer’s organization. It does this by giving them instant, granular, and high-performance access to data across Redshift clusters without the need to copy or move it manually. With data sharing, customers have live access to data, so their users can see the most up-to-date and consistent information as it’s updated in Redshift clusters. Data sharing provides workload isolation by running ETL workloads in its own Redshift cluster and sharing data with other BI and analytical workloads in their respective Redshift clusters.

Scalability

With the legacy architecture, the customer was facing scalability challenges during large events to handle unpredictable spiky workloads and over-provisioning of the database capacity. Using concurrency scaling and elastic resize allowed the customer to meet their scalability requirements and handle unpredictable and spiky workloads.

Data migration to Amazon Redshift

The customer used a home-grown process to extract the data from Actian Vectorwise and store it in Amazon S3 and CSV files. The data from Amazon S3 was then ingested into Amazon Redshift.

The loading process used a COPY command and ingested the data from Amazon S3 in a fast and efficient way. A best practice for loading data into Amazon Redshift is to use the COPY command. The COPY command is the most efficient way to load a table because it uses the Amazon Redshift MPP architecture to read and load data in parallel from a file or multiple files in an S3 bucket.

To learn about the best practices for source data files to load using the COPY command, see Loading data files.

After the data is ingested into Redshift staging tables from Amazon S3, transformation jobs are run from Pentaho to apply the incremental changes to the final reporting tables.

The following diagram illustrates this workflow.

Key considerations for the migration

There are three ways of migrating an on-premises data warehouse to Amazon Redshift: one-step, two-step, and wave-based migration. To minimize the risk of migrating over 20 databases that vary in complexity, we decided on the wave-based approach. The fundamental concept behind wave-based migration involves dividing the migration program into projects based on factors such as complexity and business outcomes. The implementation then migrates each project individually or by combining certain projects into a wave. Subsequent waves follow, which may or may not be dependent on the results of the preceding wave.

This strategy requires both the legacy data warehouse and Amazon Redshift to operate concurrently until the migration and validation of all workloads are successfully complete. This provides a smooth transition while making sure the on-premises infrastructure can be retired only after thorough migration and validation have taken place.

In addition, within each wave, we followed a set of phases to make sure that each wave was successful:

Assess and plan
Design the Amazon Redshift environment
Migrate the data
Test and validate
Perform cutover and optimizations

In the process, we didn’t want to rewrite the legacy code for each migration. With minimal code changes, we migrated the data to Amazon Redshift because SQL compatibility was very important in the process due to existing knowledge within the organization and downstream application consumption. After the data was ingested into the Redshift cluster, we adjusted the tables for best performance.

One of the main benefits we realized as part of the migration was the option to integrate data in Amazon Redshift with other business groups in the future that use AWS Data Exchange, without significant effort.

We performed blue/green deployments to make sure that the end-users didn’t encounter any latency degradation while retrieving the data. We migrated the end-users in a phased manner to measure the impact and adjust the cluster configuration as needed.

Results

The customer’s decision to use Amazon Redshift for their solution was further reinforced by the platform’s ability to handle both structured and semi-structured data seamlessly. Amazon Redshift allows the customer to efficiently analyze and derive valuable insights from their diverse range of datasets, including equities and institutional data, all while using standard SQL commands that teams are already comfortable with.

Through rigorous testing, Amazon Redshift consistently demonstrated remarkable performance, meeting the customer’s stringent SLAs and delivering exceptional subsecond query response times with an impressive latency. With the AWS migration, the customer achieved a 5% improvement in query performance. Scalability of the clusters was done in minutes compared to 6 months in the data center. Operational cost reduced by 500% due to the simplicity of the Redshift cluster operations in AWS. Stability of the clusters improved by 100%. Upgrades and patching cycle time improved by 200%. Overall, improvement in operational posture and total savings for the footprint has resulted in significant savings for the team and platform in general. In addition, the ability to scale the overall architecture based on market data trends in a resilient and highly available way not only met the customer demand in terms of time to market, but also significantly reduced the operational costs and total cost of ownership.

Conclusion

In this post, we covered how a large financial services customer improved performance and scalability, and reduced their operational costs by migrating to Amazon Redshift. This enabled the customer to grow and onboard new workloads into Amazon Redshift for their business-critical applications.

To learn about other migration use cases, refer to the following:

About the Authors

Krishna Gogineni is a Principal Solutions Architect at AWS helping financial services customers. Krishna is Cloud-Native Architecture evangelist helping customers transform the way they build software. Krishna works with customers to learn their unique business goals, and then super-charge their ability to meet these goals through software delivery that leverages industry best practices/tools such as DevOps, Data Lakes, Data Analytics, Microservices, Containers, and Continuous Integration/Continuous Delivery.

Dayananda Shenoy is a Senior Solution Architect with over 20 years of experience designing and architecting backend services for financial services products. Currently, he leads the design and architecture of distributed, high-performance, low latency analytics services for a data provider. He is passionate about solving scalability and performance challenges in distributed systems leveraging emerging technology which improve existing tech stacks and add value to the business to enhance customer experience.

Vishal Balani is a Sr. Customer Solutions Manager based out of New York. He works closely with Financial Services customers to help them leverage cloud for businesses agility, innovation and resiliency. He has extensive experience leading large-scale cloud migration programs. Outside of work he enjoys spending time with family, tinkering with a new project or riding his bike.

Ranjan Burman is a Sr. PostgreSQL Database Specialist SA. He specializes in RDS & Aurora PostgreSQL. He has more than 18 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with the use of cloud solutions.

Muthuvelan Swaminathan is an Enterprise Solutions Architect based out of New York. He works with enterprise customers providing architectural guidance in building resilient, cost-effective and innovative solutions that address business needs.

Temporal data lake architecture for benchmark and indices analytics

2023-07-21 Krishna Gogineni

Post Syndicated from Krishna Gogineni original https://aws.amazon.com/blogs/architecture/temporal-data-lake-architecture-for-benchmark-and-indices-analytics/

Financial trading houses and stock exchanges generate enormous volumes of data in near real-time, making it difficult to perform bi-temporal calculations that yield accurate results. Achieving this requires a processing architecture that can handle large volumes of data during peak bursts, meet strict latency requirements, and scale according to incoming volumes.

In this post, we’ll describe a scenario for an industry leader in the financial services sector and explain how AWS services are used for bi-temporal processing with state management and scale based on variable workloads during the day, all while meeting strict service-level agreement (SLA) requirements.

Problem statement

To design and implement a fully temporal transactional data lake with the repeatable read isolation level for queries is a challenge, particularly with burst events that need the overall architecture to scale accordingly. The data store in the overall architecture needs to record the value history of data at different times, which is especially important for financial data. Financial data can include corporate actions, annual or quarterly reports, or fixed-income securities, like bonds that have variable rates. It’s crucial to be able to correct data inaccuracies during the reporting period.

The example customer seeks a data processing platform architecture to dynamically scale based on the workloads with a capacity of processing 150 million records under 5 minutes. Their platform should be capable of meeting the end-to-end SLA of 15 minutes, from ingestion to reporting, with lowest total cost of ownership. Additionally, managing bi-temporal data requires a database that has critical features, such as ACID (atomicity, consistency, isolation, durability) compliance, time-travel capability, full-schema evolution, partition layout and evolution, rollback to prior versions, and SQL-like query experience.

Solution overview

The solution architecture key building blocks are Amazon Kinesis Data Streams for streaming data, Amazon Kinesis Data Analytics with Apache Flink as processing engine, Flink’s RocksDB for state management, and Apache Iceberg on Amazon Simple Storage Service (Amazon S3) as the storage engine (Figure 1).

Figure 1. End-to-end data-processing architecture

Data processing

Here’s how it works:

A publisher application receives the data from the source systems and publishes data into Kinesis Data Streams using a well-defined JSON format structure.
Kinesis Data Streams holds the data for a duration that is configurable so data is not lost and can auto scale based on the data volume ingested.
Kinesis Data Analytics runs an Apache Flink application, with state management (RocksDB), to handle bi-temporal calculations. The Apache Flink application consumes data from Kinesis Data Streams and performs the following computations:
- Transforms the JSON stream into a row-type record, compatible with a SQL table-like structure, resolving nesting and parent–child relationships present within the stream
- Checks whether the record has already an existing state in in-memory RocksDB or disk attached to Kinesis Data Analytics computational node to avoid read latency from the database, which is critical for meeting the performance requirements
- Performs bi-temporal calculations and creates the resultant records in an in-memory data structure before invoking the Apache Iceberg sink operator
- The Apache Flink application sink operator appends the temporal states, expressed as records into existing Apache Iceberg data store. This will comply with key principles of time series data, which is immutable, and the ability to time-travel along with ACID compliance, schema evolution, and partition evolution
Kinesis Data Analytics is resilient and provides a no-data-loss capability, with features like periodic checkpoints and savepoints. They are used to store the state management in a secure Amazon S3 location that can be accessed outside of Kinesis Data Analytics. This savepoints mechanism can be used to programmatically to scale the cluster size based on the workloads using time-driven scheduling and AWS Lambda functions.
If the time-to-live feature of RocksDB is implemented, old records are stored in Apache Iceberg on Amazon S3. When performing temporal calculations, if the state is not found in memory, data is read from Apache Iceberg into RocksDB and the processing is completed. However, this step is optional and can be circumvented if the Kinesis Data Analytics cluster is initialized with right number of Kinesis processing units to hold the historical information, as per requirements.
Because the data is stored in an Apache Iceberg table format in Amazon S3, data is queried using Trino, which supports Apache Iceberg table format.
The end user queries data using any SQL tool that supports the Trino query engine.

Apache Iceberg maintenance jobs, such as data compaction, expire snapshot, delete orphan files, can be launched using Amazon Athena to optimize performance out of Apache Iceberg data store. Details of each processing step performed in Apache Flink application are captured using Amazon CloudWatch, which logs all the events.

Scalability

Amazon EventBridge scheduler invokes a Lambda function to scale the Kinesis Data Analytics. Kinesis Data Analytics has a short outage during rescaling that is proportional to the amount of data stored in RocksDB, which is why a state management strategy is necessary for the proper operation of the system.

Figure 2 shows the scaling process, which depicts:

Before peak load: The Kinesis Data Analytics cluster is processing off-peak records with minimum configuration before the peak load. A scheduled event is launched from EventBridge that invokes a Lambda function, which shuts down the cluster using the savepoint mechanism and scales up the Kinesis Data Analytics cluster to required Kinesis processing units.
During peak load: When the peak data burst happens, the Kinesis Data Analytics cluster is ready to handle the volume of data from Kinesis Data Stream, and processes it within the SLA of 5 minutes.
After peak load: A scheduled event from EventBridge invokes a Lambda function to scale down the Kinesis Data Analytics cluster to the minimum configuration that holds the required state for the entire volume of records.

Figure 2. Cluster scaling before, during, and after peak data volume processing

Performance insights

With the discussed architecture, we want to demonstrate that the we are able to meet the SLAs, in terms of performance and processing times. We have taken a subset of benchmarks and indices data and processed the same with the end-to-end architecture. During the process, we observed some very interesting findings, which we would like to share.

Processing time for Apache Iceberg Upsert vs Append operations: During our tests, we expected Upsert operation to be faster than append. But on the contrary, we noticed that Append operations were faster compared to Upsert even though more computations are performed in the Apache Flink application. In our test with 3,500,000 records, Append operation took 1556 seconds while Upsert took 1675 seconds to process the data (Figure 3).

Figure 3. Processing times for Upsert vs. Append

Compute consumption for Apache Iceberg Upsert vs. Append operations: Comparing the compute consumption for 10,000,000 records, we noticed that Append operation was able to process the data in the same amount of time as Upsert operation but with less compute resources. In our tests, we have noted that Append operation only consumed 64 Kinesis processing units, whereas Upsert consumed 78 Kinesis processing units (Figure 4).

Figure 4. Comparing consumption for Upsert vs. Append

Scalability vs performance: To achieve the desired data processing performance, we need a specific configuration of Kinesis processing units, Kinesis Data Streams, and Iceberg parallelism. In our test with the data that we chose, we started with four Kinesis processing units and four Kinesis data streams for data processing. We observed an 80% performance improvement in data processing with 16 Kinesis data processing units. An additional 6% performance improvement was demonstrated when we scaled to 32 Kinesis processing units. When we increased the Kinesis data streams to 16, we observed an additional 2% performance improvement (Figure 5).

Figure 5. Scalability vs. performance

Data volume processing times for Upsert vs. Append: For this test, we started with 350,000 records of data. When we increased data volume to 3.5M records, we observed that Append performing better than Upsert, demonstrating a five-fold increase in processing time (Figure 6).

Figure 6. Data volume processing times for Upsert vs. Append

Conclusion

The architecture we explored today scales based on the data-volume requirements of the customer and is capable of meeting the end-to-end SLA of 15 minutes, with a potential lowered total cost of ownership. Additionally, the solution is capable of handling high-volume, bi-temporal computations with ACID compliance, time travel, full-schema evolution, partition layout evolution, rollback to prior versions and SQL-like query experience.

Noise

All posts by Krishna Gogineni

Migrate a petabyte-scale data warehouse from Actian Vectorwise to Amazon Redshift

Challenges

Legacy architecture

Solution overview

Rules engine and API

Data storage and processing

Data distribution

Operational efficiency and hygiene

Workload management

Workload isolation

Scalability

Data migration to Amazon Redshift

Key considerations for the migration

Results

Conclusion

About the Authors

Temporal data lake architecture for benchmark and indices analytics

Problem statement

Solution overview

Data processing

Scalability

Performance insights

Conclusion

Further reading

The collective thoughts of the interwebz