All posts by Hyeonho Kim

How Karrot built a feature platform on AWS, Part 1: Motivation and feature serving

Post Syndicated from Hyeonho Kim original https://aws.amazon.com/blogs/architecture/how-karrot-built-a-feature-platform-on-aws-part-1-motivation-and-feature-serving/

This post is co-written with Hyeonho Kim, Jinhyeong Seo and Minjae Kwon from Karrot.

Karrot is Korea’s leading local community and a service centered on all possible connections in the neighborhood. Beyond simple flea markets, it strengthens connections between neighbors, local stores, and public institutions, and creates a warm and active neighborhood as its core value.

Karrot uses a recommendation system to provide users with connections that match their interests and neighborhoods, and to provide personalized experiences. In particular, you can check customized content on the home screen of the Karrot application. Personalized content is continuously updated by analyzing the user’s activity patterns without having to set a special interest category. The core of the feed is to provide new and interesting content, and Karrot is constantly working to improve user satisfaction for this purpose. Karrot actively uses a recommendation system to provide personalized and recommended content. In this system, the feature platform plays a key role along with the machine learning (ML) recommendation model. The feature platform acts as a data store that stores and serves data necessary for the ML recommendation model, such as the user’s behavior history and article information.

This two-part series starts by presenting our motivation, our requirements, and the solution architecture, focusing on feature serving. Part 2 covers the process of collecting features in real-time and batch ingestion into an online store, and the technical approaches for stable operation.

Background of the feature platform at Karrot

Karrot recognized the need for a feature platform in early 2021, about 2 years after implementing a recommendation system in their application. At that time, Karrot was achieving significant growth in various metrics through active usage of the recommendation system. By showing personalized feeds to each user beyond chronological feeds, they observed a more than 30% increase in click-through rates and higher user satisfaction. As the recommendation system’s impact continued to grow, the ML team naturally faced the challenge of advancing the system.

In ML-based systems, various high-quality input data (clicks, conversion actions, and so on) is considered a crucial element. These input data are typically called features. At Karrot, data including user behavior logs, action logs, and status values are collectively referred to as user features, and logs related to articles are called article features.

To improve the accuracy of personalized recommendations, various types of features are needed. A system that can efficiently manage these features and quickly deliver them to ML recommendation models is essential. Here, serving means the process of providing real-time data needed when the recommendation system suggests personalized content to users. However, the feature management approach in the existing recommendation system had some limitations, with the following key issues:

  • Dependency on flea market server – Because the initial recommendation system existed as an internal library on the flea market server, the source code of the web application had to be changed whenever the recommendation logic was modified or a feature was added. This reduced the flexibility of deployment and made it difficult to optimize resources.
  • Limited scalability of recommendation logic and features – The initial recommendation system directly depended on the flea market database and only considered flea market articles. This made it impossible to expand to new article types like local community, local jobs, and advertisements, which are managed by different data sources. Additionally, feature-related code was hardcoded, making it difficult to explore, add, or modify features.
  • Lack of feature data source reliability – Although features were retrieved from various repositories such as Amazon Simple Storage Service (Amazon S3), Amazon ElastiCache, and Amazon Aurora, the reliability of data quality was low due to the lack of a consistent schema and collection pipeline. This was a major limitation in securing the latest features and consistency.

The following diagram illustrates the initial recommendation system backend structure.

To solve these problems, we needed a new central system that could efficiently support feature management, real-time ingestion, and serving, and so we started the feature platform project.

Requirements of the feature platform

The following functional requirements were organized by separating the feature platform into an independent service:

  • Record and rapidly serve the top N most recent actions performed by users. Allow parameterization of both the top N value and the lookup period.
  • Support user-specific features such as notification keywords in addition to action features.
  • Process features from various article types beyond just flea market articles.
  • Handle arbitrary data types for all features, including primitive types, lists, sets, and maps.
  • Provide real-time updates for both action features and user characteristic features.
  • Provide flexibility in feature lists, counts, and lookup periods for each request.

To implement these functional requirements, a new platform was necessary. This platform needed three core capabilities: real-time ingestion of various feature types, storage with consistent schema, and quick response to diverse query requests. Although these requirements initially seemed ambiguous, designing a generalized structure enabled efficient configuration of data ingestion pipelines, storage methods, and serving schemas, leading to clearer development objectives.

In addition to functional requirements, the technical requirements included:

  • Serving traffic: 1,500 or more requests per second (RPS)
  • Ingestion traffic: 400 or more writes per second (WPS)
  • Top N values: 30–50
  • Single feature size: Up to 8 KB
  • Total number of features: Over 3 billion or more

At the time, the variety and number of features in use were limited, and the recommendation models were simple, resulting in modest technical requirements. However, considering the rapid growth rate, a significant increase in system requirements was anticipated. Based on this prediction, higher targets were set beyond the initial requirements. As of February 2025, the serving and ingestion traffic has increased by about 90 times compared to the initial requirements, and the total number of features has increased by hundreds of times. The ability to handle this rapid growth was made possible by the highly scalable architecture of the feature platform, which we discuss in the following sections.

Solution overview

The following diagram illustrates the architecture of the feature platform.

The feature platform consists of three main components: feature serving, a stream ingestion pipeline, and a batch ingestion pipeline.

Part 1 of this series will cover feature serving. Feature serving is the core function of receiving client requests and providing the required features. Karrot designed this system with four major components:

  • Server – A server that receives and processes feature serving requests, and is a pod located on Amazon Elastic Kubernetes Service (Amazon EKS)
  • Remote cache – A remote cache layer shared by servers, and uses ElastiCache
  • Database – A persistence layer that stores features, and uses Amazon DynamoDB
  • On-demand feature server – A server that serves features that can’t be stored in the remote cache and database due to compliance issues, or that require real-time calculations every time

From a data store perspective, feature serving should serve high-cardinality features with low latency at scale. Karrot introduced multi-level cache and subdivided serving strategies according to the characteristics of the features:

  • Local cache (tier 1 cache) – An in-memory store located within the server, suitable for cases where the data size is small and is frequently accessed or requires fast response times
  • Remote cache (tier 2 cache) – Suitable for cases where the data size is medium and is frequently accessed
  • Database (tier 3 cache) – Suitable for cases where the data size is large and is not frequently accessed or is less sensitive to response times

Schema design

The feature platform stores multiple features together using the concept of feature groups, such as column families. All feature groups are defined through the feature group schema, called feature group specifications, and each feature group specification defines the name of the feature group, required features, and so on.

Based on this concept, the key design is defined as follows:

  • Partition key: <feature_group_name>#<feature_group_id>
  • Sort key: <feature_group_timestamp> or a string representing null

To illustrate how this works in practice, let’s explore an example of a feature group representing recently clicked flea market articles by user 1234. Consider the following scenario:

  • Feature group name: recent_user_clicked_fleaMarketArticles
  • User ID: 1234
  • Click timestamp: 987654321
  • Features in the feature group:
    • Clicked article ID: a
    • User session ID: 1111

In this example, the keys and feature group are created as follows:

  • Partition key: recent_user_clicked_fleaMarketArticles#1234
  • Sort Key: 987654321
  • Value: {"0": "a", "1": "1111"}

Features defined in the feature group specification maintain a fixed order, using this ordering like an enum when saving the feature group.

Feature serving read/write flow

The feature platform uses a multi-level cache and database for feature serving, as shown in the following diagram.

To illustrate this process, let’s examine how the system retrieves feature groups 1, 2, and 3 from flea market articles. The read flow (solid lines in the preceding diagram) demonstrates data access optimization using a multi-level cache strategy:

  1. When a query request comes in, first check the local cache.
  2. Data not in the local cache is searched in ElastiCache.
  3. Data not in ElastiCache is searched in DynamoDB.
  4. The feature groups found at each stage are collected and returned as the final response.

The write flow (dotted lines in the preceding diagram) consists of the following steps:

  1. Feature groups that have cache misses are stored in each cache level.
  2. Data not found in the local cache but found in the remote cache or database is stored in the upper-level cache.
    1. Data found in ElastiCache is stored in the local cache.
    2. Data found in DynamoDB is stored in both ElastiCache and the local cache.
  3. Cache write operations are performed asynchronously in the background.

This approach presents a strategy to maintain data consistency and improve future access time in the multi-level cache structure. In an ideal situation, serving works well without any problems with just the preceding flow. However, the reality was not like that. The problems experienced included cache misses, consistency, and penetration problems:

  • Cache miss problem – Frequent cache misses slow down the response time and put a burden on the next level cache or database. Karrot uses the Probabilistic Early Expirations (PEE) technique to proactively refresh data that is likely to be retrieved again in the future, thereby maintaining low latency and mitigating cache stampede.
  • Cache consistency problem – If the Time-To-Live (TTL) of a cache is set incorrectly, it can affect recommendation quality or reduce system efficiency. Karrot sets soft and hard TTL separately, and sometimes uses a write-through caching strategy together to synchronize cache and database to alleviate consistency problems. In addition, jitter is added to spread out the TTL deletion time to alleviate the cache stampede of feature groups written at similar times.
  • Cache penetration problem – Continuous queries for non-existent feature groups can lead to DynamoDB queries, resulting in increased costs and response times. The platform resolves this through negative caching, storing information about non-existent feature groups to reduce unnecessary database queries. Additionally, the system monitors the ratio of missing feature groups in DynamoDB, negative cache hit rates, and potential consistency problems.

Future improvements for feature serving

Karrot is considering the following future improvements to their feature serving solution:

  • Large data caching – Recently, the demand for storing large data features has been increasing. This is because as Karrot grows, the number of features also increases. Also, as the demand for embeddings increases along with the rapid growth of large language models (LLMs), the size of data to be stored has increased. Accordingly, we are reviewing more efficient serving by using an embedded database.
  • Efficient use of cache memory – Even if an efficient TTL value is set initially, the efficiency tends to decrease as the user’s usage pattern changes and the model is changed. Also, as more feature groups are defined, monitoring becomes more difficult. It should be straightforward to find the optimal TTL value for the cache based on data. We are considering a method to efficiently use memory while maintaining a high recommendation quality through cache hit rate and feature group loss prevention. Should we cache a feature group that is only retrieved once? What about a feature group that is retrieved twice? The current feature platform attempts caching even if a cache miss occurs only one time. We believe that all feature groups that have cache misses are worth caching. This naturally increases the inefficiency of caching. An advanced policy is needed to determine and cache feature groups that are worth caching based on various data. This will increase the efficiency of cache usage.
  • Multi-level cache optimization – Currently, the feature platform has a multi-level cache structure, and the complexity will increase if an embedded database is added in the future. Therefore, it is necessary to find and set the optimal settings by considering different cache levels. In the future, we will try to maximize efficiency by considering different levels of cache settings.

Conclusion

In this post, we examined how Karrot built their feature platform, focusing on feature serving capabilities. As of February 2025, the platform reliably handles over 100,000 RPS with P99 latency under 30 milliseconds, providing stable recommendation services through a scalable architecture that efficiently manages traffic increases.

Part 2 will explore how features are generated using consistent feature schemas and ingestion pipelines through the feature platform.


About the authors

How Karrot built a feature platform on AWS, Part 2: Feature ingestion

Post Syndicated from Hyeonho Kim original https://aws.amazon.com/blogs/architecture/how-karrot-built-a-feature-platform-on-aws-part-2-feature-ingestion/

This post is co-written with Hyeonho Kim, Jinhyeong Seo and Minjae Kwon from Karrot.

In Part 1 of this series, we discussed how Karrot developed a new feature platform, which consists of three main components: feature serving, a stream ingestion pipeline, and a batch ingestion pipeline. We discussed their requirements, the solution architecture, and feature serving using a multi-level cache. In this post, we share the stream and batch ingestion pipelines and how they ingest data into an online store from various event sources.

Solution overview

The following diagram illustrates the solution architecture, as introduced in Part 1.

Stream ingestion

Stream ingestion is the process of collecting data from various event sources in real time, transforming it into features, and storing them. It consists of two main components:

Consumers handle not only the source events, but also re-published events. When loading features, they are performed by considering different strategies, such as write-through and write-around, and are loaded in detail considering cardinality, data size, and access patterns.

Most features are generated based on two types of events: events that occur due to real-time user actions, and asynchronous events that occur due to state changes in user and article data. These events and features have an M:N relationship, meaning one event can be the source of multiple features, and one feature can be generated based on multiple events.

The following diagram illustrates the architecture of the stream ingestion pipeline.

To efficiently handle M:N relationships, a structure was needed to receive events and distribute them to multiple feature processing logics. Two core components were designed for this purpose:

  • Dispatcher – Receives events from multiple consumer groups and propagates them to relevant feature processing logic
  • Aggregator – Processes events received from the dispatcher into actual features

This stream processing pipeline enables real-time feature generation and storage.

Message broker optimization: Fast at-least-once delivery

The feature platform processes up to 25,000 events per second, including user behavior log events, at high speed. However, when worker traffic surges, event processing failures or infrastructure failures occasionally cause event loss. To solve this problem, the existing automatic commit mode was changed to manual commit in Amazon MSK. This allows events to be committed only when they are definitely processed, and failed events are sent to a separate retry topic and postprocessed through a dedicated worker.

However, processing large volumes of events synchronously with manual commit resulted in approximately 10 times slower processing speed and increased latency. Although consumer group resources were available, simply increasing the number of partitions in Amazon MSK wasn’t a solution due to team-specific partitioning permissions. The platform designed parallel processing within single partitions and implemented a custom consumer supporting retry functionality. The core of the implementation is to read as many messages from the partition as the fetch size at a time and process them by spawning worker threads in parallel for each message. When processing is complete, the offsets of successful messages are sorted and a manual commit is performed for the largest offset, and failed messages are republished to the retry topic. This enables parallel processing even in a single partition, and the concurrency can be controlled automatically. As a result, the event processing speed is faster than the existing automatic commit method, and it is stably processed without delay even when the number of events increases.

Stream processing

The stream ingestion pipeline performs only simple extract, transform, and load (ETL) logic and validation. There were already many requirements for complex stream processing in the feature platform, and a separate service was created to accommodate them. The feature platform didn’t address these requirements for the following reasons:

  • The purpose of stream ingestion in the feature platform is to collect and store features in real time, whereas the main purpose of stream processing is to process data.
  • Not all features require complex processing. We decided that it wasn’t appropriate to make the entire stream collection process complicated for some features.
  • The result data of stream processing could be used outside the feature platform, and there were requirements to consider this. Therefore, creating a separate service was more suitable for Karrot’s situation.
  • Additionally, some source data didn’t exist in AWS, which could have resulted in significant additional costs if everything was handled within the feature platform.

Although it’s a separate service from the feature platform, the following is a brief introduction to how the feature platform uses data through stream processing:

  • Various content embedding cases – We perform stream processing using models, and use various contents (articles, images, and so on) as input values to pre-trained models to create embeddings. These embeddings are stored in the feature platform and used as features during recommendation to improve recommendation quality.
  • Rich feature generation cases – Some of the processed data is further processed using large language models (LLMs) for use as features. One example is predicting which category a specific second-hand product belongs to and using this prediction value as a feature.

Batch ingestion

Batch ingestion is responsible for processing and storing large amounts of data into features in batches. This is divided into a cron job that runs periodically and a backfill job that loads large amounts of data one time.

For this purpose, AWS Batch based on AWS Fargate is used. AWS Batch jobs running on Fargate are provisioned independently from the other environments, enabling safe large-scale processing. For example, even if more than 1,000 servers or 10,000 vCPUs are used for backfilling large amounts of data, they are operated separately from the other services and can be operated efficiently with a usage-based billing method.

When adding new features, batch loading of past data or periodic loading of large amounts of data is one of the core functions of the feature platform. The main requirements considered in the design are as follows:

  • It must be able to process large amounts of data.
  • It must be able to start at the time desired by the user and finish the work within an appropriate time.
  • It must have low operating costs. It should be a managed service if possible, and it’s better if there is less additional work or specific domain knowledge for operation. Also, it should reuse existing service code as much as possible.
  • Complex operations for features or the configuration of Directed Acyclic Graphs (DAGs) are not necessarily required.

There were several options to choose from, such as Apache Airflow, but AWS Batch was chosen to avoid over-engineering considering the operating cost according to the current requirements.

The following diagram illustrates the architecture of the batch ingestion pipeline.

The key components are as follows:

  • Scheduler – It extracts the targets that need to perform the batch jobs according to the specifications such as FeatureGroupSpec and IngestionSpec written by the user on the feature platform, and registers the corresponding job specifications to an AWS Batch job (submit job).
  • AWS Batch – The jobs submitted by the scheduler are executed using the preconfigured job queue and computing environment. In the case of AWS Batch, you can configure a Fargate environment separately from the other production services, so that even if you provision large-scale resources and perform tasks, you can perform tasks stably without affecting the other production services.

Future improvements for batch ingestion

The current configuration works well and reliably, but there are some areas for improvement:

  • No DAG support – The initial feature platform performed relatively simple tasks, such as parsing batch data sources, converting them to the feature schema, and storing them. However, as the platform became more advanced, more complex operations became necessary, and therefore support for DAG configurations that can process features by sequentially performing various dependent jobs became necessary.
  • Manual configuration for parallel processing – Currently, when processing large-scale data in parallel, the worker must manually estimate the number of jobs to be processed in parallel and provide it in the specification, and the scheduler performs a submit job in parallel based on this. This method is based purely on experience, and in order for the system to become more advanced, the system must be able to automatically abstract and optimize the appropriate level of parallel processing.
  • Limited AWS Batch monitoring usability – AWS Batch monitoring has some limitations, such as jobs don’t transition from Runnable to Running state, a lack of appropriate notification systems for such cases, and the inability to directly check failed jobs through URL parameters when receiving alerts. These aspects should be improved from an operational convenience perspective.

Results

As of February 2025, Karrot has addressed the major problems mentioned in the early stages of feature platform development:

  • Decoupling recommendation logic from flea market server – The recommendation system now uses the feature platform across more than 10 different recommendation spaces and services.
  • Securing scalability of features used in recommendation logic – With more than 1,000 high-quality and rich features acquired from various services such as flea market, advertisements, local jobs, and real estate, we are contributing to the advancement of recommendation logic and making it straightforward for all Karrot engineers to explore and add features.
  • Maintaining the reliability of feature data sources – Through the feature platform, we are providing reliable data using a consistent schema and ingestion pipeline.

Karrot engineers are continuously improving the user experience by advancing recommendations through high-quality features through the feature platform. This has contributed to increasing click-through rates by 30% and conversion rates by 70% compared to before by recommending articles that users might be interested in.

This was possible because the AWS services used in the feature platform were firmly supporting it. Amazon DynamoDB has amazing scalability in all aspects of read, write, and storage, so it was possible to handle dynamically changing workloads without incurring separate operating costs. Amazon ElastiCache showed highly reliable service stability, so we could use it with confidence. In addition, it was straightforward and stable to scale up, down, in, and out, so it was possible to reduce the operational burden. It also seamlessly integrated with the ecosystem of Redis OSS, so we could use open source ecosystems such as Redis Exporter. Amazon MSK also supports reliable operation and seamless integration with the Apache Kafka ecosystem, making the development and operation of the feature platform effortless.

Furthermore, working with AWS enables cost-efficient operations based on their various support and expertise. Recently, we had an over-provisioning problem with our ElastiCache cluster. Right-sizing our ElastiCache cluster with various experts (including Solutions Architects) made it possible to optimize ElastiCache costs by nearly 40%. Such technical human resources from AWS have been invaluable in operating the feature platform using AWS products.

Conclusion

In this series, we discussed how Karrot built a feature platform on AWS. We believe that by combining AWS services and our experience, you can develop and operate a feature store without difficulty by modifying it to suit your company’s requirements. Try out this implementation and let us know your thoughts in the comments.


About the authors