[Paper review] Monolith: TikTok’s Real-time Recommender System.
Table of Contents
0. Introduction
1. Data Difference in Recommender systems
1.1 Sparsity and Dynamism
1.2 Concept Drift2. Design
2.1 Hash Table
2.2 Online Training
2.3 Fault Tolerance3. Conclusion
0. Introduction
When talking about “Real-time” recommender system it contains two separate parts which can be developed independently from others. Two parts are Real-time inference, Real-time training. its challenges and possible solutions will be briefly covered before going into paper review.
Real-time inference : Goal is to reduce latency.
Challenges :
- large search space → for each user there are too many items to compute relevance score.
Remedy :
- ANN : build index using embeddings then for given user match top-k items closest to given user then compute relevance score on only top-k items.
- Clustering : match user to closest cluster then compute relevance score on items within cluster.
- More methods improve on above methods : such as PECOS, Parabel, and other methods for solving XMC problems.
Real-time training : Goal is to make prediction with most recent data ⇒ update model ASAP.
Challenges :
- re-training on some historical data + new data takes long time to train.
- We could use only new data to learn new parameters → becomes short-sighted.
- updates of parameters of a model has to happen while serving requests.
Remedy :
- Train on less data while updating( NOT overriding parameters) existing parameters.
- update parameters incrementally (This will be covered more in-depth later).
Other challenges
- Separate batch-training stage and serving stage : common architecture for most deep learning frameworks which prevents model from interacting with customer feedback in real-time.
Monolith focuses on Real-time training and offers following benefits:
- Reduce memory footprint : via collision-less embedding table with optimizations such as expirable embeddings and frequency filtering.
- Fault-tolerance via its architecture : follows TensorFlow’s distributed Worker-ParameterServer setting architecture.
- Overall architecture that allows efficient update of parameters while still serving requests.
1. Data Differences in Recommender System
Real-world data from recommender systems are different from those used in conventional deep learning problems like language modelling, computer vision in two aspects:
- Sparsity and Dynamism : Features are sparse, categorical, dynamically changing
- Concept Drift : distribution of training data is non-stationary.
These are some challenges to consider when working on recommender systems.
1.1 Sparsity and Dynamism
RecSys data consists of sparse categorical features(userID, itemID, etc… and user only interacts with few items)
Traditional methods to handle sparse categorical features is to map them to high-dimensional embedding space (rating matrix ⇒ decomposed into user and item embedding matrix) which give rise to memory issues.
Out-Of-Memory : for each item/user added, memory increases linearly and since there will be many new items and users it soon lead to memory issues.
Low-collision hashing. Traditional method that save memory however it relies on an over-idealistic assumption that IDs in the embedding table is distributed evenly in frequency, and collisions are harmless to the model quality which is not the case.
With increasing embedding table size chance of hash key collision increases leading to poor model quality. In order to avoid this Monolith offers hash table that:
- capture more features in its parameter
- elastically adjust number of user/item ids in embedding table → get rid of old/outdated user/item Ids.
1.2 Concept Drift
Shift in underlying distribution, different distribution in training data and new online data.
- Ex1: User interests change over time ⇒ their underlying distribution shifts. → Therefore to capture these changes, model must be updated at an increasing pace.
- Ex2: ItemA had left skewed distribution in range [0,1](representing probability of views) however after few months its distribution becomes right skewed ⇒ lower average probability of getting views.
- Concept drift problem is handled by parameter synchronization scheme offered by Monolith.
2. Design
Here we talk about design of Monolith that overcome challenges when working with real-world data from recommender systems.
overall architecture follows TF’s distributed Worker-ParameterSever setting.
- Worker : run computation defined by graph ⇒ compute gradients.
- Parameter Server : store/update parameters using gradients computed by Worker machines.
In RecSys there are two types of parameters:
- dense : weights/variables in DNN → stored as
tf.Variable
- sparse : embedding table that corresponds to sparse features → stored using collision-less hash table.
both stored in Parameter Server.
2.1 Hash Table
Two design choices of Embedding table in Monolith lead to :
- Collision-less : By utilizing Cuckoo Hashmap
- Memory-efficient : by adding two filtering methods
Collision-less
key-value HashTable for sparse parameter representations that utilize Cuckoo Hashmap which help avoid collisions.
look-up, delete, insert(amoritized) : O(1)
- Quick facts: Word “amoritized” comes from “Amoritized analysis” where it computes average cost of single operations therefore insertion has on average O(1).
- Two tables T_0, T_1 with different hash functions h_0(x), h_1(x) and an element would be stored in either one of them.
- Insert 𝐴 into T_0, it first attempts to place 𝐴 at h_0(A)
- If occupied by another element 𝐵, evict 𝐵 from T_0 and try inserting 𝐵 into T_1 with the same logic.
- This process will be repeated until all elements stabilize, or rehash happens when insertion runs into a cycle.
Memory-efficient
Handle memory by filtering IDs before insertion to embedding table. Three filtering methods:
- Frequency filter (freq threshold is hyp_params) : IDs with # of interactions over threshold → removal happens independently among embedding tables? → what does it look up… it got deleted…
- Probabilistic filter
- Expire filter (post filter) : IDs inactive for certain amount of time gets removed.
2.2 Online Training
Two stages : Batch, Online.
Batch training stage
- (1) In each training step, Training Worker (TW) requests parameters from Training Parameter Server(Train PS), compute forward/backward pass then pass updated parameters back to Train PS.
- Only train for one epoch (My guess: to mimic online setting…?, paper doesn’t expand on the reasoning)
Online training stage
- Starts when trained model from batch training stage have been deployed.
- (2) TW train on online streaming data then update Train PS
- (3) Train PS periodically sync (sync interval is hyperparameter) servingPS → immediate effect on user side.
Streaming Engine
This is the part of the architecture that retrieves online data(user actions and items recommended) and generates training example that gets passed to both online training and batch training phase.
- User actions, features( of recommended item) come in from separate Kafka queue → concatenated by Online joiner(Flink) to produce training data.
- Consumed by both online/batch(store data in HDFS then send to TW when enough data is accumulated) training stages.
Online Joiner : part of streaming engine.
Upon user action it first searches for features that match that user’s action(paired up with unique key) in cache then search on-disk key-value storage. This is because saving all features(all items user viewed) until action would be very memory intensive.
Also log odds correction[6] has been added to serving stage to ensure online model does not suffer from bias caused by negative sampling.
- negative sampling is sampling negative target data since there are usually less positive target data however this leads to change in underlying distribution of trained model, biased towards higher probability of positive target prediction.
Parameter Synchronization (TrainPS → servingPS)
Challenges of synchronization:
- model need to continue serving despite being updated.
- Large network bandwidth, memory needed(mostly sparse params) to transfer model from TrainPS → ServingPS.
To overcome its challenges:
- incremental periodic parameter synchronization.
- For Sparse parameters → sync only IDs with interactions
- For dense parameters → sync with lower frequency → from observation dense variables updates with less magnitude therefore having longer sync schedule is tolerable.
2.3 Fault Tolerance
- Handle fault tolerance of PS by taking snapshots of model state
- Trade-off: More frequent snapshot ⇒ better model quality (+), more computation (-)
- Monolith took daily snapshots → parameters lag one day however through experimentation they’ve observed that performance loss was tolerable.
3. Conclusion
- This paper focused mostly on engineering side.
- Overall architecture seems flexible → could add candidate generation in model server region to reduce latency, or any other search algorithms.
- Real-time recommender require lots of engineering work, much more compared to batch recommenders.
- After reading this I’ve thought “maybe batch recommender systems aren’t bad after all”. You really need to think thoroughly if shifting from batch recommender systems to real-time recommender system is really necessary…
References
- Monolith: Real Time Recommendation System with Collisionless Embedding Table
- Core modeling at Instagram → talks about pooling and hashing
- RecSys 2016 video : Deep Neural Networks for Youtube Recommendations
- (paper) Deep Neural Network for Youtube Recommendations\
- Fully Understanding the Hashing Trick | NeurIPS 2018
- Nonuniform Negative Sampling and Log Odds Correction with Rare Events Data
- What is Hash Table and how it works