[Paper review] PECOS: Predictions for Enormous and Correlated Output Spaces
Table of Contents
0. Introduction
0.1 eXtreme Multi-label Classification(XMC)
0.2 Traditional methods and its challenges1. What is PECOS?
2. How does PECOS work?
2.1 Semantic label indexing
2.2 Matching (XR-Linear)
2.3 Ranking
2.4 Inference3. Comparison Study
4. Conclusion
5. References
0. Introduction
This paper review is about a paper that is the basis of open source framework published by Amazon in 2021. It’s is a flexible framework that aims to efficiently solve eXtreme Multi-label Classification problems therefore it has tolerable latency, computation cost for production.
We will briefly go over what eXtreme Multi-label Classification is, how PECOS can mitigate this challenging problem. Each stage of PECOS will be explained with an example.
Enjoy :)
0.1 eXtreme Multi-label Classification (XMC)
Problems that contain extremely large output space for each input. ex: for userA there exists 1 million items to recommend.
XMC (a.k.a. eXtreme Multi-label Ranking(XMR)) problems are commonly seen:
- Tagging : where the goal is to read Wikipedia article then predict categories (at the bottom enclosed in green rectangle). There are large possible categories that could be used as label → Enormous output space if not infinite.
2. Product Search : there can be millions of shoes to recommend from.
3. Keyword Recommendation
4. Related Searches
5. Viewed together/ Bought together Recommendations
As you can see there can be extremely large options if not infinite. We refer to these types of problems as eXtreme Multi-label Ranking/Classification(XMR/XMC) and it is active research area.
Real-world applications:
- Amazon : Product search and recommendation [7]
- Google : Youtube recommendation [7]
- Home Depot : Product search [2]
0.2 Traditional methods and its challenges
Traditionally there are two main methods in solving XMC problems:
OvR(One-vs-Rest): train independent binary classifier on each label → during inference choose output of binary classifier that outputted highest score
- For each additional label another binary classifier has to be trained → scales linearly with output space
Softmax : training deep learning model with softmax layer to output probability score for each label.
- weights need to be updated for each label → slow back propagation
Inference time, memory requirement(feature matrix * label matrix), and training time increases linearly with output space.
Four main bottlenecks when using traditional methods for solving XMC problems
- Train time
- Inference time
- Memory requirement
- Long-tail problem
PECOS solves all above bottlenecks by:
- Minimize search space (hierarchical clustering)
- Compresses data format for efficient training for sparse matrices
- Beam search
Experimental result [10] shows efficiency of PECOS over traditional method (OvR).
1. What is PECOS?
Flexible framework open-sourced by Amazon in 2021, aims to solve XMC problems.
PECOS solves four bottlenecks of traditional methods mentioned above in four steps:
1. Semantic Label Indexing : cluster correlated labels.
- If clustered K is still large, cluster it again. repeat this process until satisfying K → This process creates Hierarchical label tree, K is hyperparameter.
2. Matching : matches input to most relevant cluster
- Match relevant cluster at each layer while traversing down hierarchical label tree.
3. Ranking :
- Train only on labels that belong to matched cluster from previous layer.
4. Inference : traverse tree using beam-search with desired beam_width
- After 3-stage of training, we have trained binary classifier for each node, traverse down tree by
beam_width
which saves computations by ranking only on relevant clusters.
Flexibility is also huge strength of PECOS → different algorithms and hyperparameters can be tailored with ease at each stage, so for example: use deep learning for Ranking model if long train time is affordable else linear model.
Before going to in detail of workflow of PECOS, let’s briefly go over it at a high level.
Training phase : 1~3 stage.
- Recursively apply semantic label indexing to generate hierarchical label tree → ex: cluster labels, again apply clustering on clusters, repeat the process.
- Train on matched clusters while traversing down the hierarchical label tree → all nodes will have trained model.
Inference phase : 4th stage.
- Given new input, traverse down the tree using trained model to do matching and ranking. We set number of top K clusters to be matched(sent to next layer) via hyperparameter
beam_width
.
2. How does PECOS work?
- Written in C++, supports python and CLI APIs.
We will go over each stage of PECOS with an example.
Goal: recommend most similar item based on given input(item).
2.1 Semantic Label Indexing
- Clusters semantically correlated labels, grouping labels with little to none interaction with highly interacted labels mitigating effect of long-tail problem.
- PECOS and other methods for solving XMC problem was developed on base idea that many labels are correlated.
Semantic Label Indexing is a 2-step process:
- Label representation : represent data(item) as embedding vectors, semantically related should be close to each other in embedding vector space.
- If Metadata exists : tf-idf, deep models(item2Vec, GNN, etc…) can be used for learning embeddings
- If Metadata does not exist : Positive Instance Indices(PII), Positive Instance Feature Aggregation(PIFA), Label representation via Graph Spectrum(Spectral), Label feature in addition to PIFA(PIFA + LF) → methods used to generate label embedding
2. Indexing/Clustering : Recursively apply K-means, SK-means to generate hierarchical label tree.
- It has smaller time complexity then regular K-means : O(nnz(Z) * K * n_iter) → O(nnz(Z) * log_B(K) * n_iter)
- nnz: # of non-zeros
- K : # of cluster
- B : # of partitions
- Z : label representation matrix
Now we will apply 1st step, label representation on our example dataset.
Then perform clustering.
Clustering on clustered labels a.k.a. clustering with b-ary partitions(recursive partitions), we have now generated hierarchical label tree.
Just went through (1) part in fig.1
.
2.2 Matching
Match only top-b most relevant clusters from previous layers, labels within matched clusters are moved on to Ranking stage i.e. used to train binary classifier.
Below is pseudocode, don’t worry it will all make sense once we go through it on our example.
Now back to our example, we recursively train starting from root(t=1).
when t = 1
- Train two binary classifiers, one for food and one for sports.
- input gets inserted into each binary classifier and outputs relevance score.
when t = 2 : *NOTE: (n) means to refer to (n) in fig.1
above.
- ((3) from
fig.1
), let b = 2 => match two most relevant cluster. Both food/sport gets matched (if b=1 then only one of them is matched, used for training next layer). - Perform hard negative sampling
binarize()
just takes identity function of dot product of Y and C → if dot product of y_i * c_i > 0 = 1 else 0.- (4) Train binary classifier on each labels from matched cluster, then rank them.
when t = 3
- (5) match top-b relevant clusters.
- use previous layer’s label data(Y) to do hard negative sample on (Y of current layer)
- (6) train binary classifier on labels from matched clusters.
Recursive matching and ranking stage is called XR-{rank_model_type}. Now we will explain Ranking stage more in-depth.
2.3 Ranking
Below is the scoring function that is used to obtain top-b matching clusters.
similar to other stages, choice of classifier model is flexible (ex: linear, GBDT, DNN, etc…)
Ranking model expressed as:
and trained with:
above formula could be represented like :
Which allows independent training of columns in figure below.
Two advantages of training columns independently
- Each label(column) can be trained in embarrassingly parallel manner
- Reduce memory : no need to make all columns the same length which is inefficient when each label are sparse. → compress data so label only contains non-zero elements.
2.4 Inference
When training is done, each node will contain trained binary classifier.
For given input, start from root and search beam_width
clusters(labels), rank them and match beam_width
clusters to go onto next layer → This method is referred to as Beam Search which improves latency by minimizing the search space.
Another key method used during inference stage is
- weight-pruning: For each binary classifier, cut-off weights under given threshold(hyperparameter).
3. Comparison Study
Two comparisons will be made: first, traditional OvR and Parabel which is precursor of PECOS. Second, improvements from Parabel to PECOS.
2.1 Traditional OvR Vs. Parabel(tree-based)
Large improvement in training/inference time with minimal accuracy loss
- Minimal loss due to many correlated output labels hence clustering works well and long-tail problem is remedied.
2.2 Parabel Vs. PECOS
- Some improvement of PECOS from Parabel includes : flexible adjustment of number of branched in hierarchical label tree, new data structure (doubly sparse). These improvements allow PECOS to have 5x less latency with almost no performance loss.
Below chart compares difference between Parabel and PECOS at each stage, configurations used during experiment conducting while writing PECOS paper. *Green color => difference between two methods.
PECOS > Parabel advantages:
- Branching Factor(B) : allows adjustment of tree depth that could lead to faster inference time.
- Doubly sparse data structure : allows efficient computation on sparse matrix (relevance score).
- Sophisticated ensemble: Allow heterogeneous(Parabel only offer Homogeneous) ensemble at each stage with ease.
- More negative sampling schemes : Two more negative sampling schemes → MAN, TFN+MAN which increase performance with cost of longer train time however again, it can be easily adjusted by using simpler negative sampling scheme if you want shorter training time.
3. Conclusion
- 2% performance loss with benefits of 915% reduced training time, 1640% reduced inference time compared to OvR.
- 500% latency time reduction with no performance loss compared to Parabel.
Solves 4 problems faced with traditional methods for solving XMC:
- Training time : hard-negative sampling, matching, doubly sparse data format contribute to reduced training time
- Inference Time : beam search, weight-pruning, data format.
- Memory : data format, reduced search space
- Long-tail problem : reduce sparsity by clustering.
Other strengths
- Modularity : Each stage of PECOS is modular.
- Flexibility : Can choose simple → complex model at each stage depending on your needs
- Customization : Open sourced
- Active research area
What’s next? → Ongoing researches
- Clustering with label belonging to multiple cluster. ex: apple can be in food and electronics cluster.
- Online learning of hierarchical label tree. → Currently, in order to update hierarchical label tree stage-1 needs to be ran again.
- Expand output space to be infinite.
- Generate relevant label (GAN)
4. References
- Multi-Scale Methods for Machine Learning
- Semantic product search — Vector Search for E-Commerce (Haystack 2021)
- **Extreme Classification Repository → No PECOS as of 20221026 however lots of other XMC methods w/ code
- Applying PECOS to product retrieval and text autocompletion
- PECOS open-source code
- Amazon pushes the boundaries of extreme multilabel classification
- DeepXML: A Framework for Deep Extreme Multi-label Learning by Manik Varma
- (2021) DeepXML: A Deep Extreme Multi-Label Learning Framework Applied to Short Text Documents
- (2021) Extreme Multi-label Learning for Semantic Matching in product search
- (2022) PECOS
- (2016) DiSMEC — Distributed Sparse Machines for Extreme Multi-label Classification
- (2018) Parabel — Partitioned Label Trees for Extreme Classification with Application to Dynamic Search Advertising