[Paper review] PECOS: Predictions for Enormous and Correlated Output Spaces

10 min readDec 5, 2022

0. Introduction
0.1 eXtreme Multi-label Classification(XMC)
0.2 Traditional methods and its challenges
1. What is PECOS?
2. How does PECOS work?
2.1 Semantic label indexing
2.2 Matching (XR-Linear)
2.3 Ranking
2.4 Inference
3. Comparison Study
4. Conclusion
5. References

0. Introduction

This paper review is about a paper that is the basis of open source framework published by Amazon in 2021. It’s is a flexible framework that aims to efficiently solve eXtreme Multi-label Classification problems therefore it has tolerable latency, computation cost for production.

We will briefly go over what eXtreme Multi-label Classification is, how PECOS can mitigate this challenging problem. Each stage of PECOS will be explained with an example.

Enjoy :)

0.1 eXtreme Multi-label Classification (XMC)

Problems that contain extremely large output space for each input. ex: for userA there exists 1 million items to recommend.

XMC (a.k.a. eXtreme Multi-label Ranking(XMR)) problems are commonly seen:

Tagging : where the goal is to read Wikipedia article then predict categories (at the bottom enclosed in green rectangle). There are large possible categories that could be used as label → Enormous output space if not infinite.

2. Product Search : there can be millions of shoes to recommend from.

3. Keyword Recommendation

4. Related Searches

5. Viewed together/ Bought together Recommendations

As you can see there can be extremely large options if not infinite. We refer to these types of problems as eXtreme Multi-label Ranking/Classification(XMR/XMC) and it is active research area.

Real-world applications:

Amazon : Product search and recommendation [7]
Google : Youtube recommendation [7]
Home Depot : Product search [2]

0.2 Traditional methods and its challenges

Traditionally there are two main methods in solving XMC problems:

OvR(One-vs-Rest): train independent binary classifier on each label → during inference choose output of binary classifier that outputted highest score

For each additional label another binary classifier has to be trained → scales linearly with output space

Softmax : training deep learning model with softmax layer to output probability score for each label.

weights need to be updated for each label → slow back propagation

Inference time, memory requirement(feature matrix * label matrix), and training time increases linearly with output space.

Four main bottlenecks when using traditional methods for solving XMC problems

Train time
Inference time
Memory requirement
Long-tail problem

PECOS solves all above bottlenecks by:

Minimize search space (hierarchical clustering)
Compresses data format for efficient training for sparse matrices
Beam search

Experimental result [10] shows efficiency of PECOS over traditional method (OvR).

1. What is PECOS?

Flexible framework open-sourced by Amazon in 2021, aims to solve XMC problems.

PECOS solves four bottlenecks of traditional methods mentioned above in four steps:

1. Semantic Label Indexing : cluster correlated labels.

If clustered K is still large, cluster it again. repeat this process until satisfying K → This process creates Hierarchical label tree, K is hyperparameter.

2. Matching : matches input to most relevant cluster

Match relevant cluster at each layer while traversing down hierarchical label tree.

3. Ranking :

Train only on labels that belong to matched cluster from previous layer.

4. Inference : traverse tree using beam-search with desired beam_width

After 3-stage of training, we have trained binary classifier for each node, traverse down tree by beam_width which saves computations by ranking only on relevant clusters.

Flexibility is also huge strength of PECOS → different algorithms and hyperparameters can be tailored with ease at each stage, so for example: use deep learning for Ranking model if long train time is affordable else linear model.

Before going to in detail of workflow of PECOS, let’s briefly go over it at a high level.

Training phase : 1~3 stage.

Recursively apply semantic label indexing to generate hierarchical label tree → ex: cluster labels, again apply clustering on clusters, repeat the process.
Train on matched clusters while traversing down the hierarchical label tree → all nodes will have trained model.

Inference phase : 4th stage.

Given new input, traverse down the tree using trained model to do matching and ranking. We set number of top K clusters to be matched(sent to next layer) via hyperparameter beam_width.

2. How does PECOS work?

Written in C++, supports python and CLI APIs.

We will go over each stage of PECOS with an example.

Goal: recommend most similar item based on given input(item).

X: item meta data, Y: items that has been bought with X row(item)

2.1 Semantic Label Indexing

Clusters semantically correlated labels, grouping labels with little to none interaction with highly interacted labels mitigating effect of long-tail problem.

PECOS and other methods for solving XMC problem was developed on base idea that many labels are correlated.

Semantic Label Indexing is a 2-step process:

Label representation : represent data(item) as embedding vectors, semantically related should be close to each other in embedding vector space.

If Metadata exists : tf-idf, deep models(item2Vec, GNN, etc…) can be used for learning embeddings
If Metadata does not exist : Positive Instance Indices(PII), Positive Instance Feature Aggregation(PIFA), Label representation via Graph Spectrum(Spectral), Label feature in addition to PIFA(PIFA + LF) → methods used to generate label embedding

2. Indexing/Clustering : Recursively apply K-means, SK-means to generate hierarchical label tree.

It has smaller time complexity then regular K-means : O(nnz(Z) * K * n_iter) → O(nnz(Z) * log_B(K) * n_iter)
nnz: # of non-zeros
K : # of cluster
B : # of partitions
Z : label representation matrix

Now we will apply 1st step, label representation on our example dataset.

Then perform clustering.

C : label-cluster indexing matrix, represents which cluster each label belongs to.

Clustering on clustered labels a.k.a. clustering with b-ary partitions(recursive partitions), we have now generated hierarchical label tree.

Just went through (1) part in fig.1.

2.2 Matching

Match only top-b most relevant clusters from previous layers, labels within matched clusters are moved on to Ranking stage i.e. used to train binary classifier.

Below is pseudocode, don’t worry it will all make sense once we go through it on our example.

Now back to our example, we recursively train starting from root(t=1).

when t = 1

Train two binary classifiers, one for food and one for sports.
input gets inserted into each binary classifier and outputs relevance score.

when t = 2 : *NOTE: (n) means to refer to (n) in fig.1 above.

((3) from fig.1), let b = 2 => match two most relevant cluster. Both food/sport gets matched (if b=1 then only one of them is matched, used for training next layer).
Perform hard negative sampling

binarize() just takes identity function of dot product of Y and C → if dot product of y_i * c_i > 0 = 1 else 0.
(4) Train binary classifier on each labels from matched cluster, then rank them.

when t = 3

(5) match top-b relevant clusters.
use previous layer’s label data(Y) to do hard negative sample on (Y of current layer)

(6) train binary classifier on labels from matched clusters.

Recursive matching and ranking stage is called XR-{rank_model_type}. Now we will explain Ranking stage more in-depth.

2.3 Ranking

Below is the scoring function that is used to obtain top-b matching clusters.

similar to other stages, choice of classifier model is flexible (ex: linear, GBDT, DNN, etc…)

Ranking model expressed as:

and trained with:

above formula could be represented like :

Which allows independent training of columns in figure below.

Two advantages of training columns independently

Each label(column) can be trained in embarrassingly parallel manner
Reduce memory : no need to make all columns the same length which is inefficient when each label are sparse. → compress data so label only contains non-zero elements.

2.4 Inference

When training is done, each node will contain trained binary classifier.

For given input, start from root and search beam_width clusters(labels), rank them and match beam_width clusters to go onto next layer → This method is referred to as Beam Search which improves latency by minimizing the search space.

Another key method used during inference stage is

weight-pruning: For each binary classifier, cut-off weights under given threshold(hyperparameter).

3. Comparison Study

Two comparisons will be made: first, traditional OvR and Parabel which is precursor of PECOS. Second, improvements from Parabel to PECOS.

2.1 Traditional OvR Vs. Parabel(tree-based)

Large improvement in training/inference time with minimal accuracy loss

Minimal loss due to many correlated output labels hence clustering works well and long-tail problem is remedied.

precision, computation, memory result [12]

2.2 Parabel Vs. PECOS

Some improvement of PECOS from Parabel includes : flexible adjustment of number of branched in hierarchical label tree, new data structure (doubly sparse). These improvements allow PECOS to have 5x less latency with almost no performance loss.

Below chart compares difference between Parabel and PECOS at each stage, configurations used during experiment conducting while writing PECOS paper. *Green color => difference between two methods.

PECOS > Parabel advantages:

Branching Factor(B) : allows adjustment of tree depth that could lead to faster inference time.
Doubly sparse data structure : allows efficient computation on sparse matrix (relevance score).
Sophisticated ensemble: Allow heterogeneous(Parabel only offer Homogeneous) ensemble at each stage with ease.
More negative sampling schemes : Two more negative sampling schemes → MAN, TFN+MAN which increase performance with cost of longer train time however again, it can be easily adjusted by using simpler negative sampling scheme if you want shorter training time.

3. Conclusion

2% performance loss with benefits of 915% reduced training time, 1640% reduced inference time compared to OvR.
500% latency time reduction with no performance loss compared to Parabel.

Solves 4 problems faced with traditional methods for solving XMC:

Training time : hard-negative sampling, matching, doubly sparse data format contribute to reduced training time
Inference Time : beam search, weight-pruning, data format.
Memory : data format, reduced search space
Long-tail problem : reduce sparsity by clustering.

Other strengths

Modularity : Each stage of PECOS is modular.
Flexibility : Can choose simple → complex model at each stage depending on your needs
Customization : Open sourced
Active research area

What’s next? → Ongoing researches

Clustering with label belonging to multiple cluster. ex: apple can be in food and electronics cluster.
Online learning of hierarchical label tree. → Currently, in order to update hierarchical label tree stage-1 needs to be ran again.
Expand output space to be infinite.
Generate relevant label (GAN)