Matrix Factorization Part I: Understanding all the processes and how stochastic gradient descent step is derived

6 min readAug 20, 2022

One of traditional algorithm in recommender system, Matrix Factorization. It was first proposed by Simon Funk in 2006 and has been widely used and developed. This article assumes that readers are familiar with recommender systems therefore will not explain different types of recommender system in depth.

Main focus is to truly understanding each process with mathematical detail.

Here is the agenda:

1. Introduction
2. What is matrix factorization
3. Learning Algorithms
4. Further improvements

Introduction

Two main type of Recommender systems: content filtering and collaborative filtering.

Content filtering: based on previous user interactions recommend similar items (by looking at their meta data ex: price, size, category, etc…).

Collaborative filtering: analyze user-item interaction simultaneously. It’s split into two different categories:

nearest neighbor
latent factor models. Latent factor models tries to learn embedding vectors of item and user from rating matrix.

What is Matrix Factorization

Belongs to latent factor model family and tries to learn item(q_i) and user(p_u) embedding vectors which gets mapped to same embedding vector space and interactions are modeled using inner product. ex: estimation of user_u’s rating on item_i can be formulated as:

Main goal is learning embedding vectors q_i and p_u

Note that rating matrix is sparse(ex: user rates only subset of movie(Items) in entire item list ⇒ sparse matrix). Since conventional SVD cannot learn from incomplete matrix we must make some modifications.

Impute missing value: expensive since it increase amount of data and imputation may be inaccurate.
train only using dense part: this is prone to overfitting

Recent work suggests training only on dense part while avoiding overfitting by adding regularization.

Learning Algorithms

We learn by minimizing loss function below:

Two ways of learning: Stochastic gradient descent(SGD) and alternating least squares(ALS).

Note: This article covers SGD in depth but not ALS, it will be covered in preceding articles.

Stochastic Gradient Descent

For each value in a rating matrix, system predicts r_ui and calculates error:

and updates embedding vector(gradient descent) as follows:

Now let’s truly understand how above gradient descent step is derived.

For simplicity we will ignore the regularization term however once we derive gradient descent step you will understand how regularization term has been derived as well.

We know that process of gradient descent goes as follows: For each parameter we want to learn, take partial derivative of loss function w.r.t. each parameter (gradient) and update it in opposite direction multiplied by learning_rate.

In matrix factorization all elements in user and item embedding vectors are parameters, so there are total of n_user*emb_dim + n_item*emb_dim learnable parameters.

For better understanding, here is a rating matrix:

From above matrix we want to learn user(p_u) and item(q_i) embedding vectors. Note that user and item embedding vector belong to same vector space and dimension of vector space is set by the data scientist, it is a hyperparameter.

each row in p_u is a embedding vector representing one user and each column of p_i is an embedding vector that represents one item.

We will set embedding dimension to be 3.

Now let’s update one of the parameters p_11

*EDIT : from above equation. Sigma should iterate over all u and v, not k.

*EDIT: first line shouldn’t contain -q_11.

Similarly we will update all elements in p_1 (vector), p_12 and p_13

You can see that error term is the same for each element in p_1 (vector) therefore we can simply vectorize this operation into

Here we just derived the update process.

Exactly same process is done to item embedding vectors.

Alternating Least Squares

Rotate between fixing q_i and p_u meaning there are two loss function that we optimize one at a time. When all p_u are fixed, the system recomputes q_i’s by solving a least-squares problem

Advantage over SGD learning method:

Can be parallelized.
More robust with sparse data therefore preferred when using implicit data which are usually more sparse than explicit feedback data.

Further Improvements

Adding Bias

Some user may rate higher on average and some items might get rated higher than some similar ones, these are biases in the system. To take into account these bias, prediction function is modified:

u = overall/global average
b_i, b_u : observed deviation from average. b_i = u - current item's rating

Loss function also gets modified:

Adding input sources

Some user may have little interaction(cold start problem) therefore hard to conclude on their preference. In order to alleviate this two additional data may be added:

Implicit feedback data:

N(u) denotes set of items user_u expressed an implicit preference.
user who showed preference for items in N(u) is characterized by vector, x_i is item vector:

Normalized to:

2. Meta data. ex: user’s age, sex, etc…

A(u) denotes set of features
each user’s attributes(feature) are represented as distinct vector y_a and for each user summation of each factor vector are inputted to the system for prediction
item meta data could be used in same way as user metadata however not used in this paper.

Prediction function is modified:

Adding Temporal Information

In real world user’s perception of the world changes and popularity of item changes with respect to time hence we could add time factor into our model to capture non-static nature.

Three terms b_i, b_u, p_u are modeled w.r.t. time:

b_i(t) : item bias since item’s popularity changes over time.
b_u(t) : user’s characteristic may change → person may become less critic therefore rate items higher on average then previously.
p_u(t): user preference on item may changes over time. $q_i$ stays the same since item is static in nature.

Prediction function is modified to:

Adding confidence level to ratings

All ratings do not deserve same weights because some rating may be due to advertisement and maybe adversarial user may give low ratings certain items. So in the paper it suggests using implicit data such as watch history, watch time to add confidence term (c_ui).

prediction function is modified to:

For example if user watched a movie 3 times system could weigh its rating 3x.

Conclusion

I’ve implemented matrix factorization using SGD, feel free to take a look here. Current version is not very efficient hence takes a long time to train. On preceding articles I plan to 1. optimize SGD to make computation more efficient and 2. implement matrix factorization using ALS.

Thank you for reading this article, if you have any questions leave a comment :)