Neural Collaborative Filtering (WWW’17)

This is another paper that applies deep neural network for collaborative filtering problem. Although there are a few outstanding deep learning models for CF problems such as CF-NADE and AutoRec, the author claims that those models are solving for explicit feedback and positioned this work to solve for ‘implicit feedback CF’ problem.

The model is straightforward and similar to Neural Factorization Machine. The idea is to find embedding vectors for users and items and model their interaction as a non-linear function via a multi-layers neural network. A non-linear interaction has been commonly used in many recent works such as Neural Factorization Machine [2], Deep Relevant Matching Model [3].

The authors proposed 3 models incrementally: (1) Generalized Matrix Factorization – which is basically MF with additional non-linear transformation. In this case, they use sigmoid function; (2) Multi-layer Perceptron – which concatenate user and item embedded vectors and transform them by a learned non-linear function; (3) Fusion model can be either shared the same embedding vectors and add them up at the last layers or learned separate embedding vectors and concatenate them at the output layer.

Since they want to predict either the given item is preferable or not, it is a binary classification problem. Then the loss function can be a binary cross-entropy. To me, implicit feedback seems to be an easier problem than rating prediction problem because we only need to make a binary prediction.

The baseline seems okay but I wish the authors include more recent deep learning models such as [4]. The AutoRec model is also applicable for an implicit feedback by forcing the output to be a binary output. Regardless of their baseline, the extensive experiments tried to convince readers that deep neural network can model a complex interaction and will improve the performance.

Speak of the performance, the author uses hit-ratio and NDCG. Basically, there is one test sample for each user. The model tries to give a high score for that sample. This metric is more practical than MSE and RMSE since the dataset is implicit feedback dataset.

Non-linear interaction is a simple extension to a traditional MF model. This author shows that this simple extension does work for MovieLens and Pinterest dataset.

Reference:

[1] He, Xiangnan, et al. “Neural collaborative filtering.” Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.

http://www.comp.nus.edu.sg/~xiangnan/papers/ncf.pdf

[2] Xiangnan He, Tat-Seng Chua, “Neural Factorization Machines for Sparse Predictive Analytics”, ACM SIGIR 2017.

[3] Guo, Jiafeng, et al. “A deep relevance matching model for ad-hoc retrieval.” Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 2016.

[4] Zheng, Yin, et al. “Neural Autoregressive Collaborative Filtering for Implicit Feedback.” Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 2016. APA

 

Neural Factorization Machines for Sparse Predictive Analytics (SIGIR17)

Deep learning has been applied to many information retrieval problems at SIGIR this year. One of the key ideas is to add non-linearity to existing linear models commonly found in IR.

Factorization machine is one of the linear models for a sparse data prediction. The model is composed of a linear regression terms and d-way interaction terms. Then, it is possible to add non-linear interaction to Factorization machine. Neural FM [1] uses a neural network to model a non-linear interaction between feature latent vectors and it demonstrates interesting results on recommendation tasks.

The model of Factorization machine is as follows:

\hat y(\mathbf{x} ) = w_0 + \sum_{i=1}^n w_ix_i + f(\mathbf{x})

This paper changes the interaction f(\mathbf{x}) to be a non-linear function. The author introduces a bi-interaction layer, which is a pooling operation that converts embedding feature vectors to a single vector. Then, the result vector will be fed to a multi-layers neural network and the output from this operation will be an interaction score f(\mathbf{x}).

NFM is different from Wide&Deep model due to its pooling operation. The Wide&Deep model concatenates features together which does not account for feature interactions.

To train NFM, the author uses a square loss and SGD for parameter estimation. Dropout and batch normalization are utilized to avoid an overfitting.

The baselines are strong such as DeepCross and Wide&Deep models. The performance (RMSE) on personalized tag recommendation shows that NFM yields a much lower RMSE than other state-of-the-arts models. In conclusion, NFM models a higher-order and non-linear feature interaction that use a few parameters and does not require a deep structure.

References:

[1] Xiangnan He, Tat-Seng Chua, “Neural Factorization Machines for Sparse Predictive Analytics”, ACM SIGIR 2017.

http://www.comp.nus.edu.sg/~xiangnan/papers/sigir17-nfm.pdf

 

Factorization Machine

I’ve stumbled upon this paper that focused on predicting a response variable on a sparse dataset. In a standard regression problem, we want to find a weight vector that transforms a feature vector to a response value.

\hat y = w_0 + \bf{w}^T \bf{x}

This linear equation assumes that each sample, x^{(1)}, x^{(2)}, \cdots, x^{(n)} are independent. This assumption may not be valid for the collaborative filtering or ranking problems where the observation is correlated. Furthermore, SVM cannot find a reliable hyperplane due to the data sparsity; thus, we need the model that is reliable on this setting. Factorization Machine (FM) is also a general predictor and [1] shows that many predicting problems are equivalent to FM.

FM models all interactions in the dataset. This approach increases the number of observations. If we model a pair-wise interaction, there are O(n^2) interactions. The FM model equation is:

\hat y(\bf{x}) = w_0 + \sum_{i=1}^n w_ix_i + \sum_{i=1}^n \sum_{j=i+1}^n \bf{v}_i^T\bf{v}_jx_ix_j

The first two terms are regression formulation. The third term is an interaction term. The dot product of feature embedding vectors \bf{v}_i, \bf{v}_j is a weight of interaction between feature i and feature j. This formulation implies that each feature is not independent.

The author shows that a 2-way FM is equivalent to SVM with polynomial kernel K(\bf{x},\bf{z}) = <\phi(\bf{x}),\phi(\bf{z})>. The only difference is that the interaction weight parameters $w_{i,j}$ in SVM is dense matrix, but it is a low-rank matrix under FM framework. This lead to the less parameters to estimate. If W \in R^{n,n} = VV^T where V \in R^{n,k} and the number of parameters is O(n^2) v.s. O(nk). If k << n, then the term O(nk) is almost a linear.

Some people pointed out that the computation of FM is O(n^2), which is not the case. If the input matrix is sparse, then the number of non-zero interactions in this term:  $latex \sum_{i=1}^n \sum_{j=i+1}^n \bf{v}_i^T\bf{v}_jx_ix_j$ is actually much small than O(k \cdot n^2). If m(n) is the number of non-zero entries, then the computation of the interaction term is O(k \cdot m(n). As long as m(n) << n then FM’s computation is almost linear.

Nevertheless, the input feature for FM still needs some feature engineering. The choice of features is important and feature normalization is necessary.

References:

Rendle, Steffen. “Factorization machines.” Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE, 2010.

 

RBMs for Collaborative Filtering

If we are going to write a paper on deep learning for Collaborative Filtering problem, then RBM-CF [1] must be cited! Although it was not the best algorithm that yielded the lowest RMSE during the Netflix contest (SVD++ was the best algorithm at that time), RBM-CF was used as one of many algorithms for ensemble learning. One the main reason is because of its unique approach at that time. Thus, its predictions were slightly different from matrix factorization approaches.

It turns out that one of the state-of-the-arts for collaborative filtering right now (as of  April 2017) that I am aware of is NADE-CF [2] which is based on a similar idea as RBM-CF. This post will summarize the RBM-CF model, its architecture, and extensions.

The RBM-CF models the joint probability between visible and hidden units. In this case of CF problem, visible units are observed ratings which are represented as a binary value. Each rating is one-hot encoded as a binary vector of length K where K is the maximum rating. The goal is to infer hidden units from these observed ratings. This means we need to learn a non-linear function that maps visible units to a probability of distribution of hidden units. In RBF, this non-linear function is a sigmoid function.

RBM-CF uses softmax to model the visible units and Bernoulli distribution to model the hidden units. To infer all hidden units, this model is trained by using contrastive divergence to approximate the gradient of the log-likelihood. In order to make a prediction, the author suggested to first compute all p(v_q = k | \bf V) and normalize them using softmax. Then, compute the expectation of rating.

One possible variant is to model the hidden units as Gaussian latent variables. This variant increases the capacity of the model. Another variant to utilize the missing ratings as an extra information. The author observed that all rating in the test sets can be treated as all items that are viewed by a user without the rating. The viewing information is represented as a binary random variable and influences the hidden units. The conditional RBM model significantly improves performance. Imputing the missing values is a heuristic used to slightly improve the model performance.

The most interesting contribution is how the author proposed an architecture to reduce the number of free parameters. This can be done by factoring the weight matrix into a product of two lower-rank matrices. The less number of parameters means that we can avoid an overfitting and the convergence will be faster.

CF-RBM has a slightly lower RMSE than a standard SVD. The modern approaches such as Autoencoder or PMF are much more scalable than CF-RBM. This model can be extended to a deeper model and RBM used for parameter pretraining.

References:

[1] Salakhutdinov, Ruslan, Andriy Mnih, and Geoffrey Hinton. “Restricted Boltzmann machines for collaborative filtering.” Proceedings of the 24th international conference on Machine learning. ACM, 2007.

[2] Zheng, Yin, et al. “A neural autoregressive approach to collaborative filtering.” Proceedings of the 33nd International Conference on Machine Learning. 2016.

Recurrent Recommender Networks (WSDM’17)

The motivation of this work is to tackle the collaborative filtering problem in the realistic setting. The classical collaborative filtering models interpolate rating based on the past and future rating. But in a real-world situation, there are no future rating scores. Therefore, to be able to extrapolate or predict the future rating is more practical.

One important argument that explains why many CF models had performed well on the Netflix dataset is due to the mixing distribution of training and testing data. The models were fed with future ratings; hence, it is easy to predict a user rating.

Therefore, modeling the temporal and causal aspects of the rating data are the main goal of this work. They gave an example of the movie ‘Plan 9’ which initially was reviewed as a bad film but became very popular later. Another observation is that some movies are more popular during the Christmas and summer. It is also valid to assume that a user preference will change over time as they grow up their taste of films will change.

With all these motivations and observations, they propose to use RNN to model user and movie dynamics over time and hope that RNN will capture both exogenous and endogenous dynamics. The key ingredient of their models is to incorporate a wall clock (time index) as part of the sample features. Here is how each training sample looks like:

s_t = [x_t, 1_{\text{newbie}}, \tau_t, \tau_{t-1}]

x_t is a vector of user rating up to time t.  x_{jt} = k represents this user has rated movie j at time t with a rating of k. x_{jt} = 0 when this user did not rate the movie j.1_{\text{newbie}} seems to be an indicator if a user has no previous rating – a new user. The next two parameters are important because the RNN will use time index to handle no-rating steps.

Another important component is a projecting function of s_t to an embedding space and feeds the embedding vector to an LSTM unit. Adding a linear transformation can be viewed as converting raw data into more abstract representation. This also implies that this model does not feed user ratings to an LSTM unit directly. The LSTM is used to model user and movie dynamics. We can look the trained LSTM as a function that model these dynamics. The authors trained two RNN models: one for users and another for movies.

Finally, at each time step, the model predicts a rating as follows:

\hat{r} = f(u_{it}, m_{jt}, u_i, m_j) = <\tilde u_{it}, \tilde m_{jt}> + <u_i, m_j>

This equation extends a standard matrix factorization with dynamic states \tilde u_{it}, \tilde m_{jt}. It means that at each time step, this model will solve a matrix factorization based on the rating up to time t.

To train this model requires an alternate training since we can’t train user and movie simultaneously. Otherwise, there will be too many RNN for all movies. Thus, the author fixes the movie dynamic function, and train user dynamic function. Then, fix the user dynamic function and train movie dynamic function alternately. Training the model this way will be more scalable.

The experimental results show that this model (RRN) beats TimeSVD++, AutoRec, and PMF. Further, this model can capture many external factors such as rating scale changes in Netflix dataset, external factor such as Oscar or Golden Globe awards, and internal factor such as season change.

My 2-cent, I like this paper because the motivation is well written and I can see the benefit of modeling the dynamic systems in user and movie. I am surprised that there are not many related works that attempt to solve extrapolating problem.

A quick survey on Deep Learning for Collaborative Filtering

Here is my quick survey on recent works of using deep learning to tackle collaborative filtering and cold start problems in recommender systems.

Pure Collaborative Filtering problem (No side-information)

I have read a few works that attempt to use deep learning to solve CF problems. In the Collaborative filtering problem, we want to infer latent representation of users and items from rating matrix. The best method that I have known is CF-NADE [1] and AutoRec [2]. The first model is based on Auto-Regressive model (which is an extension of RBF ) while the second model is based on Autoencoder. I encounter RNN-based CF where we learn to predict the rating based on the seen rating. It seems to work well too. Collaborative Denoising Auto-Encoders for Top-N Recommender Systems (CDAE) [3] is a denoise autoencoder to encode both item and user latent vectors. The user latent is used as a bias term.

The traditional method that performs well in this domain is SVD++ which is a matrix factorization that incorporating neighbor users implicitly. Probabilistic Matrix Factorization is bayesian approach to this problem and provides a better interpretation of the latent vectors.

I have not found any work that provides a new insight in CF problems except CF-NADE that uses completely different approach. AutoRec seems to work so well but the original paper is a short paper ( 2 pages) so I am not sure if it really works.

Content-based recommendation (with side-information)

A few works demonstrate that Autoencoder can be used to learn latent vectors from user or item contents. Collaborative deep learning (CDL) [4] uses stacked denoise autoencoder to learn item latent vectors and coupled with user latent vectors. Some works in Music recommendation simply uses deep learning to learn a better item latent vectors from music raw features [5, 6].

Deep Collaborative Filtering via Marginalized Denoising Auto-encoder [7] jointly learns user and item latent vectors from user and item contents. This model regularizes the latent vectors by adding reconstruction errors on rating matrix and try to reconstruct a user and item contents by projecting latent vectors back to the feature space. This additional loss terms could be a novel. However, this model put less weight on content reconstruction errors, hence, these extra hyper parameters add extra complexity to this model.

One question we need to ask if autoencoder is really a good approach to solve Collaborative filtering. One view of the autoencoder is a non-linear PCA. Can we elaborate this model?

References:

[1] https://arxiv.org/abs/1605.09477

[2] http://users.cecs.anu.edu.au/~akmenon/papers/autorec/autorec-paper.pdf

[3] http://alicezheng.org/papers/wsdm16-cdae.pdf

[4] https://arxiv.org/abs/1409.2944

[5] Deep content-based music recommendation

[6] Improving content-based and hybrid music recommendation using deep learning

[7] http://dl.acm.org/citation.cfm?id=2806527