Neural Autoregressive Collaborative Filtering for Implicit Feedback

This short paper extends the NADE-CF for implicit feedback dataset.  The main difference between implicit and explicit dataset is that the implicit feedback data does not have a negative sample. For example, the lack of a number of clicks or view counts does not imply that a particular user dislikes that item. He/she may not aware the existing of that item.

The author extent CF-NADE to solve the implicit feedback problem by predicting the likeness probability:

p(\textbf{t}) = \prod_{i=1}^M p(t_i | \textbf{t}_{<i})

where M is the number of items, t_i = 1 indicates user u likes item i, t_i = 0 when user u dislikes item i. However, we do not know whether user u will like item i. So we need to convert implicit feedback r_{ui} to t_{ui} by setting t_{ui} when r_{ui} > 0.

Since this binarizing an implicit feedback is noisy, the confidence that user u like or dislike the given item should be included in the model. One way to estimate the confidence is:

c_{ui} = 1 + \alpha r_{ui}

The higher implicit feedback score user u gives the higher confidence.

Thus, the final model becomes:

p(\textbf{t}|\textbf{c}) = \prod_{i=1}^M p(t_i | \textbf{t}_{<i}, \textbf{c}_{<i})

This conditional probability is modeled as an autoregressive model with a hidden variable. The confidence will basically act as a weight on the loss function:

C = -\sum_{i=1}^M c_i \log p(t_i|\textbf{t}_{<i}, \textbf{c}_{<i})

Overall, this short paper provides a simple extension of CF-NADE by binarizing an implicit feedback and adding confident random variable to the original CF-NADE model.



A Neural Autoregressive Approach for Collaborative Filtering (ICML’16)

This paper proposed an autoregressive model for rating prediction. The idea of this paper is similar to CF-RBM for collaborative filtering problem. Instead, it replaces an RMB with NADE. Since NADE is a mean-field approximation of RBM and also CF-RBM works for collaborative filtering problem, it makes sense that NADE model is applicable for CF problem.

The NADE model aims to estimate a probability of the given sequence as P(\vec{r}) = \prod_{i=1}^D P(r_{m_{o_i}}|\vec{r}_{m_{o_{<i}}}) The previously seen sequence of rating can be used to predict the current rating. This also implies that some early rating on the known sequence has a strong influence on the current rating. This observation made NADE a different model from RNN.

The conditional probability distribution is defined as:

P(r_{m_{o_i}}|\vec{r}_{m_{o_{<i}}}) = \frac{\exp(s^k_{m_{o_i}}(\vec{r}_{m_{o_{<i}}}))}{\sum_{l=1}^K\exp(s^l_{m_{o_i}}(\vec{r}_{m_{o_{<i}}}))}

The model needs to estimate a score function s^k_{m_{o_i}}(\vec{r}_{m_{o_{<i}}}) such that it gives a high score when a user will likely to give a rating k to an item m_{o_i} given her previous rating history.

The score function can be seen as a dot product of a user state with an item latent vector. s^k_{m_{o_i}}(\vec{r}_{m_{o_{<i}}}) = b^k_{m_{o_i}} + V^k_{m_{o_i},:}\textbf{h}(r_{m_{o_{<i}}}). The bias is an item popularity and the dot product is the similarity between the current taste of the given user to the item o_i.

The hidden state captures the user’s current interest. It means that this model captures the change of behavior. This hidden state is defined as \textbf{h}(r_{m_{o_{<i}}}) = g(c + \sum_{j<i}W^{r_{m_{o_j}}}_{:,m_{o_j}}) The function g is an activation function, c is a bias, and a matrix W is an interaction matrix that transforms the input rating vector to a item vector.

The NADE-CF can be complicated but the main idea is the same as RBM-CF. Each user will have her own NADE network. NADE-CF shares all weight parameters among all users to avoid overfitting. Further, since each weight matrix associated with a rating, some weight matrix might be updated less frequently than others. The author changes the formula to compute the hidden state (look at the eq. 9 in the original paper). This way, the rare rating matrix will be updated more often.

In order to reduce the number parameters further, the low-rank matrix approximation is applied to matrix W and V. In addition, to improve the predictability further, the ordinal cost is introduced. The idea is if the user rating item m with rating k. Her rating preference of k-1 should be higher than k-2 and so on. In other words, if a user gave a rating of 3, it is likely that she will give a rating of 2 instead of 1 because a rating of 2 is closer to a rating of 3. The ordinal cost then is modeled as a rank loss similar to the list-wise learning to rank model.

Overall, NADE-CF yields a very good performance on the popular CF data set such as Netflix and MovieLens. The model is complicated and seems to be slow. However, with ordinal cost, low-rank approximation, and its ability to have a deeper model, made this model more attractive and it might be worthwhile to investigate an autoregressive model for this particular task.


Neural Autoregressive Distribution Estimation (NADE)

The generative model estimates P(x) or P(x,z) which is different from the discriminative model which estimates a conditional probability P(z|x) directly. An autoregressive model is one of three popular approaches in deep generative models beside GANs and VAE. It models P(x)  = \prod_i P(x_i | x_{<i})

It models P(x)  = \prod_i P(x_i | x_{<i}) – it factors a joint distribution of an observation as a product of an independent conditional probability distribution. However, implement this model to approximate these distributions directly is intractable because we need parameters for each observation.

NADE [1] proposed a scalable approximation by sharing weight parameters between each observation. The sharing parameters method reduces the number of free parameters and has an effect of regularization because the weight parameters must accommodate for all observation.

For technical details, NADE models the following distribution:

P(x) = \prod_{d=1}^D P(x_{o_d}|x_{o_<d}) where d is an index of the permutation o. For example, if we have x = {x_1, x_2, x_3, x_4}, and o = {2, 3, 4, 1}, then x_{o_1} = x_2. The permutation of the observation is more generic notations. Once we model the observation, the hidden variables can be computed as:

\vec h_d = \sigma(W_{:, o_{<d}} \vec x_{o_{<d}} + \vec c)  And we can generate the observation ( a binary random variable) using a sigmoid function:

P(x_{o_d} = 1| \vec x_{o_{<d}}) = \sigma(V_{o_d}\vec h_d + b_{o_d})

If we look at NADE’s architecture, it is similar to RBM. In fact, [1] shows that NADE is in fact, a mean-field approximation of RBM (see [1] for details).

Another important property of NADE is computing a joint distribution P(x) is linear to the number of dimension of observations because we can express W_{:, o_{<d}} \vec x_{o_{<d}} + \vec c recursively:

Define the basecase as: \vec a_1 = \vec c

And the recurrent relationship as: \vec a_d =  W_{:, o_{<d}} \vec x_{o_{<d}} + \vec c = W_{:,o_{d-1}} x_{o_{d-1}} + \vec a_{d-1}

This means that computing \vec h_d = \sigma(\vec a_d) can be done in a linear fashion.

There are many extension of the Autoregressive model, one of the extension CF-NADE is currently the state-of-the-arts of CF. This model can model a binary/discrete random variable which VAE is unable to model currently. So this can be useful for any problem that requires a discrete random variable.


[1] Uria, Benigno, et al. “Neural Autoregressive Distribution Estimation.” Journal of Machine Learning Research 17.205 (2016): 1-37.