Attentive Collaborative Filtering: Multimedia Recommendation with Item- and Component-Level Attention (SIGIR’17)

This paper proposed a content-based recommendation model using attention mechanism. The task is to recommend video or image to a user based on user’s interaction with the content such as view counts or clicks. This is an implicit feedback problem.

The key motivation is that in the video or image recommendation, users do not like the entire video or images but only pay attention to the subset of the content. For the video, only some frames that users like. For an image, only some part of the images are interesting to the user. The attention mechasim

The model resembles SVD++ because the overall rating has both latent factor model and neighborhood model:

\hat R_{ij} = (\textbf{u}_i + \sum_{l \in R(i)} \alpha(i, l)\textbf{p}_l)^T \textbf{v}_j \\  =\textbf{u}_i^T \textbf{v}_j + \sum_{l \in R(i)} \alpha(i, l)\textbf{p}_l^T \textbf{v}_j

The first term is a standard latent factor model: finding an embedding vector for user and item. The second term is a neighborhood model: the more similar between item j and some items that user i likes, the higher rating prediction.

The difference is: SVD++ uses a uniform weight but this model learns weight from the data. Furthermore, the neighborhood model does not compute a dot product between two embedding item vectors but between item vector and auxiliary item vector. This vector is used to characterize users based on the set of items they interacted with.

The model architecture basically has 3 parts. The first part is to compute an item vector based on a weight combination of all components. E.g. find a vector representation for a video that has 100 frames. This representation \bar x = \sum_l \beta(l, m) x_{lm} is a personalized item representation because the weight (attention) \beta(l,m) is computed from a nonlinear function b(i, l, m) that takes a user and all item components.

The second part is to compute a neighborhood item vector, \sum_{l \in R(i)} \alpha(i,l) \textbf{p}_l. The attention is computed from a function a(i, l) that takes the user i, item l, auxiliary tiem l, \bar x_l. Then, this neighborhood vector will be added to a user vector i.

The final part performs a pair-wise learning. This paper use BPR (Bayesian Personalized Ranking) to minimize the rank loss between positive and negative samples. The evaluation metric use leave-one-out which is to keep the user’s last interaction with an item as the test set. Then, they measure hit rate and NDCG.

The hit rate of the proposed model is better than SVD++. I think the author should include FISM as a baseline since this model also performs well on implicit feedback task. Furthermore, SVD++ does not perform well on Top-N task since it optimized for RMSE metrics. Hence, SVD++ is not a strong baseline.

Overall, what I like about this work is the hierarchy model that construct a personalized item representation based on an attention. This idea makes a lot of sense because (1) people focus on different part of the content. The item vector should then be personalized; (2)  This model has a lot of parameters so it works well on the large video and image dataset.





Importance Weighted Autoencoders (ICLR’16)

In a classical VAE, the ELBO is

\log P(\textbf{x}) \ge E_{Q(\textbf{z}|\textbf{x})}[\log \frac{P(\textbf{x},\textbf{z})}{Q(\textbf{z}|\textbf{x})}] = L(x)

The unbiased estimation of the gradient of L(x) is:

\nabla_{\theta} L(\textbf{x})= \frac{1}{k}\sum_{i=1}^k \nabla_{\theta}\log w(\textbf{x},\textbf{z}_i; \theta)

where w(\textbf{x},\textbf{z}_i; \theta) = w(\textbf{x},\textbf{z}_i; \theta) = \frac{p(\textbf{x},\textbf{z}_i)}{q(\textbf{z}_i|\textbf{x})}

The importance weighted autoencoders has a slightly different ELBO:

\log P(\textbf{x}) \ge E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\log \frac{1}{k}\sum_{i=1}^k \frac{P(\textbf{x},\textbf{z}_i)}{Q(\textbf{z}_i|\textbf{x})}]

The unbiased estimation of the gradient is:

\nabla_{\theta} L_k(\textbf{x}) = \nabla_{\theta} E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\log\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta))] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\nabla_{\theta} \log\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta))] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\nabla_{\theta}\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta))}{\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\nabla_{\theta}\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta))}{\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}] \\= E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\nabla_{\theta}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}{\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\nabla_{\theta}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}{C(\textbf{x};\theta)}] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\sum_{i=1}^k \nabla_{\theta} w(\textbf{x}, \textbf{z}_i; \theta)}{C(\textbf{x};\theta)}]

Then, use the log identity:

= E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)\nabla_{\theta}\log w(\textbf{x}, \textbf{z}_i; \theta)}{C(\textbf{x};\theta)}] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\sum_{i=1}^k \frac{w(\textbf{x}, \textbf{z}_i; \theta)}{\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}\nabla_{\theta}\log w(\textbf{x}, \textbf{z}_i; \theta)] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\sum_{i=1}^k \tilde w_i\nabla_{\theta}\log w(\textbf{x}, \textbf{z}_i; \theta)]

The Monte Carlo estimation is then:

$latex \sum_{i=1}^k \tilde w_i\nabla_{\theta} \log w(\textbf{x}, \textbf{z}_i; \theta)$.

The Importance Weighted Autoencoders (IWAE) has a similar network architecture as VAE because when k = 1, the Monte Carlo estimation becomes the standard VAE.

It is unclear why IWAE is better than VAE since we can draw multiple samples from VAE to approximate its log data likelihood. The main difference is the weight function. That is why it is called Importance samples.


[1] Burda, Yuri, Roger Grosse, and Ruslan Salakhutdinov. “Importance weighted autoencoders.” arXiv preprint arXiv:1509.00519 (2015).

Performance of Recommender Algorithms on Top-N Recommendation Tasks (RecSys’10)

This paper found that many recommender algorithms that optimize on the error metric (RMSE) do not perform well on Top-N task because these models do not account for all possible items including unrated items. The Top-N task is a more realistic evaluation than RMSE-based because it involves more than just rated items in the dataset. Hence, it is our interests to understand how to evaluate the recommender model under Top-N task/evaluation.

The testing methodology is described as follows (for Netflix data set):

Netflix dataset has the training and probe set.
Create a training set M which is the original Netflix training set.
Create a testing set T contains all 5-star ratings from the probe set.
Train the model over the training set M.
For each item i rated 5-stars by user u in T:
randomly select 1000 additional items unrated by user u. We assume that user u is not interested in any item in this set.
predict the ratings for the test item i and additional 1000 items.
form a ranked list by ordering 1001 items. The position of item i is p.
form a top-N recommendation list by picking the N top ranked items from the list. If p <= N, we have a hit.
The evaluation metrics are:

recall(N) = # hits / |T|
precision(N) = # hits / (N * |T|) = recall(N) / N
It turns out that many collaborative filtering algorithms do not perform well on the precision metric. The simple SVD outperforms SVD++. The author explained that SVD considers all items while SVD++ only looks at observed items.

Another finding from this paper is that the majority of ratings come from the most popular items. We can divide up the dataset into two groups: popular items and non-popular items. Recommending popular items is then a trivial task while recommending non-popular items are more interesting and more difficult.

Hence, excluding top popular items from the training data will result in lower precision and recall. However, the pureSVD still outperforms SVD++ and other matrix factorization methods.

Neural Autoregressive Collaborative Filtering for Implicit Feedback

This short paper extends the NADE-CF for implicit feedback dataset.  The main difference between implicit and explicit dataset is that the implicit feedback data does not have a negative sample. For example, the lack of a number of clicks or view counts does not imply that a particular user dislikes that item. He/she may not aware the existing of that item.

The author extent CF-NADE to solve the implicit feedback problem by predicting the likeness probability:

p(\textbf{t}) = \prod_{i=1}^M p(t_i | \textbf{t}_{<i})

where M is the number of items, t_i = 1 indicates user u likes item i, t_i = 0 when user u dislikes item i. However, we do not know whether user u will like item i. So we need to convert implicit feedback r_{ui} to t_{ui} by setting t_{ui} when r_{ui} > 0.

Since this binarizing an implicit feedback is noisy, the confidence that user u like or dislike the given item should be included in the model. One way to estimate the confidence is:

c_{ui} = 1 + \alpha r_{ui}

The higher implicit feedback score user u gives the higher confidence.

Thus, the final model becomes:

p(\textbf{t}|\textbf{c}) = \prod_{i=1}^M p(t_i | \textbf{t}_{<i}, \textbf{c}_{<i})

This conditional probability is modeled as an autoregressive model with a hidden variable. The confidence will basically act as a weight on the loss function:

C = -\sum_{i=1}^M c_i \log p(t_i|\textbf{t}_{<i}, \textbf{c}_{<i})

Overall, this short paper provides a simple extension of CF-NADE by binarizing an implicit feedback and adding confident random variable to the original CF-NADE model.



FISM: Factored Item Similarity Models for Top-N Recommender Systems (KDD’13)

Sparsity is always the main problem in top-N recommendation models because we cannot observe all possible user and item interactions. This model aims to solve this problem by borrowing ideas from the two existing models: SLIM and NSVD models.

SLIM [1] model is a generalization of ItemKNN model. Instead of imposing a similar metrics such as Pearson correlation in order to identify similar items, SLIM learns a similar matrix from the data. Thus, the predicted rating of user u is:

\tilde r_u = r_u \textbf{S}

This equation is simply a regression model. We want to predict a rating of item i by a user u by performing a linear combination of all rated items by a user u. The optimization problem of SLIM introduces a constraint term that forcing all diagonal of S to be 0. This prevents the model to do a self-predicted ( using a rating of item i to predict a rating of item i). It also added a nonnegative constraint so that the predicted rating is positive aggregations over items. It also has an L1 regularizer on S to force sparsity on S.

NSVD [2] model assumes that a similar matrix \textbf{S} can be factored into two low-rank matrices.  The intuition behinds this is that two similar embedding item vector should yield a higher similarity score. The predicted rating of item i by user u is:

\tilde r_{ui} = b_u + b_i + \sum_{j \in R_u} \textbf{p}_j \textbf{q}^T

This equation states that a rating of item i by user u is an aggregation of all similarity score between item i and all items rated by user u (including an item i itself!)

FISM model simply extends NSVD model by making simple changes:

  • It excluded item i from the NSVD formulation to avoid self-predicted rating.
  • It added a normalization term to control the degree of agreement between the items rated by the user with an item whose rating is being estimated.

The FISM model is:

\tilde r_{ui} = b_u + b_i + (n_u^+)^{-\alpha} \sum_{j \in R_u} \textbf{p}_j \textbf{q}^T

When the \alpha term is 1.0, the model will average all similarity scores. If all rated items have a high rating, then the predicted rating will likely to be high. On the other hand, when  \alpha term is 0.0, only few items with high rating can force the predicting rating to be high.

FISM paper proposed two learning model: FISMrmse and FISMauc. The first model is minimizing the square-loss between known rating and predicted rating. The second model is minimizing the rank-loss by placing a higher rated item before a lower rated item.

The experimental results show that FISMrmse is better than FISMauc based on Hit Rate and Average Reciprocal Hit Rank metrics. The NSVD is not part of the baselines since it solves rating prediction not top-N recommendation problem. In my opinion, FISM is not different from NSVD at all. FISMrmse is very similar to NSVD model except that it excludes item i.


[1] X. Ning and G. Karypis. Slim: Sparse linear methods for top-n recommender systems. In ICDM 2011.

[2] A. Paterek. Improving regularized singular value decomposition for collaborative filtering. In Proceedings of KDD Cup and Workshop, volume 2007, pages 5-8, 2007.

[3] Kabbur, Santosh, Xia Ning, and George Karypis. “Fism: factored item similarity models for top-n recommender systems.” Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013.

A Neural Autoregressive Approach for Collaborative Filtering (ICML’16)

This paper proposed an autoregressive model for rating prediction. The idea of this paper is similar to CF-RBM for collaborative filtering problem. Instead, it replaces an RMB with NADE. Since NADE is a mean-field approximation of RBM and also CF-RBM works for collaborative filtering problem, it makes sense that NADE model is applicable for CF problem.

The NADE model aims to estimate a probability of the given sequence as P(\vec{r}) = \prod_{i=1}^D P(r_{m_{o_i}}|\vec{r}_{m_{o_{<i}}}) The previously seen sequence of rating can be used to predict the current rating. This also implies that some early rating on the known sequence has a strong influence on the current rating. This observation made NADE a different model from RNN.

The conditional probability distribution is defined as:

P(r_{m_{o_i}}|\vec{r}_{m_{o_{<i}}}) = \frac{\exp(s^k_{m_{o_i}}(\vec{r}_{m_{o_{<i}}}))}{\sum_{l=1}^K\exp(s^l_{m_{o_i}}(\vec{r}_{m_{o_{<i}}}))}

The model needs to estimate a score function s^k_{m_{o_i}}(\vec{r}_{m_{o_{<i}}}) such that it gives a high score when a user will likely to give a rating k to an item m_{o_i} given her previous rating history.

The score function can be seen as a dot product of a user state with an item latent vector. s^k_{m_{o_i}}(\vec{r}_{m_{o_{<i}}}) = b^k_{m_{o_i}} + V^k_{m_{o_i},:}\textbf{h}(r_{m_{o_{<i}}}). The bias is an item popularity and the dot product is the similarity between the current taste of the given user to the item o_i.

The hidden state captures the user’s current interest. It means that this model captures the change of behavior. This hidden state is defined as \textbf{h}(r_{m_{o_{<i}}}) = g(c + \sum_{j<i}W^{r_{m_{o_j}}}_{:,m_{o_j}}) The function g is an activation function, c is a bias, and a matrix W is an interaction matrix that transforms the input rating vector to a item vector.

The NADE-CF can be complicated but the main idea is the same as RBM-CF. Each user will have her own NADE network. NADE-CF shares all weight parameters among all users to avoid overfitting. Further, since each weight matrix associated with a rating, some weight matrix might be updated less frequently than others. The author changes the formula to compute the hidden state (look at the eq. 9 in the original paper). This way, the rare rating matrix will be updated more often.

In order to reduce the number parameters further, the low-rank matrix approximation is applied to matrix W and V. In addition, to improve the predictability further, the ordinal cost is introduced. The idea is if the user rating item m with rating k. Her rating preference of k-1 should be higher than k-2 and so on. In other words, if a user gave a rating of 3, it is likely that she will give a rating of 2 instead of 1 because a rating of 2 is closer to a rating of 3. The ordinal cost then is modeled as a rank loss similar to the list-wise learning to rank model.

Overall, NADE-CF yields a very good performance on the popular CF data set such as Netflix and MovieLens. The model is complicated and seems to be slow. However, with ordinal cost, low-rank approximation, and its ability to have a deeper model, made this model more attractive and it might be worthwhile to investigate an autoregressive model for this particular task.


Deep Semantic Hashing with Generative Adversarial Networks (SIGIR17)

This paper proposed a deep semantic hashing model that utilizes a generative adversarial network for learning a compact binary code for similar image search task. In semantic hashing research, the supervised models where learning a hash function from labeled images produces more accurate results than unsupervised models. The main reason is that the similarity criteria is not always based on the image contents. It depends on image annotators on how he/she interprets the given image. But obtaining labeled images are expensive, thus, it is important to develop the model that requires less labeled images but still yields a comparable result to supervised model.

Hence, the goal of this model is to generate more synthetic images in order to learn a better hash function. It leverages the GANs idea by training a conditional generator and discriminator. Intuitively, once we have a good generator that can produce an image given class labels, then we will have more image samples with known labels. GANs framework fits with this semi-supervised learning ideas.

The main model, which is DSH-GANs (Deep Semantic Hashing with GANs), has a generator and discriminator. The input real image must have labels so that we can control the generator to produce similar images (positive samples). We also need to somehow select non-relevant labels so that the generator will produce dissimilar images ( negative samples). Then, these images will go through deep CNN to learn low-level features. Finally, the feature learned from the last layer of CNN will be fed to 3 separated networks. The first one is a hash stream, which is necessary to learn a hash function. They use maximum margin as a rank loss function. The second network is classification stream. This network is to force images with similar labels to have a similar hash code. For example, two completely different images may have the same labels. Thus, this network allows this two image to end up with the same hash code. Finally, the adversary stream which is part of GANs framework is necessary because we want to adversarial train DSH-GANs.

It is important to pre-train the generator first. The author does not explain the main reason but I think we want the generator at least generate somewhat useful image samples. The pre-train might be helpful for the GANs training to converge to a good local optimal. Pre-training the generator is also based on advertorial training. We want to the generator to generate an image that preserves label information while the discriminator must learn to detect a synthetic image and be able to predict the class label. This pre-training step also utilizes unlabeled images. For any image with no labels, the author set the class label as a zero vector, so that the discriminator just needs to predict all zeros. It seems there is no good reason why choosing zeros because it means that the generator will treat unlabeled images as having labels that are outside the label sets from the corpus.

I think this model utilizes GANs idea well, especially for semi-supervised learning. The key contribution is to allow GANs to generate more image samples both positive and negative samples.