Attentive Collaborative Filtering: Multimedia Recommendation with Item- and Component-Level Attention (SIGIR’17)

This paper proposed a content-based recommendation model using attention mechanism. The task is to recommend video or image to a user based on user’s interaction with the content such as view counts or clicks. This is an implicit feedback problem.

The key motivation is that in the video or image recommendation, users do not like the entire video or images but only pay attention to the subset of the content. For the video, only some frames that users like. For an image, only some part of the images are interesting to the user. The attention mechasim

The model resembles SVD++ because the overall rating has both latent factor model and neighborhood model:

\hat R_{ij} = (\textbf{u}_i + \sum_{l \in R(i)} \alpha(i, l)\textbf{p}_l)^T \textbf{v}_j \\  =\textbf{u}_i^T \textbf{v}_j + \sum_{l \in R(i)} \alpha(i, l)\textbf{p}_l^T \textbf{v}_j

The first term is a standard latent factor model: finding an embedding vector for user and item. The second term is a neighborhood model: the more similar between item j and some items that user i likes, the higher rating prediction.

The difference is: SVD++ uses a uniform weight but this model learns weight from the data. Furthermore, the neighborhood model does not compute a dot product between two embedding item vectors but between item vector and auxiliary item vector. This vector is used to characterize users based on the set of items they interacted with.

The model architecture basically has 3 parts. The first part is to compute an item vector based on a weight combination of all components. E.g. find a vector representation for a video that has 100 frames. This representation \bar x = \sum_l \beta(l, m) x_{lm} is a personalized item representation because the weight (attention) \beta(l,m) is computed from a nonlinear function b(i, l, m) that takes a user and all item components.

The second part is to compute a neighborhood item vector, \sum_{l \in R(i)} \alpha(i,l) \textbf{p}_l. The attention is computed from a function a(i, l) that takes the user i, item l, auxiliary tiem l, \bar x_l. Then, this neighborhood vector will be added to a user vector i.

The final part performs a pair-wise learning. This paper use BPR (Bayesian Personalized Ranking) to minimize the rank loss between positive and negative samples. The evaluation metric use leave-one-out which is to keep the user’s last interaction with an item as the test set. Then, they measure hit rate and NDCG.

The hit rate of the proposed model is better than SVD++. I think the author should include FISM as a baseline since this model also performs well on implicit feedback task. Furthermore, SVD++ does not perform well on Top-N task since it optimized for RMSE metrics. Hence, SVD++ is not a strong baseline.

Overall, what I like about this work is the hierarchy model that construct a personalized item representation based on an attention. This idea makes a lot of sense because (1) people focus on different part of the content. The item vector should then be personalized; (2)  This model has a lot of parameters so it works well on the large video and image dataset.

 

 

 

References:

http://www.comp.nus.edu.sg/~xiangnan/papers/sigir17-AttentiveCF.pdf

Advertisements

Performance of Recommender Algorithms on Top-N Recommendation Tasks (RecSys’10)

This paper found that many recommender algorithms that optimize on the error metric (RMSE) do not perform well on Top-N task because these models do not account for all possible items including unrated items. The Top-N task is a more realistic evaluation than RMSE-based because it involves more than just rated items in the dataset. Hence, it is our interests to understand how to evaluate the recommender model under Top-N task/evaluation.

The testing methodology is described as follows (for Netflix data set):

  • Netflix dataset has the training and probe set.
  • Create a training set M which is the original Netflix training set.
  • Create a testing set T contains all 5-star ratings from the probe set.
  • Train the model over the training set M.
  • For each item i rated 5-stars by user u in T:
    • randomly select 1000 additional items unrated by user u. We assume that user u is not interested in any item in this set.
    • predict the ratings for the test item i and additional 1000 items.
    • form a ranked list by ordering 1001 items. The position of item i is p.
    • form a top-N recommendation list by picking the N top ranked items from the list. If p <= N, we have a hit.

The evaluation metrics are:

  • recall(N) = # hits / |T|
  • precision(N) = # hits / (N * |T|) = recall(N) / N

It turns out that many collaborative filtering algorithms do not perform well on the precision metric. The simple SVD outperforms SVD++. The author explained that SVD considers all items while SVD++ only looks at observed items.

Another finding from this paper is that the majority of ratings come from the most popular items. We can divide up the dataset into two groups: popular items and non-popular items. Recommending popular items is then a trivial task while recommending non-popular items are more interesting and more difficult.

Hence, excluding top popular items from the training data will result in lower precision and recall. However, the pureSVD still outperforms SVD++ and other matrix factorization methods.

Neural Autoregressive Collaborative Filtering for Implicit Feedback

This short paper extends the NADE-CF for implicit feedback dataset.  The main difference between implicit and explicit dataset is that the implicit feedback data does not have a negative sample. For example, the lack of a number of clicks or view counts does not imply that a particular user dislikes that item. He/she may not aware the existing of that item.

The author extent CF-NADE to solve the implicit feedback problem by predicting the likeness probability:

p(\textbf{t}) = \prod_{i=1}^M p(t_i | \textbf{t}_{<i})

where M is the number of items, t_i = 1 indicates user u likes item i, t_i = 0 when user u dislikes item i. However, we do not know whether user u will like item i. So we need to convert implicit feedback r_{ui} to t_{ui} by setting t_{ui} when r_{ui} > 0.

Since this binarizing an implicit feedback is noisy, the confidence that user u like or dislike the given item should be included in the model. One way to estimate the confidence is:

c_{ui} = 1 + \alpha r_{ui}

The higher implicit feedback score user u gives the higher confidence.

Thus, the final model becomes:

p(\textbf{t}|\textbf{c}) = \prod_{i=1}^M p(t_i | \textbf{t}_{<i}, \textbf{c}_{<i})

This conditional probability is modeled as an autoregressive model with a hidden variable. The confidence will basically act as a weight on the loss function:

C = -\sum_{i=1}^M c_i \log p(t_i|\textbf{t}_{<i}, \textbf{c}_{<i})

Overall, this short paper provides a simple extension of CF-NADE by binarizing an implicit feedback and adding confident random variable to the original CF-NADE model.

References:

https://arxiv.org/pdf/1606.07674.pdf

FISM: Factored Item Similarity Models for Top-N Recommender Systems (KDD’13)

Sparsity is always the main problem in top-N recommendation models because we cannot observe all possible user and item interactions. This model aims to solve this problem by borrowing ideas from the two existing models: SLIM and NSVD models.

SLIM [1] model is a generalization of ItemKNN model. Instead of imposing a similar metrics such as Pearson correlation in order to identify similar items, SLIM learns a similar matrix from the data. Thus, the predicted rating of user u is:

\tilde r_u = r_u \textbf{S}

This equation is simply a regression model. We want to predict a rating of item i by a user u by performing a linear combination of all rated items by a user u. The optimization problem of SLIM introduces a constraint term that forcing all diagonal of S to be 0. This prevents the model to do a self-predicted ( using a rating of item i to predict a rating of item i). It also added a nonnegative constraint so that the predicted rating is positive aggregations over items. It also has an L1 regularizer on S to force sparsity on S.

NSVD [2] model assumes that a similar matrix \textbf{S} can be factored into two low-rank matrices.  The intuition behinds this is that two similar embedding item vector should yield a higher similarity score. The predicted rating of item i by user u is:

\tilde r_{ui} = b_u + b_i + \sum_{j \in R_u} \textbf{p}_j \textbf{q}^T

This equation states that a rating of item i by user u is an aggregation of all similarity score between item i and all items rated by user u (including an item i itself!)

FISM model simply extends NSVD model by making simple changes:

  • It excluded item i from the NSVD formulation to avoid self-predicted rating.
  • It added a normalization term to control the degree of agreement between the items rated by the user with an item whose rating is being estimated.

The FISM model is:

\tilde r_{ui} = b_u + b_i + (n_u^+)^{-\alpha} \sum_{j \in R_u} \textbf{p}_j \textbf{q}^T

When the \alpha term is 1.0, the model will average all similarity scores. If all rated items have a high rating, then the predicted rating will likely to be high. On the other hand, when  \alpha term is 0.0, only few items with high rating can force the predicting rating to be high.

FISM paper proposed two learning model: FISMrmse and FISMauc. The first model is minimizing the square-loss between known rating and predicted rating. The second model is minimizing the rank-loss by placing a higher rated item before a lower rated item.

The experimental results show that FISMrmse is better than FISMauc based on Hit Rate and Average Reciprocal Hit Rank metrics. The NSVD is not part of the baselines since it solves rating prediction not top-N recommendation problem. In my opinion, FISM is not different from NSVD at all. FISMrmse is very similar to NSVD model except that it excludes item i.

References:

[1] X. Ning and G. Karypis. Slim: Sparse linear methods for top-n recommender systems. In ICDM 2011.

[2] A. Paterek. Improving regularized singular value decomposition for collaborative filtering. In Proceedings of KDD Cup and Workshop, volume 2007, pages 5-8, 2007.

[3] Kabbur, Santosh, Xia Ning, and George Karypis. “Fism: factored item similarity models for top-n recommender systems.” Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013.

A Neural Autoregressive Approach for Collaborative Filtering (ICML’16)

This paper proposed an autoregressive model for rating prediction. The idea of this paper is similar to CF-RBM for collaborative filtering problem. Instead, it replaces an RMB with NADE. Since NADE is a mean-field approximation of RBM and also CF-RBM works for collaborative filtering problem, it makes sense that NADE model is applicable for CF problem.

The NADE model aims to estimate a probability of the given sequence as P(\vec{r}) = \prod_{i=1}^D P(r_{m_{o_i}}|\vec{r}_{m_{o_{<i}}}) The previously seen sequence of rating can be used to predict the current rating. This also implies that some early rating on the known sequence has a strong influence on the current rating. This observation made NADE a different model from RNN.

The conditional probability distribution is defined as:

P(r_{m_{o_i}}|\vec{r}_{m_{o_{<i}}}) = \frac{\exp(s^k_{m_{o_i}}(\vec{r}_{m_{o_{<i}}}))}{\sum_{l=1}^K\exp(s^l_{m_{o_i}}(\vec{r}_{m_{o_{<i}}}))}

The model needs to estimate a score function s^k_{m_{o_i}}(\vec{r}_{m_{o_{<i}}}) such that it gives a high score when a user will likely to give a rating k to an item m_{o_i} given her previous rating history.

The score function can be seen as a dot product of a user state with an item latent vector. s^k_{m_{o_i}}(\vec{r}_{m_{o_{<i}}}) = b^k_{m_{o_i}} + V^k_{m_{o_i},:}\textbf{h}(r_{m_{o_{<i}}}). The bias is an item popularity and the dot product is the similarity between the current taste of the given user to the item o_i.

The hidden state captures the user’s current interest. It means that this model captures the change of behavior. This hidden state is defined as \textbf{h}(r_{m_{o_{<i}}}) = g(c + \sum_{j<i}W^{r_{m_{o_j}}}_{:,m_{o_j}}) The function g is an activation function, c is a bias, and a matrix W is an interaction matrix that transforms the input rating vector to a item vector.

The NADE-CF can be complicated but the main idea is the same as RBM-CF. Each user will have her own NADE network. NADE-CF shares all weight parameters among all users to avoid overfitting. Further, since each weight matrix associated with a rating, some weight matrix might be updated less frequently than others. The author changes the formula to compute the hidden state (look at the eq. 9 in the original paper). This way, the rare rating matrix will be updated more often.

In order to reduce the number parameters further, the low-rank matrix approximation is applied to matrix W and V. In addition, to improve the predictability further, the ordinal cost is introduced. The idea is if the user rating item m with rating k. Her rating preference of k-1 should be higher than k-2 and so on. In other words, if a user gave a rating of 3, it is likely that she will give a rating of 2 instead of 1 because a rating of 2 is closer to a rating of 3. The ordinal cost then is modeled as a rank loss similar to the list-wise learning to rank model.

Overall, NADE-CF yields a very good performance on the popular CF data set such as Netflix and MovieLens. The model is complicated and seems to be slow. However, with ordinal cost, low-rank approximation, and its ability to have a deeper model, made this model more attractive and it might be worthwhile to investigate an autoregressive model for this particular task.

 

One-Class Collaborative Filtering (ICDM’08)

Negative samples are very important in learning an effective collaborative filtering model. In an implicit feedback CF problem where we collect implicit data such as clicking or viewing by a user, those unclicked or non-viewed items can be either positive or negative sample. But when we train a CF model without carefully select negative samples, the model will be biased because we treat all missing data as if negative samples. Some unobserved positive samples would be negative samples as well.

This classic paper proposed two treatments on missing values. The first approach is to treat all missing values as negative samples. The key is to not penalize too much when the model mispredicts a negative sample. This approach introduces a weight parameter for each pair of user and item in the training data. The positive samples will have a weight of 1, but a negative sample will have a much smaller weight. This reflects our uncertainty on whether the given negative sample might be a positive sample.

The weight can be uniform for all missing data or can be user-based or item-based weighting. The user-based weighting assumes that once the current user has viewed a lot of items already, the chance of unobserved item to be a negative sample is very likely. The item-based weighting assumes that if a user has not viewed a popular item, it probably means that he/she does not like that item.

The above scheme is used to generate a richer dataset including negative samples and thus improve the MAP on the given CF model. However, since it trained by ALS, the computational cost is expensive. To alleviate this cost, sampling based scheme is utilized.

We can fix the number of samples to be somewhat small and sample negative items based on either uniform, user-based, or item-based assumptions. Then, we will train ALS to reconstruct an approximate rating matrix. Since the number of samples is much smaller, we need to approximate a lot of rating matrices. Finally, we average all approximate rating matrices to achieve the final predicted rating matrix.

The experimental results show that User-based assumption performs slightly better than uniform and item-based assumption. The reason why uniform assumption still perform as good as item-based because most missing or unlabelled items are negative samples anyway.

References:

[1] Pan, Rong, et al. “One-class collaborative filtering.” Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on. IEEE, 2008.

http://rongpan.net/publications/pan-oneclasscf.pdf

 

 

Neural Collaborative Filtering (WWW’17)

This is another paper that applies deep neural network for collaborative filtering problem. Although there are a few outstanding deep learning models for CF problems such as CF-NADE and AutoRec, the author claims that those models are solving for explicit feedback and positioned this work to solve for ‘implicit feedback CF’ problem.

The model is straightforward and similar to Neural Factorization Machine. The idea is to find embedding vectors for users and items and model their interaction as a non-linear function via a multi-layers neural network. A non-linear interaction has been commonly used in many recent works such as Neural Factorization Machine [2], Deep Relevant Matching Model [3].

The authors proposed 3 models incrementally: (1) Generalized Matrix Factorization – which is basically MF with additional non-linear transformation. In this case, they use sigmoid function; (2) Multi-layer Perceptron – which concatenate user and item embedded vectors and transform them by a learned non-linear function; (3) Fusion model can be either shared the same embedding vectors and add them up at the last layers or learned separate embedding vectors and concatenate them at the output layer.

Since they want to predict either the given item is preferable or not, it is a binary classification problem. Then the loss function can be a binary cross-entropy. To me, implicit feedback seems to be an easier problem than rating prediction problem because we only need to make a binary prediction.

The baseline seems okay but I wish the authors include more recent deep learning models such as [4]. The AutoRec model is also applicable for an implicit feedback by forcing the output to be a binary output. Regardless of their baseline, the extensive experiments tried to convince readers that deep neural network can model a complex interaction and will improve the performance.

Speak of the performance, the author uses hit-ratio and NDCG. Basically, there is one test sample for each user. The model tries to give a high score for that sample. This metric is more practical than MSE and RMSE since the dataset is implicit feedback dataset.

Non-linear interaction is a simple extension to a traditional MF model. This author shows that this simple extension does work for MovieLens and Pinterest dataset.

Reference:

[1] He, Xiangnan, et al. “Neural collaborative filtering.” Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.

http://www.comp.nus.edu.sg/~xiangnan/papers/ncf.pdf

[2] Xiangnan He, Tat-Seng Chua, “Neural Factorization Machines for Sparse Predictive Analytics”, ACM SIGIR 2017.

[3] Guo, Jiafeng, et al. “A deep relevance matching model for ad-hoc retrieval.” Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 2016.

[4] Zheng, Yin, et al. “Neural Autoregressive Collaborative Filtering for Implicit Feedback.” Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 2016. APA