Attentive Collaborative Filtering: Multimedia Recommendation with Item- and Component-Level Attention (SIGIR’17)

This paper proposed a content-based recommendation model using attention mechanism. The task is to recommend video or image to a user based on user’s interaction with the content such as view counts or clicks. This is an implicit feedback problem.

The key motivation is that in the video or image recommendation, users do not like the entire video or images but only pay attention to the subset of the content. For the video, only some frames that users like. For an image, only some part of the images are interesting to the user. The attention mechasim

The model resembles SVD++ because the overall rating has both latent factor model and neighborhood model:

\hat R_{ij} = (\textbf{u}_i + \sum_{l \in R(i)} \alpha(i, l)\textbf{p}_l)^T \textbf{v}_j \\  =\textbf{u}_i^T \textbf{v}_j + \sum_{l \in R(i)} \alpha(i, l)\textbf{p}_l^T \textbf{v}_j

The first term is a standard latent factor model: finding an embedding vector for user and item. The second term is a neighborhood model: the more similar between item j and some items that user i likes, the higher rating prediction.

The difference is: SVD++ uses a uniform weight but this model learns weight from the data. Furthermore, the neighborhood model does not compute a dot product between two embedding item vectors but between item vector and auxiliary item vector. This vector is used to characterize users based on the set of items they interacted with.

The model architecture basically has 3 parts. The first part is to compute an item vector based on a weight combination of all components. E.g. find a vector representation for a video that has 100 frames. This representation \bar x = \sum_l \beta(l, m) x_{lm} is a personalized item representation because the weight (attention) \beta(l,m) is computed from a nonlinear function b(i, l, m) that takes a user and all item components.

The second part is to compute a neighborhood item vector, \sum_{l \in R(i)} \alpha(i,l) \textbf{p}_l. The attention is computed from a function a(i, l) that takes the user i, item l, auxiliary tiem l, \bar x_l. Then, this neighborhood vector will be added to a user vector i.

The final part performs a pair-wise learning. This paper use BPR (Bayesian Personalized Ranking) to minimize the rank loss between positive and negative samples. The evaluation metric use leave-one-out which is to keep the user’s last interaction with an item as the test set. Then, they measure hit rate and NDCG.

The hit rate of the proposed model is better than SVD++. I think the author should include FISM as a baseline since this model also performs well on implicit feedback task. Furthermore, SVD++ does not perform well on Top-N task since it optimized for RMSE metrics. Hence, SVD++ is not a strong baseline.

Overall, what I like about this work is the hierarchy model that construct a personalized item representation based on an attention. This idea makes a lot of sense because (1) people focus on different part of the content. The item vector should then be personalized; (2)  This model has a lot of parameters so it works well on the large video and image dataset.

 

 

 

References:

http://www.comp.nus.edu.sg/~xiangnan/papers/sigir17-AttentiveCF.pdf

A Neural Autoregressive Approach for Collaborative Filtering (ICML’16)

This paper proposed an autoregressive model for rating prediction. The idea of this paper is similar to CF-RBM for collaborative filtering problem. Instead, it replaces an RMB with NADE. Since NADE is a mean-field approximation of RBM and also CF-RBM works for collaborative filtering problem, it makes sense that NADE model is applicable for CF problem.

The NADE model aims to estimate a probability of the given sequence as P(\vec{r}) = \prod_{i=1}^D P(r_{m_{o_i}}|\vec{r}_{m_{o_{<i}}}) The previously seen sequence of rating can be used to predict the current rating. This also implies that some early rating on the known sequence has a strong influence on the current rating. This observation made NADE a different model from RNN.

The conditional probability distribution is defined as:

P(r_{m_{o_i}}|\vec{r}_{m_{o_{<i}}}) = \frac{\exp(s^k_{m_{o_i}}(\vec{r}_{m_{o_{<i}}}))}{\sum_{l=1}^K\exp(s^l_{m_{o_i}}(\vec{r}_{m_{o_{<i}}}))}

The model needs to estimate a score function s^k_{m_{o_i}}(\vec{r}_{m_{o_{<i}}}) such that it gives a high score when a user will likely to give a rating k to an item m_{o_i} given her previous rating history.

The score function can be seen as a dot product of a user state with an item latent vector. s^k_{m_{o_i}}(\vec{r}_{m_{o_{<i}}}) = b^k_{m_{o_i}} + V^k_{m_{o_i},:}\textbf{h}(r_{m_{o_{<i}}}). The bias is an item popularity and the dot product is the similarity between the current taste of the given user to the item o_i.

The hidden state captures the user’s current interest. It means that this model captures the change of behavior. This hidden state is defined as \textbf{h}(r_{m_{o_{<i}}}) = g(c + \sum_{j<i}W^{r_{m_{o_j}}}_{:,m_{o_j}}) The function g is an activation function, c is a bias, and a matrix W is an interaction matrix that transforms the input rating vector to a item vector.

The NADE-CF can be complicated but the main idea is the same as RBM-CF. Each user will have her own NADE network. NADE-CF shares all weight parameters among all users to avoid overfitting. Further, since each weight matrix associated with a rating, some weight matrix might be updated less frequently than others. The author changes the formula to compute the hidden state (look at the eq. 9 in the original paper). This way, the rare rating matrix will be updated more often.

In order to reduce the number parameters further, the low-rank matrix approximation is applied to matrix W and V. In addition, to improve the predictability further, the ordinal cost is introduced. The idea is if the user rating item m with rating k. Her rating preference of k-1 should be higher than k-2 and so on. In other words, if a user gave a rating of 3, it is likely that she will give a rating of 2 instead of 1 because a rating of 2 is closer to a rating of 3. The ordinal cost then is modeled as a rank loss similar to the list-wise learning to rank model.

Overall, NADE-CF yields a very good performance on the popular CF data set such as Netflix and MovieLens. The model is complicated and seems to be slow. However, with ordinal cost, low-rank approximation, and its ability to have a deeper model, made this model more attractive and it might be worthwhile to investigate an autoregressive model for this particular task.

 

Autoencoding beyond pixels using a learned similarity metric (ICML’16)

One of the key components of an autoencoder is a reconstruction error. This term measures how much useful information is compressed into a learned latent vector. The common reconstruction error is based on an element-wise measurement such as a binary cross entropy for a black-and-white image or a square error between a reconstructed image and the input image.

The authors think that an element-wise measurement is not an accurate indicator of goodness of a learned latent vector. Hence, they proposed to learn a similar metric via an adversarial training. Here is how they set up the objective functions:

They use VAE to learn a latent vector of the input image. There are two loss functions in a vanilla VAE: a KL loss and a negative log data-likelihood. They replace the second loss with a new reconstruction loss function. We will talk about this new loss function in the next paragraph. Then, they have a discriminator that tries to distinguish the real input data from the generated data from the decoder of VAE. The discriminator will encourage the VAE to learn stronger encoder and decoder.

The discriminator can be decomposed into 2 parts. If it has L + 1 layers, then the first L layers is a transform function that maps the input data into a new representation. The last layer is a binary classifier. This means that if we input any input through the first L layer, we will get a new representation that is easily classified by the last layer of the discriminator. When the discriminator is very good at detecting the real input, the representation at L layer is going to be much easier to classify compared to the input data. It means that a square error between a transformed input and its transformed reconstruction input should be somewhat small when these inputs are similar.

This model has trained the same fashion as GANs; simultaneously train VAE and GANs. This idea works well for an image because the square-error is not a good metric for an image quality. This idea may work on text dataset as well because we assess the quality of the reconstructed texts based on the whole input but not collectively evaluate one word at a time.

 

References:

https://arxiv.org/abs/1512.09300

Adversarial Variational Bayes

Variational autoencoder (VAE) requires an expressive inference network in order to learn a complex posterior distribution. The more complex inference network will result in generating high-quality data.

This work utilizes an adversarial training to learn a function T(x,z) that approximates \log q_{\phi}(z|x) - \log p(z). The expectation of this term w.r.t $latex q_{\phi}(z|x)$ is in fact a KL-divergence term. Since the authors prove that the optimal T^*(x, z) = \log q_{\phi}(z|x) - \log p(z), the ELBO becomes:

E_{p_{D(x)}}E_{q_{\phi}(z|x)}[-T^*(x,z) + \log p_{\theta}(x|z)]

In order to approximate T^*(x, z), the discriminator needs to learn to distinguish between a sample from a prior p_{D(x)}p(z) and the current inference model p_{D(x)}q_{\phi}(z|x). Thus, the objective function for the discriminator is setup as:

\max_T E_{p_{D(x)}}E_{q_{\phi(z|x)}} \log \sigma(T(x,z)) + E_{p_{D(x)}}E_{p(z)} \log(1 - \sigma(T(x,z))) (1)

Taking a gradient on T(x,z) w.r.t. parameter \phi can be problematic because the solution of this function depends on q_{\phi}(z|x). But the author shows that the expectation of gradient of T^*(x, z) w.r.t \phi is 0. Thus, there is no gradient and no parameter update when taking a gradient of T^*(x,z).

Since T(x,z) requires sample z, the parametrization trick is applied and the ELBO becomes:

E_{p_{D(x)}}E_{\epsilon}[-T^*(x, z_{\phi}(x, \epsilon) + \log p_{\theta}(x|z_{\phi}(x, \epsilon))] (2)

This step is crucial because now the sampling is just a transformation from a noise and let T^*(x, z) to approximate the KL-divergence term. This made this model looks like a blackbox model because we do not explicitly define a distribution q_{\phi}(z|x).

This model optimizes equation (1) and (2) using adversarial training. It optimizes eq.(1) several steps in order to keep T(x, z) close to optimal while jointly optimizes eq. (2).

Adaptive contrast technique is used to make T(x, y) to be sufficiently close to the optimal. Basically, the KL term in ELBO is replaced by KL(q_{\phi}(z|x), r_{\alpha}(z|x)) where r_{\alpha}(z|x) is an auxiliary distribution which could be a Gaussian distribution.

This model has a connection to variational autoencoder, adversarial autoencoder, f-GANs, and BiGANs. A new training method for VAE via adversarial training allows us to use a flexible inference that approximate a true distribution over the latent vectors.

References:

[1] https://arxiv.org/abs/1701.04722

LDA2Vec

This paper proposed an extension of word2vec by adding document and topic vectors inspired by LDA (Latent Dirichlet Allocation). The model can discover a linear relationship between words. For example, “California + technology = Silicon Valley”. The topic is interpretable by collecting all nearby word vectors to the selected topic vector. This work boosts word2vec with topic modeling via training in the similar fashion to word2vec.

The key difference of LDA2Vec is its loss function. There are 2 loss functions: the first one is Skipgram Negative Sampling Loss which is similar to Word2Vec. It wants to maximize the probability of predicting a target word \vec w_j and non-target word (negative samples) given a context vector \vec c_j. This loss wants the model to distinguish a positive word (which related to the given context) from negative sampled words.

The innovation is the context vector \vec c_j. The intuition is that predict a nearby word given a pivot word also depends on the theme of the context. For example, if the document is about an airline when we want to predict nearby words given a word “Germany”, we will likely want to see word related to airlines but not country names. Thus, a context vector is a sum of a word vector and document vector. \vec c_j = \vec w_j + \vec d_j. Thus, LDA2vec attempts to capture both document-wide relationship and local interaction between words within its context window.

In order to learn a topic vector, the document is further decomposed as a linear combination of topic vectors. \vec d_j = \sum_{k} p_{jk} \cdot \vec t_k where p_{jk} is a probability of document j will be a topic k. Finally, the interpretability comes from a sparsity of topic assignment vector, p_j. One way to enforce sparsity is to design the loss function as:

L^{d} = \lambda \sum_{jk} (\alpha - 1)\log p_{jk}

When \alpha < 1, we encourage a topic assignment probability to put more mass on a small set of topics.

The results look interesting. This paper shows a simple way to combine topic modeling with word embedding. By embedding document vectors and topic vectors into the same semantic space as word vectors, we can learn a global semantic structure as well as word-level local interaction.

References:

[1] https://arxiv.org/pdf/1605.02019.pdf

[2] http://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/

[3] Skip-gram tutorial part1: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

[4] Skip-gram tutorial part2: http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/

Neural Collaborative Filtering (WWW’17)

This is another paper that applies deep neural network for collaborative filtering problem. Although there are a few outstanding deep learning models for CF problems such as CF-NADE and AutoRec, the author claims that those models are solving for explicit feedback and positioned this work to solve for ‘implicit feedback CF’ problem.

The model is straightforward and similar to Neural Factorization Machine. The idea is to find embedding vectors for users and items and model their interaction as a non-linear function via a multi-layers neural network. A non-linear interaction has been commonly used in many recent works such as Neural Factorization Machine [2], Deep Relevant Matching Model [3].

The authors proposed 3 models incrementally: (1) Generalized Matrix Factorization – which is basically MF with additional non-linear transformation. In this case, they use sigmoid function; (2) Multi-layer Perceptron – which concatenate user and item embedded vectors and transform them by a learned non-linear function; (3) Fusion model can be either shared the same embedding vectors and add them up at the last layers or learned separate embedding vectors and concatenate them at the output layer.

Since they want to predict either the given item is preferable or not, it is a binary classification problem. Then the loss function can be a binary cross-entropy. To me, implicit feedback seems to be an easier problem than rating prediction problem because we only need to make a binary prediction.

The baseline seems okay but I wish the authors include more recent deep learning models such as [4]. The AutoRec model is also applicable for an implicit feedback by forcing the output to be a binary output. Regardless of their baseline, the extensive experiments tried to convince readers that deep neural network can model a complex interaction and will improve the performance.

Speak of the performance, the author uses hit-ratio and NDCG. Basically, there is one test sample for each user. The model tries to give a high score for that sample. This metric is more practical than MSE and RMSE since the dataset is implicit feedback dataset.

Non-linear interaction is a simple extension to a traditional MF model. This author shows that this simple extension does work for MovieLens and Pinterest dataset.

Reference:

[1] He, Xiangnan, et al. “Neural collaborative filtering.” Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.

http://www.comp.nus.edu.sg/~xiangnan/papers/ncf.pdf

[2] Xiangnan He, Tat-Seng Chua, “Neural Factorization Machines for Sparse Predictive Analytics”, ACM SIGIR 2017.

[3] Guo, Jiafeng, et al. “A deep relevance matching model for ad-hoc retrieval.” Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 2016.

[4] Zheng, Yin, et al. “Neural Autoregressive Collaborative Filtering for Implicit Feedback.” Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 2016. APA

 

IRGAN: A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models (SIGIR’17)

This paper uses GANs framework to combine generative and discriminative information retrieval model. It shows a promising result on web search, item recommendations, and Q/A tasks.

Typically, many relevant models are classified into 2 types:

  • Generative retrieval model
    • It generates a document given a query and possibly relevant score. The model is p(d|q, r).
  • Discriminative retrieval model
    • It computes a relevance score for the given query and document pair. The model is p(r| d, q).

The generative model tried to find a connection between document and query. On the other hand, the discriminative model attempts to model the interaction between query and document based on relevance scores.

Both models have their shortcoming. Many generative models require a predefined data generating story. The wrong assumption will lead to the poor performance. The generative model is usually trying to fit the data to its model without external guidance. Meanwhile, the discriminative model requires a lot of labeled data to be effective, especially for a deep neural network model.

By train both models using GANs framework, it is now possible to solve their shortcoming. The generative model is now adaptive because the discriminator will reward the generative model when it can create or select good samples. This adaptive guidance from the discriminator is unique in GANs framework and will help the generator learns to pick good samples from the data distribution. At the same time, the discriminator can receive even more training data from the generative model. This is similar to semi-supervised learning where unlabeled data are utilized. Adversarial training allows us to improve both generative and discriminative models via jointly learning through the

Adversarial training allows us to improve both generative and discriminative models via jointly learning through the minimax training. The traditional training based on maximum likelihood does not have principle way to allow both models to give each other feedbacks.

The paper seems to be promising and their results on 3 information retrieval tasks are really good. But I notice that their training procedure requires pretraining. This made me wonder if pre-training is part of the performance boost during testing. I don’t find the part in the paper that explains the benefit of pretraining in their settings.

The discriminative model is straight forward. It is a sigmoid function. The discriminator basically gives a high probability when the given document-query pair is relevant. The generative model is more interesting. In the standard GANs, the generator will create a sample from a simple distribution, but IRGAN does not generate a new document-query pair. Instead, the author chose to let the generator select the sample from the document pool. In my opinion, this approach is simpler than creating a new data because the sample is realistic. Also, IRGAN cares about finding a function to compute a relevance score so it is unnecessary to generate a completely new data.

However, the cost function for the generator is an expectation over all documents in the corpus. The Monte Carlo approximation will have a high variance. Thus, they use policy gradient to reduce the variance so that the model can learn a useful representation. Although p(d|q,r) is a discrete distribution,  the backprop is applicable because we pre-sample all documents from p(d|q,r) beforehand. Thus, eq. 5 is differentiable. The extra care may need in order to reduce variance further. They use an advantage function. (Please look at the reference on Reinforcement Learning [2]).

Generating positive and negative samples are still confusing in this paper. It seems to be application specific. The author mentioned about using softmax with temperature hyper-parameter to put more or less focus on top documents. My guess is when we put less focus on top documents, the generator has more chance to pick up more negative samples. After I read the paper again, it seems that all samples selected by the generator model are negative samples. This part remains unclear and I need to ask the author for more details.

In conclusion, I like this paper because it tried to combine generative and discriminative retrieval models via GANs framework. I would not be appreciated if they simply applied the IR task to GANs blindly. Instead, they explained their motivation and discussed the advantage of jointly train both models. It seems adversarial training is useful for IR tasks as well.

References:

[1] Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. 2017. IRGAN: A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models. In Proceedings of SIGIR’17, Shinjuku, Tokyo, Japan, August 7-11, 2017, 10 pages.

[2] Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, and others. 1999. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In NIPS.

[3] IRGAN code (http://lantaoyu.com/publications/IRGAN)