# Neural Collaborative Filtering (WWW’17)

This is another paper that applies deep neural network for collaborative filtering problem. Although there are a few outstanding deep learning models for CF problems such as CF-NADE and AutoRec, the author claims that those models are solving for explicit feedback and positioned this work to solve for ‘implicit feedback CF’ problem.

The model is straightforward and similar to Neural Factorization Machine. The idea is to find embedding vectors for users and items and model their interaction as a non-linear function via a multi-layers neural network. A non-linear interaction has been commonly used in many recent works such as Neural Factorization Machine [2], Deep Relevant Matching Model [3].

The authors proposed 3 models incrementally: (1) Generalized Matrix Factorization – which is basically MF with additional non-linear transformation. In this case, they use sigmoid function; (2) Multi-layer Perceptron – which concatenate user and item embedded vectors and transform them by a learned non-linear function; (3) Fusion model can be either shared the same embedding vectors and add them up at the last layers or learned separate embedding vectors and concatenate them at the output layer.

Since they want to predict either the given item is preferable or not, it is a binary classification problem. Then the loss function can be a binary cross-entropy. To me, implicit feedback seems to be an easier problem than rating prediction problem because we only need to make a binary prediction.

The baseline seems okay but I wish the authors include more recent deep learning models such as [4]. The AutoRec model is also applicable for an implicit feedback by forcing the output to be a binary output. Regardless of their baseline, the extensive experiments tried to convince readers that deep neural network can model a complex interaction and will improve the performance.

Speak of the performance, the author uses hit-ratio and NDCG. Basically, there is one test sample for each user. The model tries to give a high score for that sample. This metric is more practical than MSE and RMSE since the dataset is implicit feedback dataset.

Non-linear interaction is a simple extension to a traditional MF model. This author shows that this simple extension does work for MovieLens and Pinterest dataset.

Reference:

[1] He, Xiangnan, et al. “Neural collaborative filtering.” Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.

http://www.comp.nus.edu.sg/~xiangnan/papers/ncf.pdf

[2] Xiangnan He, Tat-Seng Chua, “Neural Factorization Machines for Sparse Predictive Analytics”, ACM SIGIR 2017.

[3] Guo, Jiafeng, et al. “A deep relevance matching model for ad-hoc retrieval.” Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 2016.

[4] Zheng, Yin, et al. “Neural Autoregressive Collaborative Filtering for Implicit Feedback.” Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 2016. APA

# IRGAN: A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models (SIGIR’17)

This paper uses GANs framework to combine generative and discriminative information retrieval model. It shows a promising result on web search, item recommendations, and Q/A tasks.

Typically, many relevant models are classified into 2 types:

• Generative retrieval model
• It generates a document given a query and possibly relevant score. The model is $p(d|q, r)$.
• Discriminative retrieval model
• It computes a relevance score for the given query and document pair. The model is $p(r| d, q)$.

The generative model tried to find a connection between document and query. On the other hand, the discriminative model attempts to model the interaction between query and document based on relevance scores.

Both models have their shortcoming. Many generative models require a predefined data generating story. The wrong assumption will lead to the poor performance. The generative model is usually trying to fit the data to its model without external guidance. Meanwhile, the discriminative model requires a lot of labeled data to be effective, especially for a deep neural network model.

By train both models using GANs framework, it is now possible to solve their shortcoming. The generative model is now adaptive because the discriminator will reward the generative model when it can create or select good samples. This adaptive guidance from the discriminator is unique in GANs framework and will help the generator learns to pick good samples from the data distribution. At the same time, the discriminator can receive even more training data from the generative model. This is similar to semi-supervised learning where unlabeled data are utilized. Adversarial training allows us to improve both generative and discriminative models via jointly learning through the

Adversarial training allows us to improve both generative and discriminative models via jointly learning through the minimax training. The traditional training based on maximum likelihood does not have principle way to allow both models to give each other feedbacks.

The paper seems to be promising and their results on 3 information retrieval tasks are really good. But I notice that their training procedure requires pretraining. This made me wonder if pre-training is part of the performance boost during testing. I don’t find the part in the paper that explains the benefit of pretraining in their settings.

The discriminative model is straight forward. It is a sigmoid function. The discriminator basically gives a high probability when the given document-query pair is relevant. The generative model is more interesting. In the standard GANs, the generator will create a sample from a simple distribution, but IRGAN does not generate a new document-query pair. Instead, the author chose to let the generator select the sample from the document pool. In my opinion, this approach is simpler than creating a new data because the sample is realistic. Also, IRGAN cares about finding a function to compute a relevance score so it is unnecessary to generate a completely new data.

However, the cost function for the generator is an expectation over all documents in the corpus. The Monte Carlo approximation will have a high variance. Thus, they use policy gradient to reduce the variance so that the model can learn a useful representation. Although $p(d|q,r)$ is a discrete distribution,  the backprop is applicable because we pre-sample all documents from $p(d|q,r)$ beforehand. Thus, eq. 5 is differentiable. The extra care may need in order to reduce variance further. They use an advantage function. (Please look at the reference on Reinforcement Learning [2]).

Generating positive and negative samples are still confusing in this paper. It seems to be application specific. The author mentioned about using softmax with temperature hyper-parameter to put more or less focus on top documents. My guess is when we put less focus on top documents, the generator has more chance to pick up more negative samples. After I read the paper again, it seems that all samples selected by the generator model are negative samples. This part remains unclear and I need to ask the author for more details.

In conclusion, I like this paper because it tried to combine generative and discriminative retrieval models via GANs framework. I would not be appreciated if they simply applied the IR task to GANs blindly. Instead, they explained their motivation and discussed the advantage of jointly train both models. It seems adversarial training is useful for IR tasks as well.

References:

[1] Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. 2017. IRGAN: A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models. In Proceedings of SIGIR’17, Shinjuku, Tokyo, Japan, August 7-11, 2017, 10 pages.

[2] Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, and others. 1999. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In NIPS.

[3] IRGAN code (http://lantaoyu.com/publications/IRGAN)

# My takes on GANs (Generative Adversarial Nets)

What is the fuss with GANs? Everybody loves GANs now. The joke I heard from my advisor is “If your current model does not work, try it aGAN !” (pun-intended).  I realize that it comes down to its adversarial objective function. Some people may think it is sexier than VAE and NADE’s objective function in which they just maximize the log-probability – the likelihood that the model will maximize the given data samples. For VAE, we will maximize the evidence lower-bound (ELBO) while NADE will maximize the conditional log probability. GANs simultaneously maximize and minimize two cost functions.

GANs’ framework has two components: data generator and discriminator. The generator will attempt to create a sample that is hopefully similar to sample from the actual data while the discriminator needs to guess if the given sample is fake or real. This process can be viewed as a competitive game where a generator tries to fool the discriminator. The game is complete when the discriminator is unable to distinguish a generated sample from a real one. On the other hand, we can also look at GANs as a cooperating game where discriminator helps the generator to create a more realistic sample by providing a feedback.

There are 2 cost functions: one for the generator and another for the discriminator. The cost function for the discriminator is a binary cross entropy:

$J^{(D)}(\theta^{(D)}, \theta^{(G)}) = -\frac{1}{2}E_{x \sim p_{\text{data}}}[\log D(x)] - \frac{1}{2}E_z[\log(1 - D(G(z)))]$

There are two set of mini-batches: one from the real samples and one coming from the generator. The optimal discriminator will predict the chance of being a true sample as $\frac{p_{\text{data}}}{p_{\text{data}} + p_{\text{model}}}$.

There are many cost functions for a generator. The original paper proposed 2 cost functions. If we construct this learning procedure as a zero-sum game, then the cost function for the generator is simply, $J^{(G)} = -J^{(D)}$. We simply want to minimize the likelihood of the discriminator for being correct.

Goodfellow [1] mentioned that the above cost function has a vanish gradient problem when the sample is likely to be fake. Hence, to overcome this, he proposed a new cost function for the generator by maximizing the likelihood of the discriminator for being wrong. The new cost function is:

$J^{(G)} = -\frac{1}{2}E_z[\log D(G(z))]$

The new cost function is more useful in practice but the former function is useful for theoretical analysis, in which we will discuss next. For now, the objective function is:

$V(D,G) = \min_G \max_D J^{(D)}(\theta^{(D)}, \theta^{(G)})$

Goodfellow shows that learning the zero-sum game is the same as minimizing the Jensen-Shannon divergence. Given that we have an optimal discriminator, the cost function is now:

$C(G) = \max_D V(D,G) = E_{x \sim p_{\text{data}}}[\log D^*(x)] + E_z[\log(1 - D^*(G(z)))]$

Since we know the optimal discriminator, we now have:

$C(G) = \max_D V(D,G) = E_{x \sim p_{\text{data}}}[\log (\frac{p_{\text{data}}}{p_{\text{data}} + p_{\text{model}}})] + E_z[\log(\frac{p_{\text{model}}}{p_{\text{data}} + p_{\text{model}}})]$

Then, simplify further:

$C(G) = \text{KL}(p_{\text{data}}||\frac{p_{\text{data}} + p_{\text{model}}}{2}) + \text{KL}(p_{\text{model}}||\frac{p_{\text{data}} + p_{\text{model}}}{2}) + \log({\frac{1}{4}})$

At the global minimum, we will have $latex p_{\text{data}} = p_{\text{model}}$, hence:

$C^*(G) = \log({\frac{1}{4}})$

People believed that minimizing the Jensen-Shannon divergence is the reason why GANs can produce a sharp image while VAE produces blurry image since it minimizes KL divergence. However, this belief is no longer true [2].

However, GANs has received a lot of criticism. For one instance, Schmidhuber mentioned that Goodfellow should have cited Schmidhuber’s 1992 paper [3]. (I found Reddit’s article on the public attack during NIPS2016 tutorial here: https://www.reddit.com/r/MachineLearning/comments/5go4sa/n_whats_happening_at_nips_2016_jurgen_schmidhuber/ ). Interestingly, Schmidhuber is one of the reviewers on GANs original paper and he rejected it.

While many researchers claim that GANs generates a realistic sample, a few NLP researchers do not believe so [4]. Meanwhile,  the standard dataset for some particular tasks is necessary because many machine learning research paper is really depending on the empirical results.

References:

[1] Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in neural information processing systems. 2014.

[2] Goodfellow, Ian. “NIPS 2016 Tutorial: Generative Adversarial Networks.” arXiv preprint arXiv:1701.00160 (2016).

[3] ftp://ftp.idsia.ch/pub/juergen/factorial.pdf

# Learning a Hierarchical Embedding Model for Personalized Product Search (SIGIR’17)

This paper proposed a personalized product search model. Given a user and query, the model needs to retrieve a relevant product that a user will likely to purchase. Due to this requirement, the relevancy is based on the likelihood of the user will buy retrieved products. This can be finding a product that meets a user’s intent or need.

The model uses product reviews written by users to model a user’s preference. The authors assume that a query is independent (not personalized). It means by observing a query alone, we can’t tell which user creates that query. This assumption is necessary because the dataset does not have queries and the authors need to construct a set of queries based on the product’s description. On the other hands, The item embedding is also based on its product reviews. By looking all reviews, the model should be able to summarize the item and be able to learn its embedding vector.

The last information is a query. The model also learns word embedding for queries and then construct a query based on either averaging all words, non-linear projecting word embedding or simply use RNN to get a final embedding vector. It means that this model will learn word embedding for all words observed in the pseudo-queries, word embedding for all users and items.

But they need to relate query, user, and item. And I think this is the key assumption of this model. They assume that a linear combination of user and query should be similar to a relevant item. Hence, the model assumes knowing user and query, we should have a good idea what should be relevant products/items. As a result, the objective function is to maximize the log likelihood of the language model of all users, the language model of all items, and the item generation likelihood.

The neural networks are utilized in their model. To construct a query vector, projecting an average word embedding yields the best accuracy. However, this model requires approximating a lot of language models. This could be challenging to train. Thus, the author approximates the softmax calculation based on negative sampling technique proposed by [2]. The experiment results seem valid but this model performs poorly on some product searches. But this model beats the state-of-the-arts.

My impression of this model is that it seems to be product specific due to nature of the given reviews. Some products like electronics have a different pattern of reviews than books. This may limit the scalability of this model because we need separate models for each product categories. Furthermore, I think to personalize a user, their model uses review texts. It means that the model could summarize user’s sentiment or word usage. If we mix multiple product categories, the sentiment might mix up and it will be difficult to tell a user preference. And this is probably the main reason why the author separate different product categories and train separate models.

References:

[1] Qingyao Ai, Youngfeng Zhang, Keping Bi, Xu Chen, W. Bruce Croft. 2017. Learning a Hierarchical Embedding Model for Personalized Product Search. In Proceedings of SIGIR’17, Shinjuku, Tokyo, Japan, August 7-11, 2017, 10 pages.

[2] Tomax Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arxiv prepreint arXiv:1301.3781.