# A Neural Autoregressive Approach for Collaborative Filtering (ICML’16)

This paper proposed an autoregressive model for rating prediction. The idea of this paper is similar to CF-RBM for collaborative filtering problem. Instead, it replaces an RMB with NADE. Since NADE is a mean-field approximation of RBM and also CF-RBM works for collaborative filtering problem, it makes sense that NADE model is applicable for CF problem.

The NADE model aims to estimate a probability of the given sequence as $P(\vec{r}) = \prod_{i=1}^D P(r_{m_{o_i}}|\vec{r}_{m_{o_{ The previously seen sequence of rating can be used to predict the current rating. This also implies that some early rating on the known sequence has a strong influence on the current rating. This observation made NADE a different model from RNN.

The conditional probability distribution is defined as:

$P(r_{m_{o_i}}|\vec{r}_{m_{o_{

The model needs to estimate a score function $s^k_{m_{o_i}}(\vec{r}_{m_{o_{ such that it gives a high score when a user will likely to give a rating k to an item $m_{o_i}$ given her previous rating history.

The score function can be seen as a dot product of a user state with an item latent vector. $s^k_{m_{o_i}}(\vec{r}_{m_{o_{. The bias is an item popularity and the dot product is the similarity between the current taste of the given user to the item $o_i$.

The hidden state captures the user’s current interest. It means that this model captures the change of behavior. This hidden state is defined as $\textbf{h}(r_{m_{o_{ The function g is an activation function, c is a bias, and a matrix W is an interaction matrix that transforms the input rating vector to a item vector.

The NADE-CF can be complicated but the main idea is the same as RBM-CF. Each user will have her own NADE network. NADE-CF shares all weight parameters among all users to avoid overfitting. Further, since each weight matrix associated with a rating, some weight matrix might be updated less frequently than others. The author changes the formula to compute the hidden state (look at the eq. 9 in the original paper). This way, the rare rating matrix will be updated more often.

In order to reduce the number parameters further, the low-rank matrix approximation is applied to matrix W and V. In addition, to improve the predictability further, the ordinal cost is introduced. The idea is if the user rating item m with rating k. Her rating preference of k-1 should be higher than k-2 and so on. In other words, if a user gave a rating of 3, it is likely that she will give a rating of 2 instead of 1 because a rating of 2 is closer to a rating of 3. The ordinal cost then is modeled as a rank loss similar to the list-wise learning to rank model.

Overall, NADE-CF yields a very good performance on the popular CF data set such as Netflix and MovieLens. The model is complicated and seems to be slow. However, with ordinal cost, low-rank approximation, and its ability to have a deeper model, made this model more attractive and it might be worthwhile to investigate an autoregressive model for this particular task.

# My takes on GANs (Generative Adversarial Nets)

What is the fuss with GANs? Everybody loves GANs now. The joke I heard from my advisor is “If your current model does not work, try it aGAN !” (pun-intended).  I realize that it comes down to its adversarial objective function. Some people may think it is sexier than VAE and NADE’s objective function in which they just maximize the log-probability – the likelihood that the model will maximize the given data samples. For VAE, we will maximize the evidence lower-bound (ELBO) while NADE will maximize the conditional log probability. GANs simultaneously maximize and minimize two cost functions.

GANs’ framework has two components: data generator and discriminator. The generator will attempt to create a sample that is hopefully similar to sample from the actual data while the discriminator needs to guess if the given sample is fake or real. This process can be viewed as a competitive game where a generator tries to fool the discriminator. The game is complete when the discriminator is unable to distinguish a generated sample from a real one. On the other hand, we can also look at GANs as a cooperating game where discriminator helps the generator to create a more realistic sample by providing a feedback.

There are 2 cost functions: one for the generator and another for the discriminator. The cost function for the discriminator is a binary cross entropy:

$J^{(D)}(\theta^{(D)}, \theta^{(G)}) = -\frac{1}{2}E_{x \sim p_{\text{data}}}[\log D(x)] - \frac{1}{2}E_z[\log(1 - D(G(z)))]$

There are two set of mini-batches: one from the real samples and one coming from the generator. The optimal discriminator will predict the chance of being a true sample as $\frac{p_{\text{data}}}{p_{\text{data}} + p_{\text{model}}}$.

There are many cost functions for a generator. The original paper proposed 2 cost functions. If we construct this learning procedure as a zero-sum game, then the cost function for the generator is simply, $J^{(G)} = -J^{(D)}$. We simply want to minimize the likelihood of the discriminator for being correct.

Goodfellow [1] mentioned that the above cost function has a vanish gradient problem when the sample is likely to be fake. Hence, to overcome this, he proposed a new cost function for the generator by maximizing the likelihood of the discriminator for being wrong. The new cost function is:

$J^{(G)} = -\frac{1}{2}E_z[\log D(G(z))]$

The new cost function is more useful in practice but the former function is useful for theoretical analysis, in which we will discuss next. For now, the objective function is:

$V(D,G) = \min_G \max_D J^{(D)}(\theta^{(D)}, \theta^{(G)})$

Goodfellow shows that learning the zero-sum game is the same as minimizing the Jensen-Shannon divergence. Given that we have an optimal discriminator, the cost function is now:

$C(G) = \max_D V(D,G) = E_{x \sim p_{\text{data}}}[\log D^*(x)] + E_z[\log(1 - D^*(G(z)))]$

Since we know the optimal discriminator, we now have:

$C(G) = \max_D V(D,G) = E_{x \sim p_{\text{data}}}[\log (\frac{p_{\text{data}}}{p_{\text{data}} + p_{\text{model}}})] + E_z[\log(\frac{p_{\text{model}}}{p_{\text{data}} + p_{\text{model}}})]$

Then, simplify further:

$C(G) = \text{KL}(p_{\text{data}}||\frac{p_{\text{data}} + p_{\text{model}}}{2}) + \text{KL}(p_{\text{model}}||\frac{p_{\text{data}} + p_{\text{model}}}{2}) + \log({\frac{1}{4}})$

At the global minimum, we will have $latex p_{\text{data}} = p_{\text{model}}$, hence:

$C^*(G) = \log({\frac{1}{4}})$

People believed that minimizing the Jensen-Shannon divergence is the reason why GANs can produce a sharp image while VAE produces blurry image since it minimizes KL divergence. However, this belief is no longer true [2].

However, GANs has received a lot of criticism. For one instance, Schmidhuber mentioned that Goodfellow should have cited Schmidhuber’s 1992 paper [3]. (I found Reddit’s article on the public attack during NIPS2016 tutorial here: https://www.reddit.com/r/MachineLearning/comments/5go4sa/n_whats_happening_at_nips_2016_jurgen_schmidhuber/ ). Interestingly, Schmidhuber is one of the reviewers on GANs original paper and he rejected it.

While many researchers claim that GANs generates a realistic sample, a few NLP researchers do not believe so [4]. Meanwhile,  the standard dataset for some particular tasks is necessary because many machine learning research paper is really depending on the empirical results.

References:

[1] Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in neural information processing systems. 2014.

[2] Goodfellow, Ian. “NIPS 2016 Tutorial: Generative Adversarial Networks.” arXiv preprint arXiv:1701.00160 (2016).

[3] ftp://ftp.idsia.ch/pub/juergen/factorial.pdf

# Learning to Reweight Terms with Distributed Representations (SIGIR’15)

The retrieval model concerns with computing a relevant score between a query and document. Typically, the model is a weighted sum of matching score of term t and document D:

$s(q, D) = \sum_{t \in q \cap D} w(t)f(t, D)$

The most common weight function is an inverse document frequency (IDF). It signifies the unique terms over common terms. IDF is an appropriate assumption in any relevant model.

This paper proposes a term weight learning method based on distributed word vectors (Word2Vec). Motivated by additive property of word distributed representation, query q can be summarized as an average of all term in query q. The author believes that distributed representation will provide a more accurate estimation of query-dependent weight function $w(t)$.

This idea sounds better than IDF because the new weight function depends on a query. The weight will not be fixed for the given term t, but depends on the context of query q.

The paper uses a simple estimation based on the assumption that the average of word vectors will summarize the whole query. Then, any term in query q that deviates too much from the average should be less relevant. For each term t in query q and its corresponding word vector $\textbf{w}_{i,j}$, the query-dependent feature vector of term t is:

$\textbf{x}_{i,j} = \textbf{w}_{i,j} - \bar{\textbf{w}}_{q_i}$

Where $latex \bar{\textbf{w}}_{q_i}$ is an average of all word vectors in the query q. Then, in order to calculate the true weight, the author trains a linear regression with L1-regularizer using $\textbf{x}_{i,j}$ as an input feature and relevance judgment score based on term recall weight [2] as the ground truth. Once the mapping function is trained, for any given new query, the feature vector $\textbf{x}$ will be constructed and uses a mapping function to calculate the weight score.

Although the idea proposed in this paper is simple, it shows how to exploit the additive property of distributed representation. However, the way to calculate the query-dependent feature vector does not satisfy me. Some queries may contain multiple similar words based on syntactic similarities such as “United States” or “San Francisco”, then the average of the query will be biased toward the most common concept in the query. This may weaken the rare term in the same query in which that term may be the best indicator of the user’s intent.

References:

[1] Zheng, Guoqing, and Jamie Callan. “Learning to reweight terms with distributed representations.” SIGIR’15

[2] L Zhao and J. Callan. “Term necessity prediction.” In CIKM’10.

# Learning Deep Structured Semantic Models for Web Search using Clickthrough Data (CIKM’13)

When we compute the relevant score between a query and documents in the corpus, we want the higher score when the given query is relevant to the document. The vector-space model based on word matching may fail sometimes when both query and document do not share any common word. Hence, the latent semantic model could solve the vocabulary gap problem by giving a higher probability for a word that is semantically similar to words appears in the document.

Since this paper was published in 2013, the latent semantic model actually does not perform well compared to a simple heuristic vector-space model such as BM25. The actual word matching remains the most important indicator of relevancy.

Anyhow, this paper demonstrates that they can use a deep neural network to embed both query and documents to a semantic space. They assume that the relevant query and documents should be nearby in a semantic space. Although this assumption is valid, this model does not emphasize on true matching, which is very valuable information in my opinion.

The deep model architecture containing 3 main parts: word hashing layer, mapping layer which maps a document to semantic space, and relevance measurement layer which computes a cosine similarity between the given query and document. Then, the relevant score is computed through the softmax layer, normalized all cosine similarity.

The performance gain comes from the fact that the authors use supervised information (click-through data) to train the model. By maximizing the conditional probability of $\log \prod_{(Q,D^+)} P(D^+|Q)$, the model will learn to map a relevant pair of query and documents to similar embedded vectors.

The word hashing layer is introduced in order to reduce the dimension of an input vector. If we use one-hot vector, the length of a vector is too long. Using a letter-trigram reduces the dimension significantly.

The obvious extension is to replace the transformation layer with CNNs [2] or RNNs so that the deep model can capture local structures in the documents. [2] shows that using a convolutional layer with max pooling will slightly improve the NDCG score. I don’t know if anyone has tried RNN yet.

But what I don’t feel comfortable with their model is that they treat a query as a document. I think this assumption is too rigid. To me, a query is more like a signal indicates a user information need, which is clearly not a document. I still prefer to treat a query based on a language model rather than a vector space model.

References:

[1] Huang, P., He, X., Gao, J., Deng, L., Acero, A., and Heck, L. 2013. Learning deep structured semantic models for web search using clickthrough data. In CIKM’13.

[2] Shen, Yelong, et al. “Learning semantic representations using convolutional neural networks for web search.” WWW’14

# Relevance-based Word Embedding (SIGIR’17)

Word embedding has become a standard tool in many IR tasks. The popular tools such as Word2Vec[2] and GloVe [3] map each word to an embedding space where similar words are nearby. The similarity measurement is based on semantic or syntactic between words. Typically, similar words tend to stay in the same context. Thus, the objective functions exploit this assumption such as CBOW or skip-gram models. As a result, the models are very good at guessing another word given the context.

However, the authors believe that semantic and syntactic similarities are not an appropriate metric in IR tasks such as query expansion and classification. In the query expansion, we want to suggest an additional term that is relevant to the query but not necessary semantically or syntactically similar to. For instance, the given query is “indian american museum”, the Word2Vec will suggest words such as “united, states” because “united” and “states” are semantically similar to the word “american”. But these are not good term expansions. The relevent-based embedding will suggest the words that are relevant to the ‘whole’ query such as “heye, chumash, apa” which are more directly related to the query terms. Based on this idea, it motivates the authors to develop a relavance-based word embedding.

This paper proposes two models: Relevance Likelihood Maximization (RLM) and Relevance Posterior Estimation (RPE). The first model, RLM, is consists of two distributions:

• $p(w|R_i)$ – the relevance model distribution for the given query $q_i$ where $p(w|R_i) = \sum_{d \in F_i} p(w|d) \prod_{w' \in q_i} p(w'|d)$. Basically, this distribution is a product of ‘how likely word w is relevant to all relevant documents in the top K documents’ and ‘how similar between query $q_i$ to each document in the top K’. The first term acts as likelihood function and the second term acts as a weight.
• $\hat p(w|q_i; \theta_R)$ – the similarity between embedded word w and query q. $\hat p(w|q_i; \theta_R) = \frac{\exp{\vec{w}^T\vec{q}}}{\sum_{w'\in V}\exp(\vec{w}'^T\vec{q})}$. We normalized it so that the distribution will sum to a unity.

To interpret this model, we basically use the relevance model distribution as a weight to signify the similarity between a particular query and word. This weight is computed via a traditional PRF model. (see eq.3). The neural network will be constraint by this given weight and will be able to find an embedding vector that hopefully maximize the likelihood function.

The second model, RPE, uses a mixture model by assuming that there are two language models for the feedback sets: the relevance language model and noisy language model. The author opts to cast this assumption to a binary classification problem by predicting if the given w does come from the relevance distribution of the query q. In other words, RPE wants to estimate the following distribution:

$\hat p(R=1|\vec{w},\vec{q};\theta_R) = \frac{1}{1+\exp{(-\vec{w}^T\vec{q})}}$

When a random variable R is one, it implies that word w in relevant to query q.

Two optimizations are necessary to train both RLM and RPE models. In RLM model, computing a normalization term is expensive, the author uses a hierarchical approximation of the softmax function [4] to speed up the training. Meanwhile, training RPE requires negative samples. The author chose the NCE [5] method.

The experimental results show that RLM model has a better performance than RPE models on a query expansion task. Both models outperform Word2Vec and GloVe due to its ability to put a high probability on words that are relevant to the entire query. However, RPE is better than RLM model on query classification task.

References:

[1] Zamani, Hamed, and W. Bruce Croft. “Relevance-based Word Embedding.” SIGIR’17

[2] Mikolov, Tomas, et al. “Distributed representations of words and phrases and their compositionality.” NIPS’13.

[3] Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. “Glove: Global Vectors for Word Representation.” In EMNLP’14.

[4] Frederic Morin and Yoshua Bengio. 2005. Hierarchical Probabilistic Neural Network Language Model. In AISTATS’05.

[5] Michael U. Gutmann and Aapo Hyvarinen. 2012. Noise-contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics. In Mach. Learn. Res. 2012.

# Labeled-LDA (ACL’09)

In a classical topic model such as LDA, the learned topic can be uninterpretable. We need to eyeball the top words to interpret the semantic of the learned topic. Labeled LDA (L-LDA) is a model that assigns each topic to a supervisor signal. Many documents contain tags and labels, these data can be treated as target topics. If we can group similar words based on the corresponding labels, then each topic will be readable.

The trick is to constrain the available topics that each document can learn. In LDA, a document draws a topic distribution from a K-simplex ( a Dirichlet distribution ). But L-LDA draws a topic distribution from L-simplex where L is the number of tags for the given document. This implies that each document will draw a topic distribution from a different shape of simplex. L-LDA calculates the parameter of dirichlet distribution based on the tags.

For example, if there are K possible tags, $\bf{\alpha} = \{ \alpha_1, \alpha_2, \cdots, \alpha_K \}$, a document with L tags (k=2, k=5) will have an alpha $\bf{\alpha^{(d)}} = \{\alpha_2, \alpha_5 \}$. Then the topic assignment will be constraint to only topic 2 and 5.

Due to this model assumption, L-LDA assumes that all words appear in the same document are definitely drawn from a small set of topics. For a single-label dataset, this means that all words in the same document are from the same topic. Is this a good idea?

It turns out that this is a valid assumption. When a document is assigned a label, it is because the combination of words in the document contributes a very specific theme or ideas. Another view of this constraint is that the documents that share the same labels will share the same set of words that likely to appear.

When L-LDA is applied on a multi-labeled dataset, then the learned topic will become more interesting because the way each document share the same set of words become more complicated.

References:

https://www.aclweb.org/anthology/D/D09/D09-1026.pdf

# Factorization Machine

I’ve stumbled upon this paper that focused on predicting a response variable on a sparse dataset. In a standard regression problem, we want to find a weight vector that transforms a feature vector to a response value.

$\hat y = w_0 + \bf{w}^T \bf{x}$

This linear equation assumes that each sample, $x^{(1)}, x^{(2)}, \cdots, x^{(n)}$ are independent. This assumption may not be valid for the collaborative filtering or ranking problems where the observation is correlated. Furthermore, SVM cannot find a reliable hyperplane due to the data sparsity; thus, we need the model that is reliable on this setting. Factorization Machine (FM) is also a general predictor and [1] shows that many predicting problems are equivalent to FM.

FM models all interactions in the dataset. This approach increases the number of observations. If we model a pair-wise interaction, there are $O(n^2)$ interactions. The FM model equation is:

$\hat y(\bf{x}) = w_0 + \sum_{i=1}^n w_ix_i + \sum_{i=1}^n \sum_{j=i+1}^n \bf{v}_i^T\bf{v}_jx_ix_j$

The first two terms are regression formulation. The third term is an interaction term. The dot product of feature embedding vectors $\bf{v}_i, \bf{v}_j$ is a weight of interaction between feature i and feature j. This formulation implies that each feature is not independent.

The author shows that a 2-way FM is equivalent to SVM with polynomial kernel $K(\bf{x},\bf{z}) = <\phi(\bf{x}),\phi(\bf{z})>$. The only difference is that the interaction weight parameters $w_{i,j}$ in SVM is dense matrix, but it is a low-rank matrix under FM framework. This lead to the less parameters to estimate. If $W \in R^{n,n} = VV^T$ where $V \in R^{n,k}$ and the number of parameters is $O(n^2)$ v.s. $O(nk)$. If k << n, then the term $O(nk)$ is almost a linear.

Some people pointed out that the computation of FM is $O(n^2)$, which is not the case. If the input matrix is sparse, then the number of non-zero interactions in this term:  $latex \sum_{i=1}^n \sum_{j=i+1}^n \bf{v}_i^T\bf{v}_jx_ix_j$ is actually much small than $O(k \cdot n^2)$. If $m(n)$ is the number of non-zero entries, then the computation of the interaction term is $O(k \cdot m(n)$. As long as $m(n) << n$ then FM’s computation is almost linear.

Nevertheless, the input feature for FM still needs some feature engineering. The choice of features is important and feature normalization is necessary.

References:

Rendle, Steffen. “Factorization machines.” Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE, 2010.