Recurrent Recommender Networks (WSDM’17)

The motivation of this work is to tackle the collaborative filtering problem in the realistic setting. The classical collaborative filtering models interpolate rating based on the past and future rating. But in a real-world situation, there are no future rating scores. Therefore, to be able to extrapolate or predict the future rating is more practical.

One important argument that explains why many CF models had performed well on the Netflix dataset is due to the mixing distribution of training and testing data. The models were fed with future ratings; hence, it is easy to predict a user rating.

Therefore, modeling the temporal and causal aspects of the rating data are the main goal of this work. They gave an example of the movie ‘Plan 9’ which initially was reviewed as a bad film but became very popular later. Another observation is that some movies are more popular during the Christmas and summer. It is also valid to assume that a user preference will change over time as they grow up their taste of films will change.

With all these motivations and observations, they propose to use RNN to model user and movie dynamics over time and hope that RNN will capture both exogenous and endogenous dynamics. The key ingredient of their models is to incorporate a wall clock (time index) as part of the sample features. Here is how each training sample looks like:

s_t = [x_t, 1_{\text{newbie}}, \tau_t, \tau_{t-1}]

x_t is a vector of user rating up to time t.  x_{jt} = k represents this user has rated movie j at time t with a rating of k. x_{jt} = 0 when this user did not rate the movie j.1_{\text{newbie}} seems to be an indicator if a user has no previous rating – a new user. The next two parameters are important because the RNN will use time index to handle no-rating steps.

Another important component is a projecting function of s_t to an embedding space and feeds the embedding vector to an LSTM unit. Adding a linear transformation can be viewed as converting raw data into more abstract representation. This also implies that this model does not feed user ratings to an LSTM unit directly. The LSTM is used to model user and movie dynamics. We can look the trained LSTM as a function that model these dynamics. The authors trained two RNN models: one for users and another for movies.

Finally, at each time step, the model predicts a rating as follows:

\hat{r} = f(u_{it}, m_{jt}, u_i, m_j) = <\tilde u_{it}, \tilde m_{jt}> + <u_i, m_j>

This equation extends a standard matrix factorization with dynamic states \tilde u_{it}, \tilde m_{jt}. It means that at each time step, this model will solve a matrix factorization based on the rating up to time t.

To train this model requires an alternate training since we can’t train user and movie simultaneously. Otherwise, there will be too many RNN for all movies. Thus, the author fixes the movie dynamic function, and train user dynamic function. Then, fix the user dynamic function and train movie dynamic function alternately. Training the model this way will be more scalable.

The experimental results show that this model (RRN) beats TimeSVD++, AutoRec, and PMF. Further, this model can capture many external factors such as rating scale changes in Netflix dataset, external factor such as Oscar or Golden Globe awards, and internal factor such as season change.

My 2-cent, I like this paper because the motivation is well written and I can see the benefit of modeling the dynamic systems in user and movie. I am surprised that there are not many related works that attempt to solve extrapolating problem.

A Recurrent Latent Variable Model for Sequential Data (NIPS’15)

This paper presents a sequential model that is incorporating uncertainty to better model variability that arises from the data itself.

The motivation comes from the fact that data itself especially speech signal has a high variability that does not come from the noise alone. The complex relationship between observed data and an underlying factor of the variability cannot be modeled by the basic RNN alone. For example, vocal quality of the speaker affects the wave audio even though the speaker says the same word.

In a classical RNN, the state transition h_t = f_{\theta}(x_t, h_{t-1}) is a deterministic function and typically f_{\theta} is either LSTM or GRU. RNN models the joint probability of the entire sequencep(x_1, x_2, \cdot, x_T) = \prod_{t=1}^T p(x_t | x_{<t}) =\prod_{t=1}^T g_{\tau}(h_{t-1}) whereg_{\tau} is an output function that maps hidden state to the probility distribution of the output. The choice ofg_{\tau} depends on the problem. Typically, function g has 2 parts: (1) parameter generator,\phi_t = \varphi_{\tau}(h_{t-1}) and (2) density function: P_{\phi_t}(x_t | x_{<t}) . We can also make function g as a GMM; hence, function \phi_t will generate a mixture coefficient parameters.

The source of variability in RNN comes from the output function g alone. This can be problematic in speech signal because RNN must map many variants of input wave to a potentially large variation of the hidden state h_t . The limitation of RNN motivates the author to introduce uncertainty into RNN.

In order to turn RNN to an un-deterministic model, the author assumes that each data point x_t has a latent variable z_t where the latent variable is drawn from a standard Gaussian distribution initially. The generative process is as follows:

  • For each step t to T
    • Compute prior parameters:[\mu_{0,t}, \text{diag}(\sigma_{0,t})] = \phi_{\tau}^{\text{prior}}(h_{t-1})
    • Draw a prior:z_t \sim N(\mu_{0,t}, \text{diag}(\sigma_{0,t}^2))
    • Compute likelihood parameters: [\mu_{x,t},\sigma_{x,t}] = \phi_{\tau}^{\text{dec}}(\phi_{\tau}^z(z_t), h_{t-1})
    • Draw a sample:x_t | z_t \sim N(\mu_{x,t}, \text{diag}(\sigma_{x,t}^2))
    • Compute a hidden state:h_t = f_{\theta}(\phi_{\tau}^x(x_t), \phi_{\tau}^z(z_t), h_{t-1})

The state transition function is now an un-deterministic function because z_t is a random variable. Also, the hidden state h_t depends on  x_{<t}, z_{<t}, therefore, we can replace  h_t with:

  • z_t \sim p(z_t | x_{<t}, z_{<t})
  • x_t|z_t \sim p(x_t | z_{\le t}, x_{\le t})

Thus, the joint distribution becomes:

p(x_{\le T}, z_{\le T}) = \prod_{t=1}^T p(x_t|z_{\le t}, x_{<t})p(z_t|x_{<t},z_{<t})

The objective function is to maximize the log-likelihood of the input sequence:p(x_{\le T}) = \int_z p(x_{\le T}, z_{\le T}) dz . By assuming the approximate posterior distribution q(z_{\le T} | x_{\le T}) = \prod_{t=1}^T q(z_t | x_{\le t}, z_{<t}) is factorizable, the ELBO is:

E_{q(z_{\le T}|x_{\le T})}\big[ \sum_{t=1}^T \log p(x_t|z_{\le t},x_{<t}) - KL(q(z_t|x_{\le t},z_{<t}) || p(z_t|x_{<t},z_{<t})) \big]

The ELBO can be trained efficiently through variational autoencoder framework. In fact, this model is a sequential version of the classical variational autoencoder.


Chung, Junyoung, et al. “A recurrent latent variable model for sequential data.” Advances in neural information processing systems. 2015.

A Deep Relevance Matching Model for Ad-hoc Retrieval (CIKM2016)

This work points out that there are fundamentally different between representation-focused model and interaction-focused model. The first model builds an effective representation for each word and conduct matching through a simple operation. The second model focuses less on representation learning but attempts to model a complex interaction between two words. This paper focuses on the 2nd idea.

The main motivation to focus more on word interaction because ad hoc retrieval problem requires a word matching. The retrieval performance depends on the actual matching as evidently see in many past language models for IR that use weight average between word matching and word semantic. The semantic matching only is not sufficient.

This paper proposed a deep learning model to learn a representation of word and documents interaction. The key idea is to build a word similarity histogram to represents the interaction. Here is how they build a histogram: the histogram has a fixed number of buckets. Each bucket is associated with a range of similarity scores. The author uses 5 buckets: [-1, -0.5), [-0.5, 0), [0, 0.5), [0.5, 1), [1, 1]. The last bin counts the number of the exact word matching. This is how this model captures word matching. However, the proposed model depends on pre-trained word embedding. With a good word embedding, the interaction between two similar words on the same document will be similar.

The deep neural network will act as a non-linear function that transforms each interaction vector (histogram) to a relevant score. They use ‘tanh’ to squeeze the final score within the range of [-1, +1].

A query word will interact with all documents in the corpus and the model will output a vector of relevant scores for a single query word. However, there are more than one query words. The model sums all relevant score vectors to have a final relevant score. The author assumes that some query words are more important than others. For example, stopwords are not important compared to a name of person or city name. Thus, this model also computes a weight vector so that the final relevant score is computed as:

s = \sum_{i=1}^M g_i z_i^{(L)}

Where s is the final relevant score, M is the number of query words, g_i is a weight for term i, and z_i is a relevant score vector for term i. They compute this weight through a softmax function. The transformation matrix will be learned during the training.

Finally, to train this model, they employ a pairwise ranking loss such as hinge loss. Each training sample is a triple of (query, positive document, and negative document) where a positive document has a higher ranked than a negative document.  This model is trained to a standard backprop algorithm.

I found that mapping an interaction between word and document to a relevant score is novel. Typically, we focus more on semantic matching, but this work focuses more or relevant matching. And one way to estimate this relevancy is by computing a matching histogram.


Guo, Jiafeng, et al. “A deep relevance matching model for ad-hoc retrieval.” Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 2016.

A quick survey on Deep Learning for Collaborative Filtering

Here is my quick survey on recent works of using deep learning to tackle collaborative filtering and cold start problems in recommender systems.

Pure Collaborative Filtering problem (No side-information)

I have read a few works that attempt to use deep learning to solve CF problems. In the Collaborative filtering problem, we want to infer latent representation of users and items from rating matrix. The best method that I have known is CF-NADE [1] and AutoRec [2]. The first model is based on Auto-Regressive model (which is an extension of RBF ) while the second model is based on Autoencoder. I encounter RNN-based CF where we learn to predict the rating based on the seen rating. It seems to work well too. Collaborative Denoising Auto-Encoders for Top-N Recommender Systems (CDAE) [3] is a denoise autoencoder to encode both item and user latent vectors. The user latent is used as a bias term.

The traditional method that performs well in this domain is SVD++ which is a matrix factorization that incorporating neighbor users implicitly. Probabilistic Matrix Factorization is bayesian approach to this problem and provides a better interpretation of the latent vectors.

I have not found any work that provides a new insight in CF problems except CF-NADE that uses completely different approach. AutoRec seems to work so well but the original paper is a short paper ( 2 pages) so I am not sure if it really works.

Content-based recommendation (with side-information)

A few works demonstrate that Autoencoder can be used to learn latent vectors from user or item contents. Collaborative deep learning (CDL) [4] uses stacked denoise autoencoder to learn item latent vectors and coupled with user latent vectors. Some works in Music recommendation simply uses deep learning to learn a better item latent vectors from music raw features [5, 6].

Deep Collaborative Filtering via Marginalized Denoising Auto-encoder [7] jointly learns user and item latent vectors from user and item contents. This model regularizes the latent vectors by adding reconstruction errors on rating matrix and try to reconstruct a user and item contents by projecting latent vectors back to the feature space. This additional loss terms could be a novel. However, this model put less weight on content reconstruction errors, hence, these extra hyper parameters add extra complexity to this model.

One question we need to ask if autoencoder is really a good approach to solve Collaborative filtering. One view of the autoencoder is a non-linear PCA. Can we elaborate this model?






[5] Deep content-based music recommendation

[6] Improving content-based and hybrid music recommendation using deep learning


TopicRNN : A Recurrent Neural Network with Long-Range Semantic Dependency

This paper presents a RNN-based language model that is designed to capture a long-range semantic dependency. The proposed model is a simple and elegant, and yields sensible topics.

The key insight of this work is the difference between semantic and syntax. Semantic is relating to an over structure and information of the given context. If we are given a document, its semantic is a theme or topic. Semantic is meant to capture a global meaning of the context. We need to see enough words to understand its semantic.

In contrast, a syntax is dealt with local information. The likelihood of the current word heavily depends on the preceding words. This local information depends on the word ordering whereas the global information does not depend on word ordering.

This paper points out the weakness in probabilistic topic models such as LDA such as its lack of word ordering, its poor performance on word prediction. If we use bigram or trigram then these higher order models become intractable. Furthermore, LDA does not model stopwords very well because LDA is based on word co-occurrence. Stopwords tend to appear everywhere because stopwords do not carry semantic information but it acts as a filler to make the language more readable. Thus, when training LDA, the stopwords are usually discarded during the preprocessing.

RNN-based language models attempt to capture sequential information. It models a word joint distribution as P(y_1, y_2, \cdots, y_T) = P(y_1) \prod_{t=2}^T p(y_t | y_{1:t-1}). The Markov assumption is necessary to keep the inference tractable. The shortcoming is the limitation of the context windows. The higher order Markov assumption makes an inferencing becomes more difficult.

The neural network language model avoids Markov assumption by modeling a conditional probability P(y_t | y_{1:t-1}) = p(y_t|h_t) where h_t = f(h_{t-1}, x_t). Basically, h_t is a summarization of the preceding words and it uses this information to predict the current word. The RNN-based language model works pretty well but it has difficulty with long-range dependency due to the difficulty in optimization and overfitting.

Combining the advantage from both topic modeling and RNN-based is the contribution of this paper. The topic model will be used as a bias to the learned word conditional probability. They chose to make the topic vector as a bias because they don’t want to mix it up with the hidden state of RNN that includes stopwords.

The model has a binary switch variable. When it encounters a stopword, the switch is off and disable a topic vector. The switch is on otherwise. The word probability is defined as follows:

p(y_t = i | h_t, \theta, l_t, B) \propto \exp ( v_t^T h_t + (1 - l_t)b_i^T \theta)

The switch variable, l_t turn on and off the topic vector \theta.

This model is end-to-end network, meaning that it will jointly learn topic vectors and local state from RNN. The topic vector is coupled with RNN’s state so the local dynmic from word sequence will influence the topic vector and wise verse.

RNN can be replaced with GRU or LSTM. The paper shows that using GRU yields the best perplexity on Penn Treebank (PTB) dataset. The learned representation can be used to as a feature for many tasks including sentiment analysis where we want to classify positive and negative reviews on IMDB dataset.

I found this model is simple and elegantly combine VAE with RNN. The motivation is clear and we can see why using contextual information learned from VAE will improve the quality of the representation.

reference: (ICLR 2017 – Poster)


A Biterm Topic Model for Short Texts

Topic model is an unsupervised learning algorithm that discovers the theme of an individual document from a large text collection. The well-known topic models algorithms are PLSA and LDA. Both popular methods are good at summarizing a document, able to capture the long-range dependency which depicts as the theme of that particular document. However, both algorithms suffer when there is not enough word in the given document such as twitter text or short texts because they assume that two words are related if they appear in the same document. When the given document contains only a few texts, there are not enough observation of word co-occurrence – which leads to poorly estimate document summary/theme.

A Biterm Topic Model [1] by Yan solves the data sparsity in a short document by introducing a biterm in place of a single word in order to increase more observations. A biterm is a word pair in the given context. For example, a document “visit apple store” will have the following biterms: “(visit apple), (visit store), (apple store)”. In fact, biterm explicitly model word  co-occurrence while LDA and PLSA implicitly model word  co-occurrence. The author introduce BTM method (Biterm Topic Modeling) to tackle this problem.

Unlike LDA and PLSA, BTM does not model an individual document but model all biterms in the corpus. It puts a strong assumption that each biterm is associated with one topic. This is different from LDA/PLSA that allows each word to be mapped into multiple topics. However, I found this assumption is valid because biterm is more specific than a single word. The word ‘bank’ can either ‘river bank’ or ‘bank teller’ – when adding the second word, we disambiguate the meaning of the first word and be able to indicate its topic.

Without modeling the document directly, BTM needs to inference document topics based on its biterms. According to the paper, this process is simple and straight-forward. However, marginalizing all biterms ( finding the average ) could be improved with more sophisticated assumption. However, the experiment shows that this approximation is good enough to outperform LSA and other state-of-the-arts methods.

The state-of-the-arts including LDA, LDA-u, and a mixture of unigrams. LDA models a document directly and will suffer the data sparsity. LDA-u is a heuristic approach by expanding the short document with more documents from the same authors. For an instance, it combines all twitter texts from the same author so that LDA has more texts to make a better summary. A mixture of unigrams assumes that each document has one fixed topic. This assumption might be reasonable for a short text but not all short texts contain a single topic. BTM seems to be a better model in this respect.

In term of training model is straight forward. The author employed  a collapsed Gibbs Sampling. The posterior formulation is influenced by two factors: the proportion of the topic in the corpus ( if topic K is dominated, then it is likely that that biterm will be in topic K) and the proportion of the given two words in topic K ( if both words are likely to be in topic K, then the given biterm is probably in the topic K as well).

In sum, BTM is a simple extension of LDA that performs well on short text corpus such as twitter. The experimental section is well-written and provide many experiments and metrics. The author also provide the source code which make it is easier for researchers to reproduce and extend this model.


[1] Yan, Xiaohui, et al. “A biterm topic model for short texts.” Proceedings of the 22nd international conference on World Wide Web. ACM, 2013.




ListNet and ListMLE

I just read about Learn2Rank and stumped over the ListNet [3] and ListMLE [4]. These two models are part of Learn2Rank algorithms. The goal is to learn a score function s(d) where d is a document that will yield a high score for a relevant document but low scores for non-relevant documents.

There are three main methods to learn this score function: point-wise, pair-wise, and list-wise. Point-wise: we learn this function from one sample. Pair-wise: we learn from a pair of samples, its relationship. And List-wise: we learn from a collection of samples, its order/rank.

The main essence of ListNet and ListMLE is Plackett-Luce model. (see. [1, 2]). The PL model defines a probability distribution of objects. If we have a collection of N objects, and each object has its associated score, then we can compute the probability of seeing a particular permutation. For example, I have items {A, B, C} and scores {5, 2, 1}; then PL model defines a probability distribution of all permutation of {A, B, C}.

I will not go over its definition, but you can read more from [2]. My point is we have a function that will assign the score to every possible permutation of the collection of items.

What ListNet [3] tried to do is to model a score function as a neural network. Then, the author observes that given the ground truth, we can compute the true probability of the given permutation of the items. Then, we want to learn a score function such that it will map a raw feature (document) to a score. The key idea is the probability distribution calculated from the score function should be as similar as the true probability distribution of all permutation based on the ground truth. They use KL divergence to measure the deviation of the learned distribution from the true distribution.

As we expect, ListNet can be expensive to train because we need to compute all possible permutations. Thus, people approximate the permutation by only calculate up to K permutations. K can be 1, 5, 10, 20, or 50, whatever you think is a good number.

ListMLE [4] is even a simpler model. Minimizing loss based on KL divergence is still a computational expensive. Instead, they just want to maximize the probability of the observed permutation directly. If we are given a ground truth that ‘A, B, C’ is the best permutation, then P( ABC) should have the highest probability. Then they train a neural network to learn a score function that will give a score that will satisfy this objective function.


[1] Plackett, R. (1975). The analysis of permutations. Applied Stat, 24, 193-202.

[2] Guiver, John, and Edward Snelson. “Bayesian inference for Plackett-Luce ranking models.” proceedings of the 26th annual international conference on machine learning. ACM, 2009.

[3] Cao, Zhe, et al. “Learning to rank: from pairwise approach to listwise approach.” Proceedings of the 24th international conference on Machine learning. ACM, 2007.

[4] Xia, Fen, et al. “Listwise approach to learning to rank: theory and algorithm.” Proceedings of the 25th international conference on Machine learning. ACM, 2008.