Neural Factorization Machines for Sparse Predictive Analytics (SIGIR17)

Deep learning has been applied to many information retrieval problems at SIGIR this year. One of the key ideas is to add non-linearity to existing linear models commonly found in IR.

Factorization machine is one of the linear models for a sparse data prediction. The model is composed of a linear regression terms and d-way interaction terms. Then, it is possible to add non-linear interaction to Factorization machine. Neural FM [1] uses a neural network to model a non-linear interaction between feature latent vectors and it demonstrates interesting results on recommendation tasks.

The model of Factorization machine is as follows:

\hat y(\mathbf{x} ) = w_0 + \sum_{i=1}^n w_ix_i + f(\mathbf{x})

This paper changes the interaction f(\mathbf{x}) to be a non-linear function. The author introduces a bi-interaction layer, which is a pooling operation that converts embedding feature vectors to a single vector. Then, the result vector will be fed to a multi-layers neural network and the output from this operation will be an interaction score f(\mathbf{x}).

NFM is different from Wide&Deep model due to its pooling operation. The Wide&Deep model concatenates features together which does not account for feature interactions.

To train NFM, the author uses a square loss and SGD for parameter estimation. Dropout and batch normalization are utilized to avoid an overfitting.

The baselines are strong such as DeepCross and Wide&Deep models. The performance (RMSE) on personalized tag recommendation shows that NFM yields a much lower RMSE than other state-of-the-arts models. In conclusion, NFM models a higher-order and non-linear feature interaction that use a few parameters and does not require a deep structure.


[1] Xiangnan He, Tat-Seng Chua, “Neural Factorization Machines for Sparse Predictive Analytics”, ACM SIGIR 2017.


Labeled-LDA (ACL’09)

In a classical topic model such as LDA, the learned topic can be uninterpretable. We need to eyeball the top words to interpret the semantic of the learned topic. Labeled LDA (L-LDA) is a model that assigns each topic to a supervisor signal. Many documents contain tags and labels, these data can be treated as target topics. If we can group similar words based on the corresponding labels, then each topic will be readable.

The trick is to constrain the available topics that each document can learn. In LDA, a document draws a topic distribution from a K-simplex ( a Dirichlet distribution ). But L-LDA draws a topic distribution from L-simplex where L is the number of tags for the given document. This implies that each document will draw a topic distribution from a different shape of simplex. L-LDA calculates the parameter of dirichlet distribution based on the tags.

For example, if there are K possible tags, \bf{\alpha} = \{ \alpha_1, \alpha_2, \cdots, \alpha_K \}, a document with L tags (k=2, k=5) will have an alpha \bf{\alpha^{(d)}} = \{\alpha_2, \alpha_5 \}. Then the topic assignment will be constraint to only topic 2 and 5.

Due to this model assumption, L-LDA assumes that all words appear in the same document are definitely drawn from a small set of topics. For a single-label dataset, this means that all words in the same document are from the same topic. Is this a good idea?

It turns out that this is a valid assumption. When a document is assigned a label, it is because the combination of words in the document contributes a very specific theme or ideas. Another view of this constraint is that the documents that share the same labels will share the same set of words that likely to appear.

When L-LDA is applied on a multi-labeled dataset, then the learned topic will become more interesting because the way each document share the same set of words become more complicated.


DiscLDA (NIPS’08)

We can learn two type of document representations: generative and discriminative features. For example, when we learn a document representation using an autoencoder, we will get a generative representation because we only care the information that will help us reconstruct the original document back. When we train the document for prediction task by providing side information such as a class label, then the learned representation become a discriminative representation because we only want the features that correspond to the correct class label.

LDA is an example of learning a generative representation. We want to learn a theme of each document and word distributions for each topic. We constraint this model by sharing the word distributions among all documents in the same corpus.

However, we can also learn the word distribution shared by documents in the same class. Then, this distribution indeed represents discriminative information for a particular class label.

DiscLDA (discriminative LDA) is a model that combine the unsupervised topic learning from the corpus with discriminative information from the class labels. The goal is to find a more efficient representation for a classification task or transfer learning.

The paper proposed two models: DiscLDA and DiscLDA with auxiliary variables. The first model extends a standard LDA by introducing a class-dependent transformation matrix. The topic assignment is drawn from z_{dn} \sim T^{yd}\theta_d. Intuitively, in LDA, the document representation resides in a K-simplex space. The transformation $T^{yd}$ moves a point (document) in the K-simplex so that documents within the same class labels will be nearby. The important part is that this model is still constraint by the global word distribution shared by all documents. Thus, we cannot just map any document to any location we want because we still need to maintain a corpus-level word-topic relationship.

The second model introduces an auxiliary variable. First, a new parameter, \phi^y_k is a class-dependent word distribution for class y and topic k. Thus, a word is drawn from the following distribution:

w_{dn} | z_{dn}, y_d,\Phi \sim \text{Mult}(\phi^{yd}_{z_{dn}})

Learning $p(y|w,\Phi)$ can be difficult because this distribution has a lot of parameters and highly non-convex. The author proposed alternate inference between a transformation matrix T^y by maximizing the conditional probability p(y_d|w_d; {T^y},\Phi). Then maximizing the posterior of the model in order to capture the topic structure that is shared in document throughout the corpus.

By learning a transformation matrix, DiscLDA is able to find top words for each class category. The learned topic from this model yields higher performance on text classification task compared to the original LDA.


Factorization Machine

I’ve stumbled upon this paper that focused on predicting a response variable on a sparse dataset. In a standard regression problem, we want to find a weight vector that transforms a feature vector to a response value.

\hat y = w_0 + \bf{w}^T \bf{x}

This linear equation assumes that each sample, x^{(1)}, x^{(2)}, \cdots, x^{(n)} are independent. This assumption may not be valid for the collaborative filtering or ranking problems where the observation is correlated. Furthermore, SVM cannot find a reliable hyperplane due to the data sparsity; thus, we need the model that is reliable on this setting. Factorization Machine (FM) is also a general predictor and [1] shows that many predicting problems are equivalent to FM.

FM models all interactions in the dataset. This approach increases the number of observations. If we model a pair-wise interaction, there are O(n^2) interactions. The FM model equation is:

\hat y(\bf{x}) = w_0 + \sum_{i=1}^n w_ix_i + \sum_{i=1}^n \sum_{j=i+1}^n \bf{v}_i^T\bf{v}_jx_ix_j

The first two terms are regression formulation. The third term is an interaction term. The dot product of feature embedding vectors \bf{v}_i, \bf{v}_j is a weight of interaction between feature i and feature j. This formulation implies that each feature is not independent.

The author shows that a 2-way FM is equivalent to SVM with polynomial kernel K(\bf{x},\bf{z}) = <\phi(\bf{x}),\phi(\bf{z})>. The only difference is that the interaction weight parameters $w_{i,j}$ in SVM is dense matrix, but it is a low-rank matrix under FM framework. This lead to the less parameters to estimate. If W \in R^{n,n} = VV^T where V \in R^{n,k} and the number of parameters is O(n^2) v.s. O(nk). If k << n, then the term O(nk) is almost a linear.

Some people pointed out that the computation of FM is O(n^2), which is not the case. If the input matrix is sparse, then the number of non-zero interactions in this term:  $latex \sum_{i=1}^n \sum_{j=i+1}^n \bf{v}_i^T\bf{v}_jx_ix_j$ is actually much small than O(k \cdot n^2). If m(n) is the number of non-zero entries, then the computation of the interaction term is O(k \cdot m(n). As long as m(n) << n then FM’s computation is almost linear.

Nevertheless, the input feature for FM still needs some feature engineering. The choice of features is important and feature normalization is necessary.


Rendle, Steffen. “Factorization machines.” Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE, 2010.


Inference network: RNNs vs NNets

The standard variational autoencoder [1] uses neural networks to approximate the true posterior distribution by mapping an input to mean and variance of a standard Gaussian distribution. A simple modification is to replace the inference network from neural nets to RNN. That what exactly this paper present [2].

Intuitively, the RNN will work on the dataset that each consecutive features are highly correlated. It means that for the public dataset such as MNIST, RNN should have no problem approximate posterior distribution of any MNIST digit.

I started with a classical VAE. First, I trained VAE on MNIST dataset, with the hidden units of 500 for both encoders and decoders. I set the latent dimension to 2 so that I can quickly visualize on 2D plot.



2D embedding using Neural Nets (2-layers) as inference network

Some digits are clustered together but some are mixed together because VAE does not know the label of the digits. Thus, it will still put similar digits nearby, aka digit 7’s are right next to digit 9’s. Many digit 3 and 2 are mixed together. To have a better separation between each digit classes, the label information shall be utilized. In fact, our recent publication to SIGIR’2017 utilizes the label information in order to cluster similar documents together.

But come back to our original research question. Is RNN really going to improve the quality of the embedding vectors?


2D embedding using LSTM as inference network


The above 2D plot shows that using LSTM as an inference network has a slightly different embedding space.


2D embedding vectors of randomly chosen MNIST digits using GRU as inference network

LSTM and GRU also generate slightly different embedding vectors. The recurrent model tends to spread out each digit class. For example, digit 6’s (orange) are spread out. All models mixed digit 4 and 9 together. We should know that mixing digits together might not be a bad thing because some writing digit 4 are very similar to 9. This probably indicates that the recurrent model can capture more subtle similarity between digits.

Now, we will see if RNN model might generate better-looking digits than a standard model.






neural nets

It is difficult to tell which models are better. In term of training time, neural nets are the fastest, and LSTM is the slowest. It could be that we have not utilize the strength of RNN yet. Since we are working on MNIST dataset, it might be easy for a traditional model (Neural nets) to perform well. What if we train the model on text datasets such as Newsgroup20? Intuitively, RNN should be able to capture the sequential information. We might get a better embedding space, maybe? Next time we will investigate further on text dataset.


[1] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).

[2] Fabius, Otto, and Joost R. van Amersfoort. “Variational recurrent auto-encoders.” arXiv preprint arXiv:1412.6581 (2014).

Blackbox Variational Inference

When we want to inference the hidden variables from our generative model, we usually employ either Gibbs sampler or Mean-field variational inference. Both methods have pro sand cons. Gibbs sampler is easily derived because we only need a transitional probability distribution and MCMC will eventually converge to the true posterior. But it is super slow to converge and we don’t know when it will converge. Mean-field variational inference turns inference problem to an optimization problem. The inference procedure is no longer un-deterministic, no more sampling. It is faster than Gibbs sampling and typically harder to derive the closed form update equations.

David Blei’s quote on these two methods:

“Variational inference is that thing you implement while waiting for your Gibbs sampler to converge.”

His message is that deriving the closed form of VI might take as long as Gibbs sampler to converge!

So here is an alternative approach from David Blei’s student: “Blackbox Variational Inference” [1]. They proposed a generic framework for variational inference and help us avoid derivation of variational parameters. The paper has a lot of mathematical details; however, the key ideas are following:

First, it uses stochastic gradient descent to optimize ELBO.

Second, it computes the expectation term w.r.t to the variational distribution with Monte Carlo samples. Basically, we will draw many latent variables from an approximate posterior distribution, and compute an average of the gradient of ELBO.

Third, this is an important contribution. The Monte Carlo samples have a high variance, thus, optimizing the model will be difficult because the learning rate needs to be very small. The author uses Rao-Blackwellization to reduce the variance by replacing random variables with its conditional expectation w.r.t. the conditioning set.

The author also adds control variates and show that a good control variate has high covariance with the function whose expectation is being computed.

The Blackbox VI seems to be a useful tool for quickly exploring the model assumption.





Neural Autoregressive Distribution Estimation (NADE)

The generative model estimates P(x) or P(x,z) which is different from the discriminative model which estimates a conditional probability P(z|x) directly. An autoregressive model is one of three popular approaches in deep generative models beside GANs and VAE. It models P(x)  = \prod_i P(x_i | x_{<i})

It models P(x)  = \prod_i P(x_i | x_{<i}) – it factors a joint distribution of an observation as a product of an independent conditional probability distribution. However, implement this model to approximate these distributions directly is intractable because we need parameters for each observation.

NADE [1] proposed a scalable approximation by sharing weight parameters between each observation. The sharing parameters method reduces the number of free parameters and has an effect of regularization because the weight parameters must accommodate for all observation.

For technical details, NADE models the following distribution:

P(x) = \prod_{d=1}^D P(x_{o_d}|x_{o_<d}) where d is an index of the permutation o. For example, if we have x = {x_1, x_2, x_3, x_4}, and o = {2, 3, 4, 1}, then x_{o_1} = x_2. The permutation of the observation is more generic notations. Once we model the observation, the hidden variables can be computed as:

\vec h_d = \sigma(W_{:, o_{<d}} \vec x_{o_{<d}} + \vec c)  And we can generate the observation ( a binary random variable) using a sigmoid function:

P(x_{o_d} = 1| \vec x_{o_{<d}}) = \sigma(V_{o_d}\vec h_d + b_{o_d})

If we look at NADE’s architecture, it is similar to RBM. In fact, [1] shows that NADE is in fact, a mean-field approximation of RBM (see [1] for details).

Another important property of NADE is computing a joint distribution P(x) is linear to the number of dimension of observations because we can express W_{:, o_{<d}} \vec x_{o_{<d}} + \vec c recursively:

Define the basecase as: \vec a_1 = \vec c

And the recurrent relationship as: \vec a_d =  W_{:, o_{<d}} \vec x_{o_{<d}} + \vec c = W_{:,o_{d-1}} x_{o_{d-1}} + \vec a_{d-1}

This means that computing \vec h_d = \sigma(\vec a_d) can be done in a linear fashion.

There are many extension of the Autoregressive model, one of the extension CF-NADE is currently the state-of-the-arts of CF. This model can model a binary/discrete random variable which VAE is unable to model currently. So this can be useful for any problem that requires a discrete random variable.


[1] Uria, Benigno, et al. “Neural Autoregressive Distribution Estimation.” Journal of Machine Learning Research 17.205 (2016): 1-37.