Learning to Match Using Local and Distributed Representations of Text for Web Search (WWW’17)

The idea of this paper is to combine a local matching with distributed learning. In a local matching, a relevance is determined by the exact match between a query and a candidate document. This approach is commonly used in a traditional IR and it has a key advantage such as it is required no training information, scalable, and it is not domain specific. Moreover, the matching model is important to find a new document. If the match term is unique, then we should be able to locate the new document right away.

The distributed learning approach embeds both query and document to a semantic space and computes similarity in that space. The traditional approach such as LSI, PLSA and LDA learn document topic distribution as a continuous vector and the most relevant document is the one that is closest to the embedded query. The distributed representation is important because it can capture semantical similarity where the exact match is unable to do so.

Combining these two models are not a new idea. Wei [2] uses LDA to augment a traditional retrieval model and demonstrates a performance gain. This paper combined two deep neural networks models: one for local matching, another for distributed learning by jointly learning both model parameters.

This paper explains the model architecture, which is heavily engineered, with convolutional and pooling layers, sophisticated input. The Hadamard product layer is used for computing exact match on embedded documents and query.

I feel this paper has a good motivation of why combing local matching with distributed representation might be a good idea. However, the motivation of choosing specific model architecture is not clear. This made me doubt if the performance does really come from combining two approaches or simply the choice of architecture. Furthermore, the local matching wants to capture the matching positions, but in order to do so, all documents must have the same length. Thus, the author picked the first 100 or 1000 words as the document representation. This means that we discard a lot of information.

I am not sure why the author chose to use character n-graph as an input feature. Why he did not use a simple term frequency or raw documents? The missing explanation has weakened this paper. However, the figure 6, the author embedded each retrieval performance vector onto 2D space is interesting. This simple visualization can be used to compare model similarity.


[1] Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Learning to Match Using Local and Distributed Representations of Text for Web Search. In WWW’17. 1291-1299.

[2] X. Wei and W. B. Croft. Lda-based document models for ad-hoc retrieval. In SIGIR’06. 178-185.



Neural Ranking Models with Weak Supervision (SIGIR’17)

The paper provides a good motivation for applying deep learning to the ranking problem. The author claims that there are not enough labels for training a deep model, e.g. there are too few relevance judgments between query and document pair for an ad-hoc retrieval task. Hence, the lack of labels has hindered the model from generalizing any useful representation/interaction between query and document.

The author proposes to use BM25 to compute a relevant score and use the score as a noisy relevance judgment. This approach is reasonable since BM25 is a state-of-the-arts vector space model for document ranking, thus we could be able to generate a lot of relevant scores (although some scores may not be accurate or even mislead).

The key challenge is how can we train the model so that the noisy label will not degrade the model performance. This paper shows that the ranking model is more superior than scoring model because learning a relative order avoid the model from fitting to the noisy information where scoring model ends up fitting to the low-quality labels.

Another key insight is to not limit the representation of the input. If we force the input feature as a discrete unit, we could prohibit the model from learning semantical similar between words in queries and documents. Thus, represent an input vector as an embedding vector representation boosts the performance. In order to compose all embedded vectors together, the author uses weight average of all embedded vectors. The model could learn to weight each vector.

In the experimental section, the author found that dense and sparse representations suffer from overfitting even with heavily use of a regularizer. But embedding representation does not overfit.

When we allow the model to learn a weighting score for each embedded word, the learned weight has a strong linear correlation with inverse document frequency. This is an interesting discovery because the author feeds a single instance of a document-query pair to the model, but yet the model can memorize and learn a global/corpus information.

The author demonstrates that pretraining the embedding vectors with the external corpus do not perform as good as training on the target corpus, especially when the corpus size is small. When a corpus is large (ClueWeb dataset), there is no difference in performance. The learned weight also boosts the performance but not significant.

Finally, when we have limit supervised data, it is doable to pretrain the model on weak labels and fine-tunes with quality labels.

Overall, I really like this paper. It shows that weak label mitigates the lack of supervised data and ranking objective function is suitable for a weak label due to it does not overfit to the label information.


[1] Dehghani, Mostafa, et al. “Neural Ranking Models with Weak Supervision.” SIGIR’2017


Neural Factorization Machines for Sparse Predictive Analytics (SIGIR17)

Deep learning has been applied to many information retrieval problems at SIGIR this year. One of the key ideas is to add non-linearity to existing linear models commonly found in IR.

Factorization machine is one of the linear models for a sparse data prediction. The model is composed of a linear regression terms and d-way interaction terms. Then, it is possible to add non-linear interaction to Factorization machine. Neural FM [1] uses a neural network to model a non-linear interaction between feature latent vectors and it demonstrates interesting results on recommendation tasks.

The model of Factorization machine is as follows:

\hat y(\mathbf{x} ) = w_0 + \sum_{i=1}^n w_ix_i + f(\mathbf{x})

This paper changes the interaction f(\mathbf{x}) to be a non-linear function. The author introduces a bi-interaction layer, which is a pooling operation that converts embedding feature vectors to a single vector. Then, the result vector will be fed to a multi-layers neural network and the output from this operation will be an interaction score f(\mathbf{x}).

NFM is different from Wide&Deep model due to its pooling operation. The Wide&Deep model concatenates features together which does not account for feature interactions.

To train NFM, the author uses a square loss and SGD for parameter estimation. Dropout and batch normalization are utilized to avoid an overfitting.

The baselines are strong such as DeepCross and Wide&Deep models. The performance (RMSE) on personalized tag recommendation shows that NFM yields a much lower RMSE than other state-of-the-arts models. In conclusion, NFM models a higher-order and non-linear feature interaction that use a few parameters and does not require a deep structure.


[1] Xiangnan He, Tat-Seng Chua, “Neural Factorization Machines for Sparse Predictive Analytics”, ACM SIGIR 2017.



Labeled-LDA (ACL’09)

In a classical topic model such as LDA, the learned topic can be uninterpretable. We need to eyeball the top words to interpret the semantic of the learned topic. Labeled LDA (L-LDA) is a model that assigns each topic to a supervisor signal. Many documents contain tags and labels, these data can be treated as target topics. If we can group similar words based on the corresponding labels, then each topic will be readable.

The trick is to constrain the available topics that each document can learn. In LDA, a document draws a topic distribution from a K-simplex ( a Dirichlet distribution ). But L-LDA draws a topic distribution from L-simplex where L is the number of tags for the given document. This implies that each document will draw a topic distribution from a different shape of simplex. L-LDA calculates the parameter of dirichlet distribution based on the tags.

For example, if there are K possible tags, \bf{\alpha} = \{ \alpha_1, \alpha_2, \cdots, \alpha_K \}, a document with L tags (k=2, k=5) will have an alpha \bf{\alpha^{(d)}} = \{\alpha_2, \alpha_5 \}. Then the topic assignment will be constraint to only topic 2 and 5.

Due to this model assumption, L-LDA assumes that all words appear in the same document are definitely drawn from a small set of topics. For a single-label dataset, this means that all words in the same document are from the same topic. Is this a good idea?

It turns out that this is a valid assumption. When a document is assigned a label, it is because the combination of words in the document contributes a very specific theme or ideas. Another view of this constraint is that the documents that share the same labels will share the same set of words that likely to appear.

When L-LDA is applied on a multi-labeled dataset, then the learned topic will become more interesting because the way each document share the same set of words become more complicated.



DiscLDA (NIPS’08)

We can learn two type of document representations: generative and discriminative features. For example, when we learn a document representation using an autoencoder, we will get a generative representation because we only care the information that will help us reconstruct the original document back. When we train the document for prediction task by providing side information such as a class label, then the learned representation become a discriminative representation because we only want the features that correspond to the correct class label.

LDA is an example of learning a generative representation. We want to learn a theme of each document and word distributions for each topic. We constraint this model by sharing the word distributions among all documents in the same corpus.

However, we can also learn the word distribution shared by documents in the same class. Then, this distribution indeed represents discriminative information for a particular class label.

DiscLDA (discriminative LDA) is a model that combine the unsupervised topic learning from the corpus with discriminative information from the class labels. The goal is to find a more efficient representation for a classification task or transfer learning.

The paper proposed two models: DiscLDA and DiscLDA with auxiliary variables. The first model extends a standard LDA by introducing a class-dependent transformation matrix. The topic assignment is drawn from z_{dn} \sim T^{yd}\theta_d. Intuitively, in LDA, the document representation resides in a K-simplex space. The transformation $T^{yd}$ moves a point (document) in the K-simplex so that documents within the same class labels will be nearby. The important part is that this model is still constraint by the global word distribution shared by all documents. Thus, we cannot just map any document to any location we want because we still need to maintain a corpus-level word-topic relationship.

The second model introduces an auxiliary variable. First, a new parameter, \phi^y_k is a class-dependent word distribution for class y and topic k. Thus, a word is drawn from the following distribution:

w_{dn} | z_{dn}, y_d,\Phi \sim \text{Mult}(\phi^{yd}_{z_{dn}})

Learning $p(y|w,\Phi)$ can be difficult because this distribution has a lot of parameters and highly non-convex. The author proposed alternate inference between a transformation matrix T^y by maximizing the conditional probability p(y_d|w_d; {T^y},\Phi). Then maximizing the posterior of the model in order to capture the topic structure that is shared in document throughout the corpus.

By learning a transformation matrix, DiscLDA is able to find top words for each class category. The learned topic from this model yields higher performance on text classification task compared to the original LDA.



Factorization Machine

I’ve stumbled upon this paper that focused on predicting a response variable on a sparse dataset. In a standard regression problem, we want to find a weight vector that transforms a feature vector to a response value.

\hat y = w_0 + \bf{w}^T \bf{x}

This linear equation assumes that each sample, x^{(1)}, x^{(2)}, \cdots, x^{(n)} are independent. This assumption may not be valid for the collaborative filtering or ranking problems where the observation is correlated. Furthermore, SVM cannot find a reliable hyperplane due to the data sparsity; thus, we need the model that is reliable on this setting. Factorization Machine (FM) is also a general predictor and [1] shows that many predicting problems are equivalent to FM.

FM models all interactions in the dataset. This approach increases the number of observations. If we model a pair-wise interaction, there are O(n^2) interactions. The FM model equation is:

\hat y(\bf{x}) = w_0 + \sum_{i=1}^n w_ix_i + \sum_{i=1}^n \sum_{j=i+1}^n \bf{v}_i^T\bf{v}_jx_ix_j

The first two terms are regression formulation. The third term is an interaction term. The dot product of feature embedding vectors \bf{v}_i, \bf{v}_j is a weight of interaction between feature i and feature j. This formulation implies that each feature is not independent.

The author shows that a 2-way FM is equivalent to SVM with polynomial kernel K(\bf{x},\bf{z}) = <\phi(\bf{x}),\phi(\bf{z})>. The only difference is that the interaction weight parameters $w_{i,j}$ in SVM is dense matrix, but it is a low-rank matrix under FM framework. This lead to the less parameters to estimate. If W \in R^{n,n} = VV^T where V \in R^{n,k} and the number of parameters is O(n^2) v.s. O(nk). If k << n, then the term O(nk) is almost a linear.

Some people pointed out that the computation of FM is O(n^2), which is not the case. If the input matrix is sparse, then the number of non-zero interactions in this term:  $latex \sum_{i=1}^n \sum_{j=i+1}^n \bf{v}_i^T\bf{v}_jx_ix_j$ is actually much small than O(k \cdot n^2). If m(n) is the number of non-zero entries, then the computation of the interaction term is O(k \cdot m(n). As long as m(n) << n then FM’s computation is almost linear.

Nevertheless, the input feature for FM still needs some feature engineering. The choice of features is important and feature normalization is necessary.


Rendle, Steffen. “Factorization machines.” Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE, 2010.


Inference network: RNNs vs NNets

The standard variational autoencoder [1] uses neural networks to approximate the true posterior distribution by mapping an input to mean and variance of a standard Gaussian distribution. A simple modification is to replace the inference network from neural nets to RNN. That what exactly this paper present [2].

Intuitively, the RNN will work on the dataset that each consecutive features are highly correlated. It means that for the public dataset such as MNIST, RNN should have no problem approximate posterior distribution of any MNIST digit.

I started with a classical VAE. First, I trained VAE on MNIST dataset, with the hidden units of 500 for both encoders and decoders. I set the latent dimension to 2 so that I can quickly visualize on 2D plot.



2D embedding using Neural Nets (2-layers) as inference network

Some digits are clustered together but some are mixed together because VAE does not know the label of the digits. Thus, it will still put similar digits nearby, aka digit 7’s are right next to digit 9’s. Many digit 3 and 2 are mixed together. To have a better separation between each digit classes, the label information shall be utilized. In fact, our recent publication to SIGIR’2017 utilizes the label information in order to cluster similar documents together.

But come back to our original research question. Is RNN really going to improve the quality of the embedding vectors?


2D embedding using LSTM as inference network


The above 2D plot shows that using LSTM as an inference network has a slightly different embedding space.


2D embedding vectors of randomly chosen MNIST digits using GRU as inference network

LSTM and GRU also generate slightly different embedding vectors. The recurrent model tends to spread out each digit class. For example, digit 6’s (orange) are spread out. All models mixed digit 4 and 9 together. We should know that mixing digits together might not be a bad thing because some writing digit 4 are very similar to 9. This probably indicates that the recurrent model can capture more subtle similarity between digits.

Now, we will see if RNN model might generate better-looking digits than a standard model.






neural nets

It is difficult to tell which models are better. In term of training time, neural nets are the fastest, and LSTM is the slowest. It could be that we have not utilize the strength of RNN yet. Since we are working on MNIST dataset, it might be easy for a traditional model (Neural nets) to perform well. What if we train the model on text datasets such as Newsgroup20? Intuitively, RNN should be able to capture the sequential information. We might get a better embedding space, maybe? Next time we will investigate further on text dataset.


[1] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).

[2] Fabius, Otto, and Joost R. van Amersfoort. “Variational recurrent auto-encoders.” arXiv preprint arXiv:1412.6581 (2014).