# Learning to Reweight Terms with Distributed Representations (SIGIR’15)

The retrieval model concerns with computing a relevant score between a query and document. Typically, the model is a weighted sum of matching score of term t and document D:

$s(q, D) = \sum_{t \in q \cap D} w(t)f(t, D)$

The most common weight function is an inverse document frequency (IDF). It signifies the unique terms over common terms. IDF is an appropriate assumption in any relevant model.

This paper proposes a term weight learning method based on distributed word vectors (Word2Vec). Motivated by additive property of word distributed representation, query q can be summarized as an average of all term in query q. The author believes that distributed representation will provide a more accurate estimation of query-dependent weight function $w(t)$.

This idea sounds better than IDF because the new weight function depends on a query. The weight will not be fixed for the given term t, but depends on the context of query q.

The paper uses a simple estimation based on the assumption that the average of word vectors will summarize the whole query. Then, any term in query q that deviates too much from the average should be less relevant. For each term t in query q and its corresponding word vector $\textbf{w}_{i,j}$, the query-dependent feature vector of term t is:

$\textbf{x}_{i,j} = \textbf{w}_{i,j} - \bar{\textbf{w}}_{q_i}$

Where $latex \bar{\textbf{w}}_{q_i}$ is an average of all word vectors in the query q. Then, in order to calculate the true weight, the author trains a linear regression with L1-regularizer using $\textbf{x}_{i,j}$ as an input feature and relevance judgment score based on term recall weight [2] as the ground truth. Once the mapping function is trained, for any given new query, the feature vector $\textbf{x}$ will be constructed and uses a mapping function to calculate the weight score.

Although the idea proposed in this paper is simple, it shows how to exploit the additive property of distributed representation. However, the way to calculate the query-dependent feature vector does not satisfy me. Some queries may contain multiple similar words based on syntactic similarities such as “United States” or “San Francisco”, then the average of the query will be biased toward the most common concept in the query. This may weaken the rare term in the same query in which that term may be the best indicator of the user’s intent.

References:

[1] Zheng, Guoqing, and Jamie Callan. “Learning to reweight terms with distributed representations.” SIGIR’15

[2] L Zhao and J. Callan. “Term necessity prediction.” In CIKM’10.

# Learning Deep Structured Semantic Models for Web Search using Clickthrough Data (CIKM’13)

When we compute the relevant score between a query and documents in the corpus, we want the higher score when the given query is relevant to the document. The vector-space model based on word matching may fail sometimes when both query and document do not share any common word. Hence, the latent semantic model could solve the vocabulary gap problem by giving a higher probability for a word that is semantically similar to words appears in the document.

Since this paper was published in 2013, the latent semantic model actually does not perform well compared to a simple heuristic vector-space model such as BM25. The actual word matching remains the most important indicator of relevancy.

Anyhow, this paper demonstrates that they can use a deep neural network to embed both query and documents to a semantic space. They assume that the relevant query and documents should be nearby in a semantic space. Although this assumption is valid, this model does not emphasize on true matching, which is very valuable information in my opinion.

The deep model architecture containing 3 main parts: word hashing layer, mapping layer which maps a document to semantic space, and relevance measurement layer which computes a cosine similarity between the given query and document. Then, the relevant score is computed through the softmax layer, normalized all cosine similarity.

The performance gain comes from the fact that the authors use supervised information (click-through data) to train the model. By maximizing the conditional probability of $\log \prod_{(Q,D^+)} P(D^+|Q)$, the model will learn to map a relevant pair of query and documents to similar embedded vectors.

The word hashing layer is introduced in order to reduce the dimension of an input vector. If we use one-hot vector, the length of a vector is too long. Using a letter-trigram reduces the dimension significantly.

The obvious extension is to replace the transformation layer with CNNs [2] or RNNs so that the deep model can capture local structures in the documents. [2] shows that using a convolutional layer with max pooling will slightly improve the NDCG score. I don’t know if anyone has tried RNN yet.

But what I don’t feel comfortable with their model is that they treat a query as a document. I think this assumption is too rigid. To me, a query is more like a signal indicates a user information need, which is clearly not a document. I still prefer to treat a query based on a language model rather than a vector space model.

References:

[1] Huang, P., He, X., Gao, J., Deng, L., Acero, A., and Heck, L. 2013. Learning deep structured semantic models for web search using clickthrough data. In CIKM’13.

[2] Shen, Yelong, et al. “Learning semantic representations using convolutional neural networks for web search.” WWW’14

# Relevance-based Word Embedding (SIGIR’17)

Word embedding has become a standard tool in many IR tasks. The popular tools such as Word2Vec[2] and GloVe [3] map each word to an embedding space where similar words are nearby. The similarity measurement is based on semantic or syntactic between words. Typically, similar words tend to stay in the same context. Thus, the objective functions exploit this assumption such as CBOW or skip-gram models. As a result, the models are very good at guessing another word given the context.

However, the authors believe that semantic and syntactic similarities are not an appropriate metric in IR tasks such as query expansion and classification. In the query expansion, we want to suggest an additional term that is relevant to the query but not necessary semantically or syntactically similar to. For instance, the given query is “indian american museum”, the Word2Vec will suggest words such as “united, states” because “united” and “states” are semantically similar to the word “american”. But these are not good term expansions. The relevent-based embedding will suggest the words that are relevant to the ‘whole’ query such as “heye, chumash, apa” which are more directly related to the query terms. Based on this idea, it motivates the authors to develop a relavance-based word embedding.

This paper proposes two models: Relevance Likelihood Maximization (RLM) and Relevance Posterior Estimation (RPE). The first model, RLM, is consists of two distributions:

• $p(w|R_i)$ – the relevance model distribution for the given query $q_i$ where $p(w|R_i) = \sum_{d \in F_i} p(w|d) \prod_{w' \in q_i} p(w'|d)$. Basically, this distribution is a product of ‘how likely word w is relevant to all relevant documents in the top K documents’ and ‘how similar between query $q_i$ to each document in the top K’. The first term acts as likelihood function and the second term acts as a weight.
• $\hat p(w|q_i; \theta_R)$ – the similarity between embedded word w and query q. $\hat p(w|q_i; \theta_R) = \frac{\exp{\vec{w}^T\vec{q}}}{\sum_{w'\in V}\exp(\vec{w}'^T\vec{q})}$. We normalized it so that the distribution will sum to a unity.

To interpret this model, we basically use the relevance model distribution as a weight to signify the similarity between a particular query and word. This weight is computed via a traditional PRF model. (see eq.3). The neural network will be constraint by this given weight and will be able to find an embedding vector that hopefully maximize the likelihood function.

The second model, RPE, uses a mixture model by assuming that there are two language models for the feedback sets: the relevance language model and noisy language model. The author opts to cast this assumption to a binary classification problem by predicting if the given w does come from the relevance distribution of the query q. In other words, RPE wants to estimate the following distribution:

$\hat p(R=1|\vec{w},\vec{q};\theta_R) = \frac{1}{1+\exp{(-\vec{w}^T\vec{q})}}$

When a random variable R is one, it implies that word w in relevant to query q.

Two optimizations are necessary to train both RLM and RPE models. In RLM model, computing a normalization term is expensive, the author uses a hierarchical approximation of the softmax function [4] to speed up the training. Meanwhile, training RPE requires negative samples. The author chose the NCE [5] method.

The experimental results show that RLM model has a better performance than RPE models on a query expansion task. Both models outperform Word2Vec and GloVe due to its ability to put a high probability on words that are relevant to the entire query. However, RPE is better than RLM model on query classification task.

References:

[1] Zamani, Hamed, and W. Bruce Croft. “Relevance-based Word Embedding.” SIGIR’17

[2] Mikolov, Tomas, et al. “Distributed representations of words and phrases and their compositionality.” NIPS’13.

[3] Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. “Glove: Global Vectors for Word Representation.” In EMNLP’14.

[4] Frederic Morin and Yoshua Bengio. 2005. Hierarchical Probabilistic Neural Network Language Model. In AISTATS’05.

[5] Michael U. Gutmann and Aapo Hyvarinen. 2012. Noise-contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics. In Mach. Learn. Res. 2012.

# Learning to Match Using Local and Distributed Representations of Text for Web Search (WWW’17)

The idea of this paper is to combine a local matching with distributed learning. In a local matching, a relevance is determined by the exact match between a query and a candidate document. This approach is commonly used in a traditional IR and it has a key advantage such as it is required no training information, scalable, and it is not domain specific. Moreover, the matching model is important to find a new document. If the match term is unique, then we should be able to locate the new document right away.

The distributed learning approach embeds both query and document to a semantic space and computes similarity in that space. The traditional approach such as LSI, PLSA and LDA learn document topic distribution as a continuous vector and the most relevant document is the one that is closest to the embedded query. The distributed representation is important because it can capture semantical similarity where the exact match is unable to do so.

Combining these two models are not a new idea. Wei [2] uses LDA to augment a traditional retrieval model and demonstrates a performance gain. This paper combined two deep neural networks models: one for local matching, another for distributed learning by jointly learning both model parameters.

This paper explains the model architecture, which is heavily engineered, with convolutional and pooling layers, sophisticated input. The Hadamard product layer is used for computing exact match on embedded documents and query.

I feel this paper has a good motivation of why combing local matching with distributed representation might be a good idea. However, the motivation of choosing specific model architecture is not clear. This made me doubt if the performance does really come from combining two approaches or simply the choice of architecture. Furthermore, the local matching wants to capture the matching positions, but in order to do so, all documents must have the same length. Thus, the author picked the first 100 or 1000 words as the document representation. This means that we discard a lot of information.

I am not sure why the author chose to use character n-graph as an input feature. Why he did not use a simple term frequency or raw documents? The missing explanation has weakened this paper. However, the figure 6, the author embedded each retrieval performance vector onto 2D space is interesting. This simple visualization can be used to compare model similarity.

References:

[1] Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Learning to Match Using Local and Distributed Representations of Text for Web Search. In WWW’17. 1291-1299.

[2] X. Wei and W. B. Croft. Lda-based document models for ad-hoc retrieval. In SIGIR’06. 178-185.

# Neural Ranking Models with Weak Supervision (SIGIR’17)

The paper provides a good motivation for applying deep learning to the ranking problem. The author claims that there are not enough labels for training a deep model, e.g. there are too few relevance judgments between query and document pair for an ad-hoc retrieval task. Hence, the lack of labels has hindered the model from generalizing any useful representation/interaction between query and document.

The author proposes to use BM25 to compute a relevant score and use the score as a noisy relevance judgment. This approach is reasonable since BM25 is a state-of-the-arts vector space model for document ranking, thus we could be able to generate a lot of relevant scores (although some scores may not be accurate or even mislead).

The key challenge is how can we train the model so that the noisy label will not degrade the model performance. This paper shows that the ranking model is more superior than scoring model because learning a relative order avoid the model from fitting to the noisy information where scoring model ends up fitting to the low-quality labels.

Another key insight is to not limit the representation of the input. If we force the input feature as a discrete unit, we could prohibit the model from learning semantical similar between words in queries and documents. Thus, represent an input vector as an embedding vector representation boosts the performance. In order to compose all embedded vectors together, the author uses weight average of all embedded vectors. The model could learn to weight each vector.

In the experimental section, the author found that dense and sparse representations suffer from overfitting even with heavily use of a regularizer. But embedding representation does not overfit.

When we allow the model to learn a weighting score for each embedded word, the learned weight has a strong linear correlation with inverse document frequency. This is an interesting discovery because the author feeds a single instance of a document-query pair to the model, but yet the model can memorize and learn a global/corpus information.

The author demonstrates that pretraining the embedding vectors with the external corpus do not perform as good as training on the target corpus, especially when the corpus size is small. When a corpus is large (ClueWeb dataset), there is no difference in performance. The learned weight also boosts the performance but not significant.

Finally, when we have limit supervised data, it is doable to pretrain the model on weak labels and fine-tunes with quality labels.

Overall, I really like this paper. It shows that weak label mitigates the lack of supervised data and ranking objective function is suitable for a weak label due to it does not overfit to the label information.

References:

[1] Dehghani, Mostafa, et al. “Neural Ranking Models with Weak Supervision.” SIGIR’2017