Learning to Reweight Terms with Distributed Representations (SIGIR’15)

The retrieval model concerns with computing a relevant score between a query and document. Typically, the model is a weighted sum of matching score of term t and document D:

$s(q, D) = \sum_{t \in q \cap D} w(t)f(t, D)$

The most common weight function is an inverse document frequency (IDF). It signifies the unique terms over common terms. IDF is an appropriate assumption in any relevant model.

This paper proposes a term weight learning method based on distributed word vectors (Word2Vec). Motivated by additive property of word distributed representation, query q can be summarized as an average of all term in query q. The author believes that distributed representation will provide a more accurate estimation of query-dependent weight function $w(t)$.

This idea sounds better than IDF because the new weight function depends on a query. The weight will not be fixed for the given term t, but depends on the context of query q.

The paper uses a simple estimation based on the assumption that the average of word vectors will summarize the whole query. Then, any term in query q that deviates too much from the average should be less relevant. For each term t in query q and its corresponding word vector $\textbf{w}_{i,j}$, the query-dependent feature vector of term t is:

$\textbf{x}_{i,j} = \textbf{w}_{i,j} - \bar{\textbf{w}}_{q_i}$

Where $latex \bar{\textbf{w}}_{q_i}$ is an average of all word vectors in the query q. Then, in order to calculate the true weight, the author trains a linear regression with L1-regularizer using $\textbf{x}_{i,j}$ as an input feature and relevance judgment score based on term recall weight [2] as the ground truth. Once the mapping function is trained, for any given new query, the feature vector $\textbf{x}$ will be constructed and uses a mapping function to calculate the weight score.

Although the idea proposed in this paper is simple, it shows how to exploit the additive property of distributed representation. However, the way to calculate the query-dependent feature vector does not satisfy me. Some queries may contain multiple similar words based on syntactic similarities such as “United States” or “San Francisco”, then the average of the query will be biased toward the most common concept in the query. This may weaken the rare term in the same query in which that term may be the best indicator of the user’s intent.

References:

[1] Zheng, Guoqing, and Jamie Callan. “Learning to reweight terms with distributed representations.” SIGIR’15

[2] L Zhao and J. Callan. “Term necessity prediction.” In CIKM’10.