This work points out that there are fundamentally different between representation-focused model and interaction-focused model. The first model builds an effective representation for each word and conduct matching through a simple operation. The second model focuses less on representation learning but attempts to model a complex interaction between two words. This paper focuses on the 2nd idea.
The main motivation to focus more on word interaction because ad hoc retrieval problem requires a word matching. The retrieval performance depends on the actual matching as evidently see in many past language models for IR that use weight average between word matching and word semantic. The semantic matching only is not sufficient.
This paper proposed a deep learning model to learn a representation of word and documents interaction. The key idea is to build a word similarity histogram to represents the interaction. Here is how they build a histogram: the histogram has a fixed number of buckets. Each bucket is associated with a range of similarity scores. The author uses 5 buckets: [-1, -0.5), [-0.5, 0), [0, 0.5), [0.5, 1), [1, 1]. The last bin counts the number of the exact word matching. This is how this model captures word matching. However, the proposed model depends on pre-trained word embedding. With a good word embedding, the interaction between two similar words on the same document will be similar.
The deep neural network will act as a non-linear function that transforms each interaction vector (histogram) to a relevant score. They use ‘tanh’ to squeeze the final score within the range of [-1, +1].
A query word will interact with all documents in the corpus and the model will output a vector of relevant scores for a single query word. However, there are more than one query words. The model sums all relevant score vectors to have a final relevant score. The author assumes that some query words are more important than others. For example, stopwords are not important compared to a name of person or city name. Thus, this model also computes a weight vector so that the final relevant score is computed as:
Where s is the final relevant score, M is the number of query words, is a weight for term i, and is a relevant score vector for term i. They compute this weight through a softmax function. The transformation matrix will be learned during the training.
Finally, to train this model, they employ a pairwise ranking loss such as hinge loss. Each training sample is a triple of (query, positive document, and negative document) where a positive document has a higher ranked than a negative document. This model is trained to a standard backprop algorithm.
I found that mapping an interaction between word and document to a relevant score is novel. Typically, we focus more on semantic matching, but this work focuses more or relevant matching. And one way to estimate this relevancy is by computing a matching histogram.
Guo, Jiafeng, et al. “A deep relevance matching model for ad-hoc retrieval.” Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 2016.