Learning to Match Using Local and Distributed Representations of Text for Web Search (WWW’17)

The idea of this paper is to combine a local matching with distributed learning. In a local matching, a relevance is determined by the exact match between a query and a candidate document. This approach is commonly used in a traditional IR and it has a key advantage such as it is required no training information, scalable, and it is not domain specific. Moreover, the matching model is important to find a new document. If the match term is unique, then we should be able to locate the new document right away.

The distributed learning approach embeds both query and document to a semantic space and computes similarity in that space. The traditional approach such as LSI, PLSA and LDA learn document topic distribution as a continuous vector and the most relevant document is the one that is closest to the embedded query. The distributed representation is important because it can capture semantical similarity where the exact match is unable to do so.

Combining these two models are not a new idea. Wei [2] uses LDA to augment a traditional retrieval model and demonstrates a performance gain. This paper combined two deep neural networks models: one for local matching, another for distributed learning by jointly learning both model parameters.

This paper explains the model architecture, which is heavily engineered, with convolutional and pooling layers, sophisticated input. The Hadamard product layer is used for computing exact match on embedded documents and query.

I feel this paper has a good motivation of why combing local matching with distributed representation might be a good idea. However, the motivation of choosing specific model architecture is not clear. This made me doubt if the performance does really come from combining two approaches or simply the choice of architecture. Furthermore, the local matching wants to capture the matching positions, but in order to do so, all documents must have the same length. Thus, the author picked the first 100 or 1000 words as the document representation. This means that we discard a lot of information.

I am not sure why the author chose to use character n-graph as an input feature. Why he did not use a simple term frequency or raw documents? The missing explanation has weakened this paper. However, the figure 6, the author embedded each retrieval performance vector onto 2D space is interesting. This simple visualization can be used to compare model similarity.


[1] Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Learning to Match Using Local and Distributed Representations of Text for Web Search. In WWW’17. 1291-1299.

[2] X. Wei and W. B. Croft. Lda-based document models for ad-hoc retrieval. In SIGIR’06. 178-185.