The paper provides a good motivation for applying deep learning to the ranking problem. The author claims that there are not enough labels for training a deep model, e.g. there are too few relevance judgments between query and document pair for an ad-hoc retrieval task. Hence, the lack of labels has hindered the model from generalizing any useful representation/interaction between query and document.
The author proposes to use BM25 to compute a relevant score and use the score as a noisy relevance judgment. This approach is reasonable since BM25 is a state-of-the-arts vector space model for document ranking, thus we could be able to generate a lot of relevant scores (although some scores may not be accurate or even mislead).
The key challenge is how can we train the model so that the noisy label will not degrade the model performance. This paper shows that the ranking model is more superior than scoring model because learning a relative order avoid the model from fitting to the noisy information where scoring model ends up fitting to the low-quality labels.
Another key insight is to not limit the representation of the input. If we force the input feature as a discrete unit, we could prohibit the model from learning semantical similar between words in queries and documents. Thus, represent an input vector as an embedding vector representation boosts the performance. In order to compose all embedded vectors together, the author uses weight average of all embedded vectors. The model could learn to weight each vector.
In the experimental section, the author found that dense and sparse representations suffer from overfitting even with heavily use of a regularizer. But embedding representation does not overfit.
When we allow the model to learn a weighting score for each embedded word, the learned weight has a strong linear correlation with inverse document frequency. This is an interesting discovery because the author feeds a single instance of a document-query pair to the model, but yet the model can memorize and learn a global/corpus information.
The author demonstrates that pretraining the embedding vectors with the external corpus do not perform as good as training on the target corpus, especially when the corpus size is small. When a corpus is large (ClueWeb dataset), there is no difference in performance. The learned weight also boosts the performance but not significant.
Finally, when we have limit supervised data, it is doable to pretrain the model on weak labels and fine-tunes with quality labels.
Overall, I really like this paper. It shows that weak label mitigates the lack of supervised data and ranking objective function is suitable for a weak label due to it does not overfit to the label information.
 Dehghani, Mostafa, et al. “Neural Ranking Models with Weak Supervision.” SIGIR’2017