I just read about Learn2Rank and stumped over the ListNet [3] and ListMLE [4]. These two models are part of Learn2Rank algorithms. The goal is to learn a score function where d is a document that will yield a high score for a relevant document but low scores for non-relevant documents.

There are three main methods to learn this score function: point-wise, pair-wise, and list-wise. Point-wise: we learn this function from one sample. Pair-wise: we learn from a pair of samples, its relationship. And List-wise: we learn from a collection of samples, its order/rank.

The main essence of ListNet and ListMLE is Plackett-Luce model. (see. [1, 2]). The PL model defines a probability distribution of objects. If we have a collection of N objects, and each object has its associated score, then we can compute the probability of seeing a particular permutation. For example, I have items {A, B, C} and scores {5, 2, 1}; then PL model defines a probability distribution of all permutation of {A, B, C}.

I will not go over its definition, but you can read more from [2]. My point is we have a function that will assign the score to every possible permutation of the collection of items.

What ListNet [3] tried to do is to model a score function as a neural network. Then, the author observes that given the ground truth, we can compute the true probability of the given permutation of the items. Then, we want to learn a score function such that it will map a raw feature (document) to a score. The key idea is the probability distribution calculated from the score function should be as similar as the true probability distribution of all permutation based on the ground truth. They use KL divergence to measure the deviation of the learned distribution from the true distribution.

As we expect, ListNet can be expensive to train because we need to compute all possible permutations. Thus, people approximate the permutation by only calculate up to K permutations. K can be 1, 5, 10, 20, or 50, whatever you think is a good number.

ListMLE [4] is even a simpler model. Minimizing loss based on KL divergence is still a computational expensive. Instead, they just want to maximize the probability of the observed permutation directly. If we are given a ground truth that ‘A, B, C’ is the best permutation, then P( ABC) should have the highest probability. Then they train a neural network to learn a score function that will give a score that will satisfy this objective function.

References:

[1] Plackett, R. (1975). The analysis of permutations. Applied Stat, 24, 193-202.

[2] Guiver, John, and Edward Snelson. “Bayesian inference for Plackett-Luce ranking models.” proceedings of the 26th annual international conference on machine learning. ACM, 2009.

[3] Cao, Zhe, et al. “Learning to rank: from pairwise approach to listwise approach.” Proceedings of the 24th international conference on Machine learning. ACM, 2007.

APA

[4] Xia, Fen, et al. “Listwise approach to learning to rank: theory and algorithm.” Proceedings of the 25th international conference on Machine learning. ACM, 2008.

APA