A Biterm Topic Model for Short Texts

Topic model is an unsupervised learning algorithm that discovers the theme of an individual document from a large text collection. The well-known topic models algorithms are PLSA and LDA. Both popular methods are good at summarizing a document, able to capture the long-range dependency which depicts as the theme of that particular document. However, both algorithms suffer when there is not enough word in the given document such as twitter text or short texts because they assume that two words are related if they appear in the same document. When the given document contains only a few texts, there are not enough observation of word co-occurrence – which leads to poorly estimate document summary/theme.

A Biterm Topic Model [1] by Yan solves the data sparsity in a short document by introducing a biterm in place of a single word in order to increase more observations. A biterm is a word pair in the given context. For example, a document “visit apple store” will have the following biterms: “(visit apple), (visit store), (apple store)”. In fact, biterm explicitly model word  co-occurrence while LDA and PLSA implicitly model word  co-occurrence. The author introduce BTM method (Biterm Topic Modeling) to tackle this problem.

Unlike LDA and PLSA, BTM does not model an individual document but model all biterms in the corpus. It puts a strong assumption that each biterm is associated with one topic. This is different from LDA/PLSA that allows each word to be mapped into multiple topics. However, I found this assumption is valid because biterm is more specific than a single word. The word ‘bank’ can either ‘river bank’ or ‘bank teller’ – when adding the second word, we disambiguate the meaning of the first word and be able to indicate its topic.

Without modeling the document directly, BTM needs to inference document topics based on its biterms. According to the paper, this process is simple and straight-forward. However, marginalizing all biterms ( finding the average ) could be improved with more sophisticated assumption. However, the experiment shows that this approximation is good enough to outperform LSA and other state-of-the-arts methods.

The state-of-the-arts including LDA, LDA-u, and a mixture of unigrams. LDA models a document directly and will suffer the data sparsity. LDA-u is a heuristic approach by expanding the short document with more documents from the same authors. For an instance, it combines all twitter texts from the same author so that LDA has more texts to make a better summary. A mixture of unigrams assumes that each document has one fixed topic. This assumption might be reasonable for a short text but not all short texts contain a single topic. BTM seems to be a better model in this respect.

In term of training model is straight forward. The author employed  a collapsed Gibbs Sampling. The posterior formulation is influenced by two factors: the proportion of the topic in the corpus ( if topic K is dominated, then it is likely that that biterm will be in topic K) and the proportion of the given two words in topic K ( if both words are likely to be in topic K, then the given biterm is probably in the topic K as well).

In sum, BTM is a simple extension of LDA that performs well on short text corpus such as twitter. The experimental section is well-written and provide many experiments and metrics. The author also provide the source code which make it is easier for researchers to reproduce and extend this model.


[1] Yan, Xiaohui, et al. “A biterm topic model for short texts.” Proceedings of the 22nd international conference on World Wide Web. ACM, 2013.