DiscLDA (NIPS’08)

We can learn two type of document representations: generative and discriminative features. For example, when we learn a document representation using an autoencoder, we will get a generative representation because we only care the information that will help us reconstruct the original document back. When we train the document for prediction task by providing side information such as a class label, then the learned representation become a discriminative representation because we only want the features that correspond to the correct class label.

LDA is an example of learning a generative representation. We want to learn a theme of each document and word distributions for each topic. We constraint this model by sharing the word distributions among all documents in the same corpus.

However, we can also learn the word distribution shared by documents in the same class. Then, this distribution indeed represents discriminative information for a particular class label.

DiscLDA (discriminative LDA) is a model that combine the unsupervised topic learning from the corpus with discriminative information from the class labels. The goal is to find a more efficient representation for a classification task or transfer learning.

The paper proposed two models: DiscLDA and DiscLDA with auxiliary variables. The first model extends a standard LDA by introducing a class-dependent transformation matrix. The topic assignment is drawn from z_{dn} \sim T^{yd}\theta_d. Intuitively, in LDA, the document representation resides in a K-simplex space. The transformation $T^{yd}$ moves a point (document) in the K-simplex so that documents within the same class labels will be nearby. The important part is that this model is still constraint by the global word distribution shared by all documents. Thus, we cannot just map any document to any location we want because we still need to maintain a corpus-level word-topic relationship.

The second model introduces an auxiliary variable. First, a new parameter, \phi^y_k is a class-dependent word distribution for class y and topic k. Thus, a word is drawn from the following distribution:

w_{dn} | z_{dn}, y_d,\Phi \sim \text{Mult}(\phi^{yd}_{z_{dn}})

Learning $p(y|w,\Phi)$ can be difficult because this distribution has a lot of parameters and highly non-convex. The author proposed alternate inference between a transformation matrix T^y by maximizing the conditional probability p(y_d|w_d; {T^y},\Phi). Then maximizing the posterior of the model in order to capture the topic structure that is shared in document throughout the corpus.

By learning a transformation matrix, DiscLDA is able to find top words for each class category. The learned topic from this model yields higher performance on text classification task compared to the original LDA.