This paper proposed an extension of word2vec by adding document and topic vectors inspired by LDA (Latent Dirichlet Allocation). The model can discover a linear relationship between words. For example, “California + technology = Silicon Valley”. The topic is interpretable by collecting all nearby word vectors to the selected topic vector. This work boosts word2vec with topic modeling via training in the similar fashion to word2vec.

The key difference of LDA2Vec is its loss function. There are 2 loss functions: the first one is Skipgram Negative Sampling Loss which is similar to Word2Vec. It wants to maximize the probability of predicting a target word \vec w_j and non-target word (negative samples) given a context vector \vec c_j. This loss wants the model to distinguish a positive word (which related to the given context) from negative sampled words.

The innovation is the context vector \vec c_j. The intuition is that predict a nearby word given a pivot word also depends on the theme of the context. For example, if the document is about an airline when we want to predict nearby words given a word “Germany”, we will likely want to see word related to airlines but not country names. Thus, a context vector is a sum of a word vector and document vector. \vec c_j = \vec w_j + \vec d_j. Thus, LDA2vec attempts to capture both document-wide relationship and local interaction between words within its context window.

In order to learn a topic vector, the document is further decomposed as a linear combination of topic vectors. \vec d_j = \sum_{k} p_{jk} \cdot \vec t_k where p_{jk} is a probability of document j will be a topic k. Finally, the interpretability comes from a sparsity of topic assignment vector, p_j. One way to enforce sparsity is to design the loss function as:

L^{d} = \lambda \sum_{jk} (\alpha - 1)\log p_{jk}

When \alpha < 1, we encourage a topic assignment probability to put more mass on a small set of topics.

The results look interesting. This paper shows a simple way to combine topic modeling with word embedding. By embedding document vectors and topic vectors into the same semantic space as word vectors, we can learn a global semantic structure as well as word-level local interaction.


[1] https://arxiv.org/pdf/1605.02019.pdf

[2] http://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/

[3] Skip-gram tutorial part1: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

[4] Skip-gram tutorial part2: http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/


Labeled-LDA (ACL’09)

In a classical topic model such as LDA, the learned topic can be uninterpretable. We need to eyeball the top words to interpret the semantic of the learned topic. Labeled LDA (L-LDA) is a model that assigns each topic to a supervisor signal. Many documents contain tags and labels, these data can be treated as target topics. If we can group similar words based on the corresponding labels, then each topic will be readable.

The trick is to constrain the available topics that each document can learn. In LDA, a document draws a topic distribution from a K-simplex ( a Dirichlet distribution ). But L-LDA draws a topic distribution from L-simplex where L is the number of tags for the given document. This implies that each document will draw a topic distribution from a different shape of simplex. L-LDA calculates the parameter of dirichlet distribution based on the tags.

For example, if there are K possible tags, \bf{\alpha} = \{ \alpha_1, \alpha_2, \cdots, \alpha_K \}, a document with L tags (k=2, k=5) will have an alpha \bf{\alpha^{(d)}} = \{\alpha_2, \alpha_5 \}. Then the topic assignment will be constraint to only topic 2 and 5.

Due to this model assumption, L-LDA assumes that all words appear in the same document are definitely drawn from a small set of topics. For a single-label dataset, this means that all words in the same document are from the same topic. Is this a good idea?

It turns out that this is a valid assumption. When a document is assigned a label, it is because the combination of words in the document contributes a very specific theme or ideas. Another view of this constraint is that the documents that share the same labels will share the same set of words that likely to appear.

When L-LDA is applied on a multi-labeled dataset, then the learned topic will become more interesting because the way each document share the same set of words become more complicated.



DiscLDA (NIPS’08)

We can learn two type of document representations: generative and discriminative features. For example, when we learn a document representation using an autoencoder, we will get a generative representation because we only care the information that will help us reconstruct the original document back. When we train the document for prediction task by providing side information such as a class label, then the learned representation become a discriminative representation because we only want the features that correspond to the correct class label.

LDA is an example of learning a generative representation. We want to learn a theme of each document and word distributions for each topic. We constraint this model by sharing the word distributions among all documents in the same corpus.

However, we can also learn the word distribution shared by documents in the same class. Then, this distribution indeed represents discriminative information for a particular class label.

DiscLDA (discriminative LDA) is a model that combine the unsupervised topic learning from the corpus with discriminative information from the class labels. The goal is to find a more efficient representation for a classification task or transfer learning.

The paper proposed two models: DiscLDA and DiscLDA with auxiliary variables. The first model extends a standard LDA by introducing a class-dependent transformation matrix. The topic assignment is drawn from z_{dn} \sim T^{yd}\theta_d. Intuitively, in LDA, the document representation resides in a K-simplex space. The transformation $T^{yd}$ moves a point (document) in the K-simplex so that documents within the same class labels will be nearby. The important part is that this model is still constraint by the global word distribution shared by all documents. Thus, we cannot just map any document to any location we want because we still need to maintain a corpus-level word-topic relationship.

The second model introduces an auxiliary variable. First, a new parameter, \phi^y_k is a class-dependent word distribution for class y and topic k. Thus, a word is drawn from the following distribution:

w_{dn} | z_{dn}, y_d,\Phi \sim \text{Mult}(\phi^{yd}_{z_{dn}})

Learning $p(y|w,\Phi)$ can be difficult because this distribution has a lot of parameters and highly non-convex. The author proposed alternate inference between a transformation matrix T^y by maximizing the conditional probability p(y_d|w_d; {T^y},\Phi). Then maximizing the posterior of the model in order to capture the topic structure that is shared in document throughout the corpus.

By learning a transformation matrix, DiscLDA is able to find top words for each class category. The learned topic from this model yields higher performance on text classification task compared to the original LDA.



TopicRNN : A Recurrent Neural Network with Long-Range Semantic Dependency

This paper presents a RNN-based language model that is designed to capture a long-range semantic dependency. The proposed model is a simple and elegant, and yields sensible topics.

The key insight of this work is the difference between semantic and syntax. Semantic is relating to an over structure and information of the given context. If we are given a document, its semantic is a theme or topic. Semantic is meant to capture a global meaning of the context. We need to see enough words to understand its semantics.

In contrast, a syntax is dealt with local information. The likelihood of the current word heavily depends on the preceding words. This local information depends on the word order whereas the global information does not depend on word order.

This paper points out the weakness in probabilistic topic models such as LDA such as its lack of word order, its poor performance on word prediction. If we use bigram or trigram then these higher order models become intractable. Furthermore, LDA does not model stopwords very well because LDA is based on word co-occurrence. Stopwords tend to appear everywhere because stopwords do not carry semantic information but it acts as a filler to make the language more readable. Thus, when training LDA, the stopwords are usually discarded during the preprocessing.

RNN-based language models attempt to capture sequential information. It models a word joint distribution as P(y_1, y_2, \cdots, y_T) = P(y_1) \prod_{t=2}^T p(y_t | y_{1:t-1}). The Markov assumption is necessary to keep the inference tractable. The shortcoming is the limitation of the context windows. The higher order Markov assumption makes an inferencing becomes more difficult.

The neural network language model avoids Markov assumption by modeling a conditional probability P(y_t | y_{1:t-1}) = p(y_t|h_t) where h_t = f(h_{t-1}, x_t). Basically, h_t is a summarization of the preceding words and it uses this information to predict the current word. The RNN-based language model works pretty well but it has difficulty with long-range dependency due to the difficulty in optimization and overfitting.

Combining the advantage from both topic modeling and RNN-based is the contribution of this paper. The topic model will be used as a bias to the learned word conditional probability. They chose to make the topic vector as a bias because they don’t want to mix it up with the hidden state of RNN that includes stop words.

The model has a binary switch variable. When it encounters a stopword, the switch is off and disable a topic vector. The switch is on otherwise. The word probability is defined as follows:

p(y_t = i | h_t, \theta, l_t, B) \propto \exp ( v_t^T h_t + (1 - l_t)b_i^T \theta)

The switch variable, l_t turn on and off the topic vector \theta.

This model is an end-to-end network, meaning that it will jointly learn topic vectors and local state from RNN. The topic vector is coupled with RNN’s state so the local dynamic from word sequence will influence the topic vector and wise verse.

RNN can be replaced with GRU or LSTM. The paper shows that using GRU yields the best perplexity on Penn Treebank (PTB) dataset. The learned representation can be used to as a feature for many tasks including sentiment analysis where we want to classify positive and negative reviews on IMDB dataset.

I found this model is simple and elegantly combine VAE with RNN. The motivation is clear and we can see why using contextual information learned from VAE will improve the quality of the representation.


https://arxiv.org/abs/1611.01702 (ICLR 2017 – Poster)


A Biterm Topic Model for Short Texts (WWW’13)

Topic model is an unsupervised learning algorithm that discovers the theme of an individual document from a large text collection. The well-known topic models algorithms are PLSA and LDA. Both popular methods are good at summarizing a document, able to capture the long-range dependency which depicts as the theme of that particular document. However, both algorithms suffer when there is not enough word in the given document such as twitter text or short texts because they assume that two words are related if they appear in the same document. When the given document contains only a few texts, there are not enough observation of word co-occurrence – which leads to poorly estimate document summary/theme.

A Biterm Topic Model [1] by Yan solves the data sparsity in a short document by introducing a biterm in place of a single word in order to increase more observations. A biterm is a word pair in the given context. For example, a document “visit apple store” will have the following biterms: “(visit apple), (visit store), (apple store)”. In fact, biterm explicitly model word  co-occurrence while LDA and PLSA implicitly model word  co-occurrence. The author introduce BTM method (Biterm Topic Modeling) to tackle this problem.

Unlike LDA and PLSA, BTM does not model an individual document but model all biterms in the corpus. It puts a strong assumption that each biterm is associated with one topic. This is different from LDA/PLSA that allows each word to be mapped into multiple topics. However, I found this assumption is valid because biterm is more specific than a single word. The word ‘bank’ can either ‘river bank’ or ‘bank teller’ – when adding the second word, we disambiguate the meaning of the first word and be able to indicate its topic.

Without modeling the document directly, BTM needs to inference document topics based on its biterms. According to the paper, this process is simple and straight-forward. However, marginalizing all biterms ( finding the average ) could be improved with more sophisticated assumption. However, the experiment shows that this approximation is good enough to outperform LSA and other state-of-the-arts methods.

The state-of-the-arts including LDA, LDA-u, and a mixture of unigrams. LDA models a document directly and will suffer the data sparsity. LDA-u is a heuristic approach by expanding the short document with more documents from the same authors. For an instance, it combines all twitter texts from the same author so that LDA has more texts to make a better summary. A mixture of unigrams assumes that each document has one fixed topic. This assumption might be reasonable for a short text but not all short texts contain a single topic. BTM seems to be a better model in this respect.

In term of training model is straight forward. The author employed  a collapsed Gibbs Sampling. The posterior formulation is influenced by two factors: the proportion of the topic in the corpus ( if topic K is dominated, then it is likely that that biterm will be in topic K) and the proportion of the given two words in topic K ( if both words are likely to be in topic K, then the given biterm is probably in the topic K as well).

In sum, BTM is a simple extension of LDA that performs well on short text corpus such as twitter. The experimental section is well-written and provide many experiments and metrics. The author also provide the source code which make it is easier for researchers to reproduce and extend this model.


[1] Yan, Xiaohui, et al. “A biterm topic model for short texts.” Proceedings of the 22nd international conference on World Wide Web. ACM, 2013.