My takes on GANs (Generative Adversarial Nets)

What is the fuss with GANs? Everybody loves GANs now. The joke I heard from my advisor is “If your current model does not work, try it aGAN !” (pun-intended).  I realize that it comes down to its adversarial objective function. Some people may think it is sexier than VAE and NADE’s objective function in which they just maximize the log-probability – the likelihood that the model will maximize the given data samples. For VAE, we will maximize the evidence lower-bound (ELBO) while NADE will maximize the conditional log probability. GANs simultaneously maximize and minimize two cost functions.

GANs’ framework has two components: data generator and discriminator. The generator will attempt to create a sample that is hopefully similar to sample from the actual data while the discriminator needs to guess if the given sample is fake or real. This process can be viewed as a competitive game where a generator tries to fool the discriminator. The game is complete when the discriminator is unable to distinguish a generated sample from a real one. On the other hand, we can also look at GANs as a cooperating game where discriminator helps the generator to create a more realistic sample by providing a feedback.

There are 2 cost functions: one for the generator and another for the discriminator. The cost function for the discriminator is a binary cross entropy:

J^{(D)}(\theta^{(D)}, \theta^{(G)}) = -\frac{1}{2}E_{x \sim p_{\text{data}}}[\log D(x)] - \frac{1}{2}E_z[\log(1 - D(G(z)))]

There are two set of mini-batches: one from the real samples and one coming from the generator. The optimal discriminator will predict the chance of being a true sample as \frac{p_{\text{data}}}{p_{\text{data}} + p_{\text{model}}}.

There are many cost functions for a generator. The original paper proposed 2 cost functions. If we construct this learning procedure as a zero-sum game, then the cost function for the generator is simply, J^{(G)} = -J^{(D)}. We simply want to minimize the likelihood of the discriminator for being correct.

Goodfellow [1] mentioned that the above cost function has a vanish gradient problem when the sample is likely to be fake. Hence, to overcome this, he proposed a new cost function for the generator by maximizing the likelihood of the discriminator for being wrong. The new cost function is:

J^{(G)} = -\frac{1}{2}E_z[\log D(G(z))]

The new cost function is more useful in practice but the former function is useful for theoretical analysis, in which we will discuss next. For now, the objective function is:

V(D,G) = \min_G \max_D J^{(D)}(\theta^{(D)}, \theta^{(G)})

Goodfellow shows that learning the zero-sum game is the same as minimizing the Jensen-Shannon divergence. Given that we have an optimal discriminator, the cost function is now:

C(G) = \max_D V(D,G) = E_{x \sim p_{\text{data}}}[\log D^*(x)] + E_z[\log(1 - D^*(G(z)))]

Since we know the optimal discriminator, we now have:

C(G) = \max_D V(D,G) = E_{x \sim p_{\text{data}}}[\log (\frac{p_{\text{data}}}{p_{\text{data}} + p_{\text{model}}})] + E_z[\log(\frac{p_{\text{model}}}{p_{\text{data}} + p_{\text{model}}})]

Then, simplify further:

C(G) = \text{KL}(p_{\text{data}}||\frac{p_{\text{data}} + p_{\text{model}}}{2}) + \text{KL}(p_{\text{model}}||\frac{p_{\text{data}} + p_{\text{model}}}{2}) + \log({\frac{1}{4}})

At the global minimum, we will have $latex p_{\text{data}} = p_{\text{model}}$, hence:

C^*(G) =  \log({\frac{1}{4}})

People believed that minimizing the Jensen-Shannon divergence is the reason why GANs can produce a sharp image while VAE produces blurry image since it minimizes KL divergence. However, this belief is no longer true [2].

However, GANs has received a lot of criticism. For one instance, Schmidhuber mentioned that Goodfellow should have cited Schmidhuber’s 1992 paper [3]. (I found Reddit’s article on the public attack during NIPS2016 tutorial here: https://www.reddit.com/r/MachineLearning/comments/5go4sa/n_whats_happening_at_nips_2016_jurgen_schmidhuber/ ). Interestingly, Schmidhuber is one of the reviewers on GANs original paper and he rejected it.

While many researchers claim that GANs generates a realistic sample, a few NLP researchers do not believe so [4]. Meanwhile,  the standard dataset for some particular tasks is necessary because many machine learning research paper is really depending on the empirical results.

References:

[1] Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in neural information processing systems. 2014.

[2] Goodfellow, Ian. “NIPS 2016 Tutorial: Generative Adversarial Networks.” arXiv preprint arXiv:1701.00160 (2016).

[3] ftp://ftp.idsia.ch/pub/juergen/factorial.pdf

[4] https://medium.com/@yoav.goldberg/an-adversarial-review-of-adversarial-generation-of-natural-language-409ac3378bd7

 

Learning to Reweight Terms with Distributed Representations (SIGIR’15)

The retrieval model concerns with computing a relevant score between a query and document. Typically, the model is a weighted sum of matching score of term t and document D:

s(q, D) = \sum_{t \in q \cap D} w(t)f(t, D)

The most common weight function is an inverse document frequency (IDF). It signifies the unique terms over common terms. IDF is an appropriate assumption in any relevant model.

This paper proposes a term weight learning method based on distributed word vectors (Word2Vec). Motivated by additive property of word distributed representation, query q can be summarized as an average of all term in query q. The author believes that distributed representation will provide a more accurate estimation of query-dependent weight function w(t).

This idea sounds better than IDF because the new weight function depends on a query. The weight will not be fixed for the given term t, but depends on the context of query q.

The paper uses a simple estimation based on the assumption that the average of word vectors will summarize the whole query. Then, any term in query q that deviates too much from the average should be less relevant. For each term t in query q and its corresponding word vector \textbf{w}_{i,j}, the query-dependent feature vector of term t is:

\textbf{x}_{i,j} = \textbf{w}_{i,j} - \bar{\textbf{w}}_{q_i}

Where $latex \bar{\textbf{w}}_{q_i}$ is an average of all word vectors in the query q. Then, in order to calculate the true weight, the author trains a linear regression with L1-regularizer using \textbf{x}_{i,j} as an input feature and relevance judgment score based on term recall weight [2] as the ground truth. Once the mapping function is trained, for any given new query, the feature vector \textbf{x} will be constructed and uses a mapping function to calculate the weight score.

Although the idea proposed in this paper is simple, it shows how to exploit the additive property of distributed representation. However, the way to calculate the query-dependent feature vector does not satisfy me. Some queries may contain multiple similar words based on syntactic similarities such as “United States” or “San Francisco”, then the average of the query will be biased toward the most common concept in the query. This may weaken the rare term in the same query in which that term may be the best indicator of the user’s intent.

References:

[1] Zheng, Guoqing, and Jamie Callan. “Learning to reweight terms with distributed representations.” SIGIR’15

[2] L Zhao and J. Callan. “Term necessity prediction.” In CIKM’10.

Learning Deep Structured Semantic Models for Web Search using Clickthrough Data (CIKM’13)

When we compute the relevant score between a query and documents in the corpus, we want the higher score when the given query is relevant to the document. The vector-space model based on word matching may fail sometimes when both query and document do not share any common word. Hence, the latent semantic model could solve the vocabulary gap problem by giving a higher probability for a word that is semantically similar to words appears in the document.

Since this paper was published in 2013, the latent semantic model actually does not perform well compared to a simple heuristic vector-space model such as BM25. The actual word matching remains the most important indicator of relevancy.

Anyhow, this paper demonstrates that they can use a deep neural network to embed both query and documents to a semantic space. They assume that the relevant query and documents should be nearby in a semantic space. Although this assumption is valid, this model does not emphasize on true matching, which is very valuable information in my opinion.

The deep model architecture containing 3 main parts: word hashing layer, mapping layer which maps a document to semantic space, and relevance measurement layer which computes a cosine similarity between the given query and document. Then, the relevant score is computed through the softmax layer, normalized all cosine similarity.

The performance gain comes from the fact that the authors use supervised information (click-through data) to train the model. By maximizing the conditional probability of \log \prod_{(Q,D^+)} P(D^+|Q), the model will learn to map a relevant pair of query and documents to similar embedded vectors.

The word hashing layer is introduced in order to reduce the dimension of an input vector. If we use one-hot vector, the length of a vector is too long. Using a letter-trigram reduces the dimension significantly.

The obvious extension is to replace the transformation layer with CNNs [2] or RNNs so that the deep model can capture local structures in the documents. [2] shows that using a convolutional layer with max pooling will slightly improve the NDCG score. I don’t know if anyone has tried RNN yet.

But what I don’t feel comfortable with their model is that they treat a query as a document. I think this assumption is too rigid. To me, a query is more like a signal indicates a user information need, which is clearly not a document. I still prefer to treat a query based on a language model rather than a vector space model.

References:

[1] Huang, P., He, X., Gao, J., Deng, L., Acero, A., and Heck, L. 2013. Learning deep structured semantic models for web search using clickthrough data. In CIKM’13.

[2] Shen, Yelong, et al. “Learning semantic representations using convolutional neural networks for web search.” WWW’14

 

Relevance-based Word Embedding (SIGIR’17)

Word embedding has become a standard tool in many IR tasks. The popular tools such as Word2Vec[2] and GloVe [3] map each word to an embedding space where similar words are nearby. The similarity measurement is based on semantic or syntactic between words. Typically, similar words tend to stay in the same context. Thus, the objective functions exploit this assumption such as CBOW or skip-gram models. As a result, the models are very good at guessing another word given the context.

However, the authors believe that semantic and syntactic similarities are not an appropriate metric in IR tasks such as query expansion and classification. In the query expansion, we want to suggest an additional term that is relevant to the query but not necessary semantically or syntactically similar to. For instance, the given query is “indian american museum”, the Word2Vec will suggest words such as “united, states” because “united” and “states” are semantically similar to the word “american”. But these are not good term expansions. The relevent-based embedding will suggest the words that are relevant to the ‘whole’ query such as “heye, chumash, apa” which are more directly related to the query terms. Based on this idea, it motivates the authors to develop a relavance-based word embedding.

This paper proposes two models: Relevance Likelihood Maximization (RLM) and Relevance Posterior Estimation (RPE). The first model, RLM, is consists of two distributions:

  • p(w|R_i) – the relevance model distribution for the given query q_i where p(w|R_i) = \sum_{d \in F_i} p(w|d) \prod_{w' \in q_i} p(w'|d). Basically, this distribution is a product of ‘how likely word w is relevant to all relevant documents in the top K documents’ and ‘how similar between query q_i to each document in the top K’. The first term acts as likelihood function and the second term acts as a weight.
  • \hat p(w|q_i; \theta_R) – the similarity between embedded word w and query q. \hat p(w|q_i; \theta_R) = \frac{\exp{\vec{w}^T\vec{q}}}{\sum_{w'\in V}\exp(\vec{w}'^T\vec{q})}. We normalized it so that the distribution will sum to a unity.

To interpret this model, we basically use the relevance model distribution as a weight to signify the similarity between a particular query and word. This weight is computed via a traditional PRF model. (see eq.3). The neural network will be constraint by this given weight and will be able to find an embedding vector that hopefully maximize the likelihood function.

The second model, RPE, uses a mixture model by assuming that there are two language models for the feedback sets: the relevance language model and noisy language model. The author opts to cast this assumption to a binary classification problem by predicting if the given w does come from the relevance distribution of the query q. In other words, RPE wants to estimate the following distribution:

\hat p(R=1|\vec{w},\vec{q};\theta_R) = \frac{1}{1+\exp{(-\vec{w}^T\vec{q})}}

When a random variable R is one, it implies that word w in relevant to query q.

Two optimizations are necessary to train both RLM and RPE models. In RLM model, computing a normalization term is expensive, the author uses a hierarchical approximation of the softmax function [4] to speed up the training. Meanwhile, training RPE requires negative samples. The author chose the NCE [5] method.

The experimental results show that RLM model has a better performance than RPE models on a query expansion task. Both models outperform Word2Vec and GloVe due to its ability to put a high probability on words that are relevant to the entire query. However, RPE is better than RLM model on query classification task.

 

References:

[1] Zamani, Hamed, and W. Bruce Croft. “Relevance-based Word Embedding.” SIGIR’17

[2] Mikolov, Tomas, et al. “Distributed representations of words and phrases and their compositionality.” NIPS’13.

[3] Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. “Glove: Global Vectors for Word Representation.” In EMNLP’14.

[4] Frederic Morin and Yoshua Bengio. 2005. Hierarchical Probabilistic Neural Network Language Model. In AISTATS’05.

[5] Michael U. Gutmann and Aapo Hyvarinen. 2012. Noise-contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics. In Mach. Learn. Res. 2012.

 

Labeled-LDA (ACL’09)

In a classical topic model such as LDA, the learned topic can be uninterpretable. We need to eyeball the top words to interpret the semantic of the learned topic. Labeled LDA (L-LDA) is a model that assigns each topic to a supervisor signal. Many documents contain tags and labels, these data can be treated as target topics. If we can group similar words based on the corresponding labels, then each topic will be readable.

The trick is to constrain the available topics that each document can learn. In LDA, a document draws a topic distribution from a K-simplex ( a Dirichlet distribution ). But L-LDA draws a topic distribution from L-simplex where L is the number of tags for the given document. This implies that each document will draw a topic distribution from a different shape of simplex. L-LDA calculates the parameter of dirichlet distribution based on the tags.

For example, if there are K possible tags, \bf{\alpha} = \{ \alpha_1, \alpha_2, \cdots, \alpha_K \}, a document with L tags (k=2, k=5) will have an alpha \bf{\alpha^{(d)}} = \{\alpha_2, \alpha_5 \}. Then the topic assignment will be constraint to only topic 2 and 5.

Due to this model assumption, L-LDA assumes that all words appear in the same document are definitely drawn from a small set of topics. For a single-label dataset, this means that all words in the same document are from the same topic. Is this a good idea?

It turns out that this is a valid assumption. When a document is assigned a label, it is because the combination of words in the document contributes a very specific theme or ideas. Another view of this constraint is that the documents that share the same labels will share the same set of words that likely to appear.

When L-LDA is applied on a multi-labeled dataset, then the learned topic will become more interesting because the way each document share the same set of words become more complicated.

References:

https://www.aclweb.org/anthology/D/D09/D09-1026.pdf

Factorization Machine

I’ve stumbled upon this paper that focused on predicting a response variable on a sparse dataset. In a standard regression problem, we want to find a weight vector that transforms a feature vector to a response value.

\hat y = w_0 + \bf{w}^T \bf{x}

This linear equation assumes that each sample, x^{(1)}, x^{(2)}, \cdots, x^{(n)} are independent. This assumption may not be valid for the collaborative filtering or ranking problems where the observation is correlated. Furthermore, SVM cannot find a reliable hyperplane due to the data sparsity; thus, we need the model that is reliable on this setting. Factorization Machine (FM) is also a general predictor and [1] shows that many predicting problems are equivalent to FM.

FM models all interactions in the dataset. This approach increases the number of observations. If we model a pair-wise interaction, there are O(n^2) interactions. The FM model equation is:

\hat y(\bf{x}) = w_0 + \sum_{i=1}^n w_ix_i + \sum_{i=1}^n \sum_{j=i+1}^n \bf{v}_i^T\bf{v}_jx_ix_j

The first two terms are regression formulation. The third term is an interaction term. The dot product of feature embedding vectors \bf{v}_i, \bf{v}_j is a weight of interaction between feature i and feature j. This formulation implies that each feature is not independent.

The author shows that a 2-way FM is equivalent to SVM with polynomial kernel K(\bf{x},\bf{z}) = <\phi(\bf{x}),\phi(\bf{z})>. The only difference is that the interaction weight parameters $w_{i,j}$ in SVM is dense matrix, but it is a low-rank matrix under FM framework. This lead to the less parameters to estimate. If W \in R^{n,n} = VV^T where V \in R^{n,k} and the number of parameters is O(n^2) v.s. O(nk). If k << n, then the term O(nk) is almost a linear.

Some people pointed out that the computation of FM is O(n^2), which is not the case. If the input matrix is sparse, then the number of non-zero interactions in this term:  $latex \sum_{i=1}^n \sum_{j=i+1}^n \bf{v}_i^T\bf{v}_jx_ix_j$ is actually much small than O(k \cdot n^2). If m(n) is the number of non-zero entries, then the computation of the interaction term is O(k \cdot m(n). As long as m(n) << n then FM’s computation is almost linear.

Nevertheless, the input feature for FM still needs some feature engineering. The choice of features is important and feature normalization is necessary.

References:

Rendle, Steffen. “Factorization machines.” Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE, 2010.

 

Inference network: RNNs vs NNets

The standard variational autoencoder [1] uses neural networks to approximate the true posterior distribution by mapping an input to mean and variance of a standard Gaussian distribution. A simple modification is to replace the inference network from neural nets to RNN. That what exactly this paper present [2].

Intuitively, the RNN will work on the dataset that each consecutive features are highly correlated. It means that for the public dataset such as MNIST, RNN should have no problem approximate posterior distribution of any MNIST digit.

I started with a classical VAE. First, I trained VAE on MNIST dataset, with the hidden units of 500 for both encoders and decoders. I set the latent dimension to 2 so that I can quickly visualize on 2D plot.

 

MNIST_2D_VAE

2D embedding using Neural Nets (2-layers) as inference network

Some digits are clustered together but some are mixed together because VAE does not know the label of the digits. Thus, it will still put similar digits nearby, aka digit 7’s are right next to digit 9’s. Many digit 3 and 2 are mixed together. To have a better separation between each digit classes, the label information shall be utilized. In fact, our recent publication to SIGIR’2017 utilizes the label information in order to cluster similar documents together.

But come back to our original research question. Is RNN really going to improve the quality of the embedding vectors?

MNIST_2D_RNN

2D embedding using LSTM as inference network

 

The above 2D plot shows that using LSTM as an inference network has a slightly different embedding space.

MNIST_2D_GRU

2D embedding vectors of randomly chosen MNIST digits using GRU as inference network

LSTM and GRU also generate slightly different embedding vectors. The recurrent model tends to spread out each digit class. For example, digit 6’s (orange) are spread out. All models mixed digit 4 and 9 together. We should know that mixing digits together might not be a bad thing because some writing digit 4 are very similar to 9. This probably indicates that the recurrent model can capture more subtle similarity between digits.

Now, we will see if RNN model might generate better-looking digits than a standard model.

GRU_gen_ditis

GRU

LSTM_gen_digits

LSTM

VAE_gen_digits

neural nets

It is difficult to tell which models are better. In term of training time, neural nets are the fastest, and LSTM is the slowest. It could be that we have not utilize the strength of RNN yet. Since we are working on MNIST dataset, it might be easy for a traditional model (Neural nets) to perform well. What if we train the model on text datasets such as Newsgroup20? Intuitively, RNN should be able to capture the sequential information. We might get a better embedding space, maybe? Next time we will investigate further on text dataset.

References:

[1] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).

[2] Fabius, Otto, and Joost R. van Amersfoort. “Variational recurrent auto-encoders.” arXiv preprint arXiv:1412.6581 (2014).