Importance Weighted Autoencoders (ICLR’16)

In a classical VAE, the ELBO is

\log P(\textbf{x}) \ge E_{Q(\textbf{z}|\textbf{x})}[\log \frac{P(\textbf{x},\textbf{z})}{Q(\textbf{z}|\textbf{x})}] = L(x)

The unbiased estimation of the gradient of L(x) is:

\nabla_{\theta} L(\textbf{x})= \frac{1}{k}\sum_{i=1}^k \nabla_{\theta}\log w(\textbf{x},\textbf{z}_i; \theta)

where w(\textbf{x},\textbf{z}_i; \theta) = w(\textbf{x},\textbf{z}_i; \theta) = \frac{p(\textbf{x},\textbf{z}_i)}{q(\textbf{z}_i|\textbf{x})}

The importance weighted autoencoders has a slightly different ELBO:

\log P(\textbf{x}) \ge E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\log \frac{1}{k}\sum_{i=1}^k \frac{P(\textbf{x},\textbf{z}_i)}{Q(\textbf{z}_i|\textbf{x})}]

The unbiased estimation of the gradient is:

\nabla_{\theta} L_k(\textbf{x}) = \nabla_{\theta} E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\log\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta))] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\nabla_{\theta} \log\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta))] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\nabla_{\theta}\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta))}{\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\nabla_{\theta}\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta))}{\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}] \\= E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\nabla_{\theta}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}{\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\nabla_{\theta}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}{C(\textbf{x};\theta)}] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\sum_{i=1}^k \nabla_{\theta} w(\textbf{x}, \textbf{z}_i; \theta)}{C(\textbf{x};\theta)}]

Then, use the log identity:

= E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)\nabla_{\theta}\log w(\textbf{x}, \textbf{z}_i; \theta)}{C(\textbf{x};\theta)}] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\sum_{i=1}^k \frac{w(\textbf{x}, \textbf{z}_i; \theta)}{\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}\nabla_{\theta}\log w(\textbf{x}, \textbf{z}_i; \theta)] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\sum_{i=1}^k \tilde w_i\nabla_{\theta}\log w(\textbf{x}, \textbf{z}_i; \theta)]

The Monte Carlo estimation is then:

$latex \sum_{i=1}^k \tilde w_i\nabla_{\theta} \log w(\textbf{x}, \textbf{z}_i; \theta)$.

The Importance Weighted Autoencoders (IWAE) has a similar network architecture as VAE because when k = 1, the Monte Carlo estimation becomes the standard VAE.

It is unclear why IWAE is better than VAE since we can draw multiple samples from VAE to approximate its log data likelihood. The main difference is the weight function. That is why it is called Importance samples.

References:

[1] Burda, Yuri, Roger Grosse, and Ruslan Salakhutdinov. “Importance weighted autoencoders.” arXiv preprint arXiv:1509.00519 (2015).

Autoencoding beyond pixels using a learned similarity metric (ICML’16)

One of the key components of an autoencoder is a reconstruction error. This term measures how much useful information is compressed into a learned latent vector. The common reconstruction error is based on an element-wise measurement such as a binary cross entropy for a black-and-white image or a square error between a reconstructed image and the input image.

The authors think that an element-wise measurement is not an accurate indicator of goodness of a learned latent vector. Hence, they proposed to learn a similar metric via an adversarial training. Here is how they set up the objective functions:

They use VAE to learn a latent vector of the input image. There are two loss functions in a vanilla VAE: a KL loss and a negative log data-likelihood. They replace the second loss with a new reconstruction loss function. We will talk about this new loss function in the next paragraph. Then, they have a discriminator that tries to distinguish the real input data from the generated data from the decoder of VAE. The discriminator will encourage the VAE to learn stronger encoder and decoder.

The discriminator can be decomposed into 2 parts. If it has L + 1 layers, then the first L layers is a transform function that maps the input data into a new representation. The last layer is a binary classifier. This means that if we input any input through the first L layer, we will get a new representation that is easily classified by the last layer of the discriminator. When the discriminator is very good at detecting the real input, the representation at L layer is going to be much easier to classify compared to the input data. It means that a square error between a transformed input and its transformed reconstruction input should be somewhat small when these inputs are similar.

This model has trained the same fashion as GANs; simultaneously train VAE and GANs. This idea works well for an image because the square-error is not a good metric for an image quality. This idea may work on text dataset as well because we assess the quality of the reconstructed texts based on the whole input but not collectively evaluate one word at a time.

 

References:

https://arxiv.org/abs/1512.09300

Adversarial Variational Bayes

Variational autoencoder (VAE) requires an expressive inference network in order to learn a complex posterior distribution. The more complex inference network will result in generating high-quality data.

This work utilizes an adversarial training to learn a function T(x,z) that approximates \log q_{\phi}(z|x) - \log p(z). The expectation of this term w.r.t $latex q_{\phi}(z|x)$ is in fact a KL-divergence term. Since the authors prove that the optimal T^*(x, z) = \log q_{\phi}(z|x) - \log p(z), the ELBO becomes:

E_{p_{D(x)}}E_{q_{\phi}(z|x)}[-T^*(x,z) + \log p_{\theta}(x|z)]

In order to approximate T^*(x, z), the discriminator needs to learn to distinguish between a sample from a prior p_{D(x)}p(z) and the current inference model p_{D(x)}q_{\phi}(z|x). Thus, the objective function for the discriminator is setup as:

\max_T E_{p_{D(x)}}E_{q_{\phi(z|x)}} \log \sigma(T(x,z)) + E_{p_{D(x)}}E_{p(z)} \log(1 - \sigma(T(x,z))) (1)

Taking a gradient on T(x,z) w.r.t. parameter \phi can be problematic because the solution of this function depends on q_{\phi}(z|x). But the author shows that the expectation of gradient of T^*(x, z) w.r.t \phi is 0. Thus, there is no gradient and no parameter update when taking a gradient of T^*(x,z).

Since T(x,z) requires sample z, the parametrization trick is applied and the ELBO becomes:

E_{p_{D(x)}}E_{\epsilon}[-T^*(x, z_{\phi}(x, \epsilon) + \log p_{\theta}(x|z_{\phi}(x, \epsilon))] (2)

This step is crucial because now the sampling is just a transformation from a noise and let T^*(x, z) to approximate the KL-divergence term. This made this model looks like a blackbox model because we do not explicitly define a distribution q_{\phi}(z|x).

This model optimizes equation (1) and (2) using adversarial training. It optimizes eq.(1) several steps in order to keep T(x, z) close to optimal while jointly optimizes eq. (2).

Adaptive contrast technique is used to make T(x, y) to be sufficiently close to the optimal. Basically, the KL term in ELBO is replaced by KL(q_{\phi}(z|x), r_{\alpha}(z|x)) where r_{\alpha}(z|x) is an auxiliary distribution which could be a Gaussian distribution.

This model has a connection to variational autoencoder, adversarial autoencoder, f-GANs, and BiGANs. A new training method for VAE via adversarial training allows us to use a flexible inference that approximate a true distribution over the latent vectors.

References:

[1] https://arxiv.org/abs/1701.04722

Inference network: RNNs vs NNets

The standard variational autoencoder [1] uses neural networks to approximate the true posterior distribution by mapping an input to mean and variance of a standard Gaussian distribution. A simple modification is to replace the inference network from neural nets to RNN. That what exactly this paper present [2].

Intuitively, the RNN will work on the dataset that each consecutive features are highly correlated. It means that for the public dataset such as MNIST, RNN should have no problem approximate posterior distribution of any MNIST digit.

I started with a classical VAE. First, I trained VAE on MNIST dataset, with the hidden units of 500 for both encoders and decoders. I set the latent dimension to 2 so that I can quickly visualize on 2D plot.

 

MNIST_2D_VAE

2D embedding using Neural Nets (2-layers) as inference network

Some digits are clustered together but some are mixed together because VAE does not know the label of the digits. Thus, it will still put similar digits nearby, aka digit 7’s are right next to digit 9’s. Many digit 3 and 2 are mixed together. To have a better separation between each digit classes, the label information shall be utilized. In fact, our recent publication to SIGIR’2017 utilizes the label information in order to cluster similar documents together.

But come back to our original research question. Is RNN really going to improve the quality of the embedding vectors?

MNIST_2D_RNN

2D embedding using LSTM as inference network

 

The above 2D plot shows that using LSTM as an inference network has a slightly different embedding space.

MNIST_2D_GRU

2D embedding vectors of randomly chosen MNIST digits using GRU as inference network

LSTM and GRU also generate slightly different embedding vectors. The recurrent model tends to spread out each digit class. For example, digit 6’s (orange) are spread out. All models mixed digit 4 and 9 together. We should know that mixing digits together might not be a bad thing because some writing digit 4 are very similar to 9. This probably indicates that the recurrent model can capture more subtle similarity between digits.

Now, we will see if RNN model might generate better-looking digits than a standard model.

GRU_gen_ditis

GRU

LSTM_gen_digits

LSTM

VAE_gen_digits

neural nets

It is difficult to tell which models are better. In term of training time, neural nets are the fastest, and LSTM is the slowest. It could be that we have not utilize the strength of RNN yet. Since we are working on MNIST dataset, it might be easy for a traditional model (Neural nets) to perform well. What if we train the model on text datasets such as Newsgroup20? Intuitively, RNN should be able to capture the sequential information. We might get a better embedding space, maybe? Next time we will investigate further on text dataset.

References:

[1] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).

[2] Fabius, Otto, and Joost R. van Amersfoort. “Variational recurrent auto-encoders.” arXiv preprint arXiv:1412.6581 (2014).

A Recurrent Latent Variable Model for Sequential Data (NIPS’15)

This paper presents a sequential model that is incorporating uncertainty to better model variability that arises from the data itself.

The motivation comes from the fact that data itself especially speech signal has a high variability that does not come from the noise alone. The complex relationship between observed data and an underlying factor of the variability cannot be modeled by the basic RNN alone. For example, vocal quality of the speaker affects the wave audio even though the speaker says the same word.

In a classical RNN, the state transition h_t = f_{\theta}(x_t, h_{t-1}) is a deterministic function and typically f_{\theta} is either LSTM or GRU. RNN models the joint probability of the entire sequencep(x_1, x_2, \cdot, x_T) = \prod_{t=1}^T p(x_t | x_{<t}) =\prod_{t=1}^T g_{\tau}(h_{t-1}) whereg_{\tau} is an output function that maps hidden state to the probility distribution of the output. The choice ofg_{\tau} depends on the problem. Typically, function g has 2 parts: (1) parameter generator,\phi_t = \varphi_{\tau}(h_{t-1}) and (2) density function: P_{\phi_t}(x_t | x_{<t}) . We can also make function g as a GMM; hence, function \phi_t will generate a mixture coefficient parameters.

The source of variability in RNN comes from the output function g alone. This can be problematic in speech signal because RNN must map many variants of input wave to a potentially large variation of the hidden state h_t . The limitation of RNN motivates the author to introduce uncertainty into RNN.

In order to turn RNN to an un-deterministic model, the author assumes that each data point x_t has a latent variable z_t where the latent variable is drawn from a standard Gaussian distribution initially. The generative process is as follows:

  • For each step t to T
    • Compute prior parameters:[\mu_{0,t}, \text{diag}(\sigma_{0,t})] = \phi_{\tau}^{\text{prior}}(h_{t-1})
    • Draw a prior:z_t \sim N(\mu_{0,t}, \text{diag}(\sigma_{0,t}^2))
    • Compute likelihood parameters: [\mu_{x,t},\sigma_{x,t}] = \phi_{\tau}^{\text{dec}}(\phi_{\tau}^z(z_t), h_{t-1})
    • Draw a sample:x_t | z_t \sim N(\mu_{x,t}, \text{diag}(\sigma_{x,t}^2))
    • Compute a hidden state:h_t = f_{\theta}(\phi_{\tau}^x(x_t), \phi_{\tau}^z(z_t), h_{t-1})

The state transition function is now an un-deterministic function because z_t is a random variable. Also, the hidden state h_t depends on  x_{<t}, z_{<t}, therefore, we can replace  h_t with:

  • z_t \sim p(z_t | x_{<t}, z_{<t})
  • x_t|z_t \sim p(x_t | z_{\le t}, x_{\le t})

Thus, the joint distribution becomes:

p(x_{\le T}, z_{\le T}) = \prod_{t=1}^T p(x_t|z_{\le t}, x_{<t})p(z_t|x_{<t},z_{<t})

The objective function is to maximize the log-likelihood of the input sequence:p(x_{\le T}) = \int_z p(x_{\le T}, z_{\le T}) dz . By assuming the approximate posterior distribution q(z_{\le T} | x_{\le T}) = \prod_{t=1}^T q(z_t | x_{\le t}, z_{<t}) is factorizable, the ELBO is:

E_{q(z_{\le T}|x_{\le T})}\big[ \sum_{t=1}^T \log p(x_t|z_{\le t},x_{<t}) - KL(q(z_t|x_{\le t},z_{<t}) || p(z_t|x_{<t},z_{<t})) \big]

The ELBO can be trained efficiently through variational autoencoder framework. In fact, this model is a sequential version of the classical variational autoencoder.

References:

Chung, Junyoung, et al. “A recurrent latent variable model for sequential data.” Advances in neural information processing systems. 2015.

TopicRNN : A Recurrent Neural Network with Long-Range Semantic Dependency

This paper presents a RNN-based language model that is designed to capture a long-range semantic dependency. The proposed model is a simple and elegant, and yields sensible topics.

The key insight of this work is the difference between semantic and syntax. Semantic is relating to an over structure and information of the given context. If we are given a document, its semantic is a theme or topic. Semantic is meant to capture a global meaning of the context. We need to see enough words to understand its semantic.

In contrast, a syntax is dealt with local information. The likelihood of the current word heavily depends on the preceding words. This local information depends on the word ordering whereas the global information does not depend on word ordering.

This paper points out the weakness in probabilistic topic models such as LDA such as its lack of word ordering, its poor performance on word prediction. If we use bigram or trigram then these higher order models become intractable. Furthermore, LDA does not model stopwords very well because LDA is based on word co-occurrence. Stopwords tend to appear everywhere because stopwords do not carry semantic information but it acts as a filler to make the language more readable. Thus, when training LDA, the stopwords are usually discarded during the preprocessing.

RNN-based language models attempt to capture sequential information. It models a word joint distribution as P(y_1, y_2, \cdots, y_T) = P(y_1) \prod_{t=2}^T p(y_t | y_{1:t-1}). The Markov assumption is necessary to keep the inference tractable. The shortcoming is the limitation of the context windows. The higher order Markov assumption makes an inferencing becomes more difficult.

The neural network language model avoids Markov assumption by modeling a conditional probability P(y_t | y_{1:t-1}) = p(y_t|h_t) where h_t = f(h_{t-1}, x_t). Basically, h_t is a summarization of the preceding words and it uses this information to predict the current word. The RNN-based language model works pretty well but it has difficulty with long-range dependency due to the difficulty in optimization and overfitting.

Combining the advantage from both topic modeling and RNN-based is the contribution of this paper. The topic model will be used as a bias to the learned word conditional probability. They chose to make the topic vector as a bias because they don’t want to mix it up with the hidden state of RNN that includes stopwords.

The model has a binary switch variable. When it encounters a stopword, the switch is off and disable a topic vector. The switch is on otherwise. The word probability is defined as follows:

p(y_t = i | h_t, \theta, l_t, B) \propto \exp ( v_t^T h_t + (1 - l_t)b_i^T \theta)

The switch variable, l_t turn on and off the topic vector \theta.

This model is end-to-end network, meaning that it will jointly learn topic vectors and local state from RNN. The topic vector is coupled with RNN’s state so the local dynmic from word sequence will influence the topic vector and wise verse.

RNN can be replaced with GRU or LSTM. The paper shows that using GRU yields the best perplexity on Penn Treebank (PTB) dataset. The learned representation can be used to as a feature for many tasks including sentiment analysis where we want to classify positive and negative reviews on IMDB dataset.

I found this model is simple and elegantly combine VAE with RNN. The motivation is clear and we can see why using contextual information learned from VAE will improve the quality of the representation.

reference:

https://arxiv.org/abs/1611.01702 (ICLR 2017 – Poster)