VAE: Adding More Stochastic Layers gives a tighter lower-bound.

Does adding more stochastic layers to the recognition model (encoder function) give a tighter lower-bound? Daniel Jiwoong (Bengio’s student)’s AAAI paper, “Denoising Criterion for Variational Auto-Encoding Framework. (AAAI 2017)”, claims that this is true.

It has been known that multi-modal recognition models can learn a more complex posterior distribution from the input data (see [2], [3]). Intuitively, adding more stochastic layers increases the complexity of the recognition models.

Lemma 0:

This proof shows that the following inequality is true:

E_{f(x)}[\log f(x)] \ge E_{f(x)}[\log g(x)]

By using the KL divergence property, which is defined as:

D_{KL}(f || g) = \int_x f(x) \log \frac{f(x)}{g(x)} dx = E_{f(x)}[ \log \frac{f(x)}{g(x)} ] \ge 0

Since KL divergence is always non-negative (you can also prove that too.), arranging the inequality will result in the following expression:

E_{f(x)}[ \log f(x)] - E_{f(x)}[ \log g(x)] \ge 0


E_{f(x)}[ \log f(x)] \ge E_{f(x)}[ \log g(x)]

This statement says that the cross entropy for f and g is always at least the entropy of $f$. This makes sense because when we set distribution $g$ to be the same as $f$, we will get the lowest cross entropy.

Lemma 1:

A feedforward network with multiple stochastic layers can be defined as a marginal distribution of multiple latent variables:

q(\textbf{z}|\textbf{x}) = \int_{\textbf{h}}q(\textbf{z}|\textbf{h})q(\textbf{h}|\textbf{x})d\textbf{h} = E_{q(\textbf{h}|\textbf{x})}[q(\textbf{z}|\textbf{h})]


\log p(\textbf{x}) \ge E_{q(\textbf{z}|\textbf{x})}[\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{h})}] \ge E_{q(\textbf{z}|\textbf{x})}[\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{x})}]

We will prove that left and right inequality are satisfied.

Right inequality

We start with the right inequality by simplifying the expression:

E_{q(\textbf{z}|\textbf{x})}[\log p(\textbf{x},\textbf{z})]- E_{q(\textbf{z}|\textbf{x})}[q(\textbf{z}|\textbf{h})] \ge E_{q(\textbf{z}|\textbf{x})}[\log p(\textbf{x},\textbf{z})]-E_{q(\textbf{z}|\textbf{x})}[\log q(\textbf{z}|\textbf{x})]

From Lemma 0: if we replace f(x) with q(\textbf{z}|\textbf{x}) and g(x) with q(\textbf{z}|\textbf{h}), we will end up with the following inequality:

E_{q(\textbf{z}|\textbf{x})}[\log q(\textbf{z}|\textbf{x})] \ge E_{q(\textbf{z}|\textbf{x})}[\log q(\textbf{z}|\textbf{h})]

This shows that the right inequality is satisfied.

Left inequality

We expand the encoder function q(\textbf{z}|\textbf{x}) as:

E_{q(\textbf{z}|\textbf{x})}[\log \frac{p(\textbf{x},\textbf{z})}{q(z|\textbf{x})}] = \int_\textbf{z} q(\textbf{z}|\textbf{x})\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{x})}d\textbf{z} \\ = \int_\textbf{z} \int_h q(\textbf{z}|\textbf{h})q(\textbf{h}|\textbf{z})\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{x})}d\textbf{z} d\textbf{h} \\ = E_{q(\textbf{z}|\textbf{h})}E_{q(\textbf{h}|\textbf{x})}[\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{x})}]

According to the Jensen’s inequality:

E_{q(\textbf{h}|\textbf{x})}E_{q(\textbf{z}|\textbf{h})}[\log \frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{h})}] \le \log E_{q(\textbf{h}|\textbf{x})}E_{q(\textbf{z}|\textbf{h})}[\frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{h})}] \\ = \log E_{q(\textbf{h}|\textbf{x})}\int_{\textbf{z}} q(\textbf{z}|\textbf{h})[\frac{p(\textbf{x},\textbf{z})}{q(\textbf{z}|\textbf{h})}]d\textbf{z} \\ = \log E_{q(\textbf{h}|\textbf{x})}[p(\textbf{x})] = \log p(\textbf{x})

The left inequality is also satisfied.


I went over the proof presented in the paper, “Denoising Criterion for Variational Auto-Encoding Framework”. The simple proof on adding one extra stochastic layer shows that we get a tighter lowerbound. The original paper also generalizes its claim to L stochastic layers. By following the same proof strategy, they show that the lowerbound will be tighter as we add more stochastic layers.


[1] Im, Daniel Jiwoong, et al. “Denoising Criterion for Variational Auto-Encoding Framework.” AAAI. 2017.

[2] Kingma, Diederik P., et al. “Improved variational inference with inverse autoregressive flow.” Advances in Neural Information Processing Systems. 2016.

[3] Dinh, Laurent, Jascha Sohl-Dickstein, and Samy Bengio. “Density estimation using Real NVP.” arXiv preprint arXiv:1605.08803 (2016).



Generating images from captions with attention (ICLR’16)


This work extends the original DRAW paper [2] to generate images given captions. We can treat this model as conditional DRAW. That is we model a conditional probability of P(\text{image}|\text{caption}) The additional textual data controls where to read and write the image.


Generating images from text descriptions is a structure prediction task. That is given a sequence of words, we want to generate an image. Although AlignDRAW has borrowed the same approach as DRAW by combining progressive refinement with attention, incorporating text sequence is their contribution.

The latent variable in DRAW model is sampled from spherical Gaussians, z_t \sim \mathcal{N}(Z_t|\mu_t, \sigma_t) where its mean and variance are functions of the current hidden state of the encoder, e.g. \mu_t = W(h_t^{\text{enc}}). However, AlignDRAW adds dependency between latent variables: z_t \sim \mathcal{N}(\mu(h_{t-1})^{\text{gen}}, \sigma(h_{t-1}^{\text{gen}}).

During the image generation, DRAW iteratively samples a latent variable z_t from a prior \mathcal{N}(0, I), but AlignDRAW will draw z_t from P(Z_t|Z_{<t}). It means that there is a dependency between each latent vector in AlignDRAW model.


Align Operator

The input caption is fed to the BI-Directional RNN. Each output from each time-step needs to be aligned with the current drawing patch. Attention weight is then learned from caption representation up to k words and the current hidden state of the decoder h_{t-1}^{\text{gen}}. Finally, compute the weight average of all hidden state of the language model to obtain the caption context, s_t. This context together with a latent vector z_t will be fed to the LSTM decoder.

Objective Function

This model maximizes the expectation of the variational lowerbound. There are 2 terms: the data likelihood and KL loss.

Closing Thoughts

AlignDRAW uses bi-directional LSTM with attention to aligning each word context with the patches in the image. Some generated images from caption are interesting such as ‘A herd of elephants walking across a dry grass field’. The model generalizes the training data and able to generate novel images.


[1] Mansimov, Elman, et al. “Generating images from captions with attention.” arXiv preprint arXiv:1511.02793 (2015).

[2] Gregor, Karol, et al. “DRAW: A recurrent neural network for image generation.” arXiv preprint arXiv:1502.04623 (2015).

DRAW: A Recurrent Neural Network For Image Generation (ICML’15)

This paper proposes a new method for image generation by progressively improve the reconstructed image.

The previous image generation models generate the entire image by learning a sampling function (GANs), distribution over a latent vector (VAE), or generate one pixel at a time (PixelRNN, PixelCNN). Although the generated images from these models are in a good quality, these models are forced to learn a complicated and high-dimensional distribution. For example, to generate a car image, the models need to approximate the distribution of all possible cars. This is a difficult task.


(Note: I took this Figure from the original paper)

Incremental Update

Progressive refinement breaks down the complex distribution into a chain of conditional distribution:

P(X, C_T) = P(X|C_T)P(C_T) = P(X|C_T)P(C_T|C_{T-1}) \cdots P(C_1|C_0)

Therefore, estimating a conditional distribution is much easier. The conditional probability is modeled by the standard LSTM.

Latent Variable

Use VAE framework helps us project the input image which has a high dimension into a low-dimensional space. Working on the smaller latent space is much easier than the original image space.

Attention Mechanism

The progressive refinement through LSTM has simplified the complex distribution through time, then the attention mechanism simplifies the spatial data into a smaller patch. The encoder and decoder now only needs to deal with a small fraction of the image instead of the image as a whole. This idea again reduces the input space by focusing on the important part of the image only.

Read and Write Operations

This part can be intimated to read at the first glance due to the use of the Gaussian filters. There are many nice blogs that described Read and Write operations with attention mechanism in detail. The main idea is the Read operation crops the input image. The Write operation draws a patch to the canvas matrix.


This is a must-read paper. The combine of progress refinement through time with attention mechanism is a nice idea to simplify the complex image distribution. This is one of the early paper that combine RNN with attention to handle the spatial data such image. I think this is an amazing accomplishment.


[1] Gregor, Karol, et al. “DRAW: A recurrent neural network for image generation.” arXiv preprint arXiv:1502.04623 (2015).


Towards a Neural Statistician (ICLR2017)

One extension of VAE is to add a hierarchy structure. In the classical VAE, the prior is drawn from a standard Gaussian distribution. We can learn this prior from the dataset so that each dataset has its own prior distribution.

The generative process is:

  • Draw a dataset prior \mathbf{c} \sim N(\mathbf{0}, \mathbf{I})
  • For each data point in the dataset
    • Draw a latent vector \mathbf{z} \sim P(\cdot | \mathbf{c})
    • Draw a sample \mathbf{x} \sim P(\cdot | \mathbf{z})

The likelihood of the dataset is:

p(D) = \int p(c) \big[ \prod_{x \in D} \int p(x|z;\theta)p(z|c;\theta)dz \big]dc

The paper define the approximate inference network, q(z|x,c;\phi) and q(c|D; \phi) to optimize a variational lowerbound. The single dataset log likelihood lowerboud is:

\mathcal{L}_D = E_{q(c|D;\phi)}\big[  \sum_{x \in d} E_{q(z|c, x; \phi)}[ \log p(x|z;\theta)] - D_{KL}(q(z|c,x;\phi)||p(z|c;\theta)) \big] - D_{KL}(q(c|D;\phi)||p(c))

The interesting contribution of this paper is their statistic network q(c|D; \phi) that approximates the posterior distribution over the context c given the dataset D. Basically, this inference network has an encoder to take each datapoint into a vector e_i = E(x_i). Then, add a pool layer to aggregate e_i into a single vector. This paper uses an element-wise mean. Finally, the final vector is used to generate parameters of a diagonal Gaussian.


This model surprisingly works well for many tasks such as topic models, transfer learning, one-shot learning, etc.

Reference: (Poster ICLR 2017)

Improved Variational Autoencoders for Text Modeling using Dilated Convolutions (ICML’17)

One of the reasons that VAE with LSTM as a decoder is less effective than LSTM language model due to the LSTM decoder ignores conditioning information from the encoder. This paper uses a dilated CNN as a decoder to improve a perplexity on held-out data.

Language Model

The language model can be modeled as:

p(\textbf{x}) = \prod_t p(x_t | x_1, x_2, \cdots, x_{t-1})

LSTM language model use this conditional distribution to predict the next word.

By adding an additional contextual random variable [2], the language model can be expressed as:

p(\textbf{x}, \textbf{z}) = \prod_t p(x_t | x_1, x_2, \cdots, x_{t-1}, \textbf{z})

The second model is more flexible as it explicitly model a high variation in the sequential data. Without a careful training, the VAE-based language model often degrades to a standard language model as the decoder chooses to ignore the latent variable generated by the encoder.

Dilated CNN

The authors replace LSTM decoder with Dilated CNN decoder to control the contextual capacity. That is when the convolutional kernel is large, the decoder covers longer context as it resembles an LSTM. But if the kernel becomes smaller, the model becomes more like a bag-of-word. The size of kernel controls the contextual capacity which is how much the past context we want to use to predict the current word.


Stacking Dilated CNN is crucial for a better performance because we want to exponentially increase the context windows. WaveNet [3] also uses this approach.


By replacing VAE with a more suitable decoder, VAE can now perform well on language model task. Since the textual sequence does not contain a lot of variation, we may not notice an obvious improvement. We may see more significant improvement in a more complex sequential data such as speech or audio signals. Also, the experimental results show that Dilated CNN is better than LSTM as a decoder but the improvement in terms of perplexity and NLL are still incremental to the standard LSTM language model. We hope to see stronger language models using VAE in the future.


[1] Yang, Zichao, et al. “Improved Variational Autoencoders for Text Modeling using Dilated Convolutions.” arXiv preprint arXiv:1702.08139 (2017).

[2] Bowman, Samuel R., et al. “Generating sentences from a continuous space.” arXiv preprint arXiv:1511.06349 (2015).

[3] Oord, Aaron van den, et al. “Wavenet: A generative model for raw audio.” arXiv preprint arXiv:1609.03499 (2016).

Importance Weighted Autoencoders (ICLR’16)

In a classical VAE, the ELBO is

\log P(\textbf{x}) \ge E_{Q(\textbf{z}|\textbf{x})}[\log \frac{P(\textbf{x},\textbf{z})}{Q(\textbf{z}|\textbf{x})}] = L(x)

The unbiased estimation of the gradient of L(x) is:

\nabla_{\theta} L(\textbf{x})= \frac{1}{k}\sum_{i=1}^k \nabla_{\theta}\log w(\textbf{x},\textbf{z}_i; \theta)

where w(\textbf{x},\textbf{z}_i; \theta) = w(\textbf{x},\textbf{z}_i; \theta) = \frac{p(\textbf{x},\textbf{z}_i)}{q(\textbf{z}_i|\textbf{x})}

The importance weighted autoencoders has a slightly different ELBO:

\log P(\textbf{x}) \ge E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\log \frac{1}{k}\sum_{i=1}^k \frac{P(\textbf{x},\textbf{z}_i)}{Q(\textbf{z}_i|\textbf{x})}]

The unbiased estimation of the gradient is:

\nabla_{\theta} L_k(\textbf{x}) = \nabla_{\theta} E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\log\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta))] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\nabla_{\theta} \log\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta))] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\nabla_{\theta}\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta))}{\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\nabla_{\theta}\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta))}{\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}] \\= E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\nabla_{\theta}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}{\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\nabla_{\theta}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}{C(\textbf{x};\theta)}] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\sum_{i=1}^k \nabla_{\theta} w(\textbf{x}, \textbf{z}_i; \theta)}{C(\textbf{x};\theta)}]

Then, use the log identity:

= E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)\nabla_{\theta}\log w(\textbf{x}, \textbf{z}_i; \theta)}{C(\textbf{x};\theta)}] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\sum_{i=1}^k \frac{w(\textbf{x}, \textbf{z}_i; \theta)}{\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}\nabla_{\theta}\log w(\textbf{x}, \textbf{z}_i; \theta)] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\sum_{i=1}^k \tilde w_i\nabla_{\theta}\log w(\textbf{x}, \textbf{z}_i; \theta)]

The Monte Carlo estimation is then:

$latex \sum_{i=1}^k \tilde w_i\nabla_{\theta} \log w(\textbf{x}, \textbf{z}_i; \theta)$.

The Importance Weighted Autoencoders (IWAE) has a similar network architecture as VAE because when k = 1, the Monte Carlo estimation becomes the standard VAE.

It is unclear why IWAE is better than VAE since we can draw multiple samples from VAE to approximate its log data likelihood. The main difference is the weight function. That is why it is called Importance samples.


[1] Burda, Yuri, Roger Grosse, and Ruslan Salakhutdinov. “Importance weighted autoencoders.” arXiv preprint arXiv:1509.00519 (2015).

Autoencoding beyond pixels using a learned similarity metric (ICML’16)

One of the key components of an autoencoder is a reconstruction error. This term measures how much useful information is compressed into a learned latent vector. The common reconstruction error is based on an element-wise measurement such as a binary cross entropy for a black-and-white image or a square error between a reconstructed image and the input image.

The authors think that an element-wise measurement is not an accurate indicator of goodness of a learned latent vector. Hence, they proposed to learn a similar metric via an adversarial training. Here is how they set up the objective functions:

They use VAE to learn a latent vector of the input image. There are two loss functions in a vanilla VAE: a KL loss and a negative log data-likelihood. They replace the second loss with a new reconstruction loss function. We will talk about this new loss function in the next paragraph. Then, they have a discriminator that tries to distinguish the real input data from the generated data from the decoder of VAE. The discriminator will encourage the VAE to learn stronger encoder and decoder.

The discriminator can be decomposed into 2 parts. If it has L + 1 layers, then the first L layers is a transform function that maps the input data into a new representation. The last layer is a binary classifier. This means that if we input any input through the first L layer, we will get a new representation that is easily classified by the last layer of the discriminator. When the discriminator is very good at detecting the real input, the representation at L layer is going to be much easier to classify compared to the input data. It means that a square error between a transformed input and its transformed reconstruction input should be somewhat small when these inputs are similar.

This model has trained the same fashion as GANs; simultaneously train VAE and GANs. This idea works well for an image because the square-error is not a good metric for an image quality. This idea may work on text dataset as well because we assess the quality of the reconstructed texts based on the whole input but not collectively evaluate one word at a time.