Generating images from captions with attention (ICLR’16)


This work extends the original DRAW paper [2] to generate images given captions. We can treat this model as conditional DRAW. That is we model a conditional probability of P(\text{image}|\text{caption}) The additional textual data controls where to read and write the image.


Generating images from text descriptions is a structure prediction task. That is given a sequence of words, we want to generate an image. Although AlignDRAW has borrowed the same approach as DRAW by combining progressive refinement with attention, incorporating text sequence is their contribution.

The latent variable in DRAW model is sampled from spherical Gaussians, z_t \sim \mathcal{N}(Z_t|\mu_t, \sigma_t) where its mean and variance are functions of the current hidden state of the encoder, e.g. \mu_t = W(h_t^{\text{enc}}). However, AlignDRAW adds dependency between latent variables: z_t \sim \mathcal{N}(\mu(h_{t-1})^{\text{gen}}, \sigma(h_{t-1}^{\text{gen}}).

During the image generation, DRAW iteratively samples a latent variable z_t from a prior \mathcal{N}(0, I), but AlignDRAW will draw z_t from P(Z_t|Z_{<t}). It means that there is a dependency between each latent vector in AlignDRAW model.


Align Operator

The input caption is fed to the BI-Directional RNN. Each output from each time-step needs to be aligned with the current drawing patch. Attention weight is then learned from caption representation up to k words and the current hidden state of the decoder h_{t-1}^{\text{gen}}. Finally, compute the weight average of all hidden state of the language model to obtain the caption context, s_t. This context together with a latent vector z_t will be fed to the LSTM decoder.

Objective Function

This model maximizes the expectation of the variational lowerbound. There are 2 terms: the data likelihood and KL loss.

Closing Thoughts

AlignDRAW uses bi-directional LSTM with attention to aligning each word context with the patches in the image. Some generated images from caption are interesting such as ‘A herd of elephants walking across a dry grass field’. The model generalizes the training data and able to generate novel images.


[1] Mansimov, Elman, et al. “Generating images from captions with attention.” arXiv preprint arXiv:1511.02793 (2015).

[2] Gregor, Karol, et al. “DRAW: A recurrent neural network for image generation.” arXiv preprint arXiv:1502.04623 (2015).


DRAW: A Recurrent Neural Network For Image Generation (ICML’15)

This paper proposes a new method for image generation by progressively improve the reconstructed image.

The previous image generation models generate the entire image by learning a sampling function (GANs), distribution over a latent vector (VAE), or generate one pixel at a time (PixelRNN, PixelCNN). Although the generated images from these models are in a good quality, these models are forced to learn a complicated and high-dimensional distribution. For example, to generate a car image, the models need to approximate the distribution of all possible cars. This is a difficult task.


(Note: I took this Figure from the original paper)

Incremental Update

Progressive refinement breaks down the complex distribution into a chain of conditional distribution:

P(X, C_T) = P(X|C_T)P(C_T) = P(X|C_T)P(C_T|C_{T-1}) \cdots P(C_1|C_0)

Therefore, estimating a conditional distribution is much easier. The conditional probability is modeled by the standard LSTM.

Latent Variable

Use VAE framework helps us project the input image which has a high dimension into a low-dimensional space. Working on the smaller latent space is much easier than the original image space.

Attention Mechanism

The progressive refinement through LSTM has simplified the complex distribution through time, then the attention mechanism simplifies the spatial data into a smaller patch. The encoder and decoder now only needs to deal with a small fraction of the image instead of the image as a whole. This idea again reduces the input space by focusing on the important part of the image only.

Read and Write Operations

This part can be intimated to read at the first glance due to the use of the Gaussian filters. There are many nice blogs that described Read and Write operations with attention mechanism in detail. The main idea is the Read operation crops the input image. The Write operation draws a patch to the canvas matrix.


This is a must-read paper. The combine of progress refinement through time with attention mechanism is a nice idea to simplify the complex image distribution. This is one of the early paper that combine RNN with attention to handle the spatial data such image. I think this is an amazing accomplishment.


[1] Gregor, Karol, et al. “DRAW: A recurrent neural network for image generation.” arXiv preprint arXiv:1502.04623 (2015).


Towards a Neural Statistician (ICLR2017)

One extension of VAE is to add a hierarchy structure. In the classical VAE, the prior is drawn from a standard Gaussian distribution. We can learn this prior from the dataset so that each dataset has its own prior distribution.

The generative process is:

  • Draw a dataset prior \mathbf{c} \sim N(\mathbf{0}, \mathbf{I})
  • For each data point in the dataset
    • Draw a latent vector \mathbf{z} \sim P(\cdot | \mathbf{c})
    • Draw a sample \mathbf{x} \sim P(\cdot | \mathbf{z})

The likelihood of the dataset is:

p(D) = \int p(c) \big[ \prod_{x \in D} \int p(x|z;\theta)p(z|c;\theta)dz \big]dc

The paper define the approximate inference network, q(z|x,c;\phi) and q(c|D; \phi) to optimize a variational lowerbound. The single dataset log likelihood lowerboud is:

\mathcal{L}_D = E_{q(c|D;\phi)}\big[  \sum_{x \in d} E_{q(z|c, x; \phi)}[ \log p(x|z;\theta)] - D_{KL}(q(z|c,x;\phi)||p(z|c;\theta)) \big] - D_{KL}(q(c|D;\phi)||p(c))

The interesting contribution of this paper is their statistic network q(c|D; \phi) that approximates the posterior distribution over the context c given the dataset D. Basically, this inference network has an encoder to take each datapoint into a vector e_i = E(x_i). Then, add a pool layer to aggregate e_i into a single vector. This paper uses an element-wise mean. Finally, the final vector is used to generate parameters of a diagonal Gaussian.


This model surprisingly works well for many tasks such as topic models, transfer learning, one-shot learning, etc.

Reference: (Poster ICLR 2017)

Improved Variational Autoencoders for Text Modeling using Dilated Convolutions (ICML’17)

One of the reasons that VAE with LSTM as a decoder is less effective than LSTM language model due to the LSTM decoder ignores conditioning information from the encoder. This paper uses a dilated CNN as a decoder to improve a perplexity on held-out data.

Language Model

The language model can be modeled as:

p(\textbf{x}) = \prod_t p(x_t | x_1, x_2, \cdots, x_{t-1})

LSTM language model use this conditional distribution to predict the next word.

By adding an additional contextual random variable [2], the language model can be expressed as:

p(\textbf{x}, \textbf{z}) = \prod_t p(x_t | x_1, x_2, \cdots, x_{t-1}, \textbf{z})

The second model is more flexible as it explicitly model a high variation in the sequential data. Without a careful training, the VAE-based language model often degrades to a standard language model as the decoder chooses to ignore the latent variable generated by the encoder.

Dilated CNN

The authors replace LSTM decoder with Dilated CNN decoder to control the contextual capacity. That is when the convolutional kernel is large, the decoder covers longer context as it resembles an LSTM. But if the kernel becomes smaller, the model becomes more like a bag-of-word. The size of kernel controls the contextual capacity which is how much the past context we want to use to predict the current word.


Stacking Dilated CNN is crucial for a better performance because we want to exponentially increase the context windows. WaveNet [3] also uses this approach.


By replacing VAE with a more suitable decoder, VAE can now perform well on language model task. Since the textual sequence does not contain a lot of variation, we may not notice an obvious improvement. We may see more significant improvement in a more complex sequential data such as speech or audio signals. Also, the experimental results show that Dilated CNN is better than LSTM as a decoder but the improvement in terms of perplexity and NLL are still incremental to the standard LSTM language model. We hope to see stronger language models using VAE in the future.


[1] Yang, Zichao, et al. “Improved Variational Autoencoders for Text Modeling using Dilated Convolutions.” arXiv preprint arXiv:1702.08139 (2017).

[2] Bowman, Samuel R., et al. “Generating sentences from a continuous space.” arXiv preprint arXiv:1511.06349 (2015).

[3] Oord, Aaron van den, et al. “Wavenet: A generative model for raw audio.” arXiv preprint arXiv:1609.03499 (2016).

Importance Weighted Autoencoders (ICLR’16)

In a classical VAE, the ELBO is

\log P(\textbf{x}) \ge E_{Q(\textbf{z}|\textbf{x})}[\log \frac{P(\textbf{x},\textbf{z})}{Q(\textbf{z}|\textbf{x})}] = L(x)

The unbiased estimation of the gradient of L(x) is:

\nabla_{\theta} L(\textbf{x})= \frac{1}{k}\sum_{i=1}^k \nabla_{\theta}\log w(\textbf{x},\textbf{z}_i; \theta)

where w(\textbf{x},\textbf{z}_i; \theta) = w(\textbf{x},\textbf{z}_i; \theta) = \frac{p(\textbf{x},\textbf{z}_i)}{q(\textbf{z}_i|\textbf{x})}

The importance weighted autoencoders has a slightly different ELBO:

\log P(\textbf{x}) \ge E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\log \frac{1}{k}\sum_{i=1}^k \frac{P(\textbf{x},\textbf{z}_i)}{Q(\textbf{z}_i|\textbf{x})}]

The unbiased estimation of the gradient is:

\nabla_{\theta} L_k(\textbf{x}) = \nabla_{\theta} E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\log\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta))] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\nabla_{\theta} \log\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta))] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\nabla_{\theta}\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta))}{\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\nabla_{\theta}\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta))}{\frac{1}{k}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}] \\= E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\nabla_{\theta}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}{\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\nabla_{\theta}\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}{C(\textbf{x};\theta)}] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\sum_{i=1}^k \nabla_{\theta} w(\textbf{x}, \textbf{z}_i; \theta)}{C(\textbf{x};\theta)}]

Then, use the log identity:

= E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\frac{\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)\nabla_{\theta}\log w(\textbf{x}, \textbf{z}_i; \theta)}{C(\textbf{x};\theta)}] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\sum_{i=1}^k \frac{w(\textbf{x}, \textbf{z}_i; \theta)}{\sum_{i=1}^k w(\textbf{x}, \textbf{z}_i; \theta)}\nabla_{\theta}\log w(\textbf{x}, \textbf{z}_i; \theta)] \\ = E_{\textbf{z}_1,\textbf{z}_2, \cdots, \textbf{z}_k \sim Q(\textbf{z}|\textbf{x})}[\sum_{i=1}^k \tilde w_i\nabla_{\theta}\log w(\textbf{x}, \textbf{z}_i; \theta)]

The Monte Carlo estimation is then:

$latex \sum_{i=1}^k \tilde w_i\nabla_{\theta} \log w(\textbf{x}, \textbf{z}_i; \theta)$.

The Importance Weighted Autoencoders (IWAE) has a similar network architecture as VAE because when k = 1, the Monte Carlo estimation becomes the standard VAE.

It is unclear why IWAE is better than VAE since we can draw multiple samples from VAE to approximate its log data likelihood. The main difference is the weight function. That is why it is called Importance samples.


[1] Burda, Yuri, Roger Grosse, and Ruslan Salakhutdinov. “Importance weighted autoencoders.” arXiv preprint arXiv:1509.00519 (2015).

Autoencoding beyond pixels using a learned similarity metric (ICML’16)

One of the key components of an autoencoder is a reconstruction error. This term measures how much useful information is compressed into a learned latent vector. The common reconstruction error is based on an element-wise measurement such as a binary cross entropy for a black-and-white image or a square error between a reconstructed image and the input image.

The authors think that an element-wise measurement is not an accurate indicator of goodness of a learned latent vector. Hence, they proposed to learn a similar metric via an adversarial training. Here is how they set up the objective functions:

They use VAE to learn a latent vector of the input image. There are two loss functions in a vanilla VAE: a KL loss and a negative log data-likelihood. They replace the second loss with a new reconstruction loss function. We will talk about this new loss function in the next paragraph. Then, they have a discriminator that tries to distinguish the real input data from the generated data from the decoder of VAE. The discriminator will encourage the VAE to learn stronger encoder and decoder.

The discriminator can be decomposed into 2 parts. If it has L + 1 layers, then the first L layers is a transform function that maps the input data into a new representation. The last layer is a binary classifier. This means that if we input any input through the first L layer, we will get a new representation that is easily classified by the last layer of the discriminator. When the discriminator is very good at detecting the real input, the representation at L layer is going to be much easier to classify compared to the input data. It means that a square error between a transformed input and its transformed reconstruction input should be somewhat small when these inputs are similar.

This model has trained the same fashion as GANs; simultaneously train VAE and GANs. This idea works well for an image because the square-error is not a good metric for an image quality. This idea may work on text dataset as well because we assess the quality of the reconstructed texts based on the whole input but not collectively evaluate one word at a time.



Adversarial Variational Bayes

Variational autoencoder (VAE) requires an expressive inference network in order to learn a complex posterior distribution. The more complex inference network will result in generating high-quality data.

This work utilizes an adversarial training to learn a function T(x,z) that approximates \log q_{\phi}(z|x) - \log p(z). The expectation of this term w.r.t $latex q_{\phi}(z|x)$ is in fact a KL-divergence term. Since the authors prove that the optimal T^*(x, z) = \log q_{\phi}(z|x) - \log p(z), the ELBO becomes:

E_{p_{D(x)}}E_{q_{\phi}(z|x)}[-T^*(x,z) + \log p_{\theta}(x|z)]

In order to approximate T^*(x, z), the discriminator needs to learn to distinguish between a sample from a prior p_{D(x)}p(z) and the current inference model p_{D(x)}q_{\phi}(z|x). Thus, the objective function for the discriminator is setup as:

\max_T E_{p_{D(x)}}E_{q_{\phi(z|x)}} \log \sigma(T(x,z)) + E_{p_{D(x)}}E_{p(z)} \log(1 - \sigma(T(x,z))) (1)

Taking a gradient on T(x,z) w.r.t. parameter \phi can be problematic because the solution of this function depends on q_{\phi}(z|x). But the author shows that the expectation of gradient of T^*(x, z) w.r.t \phi is 0. Thus, there is no gradient and no parameter update when taking a gradient of T^*(x,z).

Since T(x,z) requires sample z, the parametrization trick is applied and the ELBO becomes:

E_{p_{D(x)}}E_{\epsilon}[-T^*(x, z_{\phi}(x, \epsilon) + \log p_{\theta}(x|z_{\phi}(x, \epsilon))] (2)

This step is crucial because now the sampling is just a transformation from a noise and let T^*(x, z) to approximate the KL-divergence term. This made this model looks like a blackbox model because we do not explicitly define a distribution q_{\phi}(z|x).

This model optimizes equation (1) and (2) using adversarial training. It optimizes eq.(1) several steps in order to keep T(x, z) close to optimal while jointly optimizes eq. (2).

Adaptive contrast technique is used to make T(x, y) to be sufficiently close to the optimal. Basically, the KL term in ELBO is replaced by KL(q_{\phi}(z|x), r_{\alpha}(z|x)) where r_{\alpha}(z|x) is an auxiliary distribution which could be a Gaussian distribution.

This model has a connection to variational autoencoder, adversarial autoencoder, f-GANs, and BiGANs. A new training method for VAE via adversarial training allows us to use a flexible inference that approximate a true distribution over the latent vectors.