Generating images from captions with attention (ICLR’16)


This work extends the original DRAW paper [2] to generate images given captions. We can treat this model as conditional DRAW. That is we model a conditional probability of P(\text{image}|\text{caption}) The additional textual data controls where to read and write the image.


Generating images from text descriptions is a structure prediction task. That is given a sequence of words, we want to generate an image. Although AlignDRAW has borrowed the same approach as DRAW by combining progressive refinement with attention, incorporating text sequence is their contribution.

The latent variable in DRAW model is sampled from spherical Gaussians, z_t \sim \mathcal{N}(Z_t|\mu_t, \sigma_t) where its mean and variance are functions of the current hidden state of the encoder, e.g. \mu_t = W(h_t^{\text{enc}}). However, AlignDRAW adds dependency between latent variables: z_t \sim \mathcal{N}(\mu(h_{t-1})^{\text{gen}}, \sigma(h_{t-1}^{\text{gen}}).

During the image generation, DRAW iteratively samples a latent variable z_t from a prior \mathcal{N}(0, I), but AlignDRAW will draw z_t from P(Z_t|Z_{<t}). It means that there is a dependency between each latent vector in AlignDRAW model.


Align Operator

The input caption is fed to the BI-Directional RNN. Each output from each time-step needs to be aligned with the current drawing patch. Attention weight is then learned from caption representation up to k words and the current hidden state of the decoder h_{t-1}^{\text{gen}}. Finally, compute the weight average of all hidden state of the language model to obtain the caption context, s_t. This context together with a latent vector z_t will be fed to the LSTM decoder.

Objective Function

This model maximizes the expectation of the variational lowerbound. There are 2 terms: the data likelihood and KL loss.

Closing Thoughts

AlignDRAW uses bi-directional LSTM with attention to aligning each word context with the patches in the image. Some generated images from caption are interesting such as ‘A herd of elephants walking across a dry grass field’. The model generalizes the training data and able to generate novel images.


[1] Mansimov, Elman, et al. “Generating images from captions with attention.” arXiv preprint arXiv:1511.02793 (2015).

[2] Gregor, Karol, et al. “DRAW: A recurrent neural network for image generation.” arXiv preprint arXiv:1502.04623 (2015).


DRAW: A Recurrent Neural Network For Image Generation (ICML’15)

This paper proposes a new method for image generation by progressively improve the reconstructed image.

The previous image generation models generate the entire image by learning a sampling function (GANs), distribution over a latent vector (VAE), or generate one pixel at a time (PixelRNN, PixelCNN). Although the generated images from these models are in a good quality, these models are forced to learn a complicated and high-dimensional distribution. For example, to generate a car image, the models need to approximate the distribution of all possible cars. This is a difficult task.


(Note: I took this Figure from the original paper)

Incremental Update

Progressive refinement breaks down the complex distribution into a chain of conditional distribution:

P(X, C_T) = P(X|C_T)P(C_T) = P(X|C_T)P(C_T|C_{T-1}) \cdots P(C_1|C_0)

Therefore, estimating a conditional distribution is much easier. The conditional probability is modeled by the standard LSTM.

Latent Variable

Use VAE framework helps us project the input image which has a high dimension into a low-dimensional space. Working on the smaller latent space is much easier than the original image space.

Attention Mechanism

The progressive refinement through LSTM has simplified the complex distribution through time, then the attention mechanism simplifies the spatial data into a smaller patch. The encoder and decoder now only needs to deal with a small fraction of the image instead of the image as a whole. This idea again reduces the input space by focusing on the important part of the image only.

Read and Write Operations

This part can be intimated to read at the first glance due to the use of the Gaussian filters. There are many nice blogs that described Read and Write operations with attention mechanism in detail. The main idea is the Read operation crops the input image. The Write operation draws a patch to the canvas matrix.


This is a must-read paper. The combine of progress refinement through time with attention mechanism is a nice idea to simplify the complex image distribution. This is one of the early paper that combine RNN with attention to handle the spatial data such image. I think this is an amazing accomplishment.


[1] Gregor, Karol, et al. “DRAW: A recurrent neural network for image generation.” arXiv preprint arXiv:1502.04623 (2015).



The autoregressive model is applicable to generate a realistic audio signal. Given the waveform \textbf{x} = {x_1, x_2, \cdots, x_T}, the joint probability of a waveform is

p(\textbf{x}) = \prod_{t=1}^T p(x_t | x_1, \cdots, x_{t-1})

Recurrent neural nets is a perfect model to learn and predict a one-dimensional sequence. However, since an audio signal has a lot of samples per second (44.1 kHz), using RNN to process one sample at a time is too slow.

This paper uses casual convolutional neural nets by predicting: p(x_{t+1}|x_1, \cdots, x_t). Masking the convolutional kernel to avoid the current output to see the future input. A convolutional NN architecture is more scalable than RNNs because we can process multiple inputs in parallel. The main limitation is that the convolutional kernel has to be very large to cover a longer range dependency.

Dilated convolution architecture uses a convolution kernel with holes in order to cover a large area of input signals. The kernel is dilated by filling zeros to expand the kernel while skipping some inputs in between.

Stacking convolutional NNs are an effective method to increase the depth of the networks. Residual and skip connections are utilized to speed up convergence of the model.

Wavenet can model the conditional probability given an external input such as a speaker identity, output text or phonic, p(\textbf{x}|\textbf{h}) = \prod_{t=1}^T p(x_t|x_1,\cdots,x_{t-1},\textbf{h}). There are two ways to construct an activation function that depends on textbf{h}, an extra input/context.

Global conditioning: all the output depends on the given context \textbf{h} directly.

Local conditioning: the input context is broken down into a timeseries h_t. Upsampling this signal to match the length of the input audio sample.

Wavenet generates realistic human voices both in English and mandarin.  It can also generate a piece of music audio but it sounds like a mad pianist storming on the piano.

You can check some generated output from DeepMind website:


Pixel Recurrent Neural Networks (ICML’16)

This paper proposes a novel autoregressive model to generate an image pixel by pixel. This idea makes sense since each pixel is correlated to its neighbor pixels. The same object has similar color and gradient. This idea has been exploited in an image compression such as jpeg where the color frequency does not change too frequent.

With this intuition, an autoregressive model assigns a probability of the observed sequence \vec x as p(\vec x) = \prod_{i=1}^{n^2} p(x_i | x_{<i}) For an image, each pixel is conditioned on the previous seen pixels.

A recurrent neural network such as LSTM has been an effective architecture for a sequence learning task. Hence, LSTM can be applied for an image generation task as well. There are two main challenges with using an autoregressive model for an image: first, an image is a 2-dimensional object – squeezing it to a 1-d vector will lose spatial information; second, training pixel by pixel is time-consuming and can’t be parallelized because we have to process the previous pixel before the current pixel. This paper attempts to solve these problems.

Model a color image 

Each pixel x_i has 3 values: an intensity in the red, green, and blue channels. The conditional distribution is modified as:

p(x_i|\vec x_{<i}) = p(x_{i,R}|\vec x_{<i})p(x_{i,G}|\vec x_{<i},x_{i,R})p(x_{i,B}|\vec x_{<i},x_{i,R},x_{i,G})

Each color is conditioned on the other channels as well as the previous pixel values.

Pixels as Discrete Value

This is a neat idea. Using a discrete distribution provides more flexibility because we do not assume the shape of the distribution.

Now we will dive into the proposed architectures:


When we apply LSTM on 2-dimensional input, we can first compute all the hidden states given the current pixel values and previous states. In this architecture, the previous states are the hidden states from the above rows. We can define the context window to control the amount of information to capture from the above row. If we set the context window to 3 (meaning taking the top, top-right, and top-left pixels’ hidden states):

h_{r, c} = f( h_{r-1, c-1}, h_{r-1, c}, h_{r-1, c+1}, x_{r, c})

We can parallelize this computation row by row.

Diagonal BiLSTM

One limitation of ROW LSTM is its lack of full context from the above row. By using bi-directional LSTM, this architecture can capture forward and backward dependency. Furthermore, each hidden state depends on the top-left and left hidden states:

h_{r,c} = f(h_{r-1,c-1}, h_{r,c-1}, x_{r,c})

However, the input image needs to be screwed by shifting each row by one pixel to the right, in order to parallelize this operation, column by column.

Pixel CNN

The sequential model is slow because it needs to process the previous states. Using convolutional layers to capture the spatial information is possible. But directly applying a convolutional kernel violates the autoregressive model’s assumption because the current pixel must not see the next pixel. Otherwise, we can’t generate the pixel because we will not know any incoming pixel. Hence, the mask is used to prevent the current pixel from seeing the future pixels. This trick is pretty neat.


Diagonal BiLSTM has the best performance out of 3 architectures in terms of nats (negative log-likelihood) scores. The autoregressive model does not assume the low-dimensional manifold assumption and it models a conditional distribution directly. However, the computation is quite expensive for a large input.


Autoencoding beyond pixels using a learned similarity metric (ICML’16)

One of the key components of an autoencoder is a reconstruction error. This term measures how much useful information is compressed into a learned latent vector. The common reconstruction error is based on an element-wise measurement such as a binary cross entropy for a black-and-white image or a square error between a reconstructed image and the input image.

The authors think that an element-wise measurement is not an accurate indicator of goodness of a learned latent vector. Hence, they proposed to learn a similar metric via an adversarial training. Here is how they set up the objective functions:

They use VAE to learn a latent vector of the input image. There are two loss functions in a vanilla VAE: a KL loss and a negative log data-likelihood. They replace the second loss with a new reconstruction loss function. We will talk about this new loss function in the next paragraph. Then, they have a discriminator that tries to distinguish the real input data from the generated data from the decoder of VAE. The discriminator will encourage the VAE to learn stronger encoder and decoder.

The discriminator can be decomposed into 2 parts. If it has L + 1 layers, then the first L layers is a transform function that maps the input data into a new representation. The last layer is a binary classifier. This means that if we input any input through the first L layer, we will get a new representation that is easily classified by the last layer of the discriminator. When the discriminator is very good at detecting the real input, the representation at L layer is going to be much easier to classify compared to the input data. It means that a square error between a transformed input and its transformed reconstruction input should be somewhat small when these inputs are similar.

This model has trained the same fashion as GANs; simultaneously train VAE and GANs. This idea works well for an image because the square-error is not a good metric for an image quality. This idea may work on text dataset as well because we assess the quality of the reconstructed texts based on the whole input but not collectively evaluate one word at a time.



Adversarial Variational Bayes

Variational autoencoder (VAE) requires an expressive inference network in order to learn a complex posterior distribution. The more complex inference network will result in generating high-quality data.

This work utilizes an adversarial training to learn a function T(x,z) that approximates \log q_{\phi}(z|x) - \log p(z). The expectation of this term w.r.t $latex q_{\phi}(z|x)$ is in fact a KL-divergence term. Since the authors prove that the optimal T^*(x, z) = \log q_{\phi}(z|x) - \log p(z), the ELBO becomes:

E_{p_{D(x)}}E_{q_{\phi}(z|x)}[-T^*(x,z) + \log p_{\theta}(x|z)]

In order to approximate T^*(x, z), the discriminator needs to learn to distinguish between a sample from a prior p_{D(x)}p(z) and the current inference model p_{D(x)}q_{\phi}(z|x). Thus, the objective function for the discriminator is setup as:

\max_T E_{p_{D(x)}}E_{q_{\phi(z|x)}} \log \sigma(T(x,z)) + E_{p_{D(x)}}E_{p(z)} \log(1 - \sigma(T(x,z))) (1)

Taking a gradient on T(x,z) w.r.t. parameter \phi can be problematic because the solution of this function depends on q_{\phi}(z|x). But the author shows that the expectation of gradient of T^*(x, z) w.r.t \phi is 0. Thus, there is no gradient and no parameter update when taking a gradient of T^*(x,z).

Since T(x,z) requires sample z, the parametrization trick is applied and the ELBO becomes:

E_{p_{D(x)}}E_{\epsilon}[-T^*(x, z_{\phi}(x, \epsilon) + \log p_{\theta}(x|z_{\phi}(x, \epsilon))] (2)

This step is crucial because now the sampling is just a transformation from a noise and let T^*(x, z) to approximate the KL-divergence term. This made this model looks like a blackbox model because we do not explicitly define a distribution q_{\phi}(z|x).

This model optimizes equation (1) and (2) using adversarial training. It optimizes eq.(1) several steps in order to keep T(x, z) close to optimal while jointly optimizes eq. (2).

Adaptive contrast technique is used to make T(x, y) to be sufficiently close to the optimal. Basically, the KL term in ELBO is replaced by KL(q_{\phi}(z|x), r_{\alpha}(z|x)) where r_{\alpha}(z|x) is an auxiliary distribution which could be a Gaussian distribution.

This model has a connection to variational autoencoder, adversarial autoencoder, f-GANs, and BiGANs. A new training method for VAE via adversarial training allows us to use a flexible inference that approximate a true distribution over the latent vectors.



IRGAN: A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models (SIGIR’17)

This paper uses GANs framework to combine generative and discriminative information retrieval model. It shows a promising result on web search, item recommendations, and Q/A tasks.

Typically, many relevant models are classified into 2 types:

  • Generative retrieval model
    • It generates a document given a query and possibly relevant score. The model is p(d|q, r).
  • Discriminative retrieval model
    • It computes a relevance score for the given query and document pair. The model is p(r| d, q).

The generative model tried to find a connection between document and query. On the other hand, the discriminative model attempts to model the interaction between query and document based on relevance scores.

Both models have their shortcoming. Many generative models require a predefined data generating story. The wrong assumption will lead to the poor performance. The generative model is usually trying to fit the data to its model without external guidance. Meanwhile, the discriminative model requires a lot of labeled data to be effective, especially for a deep neural network model.

By train both models using GANs framework, it is now possible to solve their shortcoming. The generative model is now adaptive because the discriminator will reward the generative model when it can create or select good samples. This adaptive guidance from the discriminator is unique in GANs framework and will help the generator learns to pick good samples from the data distribution. At the same time, the discriminator can receive even more training data from the generative model. This is similar to semi-supervised learning where unlabeled data are utilized. Adversarial training allows us to improve both generative and discriminative models via jointly learning through the

Adversarial training allows us to improve both generative and discriminative models via jointly learning through the minimax training. The traditional training based on maximum likelihood does not have principle way to allow both models to give each other feedbacks.

The paper seems to be promising and their results on 3 information retrieval tasks are really good. But I notice that their training procedure requires pretraining. This made me wonder if pre-training is part of the performance boost during testing. I don’t find the part in the paper that explains the benefit of pretraining in their settings.

The discriminative model is straight forward. It is a sigmoid function. The discriminator basically gives a high probability when the given document-query pair is relevant. The generative model is more interesting. In the standard GANs, the generator will create a sample from a simple distribution, but IRGAN does not generate a new document-query pair. Instead, the author chose to let the generator select the sample from the document pool. In my opinion, this approach is simpler than creating a new data because the sample is realistic. Also, IRGAN cares about finding a function to compute a relevance score so it is unnecessary to generate a completely new data.

However, the cost function for the generator is an expectation over all documents in the corpus. The Monte Carlo approximation will have a high variance. Thus, they use policy gradient to reduce the variance so that the model can learn a useful representation. Although p(d|q,r) is a discrete distribution,  the backprop is applicable because we pre-sample all documents from p(d|q,r) beforehand. Thus, eq. 5 is differentiable. The extra care may need in order to reduce variance further. They use an advantage function. (Please look at the reference on Reinforcement Learning [2]).

Generating positive and negative samples are still confusing in this paper. It seems to be application specific. The author mentioned about using softmax with temperature hyper-parameter to put more or less focus on top documents. My guess is when we put less focus on top documents, the generator has more chance to pick up more negative samples. After I read the paper again, it seems that all samples selected by the generator model are negative samples. This part remains unclear and I need to ask the author for more details.

In conclusion, I like this paper because it tried to combine generative and discriminative retrieval models via GANs framework. I would not be appreciated if they simply applied the IR task to GANs blindly. Instead, they explained their motivation and discussed the advantage of jointly train both models. It seems adversarial training is useful for IR tasks as well.


[1] Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. 2017. IRGAN: A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models. In Proceedings of SIGIR’17, Shinjuku, Tokyo, Japan, August 7-11, 2017, 10 pages.

[2] Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, and others. 1999. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In NIPS.

[3] IRGAN code (