Generating images from captions with attention (ICLR’16)


This work extends the original DRAW paper [2] to generate images given captions. We can treat this model as conditional DRAW. That is we model a conditional probability of P(\text{image}|\text{caption}) The additional textual data controls where to read and write the image.


Generating images from text descriptions is a structure prediction task. That is given a sequence of words, we want to generate an image. Although AlignDRAW has borrowed the same approach as DRAW by combining progressive refinement with attention, incorporating text sequence is their contribution.

The latent variable in DRAW model is sampled from spherical Gaussians, z_t \sim \mathcal{N}(Z_t|\mu_t, \sigma_t) where its mean and variance are functions of the current hidden state of the encoder, e.g. \mu_t = W(h_t^{\text{enc}}). However, AlignDRAW adds dependency between latent variables: z_t \sim \mathcal{N}(\mu(h_{t-1})^{\text{gen}}, \sigma(h_{t-1}^{\text{gen}}).

During the image generation, DRAW iteratively samples a latent variable z_t from a prior \mathcal{N}(0, I), but AlignDRAW will draw z_t from P(Z_t|Z_{<t}). It means that there is a dependency between each latent vector in AlignDRAW model.


Align Operator

The input caption is fed to the BI-Directional RNN. Each output from each time-step needs to be aligned with the current drawing patch. Attention weight is then learned from caption representation up to k words and the current hidden state of the decoder h_{t-1}^{\text{gen}}. Finally, compute the weight average of all hidden state of the language model to obtain the caption context, s_t. This context together with a latent vector z_t will be fed to the LSTM decoder.

Objective Function

This model maximizes the expectation of the variational lowerbound. There are 2 terms: the data likelihood and KL loss.

Closing Thoughts

AlignDRAW uses bi-directional LSTM with attention to aligning each word context with the patches in the image. Some generated images from caption are interesting such as ‘A herd of elephants walking across a dry grass field’. The model generalizes the training data and able to generate novel images.


[1] Mansimov, Elman, et al. “Generating images from captions with attention.” arXiv preprint arXiv:1511.02793 (2015).

[2] Gregor, Karol, et al. “DRAW: A recurrent neural network for image generation.” arXiv preprint arXiv:1502.04623 (2015).


DRAW: A Recurrent Neural Network For Image Generation (ICML’15)

This paper proposes a new method for image generation by progressively improve the reconstructed image.

The previous image generation models generate the entire image by learning a sampling function (GANs), distribution over a latent vector (VAE), or generate one pixel at a time (PixelRNN, PixelCNN). Although the generated images from these models are in a good quality, these models are forced to learn a complicated and high-dimensional distribution. For example, to generate a car image, the models need to approximate the distribution of all possible cars. This is a difficult task.


(Note: I took this Figure from the original paper)

Incremental Update

Progressive refinement breaks down the complex distribution into a chain of conditional distribution:

P(X, C_T) = P(X|C_T)P(C_T) = P(X|C_T)P(C_T|C_{T-1}) \cdots P(C_1|C_0)

Therefore, estimating a conditional distribution is much easier. The conditional probability is modeled by the standard LSTM.

Latent Variable

Use VAE framework helps us project the input image which has a high dimension into a low-dimensional space. Working on the smaller latent space is much easier than the original image space.

Attention Mechanism

The progressive refinement through LSTM has simplified the complex distribution through time, then the attention mechanism simplifies the spatial data into a smaller patch. The encoder and decoder now only needs to deal with a small fraction of the image instead of the image as a whole. This idea again reduces the input space by focusing on the important part of the image only.

Read and Write Operations

This part can be intimated to read at the first glance due to the use of the Gaussian filters. There are many nice blogs that described Read and Write operations with attention mechanism in detail. The main idea is the Read operation crops the input image. The Write operation draws a patch to the canvas matrix.


This is a must-read paper. The combine of progress refinement through time with attention mechanism is a nice idea to simplify the complex image distribution. This is one of the early paper that combine RNN with attention to handle the spatial data such image. I think this is an amazing accomplishment.


[1] Gregor, Karol, et al. “DRAW: A recurrent neural network for image generation.” arXiv preprint arXiv:1502.04623 (2015).


Sequence Autoencoder

Back in 2010, RNN is a good architecture for language models [3] due to its ability to remember the previous context. We will explore a few RNN architecture for learning document representation in this post.

Semi-supervised Sequence Learning [2] (NIPS 2014)

This model uses two RNN, the first one as an encoder, and later as a decoder.


Instead of learning to generate the output like in seq2seq model [1], this model learns to reconstruct the input. Hence, this model is a sequence autoencoder. LSTM is used in this paper. This unsupervised learning model is used for pretraining LSTM for different tasks such as sentiment analysis, text classification, and object classification.

A Hierarchical Neural Autoencoder for Paragraphs and Documents [4] (ACL 2015)

This model introduces a hierarchical LSTM to learn a document structure. The architecture has both encoder and decoder. The encoder processes a sequence of word tokens for each sentence. The final output from LSTM represents the input sentence. Then, the second LSTM layer will take a sequence of sentence vectors and output a document vector.

The decoder works in a backward fashion. It takes a document vector and feeds it to the LSTM to decode a sentence vector. Each sentence vector is then fed to another LSTM to decode each word in the sentence.

The author also introduces attention mechanism to put emphasis on particular sentences. The attention boosts the performance of the hierarchical model.


Generating Sentences from a Continuous Space [5]

This model combines RNNLM [3] with Variational autoencoder. The architecture is again composed of an encoder and decoder and attempts to reconstruct the given input. The additional stochastic layer converts an output from an encoder to mean and variance of the target Gaussian distribution. The document representation is sampled from this distribution. The decoder takes this representation and reconstructs word by word through another LSTM.


Training VAE under this architecture poses a challenge due to the component collapsing. The authors use KL annealing method by incrementally increases the weight of the KL loss over time. This modification helps the model to learn a much better representation.


There are a few more RNN architectures that learn document/sentence representation. The goal of learning the representation can be varied. If the goal is to generate a realistic text or dialogue then it is critical to retain syntactic accuracy as well as semantic information. However, if our goal is to obtain a global view of the given document, then we may bypass syntactic details but focus more on semantic meaning. These 3 models show how RNN architectures can be used to model for such tasks.


[1] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. “Sequence to sequence learning with neural networks.” Advances in neural information processing systems. 2014.

[2] Dai, Andrew M., and Quoc V. Le. “Semi-supervised sequence learning.” Advances in Neural Information Processing Systems. 2015.

[3] Mikolov, Tomas, et al. “Recurrent neural network based language model.” Interspeech. Vol. 2. 2010.

[4] Li, Jiwei, Minh-Thang Luong, and Dan Jurafsky. “A hierarchical neural autoencoder for paragraphs and documents.” arXiv preprint arXiv:1506.01057 (2015).

[5] Bowman, Samuel R., et al. “Generating sentences from a continuous space.” arXiv preprint arXiv:1511.06349 (2015).


Inference network: RNNs vs NNets

The standard variational autoencoder [1] uses neural networks to approximate the true posterior distribution by mapping an input to mean and variance of a standard Gaussian distribution. A simple modification is to replace the inference network from neural nets to RNN. That what exactly this paper present [2].

Intuitively, the RNN will work on the dataset that each consecutive features are highly correlated. It means that for the public dataset such as MNIST, RNN should have no problem approximate posterior distribution of any MNIST digit.

I started with a classical VAE. First, I trained VAE on MNIST dataset, with the hidden units of 500 for both encoders and decoders. I set the latent dimension to 2 so that I can quickly visualize on 2D plot.



2D embedding using Neural Nets (2-layers) as inference network

Some digits are clustered together but some are mixed together because VAE does not know the label of the digits. Thus, it will still put similar digits nearby, aka digit 7’s are right next to digit 9’s. Many digit 3 and 2 are mixed together. To have a better separation between each digit classes, the label information shall be utilized. In fact, our recent publication to SIGIR’2017 utilizes the label information in order to cluster similar documents together.

But come back to our original research question. Is RNN really going to improve the quality of the embedding vectors?


2D embedding using LSTM as inference network


The above 2D plot shows that using LSTM as an inference network has a slightly different embedding space.


2D embedding vectors of randomly chosen MNIST digits using GRU as inference network

LSTM and GRU also generate slightly different embedding vectors. The recurrent model tends to spread out each digit class. For example, digit 6’s (orange) are spread out. All models mixed digit 4 and 9 together. We should know that mixing digits together might not be a bad thing because some writing digit 4 are very similar to 9. This probably indicates that the recurrent model can capture more subtle similarity between digits.

Now, we will see if RNN model might generate better-looking digits than a standard model.






neural nets

It is difficult to tell which models are better. In term of training time, neural nets are the fastest, and LSTM is the slowest. It could be that we have not utilize the strength of RNN yet. Since we are working on MNIST dataset, it might be easy for a traditional model (Neural nets) to perform well. What if we train the model on text datasets such as Newsgroup20? Intuitively, RNN should be able to capture the sequential information. We might get a better embedding space, maybe? Next time we will investigate further on text dataset.


[1] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).

[2] Fabius, Otto, and Joost R. van Amersfoort. “Variational recurrent auto-encoders.” arXiv preprint arXiv:1412.6581 (2014).

Recurrent Recommender Networks (WSDM’17)

The motivation of this work is to tackle the collaborative filtering problem in the realistic setting. The classical collaborative filtering models interpolate rating based on the past and future rating. But in a real-world situation, there are no future rating scores. Therefore, to be able to extrapolate or predict the future rating is more practical.

One important argument that explains why many CF models had performed well on the Netflix dataset is due to the mixing distribution of training and testing data. The models were fed with future ratings; hence, it is easy to predict a user rating.

Therefore, modeling the temporal and causal aspects of the rating data are the main goal of this work. They gave an example of the movie ‘Plan 9’ which initially was reviewed as a bad film but became very popular later. Another observation is that some movies are more popular during the Christmas and summer. It is also valid to assume that a user preference will change over time as they grow up their taste of films will change.

With all these motivations and observations, they propose to use RNN to model user and movie dynamics over time and hope that RNN will capture both exogenous and endogenous dynamics. The key ingredient of their models is to incorporate a wall clock (time index) as part of the sample features. Here is how each training sample looks like:

s_t = [x_t, 1_{\text{newbie}}, \tau_t, \tau_{t-1}]

x_t is a vector of user rating up to time t.  x_{jt} = k represents this user has rated movie j at time t with a rating of k. x_{jt} = 0 when this user did not rate the movie j.1_{\text{newbie}} seems to be an indicator if a user has no previous rating – a new user. The next two parameters are important because the RNN will use time index to handle no-rating steps.

Another important component is a projecting function of s_t to an embedding space and feeds the embedding vector to an LSTM unit. Adding a linear transformation can be viewed as converting raw data into more abstract representation. This also implies that this model does not feed user ratings to an LSTM unit directly. The LSTM is used to model user and movie dynamics. We can look the trained LSTM as a function that model these dynamics. The authors trained two RNN models: one for users and another for movies.

Finally, at each time step, the model predicts a rating as follows:

\hat{r} = f(u_{it}, m_{jt}, u_i, m_j) = <\tilde u_{it}, \tilde m_{jt}> + <u_i, m_j>

This equation extends a standard matrix factorization with dynamic states \tilde u_{it}, \tilde m_{jt}. It means that at each time step, this model will solve a matrix factorization based on the rating up to time t.

To train this model requires an alternate training since we can’t train user and movie simultaneously. Otherwise, there will be too many RNN for all movies. Thus, the author fixes the movie dynamic function, and train user dynamic function. Then, fix the user dynamic function and train movie dynamic function alternately. Training the model this way will be more scalable.

The experimental results show that this model (RRN) beats TimeSVD++, AutoRec, and PMF. Further, this model can capture many external factors such as rating scale changes in Netflix dataset, external factor such as Oscar or Golden Globe awards, and internal factor such as season change.

My 2-cent, I like this paper because the motivation is well written and I can see the benefit of modeling the dynamic systems in user and movie. I am surprised that there are not many related works that attempt to solve extrapolating problem.

A Recurrent Latent Variable Model for Sequential Data (NIPS’15)

This paper presents a sequential model that is incorporating uncertainty to better model variability that arises from the data itself.

The motivation comes from the fact that data itself especially speech signal has a high variability that does not come from the noise alone. The complex relationship between observed data and an underlying factor of the variability cannot be modeled by the basic RNN alone. For example, vocal quality of the speaker affects the wave audio even though the speaker says the same word.

In a classical RNN, the state transition h_t = f_{\theta}(x_t, h_{t-1}) is a deterministic function and typically f_{\theta} is either LSTM or GRU. RNN models the joint probability of the entire sequencep(x_1, x_2, \cdot, x_T) = \prod_{t=1}^T p(x_t | x_{<t}) =\prod_{t=1}^T g_{\tau}(h_{t-1}) whereg_{\tau} is an output function that maps hidden state to the probility distribution of the output. The choice ofg_{\tau} depends on the problem. Typically, function g has 2 parts: (1) parameter generator,\phi_t = \varphi_{\tau}(h_{t-1}) and (2) density function: P_{\phi_t}(x_t | x_{<t}) . We can also make function g as a GMM; hence, function \phi_t will generate a mixture coefficient parameters.

The source of variability in RNN comes from the output function g alone. This can be problematic in speech signal because RNN must map many variants of input wave to a potentially large variation of the hidden state h_t . The limitation of RNN motivates the author to introduce uncertainty into RNN.

In order to turn RNN to an un-deterministic model, the author assumes that each data point x_t has a latent variable z_t where the latent variable is drawn from a standard Gaussian distribution initially. The generative process is as follows:

  • For each step t to T
    • Compute prior parameters:[\mu_{0,t}, \text{diag}(\sigma_{0,t})] = \phi_{\tau}^{\text{prior}}(h_{t-1})
    • Draw a prior:z_t \sim N(\mu_{0,t}, \text{diag}(\sigma_{0,t}^2))
    • Compute likelihood parameters: [\mu_{x,t},\sigma_{x,t}] = \phi_{\tau}^{\text{dec}}(\phi_{\tau}^z(z_t), h_{t-1})
    • Draw a sample:x_t | z_t \sim N(\mu_{x,t}, \text{diag}(\sigma_{x,t}^2))
    • Compute a hidden state:h_t = f_{\theta}(\phi_{\tau}^x(x_t), \phi_{\tau}^z(z_t), h_{t-1})

The state transition function is now an un-deterministic function because z_t is a random variable. Also, the hidden state h_t depends on  x_{<t}, z_{<t}, therefore, we can replace  h_t with:

  • z_t \sim p(z_t | x_{<t}, z_{<t})
  • x_t|z_t \sim p(x_t | z_{\le t}, x_{\le t})

Thus, the joint distribution becomes:

p(x_{\le T}, z_{\le T}) = \prod_{t=1}^T p(x_t|z_{\le t}, x_{<t})p(z_t|x_{<t},z_{<t})

The objective function is to maximize the log-likelihood of the input sequence:p(x_{\le T}) = \int_z p(x_{\le T}, z_{\le T}) dz . By assuming the approximate posterior distribution q(z_{\le T} | x_{\le T}) = \prod_{t=1}^T q(z_t | x_{\le t}, z_{<t}) is factorizable, the ELBO is:

E_{q(z_{\le T}|x_{\le T})}\big[ \sum_{t=1}^T \log p(x_t|z_{\le t},x_{<t}) - KL(q(z_t|x_{\le t},z_{<t}) || p(z_t|x_{<t},z_{<t})) \big]

The ELBO can be trained efficiently through variational autoencoder framework. In fact, this model is a sequential version of the classical variational autoencoder.


Chung, Junyoung, et al. “A recurrent latent variable model for sequential data.” Advances in neural information processing systems. 2015.

TopicRNN : A Recurrent Neural Network with Long-Range Semantic Dependency

This paper presents a RNN-based language model that is designed to capture a long-range semantic dependency. The proposed model is a simple and elegant, and yields sensible topics.

The key insight of this work is the difference between semantic and syntax. Semantic is relating to an over structure and information of the given context. If we are given a document, its semantic is a theme or topic. Semantic is meant to capture a global meaning of the context. We need to see enough words to understand its semantics.

In contrast, a syntax is dealt with local information. The likelihood of the current word heavily depends on the preceding words. This local information depends on the word order whereas the global information does not depend on word order.

This paper points out the weakness in probabilistic topic models such as LDA such as its lack of word order, its poor performance on word prediction. If we use bigram or trigram then these higher order models become intractable. Furthermore, LDA does not model stopwords very well because LDA is based on word co-occurrence. Stopwords tend to appear everywhere because stopwords do not carry semantic information but it acts as a filler to make the language more readable. Thus, when training LDA, the stopwords are usually discarded during the preprocessing.

RNN-based language models attempt to capture sequential information. It models a word joint distribution as P(y_1, y_2, \cdots, y_T) = P(y_1) \prod_{t=2}^T p(y_t | y_{1:t-1}). The Markov assumption is necessary to keep the inference tractable. The shortcoming is the limitation of the context windows. The higher order Markov assumption makes an inferencing becomes more difficult.

The neural network language model avoids Markov assumption by modeling a conditional probability P(y_t | y_{1:t-1}) = p(y_t|h_t) where h_t = f(h_{t-1}, x_t). Basically, h_t is a summarization of the preceding words and it uses this information to predict the current word. The RNN-based language model works pretty well but it has difficulty with long-range dependency due to the difficulty in optimization and overfitting.

Combining the advantage from both topic modeling and RNN-based is the contribution of this paper. The topic model will be used as a bias to the learned word conditional probability. They chose to make the topic vector as a bias because they don’t want to mix it up with the hidden state of RNN that includes stop words.

The model has a binary switch variable. When it encounters a stopword, the switch is off and disable a topic vector. The switch is on otherwise. The word probability is defined as follows:

p(y_t = i | h_t, \theta, l_t, B) \propto \exp ( v_t^T h_t + (1 - l_t)b_i^T \theta)

The switch variable, l_t turn on and off the topic vector \theta.

This model is an end-to-end network, meaning that it will jointly learn topic vectors and local state from RNN. The topic vector is coupled with RNN’s state so the local dynamic from word sequence will influence the topic vector and wise verse.

RNN can be replaced with GRU or LSTM. The paper shows that using GRU yields the best perplexity on Penn Treebank (PTB) dataset. The learned representation can be used to as a feature for many tasks including sentiment analysis where we want to classify positive and negative reviews on IMDB dataset.

I found this model is simple and elegantly combine VAE with RNN. The motivation is clear and we can see why using contextual information learned from VAE will improve the quality of the representation.

reference: (ICLR 2017 – Poster)