This work extends the original DRAW paper  to generate images given captions. We can treat this model as conditional DRAW. That is we model a conditional probability of The additional textual data controls where to read and write the image.
Generating images from text descriptions is a structure prediction task. That is given a sequence of words, we want to generate an image. Although AlignDRAW has borrowed the same approach as DRAW by combining progressive refinement with attention, incorporating text sequence is their contribution.
The latent variable in DRAW model is sampled from spherical Gaussians, where its mean and variance are functions of the current hidden state of the encoder, e.g. . However, AlignDRAW adds dependency between latent variables: .
During the image generation, DRAW iteratively samples a latent variable from a prior , but AlignDRAW will draw from . It means that there is a dependency between each latent vector in AlignDRAW model.
The input caption is fed to the BI-Directional RNN. Each output from each time-step needs to be aligned with the current drawing patch. Attention weight is then learned from caption representation up to k words and the current hidden state of the decoder . Finally, compute the weight average of all hidden state of the language model to obtain the caption context, . This context together with a latent vector will be fed to the LSTM decoder.
This model maximizes the expectation of the variational lowerbound. There are 2 terms: the data likelihood and KL loss.
AlignDRAW uses bi-directional LSTM with attention to aligning each word context with the patches in the image. Some generated images from caption are interesting such as ‘A herd of elephants walking across a dry grass field’. The model generalizes the training data and able to generate novel images.
 Mansimov, Elman, et al. “Generating images from captions with attention.” arXiv preprint arXiv:1511.02793 (2015).
 Gregor, Karol, et al. “DRAW: A recurrent neural network for image generation.” arXiv preprint arXiv:1502.04623 (2015).