DRAW: A Recurrent Neural Network For Image Generation (ICML’15)

This paper proposes a new method for image generation by progressively improve the reconstructed image.

The previous image generation models generate the entire image by learning a sampling function (GANs), distribution over a latent vector (VAE), or generate one pixel at a time (PixelRNN, PixelCNN). Although the generated images from these models are in a good quality, these models are forced to learn a complicated and high-dimensional distribution. For example, to generate a car image, the models need to approximate the distribution of all possible cars. This is a difficult task.


(Note: I took this Figure from the original paper)

Incremental Update

Progressive refinement breaks down the complex distribution into a chain of conditional distribution:

P(X, C_T) = P(X|C_T)P(C_T) = P(X|C_T)P(C_T|C_{T-1}) \cdots P(C_1|C_0)

Therefore, estimating a conditional distribution is much easier. The conditional probability is modeled by the standard LSTM.

Latent Variable

Use VAE framework helps us project the input image which has a high dimension into a low-dimensional space. Working on the smaller latent space is much easier than the original image space.

Attention Mechanism

The progressive refinement through LSTM has simplified the complex distribution through time, then the attention mechanism simplifies the spatial data into a smaller patch. The encoder and decoder now only needs to deal with a small fraction of the image instead of the image as a whole. This idea again reduces the input space by focusing on the important part of the image only.

Read and Write Operations

This part can be intimated to read at the first glance due to the use of the Gaussian filters. There are many nice blogs that described Read and Write operations with attention mechanism in detail. The main idea is the Read operation crops the input image. The Write operation draws a patch to the canvas matrix.


This is a must-read paper. The combine of progress refinement through time with attention mechanism is a nice idea to simplify the complex image distribution. This is one of the early paper that combine RNN with attention to handle the spatial data such image. I think this is an amazing accomplishment.


[1] Gregor, Karol, et al. “DRAW: A recurrent neural network for image generation.” arXiv preprint arXiv:1502.04623 (2015).



DocNADE (JMLR 2016)

This work extends Neural Autoregressive Distribution Estimation (NADE) for a document modeling.


The key idea of NADE is each hidden and output vectors are modeled as a conditional probability of previously seen vectors:

p(\textbf{v}) = \prod_{i=1}^D p(v_i | \textbf{v}_{<i})

\textbf{h}( \textbf{v}_{<i}) = g(c + W_{:,<i}\textbf{v}_{<i})

Then, the probability of the output is:

p(v_i=1|\textbf{v}_{<i}) = \sigma(b_i + V_{i,:}\textbf{h}_i(\textbf{v}_{<i}))


NADE has a set of separated hidden layers, each represents the previously seen context. However, NADE is not applicable for a variable length input such as a sequence of words.


DocNADE model tackles a variable length input issue by computing the hidden vector as follows:

\textbf{h}( \textbf{v}_{<i}) = g(c + \sum_{k<i} W_{:,v_k})

Each word v_k is an index in the vocabulary of fixed length. Each column of matrix W is a word embedding. Hence, a summation of word vectors represents a previous word context. This does not preserve the word order since the model simply sums all word vectors.


The output layer requires a softmax function to compute the word probability. A hierarchy softmax is necessary to scale up this calculation.

DocNADE Language Model

The previous model may not suitable for language model because it focuses on learning a semantic representation of the document. The hidden layer now needs to pay more attention to the previous terms. It can be accomplished by using n-gram model:

\textbf{h}_i(\textbf{v}_{<i}) = g(\textbf{b} + \textbf{h}_i^{DN}(\textbf{v}_{<i}) + \textbf{h}_i^{LM}(\text{v}_{<i}))

The additional hidden unit $latex \textbf{h}_i^{LM}$ models a n-gram language model:

$latex \textbf{h}_i^{LM}(\text{v}_{<i}) = \sum_{k=1}^{n-1}U_k \dot W_{:,v_{i-k}}^{LM}$

The matrix W^{LM} is a word embedding based on n-gram model.



DocNADE is similar to Recurrent Neural Network model where both models estimate the conditional probability of the current input given the previous input. For language modeling task, RNN is less explicit on how much word or context to look back. But DocNADE requires us to explicitly tell the model the number of words to look back. On the other hand, DocNADE has a similar favor to Word2Vec where the document representation is simply an aggregate of all previously seen words. However, DocNADE adds additional transformation on top of hidden units.

Will this type of Autoregressive model fall out of fashion due to the success of Recurrent Network with Attention mechanism and memory model? The current trend suggests that RNN is more flexible and extensible than NADE. Hence, there will be more development and extension of RNN models more and more in the coming year.


Lauly, Stanislas, et al. “Document neural autoregressive distribution estimation.” arXiv preprint arXiv:1603.05962 (2016).


Towards a Neural Statistician (ICLR2017)

One extension of VAE is to add a hierarchy structure. In the classical VAE, the prior is drawn from a standard Gaussian distribution. We can learn this prior from the dataset so that each dataset has its own prior distribution.

The generative process is:

  • Draw a dataset prior \mathbf{c} \sim N(\mathbf{0}, \mathbf{I})
  • For each data point in the dataset
    • Draw a latent vector \mathbf{z} \sim P(\cdot | \mathbf{c})
    • Draw a sample \mathbf{x} \sim P(\cdot | \mathbf{z})

The likelihood of the dataset is:

p(D) = \int p(c) \big[ \prod_{x \in D} \int p(x|z;\theta)p(z|c;\theta)dz \big]dc

The paper define the approximate inference network, q(z|x,c;\phi) and q(c|D; \phi) to optimize a variational lowerbound. The single dataset log likelihood lowerboud is:

\mathcal{L}_D = E_{q(c|D;\phi)}\big[  \sum_{x \in d} E_{q(z|c, x; \phi)}[ \log p(x|z;\theta)] - D_{KL}(q(z|c,x;\phi)||p(z|c;\theta)) \big] - D_{KL}(q(c|D;\phi)||p(c))

The interesting contribution of this paper is their statistic network q(c|D; \phi) that approximates the posterior distribution over the context c given the dataset D. Basically, this inference network has an encoder to take each datapoint into a vector e_i = E(x_i). Then, add a pool layer to aggregate e_i into a single vector. This paper uses an element-wise mean. Finally, the final vector is used to generate parameters of a diagonal Gaussian.


This model surprisingly works well for many tasks such as topic models, transfer learning, one-shot learning, etc.


https://arxiv.org/abs/1606.02185 (Poster ICLR 2017)

Improved Variational Autoencoders for Text Modeling using Dilated Convolutions (ICML’17)

One of the reasons that VAE with LSTM as a decoder is less effective than LSTM language model due to the LSTM decoder ignores conditioning information from the encoder. This paper uses a dilated CNN as a decoder to improve a perplexity on held-out data.

Language Model

The language model can be modeled as:

p(\textbf{x}) = \prod_t p(x_t | x_1, x_2, \cdots, x_{t-1})

LSTM language model use this conditional distribution to predict the next word.

By adding an additional contextual random variable [2], the language model can be expressed as:

p(\textbf{x}, \textbf{z}) = \prod_t p(x_t | x_1, x_2, \cdots, x_{t-1}, \textbf{z})

The second model is more flexible as it explicitly model a high variation in the sequential data. Without a careful training, the VAE-based language model often degrades to a standard language model as the decoder chooses to ignore the latent variable generated by the encoder.

Dilated CNN

The authors replace LSTM decoder with Dilated CNN decoder to control the contextual capacity. That is when the convolutional kernel is large, the decoder covers longer context as it resembles an LSTM. But if the kernel becomes smaller, the model becomes more like a bag-of-word. The size of kernel controls the contextual capacity which is how much the past context we want to use to predict the current word.


Stacking Dilated CNN is crucial for a better performance because we want to exponentially increase the context windows. WaveNet [3] also uses this approach.


By replacing VAE with a more suitable decoder, VAE can now perform well on language model task. Since the textual sequence does not contain a lot of variation, we may not notice an obvious improvement. We may see more significant improvement in a more complex sequential data such as speech or audio signals. Also, the experimental results show that Dilated CNN is better than LSTM as a decoder but the improvement in terms of perplexity and NLL are still incremental to the standard LSTM language model. We hope to see stronger language models using VAE in the future.


[1] Yang, Zichao, et al. “Improved Variational Autoencoders for Text Modeling using Dilated Convolutions.” arXiv preprint arXiv:1702.08139 (2017).

[2] Bowman, Samuel R., et al. “Generating sentences from a continuous space.” arXiv preprint arXiv:1511.06349 (2015).

[3] Oord, Aaron van den, et al. “Wavenet: A generative model for raw audio.” arXiv preprint arXiv:1609.03499 (2016).

A Dual Embedding Space Model for Document Ranking (WWW’16)

This paper uses word embedding learned from word2vec model to improve the text retrieval task.

In and out Matrices in Word2Vec

Word2Vec boils down to learn two sets of word embedding matrices: W_{in} and W_{out}. The objective function is to maximize the similarity between the source word w_s \in W_{in} and the target word w_t \in W_{out}:

w_s^Tw_t should be high if both words appear in the same context window.

The relationship among word vectors in W_{in} is different from W_{out}. Each similar word vectors in W_{in} have common semantic. For an instance, Cambridge and Oxford are both well-known university in the UK. On the other hands, The similar word vectors in W_{out} share the common context. The word such as highlights, jersey, and stadium are all related to sport. This paper exploits this fact for the retrieval task.


The authors postulate that document’s aboutness is a key ingredient of retrieving more relevant documents.  For example, when a query word contains word ‘jersey’, the retrieved documents that contain the related words to sports context should probably be more relevant than the document that simply has synonym words such as ‘t-shirt’ or ‘cloths’. Without context information, the word such as ‘t-shirt’ can appear also in shopping context as well as sports context.

Document Retrieval Task

The goal is to find a relevant document that matches the information need. The vector space model such as BM25 scores a higher point to those documents with an exact keyword match. Although keyword matching is very accurate, it has a few flaws. It is not robust to the keyword stuffing problem where some spam website loaded with keywords that are irrelevant to the website. The vector space model cannot detect this problem.

Hence, the relevant score should reflect how well the query matches the context of the document. If the query keyword is ‘Cambridge’, the relevant context should contain words related to universities such as faculty, students, or campus. This intuition motivates the authors to use word embedding from IN matrix (W_{in}) for a query and word embedding from OUT matrix (W_{out}) for documents.

Document Representation

The document representation is a centroid (an average) of all word vectors in the documents. This area could be improved because not all words represent the document equally.

Relevant Score

Finally, to compute a relevant score between query and document, the author uses a cosine similarity between a query vector and document centroid. Furthermore, the exact matching is still important. The final model averages the score between their proposed model and BM25. This simple change yields the best NDCG score.

This paper made me re-think about the key difference between local context used in word2vec and global context used in a topic model. I hope by utilizing both information (local and global) will help us find an even more relevant documents.



Sequence Autoencoder

Back in 2010, RNN is a good architecture for language models [3] due to its ability to remember the previous context. We will explore a few RNN architecture for learning document representation in this post.

Semi-supervised Sequence Learning [2] (NIPS 2014)

This model uses two RNN, the first one as an encoder, and later as a decoder.


Instead of learning to generate the output like in seq2seq model [1], this model learns to reconstruct the input. Hence, this model is a sequence autoencoder. LSTM is used in this paper. This unsupervised learning model is used for pretraining LSTM for different tasks such as sentiment analysis, text classification, and object classification.

A Hierarchical Neural Autoencoder for Paragraphs and Documents [4] (ACL 2015)

This model introduces a hierarchical LSTM to learn a document structure. The architecture has both encoder and decoder. The encoder processes a sequence of word tokens for each sentence. The final output from LSTM represents the input sentence. Then, the second LSTM layer will take a sequence of sentence vectors and output a document vector.

The decoder works in a backward fashion. It takes a document vector and feeds it to the LSTM to decode a sentence vector. Each sentence vector is then fed to another LSTM to decode each word in the sentence.

The author also introduces attention mechanism to put emphasis on particular sentences. The attention boosts the performance of the hierarchical model.


Generating Sentences from a Continuous Space [5]

This model combines RNNLM [3] with Variational autoencoder. The architecture is again composed of an encoder and decoder and attempts to reconstruct the given input. The additional stochastic layer converts an output from an encoder to mean and variance of the target Gaussian distribution. The document representation is sampled from this distribution. The decoder takes this representation and reconstructs word by word through another LSTM.


Training VAE under this architecture poses a challenge due to the component collapsing. The authors use KL annealing method by incrementally increases the weight of the KL loss over time. This modification helps the model to learn a much better representation.


There are a few more RNN architectures that learn document/sentence representation. The goal of learning the representation can be varied. If the goal is to generate a realistic text or dialogue then it is critical to retain syntactic accuracy as well as semantic information. However, if our goal is to obtain a global view of the given document, then we may bypass syntactic details but focus more on semantic meaning. These 3 models show how RNN architectures can be used to model for such tasks.


[1] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. “Sequence to sequence learning with neural networks.” Advances in neural information processing systems. 2014.

[2] Dai, Andrew M., and Quoc V. Le. “Semi-supervised sequence learning.” Advances in Neural Information Processing Systems. 2015.

[3] Mikolov, Tomas, et al. “Recurrent neural network based language model.” Interspeech. Vol. 2. 2010.

[4] Li, Jiwei, Minh-Thang Luong, and Dan Jurafsky. “A hierarchical neural autoencoder for paragraphs and documents.” arXiv preprint arXiv:1506.01057 (2015).

[5] Bowman, Samuel R., et al. “Generating sentences from a continuous space.” arXiv preprint arXiv:1511.06349 (2015).



The autoregressive model is applicable to generate a realistic audio signal. Given the waveform \textbf{x} = {x_1, x_2, \cdots, x_T}, the joint probability of a waveform is

p(\textbf{x}) = \prod_{t=1}^T p(x_t | x_1, \cdots, x_{t-1})

Recurrent neural nets is a perfect model to learn and predict a one-dimensional sequence. However, since an audio signal has a lot of samples per second (44.1 kHz), using RNN to process one sample at a time is too slow.

This paper uses casual convolutional neural nets by predicting: p(x_{t+1}|x_1, \cdots, x_t). Masking the convolutional kernel to avoid the current output to see the future input. A convolutional NN architecture is more scalable than RNNs because we can process multiple inputs in parallel. The main limitation is that the convolutional kernel has to be very large to cover a longer range dependency.

Dilated convolution architecture uses a convolution kernel with holes in order to cover a large area of input signals. The kernel is dilated by filling zeros to expand the kernel while skipping some inputs in between.

Stacking convolutional NNs are an effective method to increase the depth of the networks. Residual and skip connections are utilized to speed up convergence of the model.

Wavenet can model the conditional probability given an external input such as a speaker identity, output text or phonic, p(\textbf{x}|\textbf{h}) = \prod_{t=1}^T p(x_t|x_1,\cdots,x_{t-1},\textbf{h}). There are two ways to construct an activation function that depends on textbf{h}, an extra input/context.

Global conditioning: all the output depends on the given context \textbf{h} directly.

Local conditioning: the input context is broken down into a timeseries h_t. Upsampling this signal to match the length of the input audio sample.

Wavenet generates realistic human voices both in English and mandarin.  It can also generate a piece of music audio but it sounds like a mad pianist storming on the piano.

You can check some generated output from DeepMind website: https://deepmind.com/blog/wavenet-generative-model-raw-audio/